The Internet has buried companies under a mudslide of unstructured data. One of the most pressing problems facing IT is how to turn all that data that won’t fit into rows and columns into useful information. And while the amount of unstructured data is growing exponentially, the tools for dealing with it haven’t kept pace.
The magnitude of the deluge is staggering. Approximately 85 percent of all digital business information exists only as unstructured data, according to research by Merrill Lynch & Co. Inc. Most of that comes from the increasing use of the Web as an internal and external business channel.
The majority of unstructured data consists of text documents. Some of those, such as memos, letters, marketing materials and research documentation, have presented a storage and retrieval problem in business since before there was digital media. And now, in addition to those documents, there are e-mail messages, customer queries and responses from sales and support representatives generated by CRM applications, user group postings and chat messages, as well as images, movies and Web pages with their hyperlinked information.
E-mail alone has burgeoned; market research firm IDC predicts that there will be more than 60 billion messages sent annually by 2006. And besides the business imperative to take control of the organization’s knowledge base, federal regulatory initiatives increase the pressure on companies to both archive e-mail and develop a way to research the content of the messages.
The other 15 percent of all business information — the structured data that generally resides neatly in spreadsheets and databases — is being sliced, diced, massaged and squeezed for every bit of business intelligence it will yield. Technologies to address unstructured data can’t match the functionality of these real-time analytics for structured data, and users have been slow to adopt them. Tim Berners-Lee, the Web’s primary architect, has famously observed that most of the information on the Web is designed for human consumption and resists being organized or analyzed by any automated process.
What do companies lose by not having the means to use unstructured data? Employees’ time for one thing — recent studies indicate that information workers spend as much as a quarter of their time just finding and gathering job-related information. Nuanced information about trends and customer attitudes for another.
Vendors recognize both the challenge and the opportunity presented by unstructured data. When recently asked what the next big thing in business intelligence and data warehousing would be, Don Hatcher, SAS Institute’s vice president of technology strategy, answered emphatically, “Unstructured (data), without a doubt. We’re working on it, and I’m sure the other (competing) companies are, too.”
SAS will try to make unstructured data a part of its customers’ “predictive process,” Hatcher said. The company is also “engaging thought leaders in the space” as it maps its route into the unstructured market. Of course, SAS and the other business-intelligence and analytics vendors haven’t exactly discovered a new frontier.
Search has been the traditional way to manage and mine unstructured data, especially text-based documents. The most fruitful techniques go well beyond the simple keyword queries most of us type into Google or Yahoo several times a day. Full-text searches, which began as a tool for the intelligence and library communities, have been around for decades — for almost as long as there have been digital documents.
Search technology companies are refining their products by adding natural-language search capability; stemming, which removes common suffixes; and spelling correction. They’re also using metadata fields to narrow and focus searches by adding context to individual queries.
Vendors such as Verity, Autonomy, Stratify and Inxight offer software that automates the classification process and maintains taxonomies, as well as discovery systems that generate metadata from documents and allow users to dig through the hierarchical layers. The big content management vendors are making the direct link to business intelligence when they describe their search and classification offerings as “content intelligence.”
The big surprise, given the volume of unstructured data piling up in every modern company, is corporate IT’s lack of urgent interest in the problem. Data warehousing and business-intelligence projects are generally surviving the lousy economy more successfully than most technology initiatives. That’s because companies have been won over to the notion that the more they can learn from the data in their various databases and other structured repositories, the better off they’ll be in good times and in bad.
But those companies have yet to be convinced that they have the same need to exploit unstructured data. Some foot-dragging is understandable. Resources are in short supply. Catchphrases like content intelligence stir memories of the knowledge management hype that fizzled so miserably.
But the problem is only getting bigger, and the technologies that help us manage unstructured data and turn it into information are going to become increasingly important. If you don’t believe it, go check your e-mail in-box.