The challenge of unstructured data and what to do about it
We’re drowning in data. The problem is, not only is data growing by leaps and bounds, it’s messier, less structured and harder to control. With more data than we can handle coming at us in every direction, how can we drive meaning and insight from it?
Structured data is a lot easier to deal with — to store, classify and analyze. Unstructured data, on the other hand, isn’t something that can be sorted into columns and rows in a relational database.
Unstructured data could be anything from text in an e-mail, instant message, spreadsheet, presentation or digitized article, to images, audio files and video files — basically anything that’s not stored within a relational database. And a new era of unstructured data has been ushered in, in the form of social media, whether Tweets or Facebook wall posts or blog comments.
The sheer volume of this data poses a major challenge for businesses, and it’s not always easy to locate something when you’re looking for it.In a recent survey by Unisphere Research on behalf of MarkLogic Corp., called “The Post-Relational Reality Sets In: 2011 Survey on Unstructured Data,” 62 per cent of respondents said it is inevitable that unstructured data will exceed the volume of traditional relational data within the next decade, and 35 per cent said unstructured data has already surpassed or will surpass traditional relational data in the next 36 months.
But when it comes to unstructured data, we’re early in the game, according to Barry Cousins, senior research analyst with Info-Tech Research Group. “Vendors tend to come to market in a state of completion,” he said. “This is a case where they’re saying we’re just not there yet. The clear message is that unstructured data has a long life on the analytics workbench, more so than before.”
That’s because there’s a fundamental difference between structured and unstructured data. With structured data, you can run analytics (say, to contrast year-over-year revenue growth) and the answer is never going to change. With unstructured data, you need to go back to it, again and again, which has implications on compliance and security.
Over the years, the term “OWS” meant any number of things. But as of a few weeks ago, it started to mean “Occupy Wall Street.” And that’s why we need to go back and refine and revalidate unstructured data. “The lexicon is evolving,” said Cousins. Also, sarcasm, context and cultural differences all have to be taken into account.
“SAS is one of the most evolved analytics companies there is. When they do this for people they’re using buildings full of linguistics experts and PhDs to refine these algorithms,” said Cousins. “And they vary from customer to customer.” He believes that although we’re seeing “amazing technology from newer entrants,” the traditional vendors in the analytics space, such as SAS, Oracle Corp. and IBM Corp., will likely dominate, since they have the resources to back them up.
“One of the interesting results [of the Unisphere survey] is that 86 per cent admit unstructured data is important but only 11 per cent have clear procedures,” said David Gorbet, vice-president of product strategy with MarkLogic Corp.
“You need to have more engaged business conversations between IT and business people and have a plan for big data as opposed to being reactive to it. There is a fairly major shift going on — it’s like a sea change where the form doesn’t really change but the substance is changing underneath.”
A lot of customers are finding that big data is not just “more of the same,” but a qualitative change that requires them to think about new ways of doing things — and incorporating “big data” into the fundamental operation of their business.
As a result, a number of new technologies are coming to market, and 43 per cent of respondents in the Unisphere survey said they are currently evaluating new data management technologies. Nineteen per cent are evaluating log monitoring and reporting tools such as Splunk, 18 per cent are looking at in-memory databases, 17 per cent are considering NoSQL databases, and 10 per cent are thinking about MPP data warehouses such as EMC Greenplum or Aster Data.
Much of modern business automation is thanks to “systems of record,” which are predicated on a relational database, said Gordon Ross, vice-president of OpenRoad Integrated Media Inc. Then there’s Facebook and Twitter and YouTube, which are “systems of engagement.” With systems of record, you’re interacting with data; with systems of engagement, you’re dealing with a person.
While e-mail has formal attributes that allow it to be stored, classified and sorted, for example, the body is messy and unstructured. At the end of the day, human communication is messy and unstructured. You can mine Twitter for content that mentions your brand and start to take a run toward overall sentiment based on frequency of keywords such as “love” or “suck,” but you have to coax out sentiment.
“Those are pretty crude [methods] and I think they miss the subtlety of human communication, which is context,” said Ross. But this messy human communication stuff, he said, is increasingly valuable and “we do have ways of storing it and searching through it that hopefully makes it useful inside of organizations.”
With data, there comes some ability to better understand and control our organizations. And actionable metrics can help negotiate the terrain for better decision-making.
“Arguably, a lot of businesses still struggle with how to make good decisions based on structured data. There’s a lack of knowledge or insight into how to use it,” said Ross. To move up the hierarchy of informational needs, you’ve got to deal with your structured data first before tackling unstructured or social data.
“It’s going to be a question of how to structure the unstructured,” said Cousins. And it’s a bitter pill for IT to swallow. “You have to continually refine how you do this. It’s not an old-fashioned IT project,” he said. It will force integration and discipline on data handling.
Start by making a decision to retain your unstructured data, he said, and develop a warehousing retention practice for it. Validation is the tough part: What data should you retain? What can be excluded? Even if you exclude it, should you retain it, in case one day it becomes relevant? And how often do you have to redefine that data?
Banks spent a lot of time coming up with a single point of reference for each customer. But if you’re looking at e-mail and social media records, you’re not looking for a customer number. “We did so much work to create a common customer number and the ability to relate everything together, but these issues send it permanently in the other direction,” said Cousins.
Ultimately, it’s about being able to leverage this data to make more valuable high-stakes business decisions, not about analyzing the data just because you happen to have it, said Gorbet. And that means understanding your business requirements.
“Focus first on data that’s going to provide value. Data warehouses only tell part of the story — see if there’s an unstructured element you could integrate to tell a more complete story, [such as] why sales are down in a certain region,” said Gorbet. That’s how companies can dip their toes into unstructured waters and, most likely, yield business value in short order.Related Download
Sponsor: IBM Canada Ltd
B2B cloud integration: A strategic approach to optimizing your value chain
This IBM whitepaper presents an overview of the B2B integration landscape: its challenges, opportunities, and the approaches that leading organizations follow in developing a robust and scalable EDI solution.