Out of the vast majority of the unstructured data that will be created today, a single piece of it, on close inspection, won’t look very important—an email, a text message, stock price, or even a sensor transmitting an ‘off’ signal.
But once these tiny bricks of data are put together, the resulting structure can tell us something quite important.
In truth, a big multinational shoe company probably doesn’t care what you, personally, think of its new sneaker. But it will surely care if thousands of others feel the same way. Similarly, if the power fails in one house, the electricity company won’t lose much sleep over it. But when the entire grid gets knocked out, you can expect something to be done about it.
The mess of information we’re immersed in represents a great technological challenge—that is, finding a way to give it all meaning. But since it also offers an irresistible power to business—virtual omniscience—the money to make industrial-level data sifting a reality has arrived in sufficient quantity to get it off the drawing board and into the commercial world.
But in giving meaning to a mass of unstructured data, a distinction has to be made when we start with our initial question. Our answer could be built out of a million tiny components. Or it could come in the form of one giant, irreducible entity. Both require a very different kind of hardware.
And here is where we enter the worlds of “massively parallel” and “embarrassingly parallel.”
The strength of the elephant
We can’t talk unstructured data without making some mention of Apache’s Hadoop project, which is now the preferred beast of burden for this kind of big data. It’s the needle-in-the-haystack problem solved to the extreme: Hadoop can sift through endless bales of hay, neatly organizing every needle it finds. Massively scalable and cheap to run on commodity hardware, Hadoop represents something we’ve wanted for years but have only recently been able to use.
John Kreisa is the vice-president of Hortonworks Inc., a major contributor to the Hadoop project. As he puts it, Hadoop handles the big workloads, the messy workloads that don’t translate well into SQL, and it handles very different kinds of workloads. “It’s good in any one of those dimensions and certainly all of those dimensions at the same time, which makes it unique,” says Kreisa.
At the core of the platform is its ability to chop these large workloads into smaller and smaller ones using the combined power of distributed hardware boxes, each working independently. Essentially, Hadoop is an excellent tool to process data that can be divided into smaller parts (for example, millions of business transactions). For this reason, it’s attracted a great deal of interest in all sorts of business sectors, “everything from telecommunications to healthcare, financial services to big Web properties, retail… government,” Kreisa adds.
Hadoop, the elephant, naturally gets a lot of the attention because of its size. But other platforms have been designed to attack the same datasets with the same ferocity, if not on the same scale.
Here in Canada, OpenText Corporation solves similar problems for its clients, though instead of processing numerical transactions the company is best known for parsing written language. Using software that operates through what OpenText calls “entity extraction” the computer is trained to recognize more than just numbers, says Lubor Ptacek, vice-president of strategic marketing at OpenText: “names of people, names of locations, names of organizations, currencies, dates, trademarks, products.”
Then, he says, you have “content extraction,” linguistic algorithms that try to discern the meaning behind the data, an example of which would be what is now referred to as “sentiment analysis” (how do my customers feel about my product?). Textual analysis of this kind used to be done primarily internally (e.g., documents, emails), but is now being applied in the outside world as well: Twitter, Facebook, LinkedIn, Google, and so on.
Supercomputers: A very different animal
As impressive as this all may be, for the most part, these unstructured data tasks fall under the category of what’s called an “embarrassingly parallel” problem—one that is very simple to divide.
As meaning becomes less and less obvious, and more has to be inferred, this is where individual problems become bigger and computers have to get smarter. Using millions of Tweets as a proxy for sentiment could be used in a fairly simple manner, adding up the total number of a certain keyword, or it could be highly complicated—trying to understand syntax, context and relationships.
The former is basically analogous to analyzing database columns, says Steve Conway, research vice-president for high-performance computing at IDC Inc. Nothing terribly complicated. But the latter would demand more serious computing power. And this is where the utility of a technology like Hadoop starts to fade away, he says.
“The limitations of Hadoop are that you cannot pose intelligent questions,” says Conway. Not a tremendous practical limitation, he says, because most business questions are in fact quite simple (and divisible), but “the most challenging, the most economically important supercomputing problems do not fit that bill.”
The key distinction between a highly distributed platform like Hadoop and a single supercomputer is the fact that the CPUs in a supercomputer communicate while they solve the problem, he says, whereas with Hadoop and similar technologies, they don’t.
To ask very intelligent, very important questions, your processors must be in very close communication. Thus, supercomputers excel at tasks like identifying flaws in a car, or identifying emerging patterns of insurance fraud, says Conway. Technologies like Hadoop are useful only if you already know what you’re looking, he adds.
But when you don’t, by and large, you’ll want a supercomputer.
To explain why, he elaborates on his car example. If you want to fix a noise problem in a vehicle during the manufacturing process, you can’t simply replace the smaller rear window, where you suspect the problem is coming from, because it would affect the aerodynamics of the car. This sort of problem, in other words, cannot be parallelized.
Applied special relativity
You can, of course, connect the CPUs of different computers over a local network or over the Internet, creating a high-performance computer of sorts, but even over the fastest fibre-optic network, you’ll quickly run into an impenetrable barrier: the speed of light.
Light can travel 30 cm in a nanosecond, which sounds fast until you stretch the distances out and consider how much information can physically travel back and forth through the pipes. Distributed supercomputing-like programs, of which one of the largest is Stanford University’s popular genetic research project, Folding@Home, have an impressive combined performance in FLOPS (floating-point operations per second) only because each individual CPU can work independently on the tiny steps needed to answer a bigger question.
Perhaps more importantly, projects like Folding@Home
also aren’t time-critical, says Conway.
With unstructured data, he says, counter-terrorism is an especially apt way to demonstrate the type of job a supercomputer can do. Does a gun purchase in one part of the world have anything to do with an intelligence report warning of an attempt to hijack an airline? Time (and therefore, processor communication) is absolutely critical here.
Again, we return to finding the needle in the haystack, for which distributed computing is an excellent tool. But a supercomputer’s forte isn’t finding needles, says Conway, but rather something quite different: “finding patterns in shifting sands.”
From embarrassingly simple to shockingly intelligent
Regardless of what hardware is used to process unstructured data, and how it is used, the software running on it is becoming smarter. The unique advantages of high-performance computing will remain, but some of the functions it used to perform can now be reduced to more simple processes.
Jack Norrs, vice-president of marketing at MapR Technologies Inc., a company that provides tools to make Hadoop more accessible to enterprises, points to a Google white paper that argues for simple big data algorithms over complex models.
“You look at what Google’s capable of, Norrs says. “Things that are traditionally in the realm of artificial intelligence or supercomputers can be reduced and done with a much simpler approach across massive amounts of processing.”
At Hortonworks, Kreisa also notes that Hadoop did the pre-processing for Watson, IBM’s jeopardy whiz supercomputer, doing “a lot of the heavy lifting.”
And in any case, as the structuring of data becomes more developed and “intelligent” over the years, the limitations we face might not be on the hardware front, or related to the speed of light.
Leslie Owens, a research director at Forrester Research Inc. who studies enterprise search technologies, says applying true artificial intelligence to huge troves of unstructured data requires coping with the scale of this data somehow. At the moment, many systems operate based on rules, she says, where a certain keyword in an email, like “refund” would elicit some sort of automatic action.
“What’s exciting is that people are now saying, ‘we’re watching these big patterns of people using tools like Google and Facebook and other things, and we’re able to assume based on the wide, diverse, heterogeneous amount of information that is crunched by these systems that now we can predict with more accuracy what they really want,” says Owens.
But there are so many intersecting elements that the bottleneck may in fact be in figure out how to create a picture out of it that human beings can understand, she says. “How do you diagram or model out all these relationships?”
The new Chinese room
In the future, we may face a paradoxical situation in which so-called artificial intelligence can be applied to unstructured data, but understanding the results will challenge human intelligence. Programming a computer to understand natural language, as is done for textual analysis, is arguably not a form of advanced artificial intelligence in the sense that the computer simply associates—not infers—relationships between human language and machine code.
We can almost view it as a reversal of the thought experiment against artificial intelligence by John Searle, in which a person who speaks only English sits in a room and follows instructions to write out Chinese characters (giving the outward appearance of understanding them) is perhaps the wrong one to use when it comes to unstructured data.
When computers can make inferences from billions of lines of text, something beyond our own abilities, if not in theoretical terms than at least for all practical purposes, we could be the ones following instructions we accept but don’t truly understand.