SAN FRANCISCO -- The floods that devastated the hard disk industry in Thailand are now half a year old, and the prices per terabyte are finally dropping once again. That means data will start piling up and people around the office will wonder what can be done with it. Perhaps there are some insights in those log files? Perhaps a bit of statistical analysis will find some nuggets of gold buried in all of that noise? Maybe we can find enough change buried in the couch cushions of these files to give us all a raise?
The industry now has a buzzword, "big data," for how we're going to do something with the huge amount of information piling up. "Big data" is replacing "business intelligence," which subsumed "reporting," which put a nicer gloss on "spreadsheets," which beat out the old-fashioned "printouts." Managers who long ago studied printouts are now hiring mathematicians who claim to be big data specialists to help them solve the same old problem: What's selling and why?
It's not fair to suggest that these buzzwords are simple replacements for each other. Big data is a more complicated world because the scale is much larger. The information is usually spread out over a number of servers, and the work of compiling the data must be coordinated among them. In the past, the work was largely delegated to the database software, which would use its magical JOIN mechanism to compile tables, then add up the columns before handing off the rectangle of data to the reporting software that would paginate it. This was often harder than it sounds. Database programmers can tell you the stories about complicated JOIN commands that would lock up their database for hours as it tried to produce a report for the boss who wanted his columns just so.
The game is much different now. Hadoop is a popular tool for organizing the racks and racks of servers, and NoSQL databases are popular tools for storing data on these racks. These mechanism can be much more powerful than the old single machine, but they are far from being as polished as the old database servers. Although SQL may be complicated, writing the JOIN query for the SQL databases was often much simpler than gathering information from dozens of machines and compiling it into one coherent answer. Hadoop jobs are written in Java, and that requires another level of sophistication. The tools for tackling big data are just beginning to package this distributed computing power in a way that's a bit easier to use.
Many of the big data tools are also working with NoSQL data stores. These are more flexible than traditional relational databases, but the flexibility isn't as much of a departure from the past as Hadoop. NoSQL queries can be simpler because the database design discourages the complicated tabular structure that drives the complexity of working with SQL. The main worry is that software needs to anticipate the possibility that not every row will have some data for every column.