SAN FRANCISCO — The floods that devastated the hard disk industry in Thailand are now half a year old, and the prices per terabyte are finally dropping once again. That means data will start piling up and people around the office will wonder what can be done with it. Perhaps there are some insights in those log files? Perhaps a bit of statistical analysis will find some nuggets of gold buried in all of that noise? Maybe we can find enough change buried in the couch cushions of these files to give us all a raise?
The industry now has a buzzword, “big data,” for how we’re going to do something with the huge amount of information piling up. “Big data” is replacing “business intelligence,” which subsumed “reporting,” which put a nicer gloss on “spreadsheets,” which beat out the old-fashioned “printouts.” Managers who long ago studied printouts are now hiring mathematicians who claim to be big data specialists to help them solve the same old problem: What’s selling and why?
It’s not fair to suggest that these buzzwords are simple replacements for each other. Big data is a more complicated world because the scale is much larger. The information is usually spread out over a number of servers, and the work of compiling the data must be coordinated among them. In the past, the work was largely delegated to the database software, which would use its magical JOIN mechanism to compile tables, then add up the columns before handing off the rectangle of data to the reporting software that would paginate it. This was often harder than it sounds. Database programmers can tell you the stories about complicated JOIN commands that would lock up their database for hours as it tried to produce a report for the boss who wanted his columns just so.
The game is much different now. Hadoop is a popular tool for organizing the racks and racks of servers, and NoSQL databases are popular tools for storing data on these racks. These mechanism can be much more powerful than the old single machine, but they are far from being as polished as the old database servers. Although SQL may be complicated, writing the JOIN query for the SQL databases was often much simpler than gathering information from dozens of machines and compiling it into one coherent answer. Hadoop jobs are written in Java, and that requires another level of sophistication. The tools for tackling big data are just beginning to package this distributed computing power in a way that’s a bit easier to use.
Many of the big data tools are also working with NoSQL data stores. These are more flexible than traditional relational databases, but the flexibility isn’t as much of a departure from the past as Hadoop. NoSQL queries can be simpler because the database design discourages the complicated tabular structure that drives the complexity of working with SQL. The main worry is that software needs to anticipate the possibility that not every row will have some data for every column.
The biggest challenge may be dealing with the expectations built up by the major motion picture “Moneyball.” All the bosses have seen it and absorbed the message that some clever statistics can turn a small-budget team into a World Series winner. Never mind that the Oakland Athletics never won the World Series during the “Moneyball” era. That’s the magic of Michael Lewis’ prose. The bosses are all thinking, “Perhaps if I can get some good stats, Hollywood will hire Brad Pitt to play me in the movie version.”
None of the software in this collection will come close to luring Brad Pitt to ask his agent for a copy of the script for the movie version of your Hadoop job. That has to come from within you or the other humans working on the project. Understanding the data and finding the right question to ask is often much more complicated than getting your Hadoop job to run quickly. That’s really saying something because these tools are only half of the job.
To get a handle for the promise of the field, I downloaded some big data tools, mixed in data, then stared at the answers for Einstein-grade insight. The information came from log files to the website that sells some of my books (wayner.org), and I was looking for some idea of what was selling and why. So I unpacked the software and asked the questions.
Big data tools: Jaspersoft BI Suite
The Jaspersoft package is one of the open source leaders for producing reports from database columns. The software is well-polished and already installed in many businesses turning SQL tables into PDFs that everyone can scrutinize at meetings.
The company is jumping on the big data train, and this means adding a software layer to connect its report generating software to the places where big data gets stored. The JasperReports Server now offers software to suck up data from many of the major storage platforms, including MongoDB, Cassandra, Redis, Riak, CouchDB, and Neo4j. Hadoop is also well-represented, with JasperReports providing a Hive connector to reach inside of HBase.
This effort feels like it is still starting up — many pages of the documentation wiki are blank, and the tools are not fully integrated. The visual query designer, for instance, doesn’t work yet with Cassandra’s CQL. You get to type these queries out by hand.
Once you get the data from these sources, Jaspersoft’s server will boil it down to interactive tables and graphs. The reports can be quite sophisticated interactive tools that let you drill down into various corners. You can ask for more and more details if you need them.
This is a well-developed corner of the software world, and Jaspersoft is expanding by making it easier to use these sophisticated reports with newer sources of data. Jaspersoft isn’t offering particularly new ways to look at the data, just more sophisticated ways to access data stored in new locations. I found this surprisingly useful. The aggregation of my data was enough to make basic sense of who was going to the website and when they were going there.
Big data tools: Pentaho Business Analytics
Pentaho is another software platform that began as a report generating engine; it is, like JasperSoft, branching into big data by making it easier to absorb information from the new sources. You can hook up Pentaho’s tool to many of the most popular NoSQL databases such as MongoDB and Cassandra. Once the databases are connected, you can drag and drop the columns into views and reports as if the information came from SQL databases.
I found the classic sorting and sifting tables to be extremely useful for understanding just who was spending the most amount of time at my website. Simply sorting by IP address in the log files revealed what the heavy users were doing.
Pentaho also provides software for drawing HDFS file data and HBase data from Hadoop clusters. One of the more intriguing tools is the graphical programming interface known as either Kettle or Pentaho Data Integration. It has a bunch of built-in modules that you can drag and drop onto a picture, then connect them. Pentaho has thoroughly integrated Hadoop and the other sources into this, so you can write your code and send it out to execute on the cluster.
Big data tools: Karmasphere Studio and Analyst
Many of the big data tools did not begin life as reporting tools. Karmasphere Studio, for instance, is a set of plug-ins built on top of Eclipse. It’s a specialized IDE that makes it easier to create and run Hadoop jobs.
I had a rare feeling of joy when I started configuring a Hadoop job with this developer tool. There are a number of stages in the life of a Hadoop job, and Karmasphere’s tools walk you through each step, showing the partial results along the way. I guess debuggers have always made it possible for us to peer into the mechanism as it does its work, but Karmasphere Studio does something a bit better: As you set up the workflow, the tools display the state of the test data at each step. You see what the temporary data will look like as it is cut apart, analyzed, then reduced.
Karmasphere also distributes a tool called Karmasphere Analyst, which is designed to simplify the process of plowing through all of the data in a Hadoop cluster. It comes with many useful building blocks for programming a good Hadoop job, like subroutines for uncompressing Zipped log files. Then it strings them together and parameterizes the Hive calls to produce a table of output for perusing.
Big data tools: Talend Open Studio
Talend also offers an Eclipse-based IDE for stringing together data processing jobs with Hadoop. Its tools are designed to help with data integration, data quality, and data management, all with subroutines tuned to these jobs.
Talend Studio allows you to build up your jobs by dragging and dropping little icons onto a canvas. If you want to get an RSS feed, Talend’s component will fetch the RSS and add proxying if necessary. There are dozens of components for gathering information and dozens more for doing things like a “fuzzy match.” Then you can output the results.
Stringing together blocks visually can be simple after you get a feel for what the components actually do and don’t do. This was easier for me to figure out when I started looking at the source code being assembled behind the canvas. Talend lets you see this, and I think it’s an ideal compromise. Visual programming may seem like a lofty goal, but I’ve found that the icons can never represent the mechanisms with enough detail to make it possible to understand what’s going on. I need the source code.
Talend also maintains TalendForge, a collection of open source extensions that make it easier to work with the company’s products. Most of the tools seem to be filters or libraries that link Talend’s software to other major products such as Salesforce.com and SugarCRM. You can suck down information from these systems into your own projects, simplifying the integration.
Big data tools: Skytree Server
Not all of the tools are designed to make it easier to string together code with visual mechanisms. Skytree offers a bundle that performs many of the more sophisticated machine-learning algorithms. All it takes is typing the right command into a command line.
Skytree is more focused on the guts than the shiny GUI. Skytree Server is optimized to run a number of classic machine-learning algorithms on your data using an implementation the company claims can be 10,000 times faster than other packages. It can search through your data looking for clusters of mathematically similar items, then invert this to identify outliers that may be problems, opportunities, or both. The algorithms can be more precise than humans, and they can search through vast quantities of data looking for the entries that are a bit out of the ordinary. This may be fraud — or a particularly good customer who will spend and spend.
The free version of the software offers the same algorithms as the proprietary version, but it’s limited to data sets of 100,000 rows. This should be sufficient to establish whether the software is a good match.
Big data tools: Tableau Desktop and Server
Tableau Desktop is a visualization tool that makes it easy to look at your data in new ways, then slice it up and look at it in a different way. You can even mix the data with other data and examine it in yet another light. The tool is optimized to give you all the columns for the data and let you mix them before stuffing it into one of the dozens of graphical templates provided.
Tableau Software started embracing Hadoop several versions ago, and now you can treat Hadoop “just like you would with any data connection.” Tableau relies upon Hive to structure the queries, then tries its best to cache as much information in memory to allow the tool to be interactive. While many of the other reporting tools are built on a tradition of generating the reports offline, Tableau wants to offer an interactive mechanism so that you can slice and dice your data again and again. Caching helps deal with some of the latency of a Hadoop cluster.
The software is well-polished and aesthetically pleasing. I often found myself reslicing the data just to see it in yet another graph, even though there wasn’t much new to be learned by switching from a pie chart to a bar graph and beyond. The software team clearly includes a number of people with some artistic talent.
Big data tools: Splunk
Splunk is a bit different from the other options. It’s not exactly a report-generating tool or a collection of AI routines, although it accomplishes much of that along the way. It creates an index of your data as if your data were a book or a block of text. Yes, databases also build indices, but Splunk’s approach is much closer to a text search process.
This indexing is surprisingly flexible. Splunk comes already tuned to my particular application, making sense of log files, and it sucked them right up. It’s also sold in a number of different solution packages, including one for monitoring a Microsoft Exchange server and another for detecting Web attacks. The index helps correlate the data in these and several other common server-side scenarios.
Splunk will take text strings and search around in the index. You might type in the URLs of important articles or the IP address. Splunk finds them and packages them into a timeline built around the time stamps it discovers in the data. All other fields are correlated, and you can click around to drill deeper and deeper into the data set. While this is a simple process, it’s quite powerful if you’re looking for the right kind of needle in your data feed. If you know the right text string, Splunk will help you track it. Log files are a great application for it.
A new Splunk tool called Shep, currently in private beta, promises bidirectional integration between Hadoop and Splunk, allowing you to exchange data between the systems and query Splunk data from Hadoop.
Bigger than big data
After wading through these products, it became clear that “big data” was much bigger than any single buzzword. It’s not really fair to lump together products that largely build tables with those that attempt complicated mathematical operations. Nor is it fair to compare simpler tools that work with generic databases with those that attempt to manage larger stacks spread out over multiple machines in frameworks like Hadoop.
To make matters worse, the targets are moving. Some of the more tantalizing new companies still aren’t sharing their software yet. Mysterious Platfora has a button you can click to stay informed, while another enigmatic startup, Continuity, just says, “We’re still in stealth, heads down and coding hard.” They’re surely not going to be the last new entrants in this area.
Despite the speed and sophistication of the new algorithms, I found myself liking the old classic reports the best. The Pentaho and Jaspersoft tools simply produce nice lists of the top entries, but this was all I needed. Knowing the top domains in my log file was enough.
The other algorithms are intellectually interesting, but they’re harder to apply with any consistency. They can flag clusters or do fuzzy matching, but my data set didn’t seem to lend itself to these analyses. Try as I might, I couldn’t figure out any applications for my data that didn’t seem contrived.
Others will probably feel differently. The clustering algorithms are used heavily in diverse applications such as helping people find similar products in online stores. Others use outlier detection algorithms to identify potential security threats. These all bear investigation, but the software is the least of the challenges.
Perhaps it is my lack of vision that left me clutching to the old sortable reports. In time, I may come to understand just how I might use the advanced algorithms to do more. This may be why most of these companies list consulting among their products. They will rent you one of their engineers, who is familiar with the software and the math, so you have a guide when you’re digging around the data. This is a good option for every business because the needs and demands are often rather abstract and filled with wishful hand-waving.
At a recent O’Reilly Strata conference on big data, one of the best panels debated whether it was better to hire an expert on the subject being measured or an expert on using algorithms to find outliers. I’m not sure I can choose, but I think it’s important to hire a person with a mandate to think deeply about the data. It’s not enough to just buy some software and push a button.