It may be a buzz phrase, the cloud computing of 2012, but I do find big data analytics fascinating. It's just the way my mind works; give me a big enough survey sample, and I can entertain myself with pivot tables for hours on end. But I felt I needed a better grounding in the concept, so I asked the folks at SAS Canada for a schooling. They connected me with Paul Kent, SAS Institute Inc. in Cary, N.C. Kent is the vice-president of platform research and development for the company.
It's a given that technology changes everything, but that's particularly true in the big data analytics field. The ability to process the analytics of billions of lines of data in memory, innovatons like the Hadoop MapReduce framework for distributed computing, and high-performance computing grids make it possible to perform analytics on ever increasing amounts of data in near-real time.
On the other side of the equation, we're collecting more and more data to analyze. The evolution of data analysis is inextricably linked to the evolution of data collection. In the early days of computing, data was part of the application itself. Move along to the transactional data base model, and data is collected from outside the application, but complying with a specific structure of fields. Now, the sources of data aren't so structured: we're dealing with documents, images, and media files, often without the appropriate meta data; geo-location data that may or may not be associated with a transaction; social media feeds wherein context is everything; metering data from electrical grids; all manner of telematics from vehicles, production machinery, etc.
I remember a story from the days of yore, when data mining was a fresh concept. A colleague of mine called out a representative of one of the vendors over the beer and diapers issue: analyze enough transactional data, and you'll find a pattern that suggests people who buy diapers also buy beer, so a retailler can organize the shelves accordingly. Said colleague's complaint was that the company rep was presenting this as a fact, rather than a theoretical example of the patterns that data mining can unlock, and factually, it wasn't true. It's an item of small relevance, but for the fact that it lodged the beer-and-diapers model of data mining in my head for the ensuing 15 years.
And it's a handy model to have when the skeptical say that big data analytics is just a jumped-up version of data mining. It highlights the fundamental difference, and my discussion with Kent crystalized it: data mining is transaction-focused, teasing patterns out of information of limited scope, whereas big data analytics has a behavioural focus. We're not concerned with the transaction, according to Kent, but with the behaviour that leads to the transaction. Of those many new types of data outlined a couple paragraphs ago, almost all are related to behaviour.
That was my big data “aha” moment, and it fundamentally changes my understanding of analytics.