Precision is what most people strive for in using analytics. Crunching all those numbers with computing power that could run a small province has to generate an exact measure, right?
No, says an expert, who believes it pays to be vague — particularly in big data projects.
“I’ve seen many end results in files and they have very impressive numbers,” says Alan Khara, director of information of the First Nations Education Steering Committee (FNESC) of Vancouver. But “the end result is the numbers were never achieved or were way off.”
It’s more accurate in a report to give a range of percentages in a prediction, he told a Toronto big data conference Wednesday. Ironically, in many cases the narrower the results the greater the margin of error.
The believe that huge datasets generate precise results is one of the five myths about big data that contributes to the failure of projects, he said.
Khara, who has worked in analytics for a university, financial institutions and a British Columbia health institution, led him to the myths. The others are:
–The more data you have, the more you’ll get out of it.
Believing this is why organizations spend so much time sweeping in as much data as they can get their hands on, said Khara. But volume doesn’t guarantee information. He recalled working for an institution that collected 8 billion images of space (each about 10 MB) looking for a planet that was allegedly pulling Neptune further from the sun than its regular orbit. With the technology available at the time the research team thought it would take three years for their data model to search the images.
The hidden gravitational force wasn’t found. But other teams using the similar data found some small but useful particles. What did they do different? They invested more time in the data model than collecting data.
“From a business point of view it isn’t important that you’re collecting big data, it is how you’re analyzing it,” he told the conference. The more diverse skills your team has the better, he added.
Investing time in a model doesn’t mean having the latest tools, he cautioned.
— Structured data is better than unstructured.
Not true in many cases. One problem is organizations often convert unstructured data (like metadata) into a structured form. But, said Khara, what that can do is alter important data from the unstructured file. If that file included time-sensitive or -related data, that can be fatal if it isn’t considered.
He recalled a B.C. construction company complaining that after investing in big data tools to analyze its decades of data on physical infrastructure there was no useful results. But, Khara points out, data models can’t be static — data collected into a system years ago under a set of assumptions may not fit assumptions of today, so data models today may not be able to analyze old data. “Everything changes with time.”
–We can sort the data later.
Because storage is getting cheaper every year, companies want to collect lots of data, and figure out what to do with it later.
But that means the data analysts can’t take into account incidents that happen that may colour the data. For example, he said, in2009 in India the government did exit polls during an election to make predictions, which turned out wrong compared to the actual vote. For this year’s election they also did exit polls but the results were spot on. The difference was they did sorting of data as it came in.
It isn’t easy to sort and check data at the same time as it’s collected, Khara said, but it’s worth it.
–Data is static.
The meaning of data changes over time. Think about comparing how student school results data over time. Student performance indicators 22 years ago are not the same as they are today – for example curriculum has changed, world outlook has changed, demographics have changed, professions have changed. You have to come up with a calibration to take that into account. Be wary of comparing data in different time and space.
In an interview after his presentation Khara said there are more than five big data myths (he’s writing a book on them all). So here’s another: Big data needs a big investment.
Many needed analytic tools are free — like the Hadoop software for distributed processing of large datasets, he pointed out. And if you have the right people with wide skills there’s no need to hire specialists. Another myth is that big data is only for big companies.
“Many small organizations won’t touch big data because they think it’s expensive,” he said. But he knows of one organization that got into big data with an expenditure of $10,000.
The conference, which continues Thursday, is organized by Innovation Enterprise Ltd.
Data Center Innovation by Cisco and IBM
Register for this webcast to learn how joint data center solutions from IBM and Cisco can increase resource utilization, support business continuity, provide higher availability and faster provisioning, and help you achieve productivity gains through virtualization, optimization, energy stewardship and flexible sourcing.