The open-source Hadoop framework for processing pools of big data across clusters of servers is a boon for organizations with large stocks of data. However, making efficient use of the platform for analyzing data can be a chore.
Typically raw data has to be combed and filtered before it can be scrutinized, a process that doesn’t allow large numbers of users to leverage Hadoop at once. And the smaller datasets that result from the refining are an inefficient use of a distributed cluster.
Rather than transferring filtered data to a data warehouse or other system for analytics, Datameer, a company that makes a Hadoop analytics suite, says you can have your cake and eat it too.
The company said Wednesday its upcoming Datameer 5.0 can use Hadoop’s MapReduce and in-memory technology and a single server to process data that many users can access simultaneously.
“What we’re delivering is an optimizer called Smart Execution that looks at the dataset characteristics, looks at the analytical characteristics, looks at the available resources on your Hadoop cluster,” Matt Schumpert, director of product marketing, said in an interview.
“Then Smart Execution will dispatch the different parts of work to the different engines looking at how busy is the cluster, how busy is the dataset, can be leverage some statistics we have about the data (like do the filter before the join) … to schedule and efficiently use all the different computational engines.”
On large datasets Smart Execution uses Apache Tez, an optimized form of MapReduce, while small data analysis will be executed on a single Hadoop node or using in-memory technology. That selection is completely transparent to the end user, and does not require IT assistance or extra hardware or software, the company said. Smart Execution can add new advances in the Hadoop ecosystem as they become available, such as Spark, as they become enterprise ready.
The advantages are speed faster analytics, low latency and better utilization of the Hadoop cluster, Schumpert said.
Some organizations copy data back and forth between the Hadoop cluster and business intelligence or in-memory database tools, he said. That raises administration and security issues. Datameer 5.0 eliminates that. “It’s one job that runs through Hadoop and is audited through YARN,” a tool within Hadoop 2.0 that lets users run multiple applications with shared resource management.
It also reduces the cycle times of end users, he said. Datameer lets users define the analytical steps with a sample of data, and the software computes the full results.
“Now jobs can run faster, which means you’re going to be able to iterate faster. So if you need those accurate results before you can decide what’s going to be the next step, or you’re going to pass the result to an another analyst who’s going to go down another path, you can do that quicker.”
Datameer 5.0 will be released in Q4.