Supporters of using Apache Hadoop for processing big data have two big boosters on their side: Intel Corp., and EMC Greenplum, which this week released their own distribution of the Hadoop software for storing and processing large amounts of data.
Chipmaker Intel said Tuesday that its version of the open source software, which includes a manager for deployment, is optimized for its Xeon processors and includes encryption that supports Intel’s AES New Instructions for security on its CPUs.
Meanwhile on Monday the Greenplum division of EMC announced Pivotal HD, which integrates the Greenplum database with Apache Hadoop.
They join three other commercial Hadoop distributions — Cloudera, Hortonworks and MapR — as likely to appeal to organizations.
Forrester Research analyst Mike Gualtieri found the announcements exciting. It makes sense for Intel to get into the fray because Hadoop is a storage and data processing platform, he said. As a chip maker it can help with getting data more efficiently into Hadoop, he added. He also noted that Intel says it’s not trying to compete with enterprise software companies, who try to lock customers in to their technologies.
Gualtieri was less impressed with Greenplum’s announcement, even though its Pivotal HD software includes a SQL database. One of Hadoop’s shortcomings is accessing data through SQL. However, he noted that there are other solutions. Cloudera is working on a project called Impala to put a fast SQL layer on top of Hadoop, he pointed out.
The ability to process big data – broadly defined as data bigger than most analytics software can handle – could bring big benefits to business, argues Intel. But “only a small fraction of the world is able to extract meaning from all of this information because the technologies, techniques and skills available today are either too rigid for the data types or too expensive to deploy.”
The optimizations made for the networking and IO technologies in the Intel Xeon processor platform also enable new levels of analytic performance, Intel said in a news release.
Analyzing one terabyte of data, which would previously take more than four hours to fully process, can now be done in seven minutes, it claims, thanks a combination of Intel hardware and the company’s Hadoop distribution.
The proprietary management software is aimed at simplifying the deployment, configuration and monitoring of the Hadoop processing cluster. Optimal performance can be had through an automatic tuner, Intel says.
Intel [Nasdaq: INTC] said it is also contributing enhancements to the open source code covering the YARN distributing processing framework, the Hadoop Distributed File System and the Hive SQL Query functions.
Intel Distribution for Apache Hadoop will be sold with technical support by solution and service providers. Support options include 24 hour-seven day coverage, or eight hour a day five days a week.
Partners supporting Intel Hadoop include Cisco Systems Inc., Cray, Dell, Red Hat, SAP, SAS, Teradata and a range of others.
Greenplum says its distribution significantly expands Hadoop by adding tools including a command centre for monitoring the file system, virtualization extensions and Isilon support; installation, configuration and management tools; and support for the Spring framework.
It also includes a relational database Greenplum calls HAWQ that has its own execution engine.
According to a Grenplum blog. HAWQ is “hundreds of times faster” than Hadoop’s HIVE data warehouse.