Google claims MapReduce sets data-sorting record

Google Inc. late last week claimed results of in-house data sorting tests bolster its claims that its MapReduce technology can manipulate more data faster than any conventional database.

According to a blog post by Grzegorz Czajkowski, a member of Google’s systems infrastructure team, MapReduce recently sorted 1 terabyte (TB) of data in 68 seconds, or about a third of the time Yahoo! Inc. achieved this summer.

Sorting, or rearranging, data is one of the most basic functions of a spreadsheet, database or other data manipulation software.

 

 

Google used 1,000 servers running MapReduce in parallel to sort the data, versus 910 for Yahoo , according to Czajowksi.

Google also tested MapReduce’s ability to sort 1 petabyte (PB), or 1,000 TB, of data. That is equivalent to 12 times the amount of archived Web data in the U.S. Library of Congress as of May 2008, according to Google.

Using 4,000 servers, which is likely a small fraction of Google’s entire worldwide server infrastructure, MapReduce took 6 hours and two minutes to sort 1 PB, according to Czajkowski.

“We’re not aware of any other sorting experiment at this scale and are obviously very excited to be able to process so much data so quickly,” he wrote.

Forrester: Poor database archiving is compliance risk idgml-3884c970-ed68-456b-8879-f9b8afc4b39d Czajkowski did not say when the tests were done. He did reveal that as of early January this year, Google was processing an average of 20 PB total per day.

By comparison the largest publicly-known data warehouses today store several petabytes of data total, only processing a tiny fraction of that amount each day.

Google’s announcement appeared to be deliberately timed to coincide with a speech by a noted database expert and MapReduce critic, David DeWitt.

A former longtime University of Wisconsin-Madison computer science professor, DeWitt joined Microsoft this spring to run a new research lab being created on the Madison campus.

The lab will focus on helping Microsoft’s SQL Server “scale out” in order to run on hundreds or thousands of servers at a time. That will allow customers to run parallel database clusters similar technically to Google’s, though nowhere near the latter’s scale.

Early this year, DeWitt, along with database industry legend Michael Stonebraker, co-wrote a blog arguing that MapReduce was a “sub-optimal…not novel” type of database that lacked many features modern DBAs and developers take for granted, and which was unworthy of the hype it has received. In an interview last week with Computerworld , DeWitt praised MapReduce’s scalability and hardiness. But DeWitt also stood firm on MapReduce’s shortcomings.

He and StoneBraker are also submitting a paper to the Association of Computing Machinery (ACM) that compares the performance of several databases, IBM’s DB2 and Stonebraker’s Vertica, with MapReduce and another similar non-relational data engine, Apache Hadoop.

That paper may be publicly available as early as late January, said DeWitt.

DeWitt gave a keynote speech on Friday at the Professional Assocation for SQL Server’s (PASS) conference in Seattle.

He did not directly criticize MapReduce during his PASS keynote speech, according to blog reports.

Would you recommend this article?

Share

Thanks for taking the time to let us know what you think of this article!
We'd love to hear your opinion about this or any other story you read in our publication.


Jim Love, Chief Content Officer, IT World Canada

Featured Download

Featured Articles

Cybersecurity in 2024: Priorities and challenges for Canadian organizations 

By Derek Manky As predictions for 2024 point to the continued expansion...

Survey shows generative AI is a top priority for Canadian corporate leaders.

Leaders are devoting significant budget to generative AI for 2024 Canadian corporate...

Related Tech News

Tech Jobs

Our experienced team of journalists and bloggers bring you engaging in-depth interviews, videos and content targeted to IT professionals and line-of-business executives.

Tech Companies Hiring Right Now