From cars to computer technology, the desire for more power is often the driving force behind many technological innovations. Consider the information technology arena where storing a terabyte (TB) of data would have seemed like science fiction back in the 1980s. Today, multi-terabyte-sized data warehouses are poised to give way to the petabyte, with a storage capacity of 1024 TBs — about 250 billion pages of text.
Some governments and university research labs are already taking advantage of super-sized databases and these systems could become reality for the business sector within the next two years. While a petabyte of data may not be an easy concept to grasp, it is a reflection of the rapidly changing economic reality.
According to James Rothnie, chief technology officer at storage system vendor EMC Corporation, a typical American consumer will generate about 100 gigabytes of data during his or her lifetime. This includes medical, educational, insurance, and credit history data. Multiply that by 100 million consumers and the result is 10,000 petabytes of data. In Canada, with one tenth the population of the U.S., 30 million consumers will generate about 3,000 petabytes — still a staggering amount of data.
The cost of buying a petabyte database could be significant for many organizations; however, to put it in perspective, just 10 years ago the hardware and software to store and manage one megabyte (MB) of data was about $15. That figure has now dropped to about 15 cents per MB and in the very near future could well be 1 cent per MB.
As business as well as government and academia seek more sophisticated ways to use large volumes of data, the lower cost of storing and data management will make the prospect of petabyte data warehouses a practical solution to managing data.
And data management will be critical. In a 2004 Teradata survey1 of senior executives in some of North America’s top companies, well over half of respondents indicated their data volumes were double and even triple the previous year. Half of the executives also said decision-making was more complex in 2004 than in 2003, and 84 percent stated they were not getting reliable information in a timely manner.
As the cost becomes more affordable, and data and business needs continue to increase at an exponential rate, you may be wondering how to prepare for a petabyte-sized future. What can your organization do with 1,024 TBs of data? Can you really justify the investment?
The latest server technology, for example, is capable of scaling to 4.2 petabytes for commercial decision support. These servers are designed to grow as the business grows — from a single server to hundreds of servers — all without rewriting the applications, database or tools and utilities. More important than size, however, is server performance, which must be able to support businesses that depend on their data warehouse for mission-critical applications that drive revenue, customer satisfaction and employee productivity.
Commercially, San Antonio-based SBC Communications Inc. has brought the petabyte capability into the mainstream with a system that can harness hundreds of top-of-the-line Intel CPUs with many hundreds of gigabytes of addressable memory and hundreds of TBs of disk space, all supporting a single, integrated database. Driving this is scalability, essential to the successful design and deployment of these super computing systems.
Scalability — more users, more capabilities
Many organizations with multi TB-sized data warehouses have begun to see significant returns on investment, as their competitive advantage is derived not from the difference in prices or products, but rather from the ability to gather more detailed information on customers and prospects than the competition.
Converting prospects into loyal, long-time customers means offering them the right products, services and information, at the right time. By collecting enough detailed information about each prospect, organizations can better identify important buying patterns. This detailed data — currently measured in quantities of hundreds of TBs and eventually petabytes — will result in the ability to quickly and accurately search and deploy huge amounts of data throughout the enterprise. Key to this? System scalability.
Scalability allows an organization to add more processing power to a hardware configuration and have a linearly proportional increase in performance. The additional hardware stores and processes increasingly larger volumes of data (or progressively more complex queries or increasingly larger numbers of concurrent queries) without reducing system performance. On the other hand, poor design or product deployment can result in just the opposite. Performance deteriorates faster than data size grows.
There are four key dimensions to scalability:
Dimension one: parallel technology and ‘shared nothing’ architecture
In today’s information-rich business environment, organizations are constantly gathering enormous amounts of data to support key business applications and enterprise decision-making. This is occurring as the price per megabyte drops. When considering the ROI, will the extra data add enough value to your organization to justify the expense of storing it?
More efficient retrieval of richly detailed answers to strategic and tactical business queries means taking into account that the size of the data warehouse really does matter.
A large, multinational insurance firm may decide to assess the lifetime value of its customers in one key customer segment. If the current database uses a serial approach to data processing, a query like this could negatively impact system performance. Instead, answers to key business questions are arrived at more quickly and reliably through the ‘divide and conquer’ approach, deployment of parallel technology and a ‘shared nothing’ architecture. That’s where quantifiable business value begins.
Dimension two: so much data, so many queries
Data warehouses in large organizations must routinely handle thousands of concurrent queries from anywhere within the organization, at any time. Perhaps a large bank is gathering data on fraudulent debit and credit card transactions, or a national retailer is collecting data from thousands of daily POS transactions. Multiply this by hundreds of branches or stores, across various geographic regions, and the case for concurrent query capabilities becomes glaringly obvious.
Handling concurrent queries means the enterprise data warehouse must have sophisticated resource management capabilities so that as concurrent queries are made, the parallel database can satisfy multiple requests and scan multiple tables.
Dimension three: the challenge of complex data
To go back to the 2004 survey of senior executives, not only did the study reveal that the amount of data is drastically increasing within North American business, but so too is the complexity of the data. This is another challenge for optimizing queries in massive databases.
How about the task of building a simple customer profile? Years ago this exercise would have involved three or four interrelated data points stored in disparate data marts. Now 30 or 40 data points are possible and accessible in one enterprise data warehouse. Yet if the warehouse generates an enormous table complete with literally billions of pieces of generically categorized transaction data, then all the processing capacity in the world won’t deliver a useful customer profile. True, the warehouse may be able to separate the data into different tables, but if it is unable to preserve the business relationships among the tables then the ability to analyze that data and the business value could be severely compromised.
As warehouses increase their capacity from TBs to petabytes, they must also be able to create a super organized “file system” designed specifically for analytic queries. The system should contain multiple tables and preserve the business relationships across subject areas for easy cross-referencing.
Dimension four: automating sophisticated data queries and data mining
The super-sized data warehouse of the very near future must have the ability to handle queries and data mining that go beyond simply a tally of last month’s insurance fraud claims, last week’s debit card transactions, or the dollar amount of yesterday’s POS transactions. The warehouse must also be able to break down many components and determine an efficient route for gathering the appropriate information.
While a cost-based optimizer could certainly automate this process in most databases, DB administrators often end up intervening, which then makes the process costly and time-consuming. A data warehouse that truly delivers petabyte-type value would have an optimizer that handles sophisticated queries and data mining without human intervention.
The petabyte future
One exciting project well underway is the Internet Archive, a non-profit public organization founded in 1996 with a mandate to build an Internet library to give researchers, historians and scholars permanent access to historical collections that exist in digital format. In 2003 the Internet Archive Wayback Machine contained over 300 TBs of data, and was growing at a rate of 12 TBs per month. By 2004 it had reached over a petabyte and is now adding 20 TBs of data every month.
Commercially there will be many factors contributing to the need for more powerful data warehouses, not the least of which will be data-rich technologies like radio frequency identification (RFID). Not yet mainstream, RFID will certainly result in an explosion of information just waiting to be harnessed. And with the increasing quantity of data, the speed of decision-making will be tested.
Turning all of this data into relevant, reliable and most importantly, actionable information will be possible with petabyte-sized data warehouses. The true challenge will come from deriving the value of the deeply detailed business intelligence to spur better decision-making across the enterprise. Unless the data warehouse can efficiently organize increasingly complex data and optimize sophisticated and concurrent queries, the amount of data stored is meaningless.
1For more information on the 2004 Teradata survey, visit http://www.teradata.com/t/page/128513/
–Stephen Brobst is an internationally recognized expert in data warehousing. He completed a Masters and PhD research at the Massachusetts Institute of Technology, and has authored numerous journal and conference papers in the fields of data management and parallel computing environments. Mr. Brobst is the Chief Technology Officer for Teradata, a division of NCR Corporation.