Data warehouse boost on a budget

When Premier Inc.’s medical databases began bogging down last year, the San Diego-based provider of clinical data put its data warehouse in a box—literally.

Premier sells access to clinical data it gathers from 400 hospitals to pharmaceutical manufacturers. Last year, the company’s IBM Red Brick data warehouse had grown to 3TB, and one table included 3 billion entries. “When you go through 3 billion rows of data, you get long runtimes,” says Chris Stewart, director of data warehouse architecture. As data warehouses continue to grow, more users are demanding access to business intelligence (BI) tools to conduct data-mining exercises across large data sets.

The problem wasn’t just the size of the database, however, but how clients used the data. “Our users want to access all of the data from top to bottom,” says Stewart, and the complex, multipass queries created by Premier’s 4,000 users each week were slowing performance. Some wouldn’t run at all.

Instead of adding to its 24-processor Solaris server infrastructure or making further attempts to optimize the database, Stewart brought in an all-inclusive data warehouse appliance from Netezza Corp. in Framingham, Mass. Some calculations that took one or two days now finish in six to eight minutes on the appliance’s 108 processors. Premier still uses Red Brick for most queries, but the NPS 8150 appliance handles the “really, really ugly questions” that weren’t possible to process before, he says. “We couldn’t offer the product offerings we do today” without the appliance, Stewart says.

As data warehouses continue to grow, more users are demanding access to business intelligence (BI) tools to conduct data-mining exercises across large data sets. “We’re talking about using every single call-detail record generated in the last three years,” says Claudia Imhoff, president of Intelligent Solutions Inc., a consulting firm in Boulder, Colo.

It’s hard for database administrators (DBA) to create aggregations of data, such as summarizations, that can facilitate the processing of these complex queries because users often don’t know in advance what they’re looking for. “These unplanned questions are the ones that knock the stuffing out of databases,” she says.

But such queries are increasingly seen as business-critical, says William Fellows, an analyst at The 451 Group in New York. “The problem of querying data sets that are growing at over 100% a year has led to what might be called a data warehouse capability gap,” he says.

While market leaders like Teradata, a division of NCR Corp. in Dayton, Ohio, offer integrated systems to address this for high-end applications, Netezza and others are jumping in with moderately priced systems that don’t require the same high-end hardware and software investments as those from IBM, Oracle Corp. and Teradata.

It’s an interesting trend but still a small part of the $16 billion market for data warehouse hardware and software, says Dan Vesset, an analyst at IDC.

Small Players, Big Databases

Some start-ups offer only software, while others include software and hardware in a single bundle or appliance. But all use a parallelization scheme that involves symmetric multiprocessing or a massively parallel processing architecture. Designs vary, but all are based on the partitioning of data across servers—something Teradata has been doing for years, says Fellows. “There’s nothing new under the sun in terms of approach here except packaging and price,” he adds. While Netezza and competitors like to position themselves against Teradata, the company still dominates on the high end, he adds.

Netezza’s NPS appliance abandons database indexes in favor of direct table scans, using brute-force processing to get the job done. The system includes its own database, with specialized field programmable gateway array (FPGA) logic that links processors and storage to speed up I/O. A system comparable to Premier’s, with 4.5TB of disk space, sells for “a little more than a million dollars,” says Netezza CEO Jit Saxena.

By dumping the indexes, Premier’s database dropped from 3TB to 1TB. The system is sufficiently fast that Stewart now uses the appliance to both process queries and build the data-aggregation tables that he loads into the Red Brick data warehouse.

Start-up Calpont Corp. in Rockwall, Texas, is developing a similar appliance that hard-codes the database on an FPGA chip. Because it will store the data on a solid-state disk, or synchronous dynamic RAM, however, it will be targeted at smaller data sets. A 128GB box capable of supporting 40GB to 50GB of data will have a price tag in the “couple hundred-thousand dollar range,” says CEO Jim Janicki. “We wanted a brute-force engine to handle everything we could throw at it,” he says of the device, which is scheduled to ship by midyear.

Datallegro Inc. in Aliso Viejo, Calif., is rolling out a turnkey system that functions much like the Netezza appliance, but it’s built using off-the-shelf components. “We’re taking standard, commodity servers with an open-source database,” says CEO Stuart Frost. Datallegro’s 3TB P3000 includes 21 dual-Xeon-processor servers, each connected to 12 Western Digital Corp. Raptor drives, and will sell for $450,000 when released this month. Frost is targeting Oracle customers with databases in the 1TB to 5TB range and up to 300 concurrent users.

Metapa Inc. takes a similar approach but lets users buy their own components based on its specification, rather than bundling everything together. Users “can assemble systems that are just as fast as the high-end data warehouses at a fraction of the cost. We don’t believe you need a specialized ASIC chip to get there,” says Scott Yara, founder and president of the San Mateo, Calif., start-up. The total price, including Metapa’s Cluster DataBase—due to ship in the second quarter—and required hardware, will be half the cost of a Netezza appliance, he claims.

Clareos Inc.’s CrossCut software, now available, adds yet another twist. Instead of using database tables, it combines a BI reporting tool with a spreadsheetlike data model that creates a single, flat file of rows and columns.

“The next generation of BI tools will have a flat file structure that will be very fast,” predicts Steve Foley, CEO of Herndon, Va.-based Clareos. CrossCut software and recommended hardware to process 146GB of data costs about $65,000. But the product differs from products like Netezza’s in one key respect: CrossCut is a read-only database that doesn’t provide update capability, Foley says. Competitors that use vector-based processing to support a real-time decision-making application include Alterion Inc. and Aleri Inc., says Fellows at The 451 Group.

By contrast, Teradata’s integrated systems connect clusters of high-performance servers using a proprietary high-speed interconnect called Bynet and store data in a Fibre Channel storage-area network. The vendor focuses on allowing large numbers of concurrent queries in a mixed-workload environment and supports “active data warehousing,” where databases are continuously updated, says Stephen Brobst, chief technology officer. He sees the start-ups’ products as best suited for single-function, low-end data marts and cautions that “data marts end up replicating data.”

But that’s a trade-off users may be willing to make when cost is a factor. “With an IBM or Teradata solution, your scalability is in large chunks,” says the vice president of infrastructure at a large financial services company that’s beta-testing a Datallegro system. The incremental cost for adding capacity to an appliance can be a small fraction of what it costs to upgrade his Sun Microsystems Inc. system. He is cautious about buying from a small vendor, but adds, “If they can deliver the same or better performance at 20% of the cost of an IBM or Teradata solution, then you have to do it.”

Most of these systems take a black-box approach to optimization, which means DBAs can’t do any tuning. That paradigm shift may be the toughest sell, says Intelligent Solutions’ Imhoff, and it’s definitely a weakness for Michael Benillouche, director of technology at ACNielsen Corp., who prefers to optimize his Oracle data marts.

But Premier’s Stewart sees that as an advantage. “My DBA staff has more time for development instead of hand-holding a database. We don’t need to build in cycles to make queries go faster,” he says.

In traditional systems, ad hoc queries that bog down the data warehouse are restricted, says Imhoff. Now IT can spin off a subset of data to more groups for business analytics without supplying DBA resources. “If I can bring in a technology that doesn’t require an army of DBAs, great Scott, what a boost,” she says.

Don’t miss related articles, white papers, product reviews, useful links and more! Visit ourSpotlight on Data Management

Related Download
The Landscape of Self Service Analytics Sponsor: IBM
The Landscape of Self Service Analytics
Download this report to examine the current state of self-service analytics across all industries and company sizes, and view the technology decisions and analytical performance of organizations that reported high levels of self-service in their analytical use base.
Register Now