Some companies keep databases that contain all the personal information they can get about certain people. Other companies keep databases with all the data there is about all people – all the data available about the DNA that makes those people who they are, anyway.
DoubleTwist Inc. in Oakland, Calif., and Celera Genomics Corp. in Rockville, Md., are among several companies building biotechnology businesses based on the recently mapped human genome. Both companies collect and disseminate information about the genome to pharmaceutical and biotech companies, researchers and others trying to use maps of the human genome to design new drugs and other medical treatments.
To do that, however, both companies have to maintain enormous databases and update huge amounts of data as researchers add more detail to the still-sketchy guides to human DNA. That requires massive resources to store and annotate the data with research information to make it as useful as possible.
“Really, they’re like the companies that do oil-reservoir mapping using 2-D and 3-D soundings of reservoirs, then applying algorithms to analyse those,” says Nick Allen, a storage analyst at Stamford, Conn.-based Gartner Inc. “They’re both data-intensive operations, but [the oil companies] are churning that data every decade, not every day the way the biotechs are.”
About twice per month, for example, DoubleTwist takes in approximately 10GB of fresh data on the human genome from the National Institutes of Health in Bethesda, Md., and various research labs worldwide. DoubleTwist analyses that data and then adds notes describing the likely functions of some gene sequences and the relationships between specific genes and the proteins or enzymes they control.
The resulting data set and the research-and-development databases spun off from it take up so much space that “we’ve kind of given up on stringent controls on storage,” says Edward Kiruluta, chief technology officer at DoubleTwist.
“Storage is one thing we actually try to budget for, though we always go over. But I find a lot of our most problematic storage needs are temporary,” he says. “Say five people want to do some mining of a data set; the result can be five times the original data set. But we might only have to hold on to those results for three months, then we can delete it.”
That kind of data economy helps, but it doesn’t do much to rein in the company’s need for storage, Kiruluta says. It’s not that the kind of high-end disk storage the company relies on for most of its data is cheap; the StorEdge arrays from Sun Microsystems Inc. that DoubleTwist uses are high-end enough to cost nearly as much as the machines they support, Kiruluta says. “If the machine costs a million, pretty much the storage that goes along with that will cost close to that as well,” he says.
Gartner’s figures indicate that more than 50 per cent of the cost of high-end machines and sometimes as much as 80 per cent goes to storage.
What makes that kind of spending worthwhile is that those huge databases have to be available quickly or the amount of advanced analytics and research won’t be substantial enough to make the company’s information valuable to customers, according to John Reynders, vice-president of information systems at Celera Genomics.
Celera was the leading commercial entity in the race to map the human genome. To support its own research, Celera maintains about a teraflop of computing power and 100TB of spinning disk space, as well as several “islands” on a storage-area network, which totals two to three times the amount of data that even large companies like RadioShack Corp. or Lockheed Martin Corp. have to support.
DoubleTwist has only about 10TB of capacity but structures it in an innovative way. Rather than maintaining the central data in a relational database, the company keeps its main data store in an XML-based database that lets it more easily create links between data sets, redefine those links and translate the data into other formats, Kiruluta says.
One format is for a set of Oracle Corp. databases, and another is for the company’s proprietary data-analysis tools, which are installed at some customer sites. Another is a flat text file that’s used by the company’s online customers for queries that are too varied to easily define.
“Relational databases are useful in the production system because you can put in schema that let you do high volumes of direct, specific queries,” Kiruluta says. “But a database like ours is going to have thousands of tables in it. To do a reasonable query on that data is going to require a lot of knowledge about the schema and that data.
“Biology does not have the luxury of a restrained vocabulary,” he adds. “Talk to 10 scientists, and they’ll have 10 names for the same thing. A text-search interface is actually a very good fit for that sort of thing.”
Sometimes, apparently, even the most sophisticated data set surrenders to the simplest kind of questions.
Kevin Fogarty is a freelance writer in Sudbury, Mass.