Database horizons

The modern database era began in 1970, when E.F. Codd published his paper “A Relational Model of Data for Large Shared Data Banks.” His ideas enabled the logical manipulation of data to be independent of its physical location, greatly simplifying the work of application developers.

Now we are poised for another leap forward. Databases will scale to gargantuan proportions, span multiple locations and maintain information in heterogeneous formats. And they will be autonomous and self-tuning. The major database vendors are pursuing these goals in different ways.

Thirty years ago, IBM Corp. researcher Pat Selinger invented “cost-based” query optimization, by which searches against relational databases, such as IBM’s DB2, minimized computer resources by finding the most efficient access methods and paths. Now Selinger, vice president of data management architecture and technology, is leading an effort at IBM called Leo – for Learning Optimizer – that she says will push DB2 optimization into a new realm.

Rather then optimizing a query once, when it’s compiled, Leo will watch production queries as they run and fine-tune them as it learns about data relationships and user needs. “It empirically derives interesting things about the data,” Selinger says. For example, Leo would come to realize that a ZIP code can be associated with only one state, or that a Camry is made only by Toyota, even if those rules aren’t specified in advance.

Selinger says Leo will be most helpful in large and complex databases, and in databases where interdata relationships exist but aren’t explicitly declared by database designers. Leo is likely to be included in commercial releases of DB2 in about three years, she says.

Microsoft Corp. says users will never be persuaded to dump everything – e-mail, documents, audio/video, pictures, spreadsheets and so on – into one gigantic database. Therefore, the software vendor is developing technology that will allow a user to seamlessly reach across multiple, heterogeneous data stores with a single query.

Microsoft’s Unified Data project involves three steps, says Stan Sorensen, director of SQL Server. First, the company will devise “schema” based on XML that define data types. Then it will develop methods for relating different data types to each other and finally develop a common query mechanism for distributed databases. For example, Sorensen says, “Suppose I search for a document that references Microsoft, and the document tells the query that there’s also a media file in another place that references Microsoft.”

The technology will appear in 18 months in SQL Server. It will be added to other Microsoft products in ensuing years.

Oracle Corp. says its customers are moving toward data stores of huge size and complexity, spread over multiple locations. The company says its products will not only evolve to handle those kinds of jobs, but will also do them extraordinarily well. “Over the next couple of releases, we’ll see essentially fully autonomous databases,” says Robert Shimp, vice president of database marketing.

Oracle also wants to facilitate collaboration for people in different companies with widely varying information types. “What doesn’t exist today is the underlying infrastructure, or plumbing, that’s capable of managing all these diverse types of data,” Shimp says. “What you need is the ability to link all these clustered databases around the globe into a single, unified view for the individual user.”

Elsewhere, researchers are finding that the best design for some database applications isn’t a traditional database at all, but rather data streams. Researchers at Stanford University are working on ways that continuous flows of information – such as Web site hits, stock trades or telecommunications traffic – can be passed through queries and then archived or discarded. A query might, for example, be written to look continuously for suspicious patterns in network traffic and then spit out an alert.

The problem in handling some kinds of problems with a traditional database management system is one of timeliness, says Jennifer Widom, a computer science professor at Stanford. “If you want to put a stream of data into a DBMS, you have to at some point stop, create a load file, load the data and then query it,” she says. “Data stream queries are continuous; they just sit there and give you new answers automatically.”

Widom and her colleagues are developing algorithms for stream queries, and she says her group will develop a comprehensive data stream management system. A prototype of such a system will take a number of years to develop, and the underlying technology will then be either licensed or offered as freeware, she says.