Bridging data islands

After just two months, a new software tool enabled Aventis Pharmaceuticals Inc. to discover a promising candidate for a new drug to treat asthma, arthritis or even perhaps cancer; it’s a chemical compound that might well have been overlooked using traditional IT tools.

Aventis is using DiscoveryLink, a feature of IBM Corp.’s DB2 database management system that can propel a single SQL query out to multiple, heterogeneous data sources and bring information back to the user in one coherent view.

“Using this integrated framework, scientists were able to pull data from many different sources around the world, visualize it in a new way that they could never do before,” says Peter Loupos, vice president for drug innovation and approval information systems at the Bridgewater, N.J.-based company.

IBM calls the Aventis approach to information integration “database federation.” To get at federated data, DB2 uses IBM “wrapper” software called DataJoiner and Relational Connect. There’s one wrapper for each type of data source – whether it’s from Oracle Corp. or Sybase Inc. systems, Microsoft Corp. SQL Server or flat files – with each one mapping the source data model to the DB2 data model. A single Aventis query may be sent against heterogeneous relational databases, unstructured documents and in-house expertise culled from e-mail and other sources.

Sending a SQL query against remote, heterogeneous databases is just one of several ways to integrate data. Others include the following:

— Custom, hard-wired interfaces that pass information from one application to another. These can be made to work exactly as users demand, but they can be costly to set up and maintain.

— Replication, in which a commercial product regularly or continuously copies databases or parts of databases from one place to another. Replication is simple but limited in its ability to do anything to data beyond copying it.

— Extract, Transform and Load (ETL), a process often used to create data warehouses and data marts. ETL software moves data from one place to another, applying rules or table lookups to combine or transform data in some way. ETL is powerful but can be very complex.

— Web services. Enabled by Internet protocols including the XML standard for exchanging data between disparate systems, Web services allow SQL-based relational data to be accessed as XML, or native XML to be accessed through SQL. Web services are ideal when applications are loosely coupled and difficult to integrate in other ways.

Regardless of the approach taken, data integration can be difficult, expensive and error-prone. In particular, great care must be taken to build interfaces between applications and databases that ensure accuracy and timeliness of information and that answer the needs of disparate communities of end users. Below, we look at how two organizations tackled their data integration problems, and you can read two more case studies online.

Case 1: Standardization Is Key at Affina

While it supports all of the various methods of integrating data, Oracle favors loading data into centralized databases and data warehouses. And that’s the approach taken by Oracle customer Affina, a contact-center service provider in Peoria, Ill.

Affina, which has about 3TB of data in Oracle databases, runs call centers for about 100 clients. That requires taking data from telephone calls, e-mail, Web chat and various back-end processing systems and feeding it into an “operational data store,” a data warehouse and several data marts, says Tom Asp, vice president for IT. Affina also pushes information — such as call analysis — back out to customers. For efficiency and security, data coming in and going out is staged in customer-specific tables that can be accessed by customers.

Affina uses Oracle Warehouse Builder, Oracle’s SQL Loader utility, custom PL/SQL procedures and .Net assemblies in ETL processes to pull in and reformat customer data from Lotus Notes e-mail databases, call records from telephone switches, flat files from interactive voice response units and client data from a system that emulates mainframe CICS. “Any way we could get data, we get it,” Asp says.

Affina also uses Oracle8i replication utilities to move data in real time from a big call center in El Paso, Texas, to its data warehouse in Peoria. The Peoria copy is used for reporting, but the production database remains in El Paso for performance reasons.

Automation and standardization are the keys to making such a complex operation run smoothly, Asp says. File transfers, schedulers and loaders are automated, and strict standards are adhered to so new processes can be added easily and without fear of bugs. He says Affina is increasingly moving to the exchange of information in XML format because “it provides a common, flexible file format so that ETL procedures can be created more efficiently.”

Case 2: Database Dilemma at Terracon

Terracon Inc., an engineering consulting firm in Lenexa, Kan., uses an Oracle enterprise resource planning (ERP) system and an Oracle data warehouse. A two- to three-hour weekly process uses scripts that Terracon wrote in PL/SQL to move data from the operational ERP data to the warehouse, where it is used for reporting. Not all information — customer and employee names, for example — is replicated to the warehouse. Terracon was unable to use Oracle’s replication product, Streams, because the two repositories used different versions of Oracle, according to CIO Frank Milano.

Milano says he’d like to convert the data warehouse to Microsoft’s SQL Server database. He says that with his existing Microsoft enterprise agreement he could do that for an additional $2,000 to $3,000 and save $30,000 to $50,000 in Oracle license fees. The ERP system would remain Oracle, however, because it’s the best fit for Terracon’s project accounting and financial management needs, he says.

And Milano says total cost of ownership with SQL Server would be lower than with Oracle. “Trying to maintain an Oracle database is brutal,” he says. “It’s complex, not for the faint of heart.” But, he concedes, Oracle is better suited for large, complex, dynamic database applications.

But Milano says he worries that interfacing the Oracle and SQL Server databases may be tricky. In addition to the weekly transfer of production data from the ERP system to the warehouse, users must be able to initiate queries against the warehouse and have them go to the ERP database to pick up customer and employee names, and queries against the ERP system must be able to go to the warehouse for information that has already been calculated and summarized. “That’s a great concern for us,” he says. “I don’t have a solution for it.”

But Microsoft claims it does. According to Tom Rizzo, group product manager for SQL Server, Terracon could move data from the Oracle ERP system to the SQL Server data warehouse – doing any needed editing or massaging en route – using the Data Transformation Services tools that ship with SQL Server. And to access both databases in a single query, Terracon could use the Distributed Query engine in SQL Server, Rizzo says.