Data consolidation has been a hot topic of late, and for a good reason: integrating data has the potential to energize a business.
According to some industry experts, merging information from disparate databases can lift some weight off the shoulders of a database administrator since data centre maintenance requirements are reduced when there are fewer databases to maintain. And when analysed, consolidated data can shed much light on an enterprise’s progress and pinpoint which areas of the business need work.
But according to a report entitled Managed Semantics: Key to a Coherent Data Environment, by Framingham, Mass.-based IDC analyst Carl Olofson, many enterprises have not yet tapped into the power of database fusion.
Olofson wrote that today’s typical IT configuration includes an assortment of files and databases, grouped according to the particular applications that use them, with little relation to one another.
“Data is skimmed, scooped, or ‘mined’ from various…files and databases and coalesced on a periodic basis into data warehouses and datamarts for business analysis purposes, but otherwise it is effectively uncoordinated,” he wrote.
He added that the data integration movement was born after organizations started realizing that related data in different applications “can, if shared, improve the efficiency and interoperational potential of those applications, resulting in enhanced reliability, cost-effectiveness, responsiveness, and flexibility.”
Scott Lundstrom, chief technology officer at AMR Research in Boston, said the drivers behind data consolidation are typically cost containment, control of data and the desire to access a real-time global view of company’s operations.
At the top of the list is cost, which has been increasing particularly in the database space due to the maintenance multiple databases require.
There was a time, circa Y2K, when applications were demanding and unique enough that global companies ended up with multiple instances of the same information. “It would be not at all uncommon to find five to 10 instances of SAP running in a company,” each with its own database, he said.
Kevin Strange, vice-president and research director at analyst firm Gartner Inc. in Stamford, Conn., said the multiple database approach “creates significant problem from a management and administrative perspective,” which ties into the cost issues. He estimated that for every dollar spent on storage, three dollars go toward managing it. “Software prices have remained flat, and hardware prices have actually declined considerably, but the expense with databases is really in human management – the staffing and support and maintenance of the data centre facility itself,” added Lundstrom. “There is a business cost to having fragmented visibility of data, and to fix that, a company might have to go through a lengthy and expensive periodic consolidation process.”
Other consolidation drivers include mergers and acquisitions, as well as regulatory requirements such the U.S. Sarbanes-Oxley accounting practices regulations, and privacy laws, Lundstrom said.
One way to address the information integration issue between databases is to use extraction, transformation and loading (ETL) tools that pull out the data from one database and put it into the data warehouse. Vendors such as Toronto-based Hummingbird Ltd., Ascential Software Corp. of Westboro, Mass., and Informatica Corp., based in Redwood City, Calif., offer tools that can help accomplish this task.
Since 2000, Trimac Corp., a Calgary-based industrial transportation company, has been using Hummingbird’s ETL suite, formerly known as Genio, for its data moving requirements. The firm’s vice-president of information technology, Janet Topic, said Trimac runs on Sybase 12.5 database technology, but uses an Oracle data warehouse.
Trimac maintains an online processing database containing six months of transactional information, Topic said. The data is automatically replicated into a decision support system (DSS) using basic Sybase utilities. Information in the DSS is kept for three years, and is “used to run reports and launch queries.”
At the beginning of every year, another Sybase utility purges information in the DSS that is more than three-years old, and inserts it into an archive database. Genio then takes information from the DSS and feeds it to Trimac’s Oracle data warehouse, where the data is aggregated and summarized, Topic said.
The one problem Genio shares with other ETL tools is that it doesn’t support real-time integration – instead, it’s more of a “batch-type moving product,” Topic said.
IBM offers data integration products under the DB2 Information Integrator brand. These tools offer federation capabilities – where companies can “reach into the real-time data feed, pull information back into a consolidated view and operate on it” – as a complement to ETL, said Richard Hedges, program director of information integration products at IBM Canada in Markham, Ont.
Hedges said federation is part of the new wave of integration, helping businesses “extend beyond the view of relational databases, and maybe some legacy data sources, to really being able to tackle all kinds of different forms of data” from sources such as the Web, Excel spreadsheets and e-mail systems.”
Syed Rasheed, manager of sales consulting at Oracle Canada in Mississauga, Ont., said Oracle has built multiple options into its database offerings to assist with data distribution. “You can pick and choose the best way to implement (data distribution) that meets your needs,” he said. Some departments within an organization might require instant data updates, while others might be able to tolerate some sort of lag as a tradeoff for better performance. “With all these things what you have with Oracle is a single platform that provides multiple ways to distribute and share information with other databases in the organization.”
Gary Donaldson, manager of informatics at Xenon, a genetics-based drug discovery firm in Vancouver, said his company’s biggest database consolidation challenge was to bring together information from disparate online databases.
A year-and-a-half ago, Xenon’s business was focused on target discovery – identifying genes that cause diseases. For research purposes, the company had to sift through dozens of online medical databases to find information on diseases and genes.
Xenon used Perl, a standard programming language, and PHP, an open-source server-side scripting language for creating dynamic Web applications, to create a utility that would enable the integration between outside information and its internal databases. “We went with what we could do in the shortest amount of time to make things work,” he explains.
The firm had looked at some off-the-shelf products that offered the same capabilities, but “there were not a lot of companies out there that make this sort of system…and we felt it was risky to commit to just one of them,” Donaldson said. “It was also expensive, so we thought cost-wise we were better off and it would be cheaper to do ourselves. It took us six months to build it, but in six months we made the money back.”
Semantics and SQL
One challenge with external data integration remained: Xenon would never have control over the data source. The information in an external database might be easily accessible, but the schema might go through several revisions over time, Donaldson explained.
It’s always a challenge trying to link all this data into your own information – someone else is managing it, which may mean the schema changes every three months – and that breaks everything that you already made,” he said.
Indeed, database languages have many of the same flaws as human dialects. According to Tom Traubitz, senior group product marketing manager for Sybase’s infrastructure platform group in Dublin, Calif., SQL may be a great grammar for doing queries on structured information, but it doesn’t address the problem of database semantics.
“In my database, I may call my table of customers ‘customer list’ while in another database, another person may call it ‘service contact list.’ We mean the same thing, but computers don’t know that, and won’t integrate the data, unless a human intervenes and helps the computer figure that out…SQL gives us the grammar but not the dictionary.”
In his report, IDC’s Olofson wrote that before data can be consolidated, the differences in data representation among various databases and files must somehow be hidden or fixed. “A key vehicle for accomplishing normalization is to establish a common set of meanings from which all data representations are derived,” his report said.
Sybase offers a line of business process integration technologies along with specific adapters to extract information from databases tied to different applications. Traubitz said the company’s information integration product, Information Orchestrator, is designed to tie information together between off-the-shelf systems like SAP or PeopleSoft, which are based on different data dictionaries and semantics. “Part of our integration technologies revolve around ameliorating those differences or hiding them,” Traubitz said.
He added that there are other companies, such as San Francisco-based Autonomy Corporation plc, that offer technologies that can automatically recognize and categorize information in disparate databases by analyzing their commonality structure – in much the same way that search engines like Google and e-mail filters work.
Considerations before consolidating
Lundstrom said there are two ways of looking at consolidation: as a purely technical exercise, or as an evaluation of an organization’s business processes, followed by the rationalization of database products a company is using.
“If you’re simply consolidating databases, you can just upgrade your infrastructure by moving data onto a single instance of the database,” he said. Strange added that this approach can be “somewhat effective – you can reduce the amount of administration required.” However, it’s still not the optimal way to do it, “because you still have multiple copies of data to deal with,” he said.
The best approach, Lundstrom said, is for an organization to rationalize its data models in order to create one consolidated model – leading to more integration from the start. This involves an “upfront simplification process, a hard-edged evaluation of what products are being used and what databases are required,” added Lundstrom. “You rationalize them down to the minimum number of applications and databases required to support your business.”
All of this requires some planning, investment and resources to accomplish, Strange said. “But if a company can get rid of some of its database redundancy, the cost to manage that will also be reduced,” he said.