Biological Database Integration - Current Approaches
Author: Deron Eriksson
Description: Integration of Data from Heterogeneous Biological Databases using CORBA and XML - Current Approaches to Biological Data Integration.

Page: < 1 2 3 >

(Continued from page 1)

8.2. Database Federations

The federation approach to integrating data from heterogeneous databases does not require the creation of a new giant database to hold transformed versions of the data from constituent databases as in the data warehouse approach. Stein et al [STE, 1998] point out: “… Interoperability does not require global integration. It is possible to develop software that can operate on multiple databases without solving the much harder problem of [globally] integrating those databases.” In the federation approach, individual databases can act in a completely autonomous fashion as if they are not part of the federation. The federation is dependent upon its constituent databases, but the individual databases are in no way dependent upon the federation, since they may not even be aware of the existence of the federation. Although the federation may involve many databases, they may appear to the user as a single database, as in the case of data warehouses. However, depending on the application, it may be useful for the federation to appear as a group of closely related databases. For instance, if two constituent databases contain closely related but not identical data, a user may wish to limit a query to data in only one of the databases so that the returned results are easier to interpret.

Karp [KAR, 1996] provides an excellent description of a database federation. “The alternative approach to physical integration is for data sources to remain distributed at multiple geographic sites. The DBs can be queried via a network such as the Internet… Some variations of the integrated approach involve a mediator, which is a software component that makes the physically distributed data sources appear to the user to be a single, logical data source that can be queried in a uniform fashion. The mediation approach also involves translator software, but mediators call on the translators dynamically (i.e. while evaluating user-queries), in contrast to the warehouse approach, in which translation occurs only during construction of the warehouse.” Benton [BEN, 1996] stresses the autonomous nature of the databases making up a federation: “A federated database is a collection of essentially autonomous databases in which each constituent database may be implemented using a different schema and using a different database engine. The federated database provides defined access methods to the distributed databases and allows the user to view them collectively as a single database.”

A translator is a software component that is placed over each of the individual databases, and communication between users, mediators, or other software components in the federation and a particular database is handled by this component. In this thesis, translators are also referred to as database accessors. The translator is responsible for tasks such as performing the data format conversions that are necessary for a particular database’s data to be used in an integrated fashion in a federation query. For instance, a database may contain data stored as objects, but the translator may convert this object data into a relational table to be used by the federation. A translator is useful for resolving certain types of heterogeneity such as value heterogeneity. Value heterogeneity may involve simple conversions of units, such as pounds to kilograms.

The translator does not have to only serve one database, but in general this is a good policy. If significant changes are made to a database, modifications to a translator may be necessary. This translator modification process is easiest to handle if the translator is specific to only one database, since it limits the complexity of the translator, avoiding the creation of monolithic software. This approach is used by ODBC (Open DataBase Connectivity) and JDBC (Java Database Connectivity) drivers. ODBC and JDBC drivers provides a standard API (application program interface) to relational databases so that, if a database has a driver, programs can interact with this database using standard SQL embedded in software. The API doesn’t change, so it is possible to always talk to different databases using a standard method of communication, regardless of the nature of the particular database. Even migrating a database to a different DBMS from a different vendor won’t break code if another ODBC or JDBC driver is available for this new database.

A mediator is a software component that is involved in query processing. In this thesis, a mediator is also referred to as a query processor. A mediator makes a determination as to which databases in the federation need to be involved via translators in a particular query. After translators return their results to the mediator, the mediator must perform integration of the data from the different databases. For example, it may perform a join of two relational tables returned from two different translators and then select tuples that fit certain selection criteria.

Various data integration issues need to be resolved by mediators. Benton [BEN, 1996] describes this need: “Both syntactic and semantic knowledge must be incorporated into, and used by, a mediator if it is to merge information from multiple sources. The semantic fluidity of the biological concepts modeled by the databases implies the need for a mechanism to constantly update the mediators' semantic knowledge bases.” Mediator and translator updating can be accomplished in different ways. For example, certain knowledge or federation standards may be hard-coded into the software components. It is also possible to have a separate federation database dedicated to holding syntactic and semantic knowledge of the various databases involved in the federation, and changes to data in this data-dictionary-like database could be used to periodically update translators and mediators. Trigger-like mechanisms could be used so that changes to this database could automatically update the affected translators and mediators. The ability to save historic data in this database could be essential so that if problematic data changes cause some of the mediators and translators in the federation to malfunction, it could be possible to restore earlier working configurations.

In federations, data integration follows a dynamic, software-oriented approach, whereas in data warehouses, data integration follows a static, data-oriented approach. Mediators perform query processing and data integration in database federations at run-time. Benton [BEN, 1996] describes this shift in integration: “An alternative to data warehouses or monolithic databases depends on building most of the integrating intelligence into database query tools. Ideally, such tools would be capable of using knowledge of the schemata of remote databases to construct (either automatically or semi-automatically) properly formed queries to each of the relevant remote databases, and capable of integrating the retrieved data into a coherent report for the user.”

There are several advantages to the federation approach. To begin with, federations do not require a global schema as is the case in typical data warehouses. As Li and Clifton [LI, 1999] state, “The federated database approach resolves some of the problems associated with a global schema... Federated databases only require partial integration. A federated database integrates a collection of local database systems by supporting interoperability between pairs or collections of the local databases rather than through a complete global schema.” The upkeep of an enormous data warehouse with its involved, global transformation issues is avoided in the federation approach. This is an important issue as biological databases continue to grow in number and in content. Benton [BEN, 1996] writes: “A principal advantage of an integration scheme based on mediators is that it allows the individual databases to operate autonomously, but to function collectively as a federation.” However, data warehouses don’t actually impede the functions of their constituent databases, so in actuality data warehouses also allow constituent databases to function autonomously. In fact, the scope of statements such as Benton’s should be narrowed, stating that federation queries and data retrieval are performed against autonomous databases. This is truly an advantage of federations, since they can perform queries against the original databases. Federations see data updates at run-time, so there is no delay from the time data enters a database to the time it is available to users, as is the case in data warehouses.

Federations have certain disadvantages relative to data warehouses. To begin with, since data format conversions need to be done at run-time, federation query performance is slower. Additionally, queries must be performed against multiple databases. This can be a time-consuming operation, since a query may require a large cross-database join, and this join would probably be performed at a mediator rather than in one of the constituent databases. Such a join would be much faster in the case of a single-site database like a data warehouse. Issues such as this have been addressed in query optimization in distributed databases. For example, if one table in one database contained a million rows and another table in another database contained ten rows and a join needed to be performed, of course it would make more sense to perform the join at the first site. In a federation, a mediator would probably perform the join, which would be time-consuming. However, perhaps a mediator could work in conjunction with an existing DBMS to perform a join by sending data from another database to that database so that the join will be performed at one of the databases rather than in the mediator.

Federations, like data warehouses, are not only a research topic. Benton [BEN, 1996] states: “While the development of mediator-based query systems is a subject of continuing computer science research, recent implementations (e.g. the CPL/Kleisli system) have shown that it is possible to provide access both to relational databases [the Genome Database (GDB) and GSDB] and to ‘unconventional’ sources of biological data [e.g. the Caenorhabditis elegans genetic database (AceDB), ASN.1 and BLAST] without building a monolithic database or writing a very application-specific code for each query.”

One popular biological database federation approach is the Kleisli-CPL system, in which a federation of databases is queried using a high-level common query language, CPL (collection programming language). Chung and Wong in [CHU, 1999] state that CPL “supports a powerful data model and makes data transformation, manipulation and integration easy.” Translators in Kleisli are referred to as drivers. “Drivers are program scripts that perform the task of connecting to specific data sources or application programs, sending queries in the language of corresponding database-management systems (DBMSs) and transforming the retrieved data into the internal CPL data model. The Kleisli-CPL system is able to model data in diverse formats including plain text files, the popular relational or object-oriented data models, application programs and web-based data sources” [CHU, 1999]. A CPL-Kleisli engine core is responsible for parsing and executing queries. As of 1999, the Kleisli system featured over 50 drivers [CHU, 1999].

The TAMBIS system utilizes a special graphical interface over a Kleisli system. Chung and Wong [CHU, 1999] describe the TAMBIS system and offer some faint criticism: “The TAMBIS (transparent access to multiple bioinformatics information sources) project aims to provide transparent access to multiple databases and analysis tools using a knowledge-driven graphic user interface (GUI) for query formulation. The knowledge-driven GUI is implemented on top of an old version of the Kleisli system for the actual execution of database queries and data exchange. The TAMBIS query system is user friendly and provides a simple way for nonprogrammers to specify queries, but the kind of query that can be expressed is severely restricted by the designs of the TAMBIS query templates.”

The Jade system is a Java-based partial solution to multidatabase data integration. The Jade system focuses on data transformations that are performed by Jade adapters, which are software modules present on the database side of connections. An adapter is simply another term for a translator. An adapter “provides the appropriate methods to transform clients requests into the data sources’ idiosyncratic language” [STE, 1998]. Communication to and from the adapter on the non-database side of the adapter is performed using relational tables. Jade also offers the possibility of using objects in addition to relational tables. Stein et al [STE, 1998] compare Jade adapters to ODBC and JDBC drivers in the following: “Microsoft’s ODBC (for C language programs) and Sun’s JDBC (for Java programs) both provide a uniform API that allows developers to access the contents of relational databases without regard to their underlying semantics. However, these solutions are limited to SQL servers on specific platforms, whereas Jade is a cross-platform tool that accommodates flat files and object-oriented databases as well.” This accommodation is due to the transformational abilities of the Jade adapters. Jade’s cross-platform nature is due to its Java implementation.

Jade performs data transformations that help standardize the results returned from different databases. This is an important preliminary step to data integration. However, Jade does not provide any actual data integration capabilities. Jade does not resolve difficult data integration issues such as semantic heterogeneity. Instead, Jade defers these issues to the applications or other software modules that interact with the Jade adapters.

The distributed capabilities of CORBA make it an excellent tool for the construction of database federations. CORBA interfaces have been created for the EMBL sequence database and the Radiation Hybridization database in Europe. CORBA is an established, familiar technology, so construction of database federations using CORBA does not require experienced software engineers to learn a brand new technology.

One useful feature of CORBA in terms of its federation possibilities is that CORBA component programming can be simplified through inheritance. Lijnzaad et al [LIJ, 1998] state, “By using CORBA’s class inheritance, the aspects of the data that are common to the databases can be abstracted into a superclass, which must be implemented by all databases. The database-specific details can be implemented in subclasses specific to the particular database (typically, the local database).” Stein et al [STE, 1998] point out another benefit to CORBA: “CORBA’s data definition language makes it possible to share complex objects without regard to each database’s idiosyncratic representation of the information. Of course the protocol does not address the deeper semantic problem of interrelating different schemas, but it does provide a discipline for finding and describing a common semantic subset of two schemas.”

CORBA addresses certain aspects of heterogeneity, as Spiridou [SPI, 2000] points out. “CORBA, as a middleware platform, solves some of the problems involved in data integration. It handles heterogeneity at the programming language and platform levels, and provides network transparency. This allows CORBA-based data integration approaches to focus on resolving metamodel and schematic heterogeneity.” The CORBA-based system in [SPI, 2000] supports data integration “through object composition and union of objects.”

Probably the most commonly mentioned drawback of using CORBA is its complexity. CORBA is a very involved architecture that offers great potential, but it is not an easy technology to master. Chung and Wong [CHU, 1999] criticize CORBA for requiring “tedious programming” and Barillot and Achard [BAR, 2000] state that “programming CORBA servers and applications requires the skills of highly trained software engineers.”

Raj and Ishii discuss CORBA-based interoperability of biological databases in [RAJ, 1999]. However, their technical discussion is for the most part simply a rephrasing of Karp’s work in [KAR, 1996]. Their CORBA-based biological database interoperability application presented in the paper is trivial and not a significant contribution to biological database data integration. In their work, an applet has a series of buttons representing different biological databases. When a user clicks on a button, a CORBA server object send the web address of the given biological database to the client, which then has the client’s web browser open up the web page for the given biological database. Clicking on buttons that cause web pages to open up is a feeble example of the interoperability of biological databases.

(Continued on page 3)

Page: < 1 2 3 >