Biological Database Integration - Current Approaches
Author: Deron Eriksson
Description: Integration of Data from Heterogeneous Biological Databases using CORBA and XML - Current Approaches to Biological Data Integration.
(Continued from page 2) 8.3. Proposed Data Integration SystemThe system proposed in this work takes the database federation approach to biological data integration. It combines several technologies to produce a powerful technique for performing queries involving multiple databases. The performance of automated multidatabase queries is not a newly invented concept. However, this system offers a new approach to the performance of multidatabase queries that is valuable for a number of reasons that shall be discussed. It has several advantages and at times disadvantages relative to existing systems that integrate data from biological databases. The proposed federated data integration system consists of a group of distributed objects that communicate via CORBA. A key feature to CORBA is that it allows for objects to be easily distributed across multiple computers due to location transparency, so the distribution of objects in the proposed federated data integration system is trivial. CORBA’s object location transparency is accomplished by object registration and look-up involving a name server. Biological database federations by their nature are distributed systems, since they involve accessing data sources across a network at run-time. CORBA’s ease of object distribution can be a very powerful feature in terms of system flexibility, since it allows objects to be placed in different locations if so desired. For example, each CORBA object that directly accesses a specific database could be run on the host running the web server that provides access to the database. This pairing of biological database web server and CORBA database access object could be useful if a large number of databases were incorporated into the system, and each biological database organization was also responsible for the functionality of its CORBA database access object. In the opposite extreme, all system objects could be run locally on one host, which could be useful if source code was distributed with the system and the user running the data integration system wished to make code modifications to the data integration system. Additional object configurations for reasons such as efficiency can also be obtained in a straightforward manner using CORBA. Unless a technology such as CORBA or Java RMI is utilized, communication between distributed objects can be very difficult. CORBA opens the possibility of using existing CORBA interfaces to databases in the data integration system. As described earlier, databases such as the Radiation Hybridization database have had CORBA interfaces implemented for them. Thus, using CORBA in a biological data integration system may allow for incorporation of existing work into the system. The Jade approach utilizes Java as its implementation languages, but Jade objects themselves are not distributed. Rather, they are run on a single server. Thus, Jade lacks the distribution flexibility of the proposed system. In addition, the proposed system actually performs queries and processes the results from multiple databases. Jade concentrates on data translations into relational tables and does not perform any data integration itself. This integration is left to the application level, which means that a significant amount of work may need to be performed in order to obtain the desired results. Biological data warehouses such as Entrez and SRS also involve distributed technology since they are accessed across the Internet. However, data integration occurs during periodic updates and occurs at the database level. Thus, data warehouses do not require distributed objects to perform multidatabase queries since all the data are already integrated from the constituent databases in the data warehouse. There are various benefits and drawbacks to these different approaches, as discussed in the previous two sections. Querying of data warehouses is simplified since queries can be performed against a single data warehouse which contains all of the integrated information, so a group of distributed objects such as in the proposed system is not needed. However, creation of a data warehouse can be a very difficult, large-scale process that involves global data integration involving many types of heterogeneity. Additionally, data warehouse design is difficult since the data warehouse should probably be optimized for the various types of queries, and this can become very involved. With database federations, data integration is a much simpler process since data integration is only necessary for the data involved in the query. Queries involving data warehouses are generally faster than queries involving data federations since data integration takes place prior to the query in data warehouses. However, there can be situations in which a database federation could outperform a data warehouse. For example, if a particular query type required the data warehouse to sort through many enormous relational tables, the query may be faster using a database federation if the query required searching through a few smaller tables in the federated databases. A database federation also offers the potential for parallel computations, since several databases in the federation could perform queries at the same time. The proposed system can actually take advantage of the data integration that has been performed in data warehouses. A database access object in the system can query a data warehouse, thus utilizing the fast, pre-integrated querying capabilities of data warehouses. Thus, the proposed data integration system can serve as a powerful complement to existing data warehouses such as Entrez and SRS. A drawback to using a warehouse is that, since a warehouse is updated periodically from its constituent databases, data can be out-of-date. A data federation accessing normal databases has access to current data. Due to CORBA’s language independence, the system objects can be implemented in different programming languages. This gives the system useful flexibility. For example, if a significant amount of C++ source code pertinent to processing biological database queries was obtained, this code could be adapted to the data integration system in C++ rather than porting it to another language such as Java, which would be a very time-consuming task. This is an advantage in terms of implementation flexibility over the Jade system, which limits objects to the Java programming language. A useful feature that the proposed system has compared to other systems described in this work is its use of XML in communication between objects in the data integration system. This allows for standardization of object communication using XML. This is an attractive feature given XML’s growing popularity, which makes it a technology that many people are familiar with and can work with easily. Recently, it has become possible to retrieve data from biological databases formatted into XML. For example, the EMBL nucleotide sequence database currently offers sequence data in the AGAVE and BSML XML formats. This trend is very positive for the proposed data integration system, since query results ready-formatted in XML would offload a significant amount of parsing activity that would otherwise need to be performed by the system’s database access objects, which are responsible for converting database results into the internal XML format understandable to the data integration system. Use of an XML representation for interobject communication is an attractive feature of this system compared to the Kliesli-CPL system, which uses collection programming language (CPL) internally. CPL is high-level and powerful, but it utilizes a complex query format. An XML-based query format is easier to understand due to XML’s hierarchical structure and the metadata provided by XML tags. Java servlet technology provides a straightforward interface to the data integration system that serves as a layer of abstraction between client applications and the data integration system. As a result, clients do not need to use CORBA to communicate with the data integration system, so a software engineer writing a client application for the system does not have to deal with any of the complexities involved in CORBA. Thus, writing clients that harness the power of the data integration system is very easy in terms of the technology used to communicate with the system. In fact, a client can even be a standard web-based form, since the methods used to submit form data are the same methods used to access the data integration system. This uncomplicated means of accessing the data integration system is similar to the World Wide Web access to databases and data warehouses such as SRS in that a simple means of performing database queries is provided that does not directly involve the user with complicated technologies such as CORBA. The client application created for the prototype system is similar to the TAMBIS system in that it provides a friendly user interface to the data integration system that can provide advanced functionality to users without exposing them to the more complicated internals of the data integration system. Thus, the use of CORBA, XML, and Java servlet technology combine to form a powerful distributed data integration system that provides a simple interface for client communication with the system and uses a standardized data representation for communication between data integration system objects. System objects in the biological database federation are language independent and can easily be distributed on a single machine or across multiple hosts on a network due to the flexibility provided by CORBA. |