Biological Database Integration - Abstract
Author: Deron Eriksson
Description: Abstract to my Master's Thesis on Integration of Data from Heterogeneous Biological Databases using CORBA and XML.


Integration of Data from Heterogeneous Biological Databases using CORBA and XML

The emerging field of bioinformatics utilizes computer technology to acquire, manage, and analyze biological data. Biological data are often most useful when combined from various sources. For example, it may be possible to ascertain the function of a gene by comparing the structure of the protein product of that gene with structurally similar proteins of known function. Thus, the integration of data from multiple biological data sources is an important goal of bioinformatics.

However, data integration from heterogeneous biological databases is a difficult problem for a variety of reasons. These reasons include: (1) the vast quantity of biological data, (2) the growing number of biological databases, (3) the rapid rate in the growth of data, (4) the overabundance of data types and formats, (5) the wide variety of bioinformatics data access techniques, (6) database heterogeneity, (7) errors in biological data, and (8) the interdisciplinary nature of bioinformatics.

Despite these difficulties, solutions to biological data integration have been developed, and these solutions can be grouped primarily into two categories: database federations and data warehouses. Both techniques have their advantages and drawbacks. A solution can even combine aspects of both of these techniques.

This thesis presents a novel federation system that addresses the problem of integrating data from multiple biological databases. Communication between objects in the system is handled using CORBA, an object-oriented, language-independent, platform-independent, highly interoperable distributed computing architecture. Through CORBA, system objects can be distributed across multiple computers, and they can be written in any language on any platform that supports CORBA. Data transmitted between system objects are represented in an XML format, which provides data format standardization. Access to the federated data integration system is provided by a Java servlet, which allows clients to communicate with the system in a simple, straightforward manner. This report describes the design and implementation of a prototype of this federated data integration system. This prototype demonstrates the feasibility of using this system to perform automated queries involving multiple biological databases.