Biological Database Integration - New Approach
Author: Deron Eriksson
Description: Integration of Data from Heterogeneous Biological Databases using CORBA and XML - New Approach to Data Integration.
9. New Approach to Data Integration
This section presents the federated data integration prototype that was developed to perform queries involving multiple biological databases. First, the overall architectural design of the system will be presented. Following this, the implementation details of the system components will be discussed, including a description of the client application that was developed to query the data integration system. The setup of the development environment will be described, followed by a description of the various experiences that were encountered during the development of the prototype. Since the current database federation system is only a prototype, many useful modifications can be made to the system, and some of these enhancements will be described in a section describing future work that could be performed on the system.
9.1. Architecture Overview
The federated data integration system consists of a group of distributed components. These components are required to support database federation data integration. These components consist of a system access object, a query processing object, and multiple database access objects. All data integration system objects use CORBA to communicate with other objects, and XML is used to standardize the data representations used in the system. Since the system objects communicate via CORBA, the objects can easily be distributed across different hosts and can be written in different implementation languages if so desired. Communication between system objects is formatted in XML in order for object communication to take place in a standard text-based representation that is hierarchical and easily parseable. Access to the data integration system is provided by the system access object, a Java servlet. The prototype features a client application written in Visual Basic that accesses the system via the system access object. This allows queries to be issued using a user-friendly interface. The client application also demonstrates the ease with which different types of clients can access and utilize the data integration system via the system access object.
An overview of this architecture is presented in Figure 9.1. In this figure, the federated data integration system consists of four objects: a system access object, a query processing object, and two database access objects. The database access objects can query two biological databases, the EMBL sequence database and the PubMed abstract database. Thus, multidatabase queries can be performed using these two databases if such functionality is programmed into the system.
The system access object allows clients to communicate with the data integration system and is responsible for returning query results to clients. The query processing object is responsible for determining which databases need to be contacted in order to perform a query, and it is also responsible for ultimately processing the results of such queries. Each database access object communicates with a specific database, typically via a web server. The database access object is responsible for taking a query from the query processor and formatting this query in a form specific to its database. The database access object also receives results from database queries and formats these results into a form understandable to the data integration system.
Figure 9.1: Federated Data Integration System Overview
The system access object is implemented as a Java servlet that takes queries from clients via the hypertext transfer protocol GET and POST methods. The system access object provides a simple layer of abstraction that separates client applications from the details of the system so that clients, for example, do not need to use CORBA to communicate with the data integration system. As a result, many of the system internals of the data integration system could be completely changed without affecting client applications. For example, the data integration system could switch from CORBA to another distributed technology such as Java RMI without affecting how clients access the system. The system access object allows for client flexibility, since it allows different types of clients to communicate with the data integration system via the familiar GET and POST methods.
The system access object takes the client’s query and formats the query into a well-formed XML string, which is sent to the query processing object. The query processing object decides which database or databases need to be queried in order to complete the query based on the query type, and it in turn contacts the appropriate database accessor object or objects using query strings formatted into XML.
Database accessor objects are used to query web-accessible biological databases such as PubMed and the EMBL sequence database. In this prototype work, database accessor objects perform queries of web-accessible biological databases using the GET method. A single database accessor object is paired up with a single biological database. This makes the system as a whole resilient to change. If the technique used to query a database is changed or if the results of a query are formatted in a different fashion, changes would need to be made to the database accessor object, but the other components in the data integration system would not require any modifications since the database accessor object would still perform the query and provide the query results to the system in formats understandable to the system. Thus, having a single database accessor object for each database serves to insulate the rest of the system from changes to a database.
A database accessor object receives an XML-formatted query from the query processing object and translates this query into the specific format required by the database that it accesses. The database accessor object performs the query of the database and translates the results into an XML form that is sent back to the query processor. The query processor takes these results and based on the query type decides whether to use the results to perform an additional query of a database using those results or to send the response back to the system access object, which returns the results to the client.
This system allows for data integration across biological databases, automating multiple database queries that would be time-consuming if performed manually. Clients are not limited to multidatabase queries. Clients may also submit single database queries if those queries have been implemented by the data integration system.
Figure 9.2 illustrates the sequence of events for the processing of a multidatabase query by the system. In this example, the client requests the system to perform a query that returns all PubMed abstracts that are associated with an EMBL DNA sequence entry.
The client submits the query type and an EMBL accession number to the system via the system access object using the GET method (step 1). In the EMBL nucleotide sequence database, “Accession numbers are the primary means of identifying sequences [and] provide a stable way of identifying entries from release to release… Accession numbers allow unambiguous citation of database entries” [EMB, 2002]. Following this, the system access object formats this query in XML and passes this query string to the query processor by calling a method on the query processing object via CORBA (step 2). The query processor examines the query type and determines that the first database to be queried is the EMBL sequence database. The query processor creates an XML-formatted query requesting an XML-formatted EMBL entry based on an EMBL accession number and passes this query to the EMBL database access object via a method call using CORBA (step 3).
The method of the EMBL database accessor extracts the accession number from the query string that the query processor passes to it as a parameter. The EMBL database access object performs a query of the EMBL sequence database using the accession number via the EMBL web server using the GET method (step 4). The EMBL web server returns an EMBL sequence flatfile entry to the EMBL accessor (step 5), which parses the flatfile into a hierarchical XML representation. This XML representation of the EMBL entry is returned to the query processor via the string return value of the accessor’s method (step 6).
The query processor extracts all PubMed citation unique identifiers referenced in the XML-parsed EMBL entry. The query processor creates an XML-based query string with these identification numbers and passes this string to the PubMed accessor via a method call using CORBA (step 7). The method of the PubMed accessor obtains the identification numbers from the query string and then queries the PubMed database via the PubMed web server with these identification numbers using the GET method (step 8). The abstracts referenced by the unique identifiers are returned to the PubMed accessor in plain text format (step 9). These abstracts are placed in an XML-based hierarchical structure and are returned to the query processor via the method’s return string (step 10). The query processor receives the query result and returns these abstracts to the system access object via the return string of the query processor method that the system access object originally called (step 11). The system access object returns the abstracts to client application (step 12), thus completing the multidatabase query.
Figure 9.2: Multidatabase Query Example
An architectural overview of the data integration system can also be gained through a Unified Modeling Language (UML) class diagram. The primary classes used in the data integration system are shown in Figure 9.3. The operations (methods) of these classes will be expanded upon later in this chapter in the sections describing the system components.
Figure 9.3: UML Class Relationship Overview of the Data Integration System
The QueryProcessor, EMBLAccessor, and PubMedAccessor are CORBA object servers. Each server instantiates a servant and registers this servant with an ORB. Each servant is responsible for implementing the methods of a component’s CORBA interface. After registering a servant’s name with a naming service, a server waits for clients to invoke servant methods.
Thus, the QueryProcessor instantiates a QueryProcessorServant, the EMBLAccessor instantiates an EMBLAccessorServant, and the PubMedAccessor instantiates a PubMedAccessorServant. The QueryProcessorServant invokes methods on the EMBLAccessorServant and the PubMedAccessorServant, and the SystemAccessServlet invokes methods on the QueryProcessorServant. The QueryProcessorServant and SystemAccessServlet are CORBA clients, since they invoke methods on CORBA objects. The QueryProcessorServant is also a CORBA object itself.
The EMBLAccessorServant instantiates an EMBLParser for queries requiring XML parsing of EMBL flatfiles. This parser takes an EMBL flatfile entry and converts it into an easily parseable hierarchically structured XML format. At the time of this writing, the EMBL web server can return accession number query results in two XML formats – AGAVE and BSML. However, the XML format provided by the EMBLParser class is a truer representation of the data and structure of EMBL flatfiles.
An EMBL flatfile entry and the EMBLParser’s XML representation of this entry are given in the Appendix.
(Continued on page 2)