Biological Database Integration - New Approach
Author: Deron Eriksson
Description: Integration of Data from Heterogeneous Biological Databases using CORBA and XML - New Approach to Data Integration.


Page: < 1 2 3 4 >

(Continued from page 1)

9.2. System Components

9.2.1. System Access Servlet

The data integration system can be accessed by outside clients via the system access object, which is implemented as a Java servlet. It extends the httpServlet class, as in the following class declaration:

public class SystemAccessServlet extends HttpServlet

The SystemAccessServlet class implements the doGet() and doPost() methods so that the servlet can respond to client requests. The doGet() method responds to data submitted by a client via the GET method. In the GET method, the client sends data to the web server by placing that data after a URL in the HTTP request, as in the following example:

http://localhost:8080/examples/servlet/SystemAccessServlet?querytype=accessionnumbertoemblflatfile&accessionnumber=TRBG361

In this example, the web server is running locally on the computer, and the client requests that the SystemAccessServlet be invoked. The servlet is sent two name/value pairs. These pairs are placed after a question mark following the URL. An equal sign is used to assign a value to a name, and name/value pairs are separated by ampersands. The first name is querytype and its value is accessionnumbertoemblflatfile. The second name is accessionnumber and its value is TRBG361.

The doGet() method has the following declaration:

public void doGet(HttpServletRequest request, HttpServletResponse response) throws IOException, ServletException


This method takes an HttpServletRequest object and an HttpServletResponse object as parameters. The HttpServletRequest object contains the name/value pair data sent by the client. The HttpServletResponse contains the response that the servlet sends back to the client. The method throws IOException and ServletException. An IOException can be thrown if an input or output error occurs during the function call, and a ServletException may also be thrown.

In the current implementation of the SystemAccessServlet, a call to doGet() results in a call to doPost(), as in the following:

public void doGet(HttpServletRequest request, HttpServletResponse response) throws IOException, ServletException

{

doPost(request, response);

}


The doPost() method takes the parameters and throws the same exceptions as the doGet() method, as shown in the following declaration:

public void doPost(HttpServletRequest request, HttpServletResponse response) throws IOException, ServletException


The doPost() method responds to data submitted by a client via the POST method. As mentioned above, it is also invoked by the doGet() method. Thus, the SystemAccessServlet responds in the same manner to data submitted by the GET method as it does to data submitted by the POST method. In the POST method, the client sends name/value data to the web server, but these data are separate from the URL string. Instead, the POST data are sent as a stream.

The content of the HttpServletResponse object is set to the “text/plain” MIME content type, since ASCII text will be returned to the client. MIME is an acronym for Multipurpose Internet Mail Extensions that is a standard for describing different data formats. A PrintWriter object is obtained from the HttpServletResponse object. Textual data written to the PrintWriter object are ultimately returned to the client.

The SystemAccessServlet obtains the query type sent from the client via a getParameter() call. This function takes a parameter name as a parameter and returns the value of the parameter.

String querytype = request.getParameter("querytype");

A variety of different query types are recognized by the prototype system. The query types included are the following:

(1) accessionnumbertoemblflatfile

(2) accessionnumbertoemblxmlresult

(3) accessionnumbertoagavexmlresult

(4) uidtopubmedabstract

(5) emblaccessionnumbertopubmedabstract


The accessionnumbertoemblflatfile query type is a single database query that takes an accession number and returns the corresponding EMBL flatfile from the EMBL nucleotide sequence database. The accessionnumbertoemblxmlresult query type is a similar EMBL database query that takes an accession number and returns an EMBL entry formatted in XML by the EMBLParser object. The accessionnumbertoagavexmlresult type is a single database query that takes an EMBL accession number and returns the corresponding EMBL entry formatted in the AGAVE XML standard. The uidtopubmedabstract query type is a single database query that takes a citation unique identifier (UID) and obtains the corresponding PubMed abstract. The emblaccessionnumbertopubmedabstract query type is a multidatabase query that takes an EMBL accession number and returns all PubMed abstracts that are referenced in the EMBL entry corresponding to the accession number. This query is described earlier in the Architecture Overview section and is depicted in the query example in Figure 9.2.

If the query type submitted by the client is not recognized, the following error message will be returned to the client application:

The query type was not recognized.

If the query type is recognized, the SystemAccessServlet converts the name/value data sent from the client to an XML representation of the query. The values are enclosed in tags describing the values. This XML query string is sent to the QueryProcessorServant object.

Since the QueryProcessorServant is a CORBA object, the SystemAccessServlet must obtain a reference to this object. The SystemAccessServlet creates and initializes an ORB. The CORBA naming service is contacted and a reference to the QueryProcessorServant object (which has been registered in the name server as “QueryProcessor”) is obtained. The XML query string is passed to the QueryProcessorServant in a call to the doQuery() method, as in the following example:

String theResponse = qRef.doQuery(theQuery);

The SystemAccessServlet then waits for the response string to be returned from the QueryProcessorServant. If a problem is experienced in communicating with the QueryProcessorServant, the following error message is returned to the client:

ERROR: You connected to SystemAccessServlet but an error was experienced with the connection to the QueryProcessor.

When a response is received from the QueryProcessorServant, the SystemAccessServlet can return this response to the client by writing the response to the PrintWriter object:

out.println(theResponse);

The SystemAccessServlet is free to process the query result returned from the QueryProcessorServant before sending the result to the client. For example, the SystemAccessServlet may remove the XML tags enclosing a piece of textual data before returning the query response to the client.

During operation, the SystemAccessServlet displays status information to standard output. This is useful for monitoring normal operations and for debugging, since the execution of queries involves multiple distributed objects and the status information provided by the distributed objects can be essential for localizing possible issues.

9.2.2. Query Processor

The query processor is the primary component responsible for processing queries and query results. Some query processing is also handled by other components in the system. For example, the system access object processes client queries into an XML representation, and database access objects process the results of database queries into XML representations.

The query processor decides which databases need to be contacted in order to perform a query based on the query type that it receives in the XML-formatted query string from the system access object. Based on the query type, the query processor determines whether the results returned from a database access object should be returned to the system access object or if the results should be used in another query to another database access object.

The QueryProcessor object is the CORBA object server that instantiates the QueryProcessorServant and registers the servant with an ORB. The QueryProcessor object registers the servant using the name “QueryProcessor” in the CORBA naming service. The QueryProcessor object then waits for clients to invoke methods on the servant using the following:

java.lang.Object sync = new java.lang.Object();

synchronized (sync) {

sync.wait();

}


The QueryProcessorServant is the actual CORBA object invoked by CORBA clients. The QueryProcessorServant is also a CORBA client since it invokes methods on the EMBLAccessorServant and PubMedAccesorServant objects.

The QueryProcessorServant implements the doQuery() method specified in the QueryProc.idl file. This IDL (Interface Definition Language) file declares a QueryProcessorInterface interface within the BioQueryProcessor module. The doQuery() method is declared within this interface. The QueryProc.idl file is shown below:

module BioQueryProcessor

{

interface QueryProcessorInterface

{

string doQuery(in string theQuery);

};

};


This IDL module is translated into its Java equivalent using the idlj compiler that accompanies the JDK 1.3. Both client stubs and server skeletons are generated using the following command:

idlj –fall QueryProc.idl

This generates a BioQueryProcessor directory containing the following *.java files:

QueryProcessorInterface.java

QueryProcessorInterfaceHelper.java

QueryProcessorInterfaceHolder.java

QueryProcessorInterfaceOperations.java

_QueryProcessorInterfaceImplBase.java

_QueryProcessorInterfaceStub.java


The QueryProc.idl file has been translated into its Java equivalent in the QueryProcessorInterfaceOperations.java file, as shown in the following code:

public interface QueryProcessorInterfaceOperations

{

String doQuery (String theQuery);

} // interface QueryProcessorInterfaceOperations


The QueryProcessorInterface interface extends the QueryProcessorInterfaceOperations interface but does not add any additional operations to it.

The _QueryProcessorInterfaceImplBase class is the server skeleton and is an abstract class that implements the QueryProcessorInterface. The _QueryProcessorInterfaceStub class is the client stub. It also implements the QueryProcessorInterface. The Helper and Holder files contain other functionality and operations that won’t be discussed.

In order for the QueryProcessorServant to implement the doQuery() method, the class must extend the server skeleton, _QueryProcessorInterfaceImplBase:

class QueryProcessorServant extends _QueryProcessorInterfaceImplBase

The QueryProcessorServant implements the Java equivalent of the doQuery() method from the IDL file:

public String doQuery(String theQuery)

The method receives the XML-formatted query from the SystemAccessServlet as a parameter. Following this, the method creates an ORB and contacts the naming service to obtain references to the EMBLAccessorServant and PubMedAccessorServant objects. The EMBL servant is registered in the naming service as “EMBLAccessor” and the PubMed servant is registered as “PubMedAccessor.” If an exception is thrown, the error will be displayed in standard output.

The QueryProcessorServant then extracts the query type from the parameter. If the query type is not recognized, the following error message is sent to standard output:

ERROR: The Query Processor doesn't understand the query type

If the query type is recognized, the QueryProcessorServant proceeds to carry out the query. As an example, a multidatabase query of type emblaccessionnumbertopubmedabstract will be described. The query processor outputs a diagnostic message to the console window describing the query type to be performed. The QueryProcessorServant proceeds with the first query involved in the multidatabase query. The QueryProcessorServant extracts the EMBL accession number from the string it receives from the SystemAccessServlet. Then, it creates a new XML-formatted query containing the accession number in order to query the EMBL sequence database and obtain an XML version of the EMBL entry corresponding to the accession number. Based on the sample mode received from the SystemAccessServlet, it also includes this value in the query string. For example, for accession number TRBG361, it creates the following XML string:

<?xml version="1.0"?>

<query>

<samplemode>false</samplemode>

<accessionnumber>TRBG361</accessionnumber>

</query>


The doAccessionNumberToEMBLXMLResult method on the reference to the EMBLAccessorServant is then called, passing the new query as a parameter:

firstresult = emblRef.doAccessionNumberToEMBLXMLResult(quer);

This retrieves an XML representation of the EMBL entry corresponding to the accession number. The EMBLParser object is used by the EMBLAccessorServant to generate this XML representation from an EMBL flatfile entry. The EMBL flatfile entry for accession number TRBG361 and the system’s XML representation of this entry are shown in Appendix A.

If the EMBL entry has a cross-reference to a PubMed abstract, the PubMed Unique Identifier (UID) is extracted from the result returned from the EMBLAccessorServant. In the case of accession number TRBG361, the EMBL entry contains a reference to Medline (PubMed) UID 91322517. The QueryProcessorServant prepares an XML-formatted query for the PubMedAccessorServant using this UID:

<?xml version="1.0"?>

<query>

<samplemode>false</samplemode>

<uid>91322517</uid>

</query>


The doUIDToPubMedAbstractQuery method on the reference to the PubMedAccessorServant is called, passing the UID query as a parameter:

theResponse = pubmedRef.doUIDToPubMedAbstractQuery(quer2)

The PubMedAccessorServant returns an XML-formatted abstract corresponding to the UID. Since this abstract is the final result data desired in the multidatabase query, the QueryProcessorServant returns the result string to the SystemAccessServlet, which in turn returns the results to the client application.

9.2.3. Database Accessors

Database accessor objects are responsible for interfacing to the external web-accessible biological databases in the federation. They are responsible for translations between the data integration system and the databases accessed by the system. On the system side, database accessors communicate via XML-formatted data with the QueryProcessorServant. In this prototype, on the database side, database accessors perform queries of the biological databases using the GET method to submit data to CGI programs.

Each database accessor consists of a CORBA object server and an object servant. The object servant is the object that actually implements the database accessor interface. The object server instantiates the servant and an ORB and registers the servant object with the naming service via the ORB.

Each type of query that a database accessor can perform is specified by a particular operation of the object. Thus, the QueryProcessorServant communicates with the database accessors via method calls on the accessors in which the XML-formatted query data is passed as a parameter. Data is returned to the QueryProcessorServant by string return values from the operations.

A single database accessor is created for each biological database in the federation. As a result of this, the system as a whole is resilient to change, since changes to a single database’s query requirements or result formats would only require updates to be made to the database accessor, since the database accessor continues to provide a consistent interface and data formats to the system since it handles all required translations involved in communicating with a biological database. The inclusion of an additional biological database into the system requires the creation of a new database accessor for that database.

In the prototype system, two database accessors have been implemented – an EMBL accessor and a PubMed accessor.

9.2.3.1. EMBL Accessor

The EMBLAccessor object is a CORBA server and the EMBLAccessorServant is a CORBA servant. The EMBLAccessor instantiates an ORB, instantiates an EMBLAccessorServant, and then registers the servant in the naming service using the name “EMBLAccessor.” Following this, the EMBLAccessor waits for method invocations by the client, the QueryProcessorServant.

The EMBLAccessorServant implements the operations specified in the EMBLAcc.idl file. In this IDL file, the BioEMBLAccess module is declared, and the EMBLAccessInterface is declared within it. This interface consists of three operations. Each operation allows a specific type of query. The BioEMBLAccess module is shown below:

module BioEMBLAccess

{

interface EMBLAccessInterface

{

string doAccessionNumberToEMBLFlatFileQuery(in string AccessionNumber);

string doAccessionNumberToEMBLXMLResult(in string AccessionNumber);

string doAccessionNumberToAGAVEXMLResult(in string AccessionNumber);

};

};


Translation to Java equivalents using the idlj compiler yields a BioEMBLAccess directory containing the following files:

EMBLAccessInterface.java

EMBLAccessInterfaceHelper.java

EMBLAccessInterfaceHolder.java

EMBLAccessInterfaceOperations.java

_EMBLAccessInterfaceImplBase.java

_EMBLAccessInterfaceStub.java


The EMBLAcc.idl operations have been translated into their Java equivalents in the EMBLAccessInterfaceOperations.java file:

public interface EMBLAccessInterfaceOperations

{

String doAccessionNumberToEMBLFlatFileQuery (String AccessionNumber);

String doAccessionNumberToEMBLXMLResult (String AccessionNumber);

String doAccessionNumberToAGAVEXMLResult (String AccessionNumber);

} // interface EMBLAccessInterfaceOperations


EMBLAccessInterface extends the EMBLAccessInterfaceOperations interface. The _EMBLAccessInterfaceImplBase server skeleton is an abstract class that implements EMBLAccessInterface. The _EMBLAccessInterfaceStub client stub implements the EMBLAccessInterface. The EMBLAccessorServant extends the skeleton, as shown below:

class EMBLAccessorServant extends _EMBLAccessInterfaceImplBase

The EMBLAccessorServant implements the three methods in the EMBLAccessInterfaceOperations interface, each of which is responsible for performing a particular query type. All three methods perform accession number queries to retrieve the EMBL entry corresponding to the submitted accession number, but each method returns the results in a different format. The doAccessionNumberToEMBLFlatFileQuery method returns the results in the standard EMBL flatfile format. An example flatfile can be found in Appendix A. The doAccessionNumberToEMBLXMLResult method returns the EMBL entry in an XML format particular to the data integration system. This format is used in the multidatabase query in which an EMBL accession number is used to retrieve a PubMed abstract, since it provides a useful, easily parseable format. An example of this format can also be found in Appendix A. The doAccessionNumberToAGAVEXMLResult method returns an EMBL entry in the AGAVE XML format.

The execution of the doAccessionNumberToEMBLXMLResult method will be used as an example. This method begins by checking the sample mode value. If the sample mode is true, a dummy query result is retrieved from the file system. If it is false, an actual database query is performed. This method begins by extracting the accession number from the XML-formatted string passed to it by the QueryProcessor. The method then instantiates an EMBLParser object. It passes the accession number to the EMBLParser object via a call to the parse method:

s = emblparser.parse(AccessionNumber);

This method returns a string that is an XML-formatted version of the EMBL entry corresponding to the accession number. This method of the EMBLParser uses the Java URL class to access the EMBL sequence database’s accession number retrieval CGI program, passing the accession number to the program and requesting the retrieval to come back in raw (plain text) format:

URL u = new URL("http://www.ebi.ac.uk/cgi-bin/dbfetch?style=raw&id=" + accnum);

The query result returned by the CGI program is read from an input stream and placed in a string. This result string is then parsed based on EMBL entry specifications into an XML representation. This string is returned to the EMBLAccessorServant, which in turn returns the result to the QueryProcessorServant.

9.2.3.2. PubMed Accessor

The PubMed database accessor is very similar to the EMBL database accessor. The PubMedAccessor is a CORBA server and the PubMedAccessorServant is its servant. The PubMedAccessor instantiates an ORB and a PubMedAccessorServant, and following this it registers the servant in the naming service using the name “PubMedAccessor.” Like the other CORBA server objects in the system, the PubMedAccessor then waits for method invocations from clients.

The PubMedAccessorServant implements the single operation specified in the PubMedAcc.idl file. The BioPubMedAccess module is declared in this file, and the PubMedAccessInterface is declared within it. This interface consists of one operation, the doUIDToPubMedAbstractQuery operation:

module BioPubMedAccess

{

interface PubMedAccessInterface

{

string doUIDToPubMedAbstractQuery(in string UID);

};

};


Translation of this IDL file to its Java equivalents yields the following files:

PubMedAccessInterface.java

PubMedAccessInterfaceHelper.java

PubMedAccessInterfaceHolder.java

PubMedAccessInterfaceOperations.java

_PubMedAccessInterfaceImplBase.java

_PubMedAccessInterfaceStub.java


The functions of these files are identical to their EMBL Accessor equivalents described in the previous section. Once again, the servant extends the skeleton:

class PubMedAccessorServant extends _PubMedAccessInterfaceImplBase

The PubMedAccessorServant implements the doUIDToPubMedAbstractQuery method. This method takes a PubMed citation Unique Identifier (UID) and returns the corresponding abstract. This is accomplished in a straightforward manner. A URL object is used to access the PubMed CGI program responsible for querying the PubMed database, and the GET method is used to submit query data to the program. The resulting input stream is read into a string. This PubMed data is formatted into XML by enclosing it in meaningful tags, and this result is returned to the QueryProcessorServant, thus completing the query dictated by the QueryProcessorServant.

(Continued on page 3)

Page: < 1 2 3 4 >