Biological Database Integration - XML Background
Author: Deron Eriksson
Description: Integration of Data from Heterogeneous Biological Databases using CORBA and XML - XML Background.


6. XML Background

Extensible Markup Language (XML) is a tag-based standard for hierarchically organizing textual data. XML is useful to this project because it offers a standard way to represent data communicated between objects within the federated data integration system. Since all communication is standardized in this fashion, data can be formatted based on particular tags, and data can be extracted through parsing based upon these tags.

XML is a specification based upon the more complex Standard Generalized Markup Language (SGML). The current XML specification can be found at [EXT, 2002]. Like SGML, XML is used to organize how data is stored in a text document. XML specifies rules describing how to structure documents. In XML, customized tags and attributes can be created, and tags are used to demarcate data. XML’s tag structure resembles Hypertext Markup Language (HTML), but XML itself does not specify what the tags in a document mean. XML’s rules concerning structure are strict, so that a problem in structure such as a missing ending tag will invalidate the XML document.

An important distinction can be made between a well-formed XML document and a valid XML document. According to [MCL, 2000]: “A well-formed document is one in which all XML syntax rules are followed, and all elements and attributes are correctly positioned. However, a well-formed document is not necessarily valid, which means that it follows the constraints set upon a document by its DTD or schema.” DTD stands for document type definition, which is a description of the constraints that a particular type of XML document must follow. An XML Schema performs a similar task as a DTD.

In order for XML to be a standard for molecular biological data, a DTD or XML Schema must be defined that describes what particular XML tags mean and how they are to be used. A DTD can be embedded in an XML document or placed in a separate file. Three such DTDs are BSML (bioinformatic sequence markup language), BioML (biopolymer markup language), and BlastXML. BSML is a straightforward way of representing sequence data and related data.

XML tags can be used to give meaning to data. Barillot and Achard [BAR, 2000] give an example of how XML tags can be used to give meaning to biological data compared with HTML. “Of course, tags can then be analysed to perform intelligent queries against XML documents. In our previous example regarding the gene rap, one could tag every occurrence of the word used in the genomic context to differentiate it from the word rap in the context of music, for example, <gene>rap</gene>. This type of information can be used by the query engine. From a string of characters in HTML, we have now a concept in XML. The advantage of XML over HTML is clear from this example: it adds parseable (i.e. information that is extractable by a computer program) semantics to a document. The role of XML is to alleviate the task of the computer program, which cannot yet model the natural language, by providing some hints for comprehension.”

XML offers a hierarchical standard for data storage and communication, much like the use of tables in relational databases. XML’s tag format makes it very useful on the Internet, since it can be parsed and is easily understandable to those familiar with HTML. “… XML does more than just add semantics to the Web; it is also a powerful language for data interconnection. The flexibility and independence of XML with regards to computer operating systems makes it a universal hub between databases. In fact, several database-management systems already offer XML interfaces and all of them are integrating XML into their development plans. Other solutions, such as the common object request broker architecture (CORBA; http://www.corba.org), are already being used for data integration, but programming CORBA servers and applications requires the skills of highly trained software engineers. With its simplicity of deployment, XML is more dedicated to the users, whereas CORBA is a complex solution deployed by and for computer scientists (although it is powerful and useful in many other cases too)” [BAR, 2000].

Nucleic acid sequences and closely related information regarding those sequences are typically stored as flat files in databases such as the EMBL sequence database. A line of data in an EMBL flat file is described by the first few characters on that line. These characters serve the same purpose as XML tags in that they describe the data content on that line. If XML becomes a prevalent bioinformatics data standard, sequence data and other types of bioinformatics data may be stored in XML form. XML standardization of biological data formats could be a significant step in simplifying data integration. XML would be very useful in database federations since it could place all biological data in a standard easily parseable format, greatly simplifying the role of translators in database federations. This standardization could also simplify the data format conversions that are performed during the construction of data warehouses.

Storing biological data in XML form may have a large impact on biological data integration. “In summary, the hypertext approach to navigation is a fast way to achieve limited functionality. However, because such systems cannot accept complex queries from a network, and because the hypertext mark-up language (HTML) pages that they use do not contain enough structure to allow the computer to automatically extract and compute with individual data elements, these systems will ultimately inhibit the types of complex queries that can provide a quantum jump in biological analysis” [KAR, 1996]. XML does contain the structure necessary for automated data extraction and this allows for queries to be performed against XML-stored data. XML can be used for web-based manual navigation of biological data, and its structure can allow for automated navigation and data searching and retrieval. XML can also be useful in data warehouses and federations. XML may have a significant effect on biological data integration.