Biological Database Integration - Data Integration Issues
Author: Deron Eriksson
Description: Integration of Data from Heterogeneous Biological Databases using CORBA and XML - Data Integration Issues Relating to Bioinformatics Databases.
(Continued from page 2) 7.6. Database HeterogeneityHeterogeneity comes in many forms and plays into many levels when attempting to integrate data from multiple databases. Data types and formats, discussed earlier, are forms of heterogeneity. El-Khatib et al [ELK, 2000] offer the following classification of database heterogeneity: “1. Naming heterogeneity. This occurs when the same values are stored in different databases but the names given to the attributes are different in different systems. These can be handled by a simple (syntactic) attribute transformation of the query. 2. Relational structure heterogeneity. When the composition of elementary attributes into composite structures varies but once again values stored are identical. This can be handled by a (syntactic) relational transformation of the query. 3. Value heterogeneity. In this case the way in which values are represented is different in different databases. This may involve type and value transformations. 4. Semantic heterogeneity. This is the most difficult form to deal with as in this case the data stored in different databases embody different assumptions, e.g. in what they represent or in how they have been captured. 5. Data model heterogeneity. Here the data model itself is the issue and transformations between data models and differences between them are relevant. 6. Timing heterogeneity. This concerns the changes over time in the structure of a database, the representation of attributes and the values themselves. Basically, almost any difference from each of the preceding categories, which can occur between databases, may also arise within a single database if it changes with time.” These heterogeneity categories may be combined. These categories may also be broken down into subcategories. For instance, naming heterogeneity may be broken down into attribute synonyms, relation synonyms, attribute homonyms, relation homonyms, and attribute-relation homonyms. [ELK, 2000]. A synonym is a same concept described in two or more databases using different names and a homonym is the same name used for different concepts in different databases. Data errors are a form of heterogeneity that are difficult to categorize in a generalized fashion. [ELK, 2000]. A clear example of naming heterogeneity may be found in [STE, 1998]: “The simplest, but in many ways the most profound, differences are conflicts in the way concepts are named. For excellent reasons, the term locus may signify, in a particular database, a complementation group, a gene, a DNA sequence encoding a transcript, or a cytogenetic position. In each case, the term locus is well-defined and consistent within the database, but it is meaningless to attempt to compare locus objects that have been retrieved from two or more databases.” Karp in [KAR, 1996] gives the following example of value heterogeneity: “A relatively simple example is units of measure: two protein DBs might list molecular weights in daltons and in kilodaltons, respectively, but when combined into a single warehouse, all measurements must be transformed into common units.” Data transformations such as this are straightforward to implement. Karp describes an example of semantic heterogeneity, which is an extremely difficult type of heterogeneity to deal with when integrating multidatabase data. “A more subtle example of semantic heterogeneity is the fact that GenBank and SWISS-PROT use different notions of what a sequence is: a nucleic acid sequence held in GenBank corresponds with an observation published in the scientific literature – two publications that report a sequence for the same gene will receive two separate GenBank records; whereas an amino acid sequence held in SWISS-PROT represents a consensus view of that protein as reflected in many publications. The Entrez warehouse contains these two types of biological sequences side by side, but does not attempt to transform them into a unified ontology. The same situation is probably true for all sequence warehouses, owing to the extreme difficulty of performing transformations automatically” [KAR, 1996]. Semantic heterogeneity is especially an issue in bioinformatics data warehouses where huge quantities of changing data are involved. [KAR, 1996] also contains an excellent example of data model heterogeneity involving PIR, a protein sequence database, and Entrez, a data warehouse formed from many databases including PIR. “For example, PIR and Entrez use different data models, so all PIR data must be transformed into the Entrez model. In cases where the data model of the warehouse is not as powerful as the data model of the data source, information can be lost during transformation; unfortunately, bioinformatics warehouse projects do not typically document which information their converters lose” [KAR, 1996]. 7.7. Data ErrorsErrors are a normal and expected part of biological data, as in the earlier example of automated DNA sequencing. Biological data over groups of organisms or properties usually require some sort of associated statistical information in order to be meaningful. Statistics does not deal with certainties but rather probabilities. Relational databases designed for on-line transaction processing of normal business data typically store data in its smallest parts, and for such data, statistics would be computed based on performing aggregate operations on these individual data items. However, certain types of raw biological data, such as genome trace files, may be so immense that processing of the data may be required, and these processed values, along with statistical information about these values, should be placed in a database rather than the raw data. Errors in biological data can occur as a result of many factors, including human error and laboratory techniques that simply don’t give absolutely correct results. These errors are propagated into databases when biological data are submitted to databases. “Two examples of errors found in the GenBank database are the inclusion of 0.36% vector-contaminated sequences in release 95–96 and a 10–20% rate of erroneous annotation of entries as ‘genes’. The current standard for accuracy of genomic sequences is less than one error per ten thousand bases; older data have a higher sequence error rate” [BRU, 2000]. Genomic sequence errors are generally the result of sequencing equipment that gives excellent but not perfect results since sequencing is usually based on probabilities (ie, the probability that a particular base appears in a particular position based on fluorescent tagging in an electrophoresis process). Brusic et al [BRU, 2000] describe problems associated with automated annotations: “When the entries of the SLAD database were extracted from public databases, analysed and crosschecked with source publications, approximately 30% of the freshly extracted entries were found to contain at least one serious error as identified and subsequently re-annotated by a human expert. Automatic annotation of database entries while fast, tends to proliferate existing errors and introduce new errors. The overall error rates in functional assignment of protein sequences have been estimated to be 2.5–5% for ‘clear’ cases. The implication, in particular for the development of specialist databases, is the need for careful annotation, including human expert assessment of each entry.” In situations such as biological data warehouses, the quantity of data makes it virtually impossible for human experts to manually assess the correctness of all database entries. Biological data errors are a problem that will continually have to be dealt with because they can’t be entirely eliminated. [KAR, 1996] discusses problems with database links between data entities in databases, such as stale links in which a source entity’s link points to a target entity that doesn’t exist. Stale links can result from many factors, including physical DB reorganizations, forking of objects, retirement of objects, and merging of objects. Stale links are particularly an issue in bioinformatics because of the high rate of change in biological data. Other link problems include incorrect links, which often may come into existence through automated link inferences. Unclear link semantics is another problem mentioned, in which a link exists but it is not clear what is the meaning of the link and what is the meaning of the results returned by the link. Links, including hypertext links, are an important way of integrating data from different databases, so link errors can significantly impede integration efforts. 7.8. Interdisciplinary FieldThe interdisciplinary nature of bioinformatics requires the cross-application of knowledge from many different fields to develop innovative solutions to the pressing questions in bioinformatics. Researchers trained in biology often lack the computer science training necessary to develop solutions to the problems that they face, and computer scientists who have the necessary computer training may lack the biological background necessary to develop adequate solutions that can be used comfortably by biologists. As an example, a biologist should be able to ask a question in a biological form, such as, “What are the protein kinase genes on human chromosome 4?” In terms of an actual query, the corresponding query that would actually fetch data from biological databases could resemble the following query in Object Protocol Model multi-database query language (OPM*QL) taken from [MAR, 1997]: SELECT GDB:Gene.displayName, GDB:Gene.accessionID, Feature.products.name FROM GSDB:Feature, GDB:Gene WHERE Feature.products.name MATCH “%protein kinase%” AND Feature.genes.gdb_xref = GDB:Gene.accessionID AND GDB:Gene.mapElements.map.chromosome.displayName = “4”; As another example, the biological question “What are the motifs that are components of guppy proteins?” has a collection programming language (CPL) query equivalent (from [PAT, 1998]) of: {m | \p<-get-sp-entry-by-os("guppy"), \m<-do-prosite-scan-by-entry-rec(p)} Since bioinformatics data systems typically are used by biologists, a solution for biologists should allow them to ask questions in a form fairly understandable to them, and this form should be processed into a form that can actually be used to query or integrate data, such as in the OPM and CPL query examples. The returned results of a query in such a system should be processed into a form that biologists can understand, since results should be comprehensible to end users. Benton [BEN, 1996] gives an example of the cross-disciplinary problems faces by bioinformatics: “Bioinformatics occupies the interface between biology, computer science, applied mathematics, statistics, and computer and software engineering. This interface is also the residence of a new culture gap but, in this case, it appears to be a three-culture problem, with biologists wanting immediate solutions to their data-management and analysis problems, computer scientists and mathematicians seeking interesting basic research problems, and the engineers asking both groups for a sufficiently well-defined specification for them to get on with building something useful. These problems are not unique to bioinformatics, but plague interdisciplinary computational science in general. The bioinformatics culture-gap has two principal sources: significant differences in the vocabularies and modalities of scientific approach between the three groups; and an underestimation on all sides of the effort required to bridge the gap.” Even communication between biologists and computer scientists at a very basic level can be difficult and confusing since what is considered basic knowledge in one field may be completely foreign to the other field. Matthews [MAT, 2000] gives a clear example of a common term in both computer science and biology that has a completely different meaning in each field: “Just getting those involved in the code-breaking effort to understand everyone else is a problem right now, says Birney. He recalls a recent conference where a molecular biologist stood up in front of an audience of computer science people and asked whether she should explain what a vector was. ‘No one said anything, so she went ahead,’ recalls Birney. ‘It took the computer scientists quite a while to work out that what she meant by a vector - a means of gene transfer, like a virus - was totally different from what they mean by it, namely an array of numbers.’” Many bioinformatics books have appeared recently and bioinformatics programs have been started at several universities, which no doubt will help people gain a more thorough interdisciplinary knowledge of the field. |