Biological Database Integration - Computer Technology
Author: Deron Eriksson
Description: Integration of Data from Heterogeneous Biological Databases using CORBA and XML - The Role of Computer Technology in Bioinformatics.
4. The Role of Computer Technology in Bioinformatics
Bioinformatics is critically dependent upon computer science and related technologies. Benton [BEN, 1996] states, “Bioinformatics stands on the foundations of computational science and engineering, and applied mathematics, and depends on large stores of both experimental and derived data.” Benton describes many computer technologies and their interrelationship to bioinformatics applications. Both software and hardware play critical roles in bioinformatics. “The amount of information available is growing exponentially. This has been largely because of an increasing sophistication in cloning and sequencing techniques, but also because of the ever increasing development of computing software and hardware coupled with decreasing costs” [SAN, 2000].
4.1. Processing Power
Perhaps the most important computer-related factor that has contributed to the growth of bioinformatics has been the increase in computer processing power coupled with decreasing costs. Inexpensive, powerful data processing makes tasks such as complex biological algorithmic computations possible. Powerful microprocessors can influence bioinformatics in other ways too. “Arguably, it was only Intel's development of the Pentium microprocessor that allowed Applied Biosystems and Amersham Pharmacia to create DNA sequencing machines capable of unravelling the genome 4 years ahead of original forecasts” [STOK, 2001].
The application of tremendous processing power to biological computations is clearly shown in the example of IBM’s Blue Gene supercomputer. “Twelve to fifteen times more powerful than today's top supercomputer, Blue Gene houses a million processors, each capable of performing a billion operations per second… Blue Gene's first assignment will be to tackle one of biology's toughest computational problems: Predicting the structure of a protein from its building blocks -- complicated strings of amino acids that contain thousands of atoms. When these molecules are formed in a cell, they fold themselves into exactly the right configuration in a matter of seconds. But with large proteins, no existing computer is powerful enough to predict the exact pattern of folds” [LIC, 2001].
The growth of networks, networking standards, and the Internet have had revolutionarily effects on the sharing of data and knowledge and thus have had a profound effect on bioinformatics, a field focused on biological data acquisition, management, and analysis. The TCP/IP networking standard was developed in the 1970’s and was incorporated into Berkeley Software Distribution (BSD) UNIX, version 4.2, after which time it became the networking protocol standard for the Internet. Open networking standards play a critical role in access to data over networks. Previous to the establishment of the TCP/IP standard, computerized network communication could be exceedingly difficult, since lower-level network programming could be required in order for communication to proceed. An alternative to open standards is to use proprietary networking packages such as CICS, but this ties network communications to proprietary technology and can be platform-dependent. As an example of the problems of lacking a networking protocol standard, the planning of the Distributed INGRES database system was greatly hindered by the lack of UNIX networking software, which didn’t appear until BSD 4.2. The TCP/IP-based Internet and in particular the World Wide Web have made a staggering amount of biological data available. The graphical nature of the Web allows for data exchange in a straightforward, easy-to-use manner. This is very useful for biologists who may not be versed in some of the more esoteric computer communication methods designed by computer scientists for computer scientists.
Network programming technologies have been developed that allow for flexible and powerful distributed bioinformatics applications. Programming via sockets has been a complicated way of writing network applications, although Java socket programming is relatively easy compared to socket programming in C/C++. CORBA (Common Object Request Broker Architecture) and Microsoft’s DCOM (Distributed Component Object Model) are powerful but complex standards for distributed computing. Java RMI offers a way of writing distributed applications in a fairly simple manner. Java RMI is similar to CORBA, but limits the implementation language to Java. Other Java technologies such as applets and servlets offer new ways to program across networks. These topics will be investigated in greater detail in the “Distributed Computing Paradigms” section.
The maturation of database technology has been an essential reason for the growth of bioinformatics. Modern database management systems (DBMSs) are capable of managing vast quantities of data and handling complex operations such as concurrency control and recovery management in an efficient manner. Relational databases are well understood and standardized. Object-oriented databases lack the standardization of relational databases but allow for the storage of complex objects under the control of the DBMS. External objects can also be saved in relational databases, for instance, as an Oracle BLOB (binary large object) data type. Object-relational databases extend the relational model to have object-oriented features. Database concepts such as indices can be used for rapid retrieval of biological data. Decreasing memory prices can lead to faster query processing, since increased memory allows for significant database data buffering, which avoids excessive query processing slowdown due to I/O. Celera’s database was essential in the storage and analysis of their human genome sequence data. Data warehousing and database federations are techniques that play important roles in biological data integration.
(Continued on page 2)