Biological Database Integration - Molecular Biology Background
Author: Deron Eriksson
Description: Integration of Data from Heterogeneous Biological Databases using CORBA and XML - Molecular Biology Background.


3. Molecular Biology Background

An explanation of a few basic molecular biological terms is helpful for the understanding of the various concepts described in this work. Since biological data is analyzed by the data integration system, an understanding of the syntax and semantics of this data is important to an understanding of the system itself. This section describes the relationship between DNA, RNA, and protein, and relates protein structure to protein functionality. A gene located on a DNA strand is transcribed into RNA, which in turn is translated into protein. A protein molecule’s three-dimensional shape determines how the protein acts within a cell. Thus, genes in combination with environmental factors determine what will occur in a cell.

3.1 DNA

Humans and other higher organisms are made of cells, and each “normal” cell contains a nucleus that is a separate compartment within the cell. Long double-stranded twisted molecules called deoxyribonucleic acid (DNA) exist within each nucleus. DNA and special protein molecules bind together to form tightly coiled shapes called chromosomes that can be viewed under a microscope. DNA contains the genetic blueprint of an organism, and it is replicated and passed on to each cell when cells divide so that those cells in turn are able to divide, once again passing on an organism’s genetic code to its progeny. The complete set of a normal cell’s DNA, which contains all of the instructions required for an organism to live and function, is called the organism’s genome. The human genome consists of a sequence of over three billion bases. A slight amount of variation in the human genome is responsible for the genetic differences between people: “Looking in more depth at the human genome we already know about the 0.1% sequence variation that exists between individual genomes” [DUN, 2000]. Bacteria do not contain nuclei; their DNA exists within each bacterium but not within a separate compartment in that organism.

DNA is made up of repeating units called nucleotides, which are made of three subunits: a sugar, a phosphate, and a nitrogenous base. Four different bases are utilized: adenine, thymine, cytosine, and guanine. In DNA, these nucleotides are arranged in a particular order, which is referred to as a sequence. DNA sequences are in essence the sentences that tell cells what to do, and like a sentence, the order of the “words” is responsible for the meaning of the sequence. Changing the order changes the meaning. DNA sequences are represented as strings of the letters A, T, C, and G, corresponding to the nitrogenous bases making up the sequence.

3.2 Protein

A gene is a sequence of DNA that codes for a protein. Gene identification is currently a very exciting area of research. Proteins are the main workhorses that are responsible for performing various cellular processes, such as facilitating chemical reactions. Genes on DNA are converted to an intermediate chemical called ribonucleic acid (RNA), which is a single-stranded long molecule similar to DNA, except that thymine has been replaced by another base called uracil. This process of producing RNA from DNA is known as transcription. Transcription may involve some processing so that an RNA sequence is not represented as a contiguous stretch of DNA. RNA (specifically, messenger RNA) is transported out of the nucleus. A messenger RNA molecule is used as a template to code for a polypeptide chain, which is a strand of subunits called amino acids. This process of going from RNA to an amino acid chain is called translation. Each set of three nucleotides of the RNA sequence is called a codon. Each codon is converted to an amino acid, and the amino acids are strung together in a strand. There are twenty different amino acids.

Due to the chemical properties of the amino acids and factors within the cell, these polypeptide strands twist and turn, often bonding with other particular polypeptide strands. They take on characteristic three-dimensional shapes, and these molecules are called proteins. The set of all proteins of an organism is known as a proteome. Proteins are very complicated, and their shapes and chemical characteristics are responsible for their particular properties. For example, enzymes are proteins that catalyze chemical reactions, meaning that they lower the activation energy required for a particular chemical reaction to occur. They can accomplish this by attracting the chemical reactants and holding them in a particular orientation so that the reaction can more easily occur.

3.3 Gene Sequence Determines Protein Functionality

Small changes to a gene’s DNA sequence can have drastic effects on the structure of a protein coded for by the gene, since the chemical properties of each amino acid influence the structure of the protein. Since a protein’s three-dimensional structure is responsible for its functionality, alterations in structure can change how a protein acts, which can significantly affect how an organism functions as a whole. A single nucleotide polymorphism (SNP) is a single base pair difference in an individual compared to a genome for a species. This base difference can be the result of an insertion of an extra base, the deletion of a base, or the substitution of one base for another. Sickle-cell anemia is an example of a disease that results from an SNP. This condition is caused by a single point mutation in which a thymine has replaced an adenine in one of the genes that codes for a subunit of hemoglobin. This causes red blood cells to form shapes resembling sickles. The immune system removes these cells too rapidly, causing a reduction in the amount of oxygen that can be held in the bloodstream.

Much like DNA sequences, protein subunits can be represented by linear sequences of letters representing the twenty amino acids. Gigantic DNA and protein sequence databases currently exist and are expanding rapidly. Since shape and functionality are essential factors related to proteins, protein shape and functionality data are being rapidly elucidated and placed into databases. Multitudes of other types of biological data are also being placed in databases.

This has been a very cursory introduction to molecular biology. It defines several terms such as DNA sequences and genomes, covers the relations of DNA, RNA, and protein, and describes the role of protein structure in protein functionality. The Department of Energy’s Primer on Molecular Genetics [PRI, 2001] offers a more thorough introduction to molecular biology that is available in HTML and PDF formats.