Biological Database Integration - Distributed Computing Paradigms
Author: Deron Eriksson
Description: Integration of Data from Heterogeneous Biological Databases using CORBA and XML - Distributed Computing Paradigms.
5. Distributed Computing Paradigms
Distributed computing allows independent programs located on different hosts on a network to intercommunicate. Distributed computing is useful for creating a system to integrate data from multiple sources for a variety of reasons. To begin with, it allows the resources of multiple computers to be combined to solve data integration issues. For example, processes on different computers can specialize in particular data integration tasks, such as query processing or accessing a particular database. Computer hardware could also be optimized for a distributed task to be performed by a workstation. If a particular operation is computationally intensive and the problem can be easily subdivided, the processing power of multiple CPUs distributed across a network can be used to work together to come to a solution to the operation. A cluster of workstations can be used to perform such computationally intensive tasks rather than an expensive mainframe, which might not even be able to handle the computational task in an appropriate amount of time if the problem is too large.
Distributed computing can be used to harness the processing power of distant computers on a network. For example, complex queries can be performed on databases via web servers, and the results of these queries can returned via distributed computing technologies. Distributed computing has other advantages compared to computing on a single processor. Although a distributed system has increased points of failure, redundancy can be built into a distributed system so that if a workstation fails, another workstation can take over its workload. Additionally, if a system requires a significant increase in capabilities, distributed computing offers the system a mechanism for scaling its scope across machines.
Liu [LIU, 2002] offers the following description of distributed computing: “A distributed system is a collection of independent computers, interconnected via a network, capable of collaborating on a task. By independent computers it is meant that the computers do not have any shared memory, or program execution spaces. Such computers are called loosely-coupled computers, as opposed to tightly-coupled computers which can share data using common memory space. Distributed computing is the computing performed in a distributed system.”
Web browsing is a familiar example of distributed computing. A client application, a web browser running on a personal computer, can access web pages provided by a web server application running on a server. This server can be located on a distant host on the Internet. The web server may dynamically create a web page, and this creation of the content of the page may require significant computing on the part of the server. The client also needs to perform computations, since it needs to take the web page data that it receives from the server and display it in the correct format in the browser window. In this example, the client and the server are independent computers across a network, and independent computing occurs on both the client and the server. Although independent, the client and server have collaborated to bring the user the desired web page.
Liu [LIU, 2002] ranks distributed computing paradigms in terms of their levels of abstraction. Message passing represents a low level of abstraction, whereas the client-server model represents a higher level of abstraction. Remote method invocation is more abstract than the client-server model, and object request brokers and mobile agents are an even higher level of abstraction. More distributed computing paradigms exist in addition to those mentioned.
In the following sections, several distributed computing technologies will be described. A comparison of these technologies is discussed in [ORF, 1998]. In addition, their appropriateness and relevance in regards to the current work will be discussed.
A time-tried message passing technique for network communication involves sockets. A socket allows a process to talk to another process across a network, typically by opening a socket and then reading data from or writing data to that socket, much like data can be read from or written to a file. Sockets can be implemented in languages such as C and Java. Socket communication over different languages and different platforms can be difficult. Socket programming typically involves a significant amount of challenging code, although socket programming in a language such as Java reduces this complexity as compared to a language such as C. Sockets are low level and as a result they are fast.
Sockets were not chosen as the communication mechanism for this project for a number of reasons. To begin with, this project follows an object-oriented distributed approach. Although sockets can be used with objects, object-oriented technologies such as CORBA and Java RMI are based on objects and are thus better suited than sockets. Socket programming involves a significant amount of low-level details, so other technologies offer simpler and more elegant solutions. One goal of this project is to allow for language independence and platform independence in system objects. Socket programming across different languages and different platforms could be a very difficult task.
5.2 HTTP and Extensions
TCP/IP (Transfer Control Protocol/Internet Protocol) is the predominant network architecture in operation today and is the protocol suite used by the Internet. It is typically characterized by five levels [STA, 2000], which in order from highest to lowest level are: (1) Application layer, (2) Transport layer, (3) Internet layer, (4) Network access layer, and (5) Physical layer.
The Hypertext Transfer Protocol (HTTP), a staple of web-based technology, is a client/server network communication protocol that functions at the application layer of a network, using TCP/IP for transferring data. The application layer handles communication between processes across a network and presents retrieved data to a user. Lower-level network communication details are hidden from the application layer. HTTP is the protocol used by the World Wide Web and, although hypertext is specified in its name, it can be used to transfer other types of data such as graphics, sound, and plain text.
HTTP is a stateless protocol. In a stateless protocol, a server processes a session with a client or clients without knowledge about the client’s or clients’ previous sessions with the server. Technologies such as cookies have been devised which allow for the maintenance of state information. A cookie is sent from a web server to a web browser and can contain state information such as a user’s name and address that were entered into a form. The browser can then send this information to the web server the next time that the user accesses a web site on the server, thus identifying the user based on state information from a previous session.
Four basic steps are involved in a request/response communication transaction between a client and a server using HTTP, as described in [HAR, 1997]. First of all, a client makes a connection to a server, typically via TCP. The default port is 80, although other ports can be specified. Second, a client requests specific data from the server, such as a request for a web page. According to [RFC, 1999], “A request message from a client to a server includes, within the first line of that message, the method to be applied to the resource, the identifier of the resource, and the protocol version in use.” Following this request, the server sends the response to the client. [RFC, 1999] details the contents of the response, which includes a response code and the requested message. Following the response, the connection between the client and the server is closed.
The GET and POST methods are important methods that a client can use to make requests for data from a web server, typically via a CGI program or a Java servlet. Submission of form data to a CGI program can be used as an example. Each input on a form, such as a text field or a checkbox, can have a name to identify the input and a value, such as the text that is entered into the text field. A name-value pair is a character string in which an equal sign separates the name on the left side of the equal sign and the value on the right side. Spaces are represented by plus signs, and certain other characters are represented by hexadecimal representations. For example, a period is encoded as %2e. Name-value pairs are concatenated together using ampersands, and this string is called a query string. Using the GET method, the query string is placed at the end of the CGI program URL in the HTTP request in order to submit the query string data to the CGI program. The query string is separated from the CGI program’s name in the URL by a question mark. If a Java servlet is accessed rather than a CGI program, the servlet object’s name is used in place of the program name. The data in the query string can be used by a CGI program or Java servlet to perform tasks such as database queries.
The POST method is similar to the GET method, but the query string is submitted to the server via a stream rather than appending it to the URL. This has advantages in certain situations. To begin with, it makes the data submitted in the query string less accessible to users, since they don’t see the query string data appended to the URL as in the case of GET. In addition, in the case where the query string is very long, problems can be experienced using the GET method. Therefore, long query strings should be submitted via the POST method.
A form is not necessary in order to submit name-value pairs using the GET or POST methods. Query strings can also be generated programmatically by applications. Examples of the use of the GET and POST methods can be found in [HAR, 1997].
The HTTP/1.1 standard is specified in the World Wide Web Consortium’s RFC 2616 [RFC, 1999].
Common Gateway Interface (CGI) is a protocol by which clients communicating with web servers can interact with programs. A CGI program can be written in various languages such as C or Perl. Typically, a form is submitted to a web server, which forwards the form data to a CGI program, which processes that data and returns the result to the web server, which sends the response back to the client. Since a new process is started for each request for a CGI program, CGI can have performance problems. One solution to this is to use a technology such as Internet Server API (ISAPI), which can allow server programs to run much faster than CGI programs. Since CGI programs are not persistent, this complicates the issue of state information.
The characteristics of CGI do not really match the architecture of this project, so CGI was not a serious contender as a network technology to be used in this project. CGI could be used to provide system access to clients, but a Java servlet is a better choice for this role given the system requirements. CGI is typically not object-oriented and would be difficult to integrate into the project’s distributed object architecture.
(Continued on page 2)