Overview

Objective of Commd/Commlib is to decouple the dependencies between Grid Engine components and the communication layer. The daemons should only wait for the completion of a communication if its really necessary. Communication and CPU bound work should overlap, to maximize availability of daemons as much as possible.


The communication system is interfaced by calls to a commlib containing socket communication calls. Via this library calls the client program talks to a responsible Commd. The Commd is a program handling all communications. It is a multiplexor which gets, holds and sends messages through socket pipes. Communication partners of the commd are commprocs linked with Commlib or other Commds. The whole communication system is not limited to a single commd, which opens lots of possible constellations in how components could communicate with each other. This has to do with the fact that Commd does no routing: the only information the commd has in order to forward a communication, is the information given by the sender. Thus there has to be a commd on the host of the receiver or the receiver must be known by the commd where the message arrived. Extremities in commd configurations are, one commd on each host or one commd for all

hosts.

Components of the communication system

In a normal Grid Engine installation only Commd (see sge_commd(8)) and Commlib are use. To control and test the communication system there are some additional executables:

Addresses

In order to interact, communication partners have to know how to get in contact with the communication partner. The communication system knows the communication partners as communication processes (commprocs). A commproc is identified by the host he lives on, by his name and by an identifying number. The number is needed for commprocs who could be started more than once at a time. This identifier makes the address unique. The name and the number are given by the commproc. If the commproc is not interested in a specific number he can leave the job of getting a unique number to the Commd. This addresses are agreed upon startup time of a commproc. Communication calls contain this triple to identify the communication partner.

How Communication Works

Communication is handled based on messages. The sender passes a message buffer and the length of this buffer to the communication routines. The contents is not interpreted by the communication system. The sender specifies an address for the receiver. This address can contain wild cards for the name and the id of the receiver. A wild card does not work like a multi cast, only one receiver gets the message. There are two modes to send and receive a message. Synchron and asynchron. A synchron send/receive blocks the caller/receiver until communication is done. An asynchron send/receive returns as soon as possible. If sending synchron we wait for the receiver to get the message. The synchron send returns if an error occurs, a timeout takes place or the message is copied to the process space of the receiver successfully.

The asynchron send transfers the message to the Commd and then returns immediately. The sender can test later whether the message arrived. It is not possible to ask for more than one asynchron message, because there is only one field to store the last acknowledged asynchron message. If receiving synchron the receiver blocks until a message arrives, an error condition occurs or a timeout takes place. This timeout can be set by the commproc using the set commlib param function. If receiving asynchron the receiver gets a message if present at the Commd. If there is no message, the receiver can immediately continue his work. This is a polling for a message. To distinguish between different messages a receiving commproc can get, there is a field called "tag" in receive message/send message. This allows the receiver to wait for a specific message ignoring other messages addressed to him.

Interface

This is a summary of the commlib functions, details can be found in libs/comm/commlib.h.

Internals

The commd actions are data driven. In principle there are two structures that control operation. The first is the commproc structure. For each enrolling commproc commd creates a commproc structure. The structure is freed if a commproc calls the function leave() or if the commd considers the commproc dead. This happens if the commproc terminates while waiting in a synchron receive.

The commd holds one list of commprocs describing the status of all commprocs. The w ... fields in the appropriate data structure describe a wait condition. This allows the commd to decide in the moment a message arrives whether a commproc is ready to get this message. w_fd holds the open file descriptor of the connection to the commproc. Keeping the fd open has the advantage that Commd can recognize a breakdown of the commproc. A select on this fd shows activity a following read returns an error condition. commprocs breaking down are unregistered automatically.

There is one message list in commd containing all known messages. Messages are created at the time somebody makes a connection to the Commd. Thus all communication with commd is done via messages. The message "scheduler" (process received_message.c) is responsible for deciding what to do with a message. Message deletion is not done in a central place in the code. Messages are deleted when they are no longer needed. Deletion depends heavily on the sort of message (cntl/asynchron/synchron). The messages structure holds all the information concerning a message. The structure and therefore the message goes through a lot of states in its lifetime. At each state transition the state field is modified. The intention behind this is to portion the tasks into a lot of pieces, so that processing the message can be interrupted and continued at any time. This is necessary due to the single threaded nature of Commd. If the processing of a message is blocked due to a slow communication partner, the message processing will be freezed in its state and continued if the blocking condition vanishes.

There is a third structure needed for handling hosts and aliases. Hosts entries are created when new hosts appear. This is e.g. when a message arrives from an unknown host. There is a single resolve via gethostbyname() when creating the structure. This information is refreshed on a regular basis. Old hosts will not be removed, because they may be referenced from within other structures. This is why we need the deleted field. An aliasfule can be hold for the commd. If Commd cannot resolve a host or he is forced to do so by getuniquehostname(), he rereads this file. The aliasfile has a simple structure. Every line contains a number of hosts, which are aliases to each other. Delimiters are blank or comma.

Host name resolving

Hostname resolving in Grid Engine must be seen separate from host comparing. Host comparing is done in libuti.a and can be customized to bypass common problems in hostname resolving, described below.

Host name resolving is a difficult issue because there is no standard way to handle names in a network. Problems are: There are 3 different ways to specify the name to address resolution. DNS, NIS and the /etc/hosts-table. Each host can have a different translation for a host name. Resolver libraries have a different behavior. Machines with more than one network interface may have more than one entry in the resolver tables. DNS is case insensitive, NIS and /etc/hosts aren't. Some resolvers return different cases depending on the case of the input. The simpliest approach (this is what Commd actual does) to this is not to distinguish between small and capital letters in hostnames at all. Having the same name for different hosts is not a real failure of the network administrator. It may make sense to name a host "file server". And if a Grid Engine installation crosses network boundaries there may be two file servers within a Grid Engine cluster. This situations can be handled by using fully qualified hostnames (e.g. file server.acme.com). In case resolving tables return a short name and this name is ambiguous, the aliasing mechanism of Commd can help. This aliasing mechanism can be used to overrule the resolving tables. The alias file contains one line per host which has to be aliased. A line "file server.acme.com" forces commd to resolve this host at startup time and make "file server.acme.com" the main name. This main name will be used by commd whenever the host is addressed (getuniquehostname;send/receive message).

The alias file can be used to solve another problem. If a host has more than one interface, this can be handled in two ways. If DNS is used, it is possible to assign more then one address to a hostname. So fileserver.acme.com can have entered address 199.99.99.1 and 199.99.100.2 in the resolving table. There is no problem with this. But if DNS is not used...

To not use DNS doesn't mean to do not use DNS at all. A combination of DNS, NIS and /etc/hosts is possible. This is forced by the fact, that e.g. the resolver library of SunOS 4.1 can't talk to DNS without using NIS. In most installations with two interfaces in a host there is one entry in the resolver table per interface (e.g. fileserver eth.acme.com and fileserver fddi.acme.com). If this is the case it would be directlycult for any component of Grid Engine to say explicitly this two hostnames belong to the same host. For at least security issues this is not tolerable. A line like "fileserver eth.acme.com fileserver fddi.acme.com" in the alias file will solve this problem. To summarize the usage of an alias it can be sayed: The alias file helps solving ambiguities with hostnames. It can not be used for aliasing in general. A line "fileserver eth.acme.com my favorite host" does not introduce a new name. If a name is not resolvable he will not be accepted.

In order to minimize resolving activities, commd holds a list of known hosts. Resolving is a relative expensive operation if it has to be done via network. This causes a problem if the resolving tables are changed while commd is running. Commd tries to refresh its host list on a regular basis. Nevertheless it may take a while that changes take ect. This can not be avoided due to the fact, that some resolver libraries cache resolving information. One have to take 10 minutes into account. A newly started commd of course gets the actual information. The automatical refreshing can be disabled by the "-dhr" switch.

Commd and files

Due to the nature of Commd it is a problem handling file access within commd. File system access is in terms of computers a very slow operation. This can be even strengthened if the accessed file system is a networked file system. Commd can be blocked by this operations. So file system access is minimized. In normal operation without aliases there is no file access at all. If there are aliases there are accesses every time a host cannot be resolved. This can be caused by ill commprocs and has to be avoided. Other file system accesses are for the purpose of logging messages to file. This is done into /tmp/commd if existent (not created by commd). /tmp is usually a local file system. Accessing local file systems should not slow down the operation of commd in a disturbing fashion.

Copyright 2001 Sun Microsystems, Inc. All rights reserved.