|The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing / February 2, 2007|
When companies purchase a significant number of machines and cluster them together to solve their computing needs, their site environment often drives specific requirements for their clusters. These requirements vary and often include specific networking configurations, specific applications that need managing, specific approaches to software installation and maintenance, and existing management software and procedures that must be accommodated. The key to successful cluster administration software is that it be flexible enough to accommodate many of these environments. For optimum flexibility, the systems management software must have the following characteristics:
This article will discuss each of these characteristics in turn and give examples of cluster administration software that possesses these qualities.
Flexible Fundamental Capabilities
Cluster administration encompasses a wide variety of tasks that are often unique to the cluster or to the cluster's purpose. Therefore cluster management tools need to provide ways to accomplish many different tasks with simple tools. The more inherent flexibility in these tools the better. Basic functionality that is needed for cluster management includes:
With the basic tools above it is possible to accomplish a large number of complex cluster tasks including installation and startup of the high performance computing (HPC) application stack, cluster wide user management, configuration and startup of services, and additions of nodes to workload queues. For instance, install and startup of HPC software can be done with software maintenance and the distributed shell. Configuration and startup of services like NTP and automounter, as well as user management, can be configured mainly through the distribution of configuration files from the management server.
Examples of cluster administration tools that include forms of the above functionality are: xCAT, the C3 tools in Oscar, Scali Manage, and CSM. Other tools can do just one of the tasks above. For example, Red Hat Network (up2date) and YUM provide software maintenance capabilities.
Extensible Hardware Control
Hardware Control provides the key capability of managing the cluster hardware (powering on/off, query, console, firmware flash) without having to be physically present with the hardware. Most cluster hardware provides native mechanisms to accomplish these tasks, but often the mechanisms vary between hardware types. This provides a challenging environment for remote hardware control software, since many clusters consist of heterogeneous hardware. Even if all the compute nodes are the same machine type there are still I/O nodes and non-node devices such as switches, SAN controllers, and terminal servers to be managed.
To support the ever growing number of power methods, the administration software must support user-defined power methods that can be plugged into the main power commands. A pluggable method allows the software to more easily support new hardware, and allows the user to run the same command to all the nodes and devices, despite their different control methods. It also allows other software components, such as installation, to drive the power control to the various hardware.
In addition to power control of the cluster hardware, remote console is another area that requires pluggable methods. There are a wide variety of terminal servers and serial over LAN (SOL) support on the market, and each of these has its own intricacies for establishing a remote console session to the node.
The method for flashing the firmware of the nodes is usually very hardware specific, but at least some flexibility can be achieved by allowing additional drivers to be added to the flashing environment, and to allow flashing to be done either in-band (while the node is running) or pre-OS (before the operating system is installed on the node).
In addition to writing your own console method for new terminal servers, "in house" development of power and console methods can allow more flexibility when upgrading cluster hardware: instead of being required to wait for and upgrade to the latest version of the management software to support new hardware, you can script your own solution. Examples of simple hardware control methods that cluster administrators can easily develop are: power on through Wake On LAN, and power off through a distributed shell, and power control via a power switch like APC or Baytech. Cluster products that provide extensible power control include xCAT, Scali Manage, and CSM.
Variety of Node Installation Methods
Installing the operating system and applications on nodes is one of the most important functions of cluster administration software, because it can take so long to do manually. Because the method of installation affects the other administration processes, it is important for the software to support a variety of installation methods.
For clusters in which the nodes are not all identical and for which there exists a separate software maintenance procedure, the approach of directly installing the RPMs from the distribution media (over the network) is generally the most useful. This allows the administrator to initiate an install with just the distribution CDs in hand, and they can easily specify a different list of RPMs for different nodes. Products that support this method of installation include Rocks, Clusterworx, Scali Manage, xCAT, and CSM. They generally use kickstart's and autoyast's unattended installation features to automate the installation of multiple nodes over the network in parallel.
While many users like the simplicity of the direct installation method, an equally large camp of users prefer the cloning method. This generally combines the node installation method with the node software maintenance strategy. In this approach a model node (sometimes called a "golden" node) is installed manually and configured exactly how the administrator wants the rest of the nodes to be. Then the software image is captured from the golden node and replicated to the other nodes. When updates or configuration changes are necessary, the golden node is updated and the capture/replicate process is done again. This approach is most effective for clusters in which the nodes in the cluster are almost identical, in terms of both hardware and software. Products that provide cloning capability include OSCAR, HP XC, xCAT, CSM, Clusterworx, and the open source tools Partimage and System Imager.
While installing the operating system locally on each node generally works well (disks are cheap, and the OS files can be loaded more quickly at boot time), some users are moving to diskless nodes. The motivation for this is generally not price (disks are dirt cheap these days) or even easier maintenance (there are both pros and cons in this area). The motivation is usually reliability in large clusters, because the last moving part in the node is eliminated. Diskless nodes are achieved in most cases by sending the kernel and possibly some of the file systems to the node when it boots, and then the rest of the files systems are NFS mounted from a server. Products that support diskless nodes include CIT, Clusterworx, warewulf, xCAT, CSM, Scyld, and Egenera. (Scyld only loads a very minimal image on the node. Egenera makes a disk available to each node via a SAN.)
Most users have definite requirements about the type of node installation they want to use, since it is central to their entire administration strategy. Therefore, it is important for the cluster administration software to support as many of the presented node installation methods as possible. These methods should also be customizable by supporting the use of user defined post installation scripts and by supporting install/image servers to increase scalability and performance.
Extensible Monitoring Capabilities
Just as extensible hardware control is important, extensible monitoring of the cluster is a useful tool for automation of cluster events. While there are many enterprise software packages that provide error detection and response, it is useful to have at least some set of customizable and user defined monitoring capabilities built into a cluster administration product. Common event information across the cluster to which the software may need to respond include: node down and up events (useful for manipulating workload queues), memory and swap space used, filesystem space used, processor idle time, network adapter throughput, and syslog entries. The following extensibility points are important in event monitoring:
Products that have an extensible monitoring system include Ganglia, Nagios, Big Brother, Scali Manage, Clusterworx, and CSM.
There are several possible reasons which might motivate the use of cluster administration in a hierarchical fashion. The obvious reason is to be able to manage more nodes than supported by the current scaling limit of the administration software. Another reason is to divide up the nodes into smaller sets that can be managed individually, sometimes by different administrators, but still retain some central control. A third reason is to handle unusual networking configurations, for example cross-geography clusters. A typical hierarchical cluster consists of a 3 level hierarchy, in which there are sets of nodes, with each set being managed by a management server (called the First Line Management Server, or FMS). A top level management server (Executive Management Server or EMS) manages all of the First Line Management Servers.
Ideally, all management operations could be done from the EMS, but at least the following are critical to be done from the EMS:
Hierarchical support is important to allow the cluster administration software to work in more cluster environments. The only products that we know of that support hierarchical clusters as described here are xCAT and CSM. Several products, for example CIT, support a hierarchy for one specific operation, usually for node installation or diskless boot.
Modular and Customizable
We have already mentioned that customers often have established system management processes in their lab prior to using any of the administration products mentioned in this article. It is not normally well received when the product dictates the processes to be used for all the administration tasks (installation, software maintenance, user management, configuration, monitoring, etc.) To avoid this "barrier to entry", the product must have the following characteristics:
Frequent Updates and User Contributions
As we all know, Linux software and its associated hardware does not stand still. The many components of a typical Linux cluster continue to evolve with new versions, usually several times a year, with all the components on different release schedules. And new technology continually appears. As a result, the administration software needs to continually adapt to its changing environment. This requires the ability to put out frequent updates to the product. Open source solutions (e.g. Rocks, CIT, OSCAR) generally have an easier time of this due to their iterative development style and less regression testing done by the development team (and more by the user community). But even vendor products need to find ways to release updates often. CSM uses a combination of traditional product releases and early updates on the xCSM site. User contributions can also help tremendously in keeping up with all the changing components. This is business as usual for open source solutions, but can be difficult for vendor products due to legal restrictions. This issue must be resolved in order for vendor products to be able to keep up with the ever changing environment.
In Linux clusters, there are so many open source administration utilities and so many home-grown solutions that there is very little need for a one-size-fits-all cluster administration product. The administration software must be extremely flexible to adapt to a variety of environments and to complement, but not conflict with, the utilities already in use.
Bruce is a Senior Technical Staff Member at IBM. He has been working in the area of systems management of clusters since 1989, including AIX clusters based on the IBM SP2, Windows clusters with IBM Director, and Linux & AIX clusters with CSM and open source.
Jennifer has been working with clusters for the past seven years. She currently spends her days at Sony Pictures Imageworks in sunny southern California.
For a list of the products referenced in this article click here.
Editors note: Rick Friedman, Scali's VP of Marketing and Product Management wished to clarify some of the Scali Manage capabilities referenced in the article:
"With regard to the section 'Variety of Node Installation Methods', Scali Manage, in fact, supports both image / cloning installation and RPM methods, enabling our users to use RPMs to create a master image for initial cluster setups, and then leveraging RPMs for updates without requiring the re-imaging of the existing systems.
"With regard to the section 'Hierarchical Support', Scali Manage provides this functionality, supporting multiple clusters, multiple networks, and multiple geographies in a hierarchical fashion, enabling cluster administrators to coordinate and manage HPC efforts throughout an organization.
"Finally, with regard to support for diskless nodes, this functionality will be added in our next release coming next month."