|The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing / March 31, 2006|
OpenRTE (www.open-rte.org) is an open source project that is designed to provide a portable distributed computing run-time environment for HPC workloads. Ralph Castain, of Los Alamos National Laboratory, leads the development of this project and has been conducting research in large-scale distributed computing as a research scientist at Colorado State University for some time. HPCwire got a chance to speak with Ralph about OpenRTE. In this Q&A, he describes what it is, how it works, and what it could mean to the HPC community.
HPCwire: What are the origins and goals of OpenRTE?
Castain: OpenRTE began as a sub-project within the Open MPI initiative that is aimed at creating a free, open source, peer-reviewed, production-quality complete MPI-2 implementation. Open MPI strives to provide an extremely high, competitive performance system by directly involving the HPC community with external development and feedback -- vendors, 3rd party researchers, users, etc.
Building an MPI layer capable of meeting those objectives required the development of an equally ambitious run-time environment that would be capable of providing the necessary infrastructure. OpenRTE evolved from that effort. We recognized early on that we had an opportunity to provide an infrastructure that could transparently support extended features for MPI -- something that would transcend the traditional MPI environment of running an application on a single cluster. For example, we could use the infrastructure we were developing to transparently "stitch" together multiple applications, or to extend applications to span multiple clusters.
The goals of OpenRTE therefore became:
(a) Provide a seamless, transparent distributed computing environment that allows users to execute their applications in a single cluster or on multiple clusters, and/or to integrate individual applications together, all without changing their application code. Our current computing environments require that the user recompile their code, and perhaps even make changes in the source code itself, when moving from a cluster to a Grid or some other venue. In addition, connecting multiple applications generally requires that the integration be done at the source code level. Our objective is to make this as transparent as possible. Users should be able to execute the same code on a cluster or a Grid without changes, and be able to connect applications at run-time -- again, without changing the source code.
(b) Create a robust, production-quality platform upon which high-performance computing applications can execute. The run-time must do its job and then get out of the way -- it cannot impact critical timing loops. At the same time, it has to provide a rock-solid foundation (the run-time must never fail; instead, it must provide error messages and, if necessary, gracefully exit) that includes support for response to system faults to enable the computing application to continue executing; and
(c) Create an extensible system based on a component architecture that allows developers to "overload" any OpenRTE function, thus enabling the necessary research to support new enhanced features in a production environment. We recognized up-front that OpenRTE's goals are ambitious, and that, as the research literature demonstrates, there are multiple ways of implementing just about every major functional block in the system. We wanted to create an architecture for OpenRTE that would make it easy for a researcher to replace one of those functional blocks with their own idea on how to implement it -- without that researcher having to write all the rest of the blocks required to make a system operational. Thus, a researcher interested in distributed data storage -- such as is found in the OpenRTE's registry -- can "overload" that functional block with their own implementation, and then test the results in a production environment without having to write code to launch and monitor processes.
HPCwire: Can you describe the architecture of OpenRTE and how it works?
Castain: OpenRTE consists of four major subsystem blocks, all built upon a component architecture. Sitting at the core of the system is a publish/subscribe general purpose registry (GPR) that is used to synchronize events across the system. The GPR is designed to asynchronously notify subscribers of events such as data being entered or changed in the registry, and to transmit along with that notification whatever data the subscriber requests. This is the underlying mechanism supporting, among other things, the exchange of communication connection data among processes.
Subsystems in the Resource Management block make heavy use of the GPR to discover computing resources that may be available to the user, allocate those resources for use by a particular application, map the application's processes to specific allocated resources, and launch the processes to begin execution. In this block, the GPR is primarily used as an intermediate storage medium -- the resource discovery service, for example, places entries on the registry with information identifying the resources it has found, their state of operation, how much of their capability, if any, has been reserved for this user, etc. This information is then used by the allocation service to determine if additional resources need to be requested for the user and from where they might come. The mapper service then takes the allocated resources and maps processes to them so that the launch service can start the application.
The Error Management block is the heart of the fault tolerance capability. It is broken into two subsystems according to what could loosely be considered "tactics" and "strategy". The State Monitoring and Reporting (SMR) subsystem pretty much lives up to its name -- it monitors the state-of-health of processes and computing resources, and records any changes on the registry. We use this capability, for example, to monitor the progress of an application as it is initialized for launch in OpenRTE, launched somewhere, and when it completes execution.
The "strategic" part of the Error Management block is in the Error Manager (ERRMGR) itself. The ERRMGR is notified of changes in the status of processes or computing resources via the GPR's notification system. Upon learning of a change, the ERRMGR can respond by shutting down the entire application, migrating the affected processes to another resource, or any number of alternatives. This is one place where the component architecture really can be exploited -- developers can experiment with different strategies for error response by simply indicating, at the time of execution, which ERRMGR component they would like to use for this run.
The final major subsystem block is comprised of support services such as OpenRTE's messaging layer (which should not be confused with Open MPI's high-performance subsystem -- this one is solely for administrative messaging), name service, data services to facilitate heterogeneous operation of the OpenRTE system, and I/O forwarding.
HPCwire: OpenRTE appears to share some of the same attributes as Grid computing. How would you compare OpenRTE to current Grid technology? Is it complementary to Grid computing or do you see OpenRTE as an evolution of the Grid paradigm?
Castain: We have discussed this question at some length, both amongst ourselves and with our colleagues in the Grid community. I believe the correct answer really is that OpenRTE is both complementary and an evolution of the Grid. Quite a bit of OpenRTE's design is based on our experience with Grid-based computing, so there is some obvious degree of overlap. There is also some degree of departure in that we are focused solely on scientific or technical applications -- we are not trying to provide a general-purpose computing platform that can also support business transactions or other areas currently being developed by the Grid community.
There are two primary differences between OpenRTE and the current Grid protocols. Probably the biggest difference is that OpenRTE is designed to operate at the user level -- no system administrator has to install any software or do anything to allow OpenRTE's operation. Most Grid implementations require at least some system administrator level installation and support. We decided to follow the MPI lead here and allow operation solely at the user level. This means that individual users can have their own versions of OpenRTE, each configured to meet their own needs, without conflict.
The other difference lies in the architecture -- components versus web services. Each has their respective advantages and weaknesses. From our experience, however, we believe that components offer some advantages in transparency and overload capability, and can do so very effectively in areas where the application space is constrained, for example, in high-performance scientific computing. We are not claiming that OpenRTE's architecture is ideal for all application spaces -- it would not, for example, be a good way to build a multi-corporation business transaction system -- only that it offers some advantages for the areas we are serving.
That said, OpenRTE is clearly an evolution of the Grid concept itself in that it supports the distributed, interconnected computing model at the heart of that paradigm. It could be considered complementary in that it is not exclusively based on the current Grid protocols. Although it should be noted that several organizations have indicated interest in creating Grid-based components within OpenRTE that would support integration between the Grid and OpenRTE. But I believe there is a role for both the current Grid protocols and OpenRTE.
HPCwire: Who are you talking to about OpenRTE and what kinds of responses are you getting?
Castain: Until recently, we have been almost exclusively focused on supporting the Open MPI effort's initial production release. With that accomplished, we have begun talking to people about more general uses for OpenRTE. Much of our effort so far, of course, has been just letting people know that OpenRTE exists, and what it can do!
We have had some encouraging early interest. The National Virtual Observatory/National Radio Astronomy Observatory folks are evaluating OpenRTE for use as part of their work on scalable data analysis frameworks for NVO. Several distributed sensor projects are exploring OpenRTE as a platform for their work. And we have had some interest from a group at University of Illinois at Urbana-Champaign who are looking at a platform upon which to build a new parallel programming language.
It is still early, though, so we'll have to wait and see if OpenRTE meets their needs. Our interest, of course, is to see how we can modify OpenRTE to make this an attractive solution for them.
HPCwire: Are current releases of OpenRTE being used today?
Castain: Open MPI, which is built upon OpenRTE, is in production today at a number of sites around the world, including universities, companies, and many of the DOE laboratories.
HPCwire: How do you see OpenRTE evolving?
Castain: I think the most exciting future for OpenRTE really lies in two areas. First, we want to enable new capabilities within the high-performance computing community. For example, the next major release of OpenRTE will begin to support execution of multiple intercommunicating applications that can be "connected" at run-time with a simple command line option, and will allow users to execute applications on remote clusters from their desktop computer. In addition to the new modes of operation these directly provide to users, these features will also enable users to begin building high-performance computing "application modules".
This latter concept, I feel, represents the future of high-performance computing. The idea of "application modules" has, of course, been around for quite some time. Adoption of the idea, though, has been hampered by the necessity for source code level integration of the modules. If I have to modify the source code to connect to another module, then (a) it makes use of other modules very difficult, and (b) I have to incorporate considerable logic to allow operation of my code when the other module isn't available. In other words, the current environments don't make modular applications very easy to implement.
If, however, we can transparently allow them to be combined at run-time without modification of source code, then this paradigm might see broader use. I think OpenRTE might prove of some help there.
Second, and perhaps even more importantly, we hope to make high-performance computing easier -- and hence attract more people to that area. When talking amongst ourselves in the HPC community, we oft-times forget that probably less than 1% of the scientific community actually is engaged in high-performance computing. Some of that, of course, is due to the nature of different fields of endeavor -- some areas just don't need that kind of capability. However, if we can make as much of the hard stuff "transparent" to the user as possible, then maybe we can see a spreading of interest and involvement in HPC.
Ralph H. Castain is a scientist at Los Alamos National Laboratory where he conducts research on large-scale distributed computing environments, a continuing investigation that includes time spent as a Research Scientist in the Electrical and Computer Engineering Department at Colorado State University. While in Colorado, he founded the Colorado Grid Computing (COGrid) Initiative to create a statewide Grid computing system capable of meeting the needs of industry, government, and academia of all levels. Prior to joining the University, he spent eight years in industry leading a variety of technology initiatives, and eleven years at Los Alamos National Laboratory engaged in research spanning artificial intelligence, signal processing, and automated decision support systems. During that time, he served as Chief Scientist for Nonproliferation and Arms Control, and as an Industrial Fellow. Dr. Castain received his BS degree (physics) from Harvey Mudd College, and the MS (solid-state physics), MSEE (robotics), and PhD (nuclear physics) degrees from Purdue University.