November 27, 2006
The HPC4U European research project active in GRID computing technologies just released the first freeware version of its fault tolerant grid middleware providing fault tolerance for parallel applications. This system, based on a Linux kernel running as MS Windows service (coLinux), offers the user the possibility to launch parallel application on virtual nodes in order to test fault tolerance mechanisms in action. (User can start a parallel compute job on two compute nodes, killing one of these nodes and seeing, within a second, the job restarting on two other nodes.)
HPC4U's freeware version uses a coLinux system. It is a virtualisation which, in contrast to other systems such as VMware, does not emulate an entire machine but allows running the Linux kernel as an MS Windows service. Using coLinux makes it easier to run since the operating system is booted from a CD-Rom or a DVD device without any previous installation on the computer disk. This coLinux based system uses CCS and two free and open source components offering basic fault tolerance mechanisms for parallel applications. These components are respectively BLCR (Berkeley Lab Checkpoint/Restart) and LAM-MPI.
BLCR allows programs running on Linux to be "checkpointed" (written entirely to a file), before being "restarted." BLCR performs checkpointing and restarting inside the Linux kernel. While this makes it less portable than solutions that use user-level libraries, it also means that it has full access to all kernel resources and can thus restore resources (like process IDs) while user-level libraries cannot. In the future, this will also allow BLCR to checkpoint/restart entire sessions and/or process groups (such as shell scripts and their sub processes).
LAM-MPI is an open-source implementation of the Message Passing Interface specification, including all of MPI-1.2 and much of MPI-2. One of the main advantages of using LAM-MPI in the HPC4U freeware bundle is the native compatibility with BLCR. Indeed, as detailed on their website, MPI applications running under LAM/MPI can be checkpointed to disk and restarted later at a later time or stage. LAM requires a third party single-process checkpoint/restart toolkit to actually checkpoint and restart a single MPI process -- LAM handles the parallel coordination.
The combination of all these free and open source components coupled to CCS as Resource Management System developed by UPB and used by HPC4U will offer the possibility of testing HPC4U basic functionalities. Users will just have to boot their computer by using the provided DVD, transforming temporarily those computers into compute nodes, and will have to test fault tolerance mechanisms on a given application.
10/30/2013 | Cray, DDN, Mellanox, NetApp, ScaleMP, Supermicro, Xyratex | Creating data is easy… the challenge is getting it to the right place to make use of it. This paper discusses fresh solutions that can directly increase I/O efficiency, and the applications of these solutions to current, and new technology infrastructures.
10/01/2013 | IBM | A new trend is developing in the HPC space that is also affecting enterprise computing productivity with the arrival of “ultra-dense” hyper-scale servers.
Ken Claffey, SVP and General Manager at Xyratex, presents ClusterStor at the Vendor Showdown at ISC13 in Leipzig, Germany.
Join HPCwire Editor Nicole Hemsoth and Dr. David Bader from Georgia Tech as they take center stage on opening night at Atlanta's first Big Data Kick Off Week, filmed in front of a live audience. Nicole and David look at the evolution of HPC, today's big data challenges, discuss real world solutions, and reveal their predictions. Exactly what does the future holds for HPC?