November 09, 2010
If time marches on, computing marches up. Currently, in the terascale, and early petascale era, we are seeing thousands of processors on a given machine. Connecting all these processors requires even more hardware. And the more hardware there is, the greater the odds of component failure. Such is the subject of an article at Scientific Computing. Author Doug Baxter urges his audience to think about accomodating hardware failure by redesigning the software.
Hardware fault-tolerance measures are in use today, but the drawbacks are many. The ability to predict when hardware is about to fail, making it hot swappable, and proactively rescheduling software running on parts about to fail are all current ways to deal with the problem of faulty hardware. These methods are helpful, but only in hardware that is actively monitored. Another workaround is hardware redundancy, but the expense can make it impractical. There's checkpoint restarting, but the cost and logistics issues involved with check-pointing massive volumes of distributed memory can cancel out the benefits.
It is for these reasons that Baxter recommends looking to the software design community to achieve fault-tolerant computing. He reports that researchers have started working on this goal and categorizes their efforts into two groups: data-centric software and process-centric software. Baxter proceeds to explore a process-centric strategy. In order for process-centric HPC codes to accommodate hardware failutres, Baxter says that there must first be a shift in software design paradigms and a discarding of outmoded assumptions. Some examples of the latter are that input/output operations never fail and are relatively inexpensive, and that communications calls always succeed. Although the idea that Baxter sets himself to debunking, and one he says is particulary entrenched, is that a consistent set of resources is available for the duration of a computation. He goes on to make his case in detail, including possible pitfalls with suggested solutions.
In the end, Baxter calls for the software developer community to "design locally synchronized, dynamically scheduled, and hierarchically managed applications that can complete computations despite the expected modest number of hardware component failures." Imagine an application that can sense a hardware failure and just work around it, like a car avoiding a large pothole, able to continue to its destination.
Full story at Scientific Computing
10/30/2013 | Cray, DDN, Mellanox, NetApp, ScaleMP, Supermicro, Xyratex | Creating data is easy… the challenge is getting it to the right place to make use of it. This paper discusses fresh solutions that can directly increase I/O efficiency, and the applications of these solutions to current, and new technology infrastructures.
10/01/2013 | IBM | A new trend is developing in the HPC space that is also affecting enterprise computing productivity with the arrival of “ultra-dense” hyper-scale servers.
Ken Claffey, SVP and General Manager at Xyratex, presents ClusterStor at the Vendor Showdown at ISC13 in Leipzig, Germany.
Join HPCwire Editor Nicole Hemsoth and Dr. David Bader from Georgia Tech as they take center stage on opening night at Atlanta's first Big Data Kick Off Week, filmed in front of a live audience. Nicole and David look at the evolution of HPC, today's big data challenges, discuss real world solutions, and reveal their predictions. Exactly what does the future holds for HPC?