October 31, 2013
Within the the next decade, technical achievements will result in high-performance computing platforms capable of delivering 10^18 floating operations per second. Experts anticipate that these exascale computing machines will exhibit billion-way parallelism. The sheer number of cores raises a number of challenges, and reliability concerns are among the most prominent of these. It's against this backdrop that HPC researchers have stepped up efforts to understand the phenomenon of soft errors and silent data corruption.
In slides from a 2010 talk, Los Alamos National Laboratory researcher Sarah E. Michalak defines a soft error as "an unintended change in the state of an electronic device that alters the information that it stores without destroying its functionality, e.g. a bit flip caused by a cosmic-ray-induced neutron."
One of the most troubling ways that soft errors can present is through silent data corruption (SDC). This occurs when a computing system delivers incorrect results without logging an error. In this case, although the application might be able to finish, the results can be different than results without soft errors. In some cases, the outcome is an incorrect scientific answer and in others, the application can hang for long periods of time, or even indefinitely. (The specifics of soft error vulnerabilities are given further coverage in this SC12 paper.)
Michalak has spent years studying supercomputing reliability by conducting experiments on soft error rates and silent data corruption rates. Michalak and a team of LANL researchers describe one such experiment in a recent paper, "Field Testing of Production and Decommissioned High Performance Computing Platforms at Los Alamos National Laboratory." The authors begin with the assertion that "Silent Data Corruption (SDC) has the potential to threaten the integrity of scientific calculations performed on high performance computing (HPC) platforms and other systems."
They explain that SDC can be caused by many factors, which is partly what makes it such a frustrating problem. The main culprits that have been identified are temperature and voltage fluctuations, particles (neutrons, protons, and alphas), manufacturing residues, oxide breakdown, and electrostatic discharge.
Some researchers expect that SDC will become more prevalent as a result of new technologies where clock frequencies, transistor counts and noise levels increase while feature sizes and voltages decrease. And of course as SDC increases, reliability falters.
To better understand and mitigate this challenging problem, the LANL-based research team has undertaken a program of field testing applications using correctness checks on application execution. The tests were performed on both production and decommissioned HPC platforms at Los Alamos National Laboratory.
The large-scale ﬁeld testing study – occurring over a five-year period – was designed to measure the prevalence of incorrect results observed on the HPC platforms as well as to trace the causes of the incorrect results so they can be remedied. The authors state that it was not possible to check all system logs to ensure that an incorrect result was not accompanied by a related error or warning message, so it is not known what percentage of observed incorrect results were genuinely silent.
The paper lays out the early results from the study, taking into account six test platforms: three LANL HPC platforms were tested during production use and five LANL platforms were tested post-decommissioning. Sometimes the same machine was deployed for testing while in its production state and then again after decommissioning. Testing on production machines was performed on idle nodes only and testing on retired machines could typically use all available nodes. Section III of the paper details the architectures of all six HPC platforms, the specifics of the tests and the resultant number of errors.
The platforms were tested using High Performance Computing Linpack Benchmark (HPL) and the Crisscross MPI data-transfer test code. During the five year experiment period, test specifications were continuously refined. In the early stages, researchers used HPL-based testing only with a diverse set of HPL problem sizes and manipulation of the ambient temperature and system voltages. Later testing included fewer HPL problem sizes and omitted temperature and voltage controls.
The study to this point represents over 500 node-years of computation and nearly 80 PB of data transfers – over 35 PB of within-node data transfers, and almost 44 PB of between-node data transfers.
Incorrect results were observed when running HPL calculations on two platforms – both of these were decommissioned systems (see platform numbers 3 and 4 in the paper). In their concluding paragraphs, the authors write that "the characteristics of these results are suggestive of transient faults such as those resulting from neutron-induced errors."
As these were preliminary results, there is still much more to come. The next step for the LANL team will be processing and presenting the results from the rest of the tested platforms, as well as developing FIT estimates for all tested platforms.
10/30/2013 | Cray, DDN, Mellanox, NetApp, ScaleMP, Supermicro, Xyratex | Creating data is easy… the challenge is getting it to the right place to make use of it. This paper discusses fresh solutions that can directly increase I/O efficiency, and the applications of these solutions to current, and new technology infrastructures.
10/01/2013 | IBM | A new trend is developing in the HPC space that is also affecting enterprise computing productivity with the arrival of “ultra-dense” hyper-scale servers.
Ken Claffey, SVP and General Manager at Xyratex, presents ClusterStor at the Vendor Showdown at ISC13 in Leipzig, Germany.
Join HPCwire Editor Nicole Hemsoth and Dr. David Bader from Georgia Tech as they take center stage on opening night at Atlanta's first Big Data Kick Off Week, filmed in front of a live audience. Nicole and David look at the evolution of HPC, today's big data challenges, discuss real world solutions, and reveal their predictions. Exactly what does the future holds for HPC?