October 02, 2013
There are many important issues when it comes to advancing the field of HPC toward the exascale era, but among all these variables, there are about five or so sticking points that really stand-out: one of these is controlling for soft errors.
As the number of cores per machine increases, incorrect behaviors, known as soft errors, begin to threaten the validity of simulations. When you consider that exascale machines will employ billion-way parallelism, the necessity to address this problem is clear.
A team of scientists from PNNL performed experiments revealing the high risk of soft errors on large-scale computers. The research team found that without intervention, soft errors invalidate simulations in a large fraction of cases, but they also developed a technique that will correct 95 percent of them.
According to their paper in the Journal of Chemical Theory and Computation, the next generation of systems will combine millions of cores, which will increase the odds for soft errors, thereby producing unexpected results.
"Even if every core is highly reliable the sheer number of them will mean that the mean time between failures will become so short that most application runs will suffer at least one fault. In particular soft errors caused by intermittent incorrect behavior of the hardware are a concern as they lead to silent data corruption," note the authors.
The only way to deal with these errors is to identify and remedy them. The paper explores the impact of soft errors on optimization algorithms, which start with an initial guess and iteratively reduce the error until a correct solution is obtained. For a concrete example, the team used the Hartree–Fock method from quantum chemistry.
The results indicate that the optimization algorithms worked well for soft errors of small magnitudes but not for large errors. In other words, calculations still failed in a significant fraction of cases. The team suggests that mechanisms for different classes of data structures will allow large errors to be detected and corrected. They conclude it is possible to correct more than 95% of the soft errors using these techniques with only a modest increase in computational cost.
The work was supported by the eXtreme Scale Computing Initiative using resources from the Environmental Molecular Sciences Laboratory, located at PNNL, as well as the PNNL Institutional Computing Facility. The paper was authored by PNNL researchers Hubertus J. J. van Dam, Abhinav Vishnu, and Wibe A. de Jong.
10/30/2013 | Cray, DDN, Mellanox, NetApp, ScaleMP, Supermicro, Xyratex | Creating data is easy… the challenge is getting it to the right place to make use of it. This paper discusses fresh solutions that can directly increase I/O efficiency, and the applications of these solutions to current, and new technology infrastructures.
10/01/2013 | IBM | A new trend is developing in the HPC space that is also affecting enterprise computing productivity with the arrival of “ultra-dense” hyper-scale servers.
Ken Claffey, SVP and General Manager at Xyratex, presents ClusterStor at the Vendor Showdown at ISC13 in Leipzig, Germany.
Join HPCwire Editor Nicole Hemsoth and Dr. David Bader from Georgia Tech as they take center stage on opening night at Atlanta's first Big Data Kick Off Week, filmed in front of a live audience. Nicole and David look at the evolution of HPC, today's big data challenges, discuss real world solutions, and reveal their predictions. Exactly what does the future holds for HPC?