For the 2014 SIAM Conference on Parallel Processing for Scientific Computing (Portland, Oregon, February 18-21), Keita Teranishi, Jaideep Ray, Mike Heroux, and I (all of Sandia National Laboratories) organized a four-part (17-speaker) minisymposium on fault-tolerant numerical algorithms. This was the third in an annual sequence of minisymposia on the topic offered at SIAM conferences, alternating between CS&E and Parallel Processing.
For our purposes, a “fault” occurs when a computer system behaves incorrectly in a way that could affect the running application’s correctness or ability to make progress. The application “fails” if it actually produces incorrect results or fails to complete. We distinguish further between “hard” and “soft” faults. Hard faults cause the application to end prematurely or to wait forever. Soft faults let the application keep running (for a while, at least) but corrupt intermediate states; this could make the application return an incorrect result. Speakers in our minisymposium considered faults of both types.
Algorithms studied in our sessions include iterative sparse linear solvers, nonlinear multigrid, finite-difference stencils, time integration, dense matrix factorizations, fast Fourier transforms, and multilevel Monte Carlo. Talks also covered fault injection (both by software simulation and by an actual neutron beam directed at real hardware), programming models capable of identifying code and data that can go awry, and current trends in system reliability.
The growing interest of the parallel algorithms community in this topic is reflected in the breadth of our minisymposium, as well as in the Wednesday-evening panel discussion (Resilience at Exascale: Should it Worry Application Developers?) and related talks in other minisymposia. This interest has waxed and waned over the years, leading many to question whether we should expend so much effort on the problem. In particular, talk of application fault tolerance seems to return with each approach of parallel “scale” transitions—as in terascale, petascale, and the upcoming exascale. At previous scale transitions, systems stepped up to protect applications, with no need for contributions from the latter. However, experience has established that providing this protection incurs hardware costs (a high-bandwidth global file system) and requires significant “system middleware” effort to make current resilience mechanisms (mainly checkpoint-restart) efficient at scale.
As the exascale era approaches, the confluence of several factors suggests that algorithm developers need to get involved soon. First, the current processor-building technology, CMOS (complementary metal-oxide semiconductor), is close to its physical size limits. As CMOS transistor gates shrink to the thickness of a few atoms, it gets harder and harder to make them consistently correct. Second, for about ten years, commodity processors have been achieving performance improvements not through frequency increases but through parallelism. More parallelism means more components, which makes faults more likely. Third, very large parallel computers are now limited mainly by power. Megawatts of electricity have to get into the facility, and megawatts of heat have to leave it. Earlier trends, however, suggest that hardware will only become less reliable. Counteracting this by introducing redundancy will lead to costs in power or performance. Thus, vendors will find it more and more tempting to relax system correctness, but only as long as applications can make up for it. Finally, it is not clear that existing system mechanisms for recovering from a parallel process loss—in particular, global coordinated checkpoint-restart—will work at extreme scales. This is due to their high disk bandwidth requirements, which introduce hardware and power costs poorly related to actual needs of applications. Together, these trends explain why we algorithm developers need to look at this problem now, so that we can be involved in system design discussions.
During the Resilience at Exascale panel discussion, the mood of both panel and audience shifted audibly as soon as Shekhar Borkar (Intel Corp.) hinted that computer hardware might start to need the cooperation of system software for correctness. Getting good performance on correct parallel computers is already hard. Consider the difficulty of porting an MPI-only library (the Message Passing Interface, a model for distributed-memory parallelism) to add shared-memory parallelism. Even worse, consider that on future computers, arithmetic or storage could be silently incorrect, or a process could fail at any time and algorithms (rather than the checkpointing library) would be asked to deal with it. The growing uncertainty about system reliability causes special concern, given that many of the algorithms our community develops are used in simulations to support high-consequence decisions. Silent wrong answers could cause death or injury, or could have unacceptable financial or environmental costs. Our minisymposium offered some hope, however, that algorithms will be able to address this concern.
In closing, we point out that some of the techniques that can make algorithms get the right answer despite hardware faults can also improve their resistance to other problems—including software bugs, input that violates an algorithm’s assumptions, and other issues that affect correctness (such as loss of energy conservation). It was inspiring to read the software development history of the Mars rover Curiosity in the Communications of the ACM (February 2014). If large-scale computers are indeed becoming more and more sensitive to the physics of the real world, we scientific and engineering algorithm developers can take lessons from the embedded and real-time control communities. Moreover, principled use of the least expensive and most generally applicable resilience techniques can make codes safer and easier to use, even if our guesses about fault rates turn out to be exaggerated.
Slides with synchronized audio for many sessions from the 2014 parallel processing conference, including the minisymposium discussed here, can be accessed here.