Building and Using a Fault-Tolerant MPI Implementation
The International Journal of High Performance Computing Applications
Jack J. Dongarra
Fault Tolerance for OpenSHMEM