A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys
David B. Johnson
Lorenzo Alvisi
E. N. (Mootaz) Elnozahy
Globally precise-restartable execution of parallel programs