Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
Recovering transient data
2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)
ACM SIGOPS Operating Systems Review
Philippe Olivier Alexandre Navaux
Philippe Navaux
Paolo Rech
Nathan DeBardeleben
Dave Londo
Daniel Oliveira