Understanding GPU errors on large-scale HPC systems and the implications for system design and operation