Resilience for MPI and GAS Runtimes: Opportunities and Limitations
Time: Thu 2019-10-03 11.00 - 12.00
Lecturer: Kiril Dichev
Location: Room 304, Teknikringen 14, KTH
With the increase in compute components and the advance towards exascale systems, the system-wide mean time between failures (MTBF) is decreasing. This has forced researchers to reconsider various aspects of resilience in the last decade, including both application-specific and runtime-specific resilience.
In order to detect and correct soft faults (such as SDC), application-specific resilience can check kernel invariants.
More severe faults, such as fail-stop errors, require a more generic recovery, such as checkpoint-restart (C/R), since data is irreversibly lost. C/R has been studied for many years; for MPI, flexible modern libraries enable C/R by using their slim APIs. On the other hand, task-based runtimes, and in particular GAS runtimes for distributed execution, hold the promise of a simpler and more unified programming model. On the features of resilience, they can enable C/R without user extensions, for example via task-level, and fully transparent, backup. However, some aspects are far from trivial to implement for GAS runtimes.
Another long-running study of resilience is local rollback, as opposed to global rollback. For MPI, we will examine solutions via the powerful and generic message logging protocols. For GAS runtimes, a less generic but simpler approach could be designed, via the dependency graphs that these runtimes usually maintain. We demonstrate this idea, in a rather burdensome way, with an MPI kernel. This is another indication that task-based runtimes could provide many benefits, and with less programming overheads than MPI codes. Still, existing GAS runtimes have fallen short on the complex task of a fully featured resilient implementation.