The 27th ASE (Advanced Supercomputing Environment) Seminar
|Intended for||General public / Enrolled students / International students / Alumni / Companies / University students|
|Date(s)||April 7, 2017 16:00 — 18:00|
|Location||Hongo Area Campus|
|Venue||Information Technology Center, 4th floor, room #413 (Teleconference room)|
|Registration Method||No advance registration required
Please just come and join.
|Contact||Associate Professor Akihiro Ida, Secretary of ASE Seminar
Information Technology Center, The University of Tokyo
E-mail: ida AT cc.u-tokyo.ac.jp
For the 27th ASE Seminar, we invite Dr. Faisal Shahzad from Erlangen University, Germany to speak on Fault Tolerance in software running on parallel computers.
ABSTRACT: In order to efficiently use future generations of supercomputers, fault tolerance and power usage are two of the prime challenges anticipated by the High Performance Computing (HPC) community. A significant share of faults in HPC systems constitute of hard failures, which in many cases lead to process(es) and eventually job failure.
In this talk, I will present our fault tolerance approach developed in the scope of SPPEXA-ESSEX project. We have developed a Checkpoint/Restart and Automatic Fault Tolerance (CRAFT) library that serves two purposes. First, it provides a framework that significantly reduces the effort needed for the implementation of application-level checkpoint/restart methods in a rogram. The user can extend the library to add more user-specific data-types, making them (checkpointable) for future use. Secondly, it provides an easier interface for dynamic process recovery, thus enabling applications to recover automatically after process failures. For this purpose, we have used User-Level Failure Mitigation (ULFM), which is a prototype implementation of fault tolerant MPI. We have significantly reduced the complexity of failure detection and the application recovery mechanism. Both of these functionalities of CRAFT can either be used separate or combined. CRAFT-library features, optimizations and limitations will be discussed in detail.