The 27th ASE (Advanced Supercomputing Environment) Seminar

March 22, 2017


Type Lecture
Intended for General public / Enrolled students / International students / Alumni / Companies / University students
Date(s) April 7, 2017 16:00 — 18:00
Location Hongo Area Campus
Venue Information Technology Center, 4th floor, room #413 (Teleconference room)
Capacity 20 people
Entrance Fee No charge
Registration Method No advance registration required
Please just come and join.
Contact Associate Professor Akihiro Ida, Secretary of ASE Seminar
Information Technology Center, The University of Tokyo
E-mail: ida AT

For the 27th ASE Seminar, we invite Dr. Faisal Shahzad from Erlangen University, Germany to speak on Fault Tolerance in software running on parallel computers.

ABSTRACT: In order to efficiently use future generations of supercomputers, fault tolerance and power usage are two of the prime challenges anticipated by the High Performance Computing (HPC) community. A significant share of faults in HPC systems constitute of hard failures, which in many cases lead to process(es) and eventually job failure. 

In this talk, I will present our fault tolerance approach developed in the scope of SPPEXA-ESSEX project. We have developed a Checkpoint/Restart and Automatic Fault Tolerance (CRAFT) library that serves two purposes. First, it provides a framework that significantly reduces the effort needed for the implementation of application-level checkpoint/restart methods in a rogram. The user can extend the library to add more user-specific data-types, making them (checkpointable) for future use. Secondly, it provides an easier interface for dynamic process recovery, thus enabling applications to recover automatically after process failures. For this purpose, we have used User-Level Failure Mitigation (ULFM), which is a prototype implementation of fault tolerant MPI. We have significantly reduced the complexity of failure detection and the application recovery mechanism. Both of these functionalities of CRAFT can either be used separate or combined. CRAFT-library features, optimizations and limitations will be discussed in detail.

Related links

Access Map
Kashiwa Campus
Hongo Campus
Komaba Campus