Write a Blog >>
PPoPP 2021
Sat 27 February - Wed 3 March 2021
Wed 3 Mar 2021 11:40 - 11:55 - Session 9. Tasks, Threads, and Fault Tolerance Chair(s): Pascal Felber

Aggressive technology scaling trends have worsened the transient fault problem in high performance computing (HPC) system. Some faults are benign, but others can lead to silent data corruption (SDC), which represents a serious problem where a fault introduces an error into an HPC simulation that is not readily detected. Due to the insidious nature of SDCs, researchers have worked to understand their impact on applications. Previous studies have relied on expensive fault injection campaigns with uniform sampling to provide overall SDC rates, but this solution does not provide any feedback on the code regions without samples.

In this research, we develop a method to systematically analyze all fault injection sites in an application with a low number of fault injection experiments. We use fault propagation data from a fault injection experiment to predict the resiliency of other untested fault sites and obtain an approximate fault tolerance threshold value for each site, which represents the largest error that can be introduced at the site without incurring incorrect simulation results. We define the collection of threshold values over all fault sites in the program as a fault tolerance boundary and propose a simple but efficient method to approximate the boundary. In our experiments, we show our method reduces the number of fault injection samples required to understand a program’s resiliency by several orders of magnitude when compared with a traditional fault injection study.

Conference Day
Wed 3 Mar

Displayed time zone: Eastern Time (US & Canada) change

11:10 - 12:10
Session 9. Tasks, Threads, and Fault ToleranceMain Conference
Chair(s): Pascal FelberUniversity of Neuchâtel
11:10
15m
Talk
Advanced Synchronization Techniques for Task-based Runtime Systems
Main Conference
David ÁlvarezBarcelona Supercomputing Center, Kevin SalaBarcelona Supercomputing Center, Marcos MaroñasBarcelona Supercomputing Center, Aleix RocaBarcelona Supercomputing Center, Vicenç BeltranBarcelona Supercomputing Center
Link to publication
11:25
15m
Talk
An Ownership Policy and Deadlock Detector for Promises
Main Conference
Caleb VossGeorgia Institute of Technology, Vivek SarkarGeorgia Institute of Technology
Link to publication
11:40
15m
Talk
Understanding a Program's Resiliency Through Error Propagation
Main Conference
zhimin li, Harshitha MenonLawrence Livermore National Laboratory, Kathryn MohrorLawrence Livermore National Laboratory, Peer-Timo BremerLawrence Livermore National Laboratory, Yarden LivantUniversity of Utah, Valerio PascucciUniversity of Utah
Link to publication
11:55
15m
Talk
Lightweight Preemptive User-Level Threads
Main Conference
Shumpei ShiinaThe University of Tokyo, Shintaro IwasakiArgonne National Laboratory, Kenjiro TauraThe University of Tokyo, Pavan BalajiArgonne National Laboratory
Link to publication