Computing the expected makespan of task graphs in the presence of silent errors - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Communication Dans Un Congrès Année : 2016

Computing the expected makespan of task graphs in the presence of silent errors

Résumé

Applications structured as Directed Acyclic Graphs (DAGs) of tasks correspond to a general model of parallel computation that occurs in many domains, including popular scientific workflows. DAG scheduling has received an enormous amount of attention, and several list-scheduling heuristics have been proposed and shown to be effective in practice. Many of these heuristics make scheduling decisions based on path lengths in the DAG. At large scale, however, compute platforms and thus tasks are subject to various types of failures with no longer negligible probabilities of occurrence. Failures that have recently received increasing attention are " silent errors, " which cause a task to produce incorrect results even though it ran to completion. Tolerating silent errors is done by checking the validity of the results and re-executing the task from scratch in case of an invalid result. The execution time of a task then becomes a random variable, and so are path lengths. Unfortunately, computing the expected makespan of a DAG (and equivalently computing expected path lengths in a DAG) is a computationally difficult problem. Consequently, designing effective scheduling heuristics is preconditioned on computing accurate approximations of the expected makespan. In this work we propose an algorithm that computes a first order approximation of the expected makespan of a DAG when tasks are subject to silent errors. We compare our proposed approximation to previously proposed such approximations for three classes of application graphs from the field of numerical linear algebra. Our evaluations quantify approximation error with respect to a ground truth computed via a brute-force Monte Carlo method. We find that our proposed approximation outperforms previously proposed approaches, leading to large reductions in approximation error for low (and realistic) failure rates, while executing much faster.
Fichier principal
Vignette du fichier
icpp_2016.pdf (394.71 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01354711 , version 1 (19-08-2016)

Identifiants

  • HAL Id : hal-01354711 , version 1

Citer

Henri Casanova, Julien Herrmann, Yves Robert. Computing the expected makespan of task graphs in the presence of silent errors. Ninth International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2), 2016, Aug 2016, Philadelphia, United States. ⟨hal-01354711⟩
277 Consultations
324 Téléchargements

Partager

Gmail Facebook X LinkedIn More