Checkpointing strategies for parallel jobs. - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Communication Dans Un Congrès Année : 2011

Checkpointing strategies for parallel jobs.

Résumé

This work provides an analysis of checkpointing strategies for minimizing expected job execution times in an environ- ment that is subject to processor failures. In the case of both sequential and parallel jobs, we give the optimal solu- tion for exponentially distributed failure inter-arrival times, which, to the best of our knowledge, is the first rigorous proof that periodic checkpointing is optimal. For non-ex- ponentially distributed failures, we develop a dynamic pro- gramming algorithm to maximize the amount of work com- pleted before the next failure, which provides a good heuris- tic for minimizing the expected execution time. Our work considers various models of job parallelism and of parallel checkpointing overhead. We first perform extensive simula- tion experiments assuming that failures follow Exponential or Weibull distributions, the latter being more representa- tive of real-world systems. The obtained results not only corroborate our theoretical findings, but also show that our dynamic programming algorithm significantly outperforms previously proposed solutions in the case of Weibull fail- ures. We then discuss results from simulation experiments that use failure logs from production clusters. These results confirm that our dynamic programming algorithm signifi- cantly outperforms existing solutions for real-world clusters.
Fichier principal
Vignette du fichier
paper.pdf (429.74 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-00738504 , version 1 (04-10-2012)

Identifiants

  • HAL Id : hal-00738504 , version 1

Citer

Marin Bougeret, Henri Casanova, Mikael Rabie, Yves Robert, Frédéric Vivien. Checkpointing strategies for parallel jobs.. SuperComputing (SC) - International Conference for High Performance Computing, Networking, Storage and Analysis, 2011, 2011, United States. pp.1-11. ⟨hal-00738504⟩
236 Consultations
346 Téléchargements

Partager

Gmail Facebook X LinkedIn More