An optimal algorithm for scheduling checkpoints with variable costs - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Rapport (Rapport Technique) Année : 2010

An optimal algorithm for scheduling checkpoints with variable costs

Résumé

Since the last decade, computing systems turn to large scale parallel platforms composed of thousands of processors. Many actual applications run on such systems for long duration, up to several days or weeks. Recently, statistic studies about failures on high performance computing platforms emphasize that the mean time between failures may not exceed few hours. Thus, it is necessary to develop effcient strategies providing a safe and reliable completion of applications. This may be achieved through redundancy or by storing intermediate computation states on reliable external devices. Saved states are then used to restart computations from the last checkpoint. This last approach called checkpointing is one of the most popular fault tolerance technique in parallel systems.
Fichier principal
Vignette du fichier
trystram_fault_tolerance.pdf (221.46 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

inria-00558861 , version 1 (24-01-2011)

Identifiants

  • HAL Id : inria-00558861 , version 1

Citer

Mohamed Slim Bouguerra, Denis Trystram, Frédéric Wagner. An optimal algorithm for scheduling checkpoints with variable costs. [Technical Report] 2010. ⟨inria-00558861⟩
498 Consultations
299 Téléchargements

Partager

Gmail Facebook X LinkedIn More