Assessing general-purpose algorithms to cope with fail-stop and silent errors - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Rapport (Rapport De Recherche) Année : 2014

Assessing general-purpose algorithms to cope with fail-stop and silent errors

Résumé

In this paper, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption. For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent errors before each checkpoint, hence extending the classical formula by Young and Daly for fail-stop errors only. We further extend the approach to include intermediate verifications, and to consider a bi-criteria problem involving both time and energy (linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bi-criteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via dynamic voltage and frequency scaling (DVFS). In this latter scenario, we determine the optimal checkpointing and verification locations, as well as the optimal speed pairs for each task segment between any two consecutive checkpoints. Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performance of each algorithm, showing that the best overall performance is achieved under the most flexible scenario using intermediate verifications and different speeds.
Fichier principal
Vignette du fichier
RR-8599_extended.pdf (1009.21 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01066664 , version 1 (23-09-2014)
hal-01066664 , version 2 (08-10-2014)
hal-01066664 , version 3 (10-10-2014)
hal-01066664 , version 4 (06-01-2015)
hal-01066664 , version 5 (09-02-2016)

Identifiants

  • HAL Id : hal-01066664 , version 5

Citer

Anne Benoit, Aurélien Cavelan, Yves Robert, Hongyang Sun. Assessing general-purpose algorithms to cope with fail-stop and silent errors. [Research Report] RR-8599, INRIA. 2014, pp.42. ⟨hal-01066664v5⟩
287 Consultations
272 Téléchargements

Partager

Gmail Facebook X LinkedIn More