Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems

Nezih Yigitbasi; Matthieu Gallet; Derrick Kondo; Alexandru Iosup; Dick Epema

doi:10.1109/GRID.2010.5697961

Communication Dans Un Congrès Année : 2010

Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems

(1) , (2, 3) , (4) , (5, 1) , (5)

1
2
3
4
5

Nezih Yigitbasi

Fonction : Auteur

Delft University of Technology

Matthieu Gallet

Fonction : Auteur correspondant
PersonId : 16856
IdHAL : matthieu-gallet

Connectez-vous pour contacter l'auteur

Laboratoire de l'Informatique du Parallélisme

Algorithms and Scheduling for Distributed Heterogeneous Platforms

Derrick Kondo

Fonction : Auteur correspondant
PersonId : 849131

Connectez-vous pour contacter l'auteur

Middleware efficiently scalable

Alexandru Iosup

Fonction : Auteur

Parallel and Distributed Group

Delft University of Technology

Dick Epema

Fonction : Auteur

Parallel and Distributed Group

Résumé

The analysis and modeling of the failures bound to occur in today's large-scale production systems is invaluable in providing the understanding needed to make these systems fault-tolerant yet efficient. Many previous studies have modeled failures without taking into account the time-varying behavior of failures, under the assumption that failures are identically, but independently distributed. However, the presence of time correlations between failures (such as peak periods with increased failure rate) refutes this assumption and can have a significant impact on the effectiveness of fault-tolerance mechanisms. For example, the performance of a proactive fault-tolerance mechanism is more effective if the failures are periodic or predictable; similarly, the performance of checkpointing, redundancy, and scheduling solutions depends on the frequency of failures. In this study we analyze and model the time-varying behavior of failures in large-scale distributed systems. Our study is based on nineteen failure traces obtained from (mostly) production large-scale distributed systems, including grids, P2P systems, DNS servers, web servers, and desktop grids. We first investigate the time correlation of failures, and find that many of the studied traces exhibit strong daily patterns and high autocorrelation. Then, we derive a model that focuses on the peak failure periods occurring in real large-scale distributed systems. Our model characterizes the duration of peaks, the peak inter-arrival time, the inter-arrival time of failures during the peaks, and the duration of failures during peaks; we determine for each the best-fitting probability distribution from a set of several candidate distributions, and present the parameters of the (best) fit. Last, we validate our model against the nineteen real failure traces, and find that the failures it characterizes are responsible on average for over 50% and up to 95% of the downtime of these systems.

Domaines

Calcul parallèle, distribué et partagé [cs.DC]

Equipe Roma : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-00786266

Soumis le : vendredi 8 février 2013-11:33:12

Dernière modification le : jeudi 4 avril 2024-21:29:51

Dates et versions

hal-00786266 , version 1 (08-02-2013)

Identifiants

HAL Id : hal-00786266 , version 1
DOI : 10.1109/GRID.2010.5697961

Citer

Nezih Yigitbasi, Matthieu Gallet, Derrick Kondo, Alexandru Iosup, Dick Epema. Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems. Grid 2010 - Proceedings of the 11th ACM/IEEE International Conference on Grid Computing, 2010, Bruxelles, Belgium. pp.355-366, ⟨10.1109/GRID.2010.5697961⟩. ⟨hal-00786266⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON UNIV-RENNES1 UGA CNRS INRIA UNIV-LYON1 IRISA LIG GRID5000 INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES UDL SILECS UR1-MATH-NUM LIG_SIDCH

284 Consultations

0 Téléchargements

Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager