Self-Healing of Operational Workflow Incidents on Distributed Computing Infrastructures

Rafael Ferreira da Silva; Tristan Glatard; Frédéric Desprez

Rapport (Rapport De Recherche) Année : 2012

Self-Healing of Operational Workflow Incidents on Distributed Computing Infrastructures

(1) , (1) , (2)

1
2

Rafael Ferreira da Silva

Fonction : Auteur correspondant
PersonId : 916317

Connectez-vous pour contacter l'auteur

Images et Modèles

Tristan Glatard

Fonction : Auteur
PersonId : 867504

Images et Modèles

Frédéric Desprez

Fonction : Auteur
PersonId : 6600
IdHAL : frederic-desprez
IdRef : 034430563

Laboratoire de l'Informatique du Parallélisme

Résumé

Distributed computing infrastructures are commonly used through scientific gate- ways, but operating these gateways requires important human intervention to handle operational incidents. This report presents a self-healing process that quantifies incident degrees of workflow activities from metrics measuring long-tail effect, application efficiency, data transfer issues, and site-specific problems. These metrics are simple enough to be computed online and they make little assumptions on the application or resource characteristics. From their degree, incidents are classified in levels and associated to sets of healing actions that are selected based on association rules modeling correlations between incident levels. We specifically study the long-tail effect issue, and propose a new algorithm to control task replication. The healing process is parametrized on real application traces acquired in production on the European Grid Infrastructure. Experimental results obtained in the Virtual Imaging Platform show that the proposed method speeds up exe- cution up to a factor of 4, consumes up to 26% less resource time than a control execution and properly detects unrecoverable errors.

Les infrastructures de calcul distribué sont couramment utilisées à travers des environnements applicatifs dédiés, mais l'administration de ces environnements demande un effort humain important pour résoudre les incidents qui surviennent en production. Ce rapport présente une méthode d'administration automatique qui quantifie le degré des incidents touchant les activités des chaînes de traitements. Ce degré est obtenu à partir de métriques mesurant le retard des dernières tâches, l'efficacité de l'application, les problèmes de transfert de données et la spécificité d'un incident à un site. Ces métriques sont suffisamment simples pour être calculées en ligne, et elles font très peu d'hypothèses sur les caractéristiques des applications et des ressources. A partir de leur degré, les incidents sont classés en niveaux et associés à des ensembles d'actions sélectionnées à partir de règles d'association qui modélisent la corrélation entre niveaux. Nous étudions particulièrement le retard des dernières tâches et nous proposons un algorithme pour contrôler leur réplication. Notre méthode d'administration automatique est paramétrée à partir de traces d'applications réelles acquises en production sur l'infrastructure de grille européenne (EGI). Des résultats expérimentaux obtenus sur la Plate-forme d'Imagerie Virtuelle (VIP) montrent que la méthode peut accélérer l'exécution jusqu'à un facteur 4, économise 26% de ressources par rapport à une exécution-témoin, et détecte correctement les incidents qui ne peuvent pas être résolus.

Mots clés

Error detection and handling production distributed systems workflow execution

Domaines

Calcul parallèle, distribué et partagé [cs.DC]

Fichier principal

RR-8022.pdf (1.05 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Frédéric Desprez : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-00720369

Soumis le : mardi 24 juillet 2012-13:34:15

Dernière modification le : jeudi 11 mai 2023-11:56:10

Archivage à long terme le : vendredi 16 décembre 2016-02:38:52

Dates et versions

hal-00720369 , version 1 (24-07-2012)

Identifiants

HAL Id : hal-00720369 , version 1

Citer

Rafael Ferreira da Silva, Tristan Glatard, Frédéric Desprez. Self-Healing of Operational Workflow Incidents on Distributed Computing Infrastructures. [Research Report] RR-8022, INRIA. 2012, pp.24. ⟨hal-00720369⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-ST-ETIENNE ENS-LYON CNRS INRIA UNIV-LYON1 INSA-LYON INRIA-RRRT CREATIS LARA INSA-GROUPE UDL

184 Consultations

311 Téléchargements

Self-Healing of Operational Workflow Incidents on Distributed Computing Infrastructures

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager