Is it Worth Relaxing Fault Tolerance to Speed Up Decommission in Distributed Storage Systems? - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Communication Dans Un Congrès Année : 2019

Is it Worth Relaxing Fault Tolerance to Speed Up Decommission in Distributed Storage Systems?

Résumé

Efficient resource utilization is a major concern for large-scale computer platforms. One method used to lower energy consumption and operational cost is to reduce the amount of idle resources. This can be achieved by using malleability, namely, the possibility for resource managers to dynamically increase or decrease the amount of resources of jobs while they are running. Decommissioning (i.e., removing from the cluster) the idle nodes as soon as possible allows the resource manager to quickly reallocate those nodes to other jobs. Challenges appear when such nodes host part of a distributed storage system. Such storage systems may need to transfer large amounts of data before releasing the nodes, in order to ensure data availability and a certain level of fault tolerance. In this paper, we model and evaluate the performance of the decommission operation when relaxing the level of fault tolerance (i.e., the number of replicas) during this operation. Intuitively, this is expected to reduce the amount of data transfers needed before nodes are released, and thus allow nodes to be returned to the resource manager faster. We quantify theoretically how much time and resources are saved by such a fast decommission strategy compared with a standard decommission that does not temporarily reduce the fault-tolerance level. We establish lower bounds for the duration of the different phases of a fast decommission. We use the lower bounds to estimate when fast decommission would be useful to reduce the usage of core-hours and when not. We implement a prototype for fast decommission and experimentally validate the lower bounds on the duration of the operation and confirm in practice our theoretical findings.
Fichier principal
Vignette du fichier
Paper.pdf (184.76 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02116727 , version 1 (01-05-2019)

Identifiants

Citer

Nathanaël Cheriere, Matthieu Dorier, Gabriel Antoniu. Is it Worth Relaxing Fault Tolerance to Speed Up Decommission in Distributed Storage Systems?. CCGrid 2019 - IEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing, May 2019, Larnaca, Cyprus. pp.1-10, ⟨10.1109/CCGRID.2019.00024⟩. ⟨hal-02116727⟩
123 Consultations
218 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More