FP-Hadoop: Efficient Processing of Skewed MapReduce Jobs

Miguel Liroz-Gistau; Reza Akbarinia; Divyakant Agrawal; Patrick Valduriez

doi:10.1016/j.is.2016.03.008

Article Dans Une Revue Information Systems Année : 2016

FP-Hadoop: Efficient Processing of Skewed MapReduce Jobs

(1) , (1) , (2) , (1)

1
2

Miguel Liroz-Gistau

Fonction : Auteur
PersonId : 901689

Scientific Data Management

Reza Akbarinia

Fonction : Auteur
PersonId : 172647
IdHAL : reza-akbarinia
ORCID : 0000-0002-7098-0361
IdRef : 119863421

Scientific Data Management

Divyakant Agrawal

Fonction : Auteur

University of California [Santa Barbara]

Patrick Valduriez

Fonction : Auteur
PersonId : 172604
IdHAL : patrick-valduriez
ORCID : 0000-0001-6506-7538
IdRef : 028314417

Scientific Data Management

Résumé

Nowadyas, we are witnessing the fast production of very large amount of data, particularly by the users of online systems on the Web. However, processing this big data is very challenging since both space and computational requirements are hard to satisfy. One solution for dealing with such requirements is to take advantage of parallel frameworks, such as MapReduce or Spark, that allow to make powerful computing and storage units on top of ordinary machines. Although these key-based frameworks have been praised for their high scalability and fault tolerance, they show poor performance in the case of data skew. There are important cases where a high percentage of processing in the reduce side ends up being done by only one node. In this paper, we present FP-Hadoop, a Hadoop-based system that renders the reduce side of MapReduce more parallel by efficiently tackling the problem of reduce data skew. FP-Hadoop introduces a new phase, denoted intermediate reduce (IR), where blocks of intermediate values are processed by intermediate reduce workers in parallel. With this approach, even when all intermediate values are associated to the same key, the main part of the reducing work can be performed in parallel taking benefit of the computing power of all available workers. We implemented a prototype of FP-Hadoop, and conducted extensive experiments over synthetic and real datasets. We achieved excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time.

Mots clés

MapReduce Data Skew Parallel Data Processing

Domaines

Recherche d'information [cs.IR]

Fichier principal

infsys.pdf (362.54 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Reza Akbarinia : Connectez-vous pour contacter le contributeur

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01377715

Soumis le : vendredi 7 octobre 2016-14:33:42

Dernière modification le : jeudi 15 février 2024-03:31:44

Archivage à long terme le : vendredi 3 février 2017-19:10:39

Dates et versions

lirmm-01377715 , version 1 (07-10-2016)

Identifiants

HAL Id : lirmm-01377715 , version 1
DOI : 10.1016/j.is.2016.03.008

Citer

Miguel Liroz-Gistau, Reza Akbarinia, Divyakant Agrawal, Patrick Valduriez. FP-Hadoop: Efficient Processing of Skewed MapReduce Jobs. Information Systems, 2016, 60, pp.69-84. ⟨10.1016/j.is.2016.03.008⟩. ⟨lirmm-01377715⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 CNRS INRIA IRISA GRID5000 ZENITH LIRMM INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC MIPS UNIV-MONTPELLIER UNIV-RENNES SILECS UR1-MATH-NUM

169 Consultations

571 Téléchargements

FP-Hadoop: Efficient Processing of Skewed MapReduce Jobs

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager