Data placement in massively distributed environments for fast parallel mining of frequent itemsets

Saber Salah; Reza Akbarinia; Florent Masseglia

doi:10.1007/s10115-017-1041-5

Article Dans Une Revue Knowledge and Information Systems (KAIS) Année : 2017

Data placement in massively distributed environments for fast parallel mining of frequent itemsets

(1) , (1) , (1)

Saber Salah

Fonction : Auteur
PersonId : 967928

Scientific Data Management

Reza Akbarinia

Fonction : Auteur
PersonId : 172647
IdHAL : reza-akbarinia
ORCID : 0000-0002-7098-0361
IdRef : 119863421

Scientific Data Management

Florent Masseglia

Fonction : Auteur
PersonId : 172896
IdHAL : florent-masseglia
ORCID : 0000-0002-1149-585X
IdRef : 120528681

Scientific Data Management

Résumé

Frequent itemset mining presents one of the fundamental building blocks in data mining. However, despite the crucial recent advances that have been made in data mining literature, few of both standard and improved solutions scale. This is particularly the case when i) the quantity of data tends to be very large and/or ii) the minimum support is very low. In this paper, we address the problem of parallel frequent itemset mining (PFIM) in very large databases, and study the impact and effectiveness of using specific data placement strategies in a massively distributed environment. By offering a clever data placement and an optimal organization of the extraction algorithms, we show that the arrangement of both the data and the different processes can make the global job either completely inoperative or very effective. In this setting , we propose two different highly scalable, PFIM algorithms, namely P2S (Parallel-2-Steps) and PATD (Parallel Absolute Top Down). P2S algorithm allows discovering itemsets from large databases in two simple , yet efficient parallel jobs, while PATD renders the mining process of very large databases more simple and compact. Its mining process is made up of only one parallel job, which dramatically reduces the mining runtime, the communication cost and the energy power consumption overhead in a distributed computational platform. Our different proposed approaches have been extensively evaluated on massive real-world data sets. The experimental results confirm the effectiveness and scala-bility of our proposals by the important scale-up obtained with very low minimum supports compared to other alternatives.

Domaines

Calcul parallèle, distribué et partagé [cs.DC] Analyse numérique [cs.NA]

Fichier principal

KAIS_2017.pdf (619.75 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Reza Akbarinia : Connectez-vous pour contacter le contributeur

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01620383

Soumis le : vendredi 20 octobre 2017-14:58:28

Dernière modification le : jeudi 15 février 2024-03:31:39

Archivage à long terme le : dimanche 21 janvier 2018-14:49:50

Dates et versions

lirmm-01620383 , version 1 (20-10-2017)

Identifiants

HAL Id : lirmm-01620383 , version 1
DOI : 10.1007/s10115-017-1041-5

Citer

Saber Salah, Reza Akbarinia, Florent Masseglia. Data placement in massively distributed environments for fast parallel mining of frequent itemsets. Knowledge and Information Systems (KAIS), 2017, 53 (1), pp.207-237. ⟨10.1007/s10115-017-1041-5⟩. ⟨lirmm-01620383⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 CNRS INRIA IRISA GRID5000 ZENITH LIRMM INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC MIPS UNIV-MONTPELLIER UNIV-RENNES SILECS UR1-MATH-NUM

205 Consultations

334 Téléchargements

Data placement in massively distributed environments for fast parallel mining of frequent itemsets

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager