Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

Gregoire Preud'Homme; Kévin Duarte; Kevin Dalleau; Claire Lacomblez; Emmanuel Bresso; Malika Smaïl-Tabbone; Miguel Couceiro; Marie-Dominique Devignes; Masatake Kobayashi; Olivier Huttin; João Pedro Ferreira; Faiez Zannad; Patrick Rossignol; Nicolas Girerd

doi:10.1038/s41598-021-83340-8

Article Dans Une Revue Scientific Reports Année : 2021

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

(1, 2, 3, 4) , (1, 2, 3, 4) , (5) , (1, 2, 3, 4) , (5) , (3, 4, 5) , (6) , (3, 4, 5) , (1, 2, 3, 4) , (1, 2, 3, 4) , (1, 2, 3, 4) , (1, 2, 3, 4) , (1, 2, 3, 4) , (1, 2, 3, 4)

1
2
3
4
5
6

Gregoire Preud'Homme

Fonction : Auteur

Défaillance Cardiovasculaire Aiguë et Chronique

Centre d'investigation clinique plurithématique Pierre Drouin [Nancy]

Cardiovascular and Renal Clinical Trialists [Vandoeuvre-les-Nancy]

French-Clinical Research Infrastructure Network - F-CRIN [Paris]

Kévin Duarte

Fonction : Auteur
PersonId : 973427

Défaillance Cardiovasculaire Aiguë et Chronique

Centre d'investigation clinique plurithématique Pierre Drouin [Nancy]

Cardiovascular and Renal Clinical Trialists [Vandoeuvre-les-Nancy]

French-Clinical Research Infrastructure Network - F-CRIN [Paris]

Kevin Dalleau

Fonction : Auteur
PersonId : 16248
IdHAL : kevin-dalleau
IdRef : 227824776

Computational Algorithms for Protein Structures and Interactions

Claire Lacomblez

Fonction : Auteur

Défaillance Cardiovasculaire Aiguë et Chronique

Centre d'investigation clinique plurithématique Pierre Drouin [Nancy]

Cardiovascular and Renal Clinical Trialists [Vandoeuvre-les-Nancy]

French-Clinical Research Infrastructure Network - F-CRIN [Paris]

Emmanuel Bresso

Fonction : Auteur
PersonId : 760590
ORCID : 0000-0002-5650-3155

Computational Algorithms for Protein Structures and Interactions

Malika Smaïl-Tabbone

Fonction : Auteur
PersonId : 2054
IdHAL : malika-smail-tabbone
ORCID : 0000-0002-8119-2117
IdRef : 190929065

Cardiovascular and Renal Clinical Trialists [Vandoeuvre-les-Nancy]

French-Clinical Research Infrastructure Network - F-CRIN [Paris]

Computational Algorithms for Protein Structures and Interactions

Miguel Couceiro

Fonction : Auteur
PersonId : 1498
IdHAL : miguel-couceiro
ORCID : 0000-0003-2316-7623
IdRef : 223362395

Knowledge representation, reasonning

Marie-Dominique Devignes

Fonction : Auteur
PersonId : 742369
IdHAL : mddevignes
ORCID : 0000-0002-0399-8713
IdRef : 134215001

Cardiovascular and Renal Clinical Trialists [Vandoeuvre-les-Nancy]

French-Clinical Research Infrastructure Network - F-CRIN [Paris]

Computational Algorithms for Protein Structures and Interactions

Masatake Kobayashi

Fonction : Auteur
PersonId : 795184
ORCID : 0000-0003-0008-2256

Défaillance Cardiovasculaire Aiguë et Chronique

Centre d'investigation clinique plurithématique Pierre Drouin [Nancy]

Cardiovascular and Renal Clinical Trialists [Vandoeuvre-les-Nancy]

French-Clinical Research Infrastructure Network - F-CRIN [Paris]

Olivier Huttin

Fonction : Auteur
PersonId : 770504
IdRef : 127860991

Défaillance Cardiovasculaire Aiguë et Chronique

Centre d'investigation clinique plurithématique Pierre Drouin [Nancy]

Cardiovascular and Renal Clinical Trialists [Vandoeuvre-les-Nancy]

French-Clinical Research Infrastructure Network - F-CRIN [Paris]

João Pedro Ferreira

Fonction : Auteur
PersonId : 800586
ORCID : 0000-0002-2304-6138

Défaillance Cardiovasculaire Aiguë et Chronique

Centre d'investigation clinique plurithématique Pierre Drouin [Nancy]

Cardiovascular and Renal Clinical Trialists [Vandoeuvre-les-Nancy]

French-Clinical Research Infrastructure Network - F-CRIN [Paris]

Faiez Zannad

Fonction : Auteur
PersonId : 756877
ORCID : 0000-0001-7456-1570

Défaillance Cardiovasculaire Aiguë et Chronique

Centre d'investigation clinique plurithématique Pierre Drouin [Nancy]

Cardiovascular and Renal Clinical Trialists [Vandoeuvre-les-Nancy]

French-Clinical Research Infrastructure Network - F-CRIN [Paris]

Patrick Rossignol

Fonction : Auteur
PersonId : 755843
ORCID : 0000-0001-8009-3873

Défaillance Cardiovasculaire Aiguë et Chronique

Centre d'investigation clinique plurithématique Pierre Drouin [Nancy]

Cardiovascular and Renal Clinical Trialists [Vandoeuvre-les-Nancy]

French-Clinical Research Infrastructure Network - F-CRIN [Paris]

Nicolas Girerd

Fonction : Auteur
PersonId : 771124
ORCID : 0000-0002-3278-2057
IdRef : 164445757

Défaillance Cardiovasculaire Aiguë et Chronique

Centre d'investigation clinique plurithématique Pierre Drouin [Nancy]

Cardiovascular and Renal Clinical Trialists [Vandoeuvre-les-Nancy]

French-Clinical Research Infrastructure Network - F-CRIN [Paris]

Résumé

The choice of the most appropriate unsupervised machine-learning method for "heterogeneous" or "mixed" data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-life data. We conducted a benchmark analysis of "ready-to-use" tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). Clustering methods were then applied on the EPHESUS randomized clinical trial data (a heart failure trial evaluating the effect of eplerenone) allowing to illustrate the differences between different clustering techniques. The simulations revealed the dominance of K-prototypes, Kamila and LCM models over all other methods. Overall, methods using dissimilarity matrices in classical algorithms such as Partitioning Around Medoids and Hierarchical Clustering had a lower ARI compared to model-based methods in all scenarios. When applying clustering methods to a real-life clinical dataset, LCM showed promising results with regard to differences in (1) clinical profiles across clusters, (2) prognostic performance (highest C-index) and (3) identification of patient subgroups with substantial treatment benefit. The present findings suggest key differences in clustering performance between the tested algorithms (limited to tools readily available in R). In most of the tested scenarios, model-based methods (in particular the Kamila and LCM packages) and K-prototypes typically performed best in the setting of heterogeneous data.

Mots clés

Clustering method Q1 Distance or transformation Q2 Merge mode Q3 Optimization algorithm Numeric Categorical

Domaines

Cardiologie et système cardiovasculaire

Fichier principal

s41598-021-83340-8.pdf (2 Mo)

41598_2021_83340_MOESM1_ESM.docx (1.17 Mo)

41598_2021_83340_MOESM1_ESM.pdf (348.89 Ko)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Erwan BOZEC : Connectez-vous pour contacter le contributeur

https://hal.univ-lorraine.fr/hal-03165272

Soumis le : mercredi 10 mars 2021-15:21:10

Dernière modification le : lundi 25 septembre 2023-12:30:08

Archivage à long terme le : vendredi 11 juin 2021-19:00:27

Dates et versions

hal-03165272 , version 1 (10-03-2021)

Identifiants

HAL Id : hal-03165272 , version 1
DOI : 10.1038/s41598-021-83340-8
PUBMED : 33603019
PUBMEDCENTRAL : PMC7892576

Citer

Gregoire Preud'Homme, Kévin Duarte, Kevin Dalleau, Claire Lacomblez, Emmanuel Bresso, et al.. Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark. Scientific Reports, 2021, 11 (1), pp.4202. ⟨10.1038/s41598-021-83340-8⟩. ⟨hal-03165272⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSERM CNRS INRIA UNIV-LORRAINE INRIA2 LORIA LORIA-NLPKD LORIA-AIS DCAC-UL BMS-UL ANR

197 Consultations

247 Téléchargements

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager