High-dimensional clustering

Christophe Biernacki; Cathy Maugis

Chapitre D'ouvrage Année : 2017

High-dimensional clustering

(1, 2) , (3)

1
2
3

Christophe Biernacki

Fonction : Auteur

Laboratoire Paul Painlevé - UMR 8524

MOdel for Data Analysis and Learning

Cathy Maugis

Fonction : Auteur
PersonId : 15433
IdHAL : cathy-maugis-rabusseau
IdRef : 130874329

Institut de Mathématiques de Toulouse UMR5219

Résumé

High-dimensional (HD) data sets are now frequent, mostly motivated by technological reasons which concern automation in variable acquisition, cheaper availability of data storage and more powerful standard computers for quick data management possibility. All fields are impacted by this general phenomenon of variable number inflation, only the definition of ``high'' being domain dependent. In marketing, this number can be of order 10e2, in microarray gene expression between 10e2 and 10e4, in text mining 10e3 or more, of order 10e6 for single nucleotide polymorphism (SNP) data, etc. Note also that sometimes much more variables can be involved, what can be typically the case with discretized curves, for instance curves coming from temporal sequences. Such a technological revolution has a huge impact in other scientific fields, as societal or also mathematical ones. In particular, high-dimensional data management brings some new challenges to statisticians since standard (low-dimensional) data analysis methods struggle to directly apply to the new (high-dimensional) data sets. The reason can be twofold, sometimes linked, involving either combinatorial difficulties or disastrously large estimate variance increase. Data analysis methods are essential for providing a synthetic view of data sets, allowing data summary and data exploratory for future decision making for instance. This need is even more acute in the high-dimensional setting since on the one hand the large number of variables suggests that a lot of information is conveyed by data but, in the other hand, such information may be hidden behind their volume.

Domaines

Méthodologie [stat.ME]

Fichier principal

JES2014-chap2.pdf (1.71 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Christophe Biernacki : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01252673

Soumis le : mardi 12 janvier 2016-08:24:48

Dernière modification le : vendredi 19 avril 2024-14:04:05

Archivage à long terme le : vendredi 15 avril 2016-21:20:59

Dates et versions

hal-01252673 , version 1 (07-01-2016)

hal-01252673 , version 2 (12-01-2016)

Identifiants

HAL Id : hal-01252673 , version 2

Citer

Christophe Biernacki, Cathy Maugis. High-dimensional clustering. Choix de modèles et agrégation, Sous la direction de J-J. DROESBEKE, G. SAPORTA, C. THOMAS-AGNAN Edition: Technip., , 2017, 9782710811770. ⟨hal-01252673v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TLSE2 CNRS INRIA INSA-TOULOUSE IMT UT1-CAPITOLE INRIA2 UNIV-LILLE INSA-GROUPE UNIV-UT3 UT3-TOULOUSEINP LPP-MATH

360 Consultations

234 Téléchargements

High-dimensional clustering

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager