Unsupervised recall and precision measures: a step towards new efficient clustering quality indexes

Jean-Charles Lamirel; Maha Ghribi; Pascal Cuxac

Communication Dans Un Congrès Année : 2010

Unsupervised recall and precision measures: a step towards new efficient clustering quality indexes

(1) , (2) , (2)

1
2

Jean-Charles Lamirel

Fonction : Auteur
PersonId : 8202
IdHAL : jean-charles-lamirel

Natural Language Processing: representation, inference and semantics

Maha Ghribi

Fonction : Auteur
PersonId : 865717

Institut de l'information scientifique et technique

Pascal Cuxac

Fonction : Auteur
PersonId : 179348
IdHAL : pascal-cuxac
ORCID : 0000-0002-6809-5654
IdRef : 165835257

Institut de l'information scientifique et technique

Résumé

The use of the methods of classification of information became current to analyze large corpus of data as it is the case in the domain of scientific survey or in that of strategic analyses of research. While carrying out a classification, the aim is to build homogeneous groups of data sharing a certain number of identical characteristics. Furthermore, the clustering, or unsupervised classification, makes it possible to highlight these groups without prior knowledge on the processed data. A central problem that then arises is to qualify these performance in terms of quality: a quality index is a criterion which indeed makes it possible all together to decide which clustering method to use, to fix an optimal number of clusters, and to evaluate or to develop a new method. Traditional quality indexes, that are mainly distance-based indexes relying on the concepts of intra cluster inertia and inter-cluster inertia (Lebart et al. (1982)), do not allow to properly estimate the quality of the clustering in several cases, as in that one of the textual data (Ghribi and al. (2010)). We thus present in this paper an alternative approach for clustering quality evaluation based on unsupervised measures of Recall, Precision exploiting the descriptors of the data associated with the obtained clusters. The Recall makes it possible to measure the exhaustiveness of the contents of the clusters in terms of peculiar descriptors specific to each cluster. The Precision measures the homogeneity of the clusters in terms of proportion of the data containing the associated peculiar descriptors. We finally present an experimental comparison of the behavior of the classical indexes with our new approach on a dataset of bibliographical references issued from the PASCAL database. This comparison clearly highlights that our method is the only one that can distinguish between homogeneous and heterogeneous clustering results.

Mots clés

Clustering Quality indexes Text mining Heterogeneous data

Domaines

Apprentissage [cs.LG] Réseau de neurones [cs.NE] Intelligence artificielle [cs.AI] Autres [stat.ML] Linguistique Informatique et langage [cs.CL] Recherche d'information [cs.IR] Machine Learning [stat.ML] Performance et fiabilité [cs.PF]

Jean-Charles Lamirel : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00535961

Soumis le : dimanche 14 novembre 2010-16:14:00

Dernière modification le : vendredi 24 mars 2023-14:52:53

Dates et versions

inria-00535961 , version 1 (14-11-2010)

Identifiants

HAL Id : inria-00535961 , version 1

Citer

Jean-Charles Lamirel, Maha Ghribi, Pascal Cuxac. Unsupervised recall and precision measures: a step towards new efficient clustering quality indexes. 19th International Conference on Computational Statistics - COMPSTAT'2010, Aug 2010, Paris, France. pp.63-64. ⟨inria-00535961⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UNIV-LORRAINE INRIA2 LORIA INIST

122 Consultations

0 Téléchargements

Unsupervised recall and precision measures: a step towards new efficient clustering quality indexes

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager