Unsupervised recall and precision measures: a step towards new efficient clustering quality indexes - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Communication Dans Un Congrès Année : 2010

Unsupervised recall and precision measures: a step towards new efficient clustering quality indexes

Résumé

The use of the methods of classification of information became current to analyze large corpus of data as it is the case in the domain of scientific survey or in that of strategic analyses of research. While carrying out a classification, the aim is to build homogeneous groups of data sharing a certain number of identical characteristics. Furthermore, the clustering, or unsupervised classification, makes it possible to highlight these groups without prior knowledge on the processed data. A central problem that then arises is to qualify these performance in terms of quality: a quality index is a criterion which indeed makes it possible all together to decide which clustering method to use, to fix an optimal number of clusters, and to evaluate or to develop a new method. Traditional quality indexes, that are mainly distance-based indexes relying on the concepts of intra cluster inertia and inter-cluster inertia (Lebart et al. (1982)), do not allow to properly estimate the quality of the clustering in several cases, as in that one of the textual data (Ghribi and al. (2010)). We thus present in this paper an alternative approach for clustering quality evaluation based on unsupervised measures of Recall, Precision exploiting the descriptors of the data associated with the obtained clusters. The Recall makes it possible to measure the exhaustiveness of the contents of the clusters in terms of peculiar descriptors specific to each cluster. The Precision measures the homogeneity of the clusters in terms of proportion of the data containing the associated peculiar descriptors. We finally present an experimental comparison of the behavior of the classical indexes with our new approach on a dataset of bibliographical references issued from the PASCAL database. This comparison clearly highlights that our method is the only one that can distinguish between homogeneous and heterogeneous clustering results.
Fichier non déposé

Dates et versions

inria-00535961 , version 1 (14-11-2010)

Identifiants

  • HAL Id : inria-00535961 , version 1

Citer

Jean-Charles Lamirel, Maha Ghribi, Pascal Cuxac. Unsupervised recall and precision measures: a step towards new efficient clustering quality indexes. 19th International Conference on Computational Statistics - COMPSTAT'2010, Aug 2010, Paris, France. pp.63-64. ⟨inria-00535961⟩
122 Consultations
0 Téléchargements

Partager

Gmail Facebook X LinkedIn More