Importance of Dataspace Embeddings when Evaluating Text Clustering Methods - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Chapitre D'ouvrage Année : 2020

Importance of Dataspace Embeddings when Evaluating Text Clustering Methods

Résumé

Fair evaluation of text clustering methods needs to clarify the relations between 1)pre-processing, resulting in raw term occurrence vectors, 2)data transformation, and 3)method in the strict sense. We have tried to empirically compare a dozen well-known methods and variants in a protocol crossing three contrasted open-access corpora in a few tens transformed dataspaces. We compared the resulting clusterings to their supposed "ground-truth" classes by means of four usual indices. The results show both a confirmation of well-established implicit combinations, and good performances of unexpected ones, mostly in spectral or kernel dataspaces. The rich material resulting from these some 600 runs includes a wealth of intriguing facts, which needs further research on the specificities of text corpora in relation to methods and dataspaces.
Fichier principal
Vignette du fichier
Lelu_Cadot_IFCSpost_V8.pdf (129.77 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03053176 , version 1 (10-12-2020)
hal-03053176 , version 2 (14-12-2020)

Identifiants

  • HAL Id : hal-03053176 , version 2

Citer

Alain Lelu, Martine Cadot. Importance of Dataspace Embeddings when Evaluating Text Clustering Methods. Data Analysis and Rationality in a Complex World, In press. ⟨hal-03053176v2⟩
105 Consultations
61 Téléchargements

Partager

Gmail Facebook X LinkedIn More