Graphes des liens et anti-liens statistiquement valides entre les mots d'un corpus textuel - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Communication Dans Un Congrès Année : 2009

Graphes des liens et anti-liens statistiquement valides entre les mots d'un corpus textuel

Résumé

Neighborhood is a central concept in datamining, and a bunch of definitions have been implemented, mainly rooted in geometrical or topological considerations. We propose here a statistical definition of neighborhood: our TourneBool randomization test processes an ob-jects vs. attributes binary table in order to establish which inter-attribute relation is fortuitous, and which one is meaningful, out of any hypotheses on the underlying statistical distribu-tions, but taking into account these empirical distributions. It ensues a robust and statistically validated graph. A previous encouraging small-scale test led us to scale up the different phases of the process, making it possible to test it on one of the public access Reuters test corpus. We then characterized the resulting word graph with a series of well-known indicators, such as clustering coefficients, degree distribution and correlation, cluster modularity and size distribution. Another graph structure stems from this process: the one conveying the negative " counter-relations " between words, i.e. words which " steer clear " one from another. We characterize in the same way the counter-relations graph.
Fichier non déposé

Dates et versions

inria-00342751 , version 1 (28-11-2008)

Identifiants

  • HAL Id : inria-00342751 , version 1

Citer

Alain Lelu, Martine Cadot. Graphes des liens et anti-liens statistiquement valides entre les mots d'un corpus textuel. Extraction et gestion de connaissance 2009 (EGC'09), Pierre Gançarski, Jan 2009, Strasbourg, France. pp.367-378. ⟨inria-00342751⟩
308 Consultations
0 Téléchargements

Partager

Gmail Facebook X LinkedIn More