Graphes des liens et anti-liens statistiquement valides entre les mots d'un corpus textuel

Alain Lelu; Martine Cadot

Communication Dans Un Congrès Année : 2009

Graphes des liens et anti-liens statistiquement valides entre les mots d'un corpus textuel

(1, 2) , (3)

1
2
3

Alain Lelu

Fonction : Auteur
PersonId : 844123

Laboratoire de Semio-Linguistique, Didactique et Informatique

Knowledge Information and Web Intelligence

Martine Cadot

Fonction : Auteur
PersonId : 9342
IdHAL : martine-cadot
IdRef : 113870906

Machine Learning and Computational Biology

Résumé

Neighborhood is a central concept in datamining, and a bunch of definitions have been implemented, mainly rooted in geometrical or topological considerations. We propose here a statistical definition of neighborhood: our TourneBool randomization test processes an ob-jects vs. attributes binary table in order to establish which inter-attribute relation is fortuitous, and which one is meaningful, out of any hypotheses on the underlying statistical distribu-tions, but taking into account these empirical distributions. It ensues a robust and statistically validated graph. A previous encouraging small-scale test led us to scale up the different phases of the process, making it possible to test it on one of the public access Reuters test corpus. We then characterized the resulting word graph with a series of well-known indicators, such as clustering coefficients, degree distribution and correlation, cluster modularity and size distribution. Another graph structure stems from this process: the one conveying the negative " counter-relations " between words, i.e. words which " steer clear " one from another. We characterize in the same way the counter-relations graph.

Mots clés

Neighborhood graph randomization test graph characterization statistics data mining text mining given-marginals random matrix statistically significant relation statistical learning

Domaines

Informatique et langage [cs.CL] Intelligence artificielle [cs.AI] Modélisation et simulation Traitement du texte et du document Statistiques [math.ST] Théorie [stat.TH]

Alain Lelu : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00342751

Soumis le : vendredi 28 novembre 2008-13:48:30

Dernière modification le : mercredi 24 avril 2024-10:51:08

Dates et versions

inria-00342751 , version 1 (28-11-2008)

Identifiants

HAL Id : inria-00342751 , version 1

Citer

Alain Lelu, Martine Cadot. Graphes des liens et anti-liens statistiquement valides entre les mots d'un corpus textuel. Extraction et gestion de connaissance 2009 (EGC'09), Pierre Gançarski, Jan 2009, Strasbourg, France. pp.367-378. ⟨inria-00342751⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UNIV-FCOMTE UNIV-LORRAINE TDS-MACS LORIA ELLIADD

308 Consultations

0 Téléchargements

Graphes des liens et anti-liens statistiquement valides entre les mots d'un corpus textuel

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager