Identification thématique hiérarchique : Application aux forums de discussions

Brigitte Bigi; Kamel Smaïli

Communication Dans Un Congrès Année : 2002

Identification thématique hiérarchique : Application aux forums de discussions

(1) , (1)

Brigitte Bigi

Fonction : Auteur
PersonId : 7990
IdHAL : brigittebigi
ORCID : 0000-0003-1834-6918
IdRef : 079410790

Analysis, perception and recognition of speech

Kamel Smaïli

Fonction : Auteur
PersonId : 2521
IdHAL : kamel-smaili
IdRef : 034429700

Analysis, perception and recognition of speech

Résumé

Les modèles statistiques du langage ont pour but de donner une représentation statistique de la langue mais souffrent de nombreuses imperfections. Des travaux récents ont montré que ces modèles peuvent être améliorés s'ils peuvent bénéficier de la connaissance du thème traité, afin de s'y adapter. Le thème du document est alors obtenu par un mécanisme d'identification thématique, mais les thèmes ainsi traités sont souvent de granularité différente, c'est pourquoi il nous semble opportun qu'ils soient organisés dans une hiérarchie. Cette structuration des thèmes implique la mise en place de techniques spécifiques d'identification thématique. Cet article propose un modèle statistique à base d'unigrammes pour identifier automatiquement le thème d'un document parmi une arborescence prédéfinie de thèmes possibles. Nous présen-tons également un critère qui permet au modèle de donner un degré de fiabilité à la décision prise. L'ensemble des expérimentations a été réalisé sur des données extraites du groupe 'fr' des forums de discussion. Statistical language modeling attempts to capture the regularities of natural language. The most accurate natural language processing systems still suffer from several shortcomings due to the complexity of natural language and from the weakness of the current language models. It is commonly conjectured that they should benefit from topic adaptation. The topic of the document is then obtained by a topic identification mechanism, but topics thus treated are often of different granularity. This is the reason why it seems appropriate to organize them in a hierarchy. This topic organization implies a development of specific techniques for topic identification. This paper proposes a statistical model based on unigrams to automatically identify the topic of a document among a tree structure of possible topics. We also present a criterion which reflects the degree of reliability of the decision. Experiments were carried out on data extracted from the French newsgroup 'fr'.

Mots clés

unigrams Topic identification language modeling

Identification thématique modèles de langage unigrammes

Domaines

Informatique et langage [cs.CL]

Fichier principal

TALN02Pdf.pdf (1.3 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Kamel Smaïli : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01563654

Soumis le : mardi 18 juillet 2017-09:51:47

Dernière modification le : lundi 11 septembre 2023-17:41:19

Archivage à long terme le : samedi 27 janvier 2018-03:33:18

Dates et versions

hal-01563654 , version 1 (18-07-2017)

Identifiants

HAL Id : hal-01563654 , version 1

Citer

Brigitte Bigi, Kamel Smaïli. Identification thématique hiérarchique : Application aux forums de discussions. 9ème conférence annuelle sur le Traitement Automatique des Langues Naturelles - TALN'02, Jun 2002, Nancy, France. pp.24 - 27. ⟨hal-01563654⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UNIV-LORRAINE INRIA2 LORIA LORIA-NLPKD

151 Consultations

63 Téléchargements

Identification thématique hiérarchique : Application aux forums de discussions

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager