Statistical Language Modeling Based on Variable-Length Sequences

Imed Zitouni; Kamel Smaïli; Jean-Paul Haton

Article Dans Une Revue Computer Speech and Language Année : 2003

Statistical Language Modeling Based on Variable-Length Sequences

(1) , (1) , (1)

Imed Zitouni

Fonction : Auteur

Analysis, perception and recognition of speech

Kamel Smaïli

Fonction : Auteur
PersonId : 2521
IdHAL : kamel-smaili
IdRef : 034429700

Analysis, perception and recognition of speech

Jean-Paul Haton

Fonction : Auteur
PersonId : 830987

Analysis, perception and recognition of speech

Résumé

In natural language and especially in spontaneous speech, people often group words in order to constitute phrases which become usual expressions. This is due to phonological (to make the pronunciation easier), or to semantic reasons (to remember more easily a phrase by assigning a meaning to a block of words). Classical language models do not adequately take into account such phrases. A better approach consists in modeling some word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the vocabulary, on which language models are computed. In this paper, we present a method for automatically retrieving the most relevant phrases from a corpus of written sentences. The originality of our approach resides in the fact that the extracted phrases are obtained from a linguistically tagged corpus. Therefore, the obtained phrases are linguistically viable. To measure the contribution of classes in retrieving phrases, we have implemented the same algorithm without using classes. The class-based method outperformed by 11% the other method. Our approach uses information theoretic criteria which insure a high statistical consistency and make the decision of selecting a potential sequence optimal in accordance with the language perplexity. We propose several variants of language model with and without word sequences. Among them, we present a model in which the trigger pairs are linguistically more significant. We show that the use of sequences decrease the word error rate and improve the normalized perplexity. For instance, the best sequence model improves the perplexity by 16%, and the accuracy of our dictation system (MAUD) by approximately 14%. Experiments, in terms of perplexity and recognition rate, have been carried out on a vocabulary of 20000 words extracted from a corpus of 43 million words made up of two years of the French newspaper Le Monde. The acoustic model (HMM) is trained with the Bref80 corpus.

Mots clés

language modeling phrases triggres cache séquences perplexité normalisée normalized perplexity modèle de langage

Domaines

Autre [cs.OH]

Publications Loria : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00099785

Soumis le : mardi 26 septembre 2006-09:41:12

Dernière modification le : vendredi 24 mars 2023-14:52:48

Dates et versions

inria-00099785 , version 1 (26-09-2006)

Identifiants

HAL Id : inria-00099785 , version 1

Citer

Imed Zitouni, Kamel Smaïli, Jean-Paul Haton. Statistical Language Modeling Based on Variable-Length Sequences. Computer Speech and Language, 2003, 17 (1), pp.27-41. ⟨inria-00099785⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UNIV-LORRAINE INRIA2 LORIA

87 Consultations

0 Téléchargements

Statistical Language Modeling Based on Variable-Length Sequences

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager