Beyond the Conventional Statistical Language Models: The Variable-Length Sequences Approach - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Communication Dans Un Congrès Année : 2000

Beyond the Conventional Statistical Language Models: The Variable-Length Sequences Approach

Résumé

In natural language, several sequences of words are very frequent. A classical language model, like n-gram, does not adequately take into account such sequences, because it underestimates their probabilities. A better approach consists in modelling word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the word lexicon, on which language models are computed. In this paper, we present an original method for automatically determining the most important phrases in corpora. This method is based on information theoretic criteria, which insure a high statistical consistency, and on French grammatical classes which include additional type of linguistic dependencies. In addition, the perplexity is used in order to make the decision of selecting a potential sequence more accurate. We propose also several variants of language models with and without word sequences. Among them, we present a model in which the trigger pairs are more significant linguistically. The originality of this model, compared with the commonly used trigger approaches, is the use of word sequences to estimate the trigger pair without limiting itself to single words. Experimental tests, in terms of perplexity and recognition rate, are carried out on a vocabulary of 20000 words and a corpus of 43 million words. The use of word sequences proposed by our algorithm reduces perplexity by more than 16% compared to those, which are limited to single words. The introduction of these word sequences in our dictation machine improves the accuracy by approximately 15%.
Fichier principal
Vignette du fichier
ICSLP00.pdf (362.28 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

inria-00099107 , version 1 (21-11-2017)

Identifiants

  • HAL Id : inria-00099107 , version 1

Citer

Imed Zitouni, Kamel Smaïli, Jean-Paul Haton. Beyond the Conventional Statistical Language Models: The Variable-Length Sequences Approach. International Conference on Speech Language Processing, 2000, Pékin, China. pp.4. ⟨inria-00099107⟩
140 Consultations
94 Téléchargements

Partager

Gmail Facebook X LinkedIn More