Parsing word clusters - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Communication Dans Un Congrès Année : 2010

Parsing word clusters

Résumé

We present and discuss experiments in statistical parsing of French, where terminal forms used during training and parsing are replaced by more general symbols, particularly clusters of words obtained through unsupervised linear clustering. We build on the work of Candito and Crabbé (2009) who proposed to use clusters built over slightly coarsened French inflected forms. We investigate the alternative method of building clusters over lemma/part-of-speech pairs, using a raw corpus automatically tagged and lemmatized. We find that both methods lead to comparable improvement over the baseline (we obtain F_1=86.20% and F_1=86.21% respectively, compared to a baseline of F_1=84.10%). Yet, when we replace gold lemma/POS pairs with their corresponding cluster, we obtain an upper bound (F_1=87.80) that suggests room for improvement for this technique, should tagging/lemmatisation performance increase for French. We also analyze the improvement in performance for both techniques with respect to word frequency. We find that replacing word forms with clusters improves attachment performance for words that are originally either unknown or low-frequency, since these words are replaced by cluster symbols that tend to have higher frequencies. Furthermore, clustering also helps significantly for medium to high frequency words, suggesting that training on word clusters leads to better probability estimates for these words.
Fichier principal
Vignette du fichier
SPMRL2010-LemmaClust.pdf (94.77 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-00495177 , version 1 (07-09-2010)

Identifiants

  • HAL Id : hal-00495177 , version 1

Citer

Marie Candito, Djamé Seddah. Parsing word clusters. NAACL/HLT-2010 Workshop on Statistical Parsing of Morphologically Rich Languages - SPMRL 2010, Jun 2010, Los Angeles, United States. pp.76-84. ⟨hal-00495177⟩
158 Consultations
133 Téléchargements

Partager

Gmail Facebook X LinkedIn More