A Word Clustering Approach to Domain Adaptation: Robust parsing of source and target domains

Djamé Seddah; Marie Candito; Enrique Henestroza Anguiano; Henestroza Anguiano Enrique

doi:10.1093/logcom/exs082

Article Dans Une Revue Journal of Logic and Computation Année : 2013

A Word Clustering Approach to Domain Adaptation: Robust parsing of source and target domains

(1, 2) , (1) , (1) , (1)

1
2

Djamé Seddah

Fonction : Auteur
PersonId : 11545
IdHAL : djameseddah
IdRef : 086185136

Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing

Université Paris-Sorbonne

Marie Candito

Fonction : Auteur
PersonId : 13596
IdHAL : marie-candito
IdRef : 153698616

Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing

Enrique Henestroza Anguiano

Fonction : Auteur
PersonId : 878441

Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing

Henestroza Anguiano Enrique

Fonction : Auteur

Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing

Résumé

We present a technique to improve out-of-domain statistical parsing by reducing lexical data sparseness in a PCFG-LA architecture. We replace ter- minal symbols with unsupervised word clusters acquired from a large news- paper corpus augmented with target-domain data. We also investigate the impact of guiding out-of-domain parsing with predicted part-of-speech tags. We provide an evaluation for French, and obtain improvements in perfor- mance for both non-technical and technical target domains. Though the im- provements over a strong baseline are slight, an interesting result is that the proposed techniques also improve parsing performance on the source do- main, contrary to techniques such as self-training, thus leading to a more ro- bust parser overall. We also describe new target domain evaluation treebanks, freely available, that comprise a total of about 3,000 annotated sentences from the medical domain, regional newspaper articles, French Europarl and French Wikipedia.

Mots clés

Out of domain and morphologically-rich languages statistical parsing unsupervized word clustering data driven lemmatization treebanking biomedical self training

Domaines

Traitement du texte et du document

Djamé Seddah : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-00940224

Soumis le : vendredi 31 janvier 2014-15:52:42

Dernière modification le : vendredi 21 janvier 2022-03:21:22

Dates et versions

hal-00940224 , version 1 (31-01-2014)

Identifiants

HAL Id : hal-00940224 , version 1
DOI : 10.1093/logcom/exs082

Citer

Djamé Seddah, Marie Candito, Enrique Henestroza Anguiano, Henestroza Anguiano Enrique. A Word Clustering Approach to Domain Adaptation: Robust parsing of source and target domains. Journal of Logic and Computation, 2013, ⟨10.1093/logcom/exs082⟩. ⟨hal-00940224⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-PARIS7 INRIA INRIA2 SORBONNE-UNIVERSITE ANR

94 Consultations

0 Téléchargements

A Word Clustering Approach to Domain Adaptation: Robust parsing of source and target domains

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager