French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus

Murielle Fabre; Pedro Javier Ortiz Suárez; Benoît Sagot; Éric Villemonte de La Clergerie

Communication Dans Un Congrès Année : 2020

French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus

(1, 2) , (1, 3) , (1) , (1)

1
2
3

Murielle Fabre

Fonction : Auteur
PersonId : 176984
IdHAL : murielle-fabre
IdRef : 236178288

Automatic Language Modelling and ANAlysis & Computational Humanities

Laboratoire de Linguistique Formelle

Pedro Javier Ortiz Suárez

Fonction : Auteur
PersonId : 178412
IdHAL : pedro-ortiz-suarez
ORCID : 0000-0003-0343-8852
IdRef : 264210743

Automatic Language Modelling and ANAlysis & Computational Humanities

Sorbonne Université

Benoît Sagot

Fonction : Auteur
PersonId : 1461
IdHAL : bsagot
ORCID : 0000-0002-0107-8526
IdRef : 177454229

Automatic Language Modelling and ANAlysis & Computational Humanities

Éric Villemonte de La Clergerie

Fonction : Auteur

Automatic Language Modelling and ANAlysis & Computational Humanities

Résumé

This paper describes and compares the impact of different types and size of training corpora on language models like ELMO. By asking the fundamental question of quality versus quantity we evaluate four French corpora for training on parsing scores, POS-tagging and named-entities recognition downstream tasks. The paper studies the relevance of a new corpus, CaBeRnet, featuring a representative range of language usage, including a balanced variety of genres (oral transcriptions, newspapers, popular magazines, technical reports, fiction, academic texts), in oral and written styles. We hypothesize that a linguistically representative and balanced corpora will allow the language model to be more efficient and representative of a given language and therefore yield better evaluation scores on different evaluation sets and tasks.

Mots clés

Language Models French BERT ELMo Tagging Parsing NER Balanced French Corpus

Domaines

Informatique et langage [cs.CL] Linguistique

Fichier principal

LREC_Fabre_Ortiz.pdf (211.09 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Benoît Sagot : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-02678358

Soumis le : dimanche 31 mai 2020-20:22:43

Dernière modification le : jeudi 1 février 2024-10:06:29

Dates et versions

hal-02678358 , version 1 (31-05-2020)

Identifiants

HAL Id : hal-02678358 , version 1

Citer

Murielle Fabre, Pedro Javier Ortiz Suárez, Benoît Sagot, Éric Villemonte de La Clergerie. French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus. CMLC-8 - 8th Workshop on the Challenges in the Management of Large Corpora, May 2020, Marseille, France. ⟨hal-02678358⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 CNRS INRIA IRISA LLF INRIA2 CAMPUS-AAR AAI UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES SORBONNE-UNIVERSITE UP-SOCIETES-HUMANITES ANR PRAIRIE-IA UR1-MATH-NUM

220 Consultations

539 Téléchargements

French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager