When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models

Benjamin Muller; Antonis Anastasopoulos; Benoît Sagot; Djamé Seddah

Pré-Publication, Document De Travail Année : 2020

When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models

(1) , (2) , (1) , (1)

1
2

Benjamin Muller

Fonction : Auteur

Automatic Language Modelling and ANAlysis & Computational Humanities

Antonis Anastasopoulos

Fonction : Auteur

George Mason University [Fairfax]

Benoît Sagot

Fonction : Auteur
PersonId : 1461
IdHAL : bsagot
ORCID : 0000-0002-0107-8526
IdRef : 177454229

Automatic Language Modelling and ANAlysis & Computational Humanities

Djamé Seddah

Fonction : Auteur
PersonId : 11545
IdHAL : djameseddah
IdRef : 086185136

Automatic Language Modelling and ANAlysis & Computational Humanities

Résumé

Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-the-art performance in NLP. Still, it remains unclear how this approach should be applied for unseen languages that are not covered by any available large-scale multilingual language model and for which only a small amount of raw data is generally available. In this work, by comparing multilingual and monolingual models, we show that such models behave in multiple ways on unseen languages. Some languages greatly benefit from transfer learning and behave similarly to closely related high resource languages whereas others apparently do not. Focusing on the latter, we show that this failure to transfer is largely related to the impact of the script used to write such languages. Transliterating those languages improves very significantly the ability of large-scale multilingual language models on downstream tasks.

Domaines

Informatique et langage [cs.CL]

Benoît Sagot : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03109106

Soumis le : mercredi 13 janvier 2021-16:07:25

Dernière modification le : jeudi 1 février 2024-10:05:37

Dates et versions

hal-03109106 , version 1 (13-01-2021)

Identifiants

HAL Id : hal-03109106 , version 1
ARXIV : 2010.12858

Citer

Benjamin Muller, Antonis Anastasopoulos, Benoît Sagot, Djamé Seddah. When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models. 2020. ⟨hal-03109106⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 INRIA IRISA INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES ANR PRAIRIE-IA UR1-MATH-NUM

61 Consultations

0 Téléchargements

When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager