Text Corpora and the Challenge of Newly Written Languages

Alice Millour; Karën Fort

Communication Dans Un Congrès Année : 2020

Text Corpora and the Challenge of Newly Written Languages

(1) , (1, 2)

1
2

Alice Millour

Fonction : Auteur
PersonId : 21553
IdHAL : alice-millour
IdRef : 253127947

Sens, Texte, Informatique, Histoire

Karën Fort

Fonction : Auteur
PersonId : 2215
IdHAL : karen-fort
ORCID : 0000-0002-0723-8850
IdRef : 176299548

Sens, Texte, Informatique, Histoire

Semantic Analysis of Natural Language

Résumé

Text corpora represent the foundation on which most natural language processing systems rely. However, for many languages, collecting or building a text corpus of a sufficient size still remains a complex issue, especially for corpora that are accessible and distributed under a clear license allowing modification (such as annotation) and further resharing. In this paper, we review the sources of text corpora usually called upon to fill the gap in low-resource contexts, and how crowdsourcing has been used to build linguistic resources. Then, we present our own experiments with crowdsourcing text corpora and an analysis of the obstacles we encountered. Although the results obtained in terms of participation are still unsatisfactory, we advocate that the effort towards a greater involvement of the speakers should be pursued, especially when the language of interest is newly written.

Mots clés

text corpora dialectal variants spelling crowdsourcing

Domaines

Informatique et langage [cs.CL]

Fichier principal

ccurl2020_kfam.pdf (202.06 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Alice Millour : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02611209

Soumis le : lundi 18 mai 2020-11:44:22

Dernière modification le : samedi 7 octobre 2023-21:36:24

Dates et versions

hal-02611209 , version 1 (18-05-2020)

Identifiants

HAL Id : hal-02611209 , version 1

Citer

Alice Millour, Karën Fort. Text Corpora and the Challenge of Newly Written Languages. 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020), May 2020, Marseille, France. ⟨hal-02611209⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UNIV-LORRAINE INRIA2 LORIA LORIA-NLPKD SORBONNE-UNIVERSITE STIH SU-LETTRES

115 Consultations

130 Téléchargements

Text Corpora and the Challenge of Newly Written Languages

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager