Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Julia Kreutzer; Isaac Caswell; Lisa Wang; Ahsan Wahab; Daan van Esch; Nasanbayar Ulzii-Orshikh; Allahsera Tapo; Nishant Subramani; Artem Sokolov; Claytone Sikasote; Monang Setyawan; Supheakmungkol Sarin; Sokhar Samb; Benoît Sagot; Clara Rivera; Annette Rios; Isabel Papadimitriou; Salomey Osei; Pedro Ortiz Suarez; Iroro Orife; Kelechi Ogueji; Rubungo Andre Niyongabo; Toan Q. Nguyen; Mathias Müller; André Müller; Shamsuddeen Hassan Muhammad; Nanda Muhammad; Ayanda Mnyakeni; Jamshidbek Mirzakhalov; Tapiwanashe Matangira; Colin Leong; Nze Lawson; Sneha Kudugunta; Yacine Jernite; Mathias Jenny; Orhan Firat; Bonaventure F. P. Dossou; Sakhile Dlamini; Nisansa de Silva; Sakine Çabuk Balli; Stella Biderman; Alessia Battisti; Ahmed Baruwa; Ankur Bapna; Pallavi Baljekar; Israel Abebe Azime; Ayodele Awokoya; Duygu Ataman; Orevaoghene Ahia; Oghenefego Ahia; Sweta Agrawal; Mofetoluwa Adeyemi

doi:10.1162/tacl_a_00447

Article Dans Une Revue Transactions of the Association for Computational Linguistics Année : 2022

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

(1, 2) , (1) , (1) , (3) , (1) , (4) , (2, 5) , (2, 6) , (1) , (2, 7) , (1) , (3) , (8) , (9) , (1) , (10) , (11) , (2, 12) , (9, 13) , (2) , (2, 14) , (2, 15) , (16) , (10) , (10) , (2, 17) , (1) , (1) , (3, 18) , (1) , (2) , (1) , (1) , (2, 19) , (10) , (3, 1) , (2, 20) , (1) , (21) , (1) , (22) , (10) , (2, 23) , (1) , (1) , (2, 8) , (2, 24) , (3, 10) , (2, 25) , (3) , (26) , (2, 27)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

Julia Kreutzer

Fonction : Auteur

Google Inc.

Masakhane NLP

Isaac Caswell

Fonction : Auteur

Google Inc.

Lisa Wang

Fonction : Auteur

Google Inc.

Ahsan Wahab

Fonction : Auteur

Turkic Interlingua

Daan van Esch

Fonction : Auteur

Google Inc.

Nasanbayar Ulzii-Orshikh

Fonction : Auteur

Computer Science Department [Haveford]

Allahsera Tapo

Fonction : Auteur

Masakhane NLP

RobotsMali

Nishant Subramani

Fonction : Auteur

Masakhane NLP

Intel Labs Berkeley

Artem Sokolov

Fonction : Auteur

Google Inc.

Claytone Sikasote

Fonction : Auteur

Masakhane NLP

University of Zambia [Lusaka]

Monang Setyawan

Fonction : Auteur

Google Inc.

Supheakmungkol Sarin

Fonction : Auteur

Turkic Interlingua

Sokhar Samb

Fonction : Auteur

African Institute for Mathematical Sciences

Benoît Sagot

Fonction : Auteur
PersonId : 1461
IdHAL : bsagot
ORCID : 0000-0002-0107-8526
IdRef : 177454229

Automatic Language Modelling and ANAlysis & Computational Humanities

Clara Rivera

Fonction : Auteur

Google Inc.

Annette Rios

Fonction : Auteur

Universität Zürich [Zürich] = University of Zurich

Isabel Papadimitriou

Fonction : Auteur

Stanford University

Salomey Osei

Fonction : Auteur

Masakhane NLP

Kwame Nkrumah University of Science and Technology

Pedro Ortiz Suarez

Fonction : Auteur
PersonId : 178412
IdHAL : pedro-ortiz-suarez
ORCID : 0000-0003-0343-8852
IdRef : 264210743

Automatic Language Modelling and ANAlysis & Computational Humanities

Sorbonne Université

Iroro Orife

Fonction : Auteur

Masakhane NLP

Kelechi Ogueji

Fonction : Auteur

Masakhane NLP

University of Waterloo [Waterloo]

Rubungo Andre Niyongabo

Fonction : Auteur

Masakhane NLP

University of Electronic Science and Technology of China [Chengdu]

Toan Q. Nguyen

Fonction : Auteur

University of Notre Dame [Indiana]

Mathias Müller

Fonction : Auteur

Universität Zürich [Zürich] = University of Zurich

André Müller

Fonction : Auteur

Universität Zürich [Zürich] = University of Zurich

Shamsuddeen Hassan Muhammad

Fonction : Auteur

Masakhane NLP

Bayero University Kano

Nanda Muhammad

Fonction : Auteur

Google Inc.

Ayanda Mnyakeni

Fonction : Auteur

Google Inc.

Jamshidbek Mirzakhalov

Fonction : Auteur

Turkic Interlingua

University of South Florida [Tampa]

Tapiwanashe Matangira

Fonction : Auteur

Google Inc.

Colin Leong

Fonction : Auteur

Masakhane NLP

Nze Lawson

Fonction : Auteur

Google Inc.

Sneha Kudugunta

Fonction : Auteur

Google Inc.

Yacine Jernite

Fonction : Auteur

Masakhane NLP

Hugging Face

Mathias Jenny

Fonction : Auteur

Universität Zürich [Zürich] = University of Zurich

Orhan Firat

Fonction : Auteur

Turkic Interlingua

Google Inc.

Bonaventure F. P. Dossou

Fonction : Auteur

Masakhane NLP

Jacobs University = Constructor University [Bremen]

Sakhile Dlamini

Fonction : Auteur

Google Inc.

Nisansa de Silva

Fonction : Auteur

University of Moratuwa

Sakine Çabuk Balli

Fonction : Auteur

Google Inc.

Stella Biderman

Fonction : Auteur

EleutherAI

Alessia Battisti

Fonction : Auteur

Universität Zürich [Zürich] = University of Zurich

Ahmed Baruwa

Fonction : Auteur

Masakhane NLP

Obafemi Awolowo University

Ankur Bapna

Fonction : Auteur

Google Inc.

Pallavi Baljekar

Fonction : Auteur

Google Inc.

Israel Abebe Azime

Fonction : Auteur

Masakhane NLP

African Institute for Mathematical Sciences

Ayodele Awokoya

Fonction : Auteur

Masakhane NLP

University of Ibadan

Duygu Ataman

Fonction : Auteur

Turkic Interlingua

Universität Zürich [Zürich] = University of Zurich

Orevaoghene Ahia

Fonction : Auteur

Masakhane NLP

InstaDeep

Oghenefego Ahia

Fonction : Auteur

Turkic Interlingua

Sweta Agrawal

Fonction : Auteur

University of Maryland [Baltimore]

Mofetoluwa Adeyemi

Fonction : Auteur

Masakhane NLP

Defence Space Administration [Abuja]

Résumé

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.

Domaines

Informatique et langage [cs.CL]

Fichier principal

tacl_a_00447.pdf (348.24 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Benoît Sagot : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03177623

Soumis le : dimanche 13 février 2022-19:17:14

Dernière modification le : jeudi 1 février 2024-10:06:32

Archivage à long terme le : samedi 14 mai 2022-18:18:22

Dates et versions

hal-03177623 , version 1 (13-02-2022)

Licence

Paternité

Identifiants

HAL Id : hal-03177623 , version 1
ARXIV : 2103.12028
DOI : 10.1162/tacl_a_00447

Citer

Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, et al.. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 2022, 10, pp.50-72. ⟨10.1162/tacl_a_00447⟩. ⟨hal-03177623⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 INRIA IRISA INRIA2 GENCI UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES SORBONNE-UNIVERSITE ANR UR1-MATH-NUM

391 Consultations

250 Téléchargements

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Résumé

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager