VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Communication Dans Un Congrès Année : 2021

VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

Résumé

We introduce VoxPopuli, a large-scale multilingual corpus providing 400K hours of unlabeled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semisupervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 15 languages and their aligned oral interpretations into 15 target languages totaling 17.3K hours. We provide speech recognition (ASR) baselines and validate the versatility of VoxPopuli unlabeled data in semisupervised ASR and speech-to-text translation under challenging out-of-domain settings. The corpus is available at https://github. com/facebookresearch/voxpopuli.
Fichier principal
Vignette du fichier
2101.00390.pdf (268.23 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03831929 , version 1 (27-10-2022)

Identifiants

Citer

Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, et al.. VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation. ACL 2021 - 59th Annual Meeting of the Association for Computational Linguistics, Aug 2021, Bangkok, Thailand. ⟨10.18653/v1/2021.acl-long.80⟩. ⟨hal-03831929⟩
72 Consultations
238 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More