Towards unsupervised learning of speech features in the wild - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Communication Dans Un Congrès Année : 2020

Towards unsupervised learning of speech features in the wild

Résumé

Recent work on unsupervised contrastive learning of speech representation has shown promising results, but so far has mostly been applied to clean, curated speech datasets. Can it also be used with unprepared audio data "in the wild"? Here, we explore three potential problems in this setting: (i) presence of non-speech data, (ii) noisy or low quality speech data, and (iii) imbalance in speaker distribution. We show that on the Libri-light train set, which is itself a relatively clean speech-only dataset, these problems combined can already have a performance cost of up to 30% relative for the ABX score. We show that the first two problems can be alleviated by data filtering, with voice activity detection selecting speech segments, while perplexity of a model trained with clean data helping to discard entire files. We show that the third problem can be alleviated by learning a speaker embedding in the predictive branch of the model. We show that these techniques build more robust speech features that can be transferred to an ASR task in the low resource setting.
Fichier principal
Vignette du fichier
Riviere_D_2020_Towards_CPC_in_the_wild.SLT.pdf (214.32 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03070411 , version 1 (15-12-2020)

Identifiants

  • HAL Id : hal-03070411 , version 1

Citer

Morgane Rivière, Emmanuel Dupoux. Towards unsupervised learning of speech features in the wild. SLT 2020 : IEEE Spoken Language Technology Workshop, Dec 2020, Shenzhen / Virtual, China. ⟨hal-03070411⟩
86 Consultations
719 Téléchargements

Partager

Gmail Facebook X LinkedIn More