Deep Variational Generative Models for Audio-visual Speech Separation

Viet-Nhat Nguyen; Mostafa Sadeghi; Elisa Ricci; Xavier Alameda-Pineda

doi:10.1109/MLSP52302.2021.9596406

Communication Dans Un Congrès Année : 2021

Deep Variational Generative Models for Audio-visual Speech Separation

(1) , (1, 2) , (3) , (1, 4)

1
2
3
4

Viet-Nhat Nguyen

Fonction : Auteur

Interpretation and Modelling of Images and Videos

Mostafa Sadeghi

Fonction : Auteur
PersonId : 752828
IdHAL : msadeghi
ORCID : 0000-0002-0272-8017

Interpretation and Modelling of Images and Videos

Speech Modeling for Facilitating Oral-Based Communication

Elisa Ricci

Fonction : Auteur

Fondazione Bruno Kessler [Trento, Italy]

Xavier Alameda-Pineda

Fonction : Auteur
PersonId : 16186
IdHAL : xavier-alameda-pineda
ORCID : 0000-0002-5354-1084
IdRef : 18450919X

Interpretation and Modelling of Images and Videos

Vers des robots à l’intelligence sociale au travers de l’apprentissage, de la perception et de la commande

Résumé

In this paper, we are interested in audio-visual speech separation given a single-channel audio recording as well as visual information (lips movements) associated with each speaker. We propose an unsupervised technique based on audio-visual generative modeling of clean speech. More specifically, during training, a latent variable generative model is learned from clean speech spectrograms using a variational auto-encoder (VAE). To better utilize the visual information, the posteriors of the latent variables are inferred from mixed speech (instead of clean speech) as well as the visual data. The visual modality also serves as a prior for latent variables, through a visual network. At test time, the learned generative model (both for speaker-independent and speaker-dependent scenarios) is combined with an unsupervised non-negative matrix factorization (NMF) variance model for background noise. All the latent variables and noise parameters are then estimated by a Monte Carlo expectation-maximization algorithm. Our experiments show that the proposed unsupervised VAE-based method yields better separation performance than NMF-based approaches as well as a supervised deep learning-based technique.

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV] Traitement du signal et de l'image [eess.SP] Apprentissage [cs.LG] Son [cs.SD]

Perception team : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-02930662

Soumis le : vendredi 4 septembre 2020-14:36:19

Dernière modification le : jeudi 4 avril 2024-18:22:40

Dates et versions

Identifiants

HAL Id : hal-02930662 , version 1
ARXIV : 2008.07191
DOI : 10.1109/MLSP52302.2021.9596406

Citer

Viet-Nhat Nguyen, Mostafa Sadeghi, Elisa Ricci, Xavier Alameda-Pineda. Deep Variational Generative Models for Audio-visual Speech Separation. MLSP 2021 - IEEE International Workshop on Machine Learning for Signal Processing, Oct 2021, Gold Coast, Australia. ⟨10.1109/MLSP52302.2021.9596406⟩. ⟨hal-02930662⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 UGA CNRS INRIA IRISA LJK LJK_GI LJK_GI_PERCEPTION UNIV-LORRAINE INRIA2 LORIA LORIA-NLPKD UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES MIAI ANR UR1-MATH-NUM LJK-GI-ROBOTLEARN

107 Consultations

0 Téléchargements

Deep Variational Generative Models for Audio-visual Speech Separation

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager