Deep Variational Metric Learning For Transfer Of Expressivity In Multispeaker Text To Speech

Ajinkya Kulkarni; Vincent Colotte; Denis Jouvet

Pré-Publication, Document De Travail Année : 2020

Deep Variational Metric Learning For Transfer Of Expressivity In Multispeaker Text To Speech

(1) , (1) , (1)

Ajinkya Kulkarni

Fonction : Auteur
PersonId : 1069614

Speech Modeling for Facilitating Oral-Based Communication

Vincent Colotte

Fonction : Auteur
PersonId : 16268
IdHAL : vincent-colotte
IdRef : 070401683

Speech Modeling for Facilitating Oral-Based Communication

Denis Jouvet

Fonction : Auteur
PersonId : 15904
IdHAL : denis-jouvet
IdRef : 029418666

Speech Modeling for Facilitating Oral-Based Communication

Résumé

In this paper, we propose an approach relying on multiclass N-pair loss based deep metric learning in recurrent conditional variational autoencoder (RCVAE). We used RCVAE for implementation of multispeaker expressive text-to-speech (TTS) system. The proposed approach condition text-to-speech system on speaker embeddings, and leads to clustering the latent space representation with respect to emotion. The deep metric learning helps to reduce the intra-class variance and increase the inter-class variance in latent space. Thus, we present multiclass N-pair loss to enhance the meaningful representation of the latent space. For representing the speaker, we extracted speaker embed-dings from the x-vector based speaker recognition model trained on speech data from many speakers. To predict the vocoder features, we used RCVAE for the acoustic modeling, in which the model is conditioned on the textual features as well as on the speaker embedding. We transferred the expressivity by using the mean of the latent variables for each emotion to generate expressive speech in different speaker's voices for which no expressive speech data is available. We compared the results with those of the RCVAE model without multiclass N-pair loss as baseline model. The performance measured by mean opinion score (MOS), speaker MOS, and expressive MOS shows that N-pair loss based deep metric learning significantly improves the transfer of expressivity in the target speaker's voice in synthesized speech.

Mots clés

text-to-speech variational autoencoder deep metric learning expressivity

Domaines

Intelligence artificielle [cs.AI] Informatique et langage [cs.CL] Traitement du signal et de l'image [eess.SP]

Fichier principal

EUSIPCO_HAL_cersion.pdf (562.81 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Ajinkya Kulkarni : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-02573885

Soumis le : jeudi 14 mai 2020-15:43:48

Dernière modification le : lundi 11 septembre 2023-17:41:19

Dates et versions

hal-02573885 , version 1 (14-05-2020)

hal-02573885 , version 2 (22-10-2020)

Identifiants

HAL Id : hal-02573885 , version 1

Citer

Ajinkya Kulkarni, Vincent Colotte, Denis Jouvet. Deep Variational Metric Learning For Transfer Of Expressivity In Multispeaker Text To Speech. 2020. ⟨hal-02573885v1⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

292 Consultations

681 Téléchargements

Deep Variational Metric Learning For Transfer Of Expressivity In Multispeaker Text To Speech

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Partager