Deep Variational Metric Learning For Transfer Of Expressivity In Multispeaker Text To Speech - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Pré-Publication, Document De Travail Année : 2020

Deep Variational Metric Learning For Transfer Of Expressivity In Multispeaker Text To Speech

Résumé

In this paper, we propose an approach relying on multiclass N-pair loss based deep metric learning in recurrent conditional variational autoencoder (RCVAE). We used RCVAE for implementation of multispeaker expressive text-to-speech (TTS) system. The proposed approach condition text-to-speech system on speaker embeddings, and leads to clustering the latent space representation with respect to emotion. The deep metric learning helps to reduce the intra-class variance and increase the inter-class variance in latent space. Thus, we present multiclass N-pair loss to enhance the meaningful representation of the latent space. For representing the speaker, we extracted speaker embed-dings from the x-vector based speaker recognition model trained on speech data from many speakers. To predict the vocoder features, we used RCVAE for the acoustic modeling, in which the model is conditioned on the textual features as well as on the speaker embedding. We transferred the expressivity by using the mean of the latent variables for each emotion to generate expressive speech in different speaker's voices for which no expressive speech data is available. We compared the results with those of the RCVAE model without multiclass N-pair loss as baseline model. The performance measured by mean opinion score (MOS), speaker MOS, and expressive MOS shows that N-pair loss based deep metric learning significantly improves the transfer of expressivity in the target speaker's voice in synthesized speech.
Fichier principal
Vignette du fichier
EUSIPCO_HAL_cersion.pdf (562.81 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02573885 , version 1 (14-05-2020)
hal-02573885 , version 2 (22-10-2020)

Identifiants

  • HAL Id : hal-02573885 , version 1

Citer

Ajinkya Kulkarni, Vincent Colotte, Denis Jouvet. Deep Variational Metric Learning For Transfer Of Expressivity In Multispeaker Text To Speech. 2020. ⟨hal-02573885v1⟩
292 Consultations
681 Téléchargements

Partager

Gmail Facebook X LinkedIn More