Deep variational metric learning for transfer of expressivity in multispeaker text to Speech - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Communication Dans Un Congrès Année : 2020

Deep variational metric learning for transfer of expressivity in multispeaker text to Speech

Résumé

In this paper, we propose to use the deep metric learning based multi-class N-pair loss, for text-to-speech (TTS) synthesis. We use the proposed loss function in a recurrent conditional variational autoencoder (RCVAE) for transferring expressivity in a French multispeaker TTS system. We extracted the speaker embeddings from the x-vector based speaker recognition model trained on speech data from many speakers to represent the speaker identity. We use mean of the latent variables to transfer expressivity for each emotion to generate expressive speech in the desired speaker's voice. In contrast to the commonly used loss functions such as triplet loss or contrastive loss, multi-class N-pair loss considers all the negative examples which make each class of emotion distinguished from one another. Furthermore, the presented approach assists in creating a robust representation of expressivity irrespective of speaker identities. Our proposed approach demonstrates the improved performance for transfer of expressivity in the target speaker's voice in a synthesized speech. To our knowledge, it is for the first time multi-class N-pair loss and x-vector based speaker embeddings are used in a TTS system.
Fichier principal
Vignette du fichier
SLSP_2020_published_version.pdf (683.84 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02573885 , version 1 (14-05-2020)
hal-02573885 , version 2 (22-10-2020)

Identifiants

  • HAL Id : hal-02573885 , version 2

Citer

Ajinkya Kulkarni, Vincent Colotte, Denis Jouvet. Deep variational metric learning for transfer of expressivity in multispeaker text to Speech. SLSP 2020 - 8th International Conference on Statistical Language and Speech Processing, Oct 2020, Cardiff / Virtual, United Kingdom. ⟨hal-02573885v2⟩
292 Consultations
681 Téléchargements

Partager

Gmail Facebook X LinkedIn More