Adapting Language Models When Training on Privacy-Transformed Data

Mehmet Ali Tugtekin Turan; Dietrich Klakow; Emmanuel Vincent; Denis Jouvet

Communication Dans Un Congrès Année : 2021

Adapting Language Models When Training on Privacy-Transformed Data

(1) , (2) , (1) , (1)

1
2

Mehmet Ali Tugtekin Turan

Fonction : Auteur
PersonId : 1095307

Speech Modeling for Facilitating Oral-Based Communication

Dietrich Klakow

Fonction : Auteur
PersonId : 1095147

Universität des Saarlandes [Saarbrücken]

Emmanuel Vincent

Fonction : Auteur
PersonId : 1256
IdHAL : emmanuelv
ORCID : 0000-0002-0183-7289
IdRef : 089360176

Speech Modeling for Facilitating Oral-Based Communication

Denis Jouvet

Fonction : Auteur
PersonId : 15904
IdHAL : denis-jouvet
IdRef : 029418666

Speech Modeling for Facilitating Oral-Based Communication

Résumé

In recent years, voice-controlled personal assistants have revolutionized the interaction with smart devices and mobile applications. These dialogue tools are then used by system providers to improve and retrain the language models (LMs). Each spoken message reveals personal information, hence, it is necessary to remove the private data from the input utterances. However, this may harm the LM training because privacy-transformed data is unlikely to match the test distribution. This paper aims to fill the gap by focusing on the adaptation of LM initially trained on privacy-transformed utterances. Our data sanitization process relies on named-entity recognition. We propose an LM adaptation strategy over the private data with minimum losses. Class-based modeling is an effective approach to overcome data sparsity in the context of n-gram model training. On the other hand, neural LMs can handle longer contexts which can yield better predictions. Our methodology combines the predictive power of class-based models and the generalization capability of neural models together. With privacy transformation, we have a relative 11% word error rate (WER) increase compared to an LM trained on the clean data. Despite the privacy-preserving, we can still achieve comparable accuracy. Empirical evaluations attain a relative WER improvement of 8% over the initial model.

Mots clés

language model adaptation privacy-preserving learning speech recognition class-based language modeling

Domaines

Informatique et langage [cs.CL] Apprentissage [cs.LG]

Fichier principal

Paper_1854.pdf (121.75 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Mehmet Ali Tugtekin Turan : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03189354

Soumis le : samedi 3 avril 2021-12:35:37

Dernière modification le : lundi 11 septembre 2023-17:41:19

Archivage à long terme le : dimanche 4 juillet 2021-18:06:18

Dates et versions

hal-03189354 , version 1 (03-04-2021)

hal-03189354 , version 2 (08-05-2022)

Identifiants

HAL Id : hal-03189354 , version 1

Citer

Mehmet Ali Tugtekin Turan, Dietrich Klakow, Emmanuel Vincent, Denis Jouvet. Adapting Language Models When Training on Privacy-Transformed Data. LREC 2022 - 13th Language Resources and Evaluation Conference, Aug 2021, Brno, Czech Republic. ⟨hal-03189354v1⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

220 Consultations

444 Téléchargements

Adapting Language Models When Training on Privacy-Transformed Data

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Partager