Adapting Language Models When Training on Privacy-Transformed Data

Tugtekin Turan, Dietrich Klakow, Emmanuel Vincent, Denis Jouvet


Abstract
In recent years, voice-controlled personal assistants have revolutionized the interaction with smart devices and mobile applications. The collected data are then used by system providers to train language models (LMs). Each spoken message reveals personal information, hence removing private information from the input sentences is necessary. Our data sanitization process relies on recognizing and replacing named entities by other words from the same class. However, this may harm LM training because privacy-transformed data is unlikely to match the test distribution. This paper aims to fill the gap by focusing on the adaptation of LMs initially trained on privacy-transformed sentences using a small amount of original untransformed data. To do so, we combine class-based LMs, which provide an effective approach to overcome data sparsity in the context of n-gram LMs, and neural LMs, which handle longer contexts and can yield better predictions. Our experiments show that training an LM on privacy-transformed data result in a relative 11% word error rate (WER) increase compared to training on the original untransformed data, and adapting that model on a limited amount of original untransformed data leads to a relative 8% WER improvement over the model trained solely on privacy-transformed data.
Anthology ID:
2022.lrec-1.465
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4367–4373
Language:
URL:
https://aclanthology.org/2022.lrec-1.465
DOI:
Bibkey:
Cite (ACL):
Tugtekin Turan, Dietrich Klakow, Emmanuel Vincent, and Denis Jouvet. 2022. Adapting Language Models When Training on Privacy-Transformed Data. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4367–4373, Marseille, France. European Language Resources Association.
Cite (Informal):
Adapting Language Models When Training on Privacy-Transformed Data (Turan et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.465.pdf