Unsupervised training data re-weighting for natural language understanding with local distribution approximation

Jose Garrido Ramas, Dieu-thu Le, Bei Chen, Manoj Kumar, Kay Rottmann


Abstract
One of the major challenges of training Natural Language Understanding (NLU) production models lies in the discrepancy between the distributions of the offline training data and of the online live data, due to, e.g., biased sampling scheme, cyclic seasonality shifts, annotated training data coming from a variety of different sources, and a changing pool of users. Consequently, the model trained by the offline data is biased. We often observe this problem especially in task-oriented conversational systems, where topics of interest and the characteristics of users using the system change over time. In this paper we propose an unsupervised approach to mitigate the offline training data sampling bias in multiple NLU tasks. We show that a local distribution approximation in the pre-trained embedding space enables the estimation of importance weights for training samples guiding re-sampling for an effective bias mitigation. We illustrate our novel approach using multiple NLU datasets and show improvements obtained without additional annotation, making this a general approach for mitigating effects of sampling bias.
Anthology ID:
2022.emnlp-industry.15
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:
December
Year:
2022
Address:
Abu Dhabi, UAE
Editors:
Yunyao Li, Angeliki Lazaridou
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
154–160
Language:
URL:
https://aclanthology.org/2022.emnlp-industry.15
DOI:
10.18653/v1/2022.emnlp-industry.15
Bibkey:
Cite (ACL):
Jose Garrido Ramas, Dieu-thu Le, Bei Chen, Manoj Kumar, and Kay Rottmann. 2022. Unsupervised training data re-weighting for natural language understanding with local distribution approximation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 154–160, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Unsupervised training data re-weighting for natural language understanding with local distribution approximation (Garrido Ramas et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-industry.15.pdf