Training data reduction for multilingual Spoken Language Understanding systems

Anmol Bansal; Anjali Shenoy; Krishna Chaitanya Pappu; Kay Rottmann; Anurag Dwarakanath

Training data reduction for multilingual Spoken Language Understanding systems

Anmol Bansal, Anjali Shenoy, Krishna Chaitanya Pappu, Kay Rottmann, Anurag Dwarakanath

Abstract

Fine-tuning self-supervised pre-trained language models such as BERT has significantly improved state-of-the-art performance on natural language processing tasks. Similar finetuning setups can also be used in commercial large scale Spoken Language Understanding (SLU) systems to perform intent classification and slot tagging on user queries. Finetuning such powerful models for use in commercial systems requires large amounts of training data and compute resources to achieve high performance. This paper is a study on the different empirical methods of identifying training data redundancies for the fine tuning paradigm. Particularly, we explore rule based and semantic techniques to reduce data in a multilingual fine tuning setting and report our results on key SLU metrics. Through our experiments, we show that we can achieve on par/better performance on fine-tuning using a reduced data set as compared to a model finetuned on the entire data set.

Anthology ID:: 2021.icon-main.36
Volume:: Proceedings of the 18th International Conference on Natural Language Processing (ICON)
Month:: December
Year:: 2021
Address:: National Institute of Technology Silchar, Silchar, India
Editors:: Sivaji Bandyopadhyay, Sobha Lalitha Devi, Pushpak Bhattacharyya
Venue:: ICON
SIG:
Publisher:: NLP Association of India (NLPAI)
Note:
Pages:: 298–306
Language:
URL:: https://aclanthology.org/2021.icon-main.36/
DOI:
Bibkey:
Cite (ACL):: Anmol Bansal, Anjali Shenoy, Krishna Chaitanya Pappu, Kay Rottmann, and Anurag Dwarakanath. 2021. Training data reduction for multilingual Spoken Language Understanding systems. In Proceedings of the 18th International Conference on Natural Language Processing (ICON), pages 298–306, National Institute of Technology Silchar, Silchar, India. NLP Association of India (NLPAI).
Cite (Informal):: Training data reduction for multilingual Spoken Language Understanding systems (Bansal et al., ICON 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.icon-main.36.pdf

PDF Cite Search Fix data