Industry Scale Semi-Supervised Learning for Natural Language Understanding

Luoxin Chen, Francisco Garcia, Varun Kumar, He Xie, Jianhua Lu


Abstract
This paper presents a production Semi-Supervised Learning (SSL) pipeline based on the student-teacher framework, which leverages millions of unlabeled examples to improve Natural Language Understanding (NLU) tasks. We investigate two questions related to the use of unlabeled data in production SSL context: 1) how to select samples from a huge unlabeled data pool that are beneficial for SSL training, and 2) how does the selected data affect the performance of different state-of-the-art SSL techniques. We compare four widely used SSL techniques, Pseudo-label (PL), Knowledge Distillation (KD), Virtual Adversarial Training (VAT) and Cross-View Training (CVT) in conjunction with two data selection methods including committee-based selection and submodular optimization based selection. We further examine the benefits and drawbacks of these techniques when applied to intent classification (IC) and named entity recognition (NER) tasks, and provide guidelines specifying when each of these methods might be beneficial to improve large scale NLU systems.
Anthology ID:
2021.naacl-industry.39
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers
Month:
June
Year:
2021
Address:
Online
Editors:
Young-bum Kim, Yunyao Li, Owen Rambow
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
311–318
Language:
URL:
https://aclanthology.org/2021.naacl-industry.39
DOI:
10.18653/v1/2021.naacl-industry.39
Bibkey:
Cite (ACL):
Luoxin Chen, Francisco Garcia, Varun Kumar, He Xie, and Jianhua Lu. 2021. Industry Scale Semi-Supervised Learning for Natural Language Understanding. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers, pages 311–318, Online. Association for Computational Linguistics.
Cite (Informal):
Industry Scale Semi-Supervised Learning for Natural Language Understanding (Chen et al., NAACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.naacl-industry.39.pdf
Video:
 https://aclanthology.org/2021.naacl-industry.39.mp4