Efficient Semi-supervised Consistency Training for Natural Language Understanding

George Leung, Joshua Tan


Abstract
Manually labeled training data is expensive, noisy, and often scarce, such as when developing new features or localizing existing features for a new region. In cases where labeled data is limited but unlabeled data is abundant, semi-supervised learning methods such as consistency training can be used to improve model performance, by training models to output consistent predictions between original and augmented versions of unlabeled data. In this work, we explore different data augmentation methods for consistency training (CT) on Natural Language Understanding (NLU) domain classification (DC) in the limited labeled data regime. We explore three types of augmentation techniques (human paraphrasing, back-translation, and dropout) for unlabeled data and train DC models to jointly minimize both the supervised loss and the consistency loss on unlabeled data. Our results demonstrate that DC models trained with CT methods and dropout based augmentation on only 0.1% (2,998 instances) of labeled data with the remainder as unlabeled can achieve a top-1 relative accuracy reduction of 12.25% compared to fully supervised model trained with 100% of labeled data, outperforming fully supervised models trained on 10x that amount of labeled data. The dropout-based augmentation achieves similar performance compare to back-translation based augmentation with much less computational resources. This paves the way for applications of using large scale unlabeled data for semi-supervised learning in production NLU systems.
Anthology ID:
2022.naacl-industry.11
Volume:
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track
Month:
July
Year:
2022
Address:
Hybrid: Seattle, Washington + Online
Editors:
Anastassia Loukina, Rashmi Gangadharaiah, Bonan Min
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
86–93
Language:
URL:
https://aclanthology.org/2022.naacl-industry.11
DOI:
10.18653/v1/2022.naacl-industry.11
Bibkey:
Cite (ACL):
George Leung and Joshua Tan. 2022. Efficient Semi-supervised Consistency Training for Natural Language Understanding. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track, pages 86–93, Hybrid: Seattle, Washington + Online. Association for Computational Linguistics.
Cite (Informal):
Efficient Semi-supervised Consistency Training for Natural Language Understanding (Leung & Tan, NAACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.naacl-industry.11.pdf
Video:
 https://aclanthology.org/2022.naacl-industry.11.mp4