Unsupervised Data Augmentation with Naive Augmentation and without Unlabeled Data

David Lowell, Brian Howard, Zachary C. Lipton, Byron Wallace


Abstract
Unsupervised Data Augmentation (UDA) is a semisupervised technique that applies a consistency loss to penalize differences between a model’s predictions on (a) observed (unlabeled) examples; and (b) corresponding ‘noised’ examples produced via data augmentation. While UDA has gained popularity for text classification, open questions linger over which design decisions are necessary and how to extend the method to sequence labeling tasks. In this paper, we re-examine UDA and demonstrate its efficacy on several sequential tasks. Our main contribution is an empirical study of UDA to establish which components of the algorithm confer benefits in NLP. Notably, although prior work has emphasized the use of clever augmentation techniques including back-translation, we find that enforcing consistency between predictions assigned to observed and randomly substituted words often yields comparable (or greater) benefits compared to these more complex perturbation models. Furthermore, we find that applying UDA’s consistency loss affords meaningful gains without any unlabeled data at all, i.e., in a standard supervised setting. In short, UDA need not be unsupervised to realize much of its noted benefits, and does not require complex data augmentation to be effective.
Anthology ID:
2021.emnlp-main.408
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4992–5001
Language:
URL:
https://aclanthology.org/2021.emnlp-main.408
DOI:
10.18653/v1/2021.emnlp-main.408
Bibkey:
Cite (ACL):
David Lowell, Brian Howard, Zachary C. Lipton, and Byron Wallace. 2021. Unsupervised Data Augmentation with Naive Augmentation and without Unlabeled Data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4992–5001, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Unsupervised Data Augmentation with Naive Augmentation and without Unlabeled Data (Lowell et al., EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.408.pdf
Video:
 https://aclanthology.org/2021.emnlp-main.408.mp4
Data
CoNLL 2003EBM-NLPEvidence InferenceIMDb Movie Reviews