Rethinking Data Augmentation in Text-to-text Paradigm

Yanan Chen, Yang Liu


Abstract
As manually labelling data can be costly, some recent studies tend to augment the training data for improving the generalization power of machine learning models, known as data augmentation (DA). With the arise of pre-trained language models (PLMs), some recent works on DA try to synthesize new samples benefiting from the knowledge learned from PLM’s pre-training. Along the same direction, we in this paper propose to integrate text-to-text language models and construct a new two-phase framework for augmentation: 1) a fine-tuning phase where PLMs are well adapted to downstream classification with the help of two novel schemes, and 2) a generation phase where the fine-tuned models are leveraged to create new samples for performance lifting. This paradigm opens up a new way of designing fine-tuning scheme to better serve DA in an easy-to-implement manner, and can be easily extended to other desired tasks. We evaluate our proposal on two public classification datasets and demonstrate its effectiveness with remarkable gains.
Anthology ID:
2022.coling-1.99
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
1157–1162
Language:
URL:
https://aclanthology.org/2022.coling-1.99
DOI:
Bibkey:
Cite (ACL):
Yanan Chen and Yang Liu. 2022. Rethinking Data Augmentation in Text-to-text Paradigm. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1157–1162, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
Rethinking Data Augmentation in Text-to-text Paradigm (Chen & Liu, COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.99.pdf
Data
AG NewsC4SST