DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification

Hui Chen, Wei Han, Diyi Yang, Soujanya Poria


Abstract
This paper proposes a simple yet effective interpolation-based data augmentation approach termed DoubleMix, to improve the robustness of models in text classification. DoubleMix first leverages a couple of simple augmentation operations to generate several perturbed samples for each training data, and then uses the perturbed data and original data to carry out a two-step interpolation in the hidden space of neural models. Concretely, it first mixes up the perturbed data to a synthetic sample and then mixes up the original data and the synthetic perturbed data. DoubleMix enhances models’ robustness by learning the “shifted” features in hidden space. On six text classification benchmark datasets, our approach outperforms several popular text augmentation methods including token-level, sentence-level, and hidden-level data augmentation techniques. Also, experiments in low-resource settings show our approach consistently improves models’ performance when the training data is scarce. Extensive ablation studies and case studies confirm that each component of our approach contributes to the final performance and show that our approach exhibits superior performance on challenging counterexamples. Additionally, visual analysis shows that text features generated by our approach are highly interpretable.
Anthology ID:
2022.coling-1.409
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
4622–4632
Language:
URL:
https://aclanthology.org/2022.coling-1.409
DOI:
Bibkey:
Cite (ACL):
Hui Chen, Wei Han, Diyi Yang, and Soujanya Poria. 2022. DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4622–4632, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification (Chen et al., COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.409.pdf
Code
 declare-lab/doublemix
Data
IMDb Movie ReviewsMultiNLISNLISST