On the Impact of Data Augmentation on Downstream Performance in Natural Language Processing

Itsuki Okimura, Machel Reid, Makoto Kawano, Yutaka Matsuo


Abstract
With in the broader scope of machine learning, data augmentation is a common strategy to improve generalization and robustness of machine learning models. While data augmentation has been widely used within computer vision, its use in the NLP has been been comparably rather limited. The reason for this is that within NLP, the impact of proposed data augmentation methods on performance has not been evaluated in a unified manner, and effective data augmentation methods are unclear. In this paper, we look to tackle this by evaluating the impact of 12 data augmentation methods on multiple datasets when finetuning pre-trained language models. We find minimal improvements when data sizes are constrained to a few thousand, with performance degradation when data size is increased. We also use various methods to quantify the strength of data augmentations, and find that these values, though weakly correlated with downstream performance, correlate negatively or positively depending on the task. Furthermore, we find a glaring lack of consistently performant data augmentations. This all alludes to the difficulty of data augmentations for NLP tasks and we are inclined to believe that static data augmentations are not broadly applicable given these properties.
Anthology ID:
2022.insights-1.12
Volume:
Proceedings of the Third Workshop on Insights from Negative Results in NLP
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Shabnam Tafreshi, João Sedoc, Anna Rogers, Aleksandr Drozd, Anna Rumshisky, Arjun Akula
Venue:
insights
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
88–93
Language:
URL:
https://aclanthology.org/2022.insights-1.12
DOI:
10.18653/v1/2022.insights-1.12
Bibkey:
Cite (ACL):
Itsuki Okimura, Machel Reid, Makoto Kawano, and Yutaka Matsuo. 2022. On the Impact of Data Augmentation on Downstream Performance in Natural Language Processing. In Proceedings of the Third Workshop on Insights from Negative Results in NLP, pages 88–93, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
On the Impact of Data Augmentation on Downstream Performance in Natural Language Processing (Okimura et al., insights 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.insights-1.12.pdf
Video:
 https://aclanthology.org/2022.insights-1.12.mp4
Data
SSTSST-2