On Evaluation Protocols for Data Augmentation in a Limited Data Scenario

Frédéric Piedboeuf, Philippe Langlais


Abstract
Textual data augmentation (DA) is a prolific field of study where novel techniques to create artificial data are regularly proposed, and that has demonstrated great efficiency on small data settings, at least for text classification tasks. In this paper, we challenge those results, showing that classical data augmentation (which modify sentences) is simply a way of performing better fine-tuning, and that spending more time doing so before applying data augmentation negates its effect. This is a significant contribution as it answers several questions that were left open in recent years, namely : which DA technique performs best (all of them as long as they generate data close enough to the training set, as to not impair training) and why did DA show positive results (facilitates training of network). We further show that zero- and few-shot DA via conversational agents such as ChatGPT or LLama2 can increase performances, confirming that this form of data augmentation is preferable to classical methods.
Anthology ID:
2025.coling-main.231
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3428–3443
Language:
URL:
https://aclanthology.org/2025.coling-main.231/
DOI:
Bibkey:
Cite (ACL):
Frédéric Piedboeuf and Philippe Langlais. 2025. On Evaluation Protocols for Data Augmentation in a Limited Data Scenario. In Proceedings of the 31st International Conference on Computational Linguistics, pages 3428–3443, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
On Evaluation Protocols for Data Augmentation in a Limited Data Scenario (Piedboeuf & Langlais, COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.231.pdf