Mitigating Shortcut Learning via Smart Data Augmentation based on Large Language Model

Xinyi Sun, Hongye Tan, Yaxin Guo, Pengpeng Qiang, Ru Li, Hu Zhang


Abstract
Data-driven pre-trained language models typically perform shortcut learning wherein they rely on the spurious correlations between the data and the ground truth. This reliance can undermine the robustness and generalization of the model. To address this issue, data augmentation emerges as a promising solution. By integrating anti-shortcut data to the training set, the models’ shortcut-induced biases can be mitigated. However, existing methods encounter three challenges: 1) Manual definition of shortcuts is tailored to particular datasets, restricting generalization. 2) The inherent confirmation bias during model training hampers the effectiveness of data augmentation. 3) Insufficient exploration of the relationship between the model performance and the augmented data quantity may result in excessive data consumption. To tackle these challenges, we propose a method of Smart Data Augmentation based on Large Language Models (SAug-LLM). It leverages the LLMs to autonomously identify shortcuts and generate their anti-shortcut counterparts. In addition, the dual validation is employed to mitigate the confirmation bias during the model retraining. Furthermore, the data augmentation process is optimized to effectively rectify model biases while minimizing data consumption. We validate the effectiveness and generalization of our method through extensive experiments across various natural language processing tasks, demonstrating an average performance improvement of 5.61%.
Anthology ID:
2025.coling-main.543
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8160–8172
Language:
URL:
https://aclanthology.org/2025.coling-main.543/
DOI:
Bibkey:
Cite (ACL):
Xinyi Sun, Hongye Tan, Yaxin Guo, Pengpeng Qiang, Ru Li, and Hu Zhang. 2025. Mitigating Shortcut Learning via Smart Data Augmentation based on Large Language Model. In Proceedings of the 31st International Conference on Computational Linguistics, pages 8160–8172, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Mitigating Shortcut Learning via Smart Data Augmentation based on Large Language Model (Sun et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.543.pdf