Pengpeng Qiang
2025
Mitigating Shortcut Learning via Smart Data Augmentation based on Large Language Model
Xinyi Sun
|
Hongye Tan
|
Yaxin Guo
|
Pengpeng Qiang
|
Ru Li
|
Hu Zhang
Proceedings of the 31st International Conference on Computational Linguistics
Data-driven pre-trained language models typically perform shortcut learning wherein they rely on the spurious correlations between the data and the ground truth. This reliance can undermine the robustness and generalization of the model. To address this issue, data augmentation emerges as a promising solution. By integrating anti-shortcut data to the training set, the models’ shortcut-induced biases can be mitigated. However, existing methods encounter three challenges: 1) Manual definition of shortcuts is tailored to particular datasets, restricting generalization. 2) The inherent confirmation bias during model training hampers the effectiveness of data augmentation. 3) Insufficient exploration of the relationship between the model performance and the augmented data quantity may result in excessive data consumption. To tackle these challenges, we propose a method of Smart Data Augmentation based on Large Language Models (SAug-LLM). It leverages the LLMs to autonomously identify shortcuts and generate their anti-shortcut counterparts. In addition, the dual validation is employed to mitigate the confirmation bias during the model retraining. Furthermore, the data augmentation process is optimized to effectively rectify model biases while minimizing data consumption. We validate the effectiveness and generalization of our method through extensive experiments across various natural language processing tasks, demonstrating an average performance improvement of 5.61%.