SelfMix: Robust Learning against Textual Label Noise with Self-Mixup Training

Dan Qiao, Chenchen Dai, Yuyang Ding, Juntao Li, Qiang Chen, Wenliang Chen, Min Zhang


Abstract
The conventional success of textual classification relies on annotated data, and the new paradigm of pre-trained language models (PLMs) still requires a few labeled data for downstream tasks. However, in real-world applications, label noise inevitably exists in training data, damaging the effectiveness, robustness, and generalization of the models constructed on such data. Recently, remarkable achievements have been made to mitigate this dilemma in visual data, while only a few explore textual data. To fill this gap, we present SelfMix, a simple yet effective method, to handle label noise in text classification tasks. SelfMix uses the Gaussian Mixture Model to separate samples and leverages semi-supervised learning. Unlike previous works requiring multiple models, our method utilizes the dropout mechanism on a single model to reduce the confirmation bias in self-training and introduces a textual level mixup training strategy. Experimental results on three text classification benchmarks with different types of text show that the performance of our proposed method outperforms these strong baselines designed for both textual and visual data under different noise ratios and noise types. Our anonymous code is available at https://github.com/noise-learning/SelfMix.
Anthology ID:
2022.coling-1.80
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
960–970
Language:
URL:
https://aclanthology.org/2022.coling-1.80
DOI:
Bibkey:
Cite (ACL):
Dan Qiao, Chenchen Dai, Yuyang Ding, Juntao Li, Qiang Chen, Wenliang Chen, and Min Zhang. 2022. SelfMix: Robust Learning against Textual Label Noise with Self-Mixup Training. In Proceedings of the 29th International Conference on Computational Linguistics, pages 960–970, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
SelfMix: Robust Learning against Textual Label Noise with Self-Mixup Training (Qiao et al., COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.80.pdf
Code
 noise-learning/selfmix
Data
IMDb Movie Reviews