Noise Learning for Text Classification: A Benchmark

Bo Liu, Wandi Xu, Yuejia Xiang, Xiaojun Wu, Lejian He, Bowen Zhang, Li Zhu


Abstract
Noise Learning is important in the task of text classification which depends on massive labeled data that could be error-prone. However, we find that noise learning in text classification is relatively underdeveloped: 1. many methods that have been proven effective in the image domain are not explored in text classification, 2. it is difficult to conduct a fair comparison between previous studies as they do experiments in different noise settings. In this work, we adapt four state-of-the-art methods of noise learning from the image domain to text classification. Moreover, we conduct comprehensive experiments on our benchmark of noise learning with seven commonly-used methods, four datasets, and five noise modes. Additionally, most previous works are based on an implicit hypothesis that the commonly-used datasets such as TREC, Ag-News and Chnsenticorp contain no errors. However, these datasets indeed contain 0.61% to 15.77% noise labels which we define as intrinsic noise that can cause inaccurate evaluation. Therefore, we build a new dataset Golden-Chnsenticorp( G-Chnsenticorp) without intrinsic noise to more accurately compare the effects of different noise learning methods. To the best of our knowledge, this is the first benchmark of noise learning for text classification.
Anthology ID:
2022.coling-1.402
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Editors:
Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
4557–4567
Language:
URL:
https://aclanthology.org/2022.coling-1.402
DOI:
Bibkey:
Cite (ACL):
Bo Liu, Wandi Xu, Yuejia Xiang, Xiaojun Wu, Lejian He, Bowen Zhang, and Li Zhu. 2022. Noise Learning for Text Classification: A Benchmark. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4557–4567, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
Noise Learning for Text Classification: A Benchmark (Liu et al., COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.402.pdf
Data
AG News