Denoising Labeled Data for Comment Moderation Using Active Learning

Andraž Pelicon; Vanja M. Karan; Ravi Shekhar; Matthew Purver; Senja Pollak

Denoising Labeled Data for Comment Moderation Using Active Learning

Andraž Pelicon, Vanja Mladen Karan, Ravi Shekhar, Matthew Purver, Senja Pollak

Abstract

Noisily labeled textual data is ample on internet platforms that allow user-created content. Training models, such as offensive language detection models for comment moderation, on such data may prove difficult as the noise in the labels prevents the model to converge. In this work, we propose to use active learning methods for the purposes of denoising training data for model training. The goal is to sample examples the most informative examples with noisy labels with active learning and send them to the oracle for reannotation thus reducing the overall cost of reannotation. In this setting we tested three existing active learning methods, namely DBAL, Variance of Gradients (VoG) and BADGE. The proposed approach to data denoising is tested on the problem of offensive language detection. We observe that active learning can be effectively used for the purposes of data denoising, however care should be taken when choosing the algorithm for this purpose.

Anthology ID:: 2024.lrec-main.413
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 4626–4633
Language:
URL:: https://aclanthology.org/2024.lrec-main.413/
DOI:
Bibkey:
Cite (ACL):: Andraž Pelicon, Vanja Mladen Karan, Ravi Shekhar, Matthew Purver, and Senja Pollak. 2024. Denoising Labeled Data for Comment Moderation Using Active Learning. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4626–4633, Torino, Italia. ELRA and ICCL.
Cite (Informal):: Denoising Labeled Data for Comment Moderation Using Active Learning (Pelicon et al., LREC-COLING 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.lrec-main.413.pdf

PDF Cite Search Fix data