Jinjin Tian


pdf bib
UseClean: learning from complex noisy labels in named entity recognition
Jinjin Tian | Kun Zhou | Meiguo Wang | Yu Zhang | Benjamin Yao | Xiaohu Liu | Chenlei Guo
Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD)

We investigate and refine denoising methods for NER task on data that potentially contains extremely noisy labels from multi-sources. In this paper, we first summarized all possible noise types and noise generation schemes, based on which we built a thorough evaluation system. We then pinpoint the bottleneck of current state-of-art denoising methods using our evaluation system. Correspondingly, we propose several refinements, including using a two-stage framework to avoid error accumulation; a novel confidence score utilizing minimal clean supervision to increase predictive power; an automatic cutoff fitting to save extensive hyper-parameter tuning; a warm started weighted partial CRF to better learn on the noisy tokens. Additionally, we propose to use adaptive sampling to further boost the performance in long-tailed entity settings. Our method improves F1 score by on average at least 5 10% over current state-of-art across extensive experiments.