CleanCoNLL: A Nearly Noise-Free Named Entity Recognition Dataset

Susanna Rücker, Alan Akbik


Abstract
The CoNLL-03 corpus is arguably the most well-known and utilized benchmark dataset for named entity recognition (NER). However, prior works found significant numbers of annotation errors, incompleteness, and inconsistencies in the data. This poses challenges to objectively comparing NER approaches and analyzing their errors, as current state-of-the-art models achieve F1-scores that are comparable to or even exceed the estimated noise level in CoNLL-03. To address this issue, we present a comprehensive relabeling effort assisted by automatic consistency checking that corrects 7.0% of all labels in the English CoNLL-03. Our effort adds a layer of entity linking annotation both for better explainability of NER labels and as additional safeguard of annotation quality. Our experimental evaluation finds not only that state-of-the-art approaches reach significantly higher F1-scores (97.1%) on our data, but crucially that the share of correct predictions falsely counted as errors due to annotation noise drops from 47% to 6%. This indicates that our resource is well suited to analyze the remaining errors made by state-of-the-art models, and that the theoretical upper bound even on high resource, coarse-grained NER is not yet reached. To facilitate such analysis, we make CleanCoNLL publicly available to the research community.
Anthology ID:
2023.emnlp-main.533
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8628–8645
Language:
URL:
https://aclanthology.org/2023.emnlp-main.533
DOI:
10.18653/v1/2023.emnlp-main.533
Bibkey:
Cite (ACL):
Susanna Rücker and Alan Akbik. 2023. CleanCoNLL: A Nearly Noise-Free Named Entity Recognition Dataset. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8628–8645, Singapore. Association for Computational Linguistics.
Cite (Informal):
CleanCoNLL: A Nearly Noise-Free Named Entity Recognition Dataset (Rücker & Akbik, EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.533.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.533.mp4