Data Cleaning Tools for Token Classification Tasks

Karthik Muthuraman, Frederick Reiss, Hong Xu, Bryan Cutler, Zachary Eichenberger


Abstract
Human-in-the-loop systems for cleaning NLP training data rely on automated sieves to isolate potentially-incorrect labels for manual review. We have developed a novel technique for flagging potentially-incorrect labels with high sensitivity in named entity recognition corpora. We incorporated our sieve into an end-to-end system for cleaning NLP corpora, implemented as a modular collection of Jupyter notebooks built on extensions to the Pandas DataFrame library. We used this system to identify incorrect labels in the CoNLL-2003 corpus for English-language named entity recognition (NER), one of the most influential corpora for NER model research. Unlike previous work that only looked at a subset of the corpus’s validation fold, our automated sieve enabled us to examine the entire corpus in depth. Across the entire CoNLL-2003 corpus, we identified over 1300 incorrect labels (out of 35089 in the corpus). We have published our corrections, along with the code we used in our experiments. We are developing a repeatable version of the process we used on the CoNLL-2003 corpus as an open-source library.
Anthology ID:
2021.dash-1.10
Volume:
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances
Month:
June
Year:
2021
Address:
Online
Venues:
DaSH | NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
59–61
Language:
URL:
https://aclanthology.org/2021.dash-1.10
DOI:
10.18653/v1/2021.dash-1.10
Bibkey:
Cite (ACL):
Karthik Muthuraman, Frederick Reiss, Hong Xu, Bryan Cutler, and Zachary Eichenberger. 2021. Data Cleaning Tools for Token Classification Tasks. In Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances, pages 59–61, Online. Association for Computational Linguistics.
Cite (Informal):
Data Cleaning Tools for Token Classification Tasks (Muthuraman et al., DaSH 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.dash-1.10.pdf
Data
CoNLL-2003