If You Build Your Own NER Scorer, Non-replicable Results Will Come

Constantine Lignos, Marjan Kamyab


Abstract
We attempt to replicate a named entity recognition (NER) model implemented in a popular toolkit and discover that a critical barrier to doing so is the inconsistent evaluation of improper label sequences. We define these sequences and examine how two scorers differ in their handling of them, finding that one approach produces F1 scores approximately 0.5 points higher on the CoNLL 2003 English development and test sets. We propose best practices to increase the replicability of NER evaluations by increasing transparency regarding the handling of improper label sequences.
Anthology ID:
2020.insights-1.15
Volume:
Proceedings of the First Workshop on Insights from Negative Results in NLP
Month:
November
Year:
2020
Address:
Online
Venues:
EMNLP | insights
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
94–99
Language:
URL:
https://aclanthology.org/2020.insights-1.15
DOI:
10.18653/v1/2020.insights-1.15
Bibkey:
Cite (ACL):
Constantine Lignos and Marjan Kamyab. 2020. If You Build Your Own NER Scorer, Non-replicable Results Will Come. In Proceedings of the First Workshop on Insights from Negative Results in NLP, pages 94–99, Online. Association for Computational Linguistics.
Cite (Informal):
If You Build Your Own NER Scorer, Non-replicable Results Will Come (Lignos & Kamyab, insights 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.insights-1.15.pdf
Video:
 https://slideslive.com/38940802
Data
CoNLL-2003