HardEval: Focusing on Challenging Tokens to Assess Robustness of NER

Gabriel Bernier-Colborne, Phillippe Langlais


Abstract
To assess the robustness of NER systems, we propose an evaluation method that focuses on subsets of tokens that represent specific sources of errors: unknown words and label shift or ambiguity. These subsets provide a system-agnostic basis for evaluating specific sources of NER errors and assessing room for improvement in terms of robustness. We analyze these subsets of challenging tokens in two widely-used NER benchmarks, then exploit them to evaluate NER systems in both in-domain and out-of-domain settings. Results show that these challenging tokens explain the majority of errors made by modern NER systems, although they represent only a small fraction of test tokens. They also indicate that label shift is harder to deal with than unknown words, and that there is much more room for improvement than the standard NER evaluation procedure would suggest. We hope this work will encourage NLP researchers to adopt rigorous and meaningful evaluation methods, and will help them develop more robust models.
Anthology ID:
2020.lrec-1.211
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1704–1711
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.211
DOI:
Bibkey:
Cite (ACL):
Gabriel Bernier-Colborne and Phillippe Langlais. 2020. HardEval: Focusing on Challenging Tokens to Assess Robustness of NER. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1704–1711, Marseille, France. European Language Resources Association.
Cite (Informal):
HardEval: Focusing on Challenging Tokens to Assess Robustness of NER (Bernier-Colborne & Langlais, LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.211.pdf