A Methodology for the Comparison of Human Judgments With Metrics for Coreference Resolution

Mariya Borovikova, Loïc Grobol, Anaïs Halftermeyer, Sylvie Billot


Abstract
We propose a method for investigating the interpretability of metrics used for the coreference resolution task through comparisons with human judgments. We provide a corpus with annotations of different error types and human evaluations of their gravity. Our preliminary analysis shows that metrics considerably overlook several error types and overlook errors in general in comparison to humans. This study is conducted on French texts, but the methodology is language-independent.
Anthology ID:
2022.humeval-1.2
Volume:
Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Venues:
ACL | HumEval
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16–23
Language:
URL:
https://aclanthology.org/2022.humeval-1.2
DOI:
10.18653/v1/2022.humeval-1.2
Bibkey:
Cite (ACL):
Mariya Borovikova, Loïc Grobol, Anaïs Halftermeyer, and Sylvie Billot. 2022. A Methodology for the Comparison of Human Judgments With Metrics for Coreference Resolution. In Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval), pages 16–23, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
A Methodology for the Comparison of Human Judgments With Metrics for Coreference Resolution (Borovikova et al., HumEval 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.humeval-1.2.pdf
Data
CoNLL-2012