Posthoc Verification and the Fallibility of the Ground Truth

Yifan Ding, Nicholas Botzer, Tim Weninger


Abstract
Classifiers commonly make use of pre-annotated datasets, wherein a model is evaluated by pre-defined metrics on a held-out test set typically made of human-annotated labels. Metrics used in these evaluations are tied to the availability of well-defined ground truth labels, and these metrics typically do not allow for inexact matches. These noisy ground truth labels and strict evaluation metrics may compromise the validity and realism of evaluation results. In the present work, we conduct a systematic label verification experiment on the entity linking (EL) task. Specifically, we ask annotators to verify the correctness of annotations after the fact (, posthoc). Compared to pre-annotation evaluation, state-of-the-art EL models performed extremely well according to the posthoc evaluation methodology. Surprisingly, we find predictions from EL models had a similar or higher verification rate than the ground truth. We conclude with a discussion on these findings and recommendations for future evaluations. The source code, raw results, and evaluation scripts are publicly available via the MIT license at https://github.com/yifding/e2e_EL_evaluate
Anthology ID:
2022.dadc-1.3
Volume:
Proceedings of the First Workshop on Dynamic Adversarial Data Collection
Month:
July
Year:
2022
Address:
Seattle, WA
Venue:
DADC
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
23–29
Language:
URL:
https://aclanthology.org/2022.dadc-1.3
DOI:
10.18653/v1/2022.dadc-1.3
Bibkey:
Cite (ACL):
Yifan Ding, Nicholas Botzer, and Tim Weninger. 2022. Posthoc Verification and the Fallibility of the Ground Truth. In Proceedings of the First Workshop on Dynamic Adversarial Data Collection, pages 23–29, Seattle, WA. Association for Computational Linguistics.
Cite (Informal):
Posthoc Verification and the Fallibility of the Ground Truth (Ding et al., DADC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.dadc-1.3.pdf
Code
 yifding/e2e_EL_evaluate