Inherent Disagreements in Human Textual Inferences

Ellie Pavlick, Tom Kwiatkowski


Abstract
We analyze human’s disagreements about the validity of natural language inferences. We show that, very often, disagreements are not dismissible as annotation “noise”, but rather persist as we collect more ratings and as we vary the amount of context provided to raters. We further show that the type of uncertainty captured by current state-of-the-art models for natural language inference is not reflective of the type of uncertainty present in human disagreements. We discuss implications of our results in relation to the recognizing textual entailment (RTE)/natural language inference (NLI) task. We argue for a refined evaluation objective that requires models to explicitly capture the full distribution of plausible human judgments.
Anthology ID:
Q19-1043
Volume:
Transactions of the Association for Computational Linguistics, Volume 7
Month:
Year:
2019
Address:
Cambridge, MA
Editors:
Lillian Lee, Mark Johnson, Brian Roark, Ani Nenkova
Venue:
TACL
SIG:
Publisher:
MIT Press
Note:
Pages:
677–694
Language:
URL:
https://aclanthology.org/Q19-1043/
DOI:
10.1162/tacl_a_00293
Bibkey:
Cite (ACL):
Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent Disagreements in Human Textual Inferences. Transactions of the Association for Computational Linguistics, 7:677–694.
Cite (Informal):
Inherent Disagreements in Human Textual Inferences (Pavlick & Kwiatkowski, TACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/Q19-1043.pdf