A German WSC dataset comparing coreference resolution by humans and machines

Wiebke Petersen, Katharina Spalek


Abstract
We present a novel German Winograd-style dataset for direct comparison of human and model behavior in coreference resolution. Ten participants per item provided accuracy, confidence ratings, and response times. Unlike classic WSC tasks, humans select among three pronouns rather than between two potential antecedents, increasing task difficulty. While majority vote accuracy is high, individual responses reveal that not all items are trivial and that variability is obscured by aggregation. Pretrained language models evaluated without fine-tuning show clear performance gaps, yet their accuracy and confidence scores correlate notably with human data, mirroring certain patterns of human uncertainty and error. Dataset-specific limitations, including pragmatic reinterpretations and imbalanced pronoun distributions, highlight the importance of high-quality, balanced resources for advancing computational and cognitive models of coreference resolution.
Anthology ID:
2025.iwcs-main.10
Volume:
Proceedings of the 16th International Conference on Computational Semantics
Month:
September
Year:
2025
Address:
Düsseldorf, Germany
Editors:
Kilian Evang, Laura Kallmeyer, Sylvain Pogodalla
Venue:
IWCS
SIG:
SIGSEM
Publisher:
Association for Computational Linguistics
Note:
Pages:
110–117
Language:
URL:
https://aclanthology.org/2025.iwcs-main.10/
DOI:
Bibkey:
Cite (ACL):
Wiebke Petersen and Katharina Spalek. 2025. A German WSC dataset comparing coreference resolution by humans and machines. In Proceedings of the 16th International Conference on Computational Semantics, pages 110–117, Düsseldorf, Germany. Association for Computational Linguistics.
Cite (Informal):
A German WSC dataset comparing coreference resolution by humans and machines (Petersen & Spalek, IWCS 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.iwcs-main.10.pdf