Clustering Examples in Multi-Dataset Benchmarks with Item Response Theory

Pedro Rodriguez, Phu Mon Htut, John Lalor, João Sedoc


Abstract
In natural language processing, multi-dataset benchmarks for common tasks (e.g., SuperGLUE for natural language inference and MRQA for question answering) have risen in importance. Invariably, tasks and individual examples vary in difficulty. Recent analysis methods infer properties of examples such as difficulty. In particular, Item Response Theory (IRT) jointly infers example and model properties from the output of benchmark tasks (i.e., scores for each model-example pair). Therefore, it seems sensible that methods like IRT should be able to detect differences between datasets in a task. This work shows that current IRT models are not as good at identifying differences as we would expect, explain why this is difficult, and outline future directions that incorporate more (textual) signal from examples.
Anthology ID:
2022.insights-1.14
Volume:
Proceedings of the Third Workshop on Insights from Negative Results in NLP
Month:
May
Year:
2022
Address:
Dublin, Ireland
Venues:
ACL | insights
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
100–112
Language:
URL:
https://aclanthology.org/2022.insights-1.14
DOI:
10.18653/v1/2022.insights-1.14
Bibkey:
Cite (ACL):
Pedro Rodriguez, Phu Mon Htut, John Lalor, and João Sedoc. 2022. Clustering Examples in Multi-Dataset Benchmarks with Item Response Theory. In Proceedings of the Third Workshop on Insights from Negative Results in NLP, pages 100–112, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Clustering Examples in Multi-Dataset Benchmarks with Item Response Theory (Rodriguez et al., insights 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.insights-1.14.pdf
Data
DynaSentMRQASSTSuperGLUE