I am a Strange Dataset: Metalinguistic Tests for Language Models

Tristan Thrush, Jared Moore, Miguel Monares, Christopher Potts, Douwe Kiela


Abstract
Statements involving metalinguistic self-reference (“This paper has six sections.”) are prevalent in many domains. Can large language models (LLMs) handle such language? In this paper, we present “I am a Strange Dataset”, a new dataset for addressing this question. There are two subtasks: generation and verification. In generation, models continue statements like “The penultimate word in this sentence is” (where a correct continuation is “is”). In verification, models judge the truth of statements like “The penultimate word in this sentence is sentence.” (false). We also provide minimally different metalinguistic non-self-reference examples to complement the main dataset by probing for whether models can handle metalinguistic language at all. The dataset is hand-crafted by experts and validated by non-expert annotators. We test a variety of open-source LLMs (7B to 70B parameters) as well as closed-source LLMs through APIs. All models perform close to chance across both subtasks and even on the non-self-referential metalinguistic control data, though we find some steady improvement with model scale. GPT 4 is the only model to consistently do significantly better than chance, and it is still only in the 60% range, while our untrained human annotators score well in the 89-93% range. The dataset and evaluation toolkit are available at https://github.com/TristanThrush/i-am-a-strange-dataset
Anthology ID:
2024.acl-long.482
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8888–8907
Language:
URL:
https://aclanthology.org/2024.acl-long.482
DOI:
Bibkey:
Cite (ACL):
Tristan Thrush, Jared Moore, Miguel Monares, Christopher Potts, and Douwe Kiela. 2024. I am a Strange Dataset: Metalinguistic Tests for Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8888–8907, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
I am a Strange Dataset: Metalinguistic Tests for Language Models (Thrush et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.482.pdf