I am a Strange Dataset: Metalinguistic Tests for Language Models

Tristan Thrush; Jared Moore; Miguel Monares; Christopher Potts; Douwe Kiela

doi:10.18653/v1/2024.acl-long.482

I am a Strange Dataset: Metalinguistic Tests for Language Models

Tristan Thrush, Jared Moore, Miguel Monares, Christopher Potts, Douwe Kiela

Abstract

Statements involving metalinguistic self-reference (“This paper has six sections.”) are prevalent in many domains. Can large language models (LLMs) handle such language? In this paper, we present “I am a Strange Dataset”, a new dataset for addressing this question. There are two subtasks: generation and verification. In generation, models continue statements like “The penultimate word in this sentence is” (where a correct continuation is “is”). In verification, models judge the truth of statements like “The penultimate word in this sentence is sentence.” (false). We also provide minimally different metalinguistic non-self-reference examples to complement the main dataset by probing for whether models can handle metalinguistic language at all. The dataset is hand-crafted by experts and validated by non-expert annotators. We test a variety of open-source LLMs (7B to 70B parameters) as well as closed-source LLMs through APIs. All models perform close to chance across both subtasks and even on the non-self-referential metalinguistic control data, though we find some steady improvement with model scale. GPT 4 is the only model to consistently do significantly better than chance, and it is still only in the 60% range, while our untrained human annotators score well in the 89-93% range. The dataset and evaluation toolkit are available at https://github.com/TristanThrush/i-am-a-strange-dataset

Anthology ID:: 2024.acl-long.482
Volume:: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8888–8907
Language:
URL:: https://aclanthology.org/2024.acl-long.482/
DOI:: 10.18653/v1/2024.acl-long.482
Bibkey:
Cite (ACL):: Tristan Thrush, Jared Moore, Miguel Monares, Christopher Potts, and Douwe Kiela. 2024. I am a Strange Dataset: Metalinguistic Tests for Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8888–8907, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: I am a Strange Dataset: Metalinguistic Tests for Language Models (Thrush et al., ACL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.acl-long.482.pdf

PDF Cite Search Fix data