@inproceedings{thrush-etal-2024-strange,
title = "{I} am a Strange Dataset: Metalinguistic Tests for Language Models",
author = "Thrush, Tristan and
Moore, Jared and
Monares, Miguel and
Potts, Christopher and
Kiela, Douwe",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.luhme-long.482/",
doi = "10.18653/v1/2024.acl-long.482",
pages = "8888--8907",
abstract = "Statements involving metalinguistic self-reference ({\textquotedblleft}This paper has six sections.{\textquotedblright}) are prevalent in many domains. Can large language models (LLMs) handle such language? In this paper, we present {\textquotedblleft}I am a Strange Dataset{\textquotedblright}, a new dataset for addressing this question. There are two subtasks: generation and verification. In generation, models continue statements like {\textquotedblleft}The penultimate word in this sentence is{\textquotedblright} (where a correct continuation is {\textquotedblleft}is{\textquotedblright}). In verification, models judge the truth of statements like {\textquotedblleft}The penultimate word in this sentence is sentence.{\textquotedblright} (false). We also provide minimally different metalinguistic non-self-reference examples to complement the main dataset by probing for whether models can handle metalinguistic language at all. The dataset is hand-crafted by experts and validated by non-expert annotators. We test a variety of open-source LLMs (7B to 70B parameters) as well as closed-source LLMs through APIs. All models perform close to chance across both subtasks and even on the non-self-referential metalinguistic control data, though we find some steady improvement with model scale. GPT 4 is the only model to consistently do significantly better than chance, and it is still only in the 60{\%} range, while our untrained human annotators score well in the 89-93{\%} range. The dataset and evaluation toolkit are available at https://github.com/TristanThrush/i-am-a-strange-dataset"
}
<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="thrush-etal-2024-strange">
<titleInfo>
<title>I am a Strange Dataset: Metalinguistic Tests for Language Models</title>
</titleInfo>
<name type="personal">
<namePart type="given">Tristan</namePart>
<namePart type="family">Thrush</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Jared</namePart>
<namePart type="family">Moore</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Miguel</namePart>
<namePart type="family">Monares</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Christopher</namePart>
<namePart type="family">Potts</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Douwe</namePart>
<namePart type="family">Kiela</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<originInfo>
<dateIssued>2024-08</dateIssued>
</originInfo>
<typeOfResource>text</typeOfResource>
<relatedItem type="host">
<titleInfo>
<title>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</title>
</titleInfo>
<name type="personal">
<namePart type="given">Lun-Wei</namePart>
<namePart type="family">Ku</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Andre</namePart>
<namePart type="family">Martins</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Vivek</namePart>
<namePart type="family">Srikumar</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<originInfo>
<publisher>Association for Computational Linguistics</publisher>
<place>
<placeTerm type="text">Bangkok, Thailand</placeTerm>
</place>
</originInfo>
<genre authority="marcgt">conference publication</genre>
</relatedItem>
<abstract>Statements involving metalinguistic self-reference (“This paper has six sections.”) are prevalent in many domains. Can large language models (LLMs) handle such language? In this paper, we present “I am a Strange Dataset”, a new dataset for addressing this question. There are two subtasks: generation and verification. In generation, models continue statements like “The penultimate word in this sentence is” (where a correct continuation is “is”). In verification, models judge the truth of statements like “The penultimate word in this sentence is sentence.” (false). We also provide minimally different metalinguistic non-self-reference examples to complement the main dataset by probing for whether models can handle metalinguistic language at all. The dataset is hand-crafted by experts and validated by non-expert annotators. We test a variety of open-source LLMs (7B to 70B parameters) as well as closed-source LLMs through APIs. All models perform close to chance across both subtasks and even on the non-self-referential metalinguistic control data, though we find some steady improvement with model scale. GPT 4 is the only model to consistently do significantly better than chance, and it is still only in the 60% range, while our untrained human annotators score well in the 89-93% range. The dataset and evaluation toolkit are available at https://github.com/TristanThrush/i-am-a-strange-dataset</abstract>
<identifier type="citekey">thrush-etal-2024-strange</identifier>
<identifier type="doi">10.18653/v1/2024.acl-long.482</identifier>
<location>
<url>https://aclanthology.org/2024.luhme-long.482/</url>
</location>
<part>
<date>2024-08</date>
<extent unit="page">
<start>8888</start>
<end>8907</end>
</extent>
</part>
</mods>
</modsCollection>
%0 Conference Proceedings
%T I am a Strange Dataset: Metalinguistic Tests for Language Models
%A Thrush, Tristan
%A Moore, Jared
%A Monares, Miguel
%A Potts, Christopher
%A Kiela, Douwe
%Y Ku, Lun-Wei
%Y Martins, Andre
%Y Srikumar, Vivek
%S Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
%D 2024
%8 August
%I Association for Computational Linguistics
%C Bangkok, Thailand
%F thrush-etal-2024-strange
%X Statements involving metalinguistic self-reference (“This paper has six sections.”) are prevalent in many domains. Can large language models (LLMs) handle such language? In this paper, we present “I am a Strange Dataset”, a new dataset for addressing this question. There are two subtasks: generation and verification. In generation, models continue statements like “The penultimate word in this sentence is” (where a correct continuation is “is”). In verification, models judge the truth of statements like “The penultimate word in this sentence is sentence.” (false). We also provide minimally different metalinguistic non-self-reference examples to complement the main dataset by probing for whether models can handle metalinguistic language at all. The dataset is hand-crafted by experts and validated by non-expert annotators. We test a variety of open-source LLMs (7B to 70B parameters) as well as closed-source LLMs through APIs. All models perform close to chance across both subtasks and even on the non-self-referential metalinguistic control data, though we find some steady improvement with model scale. GPT 4 is the only model to consistently do significantly better than chance, and it is still only in the 60% range, while our untrained human annotators score well in the 89-93% range. The dataset and evaluation toolkit are available at https://github.com/TristanThrush/i-am-a-strange-dataset
%R 10.18653/v1/2024.acl-long.482
%U https://aclanthology.org/2024.luhme-long.482/
%U https://doi.org/10.18653/v1/2024.acl-long.482
%P 8888-8907
Markdown (Informal)
[I am a Strange Dataset: Metalinguistic Tests for Language Models](https://aclanthology.org/2024.luhme-long.482/) (Thrush et al., ACL 2024)
ACL
- Tristan Thrush, Jared Moore, Miguel Monares, Christopher Potts, and Douwe Kiela. 2024. I am a Strange Dataset: Metalinguistic Tests for Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8888–8907, Bangkok, Thailand. Association for Computational Linguistics.