ModeLing: A Novel Dataset for Testing Linguistic Reasoning in Language Models

Nathan Chi, Teodor Malchev, Riley Kong, Ryan Chi, Lucas Huang, Ethan Chi, R. McCoy, Dragomir Radev


Abstract
Large language models (LLMs) perform well on (at least) some evaluations of both few-shot multilingual adaptation and reasoning. However, evaluating the intersection of these two skills—multilingual few-shot reasoning—is difficult: even relatively low-resource languages can be found in large training corpora, raising the concern that when we intend to evaluate a model’s ability to generalize to a new language, that language may have in fact been present during the model’s training. If such language contamination has occurred, apparent cases of few-shot reasoning could actually be due to memorization. Towards understanding the capability of models to perform multilingual few-shot reasoning, we propose modeLing, a benchmark of Rosetta stone puzzles. This type of puzzle, originating from competitions called Linguistics Olympiads, contain a small number of sentences in a target language not previously known to the solver. Each sentence is translated to the solver’s language such that the provided sentence pairs uniquely specify a single most reasonable underlying set of rules; solving requires applying these rules to translate new expressions (Figure 1). modeLing languages are chosen to be extremely low-resource such that the risk of training data contamination is low, and unlike prior datasets, it consists entirely of problems written specifically for this work, as a further measure against data leakage. Empirically, we find evidence that popular LLMs do not have data leakage on our benchmark.
Anthology ID:
2024.sigtyp-1.14
Volume:
Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
Month:
March
Year:
2024
Address:
St. Julian's, Malta
Editors:
Michael Hahn, Alexey Sorokin, Ritesh Kumar, Andreas Shcherbakov, Yulia Otmakhova, Jinrui Yang, Oleg Serikov, Priya Rani, Edoardo M. Ponti, Saliha Muradoğlu, Rena Gao, Ryan Cotterell, Ekaterina Vylomova
Venues:
SIGTYP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
113–119
Language:
URL:
https://aclanthology.org/2024.sigtyp-1.14
DOI:
Bibkey:
Cite (ACL):
Nathan Chi, Teodor Malchev, Riley Kong, Ryan Chi, Lucas Huang, Ethan Chi, R. McCoy, and Dragomir Radev. 2024. ModeLing: A Novel Dataset for Testing Linguistic Reasoning in Language Models. In Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, pages 113–119, St. Julian's, Malta. Association for Computational Linguistics.
Cite (Informal):
ModeLing: A Novel Dataset for Testing Linguistic Reasoning in Language Models (Chi et al., SIGTYP-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.sigtyp-1.14.pdf