Riley Kong
2024
ModeLing: A Novel Dataset for Testing Linguistic Reasoning in Language Models
Nathan Chi
|
Teodor Malchev
|
Riley Kong
|
Ryan Chi
|
Lucas Huang
|
Ethan Chi
|
R. McCoy
|
Dragomir Radev
Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
Large language models (LLMs) perform well on (at least) some evaluations of both few-shot multilingual adaptation and reasoning. However, evaluating the intersection of these two skills—multilingual few-shot reasoning—is difficult: even relatively low-resource languages can be found in large training corpora, raising the concern that when we intend to evaluate a model’s ability to generalize to a new language, that language may have in fact been present during the model’s training. If such language contamination has occurred, apparent cases of few-shot reasoning could actually be due to memorization. Towards understanding the capability of models to perform multilingual few-shot reasoning, we propose modeLing, a benchmark of Rosetta stone puzzles. This type of puzzle, originating from competitions called Linguistics Olympiads, contain a small number of sentences in a target language not previously known to the solver. Each sentence is translated to the solver’s language such that the provided sentence pairs uniquely specify a single most reasonable underlying set of rules; solving requires applying these rules to translate new expressions (Figure 1). modeLing languages are chosen to be extremely low-resource such that the risk of training data contamination is low, and unlike prior datasets, it consists entirely of problems written specifically for this work, as a further measure against data leakage. Empirically, we find evidence that popular LLMs do not have data leakage on our benchmark.
2022
FeTaQA: Free-form Table Question Answering
Linyong Nan
|
Chiachun Hsieh
|
Ziming Mao
|
Xi Victoria Lin
|
Neha Verma
|
Rui Zhang
|
Wojciech Kryściński
|
Hailey Schoelkopf
|
Riley Kong
|
Xiangru Tang
|
Mutethia Mutuma
|
Ben Rosand
|
Isabel Trindade
|
Renusree Bandaru
|
Jacob Cunningham
|
Caiming Xiong
|
Dragomir Radev
|
Dragomir Radev
Transactions of the Association for Computational Linguistics, Volume 10
Existing table question answering datasets contain abundant factual questions that primarily evaluate a QA system’s comprehension of query and tabular data. However, restricted by their short-form answers, these datasets fail to include question–answer interactions that represent more advanced and naturally occurring information needs: questions that ask for reasoning and integration of information pieces retrieved from a structured knowledge source. To complement the existing datasets and to reveal the challenging nature of the table-based question answering task, we introduce FeTaQA, a new dataset with 10K Wikipedia-based table, question, free-form answer, supporting table cells pairs. FeTaQA is collected from noteworthy descriptions of Wikipedia tables that contain information people tend to seek; generation of these descriptions requires advanced processing that humans perform on a daily basis: Understand the question and table, retrieve, integrate, infer, and conduct text planning and surface realization to generate an answer. We provide two benchmark methods for the proposed task: a pipeline method based on semantic parsing-based QA systems and an end-to-end method based on large pretrained text generation models, and show that FeTaQA poses a challenge for both methods.
Search
Co-authors
- Dragomir Radev 3
- Linyong Nan 1
- Chiachun Hsieh 1
- Ziming Mao 1
- Xi Victoria Lin 1
- show all...