2024
pdf
bib
abs
ModeLing: A Novel Dataset for Testing Linguistic Reasoning in Language Models
Nathan Chi
|
Teodor Malchev
|
Riley Kong
|
Ryan Chi
|
Lucas Huang
|
Ethan Chi
|
R. McCoy
|
Dragomir Radev
Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
Large language models (LLMs) perform well on (at least) some evaluations of both few-shot multilingual adaptation and reasoning. However, evaluating the intersection of these two skills—multilingual few-shot reasoning—is difficult: even relatively low-resource languages can be found in large training corpora, raising the concern that when we intend to evaluate a model’s ability to generalize to a new language, that language may have in fact been present during the model’s training. If such language contamination has occurred, apparent cases of few-shot reasoning could actually be due to memorization. Towards understanding the capability of models to perform multilingual few-shot reasoning, we propose modeLing, a benchmark of Rosetta stone puzzles. This type of puzzle, originating from competitions called Linguistics Olympiads, contain a small number of sentences in a target language not previously known to the solver. Each sentence is translated to the solver’s language such that the provided sentence pairs uniquely specify a single most reasonable underlying set of rules; solving requires applying these rules to translate new expressions (Figure 1). modeLing languages are chosen to be extremely low-resource such that the risk of training data contamination is low, and unlike prior datasets, it consists entirely of problems written specifically for this work, as a further measure against data leakage. Empirically, we find evidence that popular LLMs do not have data leakage on our benchmark.
2022
pdf
bib
abs
Stanford MLab at SemEval 2022 Task 7: Tree- and Transformer-Based Methods for Clarification Plausibility
Thomas Yim
|
Junha Lee
|
Rishi Verma
|
Scott Hickmann
|
Annie Zhu
|
Camron Sallade
|
Ian Ng
|
Ryan Chi
|
Patrick Liu
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)
In this paper, we detail the methods we used to determine the idiomaticity and plausibility of candidate words or phrases into an instructional text as part of the SemEval Task 7: Identifying Plausible Clarifications of Implicit and Underspecified Phrases in Instructional Texts. Given a set of steps in an instructional text, there are certain phrases that most plausibly fill that spot. We explored various possible architectures, including tree-based methods over GloVe embeddings, ensembled BERT and ELECTRA models, and GPT 2-based infilling methods.
2021
pdf
bib
abs
RedwoodNLP at SemEval-2021 Task 7: Ensembled Pretrained and Lightweight Models for Humor Detection
Nathan Chi
|
Ryan Chi
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
An understanding of humor is an essential component of human-facing NLP systems. In this paper, we investigate several methods for detecting humor in short statements as part of Semeval-2021 Shared Task 7. For Task 1a, we apply an ensemble of fine-tuned pre-trained language models; for Tasks 1b, 1c, and 2a, we investigate various tree-based and linear machine learning models. Our final system achieves an F1-score of 0.9571 (ranked 24 / 58) on Task 1a, an RMSE of 0.5580 (ranked 18 / 50) on Task 1b, an F1-score of 0.5024 (ranked 26 / 36) on Task 1c, and an RMSE of 0.7229 (ranked 45 / 48) on Task 2a.