2024
pdf
bib
abs
ModeLing: A Novel Dataset for Testing Linguistic Reasoning in Language Models
Nathan Chi
|
Teodor Malchev
|
Riley Kong
|
Ryan Chi
|
Lucas Huang
|
Ethan Chi
|
R. McCoy
|
Dragomir Radev
Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
Large language models (LLMs) perform well on (at least) some evaluations of both few-shot multilingual adaptation and reasoning. However, evaluating the intersection of these two skills—multilingual few-shot reasoning—is difficult: even relatively low-resource languages can be found in large training corpora, raising the concern that when we intend to evaluate a model’s ability to generalize to a new language, that language may have in fact been present during the model’s training. If such language contamination has occurred, apparent cases of few-shot reasoning could actually be due to memorization. Towards understanding the capability of models to perform multilingual few-shot reasoning, we propose modeLing, a benchmark of Rosetta stone puzzles. This type of puzzle, originating from competitions called Linguistics Olympiads, contain a small number of sentences in a target language not previously known to the solver. Each sentence is translated to the solver’s language such that the provided sentence pairs uniquely specify a single most reasonable underlying set of rules; solving requires applying these rules to translate new expressions (Figure 1). modeLing languages are chosen to be extremely low-resource such that the risk of training data contamination is low, and unlike prior datasets, it consists entirely of problems written specifically for this work, as a further measure against data leakage. Empirically, we find evidence that popular LLMs do not have data leakage on our benchmark.
2023
pdf
bib
abs
Stanford MLab at SemEval 2023 Task 7: Neural Methods for Clinical Trial Report NLI
Conner Takehana
|
Dylan Lim
|
Emirhan Kurtulus
|
Ramya Iyer
|
Ellie Tanimura
|
Pankhuri Aggarwal
|
Molly Cantillon
|
Alfred Yu
|
Sarosh Khan
|
Nathan Chi
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
We present a system for natural language inference in breast cancer clinical trial reports, as framed by SemEval 2023 Task 7: Multi-evidence Natural Language Inference for Clinical Trial Data. In particular, we propose a suite of techniques for two related inference subtasks: entailment and evidence retrieval. The purpose of the textual entailment identification subtask is to determine the inference relation (either entailment or contradiction) between given statement pairs, while the goal of the evidence retrieval task is to identify a set of sentences that support this inference relation. To this end, we propose fine-tuning Bio+Clinical BERT, a BERT-based model pre-trained on clinical data. Along with presenting our system, we analyze our architectural decisions in the context of our model’s accuracy and conduct an error analysis. Overall, our system ranked 20 / 30 on the entailment subtask.
2022
pdf
bib
abs
RNRE-NLP at SemEval-2022 Task 4: Patronizing and Condescending Language Detection
Rylan Yang
|
Ethan Chi
|
Nathan Chi
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)
An understanding of patronizing and condescending language detection is an important part of identifying and addressing discrimination and prejudice in various forms of communication. In this paper, we investigate several methods for detecting patronizing and condescending language in short statements as part of SemEval-2022 Task 4. For Task 1a, we investigate applying both lightweight (tree-based and linear) machine learning classification models and fine-tuned pre-trained large language models. Our final system achieves an F1-score of 0.4321, recall-score of 0.5016, and a precision-score of 0.3795 (ranked 53 / 78) on Task 1a.
pdf
bib
abs
ISD at SemEval-2022 Task 6: Sarcasm Detection Using Lightweight Models
Samantha Huang
|
Ethan Chi
|
Nathan Chi
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)
A robust comprehension of sarcasm detection iscritical for creating artificial systems that can ef-fectively perform sentiment analysis in writtentext. In this work, we investigate AI approachesto identifying whether a text is sarcastic or notas part of SemEval-2022 Task 6. We focus oncreating systems for Task A, where we experi-ment with lightweight statistical classificationapproaches trained on both GloVe features andmanually-selected features. Additionally, weinvestigate fine-tuning the transformer modelBERT. Our final system for Task A is an Ex-treme Gradient Boosting Classifier trained onmanually-engineered features. Our final sys-tem achieved an F1-score of 0.2403 on SubtaskA and was ranked 32 of 43.
2021
pdf
bib
abs
RedwoodNLP at SemEval-2021 Task 7: Ensembled Pretrained and Lightweight Models for Humor Detection
Nathan Chi
|
Ryan Chi
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
An understanding of humor is an essential component of human-facing NLP systems. In this paper, we investigate several methods for detecting humor in short statements as part of Semeval-2021 Shared Task 7. For Task 1a, we apply an ensemble of fine-tuned pre-trained language models; for Tasks 1b, 1c, and 2a, we investigate various tree-based and linear machine learning models. Our final system achieves an F1-score of 0.9571 (ranked 24 / 58) on Task 1a, an RMSE of 0.5580 (ranked 18 / 50) on Task 1b, an F1-score of 0.5024 (ranked 26 / 36) on Task 1c, and an RMSE of 0.7229 (ranked 45 / 48) on Task 2a.