Ethan Chi

2024

Large language models (LLMs) perform well on (at least) some evaluations of both few-shot multilingual adaptation and reasoning. However, evaluating the intersection of these two skills—multilingual few-shot reasoning—is difficult: even relatively low-resource languages can be found in large training corpora, raising the concern that when we intend to evaluate a model’s ability to generalize to a new language, that language may have in fact been present during the model’s training. If such language contamination has occurred, apparent cases of few-shot reasoning could actually be due to memorization. Towards understanding the capability of models to perform multilingual few-shot reasoning, we propose modeLing, a benchmark of Rosetta stone puzzles. This type of puzzle, originating from competitions called Linguistics Olympiads, contain a small number of sentences in a target language not previously known to the solver. Each sentence is translated to the solver’s language such that the provided sentence pairs uniquely specify a single most reasonable underlying set of rules; solving requires applying these rules to translate new expressions (Figure 1). modeLing languages are chosen to be extremely low-resource such that the risk of training data contamination is low, and unlike prior datasets, it consists entirely of problems written specifically for this work, as a further measure against data leakage. Empirically, we find evidence that popular LLMs do not have data leakage on our benchmark.

2022

pdf bib abs
RNRE-NLP at SemEval-2022 Task 4: Patronizing and Condescending Language Detection
Rylan Yang | Ethan Chi | Nathan Chi
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

An understanding of patronizing and condescending language detection is an important part of identifying and addressing discrimination and prejudice in various forms of communication. In this paper, we investigate several methods for detecting patronizing and condescending language in short statements as part of SemEval-2022 Task 4. For Task 1a, we investigate applying both lightweight (tree-based and linear) machine learning classification models and fine-tuned pre-trained large language models. Our final system achieves an F1-score of 0.4321, recall-score of 0.5016, and a precision-score of 0.3795 (ranked 53 / 78) on Task 1a.

pdf bib abs
ISD at SemEval-2022 Task 6: Sarcasm Detection Using Lightweight Models
Samantha Huang | Ethan Chi | Nathan Chi
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

A robust comprehension of sarcasm detection iscritical for creating artificial systems that can ef-fectively perform sentiment analysis in writtentext. In this work, we investigate AI approachesto identifying whether a text is sarcastic or notas part of SemEval-2022 Task 6. We focus oncreating systems for Task A, where we experi-ment with lightweight statistical classificationapproaches trained on both GloVe features andmanually-selected features. Additionally, weinvestigate fine-tuning the transformer modelBERT. Our final system for Task A is an Ex-treme Gradient Boosting Classifier trained onmanually-engineered features. Our final sys-tem achieved an F1-score of 0.2403 on SubtaskA and was ranked 32 of 43.

Co-authors

Venues

Fix author