Ri-Sheng Huang
2025
LOBSTER: Linguistics Olympiad Benchmark for Structured Evaluation on Reasoning
Da-Chen Lian
|
Ri-Sheng Huang
|
Pin-Er Chen
|
Chunki Lim
|
You-Kuan Lin
|
Guan-Yu Tseng
|
Zhen-Yu Lin
|
Pin-Cheng Chen
|
Shu-Kai Hsieh
Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)
We propose the Linguistics Olympiad Benchmark for Structured Evaluation on Reasoning, or LOBSTER, a linguistically-informed benchmark designed to evaluate large language models (LLMs) on complex linguistic puzzles of the International Linguistics Olympiad (IOL). Unlike prior benchmarks that focus solely on final answer accuracy, our benchmark provides concrete evaluation protocols and rich typological metadata across over 90 low-resource and cross-cultural languages alongside the puzzles. Through systematic evaluations of state-of-the-art models on multilingual abilities, we demonstrate that LLMs struggle with low-resource languages, underscoring the need for such a benchmark. Experiments with various models on our benchmark showed that IOL problems remain a challenging task for reasoning models, though there are ways to enhance the performance—for example, iterative reasoning outperforms single-pass approaches in both final answers and explanations. Our benchmark offers a comprehensive foundation for advancing linguistically grounded, culturally informed, and cognitively plausible reasoning in LLMs.
Zero-Shot Evaluation of Conversational Language Competence in Data-Efficient LLMs Across English, Mandarin, and French
Sheng-Fu Wang
|
Ri-Sheng Huang
|
Shu-Kai Hsieh
|
Laurent Prévot
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Large Language Models (LLMs) have achieved oustanding performance across various natural language processing tasks, including those from Discourse and Dialogue traditions. However, these achievements are typically obtained thanks to pretraining on huge datasets. In contrast, humans learn to speak and communicate through dialogue and spontaneous speech with only a fraction of the language exposure. This disparity has spurred interest in evaluating whether smaller, more carefully selected and curated pretraining datasets can support robust performance on specific tasks. Drawing inspiration from the BabyLM initiative, we construct small (10M-token) pretraining datasets from different sources, including conversational transcripts and Wikipedia-style text. To assess the impact of these datasets, we develop evaluation benchmarks focusing on discourse and interactional markers, extracted from high-quality spoken corpora in English, French, and Mandarin. Employing a zero-shot classification framework inspired by the BLiMP benchmark, we design tasks wherein the model must determine, between a genuine utterance extracted from a corpus and its minimally altered counterpart, which one is the authentic instance. Our findings reveal that the nature of pretraining data significantly influences model performance on discourse-related tasks. Models pretrained on conversational data exhibit a clear advantage in handling discourse and interactional markers compared to those trained on written or encyclopedic text. Furthermore, the models, trained on small amount spontaneous speech transcripts, perform comparably to standard LLMs.
Search
Fix author
Co-authors
- Shu-Kai Hsieh 2
- Pin-Er Chen 1
- Pin-Cheng Chen 1
- Da-Chen Lian 1
- Chunki Lim 1
- show all...