Kyuhee Kim
2026
Query-Following vs Context-Anchoring: How LLMs Handle Cross-Turn Language Switching
Kyuhee Kim | Chengheng Li Chen | Anna Sotnikova
Proceedings of the First Workshop on Multilingual Multicultural Evaluation
Kyuhee Kim | Chengheng Li Chen | Anna Sotnikova
Proceedings of the First Workshop on Multilingual Multicultural Evaluation
When multilingual users switch languages mid-conversation, how should LLMs respond? We extend MultiChallenge to evaluate cross-turn language switching, translating 182 multi-turn conversations into German, Chinese, Spanish, and Arabic. Across five frontier models, we observe asymmetric behavior: switching into a foreign language (EN→X) yields high query-language fidelity (89–99%), but switching back to English (X→EN) reveals divergent policies. GPT-5 follows the query language (>95%), while Claude Opus 4.5 and Command R+ maintain the established conversation language (<8%). Task accuracy remains stable across conditions regardless of language selection differences. A simple explicit system prompt shows limited effectiveness in modifying these defaults.
2025
Nunchi-Bench: Benchmarking Language Models on Cultural Reasoning with a Focus on Korean Superstition
Kyuhee Kim | Sangah Lee
Findings of the Association for Computational Linguistics: ACL 2025
Kyuhee Kim | Sangah Lee
Findings of the Association for Computational Linguistics: ACL 2025
As large language models (LLMs) become key advisors in various domains, their cultural sensitivity and reasoning skills are crucial in multicultural environments. We introduce Nunchi-Bench, a benchmark designed to evaluate LLMs’ cultural understanding, with a focus on Korean superstitions. The benchmark consists of 247 questions spanning 31 topics, assessing factual knowledge, culturally appropriate advice, and situational interpretation. We evaluate multilingual LLMs in both Korean and English to analyze their ability to reason about Korean cultural contexts and how language variations affect performance. To systematically assess cultural reasoning, we propose a novel verification strategy with customized scoring metrics that capture the extent to which models recognize cultural nuances and respond appropriately. Our findings highlight significant challenges in LLMs’ cultural reasoning. While models generally recognize factual information, they struggle to apply it in practical scenarios. Furthermore, explicit cultural framing enhances performance more effectively than relying solely on the language of the prompt. To support further research, we publicly release Nunchi-Bench alongside a leaderboard.