Figurative language conveys stance, emotion, and social nuance, making its appropriate use essential in dialogue. While large language models (LLMs) often succeed in recognizing figurative expressions at the sentence level, their ability to use them coherently in conversation remains uncertain. We introduce FLUID QA, the first multilingual benchmark that evaluates figurative usage in dialogue across English, Korean, and Chinese. Each item embeds figurative choices into multi-turn contexts. To support interpretation, we include FLUTE-bi, a sentence-level diagnostic task. Results reveal a persistent gap: models that perform well on FLUTE-bi frequently fail on FLUID QA, especially in sarcasm and metaphor. These errors reflect systematic rhetorical confusion and limited discourse reasoning. FLUID QA provides a scalable framework for assessing usage-level figurative competence across languages.
Although an increasing number of multilingual LLMs (large language models) have begun to support Korean, there remains a notable lack of benchmark datasets specifically designed to evaluate their proficiency in Korean cultural and linguistic understanding. A major reason for this gap is that many available benchmarks in Korean are adapted from English originals via translation, which often fails to reflect the unique cultural context embedded in the Korean language. Even the few benchmark datasets based on native Korean data that involve cultural content typically focus on tasks such as bias or hate speech detection, where cultural knowledge serves merely as topical background rather than being integrated as a core component of semantic understanding. To address this gap, we introduce the Korean Idiom Matching Benchmark (KIM Bench), which consists of 1,175 instances. Idioms are culture-specific and often untranslatable, making them ideal for testing models’ cross-cultural semantic understanding. Using KIM Bench, We evaluate global and Korean native models. Our analysis show that larger and locally trained models better capture idiom semantics and cultural nuances, while chain-of-thought prompting may reduce accuracy. Models still struggle with deep semantic and contextual understanding. KIM Bench offers a compact tool for cross-cultural evaluation and insights into improving performance on culturally grounded tasks.
As LLMs are increasingly used in global conversational settings, concerns remain about their ability to handle complex sociocultural contexts. This study evaluates LLMs’ empathetic understanding in Korean—a high-context language—using a pragmatics-based Discourse Completion Task (DCT) focused on interpretive judgment rather than generation. We constructed a dataset varying relational hierarchy, intimacy, and emotional valence, and compared responses from proprietary and open-source LLMs to those of Korean speakers. Most LLMs showed over-empathizing tendencies and struggled with ambiguous relational cues. Neither model size nor Korean fine-tuning significantly improved performance. While humans reflected relational nuance and contextual awareness, LLMs relied on surface strategies. These findings underscore LLMs’ limits in socio-pragmatic reasoning and introduce a scalable, culturally flexible framework for evaluating socially-aware AI.
Large language models (LLMs) use pretraining to predict the subsequent word; however, their expansion requires significant computing resources. Numerous big tech companies and research institutes have developed multilingual LLMs (MLLMs) to meet current demands, overlooking less-resourced languages (LRLs). This study proposed three strategies to enhance the performance of LRLs based on the publicly available MLLMs. First, the MLLM vocabularies of LRLs were expanded to enhance expressiveness. Second, bilingual data were used for pretraining to align the high- and less-resourced languages. Third, a high-quality small-scale instruction dataset was constructed and instruction-tuning was performed to augment the LRL. The experiments employed the Llama2 model and Korean was used as the LRL, which was quantitatively evaluated against other developed LLMs across eight tasks. Furthermore, a qualitative assessment was performed based on human evaluation and GPT4. Experimental results showed that our proposed Bllossom model exhibited superior performance in qualitative analyses compared to previously proposed Korean monolingual models.