Hansaem Kim

2025

pdf bib abs
AI Knows Where You Are: Exposure, Bias, and Inference in Multimodal Geolocation with KoreaGEO
Xiaonan Wang | Bo Shao | Hansaem Kim
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Recent advances in vision-language models (VLMs) have enabled accurate image-based geolocation, raising serious concerns about location privacy risks in everyday social media posts. Yet, a systematic evaluation of such risks is still lacking: existing benchmarks show coarse granularity, linguistic bias, and a neglect of multimodal privacy risks. To address these gaps, we introduce KoreaGEO, the first fine-grained, multimodal, and privacy-aware benchmark for geolocation, built on Korean street views. The benchmark covers four socio-spatial clusters and nine place types with rich contextual annotations and two captioning styles that simulate real-world privacy exposure. To evaluate mainstream VLMs, we design a three-path protocol spanning image-only, functional-caption, and high-risk-caption inputs, enabling systematic analysis of localization accuracy, spatial bias, and reasoning behavior. Results show that input modality exerts a stronger influence on localization precision and privacy exposure than model scale or architecture, with high-risk captions substantially boosting accuracy. Moreover, they highlight structural prediction biases toward core cities.

pdf bib abs
FLUID QA: A Multilingual Benchmark for Figurative Language Usage in Dialogue across English, Chinese, and Korean
Seoyoon Park | Hyeji Choi | Minseon Kim | Subin An | Xiaonan Wang | Gyuri Choi | Hansaem Kim
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Figurative language conveys stance, emotion, and social nuance, making its appropriate use essential in dialogue. While large language models (LLMs) often succeed in recognizing figurative expressions at the sentence level, their ability to use them coherently in conversation remains uncertain. We introduce FLUID QA, the first multilingual benchmark that evaluates figurative usage in dialogue across English, Korean, and Chinese. Each item embeds figurative choices into multi-turn contexts. To support interpretation, we include FLUTE-bi, a sentence-level diagnostic task. Results reveal a persistent gap: models that perform well on FLUTE-bi frequently fail on FLUID QA, especially in sarcasm and metaphor. These errors reflect systematic rhetorical confusion and limited discourse reasoning. FLUID QA provides a scalable framework for assessing usage-level figurative competence across languages.

pdf bib abs
Automated Claim–Evidence Extraction for Political Discourse Analysis: A Large Language Model Approach to Rodong Sinmun Editorials
Gyuri Choi | Hansaem Kim
Proceedings of the Eighth Fact Extraction and VERification Workshop (FEVER)

This study investigates the feasibility of automating political discourse analysis using large language models (LLMs), with a focus on 87 editorials from Rodong Sinmun, North Korea’s official newspaper. We introduce a structured analytical framework that integrates Chain-of-Thought prompting for claim–evidence extraction and a GPT-4o–based automated evaluation system (G-Eval). Experimental results demonstrate that LLMs possess emerging discourse-level reasoning capabilities, showing notably improved alignment with expert analyses under one-shot prompting conditions. However, the models often reproduced ideological rhetoric uncritically or generated interpretive hallucinations, highlighting the risks of fully automated analysis. To address these issues, we propose a Hybrid Human-in-the-Loop evaluation framework that combines expert judgment with automated scoring. This study presents a novel approach to analyzing politically sensitive texts and offers empirical insights into the quantitative assessment of ideological discourse, underscoring the scalability and potential of automation-driven methodologies.

The existing assessments of planning capabilities of large language models (LLMs) remain largely limited to single-language or specific representation formats. To address this gap, we introduce the Multi-Plan benchmark comprising 204 multilingual and multi-format travel planning scenarios. In experimental results obtained with state-of-the-art LLMs, the Multi-Plan benchmark effectively highlights the performance disparities among models, notably showing superior results for reasoning-specialized models. Interestingly, language differences exhibited minimal impact, whereas mathematically structured representations significantly improved planning accuracy for most models, underscoring the crucial role of the input format. These findings enhance our understanding of planning abilities of LLMs, offer valuable insights for future research, and emphasize the need for more sophisticated AI evaluation methods. This dataset is publicly available at http://huggingface.co/datasets/Bllossom/Multi-Plan.

pdf bib abs
Too Polite to be Human: Evaluating LLM Empathy in Korean Conversations via a DCT-Based Framework
Seoyoon Park | Jaehee Kim | Hansaem Kim
Proceedings of the Third Workshop on Social Influence in Conversations (SICon 2025)

As LLMs are increasingly used in global conversational settings, concerns remain about their ability to handle complex sociocultural contexts. This study evaluates LLMs’ empathetic understanding in Korean—a high-context language—using a pragmatics-based Discourse Completion Task (DCT) focused on interpretive judgment rather than generation. We constructed a dataset varying relational hierarchy, intimacy, and emotional valence, and compared responses from proprietary and open-source LLMs to those of Korean speakers. Most LLMs showed over-empathizing tendencies and struggled with ambiguous relational cues. Neither model size nor Korean fine-tuning significantly improved performance. While humans reflected relational nuance and contextual awareness, LLMs relied on surface strategies. These findings underscore LLMs’ limits in socio-pragmatic reasoning and introduce a scalable, culturally flexible framework for evaluating socially-aware AI.

2024

Large language models (LLMs) use pretraining to predict the subsequent word; however, their expansion requires significant computing resources. Numerous big tech companies and research institutes have developed multilingual LLMs (MLLMs) to meet current demands, overlooking less-resourced languages (LRLs). This study proposed three strategies to enhance the performance of LRLs based on the publicly available MLLMs. First, the MLLM vocabularies of LRLs were expanded to enhance expressiveness. Second, bilingual data were used for pretraining to align the high- and less-resourced languages. Third, a high-quality small-scale instruction dataset was constructed and instruction-tuning was performed to augment the LRL. The experiments employed the Llama2 model and Korean was used as the LRL, which was quantitatively evaluated against other developed LLMs across eight tasks. Furthermore, a qualitative assessment was performed based on human evaluation and GPT4. Experimental results showed that our proposed Bllossom model exhibited superior performance in qualitative analyses compared to previously proposed Korean monolingual models.

pdf bib
Are large language models affected by politeness? Focusing on request speech acts in Korean
Gayeon Jung | Joeun Kang | Fei Li | Hansaem Kim
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

pdf bib
KULTURE Bench: A Benchmark for Assessing Language Model in Korean Cultural Context
Xiaonan Wang | Jinyoung Yeo | Joon-Ho Lim | Hansaem Kim
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

2022

2020

pdf bib abs
Building Korean Abstract Meaning Representation Corpus
Hyonsu Choe | Jiyoon Han | Hyejin Park | Tae Hwan Oh | Hansaem Kim
Proceedings of the Second International Workshop on Designing Meaning Representations

To explore the potential sembanking in Korean and ways to represent the meaning of Korean sentences, this paper reports on the process of applying Abstract Meaning Representation to Korean, a semantic representation framework that has been studied in wide range of languages, and its output: the Korean AMR corpus. The corpus which is constructed so far is a size of 1,253 sentences and its raw texts are from ExoBrain Corpus, a state-led R&D project on language AI. This paper also analyzes the result in both qualitative and quantitative manners, proposing discussions for further development.

pdf bib abs
Analysis of the Penn Korean Universal Dependency Treebank (PKT-UD): Manual Revision to Build Robust Parsing Model in Korean
Tae Hwan Oh | Ji Yoon Han | Hyonsu Choe | Seokwon Park | Han He | Jinho D. Choi | Na-Rae Han | Jena D. Hwang | Hansaem Kim
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

In this paper, we first open on important issues regarding the Penn Korean Universal Treebank (PKT-UD) and address these issues by revising the entire corpus manually with the aim of producing cleaner UD annotations that are more faithful to Korean grammar. For compatibility to the rest of UD corpora, we follow the UDv2 guidelines, and extensively revise the part-of-speech tags and the dependency relations to reflect morphological features and flexible word- order aspects in Korean. The original and the revised versions of PKT-UD are experimented with transformer-based parsing models using biaffine attention. The parsing model trained on the revised corpus shows a significant improvement of 3.0% in labeled attachment score over the model trained on the previous corpus. Our error analysis demonstrates that this revision allows the parsing model to learn relations more robustly, reducing several critical errors that used to be made by the previous model.

Using current methods, the construction of multilingual resources in FrameNet is an expensive and complex task. While crowdsourcing is a viable alternative, it is difficult to include non-native English speakers in such efforts as they often have difficulty with English-based FrameNet tools. In this work, we investigated cross-lingual issues in crowdsourcing approaches for multilingual FrameNets, specifically in the context of the newly constructed Korean FrameNet. To accomplish this, we evaluated the effectiveness of various crowdsourcing settings whereby certain types of information are provided to workers, such as English definitions in FrameNet or translated definitions. We then evaluated whether the crowdsourced results accurately captured the meaning of frames both cross-culturally and cross-linguistically, and found that by allowing the crowd workers to make intuitive choices, they achieved a quality comparable to that of trained FrameNet experts (F1 > 0.75). The outcomes of this work are now publicly available as a new release of Korean FrameNet 1.1.

pdf bib abs
Annotation Issues in Universal Dependencies for Korean and Japanese
Ji Yoon Han | Tae Hwan Oh | Lee Jin | Hansaem Kim
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)

To investigate issues that arise in the process of developing a Universal Dependency (UD) treebank for Korean and Japanese, we begin by addressing the typological characteristics of Korean and Japanese. Both Korean and Japanese are agglutinative and head-final languages. And the principle of word segmentation for both languages is different from English, which makes it difficult to apply UD guidelines. Following the typological characteristics of the two languages and the issue of UD application, we review the application of UPOS and DEPREL schemes to the two languages. The annotation principles for AUX, ADJ, DET, ADP and PART are discussed for the UPOS scheme, and the annotation principles for case, aux, iobj, and obl are discussed for the DEPREL scheme.

2019

pdf bib abs
Copula and Case-Stacking Annotations for Korean AMR
Hyonsu Choe | Jiyoon Han | Hyejin Park | Hansaem Kim
Proceedings of the First International Workshop on Designing Meaning Representations

This paper concerns the application of Abstract Meaning Representation (AMR) to Korean. In this regard, it focuses on the copula construction and its negation and the case-stacking phenomenon thereof. To illustrate this clearly, we reviewed the :domain annotation scheme from various perspectives. In this process, the existing annotation guidelines were improved to devise annotation schemes for each issue under the principle of pursuing consistency and efficiency of annotation without distorting the characteristics of Korean.

2018

pdf bib abs
Enhancing Universal Dependencies for Korean
Youngbin Noh | Jiyoon Han | Tae Hwan Oh | Hansaem Kim
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

In this paper, for the purpose of enhancing Universal Dependencies for the Korean language, we propose a modified method for mapping Korean Part-of-Speech(POS) tagset in relation to Universal Part-of-Speech (UPOS) tagset in order to enhance the Universal Dependencies for the Korean Language. Previous studies suggest that UPOS reflects several issues that influence dependency annotation by using the POS of Korean predicates, particularly the distinctiveness in using verb, adjective, and copula.