Won Ik Cho - ACL Anthology

Won Ik Cho

2025

Single Ground Truth Is Not Enough: Adding Flexibility to Aspect-Based Sentiment Analysis Evaluation
Soyoung Yang | Hojun Cho | Jiyoung Lee | Sohee Yoon | Edward Choi | Jaegul Choo | Won Ik Cho
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Aspect-based sentiment analysis (ABSA) is a challenging task of extracting sentiments along with their corresponding aspects and opinion terms from the text.The inherent subjectivity of span annotation makes variability in the surface forms of extracted terms, complicating the evaluation process.Traditional evaluation methods often constrain ground truths (GT) to a single term, potentially misrepresenting the accuracy of semantically valid predictions that differ in surface form.To address this limitation, we propose a novel and fully automated pipeline that expands existing evaluation sets by adding alternative valid terms for aspect and opinion. Our approach facilitates an equitable assessment of language models by accommodating multiple-answer candidates, resulting in enhanced human agreement compared to single-answer test sets (achieving up to a 10%p improvement in Kendall’s Tau score).Experimental results demonstrate that our expanded evaluation set helps uncover the capabilities of large language models (LLMs) in ABSA tasks, which is concealed by the single-answer GT sets.Consequently, our work contributes to the development of a flexible evaluation framework for ABSA by embracing diverse surface forms to span extraction tasks in a cost-effective and reproducible manner.Our code and dataset is open at https://github.com/dudrrm/zoom-in-n-out-absa.

Hermit Kingdom Through the Lens of Multiple Perspectives: A Case Study of LLM Hallucination on North Korea
Eunjung Cho | Won Ik Cho | Soomin Seo
Proceedings of the 31st International Conference on Computational Linguistics

Hallucination in large language models (LLMs) remains a significant challenge for their safe deployment, particularly due to its potential to spread misinformation. Most existing solutions address this challenge by focusing on aligning the models with credible sources or by improving how models communicate their confidence (or lack thereof) in their outputs. While these measures may be effective in most contexts, they may fall short in scenarios requiring more nuanced approaches, especially in situations where access to accurate data is limited or determining credible sources is challenging. In this study, we take North Korea - a country characterised by an extreme lack of reliable sources and the prevalence of sensationalist falsehoods - as a case study. We explore and evaluate how some of the best-performing multilingual LLMs and specific language-based models generate information about North Korea in three languages spoken in countries with significant geo-political interests: English (United States, United Kingdom), Korean (South Korea), and Mandarin Chinese (China). Our findings reveal significant differences, suggesting that the choice of model and language can lead to vastly different understandings of North Korea, which has important implications given the global security challenges the country poses.

AMAN: Agent for Mentoring and Assisting Newbies in MMORPG
Jeehyun Lee | Seung-Moo Yang | Won Ik Cho
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track

In online games with diverse contents and frequent updates, newcomers first learn gameplay mechanics by community intelligence but soon face challenges that require real-time guidance from senior gamers. To provide easy access to such support, we introduce AMAN, Agent for Mentoring and Assisting Newbies in MMORPG (Massively Multiplayer Online Role-Playing Game) - a companion chatbot designed to engage novice gamers. Our model functions as a human-like chat buddy that interacts with users in a friendly manner while providing substantive informational depth. In this light, we propose a multi-stage learning approach that incorporates continual pre-training with a sequence of online resources and instruction tuning on curated dialogues. To align with gamers’ specific needs, we first analyze user-oriented topics from online communities regarding a widely played MMORPG and construct a domain-specific dataset. Furthermore, we develop a multi-turn dialogue data to foster dynamic conversations with users. The evaluation result with the model trained upon publicly available language model shows our practical applicability on how conversational assistant in online games can help novice gamers.

2024

RICoTA: Red-teaming of In-the-wild Conversation with Test Attempts
Eujeong Choi | Younghun Jeong | Soomin Kim | Won Ik Cho
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

Chamain: Harmonizing Character Persona Integrity with Domain-Adaptive Knowledge in Dialogue Generation
Seung-Moo Yang | Jeehyun Lee | Won Ik Cho
Proceedings of the 6th Workshop on NLP for Conversational AI (NLP4ConvAI 2024)

Recent advances in large language models (LLMs) have shown their capacity for generating natural dialogues, leveraging extensive pre-trained knowledge. However, the seamless integration of domain-specific knowledge into dialogue agents, without undermining their personas or unique textual style, remains a challenging task. Traditional approaches, such as constructing knowledge-aware character dialogue datasets or training LLMs from the ground up, require considerable resources. Sequentially fine-tuning character chatbots across multiple datasets or applying existing merging techniques often leads to catastrophic forgetting, resulting in the loss of both knowledge and the character’s distinct persona. This compromises the model’s ability to consistently generate character-driven dialogues within a user-centric framework. In this context, we introduce a novel model merging method, Chamain, which effortlessly enhances the performance of character models, much like finding a “free lunch”. Chamain merges domain-specific knowledge into a character model by parameter-wise weight combination of instruction-tuned models and learns to reflect persona’s unique characteristics and style through Layer-wise merging. Our experiments demonstrate that Chamain effectively maintains style while also solving domain-specific problems to a certain extent compared to the baselines, even showing a higher style probability compared to the character model in legal QA.

2023

Study on the Domain Adaption of Korean Speech Act using Daily Conversation Dataset and Petition Corpus
Youngsook Song | Won Ik Cho
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

In Korean, quantitative speech act studies have usually been conducted on single utterances with unspecified sources. In this study, we annotate sentences from the National Institute of Korean Language’s Messenger Corpus and the National Petition Corpus, as well as example sentences from an academic paper on contemporary Korean vlogging, and check the discrepancy between human annotation and model prediction. In particular, for sentences with differences in locutionary and illocutionary forces, we analyze the causes of errors to see if stylistic features used in a particular domain affect the correct inference of speech act. Through this, we see the necessity to build and analyze a balanced corpus in various text domains, taking into account cases with different usage roles, e.g., messenger conversations belonging to private conversations and petition corpus/vlogging script that have an unspecified audience.

Revisiting Korean Corpus Studies through Technological Advances
Won Ik Cho | Sangwhan Moon | Youngsook Song
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

2022

APEACH: Attacking Pejorative Expressions with Analysis on Crowd-Generated Hate Speech Evaluation Datasets
Kichang Yang | Wonjun Jang | Won Ik Cho
Findings of the Association for Computational Linguistics: EMNLP 2022

In hate speech detection, developing training and evaluation datasets across various domains is the critical issue. Whereas, major approaches crawl social media texts and hire crowd-workers to annotate the data. Following this convention often restricts the scope of pejorative expressions to a single domain lacking generalization. Sometimes domain overlap between training corpus and evaluation set overestimate the prediction performance when pretraining language models on low-data language. To alleviate these problems in Korean, we propose APEACH that asks unspecified users to generate hate speech examples followed by minimal post-labeling. We find that APEACH can collect useful datasets that are less sensitive to the lexical overlaps between the pretraining corpus and the evaluation set, thereby properly measuring the model performance.

OpenKorPOS: Democratizing Korean Tokenization with Voting-Based Open Corpus Annotation
Sangwhan Moon | Won Ik Cho | Hye Joo Han | Naoaki Okazaki | Nam Soo Kim
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Korean is a language with complex morphology that uses spaces at larger-than-word boundaries, unlike other East-Asian languages. While morpheme-based text generation can provide significant semantic advantages compared to commonly used character-level approaches, Korean morphological analyzers only provide a sequence of morpheme-level tokens, losing information in the tokenization process. Two crucial issues are the loss of spacing information and subcharacter level morpheme normalization, both of which make the tokenization result challenging to reconstruct the original input string, deterring the application to generative tasks. As this problem originates from the conventional scheme used when creating a POS tagging corpus, we propose an improvement to the existing scheme, which makes it friendlier to generative tasks. On top of that, we suggest a fully-automatic annotation of a corpus by leveraging public analyzers. We vote the surface and POS from the outcome and fill the sequence with the selected morphemes, yielding tokenization with a decent quality that incorporates space information. Our scheme is verified via an evaluation done on an external corpus, and subsequently, it is adapted to Korean Wikipedia to construct an open, permissive resource. We compare morphological analyzer performance trained on our corpus with existing methods, then perform an extrinsic evaluation on a downstream task.

Assessing How Users Display Self-Disclosure and Authenticity in Conversation with Human-Like Agents: A Case Study of Luda Lee
Won Ik Cho | Soomin Kim | Eujeong Choi | Younghoon Jeong
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022

There is an ongoing discussion on what makes humans more engaged when interacting with conversational agents. However, in the area of language processing, there has been a paucity of studies on how people react to agents and share interactions with others. We attack this issue by investigating the user dialogues with human-like agents posted online and aim to analyze the dialogue patterns. We construct a taxonomy to discern the users’ self-disclosure in the dialogue and the communication authenticity displayed in the user posting. We annotate the in-the-wild data, examine the reliability of the proposed scheme, and discuss how the categorization can be utilized for future research and industrial development.

StyleKQC: A Style-Variant Paraphrase Corpus for Korean Questions and Commands
Won Ik Cho | Sangwhan Moon | Jongin Kim | Seokmin Kim | Nam Soo Kim
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Paraphrasing is often performed with less concern for controlled style conversion. Especially for questions and commands, style-variant paraphrasing can be crucial in tone and manner, which also matters with industrial applications such as dialog systems. In this paper, we attack this issue with a corpus construction scheme that simultaneously considers the core content and style of directives, namely intent and formality, for the Korean language. Utilizing manually generated natural language queries on six daily topics, we expand the corpus to formal and informal sentences by human rewriting and transferring. We verify the validity and industrial applicability of our approach by checking the adequate classification and inference performance that fit with conventional fine-tuning approaches, at the same time proposing a supervised formality transfer task.

Evaluating How Users Game and Display Conversation with Human-Like Agents
Won Ik Cho | Soomin Kim | Eujeong Choi | Younghoon Jeong
Proceedings of the 3rd Workshop on Computational Approaches to Discourse

Recently, with the advent of high-performance generative language models, artificial agents that communicate directly with the users have become more human-like. This development allows users to perform a diverse range of trials with the agents, and the responses are sometimes displayed online by users who share or show-off their experiences. In this study, we explore dialogues with a social chatbot uploaded to an online community, with the aim of understanding how users game human-like agents and display their conversations. Having done this, we assert that user postings can be investigated from two aspects, namely conversation topic and purpose of testing, and suggest a categorization scheme for the analysis. We analyze 639 dialogues to develop an annotation protocol for the evaluation, and measure the agreement to demonstrate the validity. We find that the dialogue content does not necessarily reflect the purpose of testing, and also that users come up with creative strategies to game the agent without being penalized.

2021

Google-trickers, Yaminjeongeum, and Leetspeak: An Empirical Taxonomy for Intentionally Noisy User-Generated Text
Won Ik Cho | Soomin Kim
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

WARNING: This article contains contents that may offend the readers. Strategies that insert intentional noise into text when posting it are commonly observed in the online space, and sometimes they aim to let only certain community users understand the genuine semantics. In this paper, we explore the purpose of such actions by categorizing them into tricks, memes, fillers, and codes, and organize the linguistic strategies that are used for each purpose. Through this, we identify that such strategies can be conducted by authors for multiple purposes, regarding the presence of stakeholders such as ‘Peers’ and ‘Others’. We finally analyze how these strategies appear differently in each circumstance, along with the unified taxonomy accompanying examples.

VUS at IWSLT 2021: A Finetuned Pipeline for Offline Speech Translation
Yong Rae Jo | Youngki Moon | Minji Jung | Jungyoon Choi | Jihyung Moon | Won Ik Cho
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

In this technical report, we describe the fine-tuned ASR-MT pipeline used for the IWSLT shared task. We remove less useful speech samples by checking WER with an ASR model, and further train a wav2vec and Transformers-based ASR module based on the filtered data. In addition, we cleanse the errata that can interfere with the machine translation process and use it for Transformer-based MT module training. Finally, in the actual inference phase, we use a sentence boundary detection model trained with constrained data to properly merge fragment ASR outputs into full sentences. The merged sentences are post-processed using part of speech. The final result is yielded by the trained MT module. The performance using the dev set displays BLEU 20.37, and this model records the performance of BLEU 20.9 with the test set.

How Does the Hate Speech Corpus Concern Sociolinguistic Discussions? A Case Study on Korean Online News Comments
Won Ik Cho | Jihyung Moon
Proceedings of the Workshop on Natural Language Processing for Digital Humanities

Social consensus has been established on the severity of online hate speech since it not only causes mental harm to the target, but also gives displeasure to the people who read it. For Korean, the definition and scope of hate speech have been discussed widely in researches, but such considerations were hardly extended to the construction of hate speech corpus. Therefore, we create a Korean online hate speech dataset with concrete annotation guideline to see how real world toxic expressions concern sociolinguistic discussions. This inductive observation reveals that hate speech in online news comments is mainly composed of social bias and toxicity. Furthermore, we check how the final corpus corresponds with the definition and scope of hate speech, and confirm that the overall procedure and outcome is in concurrence with the sociolinguistic discussions.

Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT
Won Ik Cho | Emmanuele Chersoni | Yu-Yin Hsu | Chu-Ren Huang
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection
Jihyung Moon | Won Ik Cho | Junbum Lee
Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media

Toxic comments in online platforms are an unavoidable social issue under the cloak of anonymity. Hate speech detection has been actively done for languages such as English, German, or Italian, where manually labeled corpus has been released. In this work, we first present 9.4K manually labeled entertainment news comments for identifying Korean toxic speech, collected from a widely used online news platform in Korea. The comments are annotated regarding social bias and hate speech since both aspects are correlated. The inter-annotator agreement Krippendorff’s alpha score is 0.492 and 0.496, respectively. We provide benchmarks using CharCNN, BiLSTM, and BERT, where BERT achieves the highest score on all tasks. The models generally display better performance on bias identification, since the hate speech detection is a more subjective issue. Additionally, when BERT is trained with bias label for hate speech detection, the prediction score increases, implying that bias and hate are intertwined. We make our dataset publicly available and open competitions with the corpus and benchmarks.

Discourse Component to Sentence (DC2S): An Efficient Human-Aided Construction of Paraphrase and Sentence Similarity Dataset
Won Ik Cho | Jong In Kim | Young Ki Moon | Nam Soo Kim
Proceedings of the Twelfth Language Resources and Evaluation Conference

Assessing the similarity of sentences and detecting paraphrases is an essential task both in theory and practice, but achieving a reliable dataset requires high resource. In this paper, we propose a discourse component-based paraphrase generation for the directive utterances, which is efficient in terms of human-aided construction and content preservation. All discourse components are expressed in natural language phrases, and the phrases are created considering both speech act and topic so that the controlled construction of the sentence similarity dataset is available. Here, we investigate the validity of our scheme using the Korean language, a language with diverse paraphrasing due to frequent subject drop and scramblings. With 1,000 intent argument phrases and thus generated 10,000 utterances, we make up a sentence similarity dataset of practically sufficient size. It contains five sentence pair types, including paraphrase, and displays a total volume of about 550K. To emphasize the utility of the scheme and dataset, we measure the similarity matching performance via conventional natural language inference models, also suggesting the multi-lingual extensibility.

Open Korean Corpora: A Practical Report
Won Ik Cho | Sangwhan Moon | Youngsook Song
Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)

Korean is often referred to as a low-resource language in the research community. While this claim is partially true, it is also because the availability of resources is inadequately advertised and curated. This work curates and reviews a list of Korean corpora, first describing institution-level resource development, then further iterate through a list of current open datasets for different types of tasks. We then propose a direction on how open-source dataset construction and releases should be done for less-resourced languages to promote research.

Towards an Efficient Code-Mixed Grapheme-to-Phoneme Conversion in an Agglutinative Language: A Case Study on To-Korean Transliteration
Won Ik Cho | Seok Min Kim | Nam Soo Kim
Proceedings of the 4th Workshop on Computational Approaches to Code Switching

Code-mixed grapheme-to-phoneme (G2P) conversion is a crucial issue for modern speech recognition and synthesis task, but has been seldom investigated in sentence-level in literature. In this study, we construct a system that performs precise and efficient multi-stage code-mixed G2P conversion, for a less studied agglutinative language, Korean. The proposed system undertakes a sentence-level transliteration that is effective in the accurate processing of Korean text. We formulate the underlying philosophy that supports our approach and demonstrate how it fits with the contemporary document.

Pay Attention to Categories: Syntax-Based Sentence Modeling with Metadata Projection Matrix
Won Ik Cho | Nam Soo Kim
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

Machines Getting with the Program: Understanding Intent Arguments of Non-Canonical Directives
Won Ik Cho | Youngki Moon | Sangwhan Moon | Seok Min Kim | Nam Soo Kim
Findings of the Association for Computational Linguistics: EMNLP 2020

Modern dialog managers face the challenge of having to fulfill human-level conversational skills as part of common user expectations, including but not limited to discourse with no clear objective. Along with these requirements, agents are expected to extrapolate intent from the user’s dialogue even when subjected to non-canonical forms of speech. This depends on the agent’s comprehension of paraphrased forms of such utterances. Especially in low-resource languages, the lack of data is a bottleneck that prevents advancements of the comprehension performance for these types of agents. In this regard, here we demonstrate the necessity of extracting the intent argument of non-canonical directives in a natural language format, which may yield more accurate parsing, and suggest guidelines for building a parallel corpus for this purpose. Following the guidelines, we construct a Korean corpus of 50K instances of question/command-intent pairs, including the labels for classification of the utterance type. We also propose a method for mitigating class imbalance, demonstrating the potential applications of the corpus generation method and its multilingual extensibility.

2019

On Measuring Gender Bias in Translation of Gender-neutral Pronouns
Won Ik Cho | Ji Won Kim | Seok Min Kim | Nam Soo Kim
Proceedings of the First Workshop on Gender Bias in Natural Language Processing

Ethics regarding social bias has recently thrown striking issues in natural language processing. Especially for gender-related topics, the need for a system that reduces the model bias has grown in areas such as image captioning, content recommendation, and automated employment. However, detection and evaluation of gender bias in the machine translation systems are not yet thoroughly investigated, for the task being cross-lingual and challenging to define. In this paper, we propose a scheme for making up a test set that evaluates the gender bias in a machine translation system, with Korean, a language with gender-neutral pronouns. Three word/phrase sets are primarily constructed, each incorporating positive/negative expressions or occupations; all the terms are gender-independent or at least not biased to one side severely. Then, additional sentence lists are constructed concerning formality of the pronouns and politeness of the sentences. With the generated sentence set of size 4,236 in total, we evaluate gender bias in conventional machine translation systems utilizing the proposed measure, which is termed here as translation gender bias index (TGBI). The corpus and the code for evaluation is available on-line.

2018

HashCount at SemEval-2018 Task 3: Concatenative Featurization of Tweet and Hashtags for Irony Detection
Won Ik Cho | Woo Hyun Kang | Nam Soo Kim
Proceedings of the 12th International Workshop on Semantic Evaluation

This paper proposes a novel feature extraction process for SemEval task 3: Irony detection in English tweets. The proposed system incorporates a concatenative featurization of tweet and hashtags, which helps distinguishing between the irony-related and the other components. The system embeds tweets into a vector sequence with widely used pretrained word vectors, partially using a character embedding for the words that are out of vocabulary. Identification was performed with BiLSTM and CNN classifiers, achieving F1 score of 0.5939 (23/42) and 0.3925 (10/28) each for the binary and the multi-class case, respectively. The reliability of the proposed scheme was verified by analyzing the Gold test data, which demonstrates how hashtags can be taken into account when identifying various types of irony.

Co-authors

Youngsook Song 3

Younghoon Jeong 2

Seung-Moo Yang 2

Emmanuele Chersoni 1

Jungyoon Choi 1

Chu-Ren Huang 1

Younghun Jeong 1

Woo Hyun Kang 1

Young Ki Moon 1

Naoaki Okazaki 1

Venues