JinYeong Bak - ACL Anthology

JinYeong Bak

Also published as: Jinyeong Bak

2025

Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights
Sooyung Choi | Jaehyeok Lee | Xiaoyuan Yi | Jing Yao | Xing Xie | JinYeong Bak
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The application scope of Large Language Models (LLMs) continues to expand, leading to increasing interest in personalized LLMs that align with human values. However, aligning these models with individual values raises significant safety concerns, as certain values may correlate with harmful information. In this paper, we identify specific safety risks associated with value-aligned LLMs and investigate the psychological principles behind these challenges. Our findings reveal two key insights. (1) Value-aligned LLMs are more prone to harmful behavior compared to non-fine-tuned models and exhibit slightly higher risks in traditional safety evaluations than other fine-tuned models. (2) These safety issues arise because value-aligned LLMs genuinely generate text according to the aligned values, which can amplify harmful outcomes. Using a dataset with detailed safety categories, we find significant correlations between value alignment and safety risks, supported by psychological hypotheses. This study offers insights into the “black box” of value alignment and proposes in-context alignment methods to enhance the safety of value-aligned LLMs.

Self-Training Meets Consistency: Improving LLMs’ Reasoning with Consistency-Driven Rationale Evaluation
Jaehyeok Lee | Keisuke Sakaguchi | JinYeong Bak
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Self-training approach for large language models (LLMs) improves reasoning abilities by training the models on their self-generated rationales. Previous approaches have labeled rationales that produce correct answers for a given question as appropriate for training. However, a single measure risks misjudging rationale quality, leading the models to learn flawed reasoning patterns. To address this issue, we propose CREST (Consistency-driven Rationale Evaluation for Self-Training), a self-training framework that further evaluates each rationale through follow-up questions and leverages this evaluation to guide its training. Specifically, we introduce two methods: (1) filtering out rationales that frequently result in incorrect answers on follow-up questions and (2) preference learning based on mixed preferences from rationale evaluation results of both original and follow-up questions. Experiments on three question-answering datasets using open LLMs show that CREST not only improves the logical robustness and correctness of rationales but also improves reasoning abilities compared to previous self-training approaches.

Text Overlap: An LLM with Human-like Conversational Behaviors
JiWoo Kim | Minsuk Chang | JinYeong Bak
Proceedings of the Third Workshop on Social Influence in Conversations (SICon 2025)

Traditional text-based human-AI interactions typically follow a strict turn-taking approach. This rigid structure limits conversational flow, unlike natural human conversations, which can freely incorporate overlapping speech. However, our pilot study suggests that even in text-based interfaces, overlapping behaviors such as backchanneling and proactive responses lead to more natural and functional exchanges. Motivated by these findings, we introduce text-based overlapping interactions as a new challenge in human-AI communication, characterized by real-time typing, diverse response types, and interruptions. To enable AI systems to handle such interactions, we define three core tasks: deciding when to overlap, selecting the response type, and generating utterances. We construct a synthetic dataset for these tasks and train OverlapBot, an LLM-driven chatbot designed to engage in text-based overlapping interactions. Quantitative and qualitative evaluations show that OverlapBot increases turn exchanges compared to traditional turn-taking systems, with users making 72% more turns and the chatbot 130% more turns, which is perceived as efficient by end-users. This finding supports overlapping interactions and enhances communicative efficiency and engagement.

Tagged Span Annotation for Detecting Translation Errors in Reasoning LLMs
Taemin Yeom | Yonghyun Ryu | Yoonjung Choi | Jinyeong Bak
Proceedings of the Tenth Conference on Machine Translation

We present the AIP team’s submission to the WMT 2025 Unified MT Evaluation SharedTask, focusing on the span-level error detection subtask. Our system emphasizes response format design to better harness the capabilities of OpenAI’s o3, the state-of-the-art reasoning LLM. To this end, we introduce Tagged SpanAnnotation (TSA), an annotation scheme designed to more accurately extract span-level information from the LLM. On our refined version of WMT24 ESA dataset, our reference-free method achieves an F1 score of approximately 27 for character-level label prediction, outperforming the reference-based XCOMET-XXL at approximately 17.

Proceedings of the Tenth Workshop on Noisy and User-generated Text
JinYeong Bak | Rob van der Goot | Hyeju Jang | Weerayut Buaphet | Alan Ramponi | Wei Xu | Alan Ritter
Proceedings of the Tenth Workshop on Noisy and User-generated Text

2024

KpopMT: Translation Dataset with Terminology for Kpop Fandom
JiWoo Kim | Yunsu Kim | JinYeong Bak
Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024)

While machines learn from existing corpora, humans have the unique capability to establish and accept new language systems. This makes human form unique language systems within social groups. Aligning with this, we focus on a gap remaining in addressing translation challenges within social groups, where in-group members utilize unique terminologies. We propose KpopMT dataset, which aims to fill this gap by enabling precise terminology translation, choosing Kpop fandom as an initiative for social groups given its global popularity. Expert translators provide 1k English translations for Korean posts and comments, each annotated with specific terminology within social groups’ language systems. We evaluate existing translation systems including GPT models on KpopMT to identify their failure cases. Results show overall low scores, underscoring the challenges of reflecting group-specific terminologies and styles in translation. We make KpopMT publicly available.

PEMA: An Offsite-Tunable Plug-in External Memory Adaptation for Language Models
HyunJin Kim | Young Jin Kim | JinYeong Bak
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Pre-trained language models (PLMs) show impressive performance in various downstream NLP tasks. However, pre-training large language models demands substantial memory and training compute. Furthermore, due to the substantial resources required, many PLM weights are confidential. Consequently, users are compelled to share their data with model owners for fine-tuning specific tasks. To overcome the limitations, we introduce Plug-in External Memory Adaptation (PEMA), a Parameter-Efficient Fine-Tuning (PEFT) method, enabling PLM fine-tuning without requiring access to all the weights. PEMA integrates with context representations from test data during inference to perform downstream tasks. It uses external memory to store PLM-generated context representations mapped with target tokens. Our method utilizes weight matrices of LoRA-like bottlenecked adapter in the PLM’s final layer to enhance efficiency. Our approach also includes Gradual Unrolling, a novel interpolation strategy to improve generation quality. We validate PEMA’s effectiveness through experiments on syntactic and real datasets for machine translation and style transfer. Our findings show that PEMA outperforms other PEFT approaches in memory and latency efficiency for training, and also excels in maintaining sentence meaning and generating appropriate language and styles.

Proceedings of the Ninth Workshop on Noisy and User-generated Text (W-NUT 2024)
Rob van der Goot | JinYeong Bak | Max Müller-Eberstein | Wei Xu | Alan Ritter | Tim Baldwin
Proceedings of the Ninth Workshop on Noisy and User-generated Text (W-NUT 2024)

2023

Conversational Emotion-Cause Pair Extraction with Guided Mixture of Experts
DongJin Jeong | JinYeong Bak
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Emotion-Cause Pair Extraction (ECPE) task aims to pair all emotions and corresponding causes in documents.ECPE is an important task for developing human-like responses. However, previous ECPE research is conducted based on news articles, which has different characteristics compared to dialogues. To address this issue, we propose a Pair-Relationship Guided Mixture-of-Experts (PRG-MoE) model, which considers dialogue features (e.g., speaker information).PRG-MoE automatically learns relationship between utterances and advises a gating network to incorporate dialogue features in the evaluation, yielding substantial performance improvement. We employ a new ECPE dataset, which is an English dialogue dataset, with more emotion-cause pairs in documents than news articles. We also propose Cause Type Classification that classifies emotion-cause pairs according to the types of the cause of a detected emotion. For reproducing the results, we make available all our code and data.

Diversity Enhanced Narrative Question Generation for Storybooks
Hokeun Yoon | JinYeong Bak
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Question generation (QG) from a given context can enhance comprehension, engagement, assessment, and overall efficacy in learning or conversational environments. Despite recent advancements in QG, the challenge of enhancing or measuring the diversity of generated questions often remains unaddressed. In this paper, we introduce a multi-question generation model (mQG), which is capable of generating multiple, diverse, and answerable questions by focusing on context and questions. To validate the answerability of the generated questions, we employ a SQuAD 2.0 fine-tuned question answering model, classifying the questions as answerable or not. We train and evaluate mQG on the FairytaleQA dataset, a well-structured QA dataset based on storybooks, with narrative questions. We further apply a zero-shot adaptation on the TellMeWhy and SQuAD1.1 datasets. mQG shows promising results across various evaluation metrics, among strong baselines.

It Ain’t Over: A Multi-aspect Diverse Math Word Problem Dataset
Jiwoo Kim | Youngbin Kim | Ilwoong Baek | JinYeong Bak | Jongwuk Lee
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The math word problem (MWP) is a complex task that requires natural language understanding and logical reasoning to extract key knowledge from natural language narratives. Previous studies have provided various MWP datasets but lack diversity in problem types, lexical usage patterns, languages, and annotations for intermediate solutions. To address these limitations, we introduce a new MWP dataset, named DMath (Diverse Math Word Problems), offering a wide range of diversity in problem types, lexical usage patterns, languages, and intermediate solutions. The problems are available in English and Korean and include an expression tree and Python code as intermediate solutions. Through extensive experiments, we demonstrate that the DMath dataset provides a new opportunity to evaluate the capability of large language models, i.e., GPT-4 only achieves about 75% accuracy on the DMath dataset.

From Values to Opinions: Predicting Human Behaviors and Stances Using Value-Injected Large Language Models
Dongjun Kang | Joonsuk Park | Yohan Jo | JinYeong Bak
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Being able to predict people’s opinions on issues and behaviors in realistic scenarios can be helpful in various domains, such as politics and marketing. However, conducting large-scale surveys like the European Social Survey to solicit people’s opinions on individual issues can incur prohibitive costs. Leveraging prior research showing influence of core human values on individual decisions and actions, we propose to use value-injected large language models (LLM) to predict opinions and behaviors. To this end, we present Value Injection Method (VIM), a collection of two methods—argument generation and question answering—designed to inject targeted value distributions into LLMs via fine-tuning. We then conduct a series of experiments on four tasks to test the effectiveness of VIM and the possibility of using value-injected LLMs to predict opinions and behaviors of people. We find that LLMs value-injected with variations of VIM substantially outperform the baselines. Also, the results suggest that opinions and behaviors can be better predicted using value-injected LLMs than the baseline approaches.

2022

HUE: Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea
Haneul Yoo | Jiho Jin | Juhee Son | JinYeong Bak | Kyunghyun Cho | Alice Oh
Findings of the Association for Computational Linguistics: NAACL 2022

Historical records in Korea before the 20th century were primarily written in Hanja, an extinct language based on Chinese characters and not understood by modern Korean or Chinese speakers. Historians with expertise in this time period have been analyzing the documents, but that process is very difficult and time-consuming, and language models would significantly speed up the process. Toward building and evaluating language models for Hanja, we release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks. We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats. We compare the models with several baselines on all tasks and show there are significant improvements gained by training on the two corpora. Additionally, we run zero-shot experiments on the Daily Records of the Royal Court and Important Officials (DRRI). The DRRI dataset has not been studied much by the historians, and not at all by the NLP community.

Translating Hanja Historical Documents to Contemporary Korean and English
Juhee Son | Jiho Jin | Haneul Yoo | JinYeong Bak | Kyunghyun Cho | Alice Oh
Findings of the Association for Computational Linguistics: EMNLP 2022

The Annals of Joseon Dynasty (AJD) contain the daily records of the Kings of Joseon, the 500-year kingdom preceding the modern nation of Korea.The Annals were originally written in an archaic Korean writing system, ‘Hanja’, and were translated into Korean from 1968 to 1993.The resulting translation was however too literal and contained many archaic Korean words; thus, a new expert translation effort began in 2012. Since then, the records of only one king have been completed in a decade.In parallel, expert translators are working on English translation, also at a slow pace and produced only one king’s records in English so far.Thus, we propose H2KE, a neural machine translation model, that translates historical documents in Hanja to more easily understandable Korean and to English.Built on top of multilingual neural machine translation, H2KE learns to translate a historical document written in Hanja, from both a full dataset of outdated Korean translation and a small dataset of more recently translated contemporary Korean and English.We compare our method against two baselines:a recent model that simultaneously learns to restore and translate Hanja historical documentand a Transformer based model trained only on newly translated corpora.The experiments reveal that our method significantly outperforms the baselines in terms of BLEU scores for both contemporary Korean and English translations.We further conduct extensive human evaluation which shows that our translation is preferred over the original expert translations by both experts and non-expert Korean speakers.

2021

Learning Sequential and Structural Information for Source Code Summarization
YunSeok Choi | JinYeong Bak | CheolWon Na | Jee-Hyong Lee
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Knowledge-Enhanced Evidence Retrieval for Counterargument Generation
Yohan Jo | Haneul Yoo | JinYeong Bak | Alice Oh | Chris Reed | Eduard Hovy
Findings of the Association for Computational Linguistics: EMNLP 2021

Finding counterevidence to statements is key to many tasks, including counterargument generation. We build a system that, given a statement, retrieves counterevidence from diverse sources on the Web. At the core of this system is a natural language inference (NLI) model that determines whether a candidate sentence is valid counterevidence or not. Most NLI models to date, however, lack proper reasoning abilities necessary to find counterevidence that involves complex inference. Thus, we present a knowledge-enhanced NLI model that aims to handle causality- and example-based inference by incorporating knowledge graphs. Our NLI model outperforms baselines for NLI tasks, especially for instances that require the targeted inference. In addition, this NLI model further improves the counterevidence retrieval system, notably finding complex counterevidence better.

2020

Speaker Sensitive Response Evaluation Model
JinYeong Bak | Alice Oh
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Automatic evaluation of open-domain dialogue response generation is very challenging because there are many appropriate responses for a given context. Existing evaluation models merely compare the generated response with the ground truth response and rate many of the appropriate responses as inappropriate if they deviate from the ground truth. One approach to resolve this problem is to consider the similarity of the generated response with the conversational context. In this paper, we propose an automatic evaluation model based on that idea and learn the model parameters from an unlabeled conversation corpus. Our approach considers the speakers in defining the different levels of similar context. We use a Twitter conversation corpus that contains many speakers and conversations to test our evaluation model. Experiments show that our model outperforms the other existing evaluation metrics in terms of high correlation with human annotation scores. We also show that our model trained on Twitter can be applied to movie dialogues without any additional training. We provide our code and the learned parameters so that they can be used for automatic evaluation of dialogue response generation models.

2019

Variational Hierarchical User-based Conversation Model
JinYeong Bak | Alice Oh
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Generating appropriate conversation responses requires careful modeling of the utterances and speakers together. Some recent approaches to response generation model both the utterances and the speakers, but these approaches tend to generate responses that are overly tailored to the speakers. To overcome this limitation, we propose a new model with a stochastic variable designed to capture the speaker information and deliver it to the conversational context. An important part of this model is the network of speakers in which each speaker is connected to one or more conversational partner, and this network is then used to model the speakers better. To test whether our model generates more appropriate conversation responses, we build a new conversation corpus containing approximately 27,000 speakers and 770,000 conversations. With this corpus, we run experiments of generating conversational responses and compare our model with other state-of-the-art models. By automatic evaluation metrics and human evaluation, we show that our model outperforms other models in generating appropriate responses. An additional advantage of our model is that it generates better responses for various new user scenarios, for example when one of the speakers is a known user in our corpus but the partner is a new user. For replicability, we make available all our code and data.

2018

Conversational Decision-Making Model for Predicting the King’s Decision in the Annals of the Joseon Dynasty
JinYeong Bak | Alice Oh
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Styles of leaders when they make decisions in groups vary, and the different styles affect the performance of the group. To understand the key words and speakers associated with decisions, we initially formalize the problem as one of predicting leaders’ decisions from discussion with group members. As a dataset, we introduce conversational meeting records from a historical corpus, and develop a hierarchical RNN structure with attention and pre-trained speaker embedding in the form of a, Conversational Decision Making Model (CDMM). The CDMM outperforms other baselines to predict leaders’ final decisions from the data. We explain why CDMM works better than other methods by showing the key words and speakers discovered from the attentions as evidence.

2017

Rotated Word Vector Representations and their Interpretability
Sungjoon Park | JinYeong Bak | Alice Oh
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Vector representation of words improves performance in various NLP tasks, but the high dimensional word vectors are very difficult to interpret. We apply several rotation algorithms to the vector representation of words to improve the interpretability. Unlike previous approaches that induce sparsity, the rotated vectors are interpretable while preserving the expressive performance of the original vectors. Furthermore, any prebuilt word vector representation can be rotated for improved interpretability. We apply rotation to skipgrams and glove and compare the expressive power and interpretability with the original vectors and the sparse overcomplete vectors. The results show that the rotated vectors outperform the original and the sparse overcomplete vectors for interpretability and expressiveness tasks.

2015

Five Centuries of Monarchy in Korea: Mining the Text of the Annals of the Joseon Dynasty
JinYeong Bak | Alice Oh
Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

2014

Self-disclosure topic model for classifying and analyzing Twitter conversations
JinYeong Bak | Chin-Yew Lin | Alice Oh
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self-disclosure topic model for Twitter conversations
JinYeong Bak | Chin-Yew Lin | Alice Oh
Proceedings of the Joint Workshop on Social Dynamics and Personal Attributes in Social Media

2012

Self-Disclosure and Relationship Strength in Twitter Conversations
JinYeong Bak | Suin Kim | Alice Oh
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Co-authors

Timothy Baldwin 1

Weerayut Buaphet 1

Yoonjung Choi 1

DongJin Jeong 1

Young Jin Kim 1

Jee-Hyong Lee 1

Max Müller-Eberstein 1

Sungjoon Park 1

Keisuke Sakaguchi 1

Venues