Joonsuk Park


pdf bib
SQuARe: A Large-Scale Dataset of Sensitive Questions and Acceptable Responses Created through Human-Machine Collaboration
Hwaran Lee | Seokhee Hong | Joonsuk Park | Takyoung Kim | Meeyoung Cha | Yejin Choi | Byoungpil Kim | Gunhee Kim | Eun-Ju Lee | Yong Lim | Alice Oh | Sangchul Park | Jung-Woo Ha
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The potential social harms that large language models pose, such as generating offensive content and reinforcing biases, are steeply rising. Existing works focus on coping with this concern while interacting with ill-intentioned users, such as those who explicitly make hate speech or elicit harmful responses. However, discussions on sensitive issues can become toxic even if the users are well-intentioned. For safer models in such scenarios, we present the Sensitive Questions and Acceptable Response (SQuARe) dataset, a large-scale Korean dataset of 49k sensitive questions with 42k acceptable and 46k non-acceptable responses. The dataset was constructed leveraging HyperCLOVA in a human-in-the-loop manner based on real news headlines. Experiments show that acceptable response generation significantly improves for HyperCLOVA and GPT-3, demonstrating the efficacy of this dataset.

pdf bib
KoSBI: A Dataset for Mitigating Social Bias Risks Towards Safer Large Language Model Applications
Hwaran Lee | Seokhee Hong | Joonsuk Park | Takyoung Kim | Gunhee Kim | Jung-woo Ha
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)

Large language models (LLMs) not only learn natural text generation abilities but also social biases against different demographic groups from real-world data. This poses a critical risk when deploying LLM-based applications. Existing research and resources are not readily applicable in South Korea due to the differences in language and culture, both of which significantly affect the biases and targeted demographic groups. This limitation requires localized social bias datasets to ensure the safe and effective deployment of LLMs. To this end, we present KosBi, a new social bias dataset of 34k pairs of contexts and sentences in Korean covering 72 demographic groups in 15 categories. We find that through filtering-based moderation, social biases in generated content can be reduced by 16.47%p on average for HyperClova (30B and 82B), and GPT-3.

pdf bib
Critic-Guided Decoding for Controlled Text Generation
Minbeom Kim | Hwanhee Lee | Kang Min Yoo | Joonsuk Park | Hwaran Lee | Kyomin Jung
Findings of the Association for Computational Linguistics: ACL 2023

Steering language generation towards objectives or away from undesired content has been a long-standing goal in utilizing language models (LM). Recent work has demonstrated reinforcement learning and weighted decoding as effective approaches to achieve a higher level of language control and quality with pros and cons. In this work, we propose a novel critic decoding method for controlled language generation (CriticControl) that combines the strengths of reinforcement learning and weighted decoding. Specifically, we adopt the actor-critic framework and train an LM-steering critic from reward models. Similar to weighted decoding, our method freezes the language model and manipulates the output token distribution using a critic to improve training efficiency and stability. Evaluation of our method on three controlled generation tasks, topic control, sentiment control, and detoxification, shows that our approach generates more coherent and well-controlled texts than previous methods. In addition, CriticControl demonstrates superior generalization ability in zero-shot settings. Human evaluation studies also corroborate our findings.

pdf bib
ClaimDiff: Comparing and Contrasting Claims on Contentious Issues
Miyoung Ko | Ingyu Seong | Hwaran Lee | Joonsuk Park | Minsuk Chang | Minjoon Seo
Findings of the Association for Computational Linguistics: ACL 2023

With the growing importance of detecting misinformation, many studies have focused on verifying factual claims by retrieving evidence. However, canonical fact verification tasks do not apply to catching subtle differences in factually consistent claims, which might still bias the readers, especially on contentious political or economic issues. Our underlying assumption is that among the trusted sources, one’s argument is not necessarily more true than the other, requiring comparison rather than verification. In this study, we propose ClaimDIff, a novel dataset that primarily focuses on comparing the nuance between claim pairs. In ClaimDiff, we provide human-labeled 2,941 claim pairs from 268 news articles. We observe that while humans are capable of detecting the nuances between claims, strong baselines struggle to detect them, showing over a 19% absolute gap with the humans. We hope this initial study could help readers to gain an unbiased grasp of contentious issues through machine-aided comparison.

pdf bib
Asking Clarification Questions to Handle Ambiguity in Open-Domain QA
Dongryeol Lee | Segwang Kim | Minwoo Lee | Hwanhee Lee | Joonsuk Park | Sang-Woo Lee | Kyomin Jung
Findings of the Association for Computational Linguistics: EMNLP 2023

Ambiguous questions persist in open-domain question answering, because formulating a precise question with a unique answer is often challenging. Previous works have tackled this issue by asking disambiguated questions for all possible interpretations of the ambiguous question. Instead, we propose to ask a clarification question, where the user’s response will help identify the interpretation that best aligns with the user’s intention. We first present CAmbigNQ, a dataset consisting of 5,653 ambiguous questions, each with relevant passages, possible answers, and a clarification question. The clarification questions were efficiently created by generating them using InstructGPT and manually revising them as necessary. We then define a pipeline of three tasks—(1) ambiguity detection, (2) clarification question generation, and (3) clarification-based QA. In the process, we adopt or design appropriate evaluation metrics to facilitate sound research. Lastly, we achieve F1 of 61.3, 25.1, and 40.5 on the three tasks, demonstrating the need for further improvements while providing competitive baselines for future work.

pdf bib
Tree of Clarifications: Answering Ambiguous Questions with Retrieval-Augmented Large Language Models
Gangwoo Kim | Sungdong Kim | Byeongguk Jeon | Joonsuk Park | Jaewoo Kang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Questions in open-domain question answering are often ambiguous, allowing multiple interpretations. One approach to handling them is to identify all possible interpretations of the ambiguous question (AQ) and to generate a long-form answer addressing them all, as suggested by Stelmakh et al., (2022). While it provides a comprehensive response without bothering the user for clarification, considering multiple dimensions of ambiguity and gathering corresponding knowledge remains a challenge. To cope with the challenge, we propose a novel framework, Tree of Clarifications (ToC): It recursively constructs a tree of disambiguations for the AQ—via few-shot prompting leveraging external knowledge—and uses it to generate a long-form answer. ToC outperforms existing baselines on ASQA in a few-shot setup across the metrics, while surpassing fully-supervised baselines trained on the whole training set in terms of Disambig-F1 and Disambig-ROUGE. Code is available at

pdf bib
mRedditSum: A Multimodal Abstractive Summarization Dataset of Reddit Threads with Images
Keighley Overbay | Jaewoo Ahn | Fatemeh Pesaran zadeh | Joonsuk Park | Gunhee Kim
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The growing number of multimodal online discussions necessitates automatic summarization to save time and reduce content overload. However, existing summarization datasets are not suitable for this purpose, as they either do not cover discussions, multiple modalities, or both. To this end, we present mRedditSum, the first multimodal discussion summarization dataset. It consists of 3,033 discussion threads where a post solicits advice regarding an issue described with an image and text, and respective comments express diverse opinions. We annotate each thread with a human-written summary that captures both the essential information from the text, as well as the details available only in the image. Experiments show that popular summarization models—GPT-3.5, BART, and T5—consistently improve in performance when visual information is incorporated. We also introduce a novel method, cluster-based multi-stage summarization, that outperforms existing baselines and serves as a competitive baseline for future work.

pdf bib
From Values to Opinions: Predicting Human Behaviors and Stances Using Value-Injected Large Language Models
Dongjun Kang | Joonsuk Park | Yohan Jo | JinYeong Bak
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Being able to predict people’s opinions on issues and behaviors in realistic scenarios can be helpful in various domains, such as politics and marketing. However, conducting large-scale surveys like the European Social Survey to solicit people’s opinions on individual issues can incur prohibitive costs. Leveraging prior research showing influence of core human values on individual decisions and actions, we propose to use value-injected large language models (LLM) to predict opinions and behaviors. To this end, we present Value Injection Method (VIM), a collection of two methods—argument generation and question answering—designed to inject targeted value distributions into LLMs via fine-tuning. We then conduct a series of experiments on four tasks to test the effectiveness of VIM and the possibility of using value-injected LLMs to predict opinions and behaviors of people. We find that LLMs value-injected with variations of VIM substantially outperform the baselines. Also, the results suggest that opinions and behaviors can be better predicted using value-injected LLMs than the baseline approaches.

pdf bib
Relevance-assisted Generation for Robust Zero-shot Retrieval
Jihyuk Kim | Minsoo Kim | Joonsuk Park | Seung-won Hwang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track

Zero-shot retrieval tasks such as the BEIR benchmark reveal out-of-domain generalization as a key weakness of high-performance dense retrievers. As a solution, domain adaptation for dense retrievers has been actively studied. A notable approach is synthesizing domain-specific data, by generating pseudo queries (PQ), for fine-tuning with domain-specific relevance between PQ and documents. Our contribution is showing that key biases can cause sampled PQ to be irrelevant, negatively contributing to generalization. We propose to preempt their generation, by dividing the generation into simpler subtasks, of generating relevance explanations and guiding the generation to avoid negative generalization. Experiment results show that our proposed approach is more robust to domain shifts, validated on challenging BEIR zero-shot retrieval tasks.

pdf bib
Proceedings of the 10th Workshop on Argument Mining
Milad Alshomary | Chung-Chi Chen | Smaranda Muresan | Joonsuk Park | Julia Romberg
Proceedings of the 10th Workshop on Argument Mining


pdf bib
Argument Mining for Review Helpfulness Prediction
Zaiqian Chen | Daniel Verdi do Amarante | Jenna Donaldson | Yohan Jo | Joonsuk Park
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

The importance of reliably determining the helpfulness of product reviews is rising as both helpful and unhelpful reviews continue to accumulate on e-commerce websites. And argumentational features—such as the structure of arguments and the types of underlying elementary units—have shown to be promising indicators of product review helpfulness. However, their adoption has been limited due to the lack of sufficient resources and large-scale experiments investigating their utility. To this end, we present the AMazon Argument Mining (AM2) corpus—a corpus of 878 Amazon reviews on headphones annotated according to a theoretical argumentation model designed to evaluate argument quality.Experiments show that employing argumentational features leads to statistically significant improvements over the state-of-the-art review helpfulness predictors under both text-only and text-and-image settings.

pdf bib
Plug-and-Play Adaptation for Continuously-updated QA
Kyungjae Lee | Wookje Han | Seung-won Hwang | Hwaran Lee | Joonsuk Park | Sang-Woo Lee
Findings of the Association for Computational Linguistics: ACL 2022

Language models (LMs) have shown great potential as implicit knowledge bases (KBs). And for their practical use, knowledge in LMs need to be updated periodically. However, existing tasks to assess LMs’ efficacy as KBs do not adequately consider multiple large-scale updates. To this end, we first propose a novel task—Continuously-updated QA (CuQA)—in which multiple large-scale updates are made to LMs, and the performance is measured with respect to the success in adding and updating knowledge while retaining existing knowledge. We then present LMs with plug-in modules that effectively handle the updates. Experiments conducted on zsRE QA and NQ datasets show that our method outperforms existing approaches. We find that our method is 4x more effective in terms of updates/forgets ratio, compared to a fine-tuning baseline.

pdf bib
Masked Summarization to Generate Factually Inconsistent Summaries for Improved Factual Consistency Checking
Hwanhee Lee | Kang Min Yoo | Joonsuk Park | Hwaran Lee | Kyomin Jung
Findings of the Association for Computational Linguistics: NAACL 2022

Despite the recent advances in abstractive summarization systems, it is still difficult to determine whether a generated summary is factual consistent with the source text. To this end, the latest approach is to train a factual consistency classifier on factually consistent and inconsistent summaries. Luckily, the former is readily available as reference summaries in existing summarization datasets. However, generating the latter remains a challenge, as they need to be factually inconsistent, yet closely relevant to the source text to be effective. In this paper, we propose to generate factually inconsistent summaries using source texts and reference summaries with key information masked. Experiments on seven benchmark datasets demonstrate that factual consistency classifiers trained on summaries generated using our method generally outperform existing models and show a competitive correlation with human judgments. We also analyze the characteristics of the summaries generated using our method. We will release the pre-trained model and the code at


pdf bib
Automatic Fact-Checking with Document-level Annotations using BERT and Multiple Instance Learning
Aalok Sathe | Joonsuk Park
Proceedings of the Fourth Workshop on Fact Extraction and VERification (FEVER)

Automatic fact-checking is crucial for recognizing misinformation spreading on the internet. Most existing fact-checkers break down the process into several subtasks, one of which determines candidate evidence sentences that can potentially support or refute the claim to be verified; typically, evidence sentences with gold-standard labels are needed for this. In a more realistic setting, however, such sentence-level annotations are not available. In this paper, we tackle the natural language inference (NLI) subtask—given a document and a (sentence) claim, determine whether the document supports or refutes the claim—only using document-level annotations. Using fine-tuned BERT and multiple instance learning, we achieve 81.9% accuracy, significantly outperforming the existing results on the WikiFactCheck-English dataset.

pdf bib
Argument Mining on Twitter: A Case Study on the Planned Parenthood Debate
Muhammad Mahad Afzal Bhatti | Ahsan Suheer Ahmad | Joonsuk Park
Proceedings of the 8th Workshop on Argument Mining

Twitter is a popular platform to share opinions and claims, which may be accompanied by the underlying rationale. Such information can be invaluable to policy makers, marketers and social scientists, to name a few. However, the effort to mine arguments on Twitter has been limited, mainly because a tweet is typically too short to contain an argument — both a claim and a premise. In this paper, we propose a novel problem formulation to mine arguments from Twitter: We formulate argument mining on Twitter as a text classification task to identify tweets that serve as premises for a hashtag that represents a claim of interest. To demonstrate the efficacy of this formulation, we mine arguments for and against funding Planned Parenthood expressed in tweets. We first present a new dataset of 24,100 tweets containing hashtag #StandWithPP or #DefundPP, manually labeled as SUPPORT WITH REASON, SUPPORT WITHOUT REASON, and NO EXPLICIT SUPPORT. We then train classifiers to determine the types of tweets, achieving the best performance of 71% F1. Our results manifest claim-specific keywords as the most informative features, which in turn reveal prominent arguments for and against funding Planned Parenthood.


pdf bib
Automated Fact-Checking of Claims from Wikipedia
Aalok Sathe | Salar Ather | Tuan Manh Le | Nathan Perry | Joonsuk Park
Proceedings of the Twelfth Language Resources and Evaluation Conference

Automated fact checking is becoming increasingly vital as both truthful and fallacious information accumulate online. Research on fact checking has benefited from large-scale datasets such as FEVER and SNLI. However, such datasets suffer from limited applicability due to the synthetic nature of claims and/or evidence written by annotators that differ from real claims and evidence on the internet. To this end, we present WikiFactCheck-English, a dataset of 124k+ triples consisting of a claim, context and an evidence document extracted from English Wikipedia articles and citations, as well as 34k+ manually written claims that are refuted by the evidence documents. This is the largest fact checking dataset consisting of real claims and evidence to date; it will allow the development of fact checking systems that can better process claims and evidence in the real world. We also show that for the NLI subtask, a logistic regression system trained using existing and novel features achieves peak accuracy of 68%, providing a competitive baseline for future work. Also, a decomposable attention model trained on SNLI significantly underperforms the models trained on this dataset, suggesting that models trained on manually generated data may not be sufficiently generalizable or suitable for fact checking real-world claims.


pdf bib
A Corpus of eRulemaking User Comments for Measuring Evaluability of Arguments
Joonsuk Park | Claire Cardie
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)


pdf bib
Argument Mining with Structured SVMs and RNNs
Vlad Niculae | Joonsuk Park | Claire Cardie
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We propose a novel factor graph model for argument mining, designed for settings in which the argumentative relations in a document do not necessarily form a tree structure. (This is the case in over 20% of the web comments dataset we release.) Our model jointly learns elementary unit type classification and argumentative relation prediction. Moreover, our model supports SVM and RNN parametrizations, can enforce structure constraints (e.g., transitivity), and can express dependencies between adjacent relations and propositions. Our approaches outperform unstructured baselines in both web comments and argumentative essay datasets.


pdf bib
A Corpus of Argument Networks: Using Graph Properties to Analyse Divisive Issues
Barbara Konat | John Lawrence | Joonsuk Park | Katarzyna Budzynska | Chris Reed
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Governments are increasingly utilising online platforms in order to engage with, and ascertain the opinions of, their citizens. Whilst policy makers could potentially benefit from such enormous feedback from society, they first face the challenge of making sense out of the large volumes of data produced. This creates a demand for tools and technologies which will enable governments to quickly and thoroughly digest the points being made and to respond accordingly. By determining the argumentative and dialogical structures contained within a debate, we are able to determine the issues which are divisive and those which attract agreement. This paper proposes a method of graph-based analytics which uses properties of graphs representing networks of arguments pro- & con- in order to automatically analyse issues which divide citizens about new regulations. By future application of the most recent advances in argument mining, the results reported here will have a chance to scale up to enable sense-making of the vast amount of feedback received from citizens on directions that policy should take.


pdf bib
Conditional Random Fields for Identifying Appropriate Types of Support for Propositions in Online User Comments
Joonsuk Park | Arzoo Katiyar | Bishan Yang
Proceedings of the 2nd Workshop on Argumentation Mining

pdf bib
Automatic Identification of Rhetorical Questions
Shohini Bhattasali | Jeremy Cytryn | Elana Feldman | Joonsuk Park
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)


pdf bib
Identifying Appropriate Support for Propositions in Online User Comments
Joonsuk Park | Claire Cardie
Proceedings of the First Workshop on Argumentation Mining


pdf bib
Improving Implicit Discourse Relation Recognition Through Feature Set Optimization
Joonsuk Park | Claire Cardie
Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue