Seung-won Hwang - ACL Anthology

Seung-won Hwang

2026

D3: Dynamic Docid Decoding for Multi-Intent Generative Retrieval
Jaeyoung Kim | Dohyeon Lee | Soona Hong | Seung-won Hwang
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)

Generative Retrieval (GR) maps queries to documents by generating discrete identifiers (DocIDs).However, offline DocID assignment and constrained decoding often prevent GR from capturing query-specific intent, especially when documents express multiple or unseen intents (i.e., intent misalignment).We introduce Dynamic Docid Decoding (D3), an inference-time mechanism that adaptively refines DocIDs through delayed, query-informed identifier expansion.D3 uses (a) verification to detect intent misalignment and (b) dynamic decoding to extend DocIDs with query-aligned tokens, even those absent from the pre-indexed vocabulary, enabling plug-and-play DocID expansion beyond the static vocabulary while adding minimal overhead.Experiments on NQ320k and MS-MARCO show that D3 consistently improves retrieval accuracy, especially on unseen and multi-intent documents, across various GR models, including a +2.4%p nDCG@10 gain on the state-of-the-art model.

TAGQuant: Token-Aware Clustering for Group-Wise Quantization
Jaeseong Lee | Seung-won Hwang | Aurick Qiao | Zhewei Yao | Yuxiong He
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)

Grouping, e.g., grouping channels, which is widely used in current integer-based quantization, has become essential for the emerging MXFP4 format. Ideally, each group should contain channels with similar quantization scales. To guide such groups, existing work clusters the channels using scalar proxy, ignoring the token dimension, which we find suboptimal. In this paper, we propose TAGQuant, a simple yet powerful enhancement for such “group-wise” quantization. By strategically shuffling channels to group those with similar token-wise activation distributions, TAGQuant ensures better clustering of large- and small-range values. This shuffle operation is hardware-efficient, and seamlessly integrated into the quantization process with only 0.01x latency overhead. TAGQuant reduces relative GSM8K error in both INT4 and MXFP4 formats, by up to 86% in Llama-3.1-8B-Instruct compared to baselines, validating the effectiveness of our channel shuffling approach for group-wise quantization. Code is publicly available.

2025

Towards Lifelong Dialogue Agents via Timeline-based Memory Management
Kai Tzu-iunn Ong | Namyoung Kim | Minju Gwak | Hyungjoo Chae | Taeyoon Kwon | Yohan Jo | Seung-won Hwang | Dongha Lee | Jinyoung Yeo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

To achieve lifelong human-agent interaction, dialogue agents need to constantly memorize perceived information and properly retrieve it for response generation (RG). While prior studies focus on getting rid of outdated memories to improve retrieval quality, we argue that such memories provide rich, important contextual cues for RG (e.g., changes in user behaviors) in long-term conversations. We present THEANINE, a framework for LLM-based lifelong dialogue agents. THEANINE discards memory removal and manages large-scale memories by linking them based on their temporal and cause-effect relation. Enabled by this linking structure, THEANINE augments RG with memory timelines - series of memories representing the evolution or causality of relevant past events. Along with THEANINE, we introduce TeaFarm, a counterfactual-driven evaluation scheme, addressing the limitation of G-Eval and human efforts when assessing agent performance in integrating past memories into RG. A supplementary video for THEANINE and data for TeaFarm are at https://huggingface.co/spaces/ResearcherScholar/Theanine.

PERC: Plan-As-Query Example Retrieval for Underrepresented Code Generation
Jaeseok Yoo | Hojae Han | Youngwon Lee | Jaejin Kim | Seung-won Hwang
Proceedings of the 31st International Conference on Computational Linguistics

Code generation with large language models has shown significant promise, especially when employing retrieval-augmented generation (RAG) with few-shot examples. However, selecting effective examples that enhance generation quality remains a challenging task, particularly when the target programming language (PL) is underrepresented. In this study, we present two key findings: (1) retrieving examples whose presented algorithmic plans can be referenced for generating the desired behavior significantly improves generation accuracy, and (2) converting code into pseudocode effectively captures such algorithmic plans, enhancing retrieval quality even when the source and the target PLs are different. Based on these findings, we propose Plan-as-query Example Retrieval for few-shot prompting in Code generation (PERC), a novel framework that utilizes algorithmic plans to identify and retrieve effective examples. We validate the effectiveness of PERC through extensive experiments on the CodeContests, HumanEval and MultiPL-E benchmarks: PERC consistently outperforms the state-of-the-art RAG methods in code generation, both when the source and target programming languages match or differ, highlighting its adaptability and robustness in diverse coding environments.

Tree-of-Prompts: Abstracting Control-Flow for Prompt Optimization
Jihyuk Kim | Shubham Garg | Lahari Poddar | Seung-won Hwang | Chris Hench
Findings of the Association for Computational Linguistics: ACL 2025

Prompt optimization (PO) generates prompts to guide Large Language Models (LLMs) in performing tasks. Existing methods, such as PromptAgent, rely on a single static prompt, which struggles with disjoint cases in complex tasks. Although MoP uses multiple prompts, it fails to account for variations in task complexity. Inspired by programmatic control flow, we introduce a nested if-else structure to address both varying similarities and complexities across diverse cases. We propose Tree-of-Prompts (ToP), which implements this structure by recursively expanding child prompts from a parent prompt. Sibling prompts tackle disjoint cases while inheriting shared similarities from their parent, and handle cases more complex than the parent. Evaluated on Gorilla (understanding), MATH (reasoning), and a subset of BBH benchmarks, ToP outperforms PromptAgent and MoP, with improvements of 1.4% and 4.6% over PromptAgent and 3.2% and 4.5% over MoP, when tested with GPT-4o-mini and Llama 3.2-3B, respectively.

ECoRAG: Evidentiality-guided Compression for Long Context RAG
Yeonseok Jeong | Jinsu Kim | Dohyeon Lee | Seung-won Hwang
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models (LLMs) have shown remarkable performance in Open-Domain Question Answering (ODQA) by leveraging external documents through Retrieval-Augmented Generation (RAG). To reduce RAG overhead, from longer context, context compression is necessary. However, prior compression methods do not focus on filtering out non-evidential information, which limit the performance in LLM-based RAG. We thus propose Evidentiality-guided RAG, or ECoRAG framework. ECoRAG improves LLM performance by compressing retrieved documents based on evidentiality, ensuring whether answer generation is supported by the correct evidence. As an additional step, ECoRAG reflects whether the compressed content provides sufficient evidence, and if not, retrieves more until sufficient. Experiments show that ECoRAG improves LLM performance on ODQA tasks, outperforming existing compression methods. Furthermore, ECoRAG is highly cost-efficient, as it not only reduces latency but also minimizes token usage by retaining only the necessary information to generate the correct answer. Code is available at https://github.com/ldilab/ECoRAG.

TALE: Token-Adaptive Low-Rank KVCache Approximation with Reconstruction Elimination
Jaeseong Lee | Seung-won Hwang | Aurick Qiao | Daniel Campos | Zhewei Yao | Yuxiong He
Transactions of the Association for Computational Linguistics, Volume 13

KVCache, by storing key-value pairs for reuse, has been crucial for enhancing inference efficiency for large language models (LLMs). However, the increasing memory demands of KVCache, especially with recent trends of longer input sequences, present a major challenge. In this work, we propose an innovative token-adaptive low-rank approximation strategy for KVCache compression. By applying varying ranks based on token significance, our method compresses KVCache efficiently while retaining critical information. Moreover, we introduce a lazy approximation technique, which approximates lazily only when needed, alongside a reconstruction-free design to bypass costly recalculations. Combined with multi-level quantization, this method reduces KVCache size by 9.1× on the Llama-3.1-8B model, with minimal performance degradation on complex tasks such as GSM8K. Moreover, our custom attention implementation shows up to 2× latency reduction compared to the conventional method in long context scenarios. The code is publicly available.

RoToR: Towards More Reliable Responses for Order-Invariant Inputs
Soyoung Yoon | Dongha Ahn | Youngwon Lee | Minkyu Jung | HyungJoo Jang | Seung-won Hwang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Mitigating positional bias of language models (LMs) for listwise inputs is a well-known and important problem (e.g., lost-in-the-middle). While zero-shot order-invariant LMs have been proposed to solve this issue, their success on practical listwise problems has been limited. In this work, as a first contribution, we identify and overcome two limitations to make zero-shot invariant LMs more practical: (1) training and inference distribution mismatch arising from modifying positional ID assignments to enforce invariance, and (2) failure to adapt to mixture of order-invariant and sensitive inputs in practical listwise problems. Then, to overcome these issues we propose (1) RoToR, a zero-shot invariant LM for genuinely order-invariant inputs with minimal modifications of positional IDs, and (2) Selective Routing, an adaptive framework that handles both order-invariant and order-sensitive inputs in listwise tasks. On the Lost in the middle (LitM), Knowledge Graph QA (KGQA), and MMLU benchmarks, we show that RoToR with Selective Routing can effectively handle practical listwise input tasks in a zero-shot manner (https://github.com/soyoung97/RoToR)

Query-focused Referentiability Learning for Zero-shot Retrieval
Jaeyoung Kim | Dohyeon Lee | Seung-won Hwang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Dense passage retrieval enhances Information Retrieval (IR) by encoding queries and passages into representation space. However, passage representations often fail to be referenced by their gold queries under domain shifts, revealing a weakness in representation space. One desirable concept for representations is ”argmaxable”. Being argmaxable ensures that no representations are theoretically excluded from selection due to geometric constraints. To be argmaxable, a notable approach is to increase isotropy, where representations are evenly spread out in all directions. These findings, while desirable also for IR, focus on passage representation and not on query, making it challenging to directly apply their findings to IR. In contrast, we introduce a novel query-focused concept of ”referentiable” tailored for IR tasks, which ensures that passage representations are referenced by their gold queries. Building on this, we propose Learning Referentiable Representation (LRR), and two strategic metrics, Self-P and Self-Q, quantifying how the representations are referentiable. Our experiments compare three dense model versions: Naive, Isotropic, and Referentiable, demonstrating that LRR leads to enhanced zero-shot performance, surpassing existing naive and isotropic versions.

HARP: Hesitation-Aware Reframing in Transformer Inference Pass
Romain Storaï | Seung-won Hwang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

This paper aims to improve the performance of large language models by addressing the variable computational demands in inference steps, where some tokens require more computational resources than others. We present HARP, a simple modification to “off-the-shelf” Transformer forward pass. Drawing from hesitation and the framing effect in decision-making, HARP selectively applies additional computation when the model encounters uncertainty during token generation. Our method mimics human cognitive processes by pausing at difficult decision points and reframing inputs for a different perspective. Unlike other approaches, HARP is model-agnostic, training-free, and easy to implement. We evaluate our method across various downstream tasks and model sizes, demonstrating performance improvements up to +5.16%. Notably, HARP achieves these gains while maintaining inference times twice faster than beam search. Simple and yet with significant gains, HARP provides insights into the potential of adaptive computation for enhancing the performance of Transformer-based language models.

Overcoming Source Object Grounding for Semantic Image Editing
Yeonjoon Jung | Seungtaek Choi | Seung-won Hwang
Transactions of the Association for Computational Linguistics, Volume 13

Recent diffusion models have demonstrated remarkable capabilities in text-to-image generation. However, their stochastic denoising process often causes semantic image editing (SIE) models to misapply textual instructions. That is, models often leave the source object unchanged or erroneously alter the background. We refer to this challenge as source object grounding. To address this challenge, we introduce R-SIE, a region-wise SIE framework. During the inference, R-SIE models noise separately for distinct image regions, enabling precise control over the transformed areas. To reinforce the inference, we devise an automatic pipeline leveraging bounding boxes to generate unambiguous training data. Additionally, we propose two region-focused metrics, CLIP-Region Class (CLIP-RC) and CLIP-Global Context (CLIP-GC), to independently assess how well the source object is edited and the background is preserved, respectively. Experimental results demonstrate that region-wise diffusion improves existing baselines, and our data generation pipeline further enhances these improvements.1

STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning
Jaeseong Lee | Seung-won Hwang | Aurick Qiao | Daniel F Campos | Zhewei Yao | Yuxiong He
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Mixture-of-experts (MoEs) have been adopted for reducing inference costs by sparsely activating experts in large language models (LLMs). Despite these reductions, the massive number of parameters in MoEs still makes them expensive to serve. Conventionally, unstructured or structured pruning has been considered to reduce number of parameters. Our key contribution is exploring the interpolation between structured and unstructured pruning, to propose a novel structured-then-unstructured (STUN) approach outperforming both of structured or unstructured pruning, especially for MoEs. In the first stage, we show a scalable expert pruning with O(1) forward pass, unlike existing work requiring O(^kⁿ⁄_√n) forward passes for n experts that cannot scale for recent MoEs with hundreds of experts. We then show our expert-pruned MoEs are robust to unstructured pruning to follow. Experiments on Snowflake Arctic and Mixtral shows that our proposal is highly effective– For Snowflake Arctic, a 480B-sized MoE with 128 experts, our method needs only one H100 and two hours to achieve nearly no loss in performance with 40% sparsity, even in generative tasks such as GSM8K, where state-of-the-art structured or unstructured pruning methods fail. The code is publicly available.

Counterfactual-Consistency Prompting for Relative Temporal Understanding in Large Language Models
Jongho Kim | Seung-won Hwang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Despite the advanced capabilities of large language models (LLMs), their temporal reasoning ability remains underdeveloped. Prior works have highlighted this limitation, particularly in maintaining temporal consistency when understanding event relations. For example, models often confuse mutually exclusive temporal relations like “before” and “after” between events and make inconsistent predictions. In this work, we tackle the issue of temporal inconsistency in LLMs by proposing a novel counterfactual prompting approach. Our method generates counterfactual questions and enforces collective constraints, enhancing the model’s consistency. We evaluate our method on multiple datasets, demonstrating significant improvements in event ordering for explicit and implicit events and temporal commonsense understanding, by effectively addressing temporal inconsistencies.

Smarter, Not Harder: Training-Free Adaptive Computation for Transformers
Romain Storaï | Jaeseong Lee | Seung-won Hwang
Findings of the Association for Computational Linguistics: ACL 2025

Adaptive Computation in Transformers (ACT) has been pursued in two directions: efficiency- and performance-focused. We study performance-focused ACT, or PACT, which invests more computation on hard steps to improve performance, such as by adding forward passes. We first discuss beam search and hesitation-based methods as PACT and their limitations. While the hesitation-based approach outperforms beam search by perturbing input embeddings, it suffers from inefficiency due to invalidating KVCache and exhibits instability due to its reliance on randomness. To address this, we propose IMPACT, a novel PACT method that perturbs network weights rather than input embeddings. This approach enables the reuse of KVCache, offers deterministic predictions, and significantly improves memory and computational efficiency. By achieving a better balance between performance and efficiency, IMPACT makes PACT accessible to communities with consumer-grade hardware.

PLEX: Adaptive Parameter-Efficient Fine-Tuning for Code LLMs using Lottery-Tickets
Jaeseong Lee | Hojae Han | Jongyoon Kim | Seung-won Hwang | Naun Kang | KyungJun An | Sungho Jang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)

Fine-tuning large language models (LLMs) for code generation is challenging due to computational costs and the underrepresentation of some programming languages (PLs) in pre-training. We propose PLEX, a lottery-ticket based parameter-efficient fine-tuning (PEFT) method that adapts LLMs to either well-supported and underrepresented PLs.During lottery ticket selection, PLEX employs a dual strategy: for well-represented PLs, it leverages the LLM’s full parametric knowledge by selecting from full layers, while for underrepresented PLs, it narrows the selection scope to dense layers, prioritizing the most influential parameters.Additionally, PLEX-E, a low-rank extension of PLEX, further reduces computational costs by limiting the scope of fine-tuning. On MultiPL-E benchmarks, PLEX achieves state-of-the-art performance among PEFT methods, while PLEX-E maintains competitive results with reduced computational overhead. Both variants demonstrate effective adaptation across diverse programming languages, particularly for those underrepresented in pre-training.

Agent-as-Judge for Factual Summarization of Long Narratives
Yeonseok Jeong | Minsoo Kim | Seung-won Hwang | Byung-Hak Kim
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) have demonstrated near-human performance in summarization tasks based on traditional metrics such as ROUGE and BERTScore. However, these metrics do not adequately capture critical aspects of summarization quality, such as factual accuracy, particularly for long narratives (>100K tokens). Recent advances, such as LLM-as-a-Judge, address the limitations of metrics based on lexical similarity but still exhibit factual inconsistencies, especially in understanding character relationships and states. In this work, we introduce NarrativeFactScore (NFS), the first “Agent-as-a-Judge” framework that evaluates and refines factuality in narrative summarization. By leveraging a Character Knowledge Graph (CKG) extracted from input narrative, NarrativeFactScore evaluates the factuality and provides actionable guidance for refinement, such as identifying missing or erroneous facts. Our experimental results demonstrate that constructing the CKG enables reasoning with 1/3 of the factuality computation used in the prior approach, and achieve three times higher correlation with human judgments. Furthermore, refinement with actionable guidance improves the quality of the summary.

FaVe: Factored and Verified Search Rationale for Long-form Answer
Jihyuk Kim | Sungjin Lee | Seung-won Hwang | Yang Liu
Findings of the Association for Computational Linguistics: ACL 2025

Targeting long-form question-answering, chain-of-query (CoQ) has been studied, integrating chain-of-thought (CoT) with retrieval-augmented generation. CoQ answers the complex question step-by-step, through simpler subquestions (SQs) from which relevant knowledge is retrieved. By doing so, CoQ aims to improve the answer comprehensiveness and verifiability, at the expense of latency. Our first contribution is showing that the chaining often incurs harmful effects on both objectives, and SQs left unverified often fail to answer the given question. Second, we propose a better alternative to CoQ, union-of-query which adopts a factored approach to break the harmful chain. Finally, we propose to verify SQs before answers, by fine-tuning the SQ generator using verified SQs and introducing a selector verifying SQs in test time. Employing vicuna-13b, our approach, denoted by FaVe (short for Factored and Verified search), even outperforms ChatGPT baselines while maintaining efficiency.

tRAG: Term-level Retrieval-Augmented Generation for Domain-Adaptive Retrieval
Dohyeon Lee | Jongyoon Kim | Jihyuk Kim | Seung-won Hwang | Joonsuk Park
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Neural retrieval models have emerged as an effective tool for information retrieval, but their performance suffers when there is a domain shift between training and test data distributions. Recent work aims to construct pseudo-training data for the target domain by generating domain-adapted pseudo-queries using large language models (LLMs). However, we identifies that LLMs exhibit a “seen term bias” where the generated pseudo-queries fail to include relevant “unseen” terms as expected for domain adaptation purposes. To address this limitation, we propose to improve the term recall of unseen query terms, by using term-level Retrieval-Augmented Generation (tRAG). Specifically, unlike existing document-level RAG, we propose to generate domain-specific keywords from all documents in the corpus, including those unseen in any individual document. To filter hallucination, generated keywords are retrieved and reranked, leveraging relevance feedback from both retrievers and LLMs. Experiments on the BEIR benchmark show tRAG significantly improves recall for unseen terms by 10.6% and outperforms LLM and retrieval-augmented generation baselines on overall retrieval performance.

Query Variant Detection Using Retriever as Environment
Minji Seo | Youngwon Lee | Seung-won Hwang | Seoho Song | Hee-Cheol Seo | Young-In Song
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)

This paper addresses the challenge of detecting query variants—pairs of queries with identical intents. One application in commercial search engines is reformulating user queries with its variant online. While measuring pairwise query similarity has been an established standard, it often falls short of capturing semantic equivalence when word forms or order differ. We propose leveraging the retrieval as an environment feedback (EF), based on the premise that desirable retrieval outcomes from equivalent queries should be interchangeable. Experimental results on both proprietary and public datasets demonstrate the efficacy of the proposed method, both with and without LLM calls.

CoEx – Co-evolving World-model and Exploration
Minsoo Kim | Seung-won Hwang
Findings of the Association for Computational Linguistics: EMNLP 2025

Planning in modern LLM agents relies on the utilization of LLM as an internal world model, acquired during pretraining. However, existing agent designs fail to effectively assimilate new observations into dynamic updates of the world model. This reliance on the LLM’s static internal world model is progressively prone to misalignment with the underlying true state of the world, leading to the generation of divergent and erroneous plans. We introduce a hierarchical agent architecture, CoEx, in which hierarchical state abstraction allows LLM planning to co-evolve with a dynamically updated model of the world. CoEx plans and interacts with the world by using LLM reasoning to orchestrate dynamic plans consisting of subgoals, and its learning mechanism continuously incorporates these subgoal experiences into a persistent world model in the form of a neurosymbolic belief state, comprising textual inferences and code-based symbolic memory. We evaluate our agent across a diverse set of agent scenarios involving rich environments and complex tasks including ALFWorld, PDDL, and Jericho. Our experiments show that CoEx outperforms existing agent paradigms in planning and exploration.

From Token to Action: State Machine Reasoning to Mitigate Overthinking in Information Retrieval
Dohyeon Lee | Yeonseok Jeong | Seung-won Hwang
Findings of the Association for Computational Linguistics: EMNLP 2025

Chain-of-Thought (CoT) prompting enables complex reasoning in large language models (LLMs), including applications in information retrieval (IR). However, it often leads to overthinking, where models produce excessively long and semantically redundant traces with little or no benefit. We identify two key challenges in IR: redundant trajectories that revisit similar states and misguided reasoning that diverges from user intent. To address these, we propose State Machine Reasoning (SMR), a transition-based reasoning framework composed of discrete actions (REFINE, RERANK, STOP) that support early stopping and fine-grained control. Experiments on the BEIR and BRIGHT benchmarks show that improves retrieval performance (nDCG@10) by 3.4% while reducing token usage by 74.4%. It generalizes across LLMs and retrievers without requiring task-specific tuning, offering a practical alternative to conventional CoT reasoning.

CORD: Balancing COnsistency and Rank Distillation for Robust Retrieval-Augmented Generation
Youngwon Lee | Seung-won Hwang | Daniel F Campos | Filip Graliński | Zhewei Yao | Yuxiong He
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

With the adoption of retrieval-augmented generation (RAG), large language models (LLMs) are expected to ground their generation to the retrieved contexts. Yet, this is hindered by position bias of LLMs, failing to evenly attend to all contexts. Previous work has addressed this by synthesizing contexts with perturbed positions of gold segment, creating a position-diversified train set. We extend this intuition to propose consistency regularization with augmentation and distillation. First, we augment each training instance with its position perturbation to encourage consistent predictions, regardless of ordering. We also distill behaviors of this pair, although it can be counterproductive in certain RAG scenarios where the given order from the retriever is crucial for generation quality. We thus propose CORD, balancing COnsistency and Rank Distillation: CORD adaptively samples noise-controlled perturbations from an interpolation space, ensuring both consistency and respect for the rank prior. Empirical results show this balance enables CORD to outperform consistently in diverse RAG benchmarks.

Inference Scaling for Bridging Retrieval and Augmented Generation
Youngwon Lee | Seung-won Hwang | Daniel F Campos | Filip Graliński | Zhewei Yao | Yuxiong He
Findings of the Association for Computational Linguistics: NAACL 2025

Retrieval-augmented generation (RAG) has emerged as a popular approach to steering the output of a large language model (LLM) by incorporating retrieved contexts as inputs. However, existing work observed the generator bias, such that improving the retrieval results may negatively affect the outcome. In this work, we show such bias can be mitigated, from inference scaling, aggregating inference calls from the permuted order of retrieved contexts. The proposed Mixture-of-Intervention (MoI) explicitly models the debiased utility of each passage with multiple forward passes to construct a new ranking. We also show that MoI can leverage the retriever’s prior knowledge to reduce the computational cost by minimizing the number of permutations considered and lowering the cost per LLM call. We showcase the effectiveness of MoI on diverse RAG tasks, improving ROUGE-L on MS MARCO and EM on HotpotQA benchmarks by ~7 points.

PROM: Pivoted and Regulated Optimization for Multilingual Instruction Learning
Jaeseong Lee | Seung-won Hwang | Hojin Lee | Yunju Bak | Changmin Lee
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Large language models (LLMs) have become standard for natural language generation tasks, with instruction-tuning enhancing their capabilities. However, the lack of instruction-tuning datasets in languages other than English limits their application to diverse languages. To address this, researchers have adapted English-centric LLMs to other languages by appending English tuning data with its translated pair, from which we observe negative interference between the two. To resolve this, our contribution is identifying English as an internal pivot language, based on which we disentangle the roles of English and target language data in training. Specifically, we first design two roles as pivoted objectives, and also propose to regulate between the two, to better generalize for under-represented languages. Experiments across various languages demonstrate the effectiveness of our approach on multiple benchmarks. The code is publicly available for further exploration.

2024

Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code
Hyungjoo Chae | Taeyoon Kwon | Seungjun Moon | Yongho Song | Dongjin Kang | Kai Tzu-iunn Ong | Beong-woo Kwak | Seonghyeon Bae | Seung-won Hwang | Jinyoung Yeo
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

This paper presents Coffee-Gym, a comprehensive RL environment for training models that provide feedback on code editing. Coffee-Gym includes two major components: (1) Coffee, a dataset containing humans’ code edit traces for coding questions and human-written feedback for editing erroneous code; (2) CoffeeEval, a reward function that faithfully reflects the helpfulness of feedback by assessing the performance of the revised code in unit tests. With them, Coffee-Gym addresses the unavailability of high-quality datasets for training feedback models with RL, and provides more accurate rewards than the SOTA reward model (i.e., GPT-4). By applying Coffee-Gym, we elicit feedback models that outperform baselines in enhancing open-source code LLMs’ code editing, making them comparable with closed-source LLMs. We make the dataset and the model checkpoint publicly available in https://huggingface.co/spaces/Coffee-Gym/Project-Coffee-Gym.

HIL: Hybrid Isotropy Learning for Zero-shot Performance in Dense retrieval
Jaeyoung Kim | Dohyeon Lee | Seung-won Hwang
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Advancements in dense retrieval models have brought ColBERT to prominence in Information Retrieval (IR) with its advanced interaction techniques.However, ColBERT is reported to frequently underperform in zero-shot scenarios, where traditional techniques such as BM25 still exceed it.Addressing this, we propose to balance representation isotropy and anisotropy for zero-shot model performance, based on our observations that isotropy can enhance cosine similarity computations and anisotropy may aid in generalizing to unseen data.Striking a balance between these isotropic and anisotropic qualities stands as a critical objective to refine model efficacy.Based on this, we present ours, a Hybrid Isotropy Learning (HIL) architecture that integrates isotropic and anisotropic representations.Our experiments with the BEIR benchmark show that our model significantly outperforms the baseline ColBERT model, highlighting the importance of harmonized isotropy in improving zero-shot retrieval performance.

COMMIT: Code-Mixing English-Centric Large Language Model for Multilingual Instruction Tuning
Jaeseong Lee | YeonJoon Jung | Seung-won Hwang
Findings of the Association for Computational Linguistics: NAACL 2024

Recently, instruction-tuned large language models (LLMs) are showing prominent performance on various tasks, such as question answering. However, the majority of instruction-tuned LLMs are English-centric, which hinders their application to low-resource language QA. In this paper, we propose COde-Mixed Multilingual Instruction Tuning (COMMIT) to adapt English-centric LLM to low-resource language QA. We point out two main causes of English-centricness: imbalance of unlabeled data, and English-centric instruction tuning datasets. To deviate from English-centric instruction tuning, we propose to specialize code-mixing for instruction tuning, which blocks code-mixing in English templates, to leverage the potential of its superiority. To overcome data imbalance, we perform cross-lingual alignment. The majority of cross-lingual alignment works focused on making representations similar, which is not desirable to decoder-based LLMs, such as LLaMA. Therefore, we propose code-mixed continual causal language modeling to align the decoder. COMMIT improves the exact match score of low-resourced language QA by up to 32x. Code is publicly available.

DADA: Distribution-Aware Domain Adaptation of PLMs for Information Retrieval
Dohyeon Lee | Jongyoon Kim | Seung-won Hwang | Joonsuk Park
Findings of the Association for Computational Linguistics: ACL 2024

Pre-trained language models (PLMs) exhibit promise in retrieval tasks but struggle with out-of-domain data due to distribution shifts.Addressing this, generative domain adaptation (DA), known as GPL, tackles distribution shifts by generating pseudo queries and labels to train models for predicting query-document relationships in new domains.However, it overlooks the domain distribution, causing the model to struggle with aligning the distribution in the target domain.We, therefore, propose a Distribution-Aware Domain Adaptation (DADA) to guide the model to consider the domain distribution knowledge at the level of both a single document and the corpus, which is referred to as observation-level feedback and domain-level feedback, respectively.Our method effectively adapts the model to the target domain and expands document representation to unseen gold query terms using domain and observation feedback, as demonstrated by empirical results on the BEIR benchmark.

Intended Target Identification for Anomia Patients with Gradient-based Selective Augmentation
Jongho Kim | Romain Storaï | Seung-won Hwang
Findings of the Association for Computational Linguistics: EMNLP 2024

In this study, we investigate the potential of language models (LMs) in aiding patients experiencing anomia, a difficulty identifying the names of items. Identifying the intended target item from patient’s circumlocution involves the two challenges of term failure and error. (1) The terms relevant to identifying the item remain unseen. (2) What makes the challenge unique is inherent perturbed terms by semantic paraphasia, which are not exactly related to the target item, hindering the identification process. To address each, we propose robustifying the model from semantically paraphasic errors and enhancing the model with unseen terms with gradient-based selective augmentation (GradSelect). Specifically, the gradient value controls augmented data quality amid semantic errors, while the gradient variance guides the inclusion of unseen but relevant terms. Due to limited domain-specific datasets, we evaluate the model on the Tip of the Tongue dataset as an intermediary task and then apply our findings to real patient data from AphasiaBank. Our results demonstrate strong performance against baselines, aiding anomia patients by addressing the outlined challenges.

ListT5: Listwise Reranking with Fusion-in-Decoder Improves Zero-shot Retrieval
Soyoung Yoon | Eunbi Choi | Jiyeon Kim | Hyeongu Yun | Yireun Kim | Seung-won Hwang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We propose ListT5, a novel reranking approach based on Fusion-in-Decoder (FiD) that handles multiple candidate passages at both train and inference time. We also introduce an efficient inference framework for listwise ranking based on m-ary tournament sort with output caching. We evaluate and compare our model on the BEIR benchmark for zero-shot retrieval task, demonstrating that ListT5 (1) outperforms the state-of-the-art RankT5 baseline with a notable +1.3 gain in the average NDCG@10 score, (2) has an efficiency comparable to pointwise ranking models and surpasses the efficiency of previous listwise ranking models, and (3) overcomes the lost-in-the-middle problem of previous listwise rerankers. Our code, model checkpoints, and the evaluation framework will be fully open-sourced.

ContrastiveMix: Overcoming Code-Mixing Dilemma in Cross-Lingual Transfer for Information Retrieval
Junggeun Do | Jaeseong Lee | Seung-won Hwang
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Multilingual pretrained language models (mPLMs) have been widely adopted in cross-lingual transfer, and code-mixing has demonstrated effectiveness across various tasks in the absence of target language data. Our contribution involves an in-depth investigation into the counterproductive nature of training mPLMs on code-mixed data for information retrieval (IR). Our finding is that while code-mixing demonstrates a positive effect in aligning representations across languages, it hampers the IR-specific objective of matching representations between queries and relevant passages. To balance between positive and negative effects, we introduce ContrastiveMix, which disentangles contrastive loss between these conflicting objectives, thereby enhancing zero-shot IR performance. Specifically, we leverage both English and code-mixed data and employ two contrastive loss functions, by adding an additional contrastive loss that aligns embeddings of English data with their code-mixed counterparts in the query encoder. Our proposed ContrastiveMix exhibits statistically significant outperformance compared to mDPR, particularly in scenarios involving lower linguistic similarity, where the conflict between goals is more pronounced.

ArchCode: Incorporating Software Requirements in Code Generation with Large Language Models
Hojae Han | Jaejin Kim | Jaeseok Yoo | Youngwon Lee | Seung-won Hwang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper aims to extend the code generation capability of large language models (LLMs) to automatically manage comprehensive software requirements from given textual descriptions. Such requirements include both functional (i.e. achieving expected behavior for inputs) and non-functional (e.g., time/space performance, robustness, maintainability) requirements. However, textual descriptions can either express requirements verbosely or may even omit some of them. We introduce ARCHCODE, a novel framework that leverages in-context learning to organize requirements observed in descriptions and to extrapolate unexpressed requirements from them. ARCHCODE generates requirements from given descriptions, conditioning them to produce code snippets and test cases. Each test case is tailored to one of the requirements, allowing for the ranking of code snippets based on the compliance of their execution results with the requirements. Public benchmarks show that ARCHCODE enhances to satisfy functional requirements, significantly improving Pass@k scores.Furthermore, we introduce HumanEval-NFR, the first evaluation of LLMs’ non-functional requirements in code generation, demonstrating ARCHCODE’s superiority over baseline methods. The implementation of ARCHCODE and the HumanEval-NFR benchmark are both publicly accessible.

Breaking ReLU Barrier: Generalized MoEfication for Dense Pretrained Models
Jaeseong Lee | Seung-won Hwang | Wonpyo Park | Mingi Ji
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

As the scale of language models (LMs) continues to grow, there is a heightened interest in reducing the inference cost associated with these models. Mixture-of-Experts (MoEs) present an efficient alternative to dense models, while the existing methods to convert pretrained dense models to MoEs is limited to ReLU-based models with natural sparsity. This paper introduces G-MoEfication, applicable to arbitrary dense models, where ReLU-based activation sparsity assumptions no longer hold. For generalizations, we encounter the dilemma of needing to zero-out deactivated experts, while also avoiding excessive zeroing-out to retain dense activation information. We publicly release our code and report results conducted with mBERT, SantaCoder-1.1B, Phi-2-2.7B, and Falcon-7B demonstrating the efficacy of our approach in general scenarios: from multitask to multilingual, from fine-tuning to zero-shot evaluation.

RaDA: Retrieval-augmented Web Agent Planning with LLMs
Minsoo Kim | Victor Bursztyn | Eunyee Koh | Shunan Guo | Seung-won Hwang
Findings of the Association for Computational Linguistics: ACL 2024

Agents powered by large language models (LLMs) inherit important limitations, such as the restricted context length, dependency on human-engineered exemplars (e.g., for task decomposition), and insufficient generalization. To address these challenges, we propose RaDA, a novel planning method for Web agents that does not require manual exemplars, efficiently leverages the LLMs’ context, and enhances generalization. RaDA disentangles planning into two stages: for a new given task, during Retrieval-augmented Task Decomposition (RaD), it decomposes tasks into high-level subtasks; next, during Retrieval-augmented Action Generation (RaA), it traverses the trajectory obtained with RaD to iteratively synthesize actions based on dynamically retrieved exemplars. We compare RaDA with strong baselines covering a broad space of design choices, using both GPT-3.5 and GPT-4 as backbones; and we find consistent improvements over previous SOTA in two challenging benchmarks, CompWoB and Mind2Web, covering settings with different complexities. We show the contributions of RaDA via ablation studies and qualitative analysis; and we discuss the structural benefits of our more compositional design.

ScriptMix: Mixing Scripts for Low-resource Language Parsing
Jaeseong Lee | Dohyeon Lee | Seung-won Hwang
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Despite the success of multilingual pretrained language models (mPLMs) for tasks such as dependency parsing (DEP) or part-of-speech (POS) tagging, their coverage of 100s of languages is still limited, as most of the 6500+ languages remains “unseen”. To adapt mPLMs for including such unseen langs, existing work has considered transliteration and vocabulary augmentation. Meanwhile, the consideration of combining the two has been surprisingly lacking. To understand why, we identify both complementary strengths of the two, and the hurdles to realizing it. Based on this observation, we propose ScriptMix, combining two strengths, and overcoming the hurdle.Specifically, ScriptMix a) is trained with dual-script corpus to combine strengths, but b) with separate modules to avoid gradient conflict. In combining modules properly, we also point out the limitation of the conventional method AdapterFusion, and propose AdapterFusion+ to overcome it. We empirically show ScriptMix is effective– ScriptMix improves the POS accuracy by up to 14%, and improves the DEP LAS score by up to 5.6%. Our code is publicly available.

Evidentiality-aware Retrieval for Overcoming Abstractiveness in Open-Domain Question Answering
Yongho Song | Dahyun Lee | Myungha Jang | Seung-won Hwang | Kyungjae Lee | Dongha Lee | Jinyoung Yeo
Findings of the Association for Computational Linguistics: EACL 2024

The long-standing goal of dense retrievers in abtractive open-domain question answering (ODQA) tasks is to learn to capture evidence passages among relevant passages for any given query, such that the reader produce factually correct outputs from evidence passages. One of the key challenge is the insufficient amount of training data with the supervision of the answerability of the passages. Recent studies rely on iterative pipelines to annotate answerability using signals from the reader, but their high computational costs hamper practical applications. In this paper, we instead focus on a data-driven approach and propose Evidentiality-Aware Dense Passage Retrieval (EADPR), which leverages synthetic distractor samples to learn to discriminate evidence passages from distractors. We conduct extensive experiments to validate the effectiveness of our proposed method on multiple abstractive ODQA tasks.

Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding
YeonJoon Jung | Jaeseong Lee | Seungtaek Choi | Dohyeon Lee | Minsoo Kim | Seung-won Hwang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Recently, pre-trained language models (PLMs) have been increasingly adopted in spoken language understanding (SLU). However, automatic speech recognition (ASR) systems frequently produce inaccurate transcriptions, leading to noisy inputs for SLU models, which can significantly degrade their performance. To address this, our objective is to train SLU models to withstand ASR errors by exposing them to noises commonly observed in ASR systems, referred to as ASR-plausible noises. Speech noise injection (SNI) methods have pursued this objective by introducing ASR-plausible noises, but we argue that these methods are inherently biased towards specific ASR systems, or ASR-specific noises. In this work, we propose a novel and less biased augmentation method of introducing the noises that are plausible to any ASR system, by cutting off the non-causal effect of noises. Experimental results and analyses demonstrate the effectiveness of our proposed methods in enhancing the robustness and generalizability of SLU models against unseen ASR systems by introducing more diverse and plausible ASR noises in advance.

QuBE: Question-based Belief Enhancement for Agentic LLM Reasoning
Minsoo Kim | Jongyoon Kim | Jihyuk Kim | Seung-won Hwang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Despite advancements in Large Language Models (LLMs), many complex tasks are not easily solved in a single inference step, requiring the use of agentic LLMs in interactive environments. However, agentic LLMs suffer from a phenomenon known as reasoning derailment, due to the indiscriminate incorporation of observations from partially observable environments. We introduce QuBE, a method that enhances agents’ focus on task-relevant contexts, by constructing a belief state via question answering. We validate QuBE through experiments in two agentic LLM scenarios with partial observability: 1) a canonical interactive decision-making scenario using text-based game engines, and 2) an interactive retrieval-augmented generation (RAG) scenario using search engines. In the AlfWorld text-based game, QuBE outperforms established baselines by substantial margins, and in the search engine scenario, it achieves marked improvements on the BeIR zero-shot retrieval benchmark. The results demonstrate that QuBE significantly mitigates reasoning derailment, refining the decision-making process of LLM agents in partially observed environments.

Disentangling Questions from Query Generation for Task-Adaptive Retrieval
Yoonsang Lee | Minsoo Kim | Seung-won Hwang
Findings of the Association for Computational Linguistics: EMNLP 2024

This paper studies the problem of information retrieval, to adapt to unseen tasks. Existing work generates synthetic queries from domain-specific documents to jointly train the retriever. However, the conventional query generator assumes the query as a question, thus failing to accommodate general search intents. A more lenient approach incorporates task-adaptive elements, such as few-shot learning with an 137B LLM. In this paper, we challenge a trend equating query and question, and instead conceptualize query generation task as a “compilation” of high-level intent into task-adaptive query. Specifically, we propose EGG, a query generator that better adapts to wide search intents expressed in the BeIR benchmark. Our method outperforms baselines and existing models on four tasks with underexplored intents, while utilizing a query generator 47 times smaller than the previous state-of-the-art. Our findings reveal that instructing the LM with explicit search intent is a key aspect of modeling an effective query generator.

Chaining Event Spans for Temporal Relation Grounding
Jongho Kim | Dohyeon Lee | Minsoo Kim | Seung-won Hwang
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Accurately understanding temporal relations between events is a critical building block of diverse tasks, such as temporal reading comprehension (TRC) and relation extraction (TRE). For example in TRC, we need to understand the temporal semantic differences between the following two questions that are lexically near-identical: “What finished right before the decision?” or “What finished right after the decision?”. To discern the two questions, existing solutions have relied on answer overlaps as a proxy label to contrast similar and dissimilar questions. However, we claim that answer overlap can lead to unreliable results, due to spurious overlaps of two dissimilar questions with coincidentally identical answers. To address the issue, we propose a novel approach that elicits proper reasoning behaviors through a module for predicting time spans of events. We introduce the Timeline Reasoning Network (TRN) operating in a two-step inductive reasoning process: In the first step model initially answers each question with semantic and syntactic information. The next step chains multiple questions on the same event to predict a timeline, which is then used to ground the answers. Results on the TORQUE and TB-dense, TRC, and TRE tasks respectively, demonstrate that TRN outperforms previous methods by effectively resolving the spurious overlaps using the predicted timeline.

2023

Learning to Rank Generation with Pairwise Partial Rewards
Youngwon Lee | Jinu Lee | Seung-won Hwang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

This paper studies the use of reinforcement learning for conditional text generation, which overcomes the limitation of the prevalent supervised maximum likelihood estimation approach. However, it still suffers from challenges including the large action space and the delayed reward, as the reward can be computed only after an entire sequence is generated. To address these challenges, we propose a method that provides partial rewards for intermediate actions taken on partial sequences. This enables the model to promptly prioritize actions that lead to the generation of more desirable sequences. Our method’s key contribution lies in its focus on distinguishing relatively more desirable actions rather than striving to precisely estimate pointwise values for arbitrary partial sequences. Instead, our model learns to discern the relative desirability between pairs of actions, or rank actions in a pairwise manner, only when necessary and feasible. This is materialized in an efficient way by leveraging the prefix tree constructed from the sampled sequences. Experimental results on paraphrase generation and constrained machine translation tasks showcase the effectiveness of our method.

Relevance-assisted Generation for Robust Zero-shot Retrieval
Jihyuk Kim | Minsoo Kim | Joonsuk Park | Seung-won Hwang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track

Zero-shot retrieval tasks such as the BEIR benchmark reveal out-of-domain generalization as a key weakness of high-performance dense retrievers. As a solution, domain adaptation for dense retrievers has been actively studied. A notable approach is synthesizing domain-specific data, by generating pseudo queries (PQ), for fine-tuning with domain-specific relevance between PQ and documents. Our contribution is showing that key biases can cause sampled PQ to be irrelevant, negatively contributing to generalization. We propose to preempt their generation, by dividing the generation into simpler subtasks, of generating relevance explanations and guiding the generation to avoid negative generalization. Experiment results show that our proposed approach is more robust to domain shifts, validated on challenging BEIR zero-shot retrieval tasks.

Two Examples are Better than One: Context Regularization for Gradient-based Prompt Tuning
Hyeonmin Ha | Soyoung Jung | Jinsol Park | Minjoon Seo | Seung-won Hwang | Byung-Gon Chun
Findings of the Association for Computational Linguistics: ACL 2023

Prompting has gained tremendous attention as an efficient method for the adaptation of large-scale language models. However, prompts often act against human intuition and report unstable performances, which has motivated methods that automatically find effective prompts. One popular approach is gradient-based search, which iteratively updates a (randomly) initialized prompt towards the optimal one with the guide of gradients. We propose a novel regularization method, CoRe, for gradient-based prompt tuning techniques, which guides a prompt to produce a task context properly. CoRe realizes two regularization effects — context attuning and context filtering — that improve prediction performance in a zero-shot in-context learning setting where a model makes inferences only with the prompt tuned by CoRe, without any demonstration examples for in-context learning. Context attuning guides the context generated by the input and the tuned prompt toward embedding the appropriate context for the task. In our theoretical analysis, regularizing the context extends to improving zero-shot in-context learning performance. Context filtering steers the prompt to select only the task-related context so that context attuning solely focuses on creating and sending the right task context. We evaluate CoRe on natural language understanding datasets and two large language models, GPT2-XL and GPT-J.Our training scheme shows performance improvements up to 11.9% on GPT2-XL, and up to 6.3% on GPT-J in zero-shot settings.

CR-COPEC: Causal Rationale of Corporate Performance Changes to learn from Financial Reports
Ye Chun | Sunjae Kwon | Kyunghwan Sohn | Nakwon Sung | Junyoup Lee | Byoung Seo | Kevin Compher | Seung-won Hwang | Jaesik Choi
Findings of the Association for Computational Linguistics: EMNLP 2023

In this paper, we introduce CR-COPEC called Causal Rationale of Corporate Performance Changes from financial reports. This is a comprehensive large-scale domain-adaptation causal sentence dataset to detect financial performance changes of corporate. CR-COPEC contributes to two major achievements. First, it detects causal rationale from 10-K annual reports of the U.S. companies, which contain experts’ causal analysis following accounting standards in a formal manner. This dataset can be widely used by both individual investors and analysts as material information resources for investing and decision-making without tremendous effort to read through all the documents. Second, it carefully considers different characteristics which affect the financial performance of companies in twelve industries. As a result, CR-COPEC can distinguish causal sentences in various industries by taking unique narratives in each industry into consideration. We also provide an extensive analysis of how well CR-COPEC dataset is constructed and suited for classifying target sentences as causal ones with respect to industry characteristics.

On Sample-Efficient Code Generation
Hojae Han | Yu Jin Kim | Byoungjip Kim | Youngwon Lee | Kyungjae Lee | Kyungmin Lee | Moontae Lee | Kyunghoon Bae | Seung-won Hwang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track

Large language models often struggle to predict runtime behavior in code generation tasks, leading to a reliance on rejection sampling (best-of-n) to generate multiple code snippets then select the best. Our distinction is reducing sampling costs, without compromising generation quality. We introduce EFFICODE, a novel framework that prioritizes sampling on test problems that models can solve. We show how EFFICODE estimates solvability to optimize computational costs during multiple sampling. Based on empirical evidence, EFFICODE consistently demonstrates reduced sampling budgets while maintaining comparable code generation performance, especially when problems are challenging. In addition, utilizing EFFICODE to rank sampled code snippets also shows its effectiveness in answer code selection for reducing temporal costs, by not requiring any execution or test case generation.

On Interfacing Tip-of-the-tongue References In Movie Cases
Jongho Kim | Soona Hong | Seung-won Hwang
Proceedings of the Second Workshop on Natural Language Interfaces

Consistency is Key: On Data-Efficient Modality Transfer in Speech Translation
Hojin Lee | Changmin Lee | Seung-won Hwang
Findings of the Association for Computational Linguistics: EMNLP 2023

End-to-end approaches have shown promising results for speech translation (ST), but they suffer from its data scarcity compared to machine translation (MT). To address this, progressive training has become a common practice, of using external MT data during the fine-tuning phase. Despite of its prevalence and computational overhead, its validity is not extensively corroborated yet. This paper conducts an empirical investigation and finds that progressive training is ineffective. We identify learning-forgetting trade-off as a critical obstacle, then hypothesize and verify that consistency learning (CL) breaks the dilemma of learning-forgetting. The proposed method, which combines knowledge distillation (KD) and CL, outperforms the previous methods on MuST-C dataset even without additional data, and our proposed consistency-informed KD achieves additional improvements against KD+CL. Code and models are availble at https://github.com/hjlee1371/consistency-s2tt.

Intervention-Based Alignment of Code Search with Execution Feedback
Hojae Han | Minsoo Kim | Seung-won Hwang | Nan Duan | Shuai Lu
Findings of the Association for Computational Linguistics: EMNLP 2023

One of the fundamental goals in code search is to retrieve a functionally correct code for a given natural language query. As annotating for correctness requires executing test cases (i.e. obtaining execution feedback), existing code search training datasets approximate text-code co-occurrences as positive execution feedback. However, this approximation may misalign models’ retrieval decisions from ground-truth correctness. To address such limitation, we propose Code Intervention-based Reinforcement Learning (CIRL) that perturbs training code to result in misalignment (i.e. code intervention), then tests models’ decisions and corrects them with the execution feedback by reinforcement learning. The first technical contribution of CIRL is to induce the execution feedback from perturbation, without actual execution. Secondly, CIRL introduces structural perturbations using abstract syntax trees, going beyond simple lexical changes. Experimental results on various datasets demonstrate the effectiveness of CIRL compared to conventional approaches.

Multilingual Lottery Tickets to Pretrain Language Models
Jaeseong Lee | Seung-won Hwang
Findings of the Association for Computational Linguistics: EMNLP 2023

The curse of multilinguality in training multilingual pretrained language models (mPLMs) refers to the negative interference between languages, especially when the capacity is limited. While increasing the capacity may appear intuitive for overcoming this curse, it negatively affects both training and inference costs. Our distinction is pursuing the competing goals of reducing negative interference, while keeping capacity per each language more or less the same. Specifically, we first scale the model to reduce interference, then search for a per-language subnetwork, or a lottery ticket, with comparable performance to the full model. According to lottery ticket hypothesis, this scale-then-find-ticket approach alleviates interfering signals as in the scaled model, but redistributes parameters to keep the parameters reduced. Finally, to avoid the cost of multiple retraining for searching multilingual tickets, we explore zero-shot neural architecture search (NAS) methods. We investigate the most appropriate zero-shot NAS method to find multilingual tickets. Our proposed multilingual tickets reduce the inference cost of models for each languages, while boosting the performances. The ticket search cost is negligible and tickets found qualitatively preserve linguistic similarity. Our code is publicly available.

Retrieval-augmented Video Encoding for Instructional Captioning
Yeonjoon Jung | Minsoo Kim | Seungtaek Choi | Jihyuk Kim | Minji Seo | Seung-won Hwang
Findings of the Association for Computational Linguistics: ACL 2023

Instructional videos make learning knowledge more efficient, by providing a detailed multimodal context of each procedure in instruction.A unique challenge posed by instructional videos is key-object degeneracy, where any single modality fails to sufficiently capture the key objects referred to in the procedure. For machine systems, such degeneracy can disturb the performance of a downstream task such as dense video captioning, leading to the generation of incorrect captions omitting key objects. To repair degeneracy, we propose a retrieval-based framework to augment the model representations in the presence of such key-object degeneracy. We validate the effectiveness and generalizability of our proposed framework over baselines using modalities with key-object degeneracy.

On Complementarity Objectives for Hybrid Retrieval
Dohyeon Lee | Seung-won Hwang | Kyungjae Lee | Seungtaek Choi | Sunghyun Park
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Dense retrieval has shown promising results in various information retrieval tasks, and hybrid retrieval, combined with the strength of sparse retrieval, has also been actively studied. A key challenge in hybrid retrieval is to make sparse and dense complementary to each other. Existing models have focused on dense models to capture “residual” features neglected in the sparse models. Our key distinction is to show how this notion of residual complementarity is limited, and propose a new objective, denoted as RoC (Ratio of Complementarity), which captures a fuller notion of complementarity. We propose a two-level orthogonality designed to improve RoC, then show that the improved RoC of our model, in turn, improves the performance of hybrid retrieval. Our method outperforms all state-of-the-art methods on three representative IR benchmarks: MSMARCO-Passage, Natural Questions, and TREC Robust04, with statistical significance. Our finding is also consistent in various adversarial settings.

When to Read Documents or QA History: On Unified and Selective Open-domain QA
Kyungjae Lee | Sang-eun Han | Seung-won Hwang | Moontae Lee
Findings of the Association for Computational Linguistics: ACL 2023

This paper studies the problem of open-domain question answering, with the aim of answering a diverse range of questions leveraging knowledge resources. Two types of sources, QA-pair and document corpora, have been actively leveraged with the following complementary strength. The former is highly precise when the paraphrase of given question q was seen and answered during training, often posed as a retrieval problem, while the latter generalizes better for unseen questions. A natural follow-up is thus leveraging both models, while a naive pipelining or integration approaches have failed to bring additional gains over either model alone. Our distinction is interpreting the problem as calibration, which estimates the confidence of predicted answers as an indicator to decide when to use a document or QA-pair corpus. The effectiveness of our method was validated on widely adopted benchmarks such as Natural Questions and TriviaQA.

On Consistency Training for Language-Based Image Editing Interface
Youngwon Lee | Ayoung Lee | Yeonjoon Jung | Seung-won Hwang
Proceedings of the Second Workshop on Natural Language Interfaces

2022

Privacy-Preserving Text Classification on BERT Embeddings with Homomorphic Encryption
Garam Lee | Minsoo Kim | Jai Hyun Park | Seung-won Hwang | Jung Hee Cheon
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Embeddings, which compress information in raw text into semantics-preserving low-dimensional vectors, have been widely adopted for their efficacy. However, recent research has shown that embeddings can potentially leak private information about sensitive attributes of the text, and in some cases, can be inverted to recover the original input text. To address these growing privacy challenges, we propose a privatization mechanism for embeddings based on homomorphic encryption, to prevent potential leakage of any piece of information in the process of text classification. In particular, our method performs text classification on the encryption of embeddings from state-of-the-art models like BERT, supported by an efficient GPU implementation of CKKS encryption scheme. We show that our method offers encrypted protection of BERT embeddings, while largely preserving their utility on downstream text classification tasks.

PLM-based World Models for Text-based Games
Minsoo Kim | Yeonjoon Jung | Dohyeon Lee | Seung-won Hwang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

World models have improved the ability of reinforcement learning agents to operate in a sample efficient manner, by being trained to predict plausible changes in the underlying environment. As the core tasks of world models are future prediction and commonsense understanding, our claim is that pre-trained language models (PLMs) already provide a strong base upon which to build world models. Worldformer is a recently proposed world model for text-based game environments, based only partially on PLM and transformers. Our distinction is to fully leverage PLMs as actionable world models in text-based game environments, by reformulating generation as constrained decoding which decomposes actions into verb templates and objects. We show that our model improves future valid action prediction and graph change prediction. Additionally, we show that our model better reflects commonsense than standard PLM.

Pseudo-Relevance for Enhancing Document Representation
Jihyuk Kim | Seung-won Hwang | Seoho Song | Hyeseon Ko | Young-In Song
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

This paper studies how to enhance the document representation for the bi-encoder approach in dense document retrieval. The bi-encoder, separately encoding a query and a document as a single vector, is favored for high efficiency in large-scale information retrieval, compared to more effective but complex architectures. To combine the strength of the two, the multi-vector representation of documents for bi-encoder, such as ColBERT preserving all token embeddings, has been widely adopted. Our contribution is to reduce the size of the multi-vector representation, without compromising the effectiveness, supervised by query logs. Our proposed solution decreases the latency and the memory footprint, up to 8- and 3-fold, validated on MSMARCO and real-world search query logs.

ReACC: A Retrieval-Augmented Code Completion Framework
Shuai Lu | Nan Duan | Hojae Han | Daya Guo | Seung-won Hwang | Alexey Svyatkovskiy
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Code completion, which aims to predict the following code token(s) according to the code context, can improve the productivity of software development. Recent work has proved that statistical language modeling with transformers can greatly improve the performance in the code completion task via learning from large-scale source code datasets. However, current approaches focus only on code context within the file or project, i.e. internal context. Our distinction is utilizing ”external” context, inspired by human behaviors of copying from the related code snippets when writing code. Specifically, we propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval. We adopt a stage-wise training approach that combines a source code retriever and an auto-regressive language model for programming language. We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.

FAD-X: Fusing Adapters for Cross-lingual Transfer to Low-Resource Languages
Jaeseong Lee | Seung-won Hwang | Taesup Kim
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Adapter-based tuning, by adding light-weight adapters to multilingual pretrained language models (mPLMs), selectively updates language-specific parameters to adapt to a new language, instead of finetuning all shared weights. This paper explores an effective way to leverage a public pool of pretrained language adapters, to overcome resource imbalances for low-resource languages (LRLs). Specifically, our research questions are, whether pretrained adapters can be composed, to complement or replace LRL adapters. While composing adapters for multi-task learning setting has been studied, the same question for LRLs has remained largely unanswered. To answer this question, we study how to fuse adapters across languages and tasks, then validate how our proposed fusion adapter, namely FAD-X, can enhance a cross-lingual transfer from pretrained adapters, for well-known named entity recognition and classification benchmarks.

Normalizing Mutual Information for Robust Adaptive Training for Translation
Youngwon Lee | Changmin Lee | Hojin Lee | Seung-won Hwang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Despite the success of neural machine translation models, tensions between fluency of optimizing target language modeling and source-faithfulness remain as challenges. Previously, Conditional Bilingual Mutual Information (CBMI), a scoring metric for the importance of target sentences and tokens, was proposed to encourage fluent and faithful translations. The score is obtained by combining the probability from the translation model and the target language model, which is then used to assign different weights to losses from sentences and tokens. Meanwhile, we argue this metric is not properly normalized, for which we propose Normalized Pointwise Mutual Information (NPMI). NPMI utilizes an additional language model on source language to approximate the joint likelihood of source-target pair and the likelihood of the source, which is then used for normalizing the score. We showed that NPMI better captures the dependence between source-target and that NPMI-based token-level adaptive training brings improvements over baselines with empirical results from En-De, De-En, and En-Ro translation tasks.

Plug-and-Play Adaptation for Continuously-updated QA
Kyungjae Lee | Wookje Han | Seung-won Hwang | Hwaran Lee | Joonsuk Park | Sang-Woo Lee
Findings of the Association for Computational Linguistics: ACL 2022

Language models (LMs) have shown great potential as implicit knowledge bases (KBs). And for their practical use, knowledge in LMs need to be updated periodically. However, existing tasks to assess LMs’ efficacy as KBs do not adequately consider multiple large-scale updates. To this end, we first propose a novel task—Continuously-updated QA (CuQA)—in which multiple large-scale updates are made to LMs, and the performance is measured with respect to the success in adding and updating knowledge while retaining existing knowledge. We then present LMs with plug-in modules that effectively handle the updates. Experiments conducted on zsRE QA and NQ datasets show that our method outperforms existing approaches. We find that our method is 4x more effective in terms of updates/forgets ratio, compared to a fine-tuning baseline.

BotsTalk: Machine-sourced Framework for Automatic Curation of Large-scale Multi-skill Dialogue Datasets
Minju Kim | Chaehyeong Kim | Yong Ho Song | Seung-won Hwang | Jinyoung Yeo
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

To build open-domain chatbots that are able to use diverse communicative skills, we propose a novel framework BotsTalk, where multiple agents grounded to the specific target skills participate in a conversation to automatically annotate multi-skill dialogues. We further present Blended Skill BotsTalk (BSBT), a large-scale multi-skill dialogue dataset comprising 300K conversations. Through extensive experiments, we demonstrate that our dataset can be effective for multi-skill dialogue systems which require an understanding of skill blending as well as skill grounding. Our code and data are available at https://github.com/convei-lab/BotsTalk.

Collective Relevance Labeling for Passage Retrieval
Jihyuk Kim | Minsoo Kim | Seung-won Hwang
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Deep learning for Information Retrieval (IR) requires a large amount of high-quality query-document relevance labels, but such labels are inherently sparse. Label smoothing redistributes some observed probability mass over unobserved instances, often uniformly, uninformed of the true distribution. In contrast, we propose knowledge distillation for informed labeling, without incurring high computation overheads at evaluation time. Our contribution is designing a simple but efficient teacher model which utilizes collective knowledge, to outperform state-of-the-arts distilled from a more complex teacher model. Specifically, we train up to ×8 faster than the state-of-the-art teacher, while distilling the rankings better. Our code is publicly available at https://github.com/jihyukkim-nlp/CollectiveKD.

Mind the Gap! Injecting Commonsense Knowledge for Abstractive Dialogue Summarization
Seungone Kim | Se June Joo | Hyungjoo Chae | Chaehyeong Kim | Seung-won Hwang | Jinyoung Yeo
Proceedings of the 29th International Conference on Computational Linguistics

In this paper, we propose to leverage the unique characteristics of dialogues sharing commonsense knowledge across participants, to resolve the difficulties in summarizing them. We present SICK, a framework that uses commonsense inferences as additional context. Compared to previous work that solely relies on the input dialogue, SICK uses an external knowledge model to generate a rich set of commonsense inferences and selects the most probable one with a similarity-based selection method. Built upon SICK, SICK++ utilizes commonsense as supervision, where the task of generating commonsense inferences is added upon summarizing the dialogue in a multi-task learning setting. Experimental results show that with injected commonsense knowledge, our framework generates more informative and consistent summaries than existing methods.

Modularized Transfer Learning with Multiple Knowledge Graphs for Zero-shot Commonsense Reasoning
Yu Jin Kim | Beong-woo Kwak | Youngwook Kim | Reinald Kim Amplayo | Seung-won Hwang | Jinyoung Yeo
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Commonsense reasoning systems should be able to generalize to diverse reasoning cases. However, most state-of-the-art approaches depend on expensive data annotations and overfit to a specific benchmark without learning how to perform general semantic reasoning. To overcome these drawbacks, zero-shot QA systems have shown promise as a robust learning scheme by transforming a commonsense knowledge graph (KG) into synthetic QA-form samples for model training. Considering the increasing type of different commonsense KGs, this paper aims to extend the zero-shot transfer learning scenario into multiple-source settings, where different KGs can be utilized synergetically. Towards this goal, we propose to mitigate the loss of knowledge from the interference among the different knowledge sources, by developing a modular variant of the knowledge aggregation as a new zero-shot commonsense reasoning framework. Results on five commonsense reasoning benchmarks demonstrate the efficacy of our framework, improving the performance with multiple KGs.

Towards Compositional Generalization in Code Search
Hojae Han | Seung-won Hwang | Shuai Lu | Nan Duan | Seungtaek Choi
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

We study compositional generalization, which aims to generalize on unseen combinations of seen structural elements, for code search. Unlike existing approaches of partially pursuing this goal, we study how to extract structural elements, which we name a template that directly targets compositional generalization. Thus we propose CTBERT, or Code Template BERT, representing codes using automatically extracted templates as building blocks. We empirically validate CTBERT on two public code search benchmarks, AdvTest and CSN. Further, we show that templates are complementary to data flow graphs in GraphCodeBERT, by enhancing structural context around variables.

Debiasing Event Understanding for Visual Commonsense Tasks
Minji Seo | YeonJoon Jung | Seungtaek Choi | Seung-won Hwang | Bei Liu
Findings of the Association for Computational Linguistics: ACL 2022

We study event understanding as a critical step towards visual commonsense tasks. Meanwhile, we argue that current object-based event understanding is purely likelihood-based, leading to incorrect event prediction, due to biased correlation between events and objects. We propose to mitigate such biases with do-calculus, proposed in causality research, but overcoming its limited robustness, by an optimized aggregation with association-based prediction.We show the effectiveness of our approach, intrinsically by comparing our generated events with ground-truth event annotation, and extrinsically by downstream commonsense tasks.

2021

Query Generation for Multimodal Documents
Kyungho Kim | Kyungjae Lee | Seung-won Hwang | Young-In Song | Seungwook Lee
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

This paper studies the problem of generatinglikely queries for multimodal documents withimages. Our application scenario is enablingefficient “first-stage retrieval” of relevant doc-uments, by attaching generated queries to doc-uments before indexing. We can then indexthis expanded text to efficiently narrow downto candidate matches using inverted index, sothat expensive reranking can follow. Our eval-uation results show that our proposed multi-modal representation meaningfully improvesrelevance ranking. More importantly, ourframework can achieve the state of the art inthe first stage retrieval scenarios

Robustifying Multi-hop QA through Pseudo-Evidentiality Training
Kyungjae Lee | Seung-won Hwang | Sang-eun Han | Dohyeon Lee
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

This paper studies the bias problem of multi-hop question answering models, of answering correctly without correct reasoning. One way to robustify these models is by supervising to not only answer right, but also with right reasoning chains. An existing direction is to annotate reasoning chains to train models, requiring expensive additional annotations. In contrast, we propose a new approach to learn evidentiality, deciding whether the answer prediction is supported by correct evidences, without such annotations. Instead, we compare counterfactual changes in answer confidence with and without evidence sentences, to generate “pseudo-evidentiality” annotations. We validate our proposed model on an original set and challenge set in HotpotQA, showing that our method is accurate and robust in multi-hop reasoning.

Structure-Augmented Keyphrase Generation
Jihyuk Kim | Myeongho Jeong | Seungtaek Choi | Seung-won Hwang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

This paper studies the keyphrase generation (KG) task for scenarios where structure plays an important role. For example, a scientific publication consists of a short title and a long body, where the title can be used for de-emphasizing unimportant details in the body. Similarly, for short social media posts (, tweets), scarce context can be augmented from titles, though often missing. Our contribution is generating/augmenting structure then injecting these information in the encoding, using existing keyphrases of other documents, complementing missing/incomplete titles. We propose novel structure-augmented document encoding approaches that consist of the following two phases: The first phase, generating structure, extends the given document with related but absent keyphrases, augmenting missing context. The second phase, encoding structure, builds a graph of keyphrases and the given document to obtain the structure-aware representation of the augmented text. Our empirical results validate that our proposed structure augmentation and augmentation-aware encoding/decoding can improve KG for both scenarios, outperforming the state-of-the-art.

2020

SQuAD2-CR: Semi-supervised Annotation for Cause and Rationales for Unanswerability in SQuAD 2.0
Gyeongbok Lee | Seung-won Hwang | Hyunsouk Cho
Proceedings of the Twelfth Language Resources and Evaluation Conference

Existing machine reading comprehension models are reported to be brittle for adversarially perturbed questions when optimizing only for accuracy, which led to the creation of new reading comprehension benchmarks, such as SQuAD 2.0 which contains such type of questions. However, despite the super-human accuracy of existing models on such datasets, it is still unclear how the model predicts the answerability of the question, potentially due to the absence of a shared annotation for the explanation. To address such absence, we release SQuAD2-CR dataset, which contains annotations on unanswerable questions from the SQuAD 2.0 dataset, to enable an explanatory analysis of the model prediction. Specifically, we annotate (1) explanation on why the most plausible answer span cannot be the answer and (2) which part of the question causes unanswerability. We share intuitions and experimental results that how this dataset can be used to analyze and improve the interpretability of existing reading comprehension model behavior.

Label-Efficient Training for Next Response Selection
Seungtaek Choi | Myeongho Jeong | Jinyoung Yeo | Seung-won Hwang
Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing

This paper studies label augmentation for training dialogue response selection. The existing model is trained by “observational” annotation, where one observed response is annotated as gold. In this paper, we propose “counterfactual augmentation” of pseudo-positive labels. We validate that the effectiveness of augmented labels are comparable to positives, such that ours outperform state-of-the-arts without augmentation.

Less is More: Attention Supervision with Counterfactuals for Text Classification
Seungtaek Choi | Haeju Park | Jinyoung Yeo | Seung-won Hwang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We aim to leverage human and machine intelligence together for attention supervision. Specifically, we show that human annotation cost can be kept reasonably low, while its quality can be enhanced by machine self-supervision. Specifically, for this goal, we explore the advantage of counterfactual reasoning, over associative reasoning typically used in attention supervision. Our empirical results show that this machine-augmented human attention supervision is more effective than existing methods requiring a higher annotation cost, in text classification tasks, including sentiment analysis and news categorization.

Retrieval-Augmented Controllable Review Generation
Jihyeok Kim | Seungtaek Choi | Reinald Kim Amplayo | Seung-won Hwang
Proceedings of the 28th International Conference on Computational Linguistics

In this paper, we study review generation given a set of attribute identifiers which are user ID, product ID and rating. This is a difficult subtask of natural language generation since models are limited to the given identifiers, without any specific descriptive information regarding the inputs, when generating the text. The capacity of these models is thus confined and dependent to how well the models can capture vector representations of attributes. We thus propose to additionally leverage references, which are selected from a large pool of texts labeled with one of the attributes, as textual information that enriches inductive biases of given attributes. With these references, we can now pose the problem as an instance of text-to-text generation, which makes the task easier since texts that are syntactically, semantically similar with the output text are provided as input. Using this framework, we address issues such as selecting references from a large candidate set without textual context and improving the model complexity for generation. Our experiments show that our models improve over previous approaches on both automatic and human evaluation metrics.

2019

Soft Representation Learning for Sparse Transfer
Haeju Park | Jinyoung Yeo | Gengyu Wang | Seung-won Hwang
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Transfer learning is effective for improving the performance of tasks that are related, and Multi-task learning (MTL) and Cross-lingual learning (CLL) are important instances. This paper argues that hard-parameter sharing, of hard-coding layers shared across different tasks or languages, cannot generalize well, when sharing with a loosely related task. Such case, which we call sparse transfer, might actually hurt performance, a phenomenon known as negative transfer. Our contribution is using adversarial training across tasks, to “soft-code” shared and private spaces, to avoid the shared space gets too sparse. In CLL, our proposed architecture considers another challenge of dealing with low-quality input.

Categorical Metadata Representation for Customized Text Classification
Jihyeok Kim | Reinald Kim Amplayo | Kyungjae Lee | Sua Sung | Minji Seo | Seung-won Hwang
Transactions of the Association for Computational Linguistics, Volume 7

The performance of text classification has improved tremendously using intelligently engineered neural-based models, especially those injecting categorical metadata as additional information, e.g., using user/product information for sentiment classification. This information has been used to modify parts of the model (e.g., word embeddings, attention mechanisms) such that results can be customized according to the metadata. We observe that current representation methods for categorical metadata, which are devised for human consumption, are not as effective as claimed in popular classification methods, outperformed even by simple concatenation of categorical features in the final layer of the sentence encoder. We conjecture that categorical features are harder to represent for machine use, as available context only indirectly describes the category, and even such context is often scarce (for tail category). To this end, we propose using basis vectors to effectively incorporate categorical metadata on various parts of a neural-based model. This additionally decreases the number of parameters dramatically, especially when the number of categorical features is large. Extensive experiments on various data sets with different properties are performed and show that through our method, we can represent categorical metadata more effectively to customize parts of the model, including unexplored ones, and increase the performance of the model greatly.

Evaluating Research Novelty Detection: Counterfactual Approaches
Reinald Kim Amplayo | Seung-won Hwang | Min Song
Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)

In this paper, we explore strategies to evaluate models for the task research paper novelty detection: Given all papers released at a given date, which of the papers discuss new ideas and influence future research? We find the novelty is not a singular concept, and thus inherently lacks of ground truth annotations with cross-annotator agreement, which is a major obstacle in evaluating these models. Test-of-time award is closest to such annotation, which can only be made retrospectively and is extremely scarce. We thus propose to compare and evaluate models using counterfactual simulations. First, we ask models if they can differentiate papers at time t and counterfactual paper from future time t+d. Second, we ask models if they can predict test-of-time award at t+d. These are proxies that can be agreed by human annotators and easily augmented by correlated signals, using which evaluation can be done through four tasks: classification, ranking, correlation and feature selection. We show these proxy evaluation methods complement each other regarding error handling, coverage, interpretability, and scope, and thus altogether contribute to the observation of the relative strength of existing models.

MICRON: Multigranular Interaction for Contextualizing RepresentatiON in Non-factoid Question Answering
Hojae Han | Seungtaek Choi | Haeju Park | Seung-won Hwang
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

This paper studies the problem of non-factoid question answering, where the answer may span over multiple sentences. Existing solutions can be categorized into representation- and interaction-focused approaches. We combine their complementary strength, by a hybrid approach allowing multi-granular interactions, but represented at word level, enabling an easy integration with strong word-level signals. Specifically, we propose MICRON: Multigranular Interaction for Contextualizing RepresentatiON, a novel approach which derives contextualized uni-gram representation from n-grams. Our contributions are as follows: First, we enable multi-granular matches between question and answer n-grams. Second, by contextualizing word representation with surrounding n-grams, MICRON can naturally utilize word-based signals for query term weighting, known to be effective in information retrieval. We validate MICRON in two public non-factoid question answering datasets: WikiPassageQA and InsuranceQA, showing our model achieves the state of the art among baselines with reported performances on both datasets.

Learning with Limited Data for Multilingual Reading Comprehension
Kyungjae Lee | Sunghyun Park | Hojae Han | Jinyoung Yeo | Seung-won Hwang | Juho Lee
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

This paper studies the problem of supporting question answering in a new language with limited training resources. As an extreme scenario, when no such resource exists, one can (1) transfer labels from another language, and (2) generate labels from unlabeled data, using translator and automatic labeling function respectively. However, these approaches inevitably introduce noises to the training data, due to translation or generation errors, which require a judicious use of data with varying confidence. To address this challenge, we propose a weakly-supervised framework that quantifies such noises from automatically generated labels, to deemphasize or fix noisy data in training. On reading comprehension task, we demonstrate the effectiveness of our model on low-resource languages with varying similarity to English, namely, Korean and French.

NL2pSQL: Generating Pseudo-SQL Queries from Under-Specified Natural Language Questions
Fuxiang Chen | Seung-won Hwang | Jaegul Choo | Jung-Woo Ha | Sunghun Kim
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Generating SQL codes from natural language questions (NL2SQL) is an emerging research area. Existing studies have mainly focused on clear scenarios where specified information is fully given to generate a SQL query. However, in developer forums such as Stack Overflow, questions cover more diverse tasks including table manipulation or performance issues, where a table is not specified. The SQL query posted in Stack Overflow, Pseudo-SQL (pSQL), does not usually contain table schemas and is not necessarily executable, is sufficient to guide developers. Here we describe a new NL2pSQL task to generate pSQL codes from natural language questions on under-specified database issues, NL2pSQL. In addition, we define two new metrics suitable for the proposed NL2pSQL task, Canonical-BLEU and SQL-BLEU, instead of the conventional BLEU. With a baseline model using sequence-to-sequence architecture integrated by denoising autoencoder, we confirm the validity of our task. Experiments show that the proposed NL2pSQL approach yields well-formed queries (up to 43% more than a standard Seq2Seq model). Our code and datasets will be publicly released.

2018

Semi-supervised Training Data Generation for Multilingual Question Answering
Kyungjae Lee | Kyoungho Yoon | Sunghyun Park | Seung-won Hwang
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Mining Cross-Cultural Differences and Similarities in Social Media
Bill Yuchen Lin | Frank F. Xu | Kenny Zhu | Seung-won Hwang
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Cross-cultural differences and similarities are common in cross-lingual natural language understanding, especially for research in social media. For instance, people of distinct cultures often hold different opinions on a single named entity. Also, understanding slang terms across languages requires knowledge of cross-cultural similarities. In this paper, we study the problem of computing such cross-cultural differences and similarities. We present a lightweight yet effective approach, and evaluate it on two novel tasks: 1) mining cross-cultural differences of named entities and 2) finding similar terms for slang across languages. Experimental results show that our framework substantially outperforms a number of baseline methods on both tasks. The framework could be useful for machine translation applications and research in computational social science.

Visual Choice of Plausible Alternatives: An Evaluation of Image-based Commonsense Causal Reasoning
Jinyoung Yeo | Gyeongbok Lee | Gengyu Wang | Seungtaek Choi | Hyunsouk Cho | Reinald Kim Amplayo | Seung-won Hwang
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Entity Commonsense Representation for Neural Abstractive Summarization
Reinald Kim Amplayo | Seonjae Lim | Seung-won Hwang
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

A major proportion of a text summary includes important entities found in the original text. These entities build up the topic of the summary. Moreover, they hold commonsense information once they are linked to a knowledge base. Based on these observations, this paper investigates the usage of linked entities to guide the decoder of a neural text summarizer to generate concise and better summaries. To this end, we leverage on an off-the-shelf entity linking system (ELS) to extract linked entities and propose Entity2Topic (E2T), a module easily attachable to a sequence-to-sequence model that transforms a list of entities into a vector representation of the topic of the summary. Current available ELS’s are still not sufficiently effective, possibly introducing unresolved ambiguities and irrelevant entities. We resolve the imperfections of the ELS by (a) encoding entities with selective disambiguation, and (b) pooling entity vectors using firm attention. By applying E2T to a simple sequenceto-sequence model with attention mechanism as base model, we see significant improvements of the performance in the Gigaword (sentence to title) and CNN (long document to multi-sentence highlights) summarization datasets by at least 2 ROUGE points.

Cold-Start Aware User and Product Attention for Sentiment Classification
Reinald Kim Amplayo | Jihyeok Kim | Sua Sung | Seung-won Hwang
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The use of user/product information in sentiment analysis is important, especially for cold-start users/products, whose number of reviews are very limited. However, current models do not deal with the cold-start problem which is typical in review websites. In this paper, we present Hybrid Contextualized Sentiment Classifier (HCSC), which contains two modules: (1) a fast word encoder that returns word vectors embedded with short and long range dependency features; and (2) Cold-Start Aware Attention (CSAA), an attention mechanism that considers the existence of cold-start problem when attentively pooling the encoded word vectors. HCSC introduces shared vectors that are constructed from similar users/products, and are used when the original distinct vectors do not have sufficient information (i.e. cold-start). This is decided by a frequency-guided selective gate vector. Our experiments show that in terms of RMSE, HCSC performs significantly better when compared with on famous datasets, despite having less complexity, and thus can be trained much faster. More importantly, our model performs significantly better than previous models when the training data is sparse and has cold-start problems.

2016

Probabilistic Prototype Model for Serendipitous Property Mining
Taesung Lee | Seung-won Hwang | Zhongyuan Wang
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Besides providing the relevant information, amusing users has been an important role of the web. Many web sites provide serendipitous (unexpected but relevant) information to draw user traffic. In this paper, we study the representative scenario of mining an amusing quiz. An existing approach leverages a knowledge base to mine an unexpected property then find quiz questions on such property, based on prototype theory in cognitive science. However, existing deterministic model is vulnerable to noise in the knowledge base. Therefore, we instead propose to leverage probabilistic approach to build a prototype that can overcome noise. Our extensive empirical study shows that our approach not only significantly outperforms baselines by 0.06 in accuracy, and 0.11 in serendipity but also shows higher relevance than the traditional relevance-pursuing baseline using TF-IDF.

2014

Map Translation Using Geo-tagged Social Media
Sunyou Lee | Taesung Lee | Seung-won Hwang
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

Understanding Relation Temporality of Entities
Taesung Lee | Seung-won Hwang
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2013

On Applying and Extending Bitext for Entity Translation
Seung-won Hwang
Proceedings of the Workshop on Twenty Years of Bitext

Enriching Entity Translation Discovery using Selective Temporality
Gae-won You | Young-rok Cha | Jinhan Kim | Seung-won Hwang
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Bootstrapping Entity Translation on Weakly Comparable Corpora
Taesung Lee | Seung-won Hwang
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2010

Mining Name Translations from Entity Graph Mapping
Gae-won You | Seung-won Hwang | Young-In Song | Long Jiang | Zaiqing Nie
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

Co-authors

Youngwon Lee 10

Kyungjae Lee 10

Reinald Kim Amplayo 7

Yeonjoon Jung 7

Young-In Song 4

Daniel F Campos 3

Hyungjoo Chae 3

Yeonseok Jeong 3

Sunghyun Park 3

Romain Storaï 3

Filip Gralinski 2

Myeongho Jeong 2

Chaehyeong Kim 2

Beong-woo Kwak 2

Gyeongbok Lee 2

Kai Tzu-iunn Ong 2

Seonghyeon Bae 1

Kyunghoon Bae 1

Victor Bursztyn 1

Daniel Campos 1

Young-rok Cha 1

Jung Hee Cheon 1

Byung-Gon Chun 1

Kevin Compher 1

HyungJoo Jang 1

Byoungjip Kim 1

Byung-Hak Kim 1

Youngwook Kim 1

Seungwook Lee 1

Bill Yuchen Lin 1

Seungjun Moon 1

Jai Hyun Park 1

Lahari Poddar 1

Hee-Cheol Seo 1

Kyunghwan Sohn 1

Alexey Svyatkovskiy 1

Zhongyuan Wang 1

Kyoungho Yoon 1

Venues