Dohyeon Lee

2026

D3: Dynamic Docid Decoding for Multi-Intent Generative Retrieval
Jaeyoung Kim | Dohyeon Lee | Soona Hong | Seung-won Hwang
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)

Generative Retrieval (GR) maps queries to documents by generating discrete identifiers (DocIDs).However, offline DocID assignment and constrained decoding often prevent GR from capturing query-specific intent, especially when documents express multiple or unseen intents (i.e., intent misalignment).We introduce Dynamic Docid Decoding (D3), an inference-time mechanism that adaptively refines DocIDs through delayed, query-informed identifier expansion.D3 uses (a) verification to detect intent misalignment and (b) dynamic decoding to extend DocIDs with query-aligned tokens, even those absent from the pre-indexed vocabulary, enabling plug-and-play DocID expansion beyond the static vocabulary while adding minimal overhead.Experiments on NQ320k and MS-MARCO show that D3 consistently improves retrieval accuracy, especially on unseen and multi-intent documents, across various GR models, including a +2.4%p nDCG@10 gain on the state-of-the-art model.

pdf bib abs

Beyond Markovian Forgetfulness: Episodic Memory for Reasoning-Intensive Retrieval
Dohyeon Lee | Yeonseok Jeong | Seung-won Hwang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Reasoning-intensive information retrieval uses large language models to solve complex queries via multi-step reasoning. However, existing methods have critical limitations. Chain-of-Thought (CoT) approaches suffer from inefficiency, while state-based methods, despite better token efficiency, often fall into reasoning cycles that trap the query refinement process. To address these issues, we propose Episodic Memory for Retrieval (EMR), which enhances the state-based framework with an episodic memory. This module stores the full history of prior states for a query, allowing the model to avoid repetition of such cycles. Experiments on the BRIGHT benchmark show that EMR consistently outperforms both CoT and state-based baselines. Moreover, it is highly token-efficient, reducing token usage by 72% on average. Our results show that episodic memory is an effective and token-efficient mechanism for reasoning-intensive retrieval. The gains also generalize across different base models and stay efficient in terms of end-to-end latency. The code is available in https://github.com/ldilab/EMR.

2025

pdf bib abs

tRAG: Term-level Retrieval-Augmented Generation for Domain-Adaptive Retrieval
Dohyeon Lee | Jongyoon Kim | Jihyuk Kim | Seung-won Hwang | Joonsuk Park
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Neural retrieval models have emerged as an effective tool for information retrieval, but their performance suffers when there is a domain shift between training and test data distributions. Recent work aims to construct pseudo-training data for the target domain by generating domain-adapted pseudo-queries using large language models (LLMs). However, we identifies that LLMs exhibit a “seen term bias” where the generated pseudo-queries fail to include relevant “unseen” terms as expected for domain adaptation purposes. To address this limitation, we propose to improve the term recall of unseen query terms, by using term-level Retrieval-Augmented Generation (tRAG). Specifically, unlike existing document-level RAG, we propose to generate domain-specific keywords from all documents in the corpus, including those unseen in any individual document. To filter hallucination, generated keywords are retrieved and reranked, leveraging relevance feedback from both retrievers and LLMs. Experiments on the BEIR benchmark show tRAG significantly improves recall for unseen terms by 10.6% and outperforms LLM and retrieval-augmented generation baselines on overall retrieval performance.

pdf bib abs

Query-focused Referentiability Learning for Zero-shot Retrieval
Jaeyoung Kim | Dohyeon Lee | Seung-won Hwang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Dense passage retrieval enhances Information Retrieval (IR) by encoding queries and passages into representation space. However, passage representations often fail to be referenced by their gold queries under domain shifts, revealing a weakness in representation space. One desirable concept for representations is ”argmaxable”. Being argmaxable ensures that no representations are theoretically excluded from selection due to geometric constraints. To be argmaxable, a notable approach is to increase isotropy, where representations are evenly spread out in all directions. These findings, while desirable also for IR, focus on passage representation and not on query, making it challenging to directly apply their findings to IR. In contrast, we introduce a novel query-focused concept of ”referentiable” tailored for IR tasks, which ensures that passage representations are referenced by their gold queries. Building on this, we propose Learning Referentiable Representation (LRR), and two strategic metrics, Self-P and Self-Q, quantifying how the representations are referentiable. Our experiments compare three dense model versions: Naive, Isotropic, and Referentiable, demonstrating that LRR leads to enhanced zero-shot performance, surpassing existing naive and isotropic versions.

pdf bib abs

From Token to Action: State Machine Reasoning to Mitigate Overthinking in Information Retrieval
Dohyeon Lee | Yeonseok Jeong | Seung-won Hwang
Findings of the Association for Computational Linguistics: EMNLP 2025

Chain-of-Thought (CoT) prompting enables complex reasoning in large language models (LLMs), including applications in information retrieval (IR). However, it often leads to overthinking, where models produce excessively long and semantically redundant traces with little or no benefit. We identify two key challenges in IR: redundant trajectories that revisit similar states and misguided reasoning that diverges from user intent. To address these, we propose State Machine Reasoning (SMR), a transition-based reasoning framework composed of discrete actions (REFINE, RERANK, STOP) that support early stopping and fine-grained control. Experiments on the BEIR and BRIGHT benchmarks show that improves retrieval performance (nDCG@10) by 3.4% while reducing token usage by 74.4%. It generalizes across LLMs and retrievers without requiring task-specific tuning, offering a practical alternative to conventional CoT reasoning.

pdf bib abs

ECoRAG: Evidentiality-guided Compression for Long Context RAG
Yeonseok Jeong | Jinsu Kim | Dohyeon Lee | Seung-won Hwang
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models (LLMs) have shown remarkable performance in Open-Domain Question Answering (ODQA) by leveraging external documents through Retrieval-Augmented Generation (RAG). To reduce RAG overhead, from longer context, context compression is necessary. However, prior compression methods do not focus on filtering out non-evidential information, which limit the performance in LLM-based RAG. We thus propose Evidentiality-guided RAG, or ECoRAG framework. ECoRAG improves LLM performance by compressing retrieved documents based on evidentiality, ensuring whether answer generation is supported by the correct evidence. As an additional step, ECoRAG reflects whether the compressed content provides sufficient evidence, and if not, retrieves more until sufficient. Experiments show that ECoRAG improves LLM performance on ODQA tasks, outperforming existing compression methods. Furthermore, ECoRAG is highly cost-efficient, as it not only reduces latency but also minimizes token usage by retaining only the necessary information to generate the correct answer. Code is available at https://github.com/ldilab/ECoRAG.

2024

pdf bib abs

HIL: Hybrid Isotropy Learning for Zero-shot Performance in Dense retrieval
Jaeyoung Kim | Dohyeon Lee | Seung-won Hwang
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Advancements in dense retrieval models have brought ColBERT to prominence in Information Retrieval (IR) with its advanced interaction techniques.However, ColBERT is reported to frequently underperform in zero-shot scenarios, where traditional techniques such as BM25 still exceed it.Addressing this, we propose to balance representation isotropy and anisotropy for zero-shot model performance, based on our observations that isotropy can enhance cosine similarity computations and anisotropy may aid in generalizing to unseen data.Striking a balance between these isotropic and anisotropic qualities stands as a critical objective to refine model efficacy.Based on this, we present ours, a Hybrid Isotropy Learning (HIL) architecture that integrates isotropic and anisotropic representations.Our experiments with the BEIR benchmark show that our model significantly outperforms the baseline ColBERT model, highlighting the importance of harmonized isotropy in improving zero-shot retrieval performance.

pdf bib abs

ScriptMix: Mixing Scripts for Low-resource Language Parsing
Jaeseong Lee | Dohyeon Lee | Seung-won Hwang
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Despite the success of multilingual pretrained language models (mPLMs) for tasks such as dependency parsing (DEP) or part-of-speech (POS) tagging, their coverage of 100s of languages is still limited, as most of the 6500+ languages remains “unseen”. To adapt mPLMs for including such unseen langs, existing work has considered transliteration and vocabulary augmentation. Meanwhile, the consideration of combining the two has been surprisingly lacking. To understand why, we identify both complementary strengths of the two, and the hurdles to realizing it. Based on this observation, we propose ScriptMix, combining two strengths, and overcoming the hurdle.Specifically, ScriptMix a) is trained with dual-script corpus to combine strengths, but b) with separate modules to avoid gradient conflict. In combining modules properly, we also point out the limitation of the conventional method AdapterFusion, and propose AdapterFusion+ to overcome it. We empirically show ScriptMix is effective– ScriptMix improves the POS accuracy by up to 14%, and improves the DEP LAS score by up to 5.6%. Our code is publicly available.

pdf bib abs

DADA: Distribution-Aware Domain Adaptation of PLMs for Information Retrieval
Dohyeon Lee | Jongyoon Kim | Seung-won Hwang | Joonsuk Park
Findings of the Association for Computational Linguistics: ACL 2024

Pre-trained language models (PLMs) exhibit promise in retrieval tasks but struggle with out-of-domain data due to distribution shifts.Addressing this, generative domain adaptation (DA), known as GPL, tackles distribution shifts by generating pseudo queries and labels to train models for predicting query-document relationships in new domains.However, it overlooks the domain distribution, causing the model to struggle with aligning the distribution in the target domain.We, therefore, propose a Distribution-Aware Domain Adaptation (DADA) to guide the model to consider the domain distribution knowledge at the level of both a single document and the corpus, which is referred to as observation-level feedback and domain-level feedback, respectively.Our method effectively adapts the model to the target domain and expands document representation to unseen gold query terms using domain and observation feedback, as demonstrated by empirical results on the BEIR benchmark.

pdf bib abs

Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding
YeonJoon Jung | Jaeseong Lee | Seungtaek Choi | Dohyeon Lee | Minsoo Kim | Seung-won Hwang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Recently, pre-trained language models (PLMs) have been increasingly adopted in spoken language understanding (SLU). However, automatic speech recognition (ASR) systems frequently produce inaccurate transcriptions, leading to noisy inputs for SLU models, which can significantly degrade their performance. To address this, our objective is to train SLU models to withstand ASR errors by exposing them to noises commonly observed in ASR systems, referred to as ASR-plausible noises. Speech noise injection (SNI) methods have pursued this objective by introducing ASR-plausible noises, but we argue that these methods are inherently biased towards specific ASR systems, or ASR-specific noises. In this work, we propose a novel and less biased augmentation method of introducing the noises that are plausible to any ASR system, by cutting off the non-causal effect of noises. Experimental results and analyses demonstrate the effectiveness of our proposed methods in enhancing the robustness and generalizability of SLU models against unseen ASR systems by introducing more diverse and plausible ASR noises in advance.

pdf bib abs

Chaining Event Spans for Temporal Relation Grounding
Jongho Kim | Dohyeon Lee | Minsoo Kim | Seung-won Hwang
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Accurately understanding temporal relations between events is a critical building block of diverse tasks, such as temporal reading comprehension (TRC) and relation extraction (TRE). For example in TRC, we need to understand the temporal semantic differences between the following two questions that are lexically near-identical: “What finished right before the decision?” or “What finished right after the decision?”. To discern the two questions, existing solutions have relied on answer overlaps as a proxy label to contrast similar and dissimilar questions. However, we claim that answer overlap can lead to unreliable results, due to spurious overlaps of two dissimilar questions with coincidentally identical answers. To address the issue, we propose a novel approach that elicits proper reasoning behaviors through a module for predicting time spans of events. We introduce the Timeline Reasoning Network (TRN) operating in a two-step inductive reasoning process: In the first step model initially answers each question with semantic and syntactic information. The next step chains multiple questions on the same event to predict a timeline, which is then used to ground the answers. Results on the TORQUE and TB-dense, TRC, and TRE tasks respectively, demonstrate that TRN outperforms previous methods by effectively resolving the spurious overlaps using the predicted timeline.

2023

pdf bib abs

On Complementarity Objectives for Hybrid Retrieval
Dohyeon Lee | Seung-won Hwang | Kyungjae Lee | Seungtaek Choi | Sunghyun Park
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Dense retrieval has shown promising results in various information retrieval tasks, and hybrid retrieval, combined with the strength of sparse retrieval, has also been actively studied. A key challenge in hybrid retrieval is to make sparse and dense complementary to each other. Existing models have focused on dense models to capture “residual” features neglected in the sparse models. Our key distinction is to show how this notion of residual complementarity is limited, and propose a new objective, denoted as RoC (Ratio of Complementarity), which captures a fuller notion of complementarity. We propose a two-level orthogonality designed to improve RoC, then show that the improved RoC of our model, in turn, improves the performance of hybrid retrieval. Our method outperforms all state-of-the-art methods on three representative IR benchmarks: MSMARCO-Passage, Natural Questions, and TREC Robust04, with statistical significance. Our finding is also consistent in various adversarial settings.

2022

pdf bib abs

PLM-based World Models for Text-based Games
Minsoo Kim | Yeonjoon Jung | Dohyeon Lee | Seung-won Hwang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

World models have improved the ability of reinforcement learning agents to operate in a sample efficient manner, by being trained to predict plausible changes in the underlying environment. As the core tasks of world models are future prediction and commonsense understanding, our claim is that pre-trained language models (PLMs) already provide a strong base upon which to build world models. Worldformer is a recently proposed world model for text-based game environments, based only partially on PLM and transformers. Our distinction is to fully leverage PLMs as actionable world models in text-based game environments, by reformulating generation as constrained decoding which decomposes actions into verb templates and objects. We show that our model improves future valid action prediction and graph change prediction. Additionally, we show that our model better reflects commonsense than standard PLM.

2021

pdf bib abs

Robustifying Multi-hop QA through Pseudo-Evidentiality Training
Kyungjae Lee | Seung-won Hwang | Sang-eun Han | Dohyeon Lee
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

This paper studies the bias problem of multi-hop question answering models, of answering correctly without correct reasoning. One way to robustify these models is by supervising to not only answer right, but also with right reasoning chains. An existing direction is to annotate reasoning chains to train models, requiring expensive additional annotations. In contrast, we propose a new approach to learn evidentiality, deciding whether the answer prediction is supported by correct evidences, without such annotations. Instead, we compare counterfactual changes in answer confidence with and without evidence sentences, to generate “pseudo-evidentiality” annotations. We validate our proposed model on an original set and challenge set in HotpotQA, showing that our method is accurate and robust in multi-hop reasoning.

Co-authors

Venues

IJCNLP1

Fix author