Honglak Lee - ACL Anthology

Honglak Lee

2026

IRPO: Implicit Policy Regularized Preference Optimization
Youngsoo Jang | Yu Jin Kim | Geon-Hyeong Kim | Honglak Lee | Moontae Lee
Findings of the Association for Computational Linguistics: EACL 2026

Training complexity often scales with the size of hyperparameter space for Large Language Models (LLMs). While Direct Preference Optimization (DPO) offers learning stability through reparameterizing the reward function, its regularization against the reference policy can lead to suboptimal outcomes when the reference policy is not optimal. Recent DPO variants address this concern, but at a cost: they introduce additional hyperparameters, reducing feasibility for LLM fine-tuning. To overcome this challenge, we introduce Implicit policy Regularized Preference Optimization (IRPO), which tackles suboptimality while maintaining training simplicity. By treating the winning policy that generated the chosen responses in a pairwise dataset as an implicit policy, IRPO maximizes KL-regularized reward without extra hyperparameters. Then we propose a novel PO algorithm that directly optimizes the IRPO objective by estimating the likelihood ratio between implicit policies. As the winning policy generally outperforms the reference policy, IRPO can effectively address suboptimality. Our experiments show that IRPO significantly outperforms baseline algorithms with the same hyperparameter complexity. Moreover, IRPO demonstrates comparable performance to recent algorithms that rely on a larger number of hyperparameters, offering a practical solution for scalable LLM fine-tuning.

Beyond Blind Following: Evaluating Robustness of LLM Agents under Imperfect Guidance
Yao Fu | Ran Qiu | Xinhe Wang | Jacob Sansom | Sathvika Ayyappa Prabhu | Huijie Tang | Jaekyeom Kim | Sungryull Sohn | Honglak Lee
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) have shown strong capabilities as task-solving agents across interactive domains. However, in complex environments, these agents may need to rely on auxiliary guidance to reduce the search space or make up for limited domain-specific knowledge. Such guidance includes human-provided manuals and demonstrations, retrieved examples from memory or external tools, high-level heuristics, and agent-acquired knowledge from prior interactions. However, this guidance may be imperfect. For example, due to changes in the environment, ambiguous or simplified language, or retrieval errors from external sources, guidance can be incomplete, outdated, or contextually mismatched, potentially causing errors or failures during task execution. To address this, we introduce MIRAGE, a benchmark for MeasurIng Robustness of LLM Agents under Imperfect GuidancE. MIRAGE includes procedurally generated environments in navigation, cooking, and gaming, where both the environment and the auxiliary guidance vary in fidelity and relevance. We further extend MIRAGE to realistic web tasks via WebArena, using noisy or underspecified instructions extracted from demonstrations. Our findings reveal critical failure modes in current LLM agents and motivate future work on improving their robustness under imperfect guidance.

2025

MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows
Xingjian Zhang | Yutong Xie | Jin Huang | Jinge Ma | Zhaoying Pan | Qijia Liu | Ziyang Xiong | Tolga Ergen | Dongsub Shim | Honglak Lee | Qiaozhu Mei
Findings of the Association for Computational Linguistics: NAACL 2025

Scientific innovation relies on detailed workflows, which include critical steps such as contextualizing literature, generating ideas, validating ideas, interpreting results, and planning new research. Scientific publications that document these workflows are extensive and unstructured, making it difficult to effectively navigate and explore the space of scientific innovation. To meet this challenge, we introduce **MASSW**, a comprehensive dataset of **M**ulti-**A**spect **S**ummarization of **S**cientific **W**orkflows. MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years. Using Large Language Models (LLMs), we automatically extract five core aspects from these publications – *context, key idea, method, outcome*, and *projected impact* – which correspond to five key steps in a research workflow. We show that these LLM-extract summaries have a comparable quality to human annotations, and they facilitate a variety of downstream tasks, corresponding to different types of predictions and recommendations along the scientific workflow. Overall, MASSW demonstrates decent utility as a pre-computed and trustful resource for the AI4Science community to create and benchmark a wide-range of new AI methods for optimizing scientific workflows and fostering scientific innovation. Our code and datasets are made available anonymously: [link](https://osf.io/7ygrq/?view_only=3d8261a0ea09489fa67ece2c68235afa).

Revisiting LLM Value Probing Strategies: Are They Robust and Expressive?
Siqi Shen | Mehar Singh | Lajanugen Logeswaran | Moontae Lee | Honglak Lee | Rada Mihalcea
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

The value orientation of Large Language Models (LLMs) has been extensively studied, as it can shape user experiences across demographic groups.However, two key challenges remain: (1) the lack of systematic comparison across value probing strategies, despite the Multiple Choice Question (MCQ) setting being vulnerable to perturbations, and (2) the uncertainty over whether probed values capture in-context information or predict models’ real-world actions.In this paper, we systematically compare three widely used value probing methods: token likelihood, sequence perplexity, and text generation.Our results show that all three methods exhibit large variances under non-semantic perturbations in prompts and option formats, with sequence perplexity being the most robust overall.We further introduce two tasks to assess expressiveness: demographic prompting, testing whether probed values adapt to cultural context; and value–action agreement, testing the alignment of probed values with value-based actions.We find that demographic context has little effect on the text generation method, and probed values only weakly correlate with action preferences across all methods.Our work highlights the instability and the limited expressive power of current value probing methods, calling for more reliable LLM value representations.

Interactive and Expressive Code-Augmented Planning with Large Language Models
Anthony Zhe Liu | Xinhe Wang | Jacob Sansom | Yao Fu | Jongwook Choi | Sungryull Sohn | Jaekyeom Kim | Honglak Lee
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) demonstrate strong abilities in common-sense reasoning and interactive decision-making, but often struggle with complex, long-horizon planning tasks. Recent techniques have sought to structure LLM outputs using control flow and code to improve planning performance. However, code-based approaches can be error-prone and insufficient for handling ambiguous or unstructured data. To address these challenges, we propose REPL-Plan, an LLM planning approach that is fully code-expressive (it can utilize all the benefits of code) while also being dynamic (it can flexibly adapt from errors and use the LLM for soft reasoning). In REPL-Plan, an LLM solves tasks by interacting with a Read-Eval-Print Loop (REPL), which iteratively executes and evaluates code, similar to language shells or interactive code notebooks, allowing the model to flexibly correct errors and handle tasks dynamically. We demonstrate that REPL-Plan achieves strong results across various planning domains compared to previous methods.

2024

Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination
Nakyeong Yang | Taegwan Kang | Stanley Jungkyu Choi | Honglak Lee | Kyomin Jung
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Instruction-following language models often show undesirable biases. These undesirable biases may be accelerated in the real-world usage of language models, where a wide range of instructions is used through zero-shot example prompting. To solve this problem, we first define the bias neuron, which significantly affects biased outputs, and prove its existence empirically. Furthermore, we propose a novel and practical bias mitigation method, CRISPR, to eliminate bias neurons of language models in instruction-following settings. CRISPR automatically determines biased outputs and categorizes neurons that affect the biased outputs as bias neurons using an explainability method. Experimental results demonstrate the effectiveness of our method in mitigating biases under zero-shot instruction-following settings without losing the model’s task performance and existing knowledge. The experimental results reveal the generalizability of our method as it shows robustness under various instructions and datasets. Surprisingly, our method can mitigate the bias in language models by eliminating only a few neurons (at least three).

Code Models are Zero-shot Precondition Reasoners
Lajanugen Logeswaran | Sungryull Sohn | Yiwei Lyu | Anthony Liu | Dong-Ki Kim | Dongsub Shim | Moontae Lee | Honglak Lee
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

One of the fundamental skills required for an agent acting in an environment to complete tasks is the ability to understand what actions are plausible at any given point. This work explores a novel use of code representations to reason about action preconditions for sequential decision making tasks. Code representations offer the flexibility to model procedural activities and associated constraints as well as the ability to execute and verify constraint satisfaction. Leveraging code representations, we extract action preconditions from demonstration trajectories in a zero-shot manner using pre-trained code models. Given these extracted preconditions, we propose a precondition-aware action sampling strategy that ensures actions predicted by a policy are consistent with preconditions. We demonstrate that the proposed approach enhances the performance of few-shot policy learning approaches across task-oriented dialog and embodied textworld benchmarks.

Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense
Siqi Shen | Lajanugen Logeswaran | Moontae Lee | Honglak Lee | Soujanya Poria | Rada Mihalcea
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large language models (LLMs) have demonstrated substantial commonsense understanding through numerous benchmark evaluations. However, their understanding of cultural commonsense remains largely unexamined. In this paper, we conduct a comprehensive examination of the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks. Using several general and cultural commonsense benchmarks, we find that (1) LLMs have a significant discrepancy in performance when tested on culture-specific commonsense knowledge for different cultures; (2) LLMs’ general commonsense capability is affected by cultural context; and (3) The language used to query the LLMs can impact their performance on cultural-related tasks.Our study points to the inherent bias in the cultural understanding of LLMs and provides insights that can help develop culturally-aware language models.

Efficient Dynamic Hard Negative Sampling for Dialogue Selection
Janghoon Han | Dongkyu Lee | Joongbo Shin | Hyunkyung Bae | Jeesoo Bang | Seonghwan Kim | Stanley Jungkyu Choi | Honglak Lee
Proceedings of the 6th Workshop on NLP for Conversational AI (NLP4ConvAI 2024)

Recent studies have demonstrated significant improvements in selection tasks, and a considerable portion of this success is attributed to incorporating informative negative samples during training. While traditional methods for constructing hard negatives provide meaningful supervision, they depend on static samples that do not evolve during training, leading to sub-optimal performance. Dynamic hard negative sampling addresses this limitation by continuously adapting to the model’s changing state throughout training. However, the high computational demands of this method restrict its applicability to certain model architectures. To overcome these challenges, we introduce an efficient dynamic hard negative sampling (EDHNS). EDHNS enhances efficiency by pre-filtering easily discriminable negatives, thereby reducing the number of candidates the model needs to compute during training. Additionally, it excludes question-candidate pairs where the model already exhibits high confidence from loss computations, further reducing training time. These approaches maintain learning quality while minimizing computation and streamlining the training process. Extensive experiments on DSTC9, DSTC10, Ubuntu, and E-commerce benchmarks demonstrate that EDHNS significantly outperforms baseline models, proving its effectiveness in dialogue selection tasks.

Deep Exploration of Cross-Lingual Zero-Shot Generalization in Instruction Tuning
Janghoon Han | Changho Lee | Joongbo Shin | Stanley Jungkyu Choi | Honglak Lee | Kyunghoon Bae
Findings of the Association for Computational Linguistics: ACL 2024

Instruction tuning has emerged as a powerful technique, significantly boosting zero-shot performance on unseen tasks. While recent work has explored cross-lingual generalization by applying instruction tuning to multilingual models, previous studies have primarily focused on English, with a limited exploration of non-English tasks. For in-depth exploration of cross-lingual generalization in instruction tuning, we perform instruction tuning individually for two distinct language meta-datasets. Subsequently, we assess the performance on unseen tasks in the language different from the one used for training. To facilitate this investigation, we introduce a novel non-English meta-dataset named “KORANI” (Korean Natural Instruction), comprising 51 Korean benchmarks. Moreover, we design cross-lingual templates to mitigate discrepancies in language and instruction-format of the template between training and inference within the cross-lingual setting. Our experiments reveal consistent improvements through cross-lingual generalization in both English and Korean, outperforming baseline by average scores of 20.7% and 13.6%, respectively. Remarkably, these enhancements are comparable to those achieved by mono-lingual instruction tuning and even surpass them in some tasks. The result underscores the significance of relevant data acquisition across languages over linguistic congruence with unseen tasks during instruction tuning.

Small Language Models Need Strong Verifiers to Self-Correct Reasoning
Yunxiang Zhang | Muhammad Khalifa | Lajanugen Logeswaran | Jaekyeom Kim | Moontae Lee | Honglak Lee | Lu Wang
Findings of the Association for Computational Linguistics: ACL 2024

Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs), where LLMs refine their solutions using self-generated critiques that pinpoint the errors. This work explores whether small (≤ 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs. We propose a novel pipeline that prompts smaller LMs to collect self-correction data that supports the training of self-refinement abilities. First, we leverage correct solutions to guide the model in critiquing their incorrect responses. Second, the generated critiques, after filtering, are used for supervised fine-tuning of the self-correcting reasoner through solution refinement. Our experimental results show improved self-correction abilities of two models on five datasets spanning math and commonsense reasoning, with notable performance gains when paired with a strong GPT-4-based verifier, though limitations are identified when using a weak self-verifier for determining when to correct.

Prospector: Improving LLM Agents with Self-Asking and Trajectory Ranking
Byoungjip Kim | Youngsoo Jang | Lajanugen Logeswaran | Geon-Hyeong Kim | Yu Jin Kim | Honglak Lee | Moontae Lee
Findings of the Association for Computational Linguistics: EMNLP 2024

Large language models (LLMs) have shown the ability to solve complex decision-making tasks beyond natural language processing tasks. LLM agents based on few-shot in-context learning (ICL) achieve surprisingly high performance without training. Despite their simplicity and generalizability, ICL-based agents are limited in their ability to incorporate feedback from an environment. In this paper, we introduce Prospector, an LLM agent that consists of two complementary LLMs, an Actor and a Critic. To elicit better instruction-aligned actions from the LLM agent, we propose AskAct prompting that performs an additional self-asking step such as goal and progress checking before generating an action. Furthermore, to implicitly incorporate the environment feedback, we propose Trajectory Ranking that orders generated trajectories by predicting the expected total reward. Prospector encourages the LLM Actor to generate diverse (creative) trajectories, and harnesses the LLM Critic to select the most rewarding trajectory. On representative decision-making benchmark environments such as ALFWorld and WebShop, we empirically demonstrate that Prospector can considerably increase the success rate of given tasks, while outperforming recent advancements such as ReAct and Reflexion.

Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents
Jaekyeom Kim | Dong-Ki Kim | Lajanugen Logeswaran | Sungryull Sohn | Honglak Lee
Findings of the Association for Computational Linguistics: EMNLP 2024

In this paper, we introduce Auto-Intent, a method to adapt a pre-trained large language model (LLM) as an agent for a target domain without direct fine-tuning, where we empirically focus on web navigation tasks. Our approach first discovers the underlying intents from target domain demonstrations unsupervisedly, in a highly compact form (up to three words). With the extracted intents, we train our intent predictor to predict the next intent given the agent’s past observations and actions. In particular, we propose a self-exploration approach where top-k probable intent predictions are provided as a hint to the pre-trained LLM agent, which leads to enhanced decision-making capabilities. Auto-Intent substantially improves the performance of GPT-3.5, 4 and Llama-3.1-70B, 405B agents on the large-scale real-website navigation benchmarks from Mind2Web and online navigation tasks from WebArena with its cross-benchmark generalization from Mind2Web.

Instruction Matters: A Simple yet Effective Task Selection for Optimized Instruction Tuning of Specific Tasks
Changho Lee | Janghoon Han | Seonghyeon Ye | Stanley Jungkyu Choi | Honglak Lee | Kyunghoon Bae
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Instruction tuning has been proven effective in enhancing zero-shot generalization across various tasks and in improving the performance of specific tasks. For task-specific improvements, strategically selecting and training on related tasks that provide meaningful supervision is crucial, as this approach enhances efficiency and prevents performance degradation from learning irrelevant tasks. In this light, we introduce a simple yet effective task selection method that leverages instruction information alone to identify relevant tasks, optimizing instruction tuning for specific tasks. Our method is significantly more efficient than traditional approaches, which require complex measurements of pairwise transferability between tasks or the creation of data samples for the target task. Additionally, by aligning the model with the unique instructional template style of the meta-dataset, we enhance its ability to granularly discern relevant tasks, leading to improved overall performance. Experimental results demonstrate that training on a small set of tasks, chosen solely based on the instructions, results in substantial improvements in performance on benchmarks such as P3, Big-Bench, NIV2, and Big-Bench Hard. Significantly, these improvements surpass those achieved by prior task selection methods, highlighting the superiority of our approach.

2023

Merging Generated and Retrieved Knowledge for Open-Domain QA
Yunxiang Zhang | Muhammad Khalifa | Lajanugen Logeswaran | Moontae Lee | Honglak Lee | Lu Wang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Open-domain question answering (QA) systems are often built with retrieval modules. However, retrieving passages from a given source is known to suffer from insufficient knowledge coverage. Alternatively, prompting large language models (LLMs) to generate contextual passages based on their parametric knowledge has been shown to improve QA performance. Yet, LLMs tend to “hallucinate” content that conflicts with the retrieved knowledge. Based on the intuition that answers supported by both sources are more likely to be correct, we propose COMBO, a Compatibility-Oriented knowledge Merging for Better Open-domain QA framework, to effectively leverage the two sources of information. Concretely, we match LLM-generated passages with retrieved counterparts into compatible pairs, based on discriminators trained with silver compatibility labels. Then a Fusion-in-Decoder-based reader model handles passage pairs to arrive at the final answer. Experiments show that COMBO outperforms competitive baselines on three out of four tested open-domain QA benchmarks. Further analysis reveals that our proposed framework demonstrates greater efficacy in scenarios with a higher degree of knowledge conflicts.

TOD-Flow: Modeling the Structure of Task-Oriented Dialogues
Sungryull Sohn | Yiwei Lyu | Anthony Liu | Lajanugen Logeswaran | Dong-Ki Kim | Dongsub Shim | Honglak Lee
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Task-Oriented Dialogue (TOD) systems have become crucial components in interactive artificial intelligence applications. While recent advances have capitalized on pre-trained language models (PLMs), they exhibit limitations regarding transparency and controllability. To address these challenges, we propose a novel approach focusing on inferring the TOD-flow graph from dialogue data annotated with dialog acts, uncovering the underlying task structure in the form of a graph. The inferred TOD-flow graph can be easily integrated with any dialogue model to improve its prediction performance, transparency, and controllability. Our TOD-flow graph learns what a model can, should, and should not predict, effectively reducing the search space and providing a rationale for the model’s prediction. We show that the proposed TOD-flow graph better resemble human-annotated graphs compared to prior approaches. Furthermore, when combined with several dialogue policies and end-to-end dialogue models, we demonstrate that our approach significantly improves dialog act classification and end-to-end response generation performance in the MultiWOZ and SGD benchmarks.

From Heuristic to Analytic: Cognitively Motivated Strategies for Coherent Physical Commonsense Reasoning
Zheyuan Zhang | Shane Storks | Fengyuan Hu | Sungryull Sohn | Moontae Lee | Honglak Lee | Joyce Chai
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Pre-trained language models (PLMs) have shown impressive performance in various language tasks. However, they are prone to spurious correlations, and often generate illusory information. In real-world applications, PLMs should justify decisions with formalized, coherent reasoning chains, but this challenge remains under-explored. Cognitive psychology theorizes that humans are capable of utilizing fast and intuitive *heuristic* thinking to make decisions based on past experience, then rationalizing the decisions through slower and deliberative *analytic* reasoning. We incorporate these interlinked dual processes in fine-tuning and in-context learning with PLMs, applying them to two language understanding tasks that require coherent physical commonsense reasoning. We show that our proposed Heuristic-Analytic Reasoning (HAR) strategies drastically improve the coherence of rationalizations for model decisions, yielding state-of-the-art results on Tiered Reasoning for Intuitive Physics (TRIP). We also find that this improved coherence is a direct result of more faithful attention to relevant language context in each step of reasoning. Our findings suggest that human-like reasoning strategies can effectively improve the coherence and reliability of PLM reasoning.

A Picture is Worth a Thousand Words: Language Models Plan from Pixels
Anthony Liu | Lajanugen Logeswaran | Sungryull Sohn | Honglak Lee
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Planning is an important capability of artificial agents that perform long-horizon tasks in real-world environments. In this work, we explore the use of pre-trained language models (PLMs) to reason about plan sequences from text instructions in embodied visual environments. Prior PLM based approaches for planning either assume observations are available in the form of text by a captioning model, reason about plans from the instruction alone, or incorporate information about the visual environment in limited ways (such as a pre-trained affordance function). In contrast, we show that the PLM can accurately plan even when observations are directly encoded as input prompts for the PLM. We show this simple approach outperforms prior approaches in experiments on the ALFWorld and VirtualHome benchmarks.

GRACE: Discriminator-Guided Chain-of-Thought Reasoning
Muhammad Khalifa | Lajanugen Logeswaran | Moontae Lee | Honglak Lee | Lu Wang
Findings of the Association for Computational Linguistics: EMNLP 2023

In the context of multi-step reasoning, e.g., with chain-of-thought, language models (LMs) can easily assign a high likelihood to incorrect steps. As a result, decoding strategies that optimize for solution likelihood often yield incorrect solutions. To address this issue, we propose Guiding chain-of-thought ReAsoning with a CorrectnEss Discriminator (GRACE), a stepwise decoding approach that steers the decoding process towards producing correct reasoning steps. GRACE employs a discriminator trained with a contrastive loss over correct and incorrect steps, which is used during decoding to score next-step candidates based on their correctness. Importantly, GRACE only requires sampling from the LM, without the need for LM training or fine-tuning. Using models from FLAN-T5 and LLaMA families, we evaluate GRACE over four math and two symbolic reasoning tasks, where it exhibits substantial performance gains compared to greedy decoding, verifiers, and self-consistency in most settings. When further combined with self-consistency, GRACE outperforms all the baselines by sizeable margins. Human and LLM evaluations over GSM8K show that GRACE not only improves the final answer accuracy but also the correctness of the intermediate reasoning.

Fine-grained Text Style Transfer with Diffusion-Based Language Models
Yiwei Lyu | Tiange Luo | Jiacheng Shi | Todd Hollon | Honglak Lee
Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)

Diffusion probabilistic models have shown great success in generating high-quality images controllably, and researchers have tried to utilize this controllability into text generation domain. Previous works on diffusion-based language models have shown that they can be trained without external knowledge (such as pre-trained weights) and still achieve stable performance and controllability. In this paper, we trained a diffusion-based model on StylePTB dataset, the standard benchmark for fine-grained text style transfers. The tasks in StylePTB requires much more refined control over the output text compared to tasks evaluated in previous works, and our model was able to achieve state-of-the-art performance on StylePTB on both individual and compositional transfers. Moreover, our model, trained on limited data from StylePTB without external knowledge, outperforms previous works that utilized pretrained weights, embeddings, and external grammar parsers, and this may indicate that diffusion-based language models have great potential under low-resource settings.

Few-shot Reranking for Multi-hop QA via Language Model Prompting
Muhammad Khalifa | Lajanugen Logeswaran | Moontae Lee | Honglak Lee | Lu Wang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We study few-shot reranking for multi-hop QA (MQA) with open-domain questions. To alleviate the need for a large number of labeled question-document pairs for retriever training, we propose PromptRank, which relies on language model prompting for multi-hop path reranking. PromptRank first constructs an instruction-based prompt that includes a candidate document path and then computes the relevance score between a given question and the path based on the conditional likelihood of the question given the path prompt according to a language model. PromptRank yields strong retrieval performance on HotpotQA with only 128 training examples compared to state-of-the-art methods trained on thousands of examples — 73.6 recall@10 by PromptRank vs. 77.8 by PathRetriever and 77.5 by multi-hop dense retrieval.

Unsupervised Task Graph Generation from Instructional Video Transcripts
Lajanugen Logeswaran | Sungryull Sohn | Yunseok Jang | Moontae Lee | Honglak Lee
Findings of the Association for Computational Linguistics: ACL 2023

This work explores the problem of generating task graphs of real-world activities. Different from prior formulations, we consider a setting where text transcripts of instructional videos performing a real-world activity (e.g., making coffee) are provided and the goal is to identify the key steps relevant to the task as well as the dependency relationship between these key steps. We propose a novel task graph generation approach that combines the reasoning capabilities of instruction-tuned language models along with clustering and ranking components to generate accurate task graphs in a completely unsupervised manner. We show that the proposed approach generates more accurate task graphs compared to a supervised learning approach on tasks from the ProceL and CrossTask datasets.

2022

Few-shot Subgoal Planning with Language Models
Lajanugen Logeswaran | Yao Fu | Moontae Lee | Honglak Lee
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Pre-trained language models have shown successful progress in many text understanding benchmarks. This work explores the capability of these models to predict actionable plans in real-world environments. Given a text instruction, we show that language priors encoded in pre-trained models allow us to infer fine-grained subgoal sequences. In contrast to recent methods which make strong assumptions about subgoal supervision, our experiments show that language models can infer detailed subgoal sequences from few training sequences without any fine-tuning. We further propose a simple strategy to re-rank language model predictions based on interaction and feedback from the environment. Combined with pre-trained navigation and visual reasoning components, our approach demonstrates competitive performance on subgoal prediction and task completion in the ALFRED benchmark compared to prior methods that assume more subgoal supervision.

2019

Zero-Shot Entity Linking by Reading Entity Descriptions
Lajanugen Logeswaran | Ming-Wei Chang | Kenton Lee | Kristina Toutanova | Jacob Devlin | Honglak Lee
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We present the zero-shot entity linking task, where mentions must be linked to unseen entities without in-domain labeled data. The goal is to enable robust transfer to highly specialized domains, and so no metadata or alias tables are assumed. In this setting, entities are only identified by text descriptions, and models must rely strictly on language understanding to resolve the new entities. First, we show that strong reading comprehension models pre-trained on large unlabeled data can be used to generalize to unseen entities. Second, we propose a simple and effective adaptive pre-training strategy, which we term domain-adaptive pre-training (DAP), to address the domain shift problem associated with linking unseen entities in a new domain. We present experiments on a new dataset that we construct for this task and show that DAP improves over strong pre-training baselines, including BERT. The data and code are available at https://github.com/lajanugen/zeshel.

2016

Dependency Sensitive Convolutional Neural Networks for Modeling Sentences and Documents
Rui Zhang | Honglak Lee | Dragomir R. Radev
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Co-authors

Kyunghoon Bae 2

Youngsoo Jang 2

Geon-Hyeong Kim 2

Rada Mihalcea 2

Yunxiang Zhang 2

Hyunkyung Bae 1

Ming-Wei Chang 1

Jongwook Choi 1

Seonghwan Kim 1

Byoungjip Kim 1

Anthony Zhe Liu 1

Soujanya Poria 1

Sathvika Ayyappa Prabhu 1

Dragomir Radev 1

Kristina Toutanova 1

Nakyeong Yang 1

Seonghyeon Ye 1

Xingjian Zhang 1

Zheyuan Zhang 1

Venues