Jaekyeom Kim

2026

Large language models (LLMs) have shown strong capabilities as task-solving agents across interactive domains. However, in complex environments, these agents may need to rely on auxiliary guidance to reduce the search space or make up for limited domain-specific knowledge. Such guidance includes human-provided manuals and demonstrations, retrieved examples from memory or external tools, high-level heuristics, and agent-acquired knowledge from prior interactions. However, this guidance may be imperfect. For example, due to changes in the environment, ambiguous or simplified language, or retrieval errors from external sources, guidance can be incomplete, outdated, or contextually mismatched, potentially causing errors or failures during task execution. To address this, we introduce MIRAGE, a benchmark for MeasurIng Robustness of LLM Agents under Imperfect GuidancE. MIRAGE includes procedurally generated environments in navigation, cooking, and gaming, where both the environment and the auxiliary guidance vary in fidelity and relevance. We further extend MIRAGE to realistic web tasks via WebArena, using noisy or underspecified instructions extracted from demonstrations. Our findings reveal critical failure modes in current LLM agents and motivate future work on improving their robustness under imperfect guidance.

2025

pdf bib abs

Large Language Models (LLMs) demonstrate strong abilities in common-sense reasoning and interactive decision-making, but often struggle with complex, long-horizon planning tasks. Recent techniques have sought to structure LLM outputs using control flow and code to improve planning performance. However, code-based approaches can be error-prone and insufficient for handling ambiguous or unstructured data. To address these challenges, we propose REPL-Plan, an LLM planning approach that is fully code-expressive (it can utilize all the benefits of code) while also being dynamic (it can flexibly adapt from errors and use the LLM for soft reasoning). In REPL-Plan, an LLM solves tasks by interacting with a Read-Eval-Print Loop (REPL), which iteratively executes and evaluates code, similar to language shells or interactive code notebooks, allowing the model to flexibly correct errors and handle tasks dynamically. We demonstrate that REPL-Plan achieves strong results across various planning domains compared to previous methods.

2024

pdf bib abs

Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents
Jaekyeom Kim | Dong-Ki Kim | Lajanugen Logeswaran | Sungryull Sohn | Honglak Lee
Findings of the Association for Computational Linguistics: EMNLP 2024

In this paper, we introduce Auto-Intent, a method to adapt a pre-trained large language model (LLM) as an agent for a target domain without direct fine-tuning, where we empirically focus on web navigation tasks. Our approach first discovers the underlying intents from target domain demonstrations unsupervisedly, in a highly compact form (up to three words). With the extracted intents, we train our intent predictor to predict the next intent given the agent’s past observations and actions. In particular, we propose a self-exploration approach where top-k probable intent predictions are provided as a hint to the pre-trained LLM agent, which leads to enhanced decision-making capabilities. Auto-Intent substantially improves the performance of GPT-3.5, 4 and Llama-3.1-70B, 405B agents on the large-scale real-website navigation benchmarks from Mind2Web and online navigation tasks from WebArena with its cross-benchmark generalization from Mind2Web.

pdf bib abs

Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs), where LLMs refine their solutions using self-generated critiques that pinpoint the errors. This work explores whether small (≤ 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs. We propose a novel pipeline that prompts smaller LMs to collect self-correction data that supports the training of self-refinement abilities. First, we leverage correct solutions to guide the model in critiquing their incorrect responses. Second, the generated critiques, after filtering, are used for supervised fine-tuning of the self-correcting reasoner through solution refinement. Our experimental results show improved self-correction abilities of two models on five datasets spanning math and commonsense reasoning, with notable performance gains when paired with a strong GPT-4-based verifier, though limitations are identified when using a weak self-verifier for determining when to correct.

Co-authors

Yao Fu 1

Sathvika Ayyappa Prabhu 1

Ran Qiu 1

Huijie Tang 1

Lu Wang 1

Yunxiang Zhang 1

Venues

Fix author