Kushan Mitra
2026
Verification-Aware Planning for Multi-Agent Systems
Tianyang Xu | Dan Zhang | Kushan Mitra | Estevam Hruschka
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Tianyang Xu | Dan Zhang | Kushan Mitra | Estevam Hruschka
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language model (LLM) agents are increasingly deployed to tackle complex tasks, often necessitating collaboration among multiple specialized agents. However, multi-agent collaboration introduces new challenges in planning, coordination, and verification. Execution failures frequently arise not from flawed reasoning alone, but from subtle misalignments in task interpretation, output format, or inter-agent handoffs. To address these challenges, we present VeriMAP, a framework for multi-agent collaboration with verification-aware planning. The VeriMAP planner decomposes tasks, models subtask dependencies, and encodes planner-defined passing criteria as subtask verification functions (VFs) in Python and natural language. We evaluate VeriMAP on diverse datasets, demonstrating that it outperforms both single- and multi-agent baselines while enhancing system robustness and interpretability. Our analysis highlights how verification-aware planning enables reliable coordination and iterative refinement in multi-agent systems, without relying on external labels or annotations.
RECAP: REwriting Conversations for Intent Understanding in Agentic Planning
Kushan Mitra | Dan Zhang | Hannah Kim | Estevam Hruschka
Findings of the Association for Computational Linguistics: EACL 2026
Kushan Mitra | Dan Zhang | Hannah Kim | Estevam Hruschka
Findings of the Association for Computational Linguistics: EACL 2026
Understanding user intent is essential for effective planning in conversational assistants, particularly those powered by large language models (LLMs) coordinating multiple agents. However, real-world dialogues are often ambiguous, underspecified, or dynamic, making intent understanding a persistent challenge. Traditional classification-based approaches struggle to generalize in open-ended settings, leading to brittle interpretations and poor downstream planning.We propose RECAP (REwriting Conversations for Agent Planning), a new benchmark designed to evaluate and advance intent rewriting, reframing user-agent dialogues into concise representations of user goals. RECAP captures diverse challenges such as ambiguity, intent drift, vagueness, and mixed-goal conversations. Alongside the dataset, we introduce an LLM-based evaluator that compares planning utility given a user-agent dialogue.Using RECAP, we develop a prompt-based rewriting approach that outperforms baselines, in terms of plan preference. We further demonstrate that fine-tuning two DPO-based rewriters yields additional utility gains. Our results highlight intent rewriting as a critical and tractable component for improving agentic planning in open-domain dialogue systems.
2025
AIPOM: Agent-aware Interactive Planning for Multi-Agent Systems
Hannah Kim | Kushan Mitra | Chen Shen | Dan Zhang | Estevam Hruschka
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Hannah Kim | Kushan Mitra | Chen Shen | Dan Zhang | Estevam Hruschka
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Large language models (LLMs) are being increasingly used for planning in orchestrated multi-agent systems. However, existing LLM-based approaches often fall short of human expectations and, critically, lack effective mechanisms for users to inspect, understand, and control their behaviors. These limitations call for enhanced transparency, controllability, and human oversight. To address this, we introduce AIPOM, a system supporting human-in-the-loop planning through conversational and graph-based interfaces. AIPOM enables users to transparently inspect, refine, and collaboratively guide LLM-generated plans, significantly enhancing user control and trust in multi-agent workflows. Our code and demo video are available at https://github.com/megagonlabs/aipom.
FactLens: Benchmarking Fine-Grained Fact Verification
Kushan Mitra | Dan Zhang | Sajjadur Rahman | Estevam Hruschka
Findings of the Association for Computational Linguistics: ACL 2025
Kushan Mitra | Dan Zhang | Sajjadur Rahman | Estevam Hruschka
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) have shown impressive capability in language generation and understanding, but their tendency to hallucinate and produce factually incorrect information remains a key limitation. To verify LLM-generated contents and claims from other sources, traditional verification approaches often rely on holistic models that assign a single factuality label to complex claims, potentially obscuring nuanced errors. In this paper, we advocate for a shift towards fine-grained verification, where complex claims are broken down into smaller sub-claims for individual verification, allowing for more precise identification of inaccuracies, improved transparency, and reduced ambiguity in evidence retrieval. However, generating sub-claims poses challenges, such as maintaining context and ensuring semantic equivalence with respect to the original claim. We introduce **FactLens**, a benchmark for evaluating fine-grained fact verification, with metrics and automated evaluators of sub-claim quality. The benchmark data is manually curated to ensure high-quality ground truth. Our results show alignment between automated FactLens evaluators and human judgments, and we discuss the impact of sub-claim characteristics on the overall verification performance.
2024
MEGAnno+: A Human-LLM Collaborative Annotation System
Hannah Kim | Kushan Mitra | Rafael Li Chen | Sajjadur Rahman | Dan Zhang
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations
Hannah Kim | Kushan Mitra | Rafael Li Chen | Sajjadur Rahman | Dan Zhang
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations
Large language models (LLMs) can label data faster and cheaper than humans for various NLP tasks. Despite their prowess, LLMs may fall short in understanding of complex, sociocultural, or domain-specific context, potentially leading to incorrect annotations. Therefore, we advocate a collaborative approach where humans and LLMs work together to produce reliable and high-quality labels. We present MEGAnno+, a human-LLM collaborative annotation system that offers effective LLM agent and annotation management, convenient and robust LLM annotation, and exploratory verification of LLM labels by humans.
Characterizing Large Language Models as Rationalizers of Knowledge-intensive Tasks
Aditi Mishra | Sajjadur Rahman | Kushan Mitra | Hannah Kim | Estevam Hruschka
Findings of the Association for Computational Linguistics: ACL 2024
Aditi Mishra | Sajjadur Rahman | Kushan Mitra | Hannah Kim | Estevam Hruschka
Findings of the Association for Computational Linguistics: ACL 2024
Large language models (LLMs) are proficient at generating fluent text with minimal task-specific supervision. However, their ability to generate rationales for knowledge-intensive tasks (KITs) remains under-explored. Generating rationales for KIT solutions, such as commonsense multiple-choice QA, requires external knowledge to support predictions and refute alternate options. In this work, we consider the task of generating retrieval-augmented rationalization of KIT model predictions via external knowledge guidance within a few-shot setting. Surprisingly, crowd-workers preferred LLM-generated rationales over existing crowd-sourced rationales, generated in a similar knowledge-guided setting, on aspects such as factuality, sufficiency, and convincingness. However, fine-grained evaluation of such rationales highlights the need for further improvements in conciseness, novelty, and domain invariance. Additionally, through an expert-sourced study evaluating the reliability of the rationales, we demonstrate that humans’ trust in LLM-generated rationales erodes when communicated faithfully, i.e., without taking model prediction accuracy into account. We find that even instrumenting simple guardrails can be effective for reliable rationalization.