Peter Boothroyd


2025

pdf bib
ASPERA: A Simulated Environment to Evaluate Planning for Complex Action Execution
Alexandru Coca | Mark Gaynor | Zhenxing Zhang | Jianpeng Cheng | Bo-Hsiang Tseng | Peter Boothroyd | Hector Martinez Alonso | Diarmuid O Seaghdha | Anders Johannsen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This work evaluates the potential of large language models (LLMs) to power digital assistants capable of complex action execution. Such assistants rely on pre-trained programming knowledge to execute multi-step goals by composing objects and functions defined in assistant libraries into action execution programs. To achieve this, we develop ASPERA, a framework comprising an assistant library simulation and a human-assisted LLM data generation engine. Our engine allows developers to guide LLM generation of high-quality tasks consisting of complex user queries, simulation state and corresponding validation programs, tackling data availability and evaluation robustness challenges. Alongside the framework we release Asper-Bench, an evaluation dataset of 250 challenging tasks generated using ASPERA, which we use to show that program generation grounded in custom assistant libraries is a significant challenge to LLMs compared to dependency-free code generation.

pdf bib
PyTOD: Programmable Task-Oriented Dialogue with Execution Feedback
Alexandru Coca | Bo-Hsiang Tseng | Peter Boothroyd | Jianpeng Cheng | Zhenxing Zhang | Mark Gaynor | Joe Stacey | Tristan Guigue | Héctor Martínez Alonso | Diarmuid Ó Séaghdha | Anders Johannsen
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Programmable task-oriented dialogue (TOD) agents enable language models to follow structured dialogue policies, but their effectiveness hinges on accurate dialogue state tracking (DST). We present PyTOD, an agent that generates executable code to track dialogue state and uses policy and execution feedback for efficient error correction. To achieve this, PyTOD employs a simple constrained decoding approach, using a language model instead of grammar rules to follow API schemata. This leads to state-of-the-art DST performance on the challenging SGD benchmark. Our experiments show that PyTOD surpasses strong baselines in both accuracy and cross-turn consistency, demonstrating the effectiveness of execution-aware state tracking.