Matthew Toles
2026
FormGym: Doing Paperwork with Agents
Matthew Toles | Isaac Song | Rattandeep Singh | Zhou Yu
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Matthew Toles | Isaac Song | Rattandeep Singh | Zhou Yu
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
End-to-end form filling refers to automatically populating fields in a document-style form with the appropriate information derived from external data. Although prevalent and useful, no formal benchmark exists for evaluating systems’ form completion accuracy. Existing datasets focus on parsing, extraction and web form interaction, rather than end-to-end completion of document-style forms. We propose FormGym, a benchmark formulation of the end-to-end form filling task that evaluates form completion and accuracy. We construct FormGym by repurposing three existing datasets and add one new dataset to achieve more challenging, diverse, and realistic test cases. Our studies show baseline vision language agents (VLAs) perform poorly on FormGym in every scenario, primarily due to poor field localization. GUI agents perform better but suffer from high latency and costs. Therefore we also introduce FieldFinder, a field localization tool that enables zero-shot VLAs to find and accurately place text in input fields. We find that VLAs augmented with FieldFinder achieve better performance compared to baselines in all models.
2025
Learning and Evaluating Factual Clarification Question Generation Without Examples
Matthew Toles | Yukun Huang | Zhou Yu
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Matthew Toles | Yukun Huang | Zhou Yu
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Real-world tasks such as giving legal or technical advice often depend on context that is initially missing at the outset. The ability to derive missing factual information by asking clarifying questions (ACQ) is an important element of real-life collaboration on such reasoning tasks. Although intent disambiguation has been heavily investigated, factual reasoning remains underexplored. To enable evaluation of factual domain clarification question generation, we present a new task that focuses on the ability to elicit missing information in multi-hop reasoning tasks. We observe that humans outperform GPT-4o by a large margin, while Llama 3 8B Instruct does not even beat the dummy baseline in some metrics. Finally, we find that by fine-tuning Llama 3 8B Instruct on its own generations filtered via rejection sampling, we can improve information recovery by 27.6% without using any manually labeled data.
Program Synthesis Dialog Agents for Interactive Decision-Making
Matthew Toles | Nikhil Balwani | Rattandeep Singh | Valentina Giulia Sartori Rodriguez | Zhou Yu
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Matthew Toles | Nikhil Balwani | Rattandeep Singh | Valentina Giulia Sartori Rodriguez | Zhou Yu
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Many real-world eligibility problems, ranging from medical diagnosis to tax planning, can be mapped to decision problems expressed in natural language, wherein a model must make a binary choice based on the features of the user. Large-scale domains such as legal codes or frequently updated funding opportunities render human annotation (e.g., web forms or decision trees) impractical, suggesting a need for agents that can automatically assist in decision-making. Since relevant information is often only known to the user, it is important that these agents can ask the right questions. To evaluate this task, we propose BeNYfits, a new benchmark for determining user eligibility for multiple overlapping social benefits opportunities through interactive decision-making. Our experiments show that current language models struggle with frequent hallucinations, with GPT-4o scoring only 35.7 F1 using a ReAct-style chain-of-thought. We therefore introduce ProADA, a novel approach that uses program synthesis to assist in decision-making by mapping dialog planning to a code generation problem and using gaps in structured data to determine the best next action. Our agent, ProADA, improves the F1 score to 56.2 while using nearly the same number of dialog turns.