Austen Liao
2026
TDFlow: Agentic Workflows for Test Driven Development
Kevin Han | Siddharth Maddikayala | Tim Knappe | Om Patel | Austen Liao | Amir Barati Farimani
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Kevin Han | Siddharth Maddikayala | Tim Knappe | Om Patel | Austen Liao | Amir Barati Farimani
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce TDFlow, a novel test-driven agentic workflow that frames repository-scale software engineering as a test-resolution task, specifically designed to solve human-written tests. Given a set of tests, TDFlow repeatedly proposes, revises, and debugs repository-scale patches using precisely engineered sub-agents and tightly constrained tools. The workflow decomposes software engineering program repair into four components governed by respective sub-agents. This simple, forced decoupling of patch proposing, debugging, patch revision, and optional test generation (1) reduces long-context burden on any individual sub-agent, (2) focuses each sub-agent on specific, pre-defined sub-tasks, and (3) allows for specialized performance improvement on specific sub-tasks. When provided human-written tests, TDFlow attains 88.8% pass rate on SWE-Bench Lite (an absolute improvement of 27.8% over the next best baseline) and 94.3% on SWE-Bench Verified. In this work, we further show that the primary obstacle to human-level software engineering performance lies within writing successful reproduction tests. Manual inspection of the 800 TDFlow runs within SWE-Bench Lite and Verified uncover only 7 instances of test hacking, which were subsequently counted as failures. We envision a human-LLM interactive system powered by TDFlow where human developers write tests solved by LLM systems. Together, these results show that modern LLMs, when embedded in a narrowly engineered, test-driven workflow, already achieve human-level test resolution – with the final frontier for fully autonomous repository repair being accurate reproduction test generation.
2025
EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models
Abhay Gupta | Jacob Cheung | Philip Meng | Shayan Sayyed | Kevin Zhu | Austen Liao | Sean O’Brien
Findings of the Association for Computational Linguistics: EMNLP 2025
Abhay Gupta | Jacob Cheung | Philip Meng | Shayan Sayyed | Kevin Zhu | Austen Liao | Sean O’Brien
Findings of the Association for Computational Linguistics: EMNLP 2025
The diversity of human language, shaped by social, cultural, and regional influences, presents significant challenges for natural language processing (NLP) systems. Existing benchmarks often overlook intra-language variations, leaving speakers of non-standard dialects underserved. To address this gap, we introduce EnDive (English Diversity), a benchmark that evaluates seven state-of-the-art (SOTA) large language models (LLMs) across tasks in language understanding, algorithmic reasoning, mathematics, and logic. Our framework translates Standard American English datasets into five underrepresented dialects using few-shot prompting with verified examples from native speakers, and compares these translations against rule-based methods via fluency assessments, preference tests, and semantic similarity metrics. Human evaluations confirm high translation quality, with average scores of at least 6.02/7 for faithfulness, fluency, and formality. By filtering out near-identical translations, we create a challenging dataset that reveals significant performance disparities—models consistently underperform on dialectal inputs compared to Standard American English (SAE). EnDive thus advances dialect-aware NLP by uncovering model biases and promoting more equitable language technologies.
Self Knowledge-Tracing for Tool Use (SKT-Tool): Helping LLM Agents Understand Their Capabilities in Tool Use
Joshua Vigel | Renpei Cai | Eleanor Chen | Anish Neema | Austen Liao | Kevin Zhu | Sean O’brien
The Sixth Workshop on Insights from Negative Results in NLP
Joshua Vigel | Renpei Cai | Eleanor Chen | Anish Neema | Austen Liao | Kevin Zhu | Sean O’brien
The Sixth Workshop on Insights from Negative Results in NLP
Large Language Models (LLMs) enhanced with tool use and APIs improve task performance but often misuse them, leading to inefficiency and unnecessary cost. We propose Self Knowledge-Tracing for Tool Use (SKT-Tool), a method enabling LLMs to assess their capabilities and make informed API usage decisions using knowledge tracing (KT). Our teacher-student framework helps LLMs optimize API calls in real-time without fine-tuning. Experiments across multiple datasets show that SKT-Tool significantly reduces API calls while maintaining accuracy, offering a scalable and cost-effective solution for tool-augmented LLMs. We conclude by analyzing shortcomings in this method and identifying directions for future work.