Evaluating Compound AI Systems through Behaviors, Not Benchmarks

Pranav Bhagat; K N Ajay Shastry; Pranoy Panda; Chaitanya Devaguptapu

doi:10.18653/v1/2025.findings-emnlp.1314

Evaluating Compound AI Systems through Behaviors, Not Benchmarks

Pranav Bhagat, K N Ajay Shastry, Pranoy Panda, Chaitanya Devaguptapu

Abstract

Compound AI (CAI) systems, also referred to as LLM Agents, combine LLMs with retrievers and tools to enable information-seeking applications in the real-world. Thus, ensuring these systems perform reliably is critical. However, traditional evaluation using benchmark datasets and aggregate metrics often fails to capture their true operational performance. This is because understanding the operational efficacy of these information-seeking systems requires the ability to probe their behavior across a spectrum of simulated scenarios to identify potential failure modes. Thus, we present a behavior-driven evaluation framework that generates test specifications - explicit descriptions of expected system behaviors in specific scenarios - aligned with real usage contexts. These test specifications serve as formal declarations of system requirements that are then automatically transformed into concrete test cases. Specifically, our framework operates in two phases: (1) generating diverse test specifications via submodular optimization over semantic diversity and document coverage of the tests, and (2) implementing these specifications through graph-based pipelines supporting both tabular and textual sources. Evaluations on QuAC & HybriDialogue datasets, across SoTA LLMs, reveal that our framework identifies failure modes missed by traditional metrics, demonstrating failure rates twice as high as human-curated datasets.

Anthology ID:: 2025.findings-emnlp.1314
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24193–24222
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.1314/
DOI:: 10.18653/v1/2025.findings-emnlp.1314
Bibkey:
Cite (ACL):: Pranav Bhagat, K N Ajay Shastry, Pranoy Panda, and Chaitanya Devaguptapu. 2025. Evaluating Compound AI Systems through Behaviors, Not Benchmarks. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 24193–24222, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Evaluating Compound AI Systems through Behaviors, Not Benchmarks (Bhagat et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.1314.pdf
Checklist:: 2025.findings-emnlp.1314.checklist.pdf

PDF Cite Search Checklist Fix data