Automated test generation to evaluate tool-augmented LLMs as conversational AI agents

Samuel Arcadinho; David Oliveira Aparicio; Mariana S. C. Almeida

doi:10.18653/v1/2024.genbench-1.4

Automated test generation to evaluate tool-augmented LLMs as conversational AI agents

Samuel Arcadinho, David Oliveira Aparicio, Mariana S. C. Almeida

Abstract

Tool-augmented LLMs are a promising approach to create AI agents that can have realistic conversations, follow procedures, and call appropriate functions. However, evaluating them is challenging due to the diversity of possible conversations, and existing datasets focus only on single interactions and function-calling. We present a test generation pipeline to evaluate LLMs as conversational AI agents. Our framework uses LLMs to generate diverse tests grounded on user-defined procedures. For that, we use intermediate graphs to limit the LLM test generator’s tendency to hallucinate content that is not grounded on input procedures, and enforces high coverage of the possible conversations. Additionally, we put forward ALMITA, a manually curated dataset for evaluating AI agents in customer support, and use it to evaluate existing LLMs. Our results show that while tool-augmented LLMs perform well in single interactions, they often struggle to handle complete conversations. While our focus is on customer support, our test generation pipeline is general enough to evaluate different AI agents.

Anthology ID:: 2024.genbench-1.4
Volume:: Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Dieuwke Hupkes, Verna Dankers, Khuyagbaatar Batsuren, Amirhossein Kazemnejad, Christos Christodoulopoulos, Mario Giulianelli, Ryan Cotterell
Venues:: GenBench | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 54–68
Language:
URL:: https://aclanthology.org/2024.genbench-1.4/
DOI:: 10.18653/v1/2024.genbench-1.4
Bibkey:
Cite (ACL):: Samuel Arcadinho, David Oliveira Aparicio, and Mariana S. C. Almeida. 2024. Automated test generation to evaluate tool-augmented LLMs as conversational AI agents. In Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP, pages 54–68, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Automated test generation to evaluate tool-augmented LLMs as conversational AI agents (Arcadinho et al., GenBench 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.genbench-1.4.pdf

PDF Cite Search Fix data