GEAR: A Scalable and Interpretable Evaluation Framework for RAG-Based Car Assistant Systems

Niloufar Beyranvand; Hamidreza Dastmalchi; Aijun An; Heidar Davoudi; Winston Chan; Ron DiCarlantonio

doi:10.18653/v1/2025.emnlp-industry.187

GEAR: A Scalable and Interpretable Evaluation Framework for RAG-Based Car Assistant Systems

Niloufar Beyranvand, Hamidreza Dastmalchi, Aijun An, Heidar Davoudi, Winston Chan, Ron DiCarlantonio

Abstract

Large language models (LLMs) increasingly power car assistants, enabling natural language interaction for tasks such as maintenance, troubleshooting, and operational guidance. While retrieval-augmented generation (RAG) improves grounding using vehicle manuals, evaluating response quality remains a key challenge. Traditional metrics like BLEU and ROUGE fail to capture critical aspects such as factual accuracy and information coverage. We propose GEAR, a fully automated, reference-based evaluation framework for car assistant systems. GEAR uses LLMs as evaluators to compare assistant responses against ground-truth counterparts, assessing coverage, correctness, and other dimensions of answer quality. To enable fine-grained evaluation, both responses are decomposed into key facts and labeled as essential, optional, or safety-critical using LLMs. The evaluator then determines which of these facts are correct and covered. Experiments show that GEAR aligns closely with human annotations, offering a scalable and reliable solution for evaluating car assistants.

Anthology ID:: 2025.emnlp-industry.187
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:: November
Year:: 2025
Address:: Suzhou (China)
Editors:: Saloni Potdar, Lina Rojas-Barahona, Sebastien Montella
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2792–2810
Language:
URL:: https://aclanthology.org/2025.emnlp-industry.187/
DOI:: 10.18653/v1/2025.emnlp-industry.187
Bibkey:
Cite (ACL):: Niloufar Beyranvand, Hamidreza Dastmalchi, Aijun An, Heidar Davoudi, Winston Chan, and Ron DiCarlantonio. 2025. GEAR: A Scalable and Interpretable Evaluation Framework for RAG-Based Car Assistant Systems. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2792–2810, Suzhou (China). Association for Computational Linguistics.
Cite (Informal):: GEAR: A Scalable and Interpretable Evaluation Framework for RAG-Based Car Assistant Systems (Beyranvand et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-industry.187.pdf

PDF Cite Search Fix data