Whom to Trust? Analyzing the Divergence Between User Satisfaction and LLM-as-a-Judge in E-Commerce RAG Systems

Arif Türkmen, Kaan Efe Keleş


Abstract
We study retrieval-augmented generation (RAG) evaluation in the Trendyol QA Assistant using 150k real e-commerce interactions. Our framework combines user satisfaction labels, LLM-as-a-judge scoring, and factor-based diagnostics to separate retrieval from generation errors. We find that judge models broadly reflect user satisfaction trends, though important nuances of dissatisfaction are often missed. Factor-level analysis highlights systematic error patterns across query types and context quality, demonstrating that hybrid evaluation, combining multiple LLM judges with direct user feedback offers the most reliable assessment strategy for production RAG systems.
Anthology ID:
2026.mme-main.12
Volume:
Proceedings of the First Workshop on Multilingual Multicultural Evaluation
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Pinzhen Chen, Vilém Zouhar, Hanxu Hu, Simran Khanuja, Wenhao Zhu, Barry Haddow, Alexandra Birch, Alham Fikri Aji, Rico Sennrich, Sara Hooker
Venues:
MME | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
189–195
Language:
URL:
https://aclanthology.org/2026.mme-main.12/
DOI:
Bibkey:
Cite (ACL):
Arif Türkmen and Kaan Efe Keleş. 2026. Whom to Trust? Analyzing the Divergence Between User Satisfaction and LLM-as-a-Judge in E-Commerce RAG Systems. In Proceedings of the First Workshop on Multilingual Multicultural Evaluation, pages 189–195, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Whom to Trust? Analyzing the Divergence Between User Satisfaction and LLM-as-a-Judge in E-Commerce RAG Systems (Türkmen & Keleş, MME 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.mme-main.12.pdf