Arif Türkmen
2026
Whom to Trust? Analyzing the Divergence Between User Satisfaction and LLM-as-a-Judge in E-Commerce RAG Systems
Arif Türkmen | Kaan Efe Keleş
Proceedings of the First Workshop on Multilingual Multicultural Evaluation
Arif Türkmen | Kaan Efe Keleş
Proceedings of the First Workshop on Multilingual Multicultural Evaluation
We study retrieval-augmented generation (RAG) evaluation in the Trendyol QA Assistant using 150k real e-commerce interactions. Our framework combines user satisfaction labels, LLM-as-a-judge scoring, and factor-based diagnostics to separate retrieval from generation errors. We find that judge models broadly reflect user satisfaction trends, though important nuances of dissatisfaction are often missed. Factor-level analysis highlights systematic error patterns across query types and context quality, demonstrating that hybrid evaluation, combining multiple LLM judges with direct user feedback offers the most reliable assessment strategy for production RAG systems.