Cristina Conforto-López
2026
Automatic Evaluation of Open-Domain Real Conversations: Combining Encoder-Based, Dialogue-Based Features and Large Language Models Ratings
Cristina Conforto-López | Marcos Estecha-Garitagoitia | Mario Rodriguez-Cantelar | Ricardo de Córdoba | Luis Fernando D’Haro
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Cristina Conforto-López | Marcos Estecha-Garitagoitia | Mario Rodriguez-Cantelar | Ricardo de Córdoba | Luis Fernando D’Haro
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Conversational AI is a central application of NLP, yet ensuring high response quality remains challenging due to the inherently subjective nature of user satisfaction. Dialogue evaluation can be performed manually—through expert or user ratings—or automatically, using methods that aim to predict quality scores consistent with human judgment. In this work, we present a reference-free automatic dialogue evaluation system that predicts user ratings from a dataset of real human–chatbot interactions collected during the Alexa Prize Socialbot Grand Challenge 5, combining multiple complementary models to enhance correlation with human scores. Experimental results indicate that the model that achieves the highest Pearson correlation with users’ ratings is an XGBoost regression model that combines different features such as conversation length, engineered flags capturing conversation characteristics, predictions from an Encoder-based Panel of Experts (PoE), and instruction-based outputs from a fine-tuned LLM. The overall Pearson Correlation on the eval set is 0.404, which is competitive with prior work trained on an order of magnitude more dialogues, albeit using different datasets and system configurations.