Automatic Evaluation of Open-Domain Real Conversations: Combining Encoder-Based, Dialogue-Based Features and Large Language Models Ratings

Cristina Conforto López; Marcos Estecha-Goritagoitia; Mario Rodríguez-Cantelar; Ricardo Córdoba; Luis Fernando D’Haro

Automatic Evaluation of Open-Domain Real Conversations: Combining Encoder-Based, Dialogue-Based Features and Large Language Models Ratings

Cristina Conforto López, Marcos Estecha-Goritagoitia, Mario Rodriguez-Cantelar, Ricardo Cordoba, Luis Fernando D’Haro

Abstract

Conversational AI is a central application of NLP, yet ensuring high response quality remains challenging due to the inherently subjective nature of user satisfaction. Dialogue evaluation can be performed manually—through expert or user ratings—or automatically, using methods that aim to predict quality scores consistent with human judgment. In this work, we present a reference-free automatic dialogue evaluation system that predicts user ratings from a dataset of real human–chatbot interactions collected during the Alexa Prize Socialbot Grand Challenge 5, combining multiple complementary models to enhance correlation with human scores. Experimental results indicate that the model that achieves the highest Pearson correlation with users’ ratings is an XGBoost regression model that combines different features such as conversation length, engineered flags capturing conversation characteristics, predictions from an Encoder-based Panel of Experts (PoE), and instruction-based outputs from a fine-tuned LLM. The overall Pearson Correlation on the eval set is 0.404, which is competitive with prior work trained on an order of magnitude more dialogues, albeit using different datasets and system configurations.

Anthology ID:: 2026.iwsds-1.5
Volume:: Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Month:: February
Year:: 2026
Address:: Trento, Italy
Editors:: Giuseppe Riccardi, Seyed Mahed Mousavi, Maria Ines Torres, Koichiro Yoshino, Zoraida Callejas, Shammur Absar Chowdhury, Yun-Nung Chen, Frederic Bechet, Joakim Gustafson, Géraldine Damnati, Alex Papangelis, Luis Fernando D’Haro, John Mendonça, Raffaella Bernardi, Dilek Hakkani-Tur, Giuseppe "Pino" Di Fabbrizio, Tatsuya Kawahara, Firoj Alam, Gokhan Tur, Michael Johnston
Venue:: IWSDS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 52–63
Language:
URL:: https://aclanthology.org/2026.iwsds-1.5/
DOI:
Bibkey:
Cite (ACL):: Cristina Conforto López, Marcos Estecha-Goritagoitia, Mario Rodriguez-Cantelar, Ricardo Cordoba, and Luis Fernando D’Haro. 2026. Automatic Evaluation of Open-Domain Real Conversations: Combining Encoder-Based, Dialogue-Based Features and Large Language Models Ratings. In Proceedings of the 16th International Workshop on Spoken Dialogue System Technology, pages 52–63, Trento, Italy. Association for Computational Linguistics.
Cite (Informal):: Automatic Evaluation of Open-Domain Real Conversations: Combining Encoder-Based, Dialogue-Based Features and Large Language Models Ratings (Conforto López et al., IWSDS 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.iwsds-1.5.pdf

PDF Cite Search Fix data