Evaluation of Response Generation Models: Shouldn’t It Be Shareable and Replicable?

Seyed Mahed Mousavi, Gabriel Roccabruna, Michela Lorandi, Simone Caldarella, Giuseppe Riccardi


Abstract
Human Evaluation (HE) of automatically generated responses is necessary for the advancement of human-machine dialogue research. Current automatic evaluation measures are poor surrogates, at best. There are no agreed-upon HE protocols and it is difficult to develop them. As a result, researchers either perform non-replicable, non-transparent and inconsistent procedures or, worse, limit themselves to automated metrics. We propose to standardize the human evaluation of response generation models by publicly sharing a detailed protocol. The proposal includes the task design, annotators recruitment, task execution, and annotation reporting. Such protocol and process can be used as-is, as-a-whole, in-part, or modified and extended by the research community. We validate the protocol by evaluating two conversationally fine-tuned state-of-the-art models (GPT-2 and T5) for the complex task of personalized response generation. We invite the community to use this protocol - or its future community amended versions - as a transparent, replicable, and comparable approach to HE of generated responses.
Anthology ID:
2022.gem-1.12
Volume:
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates (Hybrid)
Editors:
Antoine Bosselut, Khyathi Chandu, Kaustubh Dhole, Varun Gangal, Sebastian Gehrmann, Yacine Jernite, Jekaterina Novikova, Laura Perez-Beltrachini
Venue:
GEM
SIG:
SIGGEN
Publisher:
Association for Computational Linguistics
Note:
Pages:
136–147
Language:
URL:
https://aclanthology.org/2022.gem-1.12
DOI:
10.18653/v1/2022.gem-1.12
Bibkey:
Cite (ACL):
Seyed Mahed Mousavi, Gabriel Roccabruna, Michela Lorandi, Simone Caldarella, and Giuseppe Riccardi. 2022. Evaluation of Response Generation Models: Shouldn’t It Be Shareable and Replicable?. In Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 136–147, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):
Evaluation of Response Generation Models: Shouldn’t It Be Shareable and Replicable? (Mousavi et al., GEM 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.gem-1.12.pdf
Video:
 https://aclanthology.org/2022.gem-1.12.mp4