Understanding the Impact of Experiment Design for Evaluating Dialogue System Output

Sashank Santhanam, Samira Shaikh


Abstract
Evaluation of output from natural language generation (NLG) systems is typically conducted via crowdsourced human judgments. To understand the impact of how experiment design might affect the quality and consistency of such human judgments, we designed a between-subjects study with four experimental conditions. Through our systematic study with 40 crowdsourced workers in each task, we find that using continuous scales achieves more consistent ratings than Likert scale or ranking-based experiment design. Additionally, we find that factors such as no prior experience of participating in similar studies of rating dialogue system output
Anthology ID:
2020.winlp-1.33
Volume:
Proceedings of the Fourth Widening Natural Language Processing Workshop
Month:
July
Year:
2020
Address:
Seattle, USA
Editors:
Rossana Cunha, Samira Shaikh, Erika Varis, Ryan Georgi, Alicia Tsai, Antonios Anastasopoulos, Khyathi Raghavi Chandu
Venue:
WiNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
124–127
Language:
URL:
https://aclanthology.org/2020.winlp-1.33
DOI:
10.18653/v1/2020.winlp-1.33
Bibkey:
Cite (ACL):
Sashank Santhanam and Samira Shaikh. 2020. Understanding the Impact of Experiment Design for Evaluating Dialogue System Output. In Proceedings of the Fourth Widening Natural Language Processing Workshop, pages 124–127, Seattle, USA. Association for Computational Linguistics.
Cite (Informal):
Understanding the Impact of Experiment Design for Evaluating Dialogue System Output (Santhanam & Shaikh, WiNLP 2020)
Copy Citation:
Video:
 http://slideslive.com/38929573