RADDLE: An Evaluation Benchmark and Analysis Platform for Robust Task-oriented Dialog Systems

Baolin Peng, Chunyuan Li, Zhu Zhang, Chenguang Zhu, Jinchao Li, Jianfeng Gao


Abstract
For task-oriented dialog systems to be maximally useful, it must be able to process conversations in a way that is (1) generalizable with a small number of training examples for new task domains, and (2) robust to user input in various styles, modalities, or domains. In pursuit of these goals, we introduce the RADDLE benchmark, a collection of corpora and tools for evaluating the performance of models across a diverse set of domains. By including tasks with limited training data, RADDLE is designed to favor and encourage models with a strong generalization ability. RADDLE also includes a diagnostic checklist that facilitates detailed robustness analysis in aspects such as language variations, speech errors, unseen entities, and out-of-domain utterances. We evaluate recent state-of-the-art systems based on pre-training and fine-tuning, and find that grounded pre-training on heterogeneous dialog corpora performs better than training a separate model per domain. Adversarial training is also proposed to improve model robustness against noisy inputs. Overall, existing models are less than satisfactory in robustness evaluation, which suggests opportunities for future improvement.
Anthology ID:
2021.acl-long.341
Volume:
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Month:
August
Year:
2021
Address:
Online
Editors:
Chengqing Zong, Fei Xia, Wenjie Li, Roberto Navigli
Venues:
ACL | IJCNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4418–4429
Language:
URL:
https://aclanthology.org/2021.acl-long.341
DOI:
10.18653/v1/2021.acl-long.341
Bibkey:
Cite (ACL):
Baolin Peng, Chunyuan Li, Zhu Zhang, Chenguang Zhu, Jinchao Li, and Jianfeng Gao. 2021. RADDLE: An Evaluation Benchmark and Analysis Platform for Robust Task-oriented Dialog Systems. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4418–4429, Online. Association for Computational Linguistics.
Cite (Informal):
RADDLE: An Evaluation Benchmark and Analysis Platform for Robust Task-oriented Dialog Systems (Peng et al., ACL-IJCNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.acl-long.341.pdf
Video:
 https://aclanthology.org/2021.acl-long.341.mp4
Data
GLUEMultiWOZ