Semi-Supervised Dialogue Policy Learning via Stochastic Reward Estimation

Xinting Huang; Jianzhong Qi; Yu Sun; Rui Zhang

doi:10.18653/v1/2020.acl-main.62

Semi-Supervised Dialogue Policy Learning via Stochastic Reward Estimation

Xinting Huang, Jianzhong Qi, Yu Sun, Rui Zhang

Abstract

Dialogue policy optimization often obtains feedback until task completion in task-oriented dialogue systems. This is insufficient for training intermediate dialogue turns since supervision signals (or rewards) are only provided at the end of dialogues. To address this issue, reward learning has been introduced to learn from state-action pairs of an optimal policy to provide turn-by-turn rewards. This approach requires complete state-action annotations of human-to-human dialogues (i.e., expert demonstrations), which is labor intensive. To overcome this limitation, we propose a novel reward learning approach for semi-supervised policy learning. The proposed approach learns a dynamics model as the reward function which models dialogue progress (i.e., state-action sequences) based on expert demonstrations, either with or without annotations. The dynamics model computes rewards by predicting whether the dialogue progress is consistent with expert demonstrations. We further propose to learn action embeddings for a better generalization of the reward function. The proposed approach outperforms competitive policy learning baselines on MultiWOZ, a benchmark multi-domain dataset.

Anthology ID:: 2020.acl-main.62
Volume:: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:: July
Year:: 2020
Address:: Online
Editors:: Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 660–670
Language:
URL:: https://aclanthology.org/2020.acl-main.62
DOI:: 10.18653/v1/2020.acl-main.62
Bibkey:
Cite (ACL):: Xinting Huang, Jianzhong Qi, Yu Sun, and Rui Zhang. 2020. Semi-Supervised Dialogue Policy Learning via Stochastic Reward Estimation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 660–670, Online. Association for Computational Linguistics.
Cite (Informal):: Semi-Supervised Dialogue Policy Learning via Stochastic Reward Estimation (Huang et al., ACL 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.acl-main.62.pdf
Video:: http://slideslive.com/38929372

PDF Cite Search Video