Clipping Loops for Sample-Efficient Dialogue Policy Optimisation

Yen-Chen Wu; Carl Edward Rasmussen

doi:10.18653/v1/2021.naacl-main.267

Clipping Loops for Sample-Efficient Dialogue Policy Optimisation

Abstract

Training dialogue agents requires a large number of interactions with users: agents have no idea about which responses are bad among a lengthy dialogue. In this paper, we propose loop-clipping policy optimisation (LCPO) to eliminate useless responses. LCPO consists of two stages: loop clipping and advantage clipping. In loop clipping, we clip off useless responses (called loops) from dialogue history (called trajectories). The clipped trajectories are more succinct than the original ones, and the estimation of state-value is more accurate. Second, in advantage clipping, we estimate and clip the advantages of useless responses and normal ones separately. The clipped advantage distinguish useless actions from others and reduce the probabilities of useless actions efficiently. In experiments on Cambridge Restaurant Dialogue System, LCPO uses only 260 training dialogues to achieve 80% success rate, while PPO baseline requires 2160 dialogues. Besides, LCPO receives 3.7/5 scores in human evaluation where the agent interactively collects 100 real-user dialogues in training phase.

Anthology ID:: 2021.naacl-main.267
Volume:: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:: June
Year:: 2021
Address:: Online
Editors:: Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, Yichao Zhou
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3420–3428
Language:
URL:: https://aclanthology.org/2021.naacl-main.267/
DOI:: 10.18653/v1/2021.naacl-main.267
Bibkey:
Cite (ACL):: Yen-Chen Wu and Carl Edward Rasmussen. 2021. Clipping Loops for Sample-Efficient Dialogue Policy Optimisation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3420–3428, Online. Association for Computational Linguistics.
Cite (Informal):: Clipping Loops for Sample-Efficient Dialogue Policy Optimisation (Wu & Rasmussen, NAACL 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.naacl-main.267.pdf
Video:: https://aclanthology.org/2021.naacl-main.267.mp4

PDF Cite Search Video Fix data