Yuhan Dong
2022
Offline-to-Online Co-Evolutional User Simulator and Dialogue System
Dafeng Chi
|
Yuzheng Zhuang
|
Yao Mu
|
Bin Wang
|
Jianzhu Bao
|
Yasheng Wang
|
Yuhan Dong
|
Xin Jiang
|
Qun Liu
|
Jianye Hao
Proceedings of the Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems (SereTOD)
Reinforcement learning (RL) has emerged as a promising approach to fine-tune offline pretrained GPT-2 model in task-oriented dialogue (TOD) systems. In order to obtain human-like online interactions while extending the usage of RL, building pretrained user simulators (US) along with dialogue systems (DS) and facilitating jointly fine-tuning via RL becomes prevalent. However, joint training brings distributional shift problem caused by compounding exposure bias. Existing methods usually iterative update US and DS to ameliorate the ensued non-stationarity problem, which could lead to sub-optimal policy and less sample efficiency. To take a step further for tackling the problem, we introduce an Offline-to-oNline Co-Evolutional (ONCE) framework, which enables bias-aware concurrent joint update for RL-based fine-tuning whilst takes advantages from GPT-2 based end-to-end modeling on US and DS. Extensive experiments demonstrate that ONCE builds high-quality loops of policy learning and dialogues data collection, and achieves state-of-the-art online and offline evaluation results on MultiWOZ2.1 dataset. Opensourced code will be implemented with Mindspore (MS, 2022) and released on our homepage.