Yangyang Zhao
2024
Bootstrapped Policy Learning for Task-oriented Dialogue through Goal Shaping
Yangyang Zhao
|
Ben Niu
|
Mehdi Dastani
|
Shihan Wang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Reinforcement learning shows promise in optimizing dialogue policies, but addressing the challenge of reward sparsity remains crucial. While curriculum learning offers a practical solution by strategically training policies from simple to complex, it hinges on the assumption of a gradual increase in goal difficulty to ensure a smooth knowledge transition across varied complexities. In complex dialogue environments without intermediate goals, achieving seamless knowledge transitions becomes tricky. This paper proposes a novel Bootstrapped Policy Learning (BPL) framework, which adaptively tailors progressively challenging subgoal curriculum for each complex goal through goal shaping, ensuring a smooth knowledge transition. Goal shaping involves goal decomposition and evolution, decomposing complex goals into subgoals with solvable maximum difficulty and progressively increasing difficulty as the policy improves. Moreover, to enhance BPL’s adaptability across various environments, we explore various combinations of goal decomposition and evolution within BPL, and identify two universal curriculum patterns that remain effective across different dialogue environments, independent of specific environmental constraints. By integrating the summarized curriculum patterns, our BPL has exhibited efficacy and versatility across four publicly available datasets with different difficulty levels.
Rescue Conversations from Dead-ends: Efficient Exploration for Task-oriented Dialogue Policy Optimization
Yangyang Zhao
|
Mehdi Dastani
|
Jinchuan Long
|
Zhenyu Wang
|
Shihan Wang
Transactions of the Association for Computational Linguistics, Volume 12
Training a task-oriented dialogue policy using deep reinforcement learning is promising but requires extensive environment exploration. The amount of wasted invalid exploration makes policy learning inefficient. In this paper, we define and argue that dead-end states are important reasons for invalid exploration. When a conversation enters a dead-end state, regardless of the actions taken afterward, it will continue in a dead-end trajectory until the agent reaches a termination state or maximum turn. We propose a Dead-end Detection and Resurrection (DDR) method that detects dead-end states in an efficient manner and provides a rescue action to guide and correct the exploration direction. To prevent dialogue policies from repeating errors, DDR also performs dialogue data augmentation by adding relevant experiences that include dead-end states and penalties into the experience pool. We first validate the dead-end detection reliability and then demonstrate the effectiveness and generality of the method across various domains through experiments on four public dialogue datasets.
2021
Efficient Dialogue Complementary Policy Learning via Deep Q-network Policy and Episodic Memory Policy
Yangyang Zhao
|
Zhenyu Wang
|
Changxi Zhu
|
Shihan Wang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Deep reinforcement learning has shown great potential in training dialogue policies. However, its favorable performance comes at the cost of many rounds of interaction. Most of the existing dialogue policy methods rely on a single learning system, while the human brain has two specialized learning and memory systems, supporting to find good solutions without requiring copious examples. Inspired by the human brain, this paper proposes a novel complementary policy learning (CPL) framework, which exploits the complementary advantages of the episodic memory (EM) policy and the deep Q-network (DQN) policy to achieve fast and effective dialogue policy learning. In order to coordinate between the two policies, we proposed a confidence controller to control the complementary time according to their relative efficacy at different stages. Furthermore, memory connectivity and time pruning are proposed to guarantee the flexible and adaptive generalization of the EM policy in dialog tasks. Experimental results on three dialogue datasets show that our method significantly outperforms existing methods relying on a single learning system.
Search
Fix data
Co-authors
- Shihan Wang 3
- Mehdi Dastani 2
- Zhenyu Wang 2
- Jinchuan Long 1
- Ben Niu 1
- show all...