Generation and Extraction Combined Dialogue State Tracking with Hierarchical Ontology Integration

Recently, the focus of dialogue state tracking has expanded from single domain to multiple domains. The task is characterized by the shared slots between domains. As the scenario gets more complex, the out-of-vocabulary problem also becomes severer. Current models are not satisfactory for solving the challenges of ontology integration between domains and out-of-vocabulary problems. To address the problem, we explore the hierarchical semantic of ontology and enhance the interrelation between slots with masked hierarchical attention. In state value decoding stage, we solve the out-of-vocabulary problem by combining generation method and extraction method together. We evaluate the performance of our model on two representative datasets, MultiWOZ in English and CrossWOZ in Chinese. The results show that our model yields a significant performance gain over current state-of-the-art state tracking model and it is more robust to out-of-vocabulary problem compared with other methods.


Introduction
Dialogue state tracking (DST) is in charge of updating the belief state in task-oriented dialogue system (Gao et al., 2019a). Traditional discriminative DST models assume that the task ontology is well defined in advance, that is to say, all states and their values are known to the model. They usually rely on hand-crafted features or taskspecific lexicon (Henderson et al., 2014;Mrkšić and Vulić, 2018). An inconvenience is that they are time-consuming and hard to expand to new tasks. To overcome it, the open vocabulary-based models are proposed to decode the state value according to the dialogue context (Xu and Hu, 2018;Goel et al., 2019).
In recent years, the research frontier for taskoriented dialogue systems has expanded from single domain to multiple domains (Budzianowski  Intent and Slot. et al., 2018;Zhu et al., 2020;Cheng et al., 2020). There come new challenges demanding prompt solution. Firstly, current model does not sufficiently consider the interrelation between slots in multi-domain scenario. For example, user asks "I also want to find an attraction near the restaurant", which implies that the hotel need to have the same area with the restaurant. The implicit relation between the area slot of hotel and restaurant is the key to exactly track user's intent . Prior work simply used the summed  or concatenated (Zhang et al., 2019;Kim et al., 2020) embedding of the domain and slot as the states representation for the decoder. Secondly, out-of-vocabulary (OOV) problem gets more severe since the user asking question with wider entities and more diverse words.
In this paper, we propose the generation and extraction combined method with hierarchical ontology integration, named GeeX, for dialogue state tracking. First, we explore the hierarchical semantics of the ontology to enhance the representation of slots in multiple domains. Inspired by Chen et al. (2019), we adopt the directed acyclic graph to represent the ontology and enhance the slots interaction between domains with masked hierarchical attention. We use the ontology of MultiWOZ (Budzianowski et al., 2018) to illustrate this mechanism. As shown in figure 1, the ontology has four hierarchies, i.e., Domain, Intent, Slot and Value. The states can be expressed as the combination of Domain, Intent and Slot and the goal of DST is to decode the Value for the state mentioned in the dialogue context. The hierarchically represented ontology is efficient and effective in two aspects. First, it enhances the interrelation between slots in multiple domains. Second, the compact structure is efficient for state representation which is appropriate to domain expansion since new domain often shares slots with old one (Rastogi et al., 2020). To address the OOV problem, we leverage generation and extraction by combining the two methods together. We first predict the state operation policy to select the suitable decoding strategy. Then, we enter into the corresponding decoder for value decoding according to the predicted policy.
The contributions of this paper are summarized as follows: (i) We adopt the masked hierarchical attention to represent the ontology to enhance the slots interrelation between domains. (ii) We combine generation and extraction to handle OOV problem in dialogue state tracking. (iii) Experiment results demonstrate that GeeX outperforms state-of-the-art baseline on two representative datasets. Furthermore, GeeX also shows robustness in OOV testing.

Architecture
We use a four-stage model for state tracking, Figure 2 illustrates the architecture.

Masked Hierarchical Attention
We use a three-layer masked hierarchical attention to explicitly integrate the state information. Assuming there are M states in total 1 . For the m -th 1 The full states in MultiWOZ and CrossWOZ are listed in Appendix A. state, we use a state-specific mask M m l ∈ R |M l | to activate certain gate and only pass through their information to the next level to disentangle the layerwise information 2 . The state is computed by, where Att is the standard scaled dot-product attention (Vaswani et al., 2017), l is the layer number, d h is the hidden dimension, | · | denotes the length number. O l represents the layer of Domain, Intent and Slot when l = 1, 2, 3, respectively. The dialogue state is the concatenation of all individual states, i.e., S = S 1 ⊕ · · · ⊕ S M , where S m =S m ⊕V m , V m is the value ofS m and ⊕ denotes the concatenation operation. Note that they are shareable among layers, so the hierarchical attention helps to implicitly model the interrelation between states.

Transformer Encoder
We represent the dialogue context as the concatenation of last turn system response D and current turn user utterance U 3 . At t -th turn , the dialogue context is denoted as C t = D t−1 ⊕ U t . We use Transformer (Vaswani et al., 2017) to fuse the state information into dialogue context. We concatenate last turn state S t−1 and current dialogue context C t as the input, i.e., where [CLS] and [SEP] are the special token as in (Devlin et al., 2019). In output layer, we get the hidden representation for each of the input tokens.
Particularly, h S m corresponds toS m representing the information of the m -th state.

Operation Gate
We predict the decoding operation o k ∈ O = {CARRYOVER, GENERATE, EXTRACT, NULL} with a four-channel classifier, where, CARRY-OVER denotes to keep the state value the same as last turn, GENERATE represents to decode the value by generation decoding, EXTRACT represents to decode the value by extraction decoding, and NULL means the state is not mentioned in the context and its value is empty. For each of the states, we compute the decoding operation probability P m op ∈ R |O| by, where, W op ∈ R |O|×d h is a learnable parameter.

State Value Decoder
We build two parallel decoders for state value prediction and selectively decode the value for states whose operation policy is GENERATE and EX-TRACT. For each of the states, when its policy is GENERATE, we execute the generation decoding mode, and when its policy is EXTRACT, we enter into extraction decoding mode. Generation Decoding We use Gated Recurrent Units (GRU) (Cho et al., 2014) as the basic decoder and employ copy mechanism to calculate a probability over the dialogue context to encourage reusing words in the context. We use the state representation h S m (whose operation policy is GEN-ERATE) to initialize the decoder hidden. The final probability of decoding a certain word, e.g., the τ -th token u(τ ), is calculated by where P voc (τ ) is the probability computed from the decoder hidden over whole vocabulary, and P copy (τ ) indicates the probability of copying words from the context. Extraction Decoding We treat the state value prediction as the extractive reading comprehension problem (Gao et al., 2019b(Gao et al., , 2020. Specifically, we use the state representation h S m (whose policy is EXTRACT) as the query, the dialogue context H as background and the state value as the answer.
The extraction can be formalized as where, P m s and P m e are the start index probability and the end index probability over the dialogue context, respectively. In implementation, we use the extraction method EXT from (Hu et al., 2018).

Learning
We use cross-entropy to compute the operation policy loss and state value decoding loss: where y * * is the standard label for P * * . We adopt multi-task learning to train the model. The optimization objective is a combination of the three loss function, The performance difference between vanilla extractive model (i.e., DSTReader, TripPy) and Ge-eX mainly comes from the limitation that its decoding vocabulary is limited to the words that occurred in the dialogue history. For examples, a user may find a cheap restaurant while described it as economical, the extractive model would lose efficacy to predict the right answer span.
GeeX also achieves higher score than the generation decoding models (i.e., TRADE, SOM). After further observations, we find that most of the token can be directly extracted from context (82.0% in MultiWOZ2.0, 84.2% in MultiWOZ2.1 and 83.7% in CrossWOZ). The extractive decoding models is more robust to decode longer sequence. However, the generation decoding method helps to generate values not appearing in the context, so it is a perfect complement to extractive method.
Another performance gain comes from operation prediction. As stated in (Kim et al., 2020), a relatively larger amount of error originates from operation gate. SOM uses CARRYOVER for states keeping unchanged while neglecting the difference between "none" and succeeding. GeeX use CARRYOVER for state value succeeding from last turn and NULL for empty value, which help to explicitly take advantage of last turn belief states.

Ablation Study
Ablation results are reported in bottom half of Table 1, the degradation of -MHA, -EXT and -GEN validates the necessity of hierarchical ontology integration and parallel decoding approach. -EXT outperforms SOM in generation decoding method and -GEN outperforms DSTReader in extraction decoding method, demonstrating that hierarchical ontology integration is effective to promote the slots interaction and lead to the performance boost. Compared with vanilla extraction and generation decoding models, the improvement on -MHA further clarifies that the two parallel decoding approaches are complementary to each other. We simulate OOV instances by randomly masking the value token in dialogue context. For example, we change 'I would like modern European food' into 'I would like [UNK] European food'. Here, we take the three representative models, i.e., SUMBT, DSTReader and SOM, for comparison.

OOV Testing
As shown in Figure 3, compared with SUMBT, DSTReader and SOM, GeeX still performs well in all OOV rates. This is actually because the extraction decoder plays a crucial role for predicting OOV tokens, which is also reflected in the smaller performance drop of DSTReader. In addition, the performance of SOM decreases more sharply as more instances set to be OOV, demonstrating that the copy-augmented model is inflexible to address multiple sequential unknown words. The worst performance of SUMBT demonstrates that the discriminative model is ill-equipped to recognize unknown tokens.

Conclusion
In the paper, we explore the hierarchical structure of ontology and combine generation and extraction together for state value decoding. With the domain expanding, supervised learning is not satisfactory for rapidly increasing requirements. In future work, few-shot learning and knowledge fusion can be applied to further improve domain transferring performance.