A Sequence-to-Sequence Approach to Dialogue State Tracking

This paper is concerned with dialogue state tracking (DST) in a task-oriented dialogue system. Building a DST module that is highly effective is still a challenging issue, although significant progresses have been made recently. This paper proposes a new approach to dialogue state tracking, referred to as Seq2Seq-DU, which formalizes DST as a sequence-to-sequence problem. Seq2Seq-DU employs two BERT-based encoders to respectively encode the utterances in the dialogue and the descriptions of schemas, an attender to calculate attentions between the utterance embeddings and the schema embeddings, and a decoder to generate pointers to represent the current state of dialogue. Seq2Seq-DU has the following advantages. It can jointly model intents, slots, and slot values; it can leverage the rich representations of utterances and schemas based on BERT; it can effectively deal with categorical and non-categorical slots, and unseen schemas. In addition, Seq2Seq-DU can also be used in the NLU (natural language understanding) module of a dialogue system. Experimental results on benchmark datasets in different settings (SGD, MultiWOZ2.2, MultiWOZ2.1, WOZ2.0, DSTC2, M2M, SNIPS, and ATIS) show that Seq2Seq-DU outperforms the existing methods.


Intents: "FindFlight":
Search for one-way flights to a destination.

User System State
Service: "Flight": Find your next flight. Figure 1: An example of dialogue state tracking. Given a dialogue history that contains user utterances and system utterances, and descriptions of schema that contain all possible intents and slot-value pairs, a dialogue state for the current turn is created which is represented by intents and slot-value pairs. There are slot values obtained from the schema (categorical) as well as slot values extracted from the utterances (non-categorical). #4, #6, etc denote pointers.

Schema
of dialogue. In DST, several semantic frames representing the 'states' of dialogue are created and updated in multiple turns of dialogue. Domain knowledge in dialogues is represented by a representation referred to as schema, which consists of possible intents, slots, and slot values. Slot values can be in a pre-defined set, with the corresponding slot being referred to as categorical slot, and they can also be from an open set, with the corresponding slot being referred to as non-categorical slot. Figure 1 shows an example of DST. We think that a DST module (and an NLU module) should have the following abilities. (1) Global, the model can jointly represent intents, slots, and slot values. (2) Represenable, it has strong capa-bility to represent knowledge for the task, on top of a pre-trained language model like BERT. (3) Scalable, the model can deal with categorical and non-categorical slots and unseen schemas.
Many methods have been proposed for DST Zhong et al., 2018;Mrkšić et al., 2017;Goo et al., 2018). There are two lines of relevant research. (1) To enhance the scalability of DST, a problem formulation, referred to as schemaguided dialogue, is proposed. In the setting, it is assumed that descriptions on schemas in natural language across multiple domains are given and utilized. Consequently, a number of methods are developed to make use of schema descriptions to increase the scalability of DST (Rastogi et al., 2019;Zang et al., 2020;Noroozi et al., 2020). The methods regard DST as a classification and/or an extraction problem and independently infer the intent and slot value pairs for the current turn. Therefore, the proposed models are generally representable and scalable, but not global.
(2) There are also a few methods which view DST as a sequence to sequence problem. Some methods sequentially infer the intent and slot value pairs for the current turn on the basis of dialogue history and usually employ a hierarchical structure (not based on BERT) for the inference (Lei et al., 2018;Ren et al., 2019;Chen et al., 2020b). Recently, a new approach is proposed which formalizes the tasks in dialogue as sequence prediction problems using a unified language model (based on GPT-2) (Hosseini-Asl et al., 2020). The method cannot deal with unseen schemas and intents, however, and thus is not scalable.
We propose a novel approach to DST, referred to as Seq2Seq-DU (sequence-to-sequence for dialogue understanding), which combines the advantages of the existing approaches. To the best of our knowledge, there was no previous work which studied the approach. We think that DST should be formalized as a sequence to sequence or 'translation' problem in which the utterances in the dialogue are transformed into semantic frames. In this way, the intents, slots, and slot values can be jointly modeled. Moreover, NLU can also be viewed as a special case of DST and thus Seq2Seq-DU can also be applied to NLU. We note that very recently the effectiveness of the sequence to sequence approach has also been verified in other language understanding tasks (Paolini et al., 2021).
Seq2Seq-DU comprises a BERT-based encoder to encode the utterances in the dialogue, a BERT based encoder to encode the schema descriptions, an attender to calculate attentions between the utterance embeddings and schema embeddings, and a decoder to generate pointers of items representing the intents and slots-value pairs of state.
Seq2Seq-DU has the following advantages.
(1) Global: it relies on the sequence to sequence framework to simultaneously model the intents, slots, and slot-values. (2) Representable: It employs BERT (Devlin et al., 2019) to learn and utilize better representations of not only the current utterance but also the previous utterances in the dialogue. If schema descriptions are available, it also employs BERT for the learning and utilization of their representations. (3) Scalable: It uses the pointer generation mechanism, as in the Pointer Network (Vinyals et al., 2015), to create representations of intents, slots, and slot-values, no matter whether the slots are categorical or non-categorical, and whether the schemas are unseen or not.
Experimental results on benchmark datasets show that Seq2Seq-DU 1 performs much better than the baselines on SGD, MultiWOZ2.2, and Multi-WOZ2.1 in multi-turn dialogue with schema descriptions, is superior to BERT-DST on WOZ2.0, DSTC2, and M2M, in multi-turn dialogue without schema descriptions, and works equally well as Joint BERT on ATIS and SNIPS in single turn dialogue (in fact, it degenerates to Joint BERT).

Related Work
There has been a large amount of work on task-oriented dialogue, especially dialogue state tracking and natural language understanding (eg., Chen et al., 2017)). Table 1 makes a summary of existing methods on DST. We also indicate the methods on which we make comparison in our experiments.

Dialogue State Tracking
Previous approaches mainly focus on encoding of the dialogue context and employ deep neural networks such as CNN, RNN, and LSTM-RNN to independently infer the values of slots in DST (Mrkšić et al., 2017;Xu and Hu, 2018;Zhong et al., 2018;Rastogi et al., 2017;Ramadan et al., 2018;Zhang et al., 2019;Heck et al., 2020). The approaches cannot deal with unseen schemas in new domains, however. To cope with the problem, a new direction called schema-guided dialogue is proposed recently, which assumes that natural language descriptions of schemas are provided and can be used to help transfer knowledge across domains. As such, a number of methods are developed in the recent dialogue competition SGD (Rastogi et al., 2019;Zang et al., 2020;Noroozi et al., 2020;Chen et al., 2020a). Our work is partially motivated by the SGD initiative. Our model Seq2Seq-DU is unique in that it formalizes schema-guided DST as a sequence-to-sequence problem using BERT and pointer generation.
In fact, sequence-to-sequence models are also utilized in DST. Sequicity (Lei et al., 2018) is a two-step sequence to sequence model which first encodes the dialogue history and generates a belief span, and then generates a language response from the belief span. COMER (Ren et al., 2019) and CREDIT (Chen et al., 2020b) are hierarchical sequence-to-sequence models which represent the intents and slot-value pairs in a hierarchical way, and employ a multi-stage decoder. Simple-TOD (Hosseini-Asl et al., 2020) is a unified approach to task-oriented dialogue which employs a single and causal language model to perform sequence prediction in DST, Policy, and NLG. Our proposed approach also uses a sequence-tosequence model. There are significant differences between our model Seq2Seq-DU and the existing models. First, there is no hierarchy in decoding of Seq2Seq-DU. A flat structure on top of BERT appears to be sufficient for jointly capturing the intents, slots, and values. Second, the decoder in Seq2Seq-DU generates pointers instead of tokens, and thus can easily and effectively handle categorical slots, non-categorical slots, as well as unseen schemas.

Natural Language Understanding
Traditionally the problem of NLU is decomposed into two independent issues, namely classification of intents and sequence labeling of slot-value pairs (Liu and Lane, 2016;Hakkani-Tür et al., 2016). For example, deep neural network combined with conditional random field is employed for the task (Yao et al., 2014). Recently the pretrained language model BERT (Chen et al., 2019) is exploited to further enhance the accuracy. Methods are also proposed which can jointly train and utilize classification and sequence labeling models (Chen

Our Approach
Our approach Seq2Seq-DU formalizes dialogue state tracking as a sequence to sequence problem using BERT and pointer generation. As shown in Figure 2, Seq2Seq-DU consists of an utterance encoder, a schema encoder, an utterance schema attender, and a state decoder. In each turn of dialogue, the utterance encoder transforms the current user utterance and the previous utterances in the dialogue into a sequence of utterance embeddings using BERT; the schema encoder transforms the schema descriptions into a set of schema embeddings also using BERT; the utterance schema attender calculates attentions between the utterance embeddings and the schema embeddings to create attended utterance and schema representations; finally, the state decoder sequentially generates a state representation on the basis of the attended representations using LSTM and pointer generation.

Utterance Encoder
The utterance encoder takes the current user utterance as well as the previous utterances (user and system utterances) in the dialogue (a sequence of tokens) as input and employs BERT to construct a sequence of utterance embeddings. The relations between the current utterance and the previous utterances are captured by the encoder. The input of the encoder is a sequence of tokens with length N , denoted as X = (x 1 , ..., x N ). The first token x 1 is [CLS], followed by the tokens of the current user utterance and the tokens of the previous utterances, separated by [SEP]. The output is a sequence of embeddings also with length N , denoted as D = (d 1 , ..., d N ) and referred to as utterance embeddings, with one embedding for each token.

Schema Encoder
The schema encoder takes the descriptions of intents, slots, and categorical slot values (a set of combined sequences of tokens) as input and employs BERT to construct a set of schema embeddings.

Schema
Sequence 1 Sequence 2 Intent service description intent description Slot service description slot description Value slot description value Suppose that there are I intents, S slots, and V categorical slot values in the schemas. Each schema element is described by two descriptions as outlined in Table 2 The schema encoder in fact adopts the same approach of schema encoding as in (Rastogi et al., 2019). There are two advantages with the approach. First, the encoder can be trained across different domains. Schema descriptions in different domains can be utilized together. Second, once the encoder is fine-tuned, it can be used to process unseen schemas with new intents, slots, and slot values.

Utterance-Schema Attender
The utterance-schema attender takes the sequence of utterance embeddings and the set of schema embeddings as input and calculates schema-attended utterance representations and utterance-attended schema representations. In this way, information from the utterances and information from the schemas are fused.
First, the attender constructs an attention matrix, indicating the similarities between utterance embeddings and schema embeddings. Given the i-th utterance token embedding d i and j-th schema embedding e j , it calculates the similarity as follows, where r, W 1 , W 2 are trainable parameters. The attender then normalizes each row of matrix A as a probability distribution, to obtain matrix A. Each row represents the attention weights of schema elements with respect to an utterance token. Then the schema-attended utterance representations are calculated as D a = EA . The attender also normalizes each column of matrix A as a probability distribution, to obtain matrix A. Each column represents the attention weights of utterance tokens with respect to a schema element. Then the utterance-attended schema representations are calculated as E a = D A.

State Decoder
The state decoder sequentially generates a state representation (semantic frame) for the current turn, which is represented as a sequence of pointers to elements of the schemas and tokens of the utterances (cf., Figure 1). The sequence can then be either re-formalized as a semantic frame in dialogue state tracking 2 , [intent; (slot 1 , value 1 ); (slot 2 , value 2 ); ...], or a sequence of labels in NLU (intent-labeling and slot-filling). The pointers point to the elements of intents, slots, and slot values in the schema descriptions (categorical slot values), as well as the tokens in the utterances (non-categorical slot values). The elements in the schemas can be either words or phrases, and the tokens in the utterances form spans for extraction of slot values.
The state decoder is an LSTM using pointer (Vinyals et al., 2015) and attention (Bahdanau et al., 2015). It takes the two representations D a and E a as input. At each decode step t, the decoder receives the embedding of the previous item w t−1 , the utterance context vector u t , the schema context vector s t , and the previous hidden state h t−1 , and produces the current hidden state h t : We adopt the attention function in (Bahdanau et al., 2015) to calculate the context vectors as follows, The decoder then generates a pointer from the set of pointers in the schema elements and the tokens of the utterances on the basis of the hidden state h t . Specifically, it generates a pointer of item w according to the following distribution, where #w is the pointer of item w, k w is the representation of item w either in the utterance representations D a or in the schema representations E a , q, U 1 , and U 2 are trainable parameters, and softmax is calculated over all possible pointers. During decoding, the decoder employs beam search to find the best sequences of pointers in terms of probability of sequence.

Training
The training of Seq2Seq-DU follows the standard procedure of sequence-to-sequence. The only difference is that it is always conditioned on the schema descriptions. Each instance in training consists of the current utterance and the previous utterances, and the state representation (sequence of pointers) for the current turn. Two pre-trained  BERT models are used for representations of utterances and schema descriptions respectively. The BERT models are then fine-tuned in the training process. Cross-entropy loss is utilized to measure the loss of generating a sequence.

Datasets
We conduct experiments using the benchmark datasets on task-oriented dialogue. SGD (Rastogi et al., 2019) and MultiWOZ2.2 (Zang et al., 2020) are datasets for DST; they include schemas with categorical slots and non-categorical slots in multiple domains and natural language descriptions on the schemas, as shown in  (Henderson et al., 2014) are datasets for DST; they contain schemas with only categorical slots in a single domain. M2M (Shah et al., 2018) is a dataset for DST and it has span annotations for slot values in multiple domains. ATIS (Tur et al., 2010) and SNIPS (Coucke et al., 2018) are datasets for NLU in single-turn dialogues in a single domain. Table 3 gives the statics of datasets in the experiments.

Baselines and Variants
We make comparison between our approach and the state-of-the-art methods on the datasets. We also include two variants of Seq2Seq-DU. The differences are whether to use the schema descriptions, and the formation of dialogue state. Seq2Seq-DU-w/oSchema: It is used for datasets that do not have schema descriptions. It only contains utterance encoder and state decoder. Seq2Seq-DU-SeqLabel: It is used for NLU in a single-turn dialogue. It views the problem as sequence labeling, and only contains the utterance encoder and state decoder.

Evaluation Measures
We make use of the following metrics in evaluation. Intent Accuracy: percentage of turns in dialogue for which the intent is correctly identified. Joint Goal Accuracy: percentage of turns for which all the slots are correctly identified. For non-categorical slots, a fuzzy matching score is used on SGD and exact match are used on the other datasets to keep the numbers comparable with other works. Slot F1: F1 score to evaluate accuracy of slot sequence labeling.

Training
We use the pre-trained BERT model ([BERT-Base, Uncased]), which has 12 hidden layers of 768 units and 12 self-attention heads to encode utterances and schema descriptions. The hidden size of LSTM decoder is also 768. The dropout probability is 0.1. We also use beam search for decoding, with a beam size of 5. The batch size is set to 8. Adam (Kingma and Ba, 2014) is used for optimization with an initial learning rate of 1e-4. Hyper parameters are chosen using the validation dataset in all cases.
The training curves of all models are shown in Appendix A.

Experimental Results
Tables 4, 5, 6, and 7 show the results. One can see that Seq2Seq-DU performs significantly better than the baselines in DST and performs equally well as the baselines in NLU.
DST is carried out in different settings in SGD, MultiWOZ2.2, MultiWOZ2.1, WOZ2.0, DSTC2, and M2M. In all cases, Seq2Seq-DU works significantly better than the baselines. The results indicate that Seq2Seq-DU is really a general and effective model for DST, which can be applied to multiple settings. Specifically, Seq2Seq-DU can leverage the schema descriptions for DST when they are available (SGD and MultiWOZ2.2, Multi-WOZ2.1) 3 . It can work well in zero-shot learning to deal with unseen schemas (SGD). It can also effectively handle categorical slots (MultiWOZ2.1, WOZ2.0 and DSTC2) and non-categorical slots (M2M). It appears that the success of Seq2Seq-DU is due to its suitable architecture design with a sequence-to-sequence framework, BERT-based encoders, utterance-schema attender, and pointer generation decoder.
NLU is formalized as sequence labeling in ATIS and SNIPS. Seq2Seq-DU is degenerated to Seq2Seq-DU-SeqLabel, which is equivalent to the baseline of Joint Bert. The results suggest that it is the case. Specially, the performances of Seq2Seq-DU are comparable with Joint BERT, indicating that Seq2Seq-DU can also be employed in NLU.

Ablation Study
We also conduct ablation study on Seq2Seq-DU. We validate the effects of three factors: BERTbased encoder, utterance-schema attention, and pointer generation decoder. The results indicate that all the components of Seq2Seq-DU are indispensable.

Effect of BERT
To investigate the effectiveness of using BERT in the utterance encoder and schema encoder, we replace BERT with Bi-directional LSTM and run the model on SGD and MultiWOZ2.2. As shown in Figure 3, the performance of the BiLSTM-based model Seq2Seq-DU-w/oBert in terms of Joint GA and Int. Acc decreases significantly compared with Seq2Seq-DU. It indicates that the BERT-based encoders can create and utilize more accurate representations for dialogue understanding.

Effect of Attention
To investigate the effectiveness of using attention, we compare Seq2Seq-DU with Seq2Seq-DUw/oAttention which eliminates the attention mechanism, Seq2Seq-DU-w/SchemaAtt which only contains the utterance-attended schema representations, and Seq2Seq-DU-w/UtteranceAtt which only contains the schema-attended utterance representations. Figure 3 shows the results on SGD and MultiWOZ2.2 in terms of Joint GA and Int. Acc. One can observe that without attention the performances deteriorate considerably. In addition, the performances of unidirectional attentions are inferior to the performance of bidirectional attention. Thus, utilization of bidirectional attention between utterances and schema descriptions is desriable.

Effect of Pointer Generation
To investigate the effectiveness of the pointer generation mechanism, we directly generate words from the vocabulary instead of generating pointers in the decoding process. Figure 3 also shows the results of Seq2Seq-DU-w/oPointer on SGD and MultiWOZ2.2 in terms of Joint GA and Int. Acc. From the results we can see that pointer generation is crucial for coping with unseen schemas.
In SGD which contains a large number of unseen schemas in the test set, there is significant performance degradation without pointer generation. The results on MultiWOZ2.2, which does not have unseen schemas in the test set, show pointer generation can also make significant improvement on already seen schemas by making full use of schema descriptions.

Discussions Case Study
We make qualitative analysis on the results of Seq2Seq-DU and SGD-baseline on SGD and Mul-tiWOZ2.2. We find that Seq2Seq-DU can make more accurate inference of dialogue states by leveraging the relations existing in the utterances and schema descriptions. For example, in the first case in Table 8, the user wants to find a cheap guesthouse. Seq2Seq-DU can correctly infer that the hotel type is "guesthouse" by referring to the relation between "hotel-pricerange" and "hotel-type". In the second case, the user wants to rent a room with in-unit laundry. In the dataset, a user who intends to rent a room will care more about the laundry property. Seq2Seq-DU can effectively extract the relation between "intent" and "in-unit-laundry", yielding a correct result. In contrast, SGD-baseline does not model the relations in the schemas, and thus it cannot properly infer the values of "hoteltype" and "in-unit-laundry".

Dealing with Unseen Schemas
We analyze the zero-shot learning ability of Seq2Seq-DU. One bedroom is fine. It also needs in-unit laundry.
in the domains with all seen schemas. The domains that have more partially seen schemas achieve higher accuracies, such as "Hotels", "Movies", "Services". The accuracies decline in the domains with more unseen schemas, such as "Messaging" and "RentalCars". We conclude that Seq2Seq-DU can perform zero-shot learning across domains. However, the ability still needs enhancement.

Conclusion
We have proposed a new approach to dialogue state tracking. The approach, referred to as Seq2Seq-DU, takes dialogue state tracking (DST) as a problem of transforming all the utterances in a dialogue into semantic frames (state representations) on the basis of schema descriptions. Seq2Seq-DU is unique in that within the sequence to sequence framework it employs BERT in encoding of utterances and schema descriptions respectively and generates pointers in decoding of dialogue state. Seq2Seq-DU is a global, reprentable, and scalable model for DST as well as NLU (natural language understanding). Experimental results show that Seq2Seq-DU significantly outperforms the state-ofthe-arts methods in DST on the benchmark datasets of SGD, MultiWOZ2.2, MultiWOZ2.1, WOZ2.0, DSTC2, M2M, and performs as well as the stateof-the-arts in NLU on the benchmark datasets of ATIS and SNIPS.
A Training Curves Figure 4 shows the training losses of Seq2Seq-DU on the training datasets, while Figure 5 shows the accuracies of Seq2Seq-DU on the test sets during training. We regard training convergence when the fluctuation of loss is less than 0.01 for consecutive 20 thousand steps. Seq2Seq-DU converges at the 180k-th step on SGD, MultiWOZ2.2, and Multi-WOZ2.1. Seq2Seq-DU-w/oSchema converges at the 150k-th step on WOZ2.0 and at the 140k-th step on DSTC2, and M2M. Furthermore, Seq2Seq-DU-SeqLabel converges at the 130k-th step on ATIS and SNIPS. These are consistent with the general trends in machine learning that more complex models are more difficult to train.