Effective Sequence-to-Sequence Dialogue State Tracking

Sequence-to-sequence models have been applied to a wide variety of NLP tasks, but how to properly use them for dialogue state tracking has not been systematically investigated. In this paper, we study this problem from the perspectives of pre-training objectives as well as the formats of context representations. We demonstrate that the choice of pre-training objective makes a significant difference to the state tracking quality. In particular, we find that masked span prediction is more effective than auto-regressive language modeling. We also explore using Pegasus, a span prediction-based pre-training objective for text summarization, for the state tracking model. We found that pre-training for the seemingly distant summarization task works surprisingly well for dialogue state tracking. In addition, we found that while recurrent state context representation works also reasonably well, the model may have a hard time recovering from earlier mistakes. We conducted experiments on the MultiWOZ 2.1-2.4, WOZ 2.0, and DSTC2 datasets with consistent observations.


Introduction
Sequence-to-sequence (Seq2Seq) modeling (Sutskever et al., 2014) is one of the most widely adopted generative framework for a multitude of NLP tasks. While it has also been applied for task-oriented dialogue modeling Lei et al., 2018;Rongali et al., 2020;Feng et al., 2020), how to best setup Seq2Seq models on this task remains an understudied topic. In this paper, we investigate this problem from two perspectives: Pre-training objectives and dialogue context representation, and we focus on the dialogue state tracking (DST) task.
The flexibility of the Seq2Seq model allows us to adopt and compare pre-training strategies for other NLP tasks sharing the same architec-ture. Specifically, we first experimented with different pre-training setups of T5 (Raffel et al., 2020) which have been shown to be effective for generic language understanding. Additionally, as an exploratory effort, we applied Pegasus (Zhang et al., 2020b), a pre-training procedure designed for text summarization, to the task of DST.
Additionally, to investigate how different dialogue context representations affect Seq2Seq performance, we compare two versions of all models: one accepts full conversation history as context, and one that feeds the previously predicted states recurrently as summary of context.
We conduct systematic experiments on the Multi-WOZ (Budzianowski et al., 2018) benchmark. For fair comparison with existing approaches, we report results on MultiWOZ 2. 1-2.4 (Eric et al., 2019;Zang et al., 2020;Han et al., 2020;Ye et al., 2021), all 4 variations of the benchmark proposed to date. In addition, we report results on the WOZ 2.0 DSTC2 (Henderson et al., 2014) datasets. Our findings can be summarized as follows: 1. Pre-training procedures involving masked span prediction work consistently better than auto-regressive language modeling objectives. 2. Pre-training for text summarization works surprisingly well for DST, despite it being a seemingly irrelevant task. 3. Recurrent models work reasonably well by including previously predicted states and constant length dialogue history. However they may suffer from the problem of not being able to recover from early mistakes.
The inputs to the encoder are dialogue contexts, and the decoder generates a sequence of strings of the format slot1=value1,slot2=value2,... describing the predicted states conditioned on the given context. Depending on how we represent the dialogue context, we consider two variations of the model: 1. Full-history model: The most straightforward way of preparing the context is simply to concatenate turns from the entire history as inputs to the encoder, which ensures the model to have full access to the raw information required to predict the current state. This setup is also adopted by several generative dialogue models such as SimpleTOD (Hosseini-Asl et al., 2020), Seq2Seq-DU (Feng et al., 2020) and SOLOIST (Peng et al., 2021). A potential drawback of the full-history model is that it may become increasingly inefficient as a conversation unfolds and the input length grows.
2. Recurrent-state model: An alternative approach is to include just the N recent turns in the conversation history, and replace turns from 1 to T − N with dialogue states up to T − N (where T is the current turn index). That is, the inputs to encoder have the format [states(turn 1,...,T −N ), turn T −N +1,...,T ].
States provide a summarization of the conversation semantics. By consolidating remote histories into states we not only reduce the context lengths, but also discard information not immediately related to the purpose of state tracking. Similar setup has also been considered by generative models including GPT-2 (Budzianowski and Vulić, 2019) and Sequicity (Lei et al., 2018), although in their cases only the last turn has been considered (N = 1).
An example of the input and output formats for both models is given in Appendix A.1.

Pre-training
Pre-training followed by task-specific fine-tuning is becoming a standard paradigm for contemporary NLP model training. Existing pre-training objectives mainly fall into two categories: masked span prediction (where the span length can be 1 corresponding to word prediction) and auto-regressive prediction. Objectives like BERT MLM (Devlin et al., 2019) and the denoising setup in T5 (Raffel et al., 2020) belong to the former category, while GPT (Radford et al., 2019;Brown et al., 2020) and the prefix LM setup in T5 fall into the latter.
For generative dialogue modeling, both pretraining styles have been considered. For example, Seq2Seq-DU (Feng et al., 2020) adopted a BERTpre-trained encoder, while SimpleTOD (Hosseini-Asl et al., 2020) and SOLOIST (Peng et al., 2021) are based on the GPT-2 auto-regressive prediction procedure. Nevertheless, it remains unclear which style is more effective for dialogue understanding.
To study this problem, we compare span prediction and auto-regressive language modeling (ARLM) by pre-training the encoder and decoder simultaneously using the denoising and prefix LM objectives from T5. To compare the relative effectiveness of different pre-training styles, we consider 3 setups: 1) Pre-training the model with span prediction only; 2) Continuing the pre-training of models from setup (1) with prefix LM; 3) Pre-training the model only with prefix LM only.
While T5 pre-training has demonstrated its effectiveness for generic language understanding tasks such as the GLUE and SuperGLUE benchmarks, we are curious as to which procedures are biased towards the downstream DST task. While it can be difficult to define an objective that applies immediately to DST, we consider a surrogate pre-training for a seemingly remote task: Summarization. To properly summarize a large chunk of text requires the model to be able to extract key semantics out of a clutter of inputs, which to some extent shares a similar problem structure as DST.
Following this intuition, we choose Pegasus (Zhang et al., 2020b), a strong pre-training objective developed for summarization based on Seq2Seq, as an alternative for comparison. In brief, Pegasus defines a self-supervised objective named Gap Sentence Generation (GSG), which identifies potentially important sentences in a paragraph according to some heuristics (for example, the topm sentences with the highest ROUGE score with respect to the remaining ones), masks them out, and forces the decoder to predict these pivoting sentences. A critical difference between Pegasus and other span prediction objectives is that masked spans are carefully identified instead of randomized. This principled operation positions the model to work specifically well for the downstream task of summarization.

General Setup
Our models are built with the open-source framework Lingvo (Shen et al., 2019) 1 . Each encoder and decoder has 12 Transformer layers, 8 attention head's and embedding dimension 768. Our models are trained with 16 TPUv3 chips (Jouppi et al., 2017). We use the memory-efficient Adafactor (Shazeer and Stern, 2018) as the optimizer, with learning rate 0.01 and inverse squared root decay schedule. We use the default SentencePiece model provided by T5 2 with vocabulary size 32k. For the pre-training procedure, we strictly follow the setups and procedures described in (Zhang et al., 2020b) and (Raffel et al., 2020). For decoding, we use beam search with size 5. We also enabled label smoothing with uncertainty 0.1 during training.

Datasets
We conduct our experiments on the MultiWOZ (Budzianowski et al., 2018) benchmark. The original MultiWOZ dataset, released in 2018, was known to contain substantial annotation errors.  2021)). The existence of multiple versions of the same benchmark, as well as ad-hoc pre-and post-processing procedures 3 adopted by different research groups make it difficult to compare results fairly. We therefore report results on all of Multi-WOZ 2.1-2.4 4 , without any pre-or post-processing of the original data. We use Joint-Goal-Accuracy (JGA) as the metric for all experiments.
In addition to the MultiWOZ datasets, we also report results on the WOZ 2.0 5    Note that on DSTC2, unlike other methods which combines the n-best speech recognition hypotheses as inputs, we make use of only the top 1-best hypothesis for simplicity, although the combination of n-best hypotheses could potentially further improve DST quality.

Results
We first report the MultiWOZ JGA scores achieved by the full-history models in Table 1, in which "span" and "ARLM" indicate masked span prediction and auto-regressive language modeling for pretraining respectively, and "span+ARLM" means pre-training with "span" followed by "ARLM". In addition, WOZ and DSTC2 JGA scores are reported in Table 2.
From Tables 1 and 2 we make the following observations: 1. Pre-training procedures with masked span prediction involved ("span", "span+ARLM") consistently performed better than using "ARLM" alone. This is true even if "span" is continued by "ARLM", and this result is seen in not just MultiWOZ 2.1-2.4 but also WOZ 2.0 and DSTC2.
2. Pegasus pre-training works almost equally well or better than the T5 pretraining, indicating that some features can be shared and transferred between the two tasks. Again, this observation is consistent across all benchmarks. This also corroborates conclusion (1) above in that span prediction objectives are more effective for DST.  3. Without pre-training, model quality drops miserably, as expected.

Recurrent Results
For the recurrent-state model, we report results for the Pegasus pre-trained model on MultiWOZ 2.1-2.4 in Table 3, with N = 1, 2, 3 (number of recent history turns). Each turn contains a pair of user and agent utterance. Our observations on models pre-trained with T5 are similar. The results show that while the recurrent-state models achieved reasonably good JGA on all data sets, they are nevertheless worse than the full-history model, despite the fact that the representation of context is more concise for the recurrent model. What is more, the choice of N can make a big difference to the model quality.  A closer look at failed examples produced by the model reveals that the main reason why the recurrent context representation achieved worse results is that they had a hard time recovering from prediction mistakes made at earlier turns. Since previously predicted states are feedback to the model inputs for future predictions, as long as the model made a mistake at earlier turns, this wrong prediction will be carried over as future inputs, causing the model to make consecutive prediction errors. We therefore believe that for the DST task, it may still be important to provide the model full access to dialogue history, so that it can learn to correct its predictions once a mistake was made in the past.

Remarks
From results enumerated in Sec. 3.3, one will see wildly varying scores across MultiWOZ 2.1-2.4, despite the fact that each dataset evolved from the same benchmark. This poses a concerning question of whether existing approaches can generalize well across different setups and benchmarks. For example, TripPy performed remarkably well on 2.1 and 2.3 (55.3%, 63.0%), but dropped to 59.6% on 2.4 (which claims to be the "cleanest" version of MultiWOZ). SOM-DST on the other hand, underperformed on 2.1 and 2.3 but achieved a strong result on 2.4.
We therefore suggest researchers working on the MultiWOZ benchmark report results on multiple version of the data with consistent or no data processing steps, to provide the community a more faithful assessment of the quality of their approaches.

Related Work
Generative sequential models have been applied for task-oriented dialogue problems in several ways. (Budzianowski and Vulić, 2019;Hosseini-Asl et al., 2020;Peng et al., 2021) adopted GPT-2, a unidirectional pretrained Transformer LM, as backbones for the generation of states, actions and responses. Under the framework of Seq2Seq, perhaps most similar to our work is (Feng et al., 2020), which adopts a Transformer encoder-decoder architecture, with the encoder pre-trained with BERT which is also used to encode schema. Besides,  is an early example that uses encoder outputs as state representation, merged with KB representation for the decoder to generate responses; (Lei et al., 2018) proposes a simplistic two stage CopyNet on top of Seq2Seq model to enable word copying from input sequences;  proposes a hierarchical Seq2Seq model for coarse-to-fine DST; (Zeng and Nie, 2021) proposes a "flat" encoder-decoder structure which reuses a BERT encoder for the function of a decoder with hidden layer states reused.
In terms of pre-training, BERT and GPT are still the most commonly used techniques (Zaib et al., 2021;Zhang et al., 2020c). Various pre-training methods developed for dialogue-specific problems have also been developed.  uses dialogue specific datasets for pre-training and fine-tuning. (Mehri et al., 2019) studies 4 ways of pre-training aiming at better capturing discourselevel dependencies for multi-turn dialogues;  proposed a contrastive pre-training loss to capture important qualities of dialogues; (Bao et al., 2021) proposed a curriculum pre-training procedure for response generation, subsuming opendomain, knowledge-grounded, task-oriented dialogue applications. (Liu et al., 2021) factorize the generative dialogue model according to the noisy channel model, pre-training each component separately.

Conclusion
We studied the problem of how to perform the DST task with Seq2Seq models effectively from the perspective of pre-training and context representation. We demonstrated that Seq2Seq pre-training objectives involving masked span prediction are more preferred than auto-regressive predictions for dialogue understanding. This observation further generalizes to the adoption of Pegasus, a span prediction objective for summarization, which works surprisingly well on DST tasks. We also find that recurrent state representation for dialogue context can work reasonably well.