Diable : Efficient Dialogue State Tracking as Operations on Tables

Sequence-to-sequence state-of-the-art systems for dialogue state tracking (DST) use the full dialogue history as input, represent the current state as a list with all the slots, and generate the entire state from scratch at each dialogue turn. This approach is inefficient, especially when the number of slots is large and the conversation is long. In this paper, we propose Diable , a new task formalisation that simplifies the design and implementation of efficient DST systems and allows one to easily plug and play large language models. We represent the dialogue state as a table and formalise DST as a table manipulation task. At each turn, the system updates the previous state by generating table operations based on the dialogue context. Extensive experimentation on the MultiWoz datasets demonstrates that Diable (i) outperforms strong efficient DST baselines, (ii) is 2.4x more time efficient than current state-of-the-art methods while retaining competitive Joint Goal Accuracy, and (iii) is robust to noisy data annotations due to the table operations approach.


Introduction
Dialogue state tracking (DST; Jacqmin et al., 2022) is the task of tracking user requests from the dialogue history in the form of slot-value pairs (Henderson et al., 2014;Mrkšić et al., 2015;Rastogi et al., 2020a).The slots are defined in a domainspecific schema and represent the fields that need to be extracted from the dialogue to execute queries in the backend and generate responses.Recent generative approaches to DST based on language models (Wu et al., 2019;Kim et al., 2020) often use the entire dialogue history as input and represent the state, at each turn, as the concatenation of all the slots in the schema, where inactive Figure 1: Diable approach to DST.The figure presents the first two turns of a dialogue (user's utterances are orange, system's are green).When the conversation starts, the state table is empty.At each dialogue turn, the system outputs a table update operation (either INSERT or DELETE), and the state is modified accordingly.
slots are reported with a placeholder value (see Figure 2).This representation is known as cumulative state (Hosseini-Asl et al., 2020;Feng et al., 2021;Zhao et al., 2022) and implies the generation of all the states from scratch at each dialogue turn.This approach is computationally inefficient, especially for long conversations and large schemas.
We propose Efficient Dialogue State tracking as Operations on Tables (Diable, shown in Figure 1), a novel task formulation and a new DST approach that better uses the generative capabilities of language models.Our approach simplifies the Figure 2: Cumulative state approach to DST.At each dialogue turn, the system outputs all the slots.Inactive slots are filled with a placeholder value (none).
design and implementation of DST systems and works with any sequence-to-sequence model.Our intuition is that a DST system translates conversations into filters for database searches.Inspired by formal languages for databases and the recent success in applying sequence-to-sequence models to text-to-SQL tasks (Yin et al., 2020;Scholak et al., 2021), we represent the dialogue state as an implicit table and frame DST as a table manipulation task.At each turn, the system updates the previous state by generating update operations expressed in a simplified formal language based on the current dialogue context (see Figure 1).Diable is the first end-to-end DST system that outputs state operations and values jointly while processing all slots simultaneously.
Based on extensive experimentation using the MultiWoz benchmark (Budzianowski et al., 2018), we show that Diable can successfully and efficiently translate conversations into filters for database searches.Our approach minimises the number of input and output tokens required resulting in a significant efficiency gain (2.4x reduction in inference time compared to state-of-the-art cumulative state systems).
Our main contributions are as follows: • We introduce a novel DST task formulation and a new system, Diable, specifically designed to enhance efficiency and leverage the capabilities of state-of-the-art sequenceto-sequence models.
• We show that our DST task formulation does not require ad-hoc data preprocessing, the full history, or extra supervision and works with any sequence-to-sequence model without requiring any architectural modification.
• We demonstrate that Diable achieves better Joint Goal Accuracy on MultiWoz than other efficient baselines while being competitive with state-of-the-art cumulative state systems.
• We show that Diable is robust to noise in the training data, resulting in more stable results across three versions of MultiWoz.

A Taxonomy of DST Approaches
The goal of DST systems is to handle long, diverse conversations in multiple domains with large schemas and unrestricted vocabulary, potentially without extra supervision (Eric et al., 2020;Rastogi et al., 2020b).Achieving this goal has prompted the development of different DST approaches.
Ontology-based approaches treat DST as either a classification or a token classification task.They assume that all possible slot-value pairs are restricted to a fixed set, or ontology, either predefined or extracted from the training data.Classificationbased approaches output a probability distribution over values given the dialogue context and a slot (Henderson et al., 2014) while token classification approaches output a probability distribution over slots for each token (Liao et al., 2021).
The ontology-based formulation simplifies the DST task considerably, thus the performance of these systems is usually relatively high for specific datasets (Zhong et al., 2018;Ye et al., 2021Ye et al., , 2022a)).
Complex dialogues with large schemas pose a significant challenge for traditional ontology-based approaches as they do not easily generalise to new domains nor scale to large ontologies (Mrkšić et al., 2017;Rastogi et al., 2017;Zhong et al., 2018;Ye et al., 2021).For this reason, ontology-based approaches are out of scope for our paper.
Open-vocabulary approaches address these limitations by formulating DST as either a reading comprehension task wherein for each slot a span is extracted from the dialogue context (Gao et al., 2019;Chao and Lane, 2019), or as a generation task wherein a value for each slot is generated based on the dialogue history (Wu et al., 2019).
By leveraging sequence-to-sequence models (Brown et al., 2020;Lewis et al., 2020;Raffel et al., 2020), generative approaches have recently achieved states-of-the-art results (Xu and Hu, 2018;Lee et al., 2019;Chao and Lane, 2019;Gao et al., 2019;Wu et al., 2019;Kumar et al., 2020;Heck et al., 2020;Hosseini-Asl et al., 2020;Lee et al., 2021;Zhao et al., 2021Zhao et al., , 2022)).However, these methods predict the dialogue state from scratch at each turn and generate a value for each slot, even when a slot is not active (Figure 2).We argue ( §6) that these are the main sources of inefficiencies of current DST systems.We compare Diable with these methods in the "Cumulative State Models" section of Table 1.
Efficient approaches seek efficiency by minimising the number of values to generate, thus decomposing DST into two successive sub-tasks: state operation prediction and value generation.In this way, only the slots that need to be changed are considered for value generation (Kim et al., 2020).These approaches are the most related to Diable in that they target efficiency.We compare them against Diable in section "Efficient Models" of Table 1.Often, these methods (Ren et al., 2019;Zhu et al., 2020) use the cumulative state representation which is the primary source of inefficiencies (we discuss this issue in the context of §5, Table 2) and need to output operations for all slots.For example, Kim et al. (2020) and Zeng and Nie (2020) predict an operation for each slot in the input by adding a classification head on top of the tokens representing the individual slots and predict four kinds of state operations: "carryover", "delete", "dontcare", and "update".For those slots categorised as "update", the contextual representation is further processed to decode the slot value.However, this approach limits the ability of such systems to deal with large schemas because the full schema needs to fit in the input context.Differently from these approaches, we remove the twocomponent structure by adopting a sequence-tosequence approach that allows us to jointly generate operations and values for all slots simultaneously and works with any sequence-to-sequence model.Importantly, we only need to predict operations for the active slots (i.e., the slots actually mentioned in the conversation).Lin et al. (2020b) seek efficiency differently by introducing the notion of "Levenshtein belief span".
Based on the concept of belief span (Lei et al., 2018) that reformats the dialogue state into a text span allowing models to generate slot values dynamically.They propose to only focus on the differences between states at subsequent turns.We take this approach a step further by explicitly outputting operations for all slots changing from one turn to another simultaneously while retaining our minimal state representation.
3 Diable: Dialogue State Tracking as

Operations on Tables
We introduce a novel efficient formulation of the DST task and a system, Diable, specifically designed to enhance efficiency and optimise the capabilities of state-of-the-art sequence-to-sequence models.In this section, we describe our approach, formalise the DST problem, and introduce the concepts of state as a table and state operations.

Problem Definition
The goal of DST is defined as learning a mapping from a dialogue context to a dialogue state.Specifically, let D 1:T = (D 1 , . . ., D T ) denote a dialogue of T turns, where D t = (U t , R t ) represent an utterance composed of the user query and system response at turn t, respectively.At turn t, the dialogue context, C t , is defined as the set of all the information available up to that turn.It always includes the current dialogue utterance, but can additionally contain the previous state, utterances from the dialogue history, and extra supervision (e.g., slot descriptions and the schema).We consider a dialogue context composed by only the previous dialogue turn(s) and the previous state, that is C t = (D t , B t−1 ).We do not use any schema information and let the model learn it during training. 1The dialogue state at turn t is defined as a set B t = {(s, v t )|s ∈ S t }, where S t ⊆ S denotes the subset of active slots at that turn out of all the predefined slots in the schema and v t is the value corresponding to slot s.The tracker is expected to carry over the previously extracted slots into the current state, i.e., the state at each turn includes all the slots active since the beginning of the conversation up to that point.We operationalise this process by framing it as a sequence-to-sequence task in which a model, f θ , receives a textual representation of the dialogue context, τ c (C t ), and outputs a textual representation of the operations needed, τ s (O t ), where τ c and τ s are the templating functions that convert the dialogue context and state operations to a string, respectively.We provide more details about these functions in Appendix B. The structure of the system can be described as follows

The
where in Eq. ( 2) we use an operation interpreter to parse the string representation of the operations and apply them to the previous state.Based on the definition of the state operations, the operation interpreter can be based on different formal languages (e.g., Regular Expressions, SQL).We use T5v1.1 (Raffel et al., 2020) as the backbone for Diable.During training, we use teacher forcing (Goyal et al., 2016) and pass the oracle dialogue state in the input, B t−1 .At test time, we pass the previously predicted state, Bt−1 , instead.To learn the model, we optimise the negative loglikelihood of the state operations given the dialogue context, that is where f θ is used to parameterise the probabilistic model P .We use the Adafactor optimiser (Shazeer and Stern, 2018) with no weight decay and we set the learning rate to 10 −4 with a constant schedule.We fix the training budget at 40k optimiser steps and set the batch size to 32.We generate the output sequence using beam search decoding with 4 beams.We describe in detail the training and inference processes in Appendix C.

Representing the State as a Table
In our approach, we represent the dialogue state as a table that is sequentially updated.Specifi-cally, a state, B, is represented by a simple twocolumn table in which the first column is used for the slot name and the second for the slot value (Figure 1).We define the slot name as the concatenation of domain and slot separated by a dash, e.g., restaurant-area (see Appendix B).The state table is passed into the dialogue context by simple "linearisation" (Suhr et al., 2020;Scholak et al., 2021;Shaw et al., 2021): the rows are converted to slot-value tuples, cast to a string using the template {slot} = {value}, and concatenated together using ; as a separator.2During the linearisation, we randomise the order of the rows to avoid overfitting to specific positional biases.

State Tracking via Table Operations
We introduced how we operationalise Updates are less frequent and are caused mostly by inconsistent annotations.In our preliminary experiments, we empirically found that adding an UPDATE operation does not improve performance despite adding complexity; thus, we decided to not use it.We emphasise that the specific definition of the operations is not critical for the efficiency of our method and it can be easily adapted to any specific use case.To convert operations to strings we use the template {command} {slot} = {value}.If multiple operations need to be applied, we concatenate them using ; as a separator (see Appendix B).We define the target sequence as the concatenation of all the slot-level operations.Since the order in which the operations are applied does not affect the output, we randomise their position during training.

Datasets
The MultiWoz dataset (Budzianowski et al., 2018) is a collection of 10k multi-domain task-oriented human-to-human conversations.It is one of the most used benchmarks in the DST literature (Jacqmin et al., 2022).Nonetheless, it is known to contain annotation errors and previous work proposed different versions (Eric et al., 2020;Han et al., 2021;Ye et al., 2022b) and data normalisation procedures 3 to mitigate this issue.Thus, it is difficult to have a fair comparison of results across the literature.Following the MultiWoz convention (Wu et al., 2019), we filter out dialogues in the "bus", "police", and "hospital" domains (and the respective slots from multi-domain dialogues), and remove the invalid dialogue SNG01862.json.We experiment with multiple versions (2.1, 2.2, and 2.4) and use the data as-is (see Appendix B).To construct the training set, we extract the operations automatically from the dataset.

Evaluation
We use Joint Goal Accuracy (Henderson et al., 2014, JGA) as the main metric for all experiments: 3 Most notably the TRADE scripts from Wu et al. ( 2019) to normalise both text and labels.
it measures the proportion of turns for which the predicted state (slot-value pairs) exactly match the gold label.At each turn, for each slot, a list of acceptable values is included in the annotation (e.g., hotel-name: ["marriott", "marriott hotel"]).We consider a value correct if it exactly matches one of the available options.Importantly, we perform an uncased evaluation since the annotation casing is not consistent.

Cumulative State and Efficient Baselines
We compare our results with a set of strong cumulative state models (i.e., models that use all previous turns and output a value for each slot at each turn, see Figure 2), and efficient baseline models.We also implement our own version of a cumulative state model and its "lighter" variant, lightCumulative: the state does not include the inactive slots.In all our experiments, the full cumulative models underperform lightCumulative while being less efficient (∼1.18x slower).Thus, we only report the results of lightCumulative, effectively selecting a stronger baseline.
In the upper part of Table 1 ("Cumulative State Models"), we include results from state-of-the-art generative cumulative state models.In each section, we report details and results for encoder-based, sequence-to-sequence, and T5-based models, respectively.The latter class of models based on T5 is related to our implementation of Diable in that they share the same backbone.However, they are not directly comparable due to the additional text preprocessing and label normalisation.The results of our own re-implementation of a cumulative state model, lightCumulative, are directly comparable as we adopted the same experimental conditions.
In the bottom part of Table 1 ("Efficient Models"), we report the JGA of the latest generative efficient DST approaches in the literature.Despite being related to our implementation of Diable, these approaches are not directly comparable since they rely on additional information (e.g. the schema) or are based on a different backbone model.

Results
In this section, we discuss our experimental results.
In Table 1, we summarise the JGA on three versions of MultiWoz (2.1, 2.2, and 2.4) for both Diable and the baseline models.The results for the baseline models are taken from previous work (Tian et al., 2021;Wang et al., 2022;Zhao et al., 2021) when better or if missing for a particular version in the original papers.The results for Diable and lightCumulative implementation are averaged across 5 random seeds.

Diable and Cumulative State Models
We compare Diable's performance to cumulative state models, i.e., models that have access to all previous turns in the conversation.We emphasise that Diable uses none or a limited number of previous turns and thus has less context with respect to these models.On one hand, our goal is to evaluate the trade-off between efficiency and performance; on the other hand, to study the capability of the system to generate correct table operations.
The cumulative state model results are shown in the first part of Table 1.First, D3ST (Zhao et al., 2022) achieves the best JGA score on MultiWoz 2.4 when the backbone is T5-base.Similarly to Diable, D3ST is based on a T5 model; however, it has access to more information such as the schema, the slot descriptions, and the list of possible values ("picklist") for categorical slots.Nonetheless, Diable scores within 1 standard deviation in terms of JGA, while being more than 2x more efficient.
When the backbone model of D3ST (Zhao et  We further compare Diable to our implementation of a cumulative state T5-based model (lightCumulative).This comparison is fairer, as the models are put in the exact same experimental conditions.Our goal here is to quantify the improvements due to our proposed approach isolating additional effects from model pre-training and architectural changes.The results show that our Diable approach has a significantly better JGA (+3 absolute points) on the less noisy version of MultiWoz (i.e., 2.4) and has similar performance on 2.1 and 2.2, while still being more efficient.
Next, we compare Diable with two other strong models based on the T5 architecture (making them directly comparable, besides the preprocessing steps): DaP (ind) (Lee et al., 2021) and Seq2seq (Zhao et al., 2021).Both models achieve a slightly higher JGA than Diable on 2.2 (1 point absolute); however, they are again less efficient and have access to a larger context.DaP relies on slot descriptions (thus, the schema) and runs inference once for every slot, which is not scalable to large schemas.The improvements in Seq2seq are likely due to its extensive DST-specific pre-training.
Our results confirm that Diable-based systems while being efficient, achieve competitive JGA on MultiWoz as compared to both other strong efficient DST baselines and cumulative state state-ofthe-art DST systems without requiring any ad-hoc data preprocessing, access to the full history, extra supervision, or large backbone models.

Diable and Efficient Models
Comparing Diable to other efficient state-of-the-art DST models, that are based on state operations, we see significant improvements up to almost 4 JGA points on version 2.4 (shown in the "Efficient Models" section of Table 1).Only Transformer-DST (Zeng and Nie, 2020) is able to outperform our model on 2.1.However, they use data preprocessing (text and label normalisation) and extra supervision (schema).This model is an improved version of SOM-DST (Kim et al., 2020), therefore the same argument applies to the latter, which achieves slightly lower performance even using the same extra supervision and text normalisation.

Latency Analysis
Table 2 reports the median4 inference time and the speed-up factor of Diable relative to lightCumulative.Our approach is more than 2x faster, even when using the full history as context.These results clearly show that the biggest efficiency gains are obtained by shortening the output sequence, that is, replacing the cumulative state with state operations.Consequently, adding only the last N turns comes at a small cost for a Diable while potentially helping the model to recover values not present in the current dialogue context.approach has Ω(1) and O(N ), where N is the number of slots in the schema, as the best and worst case ITC, respectively.Whereas, SOM-DST and Transformer-DST have a best-case ITC of Ω(N ).

Discussion
Our task formalisation is intuitively simple and is especially beneficial for large pre-trained sequenceto-sequence models.First, the state is expanded sequentially and thus only includes the necessary slots.This minimises the size of the input context, allowing the models to scale to larger schemas before reaching their maximum input length.Second, since the model needs to focus on the state changes, the decoder only needs to generate operations for a limited number of slots (previous slots persist implicitly in the state, no need for explicit "carryover" operations).Third, our system is general in that it deals with span-based and categorical slots in the same way, and outputs both the operations and the slot-value pairs in a single forward pass, without the need for specialised architectures.Finally, since not all pre-defined slots are needed in the input, we do not have to access the schema beforehand, and thus it can be learned from the data directly.

Impact of the Dialogue History
Table 3 compares the effect of the context size for both lightCumulative and Diable trained on version 2.2.Comparing the results from the upper and bottom parts of the table, we see that using only the previous state barely changes the JGA of lightCumulative but benefits Diable.We hypothesise that being a cleaner and more compact representation of the conversation, the previous state introduces less noise than the full history.This is especially true in conversations for which the value of a slot is changed or removed throughout the conversation.However, completely removing the dialogue history reduces the ability of the model to recover values referenced at the beginning of the conversation.We hypothesise that this negative effect is not too evident because of the entity bias present in the MultiWoz dataset (Qian et al., 2021) that allows the model to memorise and correctly predict values for certain slots even when not present in the dialogue context ( §6.4).Finally, when evaluated on the cleaned version 2.4, Diable consistently matches or outperforms lightCumulative.size does not improve JGA, however, we hypothesise that scaling it further can improve the performance similarly to D3ST (Zhao et al., 2022).

Impact of the State Representation
When replacing the tabular state representation with a cumulative one in Diable, ceteris paribus, we find a 3% reduction in JGA for version 2.4 and up to 5% for other versions.Specifically, at the beginning of the conversation, the state includes all the slots with the none value.In this case, the INSERT operation is unchanged while the DELETE operation becomes an update with a none value.

Error Propagation
Diable, like any recursive state model (Zhao et al., 2021), is affected by error propagation: since we pass the previous predicted state at each turn, errors can be persisted.We measure the potential gains stemming from completely avoiding the error propagation by using the gold previous state rather than the predicted one in the dialogue context.Table 5 reports the upper bound on JGA for our simple Diable instantiation and highlights that there is potential to improve JGA by adopting recent methodologies targeted at reducing error propagation (Zhang et al., 2022).
In our experiments, we identify two main sources of error propagation that account for more than 60% of the total mistakes: state "underprediction" (i.e., the model does not recognise that a certain state is active) and value misprediction.Under-prediction happens when the system is unable to recognise that specific slots are active.Since MultiWoz presents a strong entity bias-e.g., "Cambridge" appears in 50% of the destination cities in the training data (Qian et al., 2021)-a possible direction to address this issue is to use data augmentation methods targeted at reducing entity bias and annotation inconsistency (Summerville et al., 2020;Lai et al., 2022) by improving the overall slot recall.Value misprediction happens when the value for a correctly predicted slot is wrong.This is especially evident when the same slot is discussed in multiple turns and its value can potentially change.One way to address this limitation is by automatically pre-selecting the previous dialogue turns to include the relevant information about a specific slot in the context window (Yang et al., 2021;Guo et al., 2022;Wang et al., 2022).
We do not constrain the generation in any way, and thus Diable can generate invalid slots or values (e.g., attraction-time).In our experiments, errors due to invalid states are rare (less than 2% of the total mistakes): in fact, using the schema to filter incorrectly predicted slots at each turn did not improve the JGA significantly (less than 1%).There are several promising techniques that can further improve the performance of our system, at a minor efficiency cost, such as amendable generation (Tian et al., 2021), constrained generation (Lin et al., 2020a), and schema descriptions (Lee et al., 2021;Zhao et al., 2022).Finally, with larger schemas and more diverse conversations, constraining the set of values that the model can predict can potentially further improve performance and safety.

Future Directions
In §5, we showed that Diable is an effective DST approach, at the same time it is competitive with budget-matched (in terms of parameter count) cumulative state baselines.We emphasise that our goal is not to reach state-of-the-art JGA on the Mul-tiWoz dataset.We intentionally keep our Diablebased models as simple as possible, by not adding extra supervision signals, to clearly measure the effectiveness of our approach.However, the benefits coming from Diable can be easily added on top of other methods.We believe our approach can be improved and expanded in several ways.
Explicitly Modelling Slot Dependence.Diable treats slots independently of each other and implicitly uses the model's capability of learning their co-occurrence patterns.However, as the schema becomes larger and the dialogues longer, slot dependence becomes more complex and the model might fail to learn it effectively.Explicitly modelling the slot dependence can potentially improve performance, robustness (to spurious correlations), and efficiency.For example, selecting only relevant turns from the dialogue history as context to predict slot values.In our experiments, we show consistent improvement across all MultiWoz ver-sions by adding the previous 4 dialogue turns in the dialogue context (Table 1 -last 2 rows).However, this simple heuristic might be suboptimal when the schema is large and the dialogue is long because relevant turns may not be the immediately preceding ones and we might add irrelevant context or omit relevant information.Instead, adopting a more granular turn selection method based on the slot dependence (Yang et al., 2021;Guo et al., 2022) can improve both performance and efficiency.Representations.When passing the previous state in the context, we simply linearise the table.That is, we represent the previous states as discrete tokens passed in the input context for the next turn.This allowed us to use the T5 architecture without modification.A promising direction for future work is to use continuous representations for the state table (Wu et al., 2022).This representation can potentially require fewer or no tokens to represent the state, thus further improving the efficiency of our approach.

Conclusions
In this paper, we introduce a novel efficient formulation of the DST task and a new system, Diable, specifically designed to enhance efficiency and optimise the capabilities of state-of-the-art sequenceto-sequence models.Diable represents the dialogue state as an implicit table and updates it using a sequence-to-sequence model to generate table operations at each turn.Our task formalisation provides a significant efficiency gain (up to 2.4x speed-up in inference time) compared to cumulative state approaches adopted by the current state-of-the-art DST systems.Moreover, this sizeable improvement comes with a minimal efficiencyaccuracy trade-off.In fact, Diable outperforms other efficient DST approaches in the literature by more than 3 absolute JGA points on MultiWoz 2.4 and shows a competitive performance with respect to current DST state-of-the-art systems.Diable comes with other advantages: it is simple and general (it makes no assumptions about the schema and does not require any specialised architecture) and it is robust to noise.Moreover, it allows to plug and play sequence-to-sequence models without any architectural modification easily.Finally, our approach goes beyond the dialogue setting and can be adapted to the sequential processing of long documents for information extraction tasks with memory-constrained language models.Thomas Mueller, and Marcello Federico for their constructive and detailed feedback on the early versions of the paper.We thank Tiago Pimentel, Josef Valvoda, Davide Lesci, and Marco Lesci for their feedback on the final version.

Limitations
In Section 6.4, we already discussed the limitations and challenges of the model proposed (e.g., the model has access to less contextual information from the conversation history, errors can propagate more easily as it does not re-predict the entire cumulative state at each step, and mistakes could only be fixed by explicit delete or update operation).In the following, we concentrate on the limitations that refer to the scope of this work.
Languages.We experimented with a limited number of languages (English) and datasets (MultiWoz 2.1, 2.2 and 2.4).We do not have experimental evidence that our method can work for other languages, including languages with a richer morphology.Still, our system has been built without any language-specific constraints or resources, other than the T5 checkpoints and the manually annotated training set.Our method can be applied to any other language (without modification) for which these resources are available, or by applying crosslingual techniques and resources (e.g., multilingual language models, translation/projection of the training set) to transfer to other languages zero-shot.In those cases, the expected quality is lower, but the efficiency advantage of Diable remains.
Models.We experimented with two models (T5v1.1 base and large).This is due to the restriction on our computational budget to be both economically-and environmentally-friendly, which made it infeasible to conduct thorough experiments using larger-scale language models.However, we re-emphasise that Diable allows one to easily plug and play arbitrary language models and the efficiency advantage of Diable remains.
Diversity in the Evaluation Dataset.We experimented with three different versions of the Multi-Woz dataset (2.1, 2.2, and 2.4).Although this is the current benchmark for DST accepted by the community, and we followed the standard evaluation methodology and metrics, we are aware that the results presented might not be directly generalisable to other datasets or real-world scenarios with a considerable data shift with respect to MultiWoz.Additionally, MultiWoz has a certain level of noise and this can have an impact on the evaluation and the generalisation capabilities of the model trained.Alane Suhr, Ming-Wei Chang, Peter Shaw, and Kenton Lee.2020.Exploring unexplored generalization challenges for cross-database semantic parsing.In In this section, we report statistics about versions 2.1-2.4 of the MultiWoz dataset.Table 6 shows the distribution of domains across dialogues and turns.Table 9 reports the distribution of slots across dialogues and turns.Finally, Table 10 reports general statistics regarding the frequency of domains, slots, and turns.
Slot-Value Representation.We represent states as a list of triplets in the form of (domain, slot, value).We define a slot as the concatenation of the domain name and slot name, e.g., (restaurant-area, center).Annotations can possibly contain multiple acceptable values.During testing, this is not problematic as we consider a prediction correct when the predicted value is contained in the acceptable set of values.However, during training, we need to choose one in order to use teacher forcing.To do so, we first check which one of the possible values is actually a span from the text.If none are present, we choose the longest.Since the casing is inconsistent, we lowercase all the values.
Label Normalisation.MultiWoz contains noisy annotations and different authors have tried to alleviate the issue by devising different label normalisation procedures.For example, the scripts by Wu et al. ( 2019)5 and Hosseini-Asl et al. ( 2020). 6In this work, we try to balance being as faithful as possible to the original annotations without needlessly penalizing the evaluation of our system.In detail, we target the following noisy annotation values: • Typos: "guest house", "swimming pool", "night club", "concert hall" that appear with  and without spaces.When one is present, we also add the other version as a possible correct answer.This normalisation affects "hoteltype", "attraction-type", "hotel-name", and "attraction-name" slots.
• Spelling inconsistencies: "theater" and "center" appear both in their UK and US English versions.When one is present, we add the alternative version.This normalisation affects the "hotel-area", "restaurant-area", "attraction-area", and "attraction-type".
• Names starting with "the": some names appear with the "the" proposition.In such cases, we add a version without the proposition.This normalisation affects the "hotelname","restaurant-name", and "attractionname" slots.
• Categorical slots: the "hotel-star" slot is a categorical slot whose values are integers in 0-8.In some cases, the annotation includes the literal "star" string.In such cases, we remove the "star" from the annotation.
Overall, these are minimal changes.Many such errors caused by noisy labels are still present in the dataset.We leave as future work the creation of an even cleaner evaluation dataset.More details on the impact of these normalisations are available in Appendix D.
Input Creation.In a preliminary study, we experimented with different possible templates for the input sequence and found that, after a certain degree, adding more text to the prompt was not beneficial and the exact wording was not having a big impact.Therefore, to balance simplicity and accuracy, for all experiments, we used the following simple templates, • lightCumulative: In our experiments, up to decimal differences, including the schema in the input context does not affect performance.Similarly, there is no impact from excluding the <sep> token to separate the various parts of the input context.However, excluding the prefixes-i.e., generate update operations and generate full state-reduces performances by up to 1.2% JGA for both state operations and cumulative state models.A similar effect is caused by removing the "system"/"user" identifiers (as also observed by (Hosseini-Asl et al., 2020).Note that we did not optimise over the choice of the prefixes; given the new instruction fine-tuned models recently proposed (Chung et al., 2022), we hypothesise-and leave for future work-that different prompts can improve the DST, especially in the few-shot setting.Finally, lower-casing the text decreases performances by up to 1% JGA.
An example of an actual processed conversation is shown in Table 7.

C Training Details
Hardware Details.We used a server with 8 Tesla V100 GPUs.The batch size on each GPU was limited to 8. Thus, we ran the majority of the experiments in a multi-GPU setting with 4 GPUs allocated to each training job with no need for gradient accumulation.Diable models were trained in 3-4 hours, while cumulative state models required 5-6 hours.
Below, we report the output of the lscpu command (excluding the flags): Model.For all experiments, we use the T5 architecture (Raffel et al., 2020) and use the associated T5v1.1 base7 and large8 checkpoints available on the HuggingFace Hub via the transformers (Wolf et al., 2020) library and implemented using the PyTorch (Paszke et al., 2019) framework.In a preliminary study, we compared T5v1.1 with the original T5 and flan-T5 (Chung et al., 2022) variants and did not see any significant differences; we choose to use the T5v1.1 checkpoint since it is not fine-tuned on downstream tasks.
Data Preparation.We use the default Sentence-Piece (Kudo and Richardson, 2018) tokenizer with vocabulary size 32k associated with the T5v1.1 checkpoint and available in the tokenizers library (Wolf et al., 2020).We add <sep> to the vocabulary as a special separator token.We truncate only the input sequence at 512 tokens during training but do not truncate during evaluation in order to not penalize cumulative state models (our main baseline).
Training.We use the Pytorch-Lightning9 library to implement the training loop.For all experiments, we use the Adafactor optimiser (Shazeer and Stern, processed according to our task formalization.The "Input" column shows the template used to construct the input sequence.It includes optional fields separated by the <sep> token.For example, if the context also includes the dialogue history, we add <sep> history: {history} before operations.In red the states from the previous turn.We use the value none to fill-in empty utterances (e.g., the first system utterance), states (e.g., the first state is always empty), or no-operations (i.e., when the state does not need to be updated.2018) with no weight decay, eps=[1e-30, 0.001], clip_threshold: 1.0, decay_rate: -0.8, beta1: null, scale_parameter: false, relative_step: false, and warmup_init: false.We set the learning rate to 10 −4 and use a constant schedule.We fix the training budget at 40k optimiser steps and set the batch size to 32 to trade off the speed and precision of the gradient estimation.
Inference.For all experiments, we use beam search decoding, as implemented in the Hugging-Face library (Wolf et al., 2020) with 4 beams and no additional settings.
Reproducibility.For reproducibility, we use the pseudo-random seed for both data shuffling and model initialization.In each experiment, we fix the seed for pseudo-random number generators and use CUDA deterministic operations.In particular, we use the seed_everything function from Pytorch-Lightning 10 to set the seed for pseudorandom number generators in pytorch, numpy, and python.random.In addition, it sets the following environment variables: PL_GLOBAL_SEED that is passed to spawned subprocesses (e.g.ddp_spawn backend) and PL_SEED_WORKERS.
Hyper-parameters.In our initial exploration, we used the default hyper-parameters suggested in the T5 paper (Raffel et al., 2020) and on the Hugging-10 https://github.com/Lightning-AI/lightning/blob/94e6d52b7e2f2a9ffc21f7e11e087808666fe710/ src/lightning_lite/utilities/seed.py#L20Face blog,11 that is batch size 128 and constant learning rate equal to 10 −3 .Given the size of Mul-tiWoz, it roughly corresponded to 4k update steps.However, this budget proved to be insufficient.In particular, our own re-implementation of the cumulative state type of models was not in line with the results reported in the literature.More importantly, our Diable model was clearly undertrained as demonstrated by the fact that model selection on the validation set was consistently selecting the last checkpoint.Therefore, we scaled up the training budget by 10x to roughly 40k update steps.We rescaled the batch size to 32 and, consistently, the learning rate to 10 −4 -a similar setup is used by Zhao et al. (2022).This new training budget corresponds to roughly 20 epochs.We did not notice any significant improvement by further increasing it.Finally, we experimented with both Adafactor and AdamW (Loshchilov and Hutter, 2019) and the former consistently outperformed the latter while also speeding up the training process.

D Complete Tables of Results
In this section, we report the complete set of results for our Diable system and lightCumulative (our own reproduction of a cumulative state model).We run each experiment with 5 different random seeds and report statistics across runs.Furthermore, we show the effect of label normalisation in rows identified by "fix label".
B 0 , is initialised empty.At turn t, based on the dialogue context, C t , the DST system generates the set of operations, O t = {O 1 , . . ., O Nt }, where N t is the number of slots that change between turn t − 1 and t.Finally, the generated operations are applied to the previous state to get the new state.

Table 2 :
al., Median instance-level runtime in milliseconds and relative speed-up vs a cumulative state baseline.
(Tian et al., 2021)B), it scores 57.80, 58.70, and 75.90, respectively, on the three versions of the MultiWoz dataset.These scores are significantly higher than all other baselines.However, this improvement is solely due to increasing the model size, and we argue that the same performance improvement can be achieved by scaling the backbone of Diable to larger models.In particular, error analysis shows that most of the errors in our instantiation of Diable-based systems are due to the model not recognising active slots ("under-predictions").A larger backbone model can alleviate this issue by picking up less obvious contextual cues.Finally, the difference is more significant for version 2.1 because D3ST also applies text preprocessing, as used in other baselines.Moreover, baselines that use smaller models (the first part of Table1) consistently score lower than those based on the larger and better pre-trained T5 checkpoints.The only exception is AG-DST(Tian et al., 2021)but their backbone model has 310M parameters.
Using the Inference Time Complexity (ITC) notation introduced by Ren et al. (2019), our proposed

Table 3 :
Effect on JGA (mean ±1 standard deviation) of different context and state representations.
Table3compares the performance of models trained on MultiWoz 2.2 with different context and state representations.Notably, when evaluated on the cleaner 2.4 version (bottom row for both parts of the table), Diable consistently outperforms lightCumulative.In fact, regardless of the dialogue context, Diable achieves a better JGA on 2.4.We hypothesise that the lower accuracy of lightCumulative is due to overfitting the noisy annotations of the training set.In particular, we think that since it generates the full state from scratch at every turn, the decoder might learn wrong correlations amongst slots that are wrongly annotated in the training set.For example, hotel-type and attraction-type are inconsistently and sparsely annotated in the training set, while in the test set of version 2.4 they tend to appear almost always together with the respective hotel-name and attraction-name slots.Thus, a cumulative state model can learn to not generate one when the other is present.Instead, being Diable based on state changes, we presume that it learns to treat slots more independently.

Table 4 :
MultiWoz 2.2 and 2.4 test set JGA for T5v1.1 base and large trained on the MultiWoz 2.2.

Table 4
compares the performance of the base and large version of T5v1.1 for both lightCumulative and Diable models.We find that scaling up model

Table 5 :
JGA (mean ±1 standard deviation) with gold and the predicted previous state in the input context.

Table 6 :
Frequency of domains across dialogues and turns for MultiWoz 2.1-2.4.

Table 7 :
Example of a conversation from the MultiWoz dataset (dialogue MUL0003.json)

Table 8 :
JGA on the evaluation sets of MultiWoz 2.1-2.4 for T5v1.1-largemodels trained on the MultiWoz 2.2 training set.Result statistics obtained across 5 random seeds.The evaluation also includes the raw metrics with no label normalisation.

Table 9 :
Frequency of slots across dialogues and turns for MultiWoz 2.1-2.4.