MetaASSIST: Robust Dialogue State Tracking with Meta Learning

Existing dialogue datasets contain lots of noise in their state annotations. Such noise can hurt model training and ultimately lead to poor generalization performance. A general framework named ASSIST has recently been proposed to train robust dialogue state tracking (DST) models. It introduces an auxiliary model to generate pseudo labels for the noisy training set. These pseudo labels are combined with vanilla labels by a common fixed weighting parameter to train the primary DST model. Notwithstanding the improvements of ASSIST on DST, tuning the weighting parameter is challenging. Moreover, a single parameter shared by all slots and all instances may be suboptimal. To overcome these limitations, we propose a meta learning-based framework MetaASSIST to adaptively learn the weighting parameter. Specifically, we propose three schemes with varying degrees of flexibility, ranging from slot-wise to both slot-wise and instance-wise, to convert the weighting parameter into learnable functions. These functions are trained in a meta-learning manner by taking the validation set as meta data. Experimental results demonstrate that all three schemes can achieve competitive performance. Most impressively, we achieve a state-of-the-art joint goal accuracy of 80.10% on MultiWOZ 2.4.


Introduction
Task-oriented dialogue systems have recently become a hot research topic.They act as digital personal assistants, helping users with various tasks such as hotel bookings, restaurant reservations, and weather checks.Dialogue state tracking (DST) is recognized as a core task of the dialogue manager.Its goal is to keep track of users' intentions at each turn of the dialogue (Mrkšić et al., 2017;Rastogi et al., 2020).Tracking the dialogue state accurately is of significant importance, as the state information will be fed into the dialogue policy learning module to determine the next system action to perform (Manotumruksa et al., 2021).In general, the Both frameworks utilize soft labels obtained by linearly combining pseudo labels (one-hot) and vanilla labels (one-hot) using a weighting parameter α to enhance the training process compared to standard training that only relies on vanilla noisy labels.ASSIST adopts a single α that is shared by all slots and all training samples, while MetaASSIST uses slot-wise (and instance-wise) αs.
dialogue state is represented as a set of (slot, value) pairs (Henderson et al., 2014;Budzianowski et al., 2018).The slots for a particular task or domain are predefined (e.g., "hotel-name").Their values are extracted from the dialogue context.So far, a great variety of DST models have been proposed (Wu et al., 2019;Campagna et al., 2020;Balaraman et al., 2021;Lee et al., 2021;Guo et al., 2022;Shin et al., 2022;Wang et al., 2022).These models assume that all state labels provided in the dataset are correct, without considering the effect of label noise.However, dialogue state annotations are error-prone, especially considering that most dialogue datasets (e.g., MultiWOZ Budzianowski et al., 2018) are collected through crowdsourcing.The presence of label noise may impair model training and lead to poor generalization performance of the trained model, as deep neural models can easily overfit noisy training data (Zhang et al., 2021).
In order to robustly train DST models from noisy labels, Ye et al. (2022) proposed a general framework dubbed ASSIST, which augments the standard model training procedure with a small clean dataset.As shown in Figure 1, ASSIST first trains an auxiliary model on the small clean dataset and applies this model to generate pseudo labels for each sample in the noisy training set.Then, it linearly combines the pseudo labels and vanilla labels to train the primary model.Both theoretically and empirically, ASSIST has been shown to be effective in reducing the impact of label noise.
However, ASSIST adopts a common weighting parameter to combine the pseudo labels and vanilla labels for all slots and all training samples, which is suboptimal.In reality, different slots tend to have different noise rates (Eric et al., 2020), indicating that the weighting parameter should be slot-wise.On the other hand, different training samples may also require different weighting parameters, since whether pseudo labels or vanilla labels should be preferred is highly dependent on specific training instances.Furthermore, the weighting parameter is considered a hyperparameter and thus needs to be carefully tuned on each dataset.
To address the aforementioned limitations of AS-SIST, we propose MetaASSIST, a meta learningbased general framework that supports automatically learning slot-wise (and instance-wise) weighting parameters.Specifically, our contributions are: • We propose three different schemes for transforming the weighting parameters into learnable functions.These schemes have varying degrees of flexibility, ranging from slot-wise to both slot-wise and instance-wise.
• We propose to train these learnable functions through a meta-learning paradigm that takes the validation set as meta data and adaptively adjusts the parameters of each learnable function (as a result, the weighting parameters) by reducing the validation loss.
• We conduct extensive experiments to test the effectiveness of the proposed three schemes.All of them achieve superior performance.For the first time, we achieve over 80% joint goal accuracy on MultiWOZ 2.4 (Ye et al., 2021a).

Preliminaries
In task-oriented dialogue systems, the DST module transforms users' goals or intentions expressed in unstructured natural languages into structured state representations (e.g., a series of slot-value pairs).The state representations are continually updated in each round of the user-system interactions.

Problem Statement
More formally, we symbolize a dialogue of T turns as X = {(R 1 , U 1 ), . . ., (R T , U T )}, where R t and U t denote the system response and user utterance at turn t (1 ≤ t ≤ T ), respectively.We adopt X t to represent the dialogue context from the first turn to the t-th turn, i.e., X t = {(R 1 , U 1 ), . . ., (R t , U t )}.
Further, let S denote the set of all the predefined slots and B t = {(s, v t )|s ∈ S} the dialogue state at turn t.Here, v t is the corresponding value of slot s at turn t.Then, the DST problem is defined as learning a dialogue state tracker F : As discussed earlier, annotating dialogue states via crowdsourcing is prone to incorrect and inconsistent labels.These noisy annotations are likely to adversely affect model training.We denote the noisy state annotations as Bt = {(s, ṽt )|s ∈ S}, where ṽt is the noisy label of slot s at turn t.In this work, Bt refers to the labels provided in the dataset and B t refers to the unknown true state annotations.As pointed out by Ye et al. (2022), existing DST approaches are only able to learn a suboptimal dialogue state tracker F : X t → Bt rather than the optimal dialogue state tracker F : X t → B t .Aiming at learning a strong dialogue state tracker F * to better approximate F, Ye et al. (2022) proposed a general framework ASSIST that supports training DST models robustly from noisy labels.

Overview of ASSIST
ASSIST assumes that a small clean dataset is available.Based on this assumption, it firstly trains an auxiliary model on the clean dataset.Then, it leverages the trained model to generate pseudo labels for each sample in the large noisy training set.The generated pseudo labels are expected to be a good complement to the vanilla noisy labels.Therefore, combining the two types of labels has the potential to reduce the influence of noisy labels when training the primary model.
Denote the generated pseudo state annotations as Bt = {(s, vt )|s ∈ S}, where vt represents the pseudo label of slot s at turn t.Within the framework of ASSIST, the primary model is required to predict Bt and Bt concurrently during the training process.In other words, the target of model training turns into learning a dialogue state tracker F * : X t → C( Bt , Bt ), where C( Bt , Bt ) denotes a combination of Bt and Bt .There can be different methods to combine the generated pseudo labels and vanilla noisy labels.The most straightforward way is to combine them linearly, which is also the strategy adopted in ASSIST.The linearly combined label of slot s at turn t is formulated as: where vt and ṽt are the one-hot vector representation of the pseudo label vt and vanilla noisy label ṽt , respectively.The parameter α (0 ≤ α ≤ 1) is employed to control the weights of vt and ṽt .Let p(v t |X t , s) denote the likelihood of vt and p(ṽ t |X t , s) the likelihood of ṽt .Then, the likelihood of the combined label v c t is calculated as: (2) Based on this formula, the training objective of the primary model can be derived as follows: where D n represents the noisy training set.
3 MetaASSIST: A Meta Learning-Based Version of ASSIST Equations ( 1) and (3) show that a single α is shared by all slots when combining the pseudo labels and vanilla labels.This is suboptimal, as the ratio of the noise rate of pseudo labels to that of vanilla labels tends to be different for different slots.When the vanilla labels have higher quality than the generated pseudo labels, α should be set to a small value; otherwise, a large α should be used.This implies that setting α to different values for different slots can help train the primary model more robustly.In the following, we first theoretically show that the combined labels obtained via slot-wise weighting parameters instead of a common one can better approximate the unknown true labels.Then, we elaborate on the proposed framework MetaASSIST.

Theoretical Justification
Following (Ye et al., 2022), we employ the mean squared loss to define the mean approximation error of any corrupted labels vt to their corresponding unknown true labels v t , as formularized below: Here, D c refers to the small clean dataset.Both vt and v t are the vector representations of labels.Let α s be the slot-wise weighting parameter for slot s.We utilize v s t to denote the combined label obtained by replacing α with α s in Eq. (1).Thus, Same as α, α s is also bounded between 0 and 1. Substituting the corrupted labels vt in Eq. (4) with v s t and v c t , we have the following theorem: Theorem 1.The optimal mean approximation error with respect to the combined labels v s t derived from slot-wise weighting parameters α s is smaller than or equal to that of the combined labels v c t derived from a shared weighting parameter α, i.e., Proof.The conclusion is obvious as we can replace

Slot-Wise Weighting Parameters as Meta
Learnable Functions In the framework of ASSIST, α is treated as a hyperparameter.It needs to be meticulously tuned in the training phase so as to help the primary model achieve the best performance.Although it is feasible to tune a single parameter α, it would become extremely painful to tune all the slot-wise parameters.This is because multi-domain dialogues can have dozens of or even hundreds of slots (e.g., there are 37 slots in the MultiWOZ dataset Eric et al., 2020).To circumvent the troublesome step of tuning each slot-wise parameter α s of slot s, we propose to learn all these parameters automatically via meta learning (Hospedales et al., 2021).Specifically, we propose three different schemes to cast the slot-wise weighting parameters as learnable functions, which are described in detail below: Scheme One (S1): The first scheme assumes that the parameter α s is fully independent of the dialogue context X t .As a consequence of this assumption, all the training samples will share the same α s for slot s.Given that the parameter α s is restricted to fall in the range of 0 to 1, it is tricky to learn it by gradient-based optimizers.In our implementation, we introduce an unconstrained learnable parameter w s and regard α s as a Sigmoid function of w s : As thus, the parameter w s rather than α s will be directly optimized during the training process.
Scheme Two (S2): Apart from being slot-wise, the second scheme assumes that the parameter α s should also be relevant to the dialogue context X t (i.e., instance-wise).This assumption is of practical significance, as whether the vanilla labels or the pseudo labels should be preferred may vary across the training samples.In order to make α s instancewise, we first construct a five-dimensional feature vector based on the loss values of both vanilla labels and pseudo labels, as shown below: where ls and ls correspond to the loss value of the vanilla label ṽt and pseudo label vt of slot s associated with the dialogue context X t , respectively.To be more specific, ls and ls are calculated as follows: We then utilize an MLP network (Rumelhart et al., 1986) with a single hidden layer followed by the Sigmoid activation function to learn α s : Scheme Three (S3): The first and second schemes require that the weights of the pseudo label vt and vanilla label ṽt of each slot in each training sample must add up to 1.In reality, however, both vt and ṽt can be incorrect for some training samples, in which case, it is beneficial to assign small weights to both labels.In the third scheme, we remove the constraint on the sum and adopt two weighting parameters to combine the pseudo labels and vanilla labels.The combined label v s t is given by: We learn αs and αs (0 ≤ αs , αs ≤ 1) in the same way as how α s is learned in the second scheme1 : It is noted that Eq. ( 11) can be rewritten as: where β s = αs /( αs + αs ).Comparing Eq. ( 14) to Eq. ( 5), it can be seen that the main difference is that the combined label is further weighted by αs + αs .This reweighting is expected to be able to discard the training samples whose pseudo labels and vanilla labels are both incorrect by adjusting αs + αs to be a small value.
In schemes S2 and S3, the weighting parameters are both slot-wise and instance-wise.Compared to scheme S1 in which the weighting parameters are only slot-wise, adding the instance-wise flexibility can make the combined labels even more accurate in the optimal case.For example, when the pseudo label of slot s in a training sample is correct while its vanilla label is wrong, the best α s in scheme S2 will be 1.0, which leads to 0 approximation error.

Learning Algorithm
When training the primary model, besides its own parameters, the parameters of the learnable functions that are used to predict the weights also need to be optimized.Inspired by the common practice that the best model checkpoint is chosen according to the performance on the validation set, we decide to employ the validation set as meta data and then train the involved functions (i.e., f 1 , f 2 , f 3 and f ′ 3 ) in a meta-learning manner.
For the sake of uniformly describing the learning processes of the three proposed schemes, we unify the combined label v s t as: where w 1 and w 2 are the parameters of the learnable functions.Note that for schemes S1 ans S2, f ′ (w 2 ) = 1−f (w 1 )2 .Then, the training objective of the primary model is derived as: (16) Here, Θ represents the parameters of the primary model and is optimized by minimizing L(Θ), i.e., The optimal parameters Θ * (w 1 , w 2 ) are expected to achieve the best performance on the validation set D v .Hence, we can optimize w 1 and w 2 in the following way: ) represents the loss of slot s corresponding to the validation sample X t , calculated from the predictions of the primary model with parameters Θ * (w 1 , w 2 ).

Batch-Based Online Approximation
As shown in Eqs. ( 17) and ( 18), two nested loops of optimization are required for calculating the optimal parameters Θ * , w * 1 and w * 2 .Each single loop on the whole dataset can be fairly expensive.Following (Ren et al., 2018), we adopt an online strategy to update Θ, w 1 and w 2 alternately through a single optimization loop based on mini-batch data.Algorithm 1 summarizes the overall training procedure (including auxiliary model training).

Algorithm 1 Learning algorithm of MetaASSIST
1 and w (0) 2 ; 9: for j = 1, 2, . . ., JP do 10: Mn ← SampleMiniBatch(Dn, m); 11: Mv ← SampleMiniBatch(Dv, k); 12: Θ(j) (w )ṽt as the label; 13: w and w on Mv with loss values derived from Θ(j) (w ); 14: Θ (j) ← Update Θ (j−1) on Mn again using the new label v s t = f (w 2 )ṽt; 15: end for The procedure of training the primary model in MetaASSIST is similar to that of standard model training, except that three extra steps (lines 11-13) are added.This is because the optimal combined label v s t is unknown upon beginning.In Algorithm 1, we choose to dynamically update v s t by adapting w 1 and w 2 .At first, we use w , w (j−1) 2 ) (line 12).Then, we apply this interim model to the validation batch M v and compute the validation loss.By lowering this loss (e.g., one-step optimization by SGD), we obtain the updated w (j) 1 and w (j) 2 (line 13).After that, we use w (j) 1 and w (j) 2 to update v s t and apply this new combined label to train the (j − 1)-th step primary model on batch M n again, which eventually leads to the updated primary model (line 14).

Datasets
We conduct experiments mainly on MultiWOZ 2.4 (Ye et al., 2021a).It is the latest refined version of MultiWOZ 2.0 (Budzianowski et al., 2018), a largescale multi-domain task-oriented dialogue dataset consisting of over 10,000 dialogues spanning seven domains.The validation set and test set of Multi-WOZ 2.4 have been carefully reannotated, while its training set remains the same as that of Multi-WOZ 2.1 (Eric et al., 2020) and is therefore noisy.Following (Ye et al., 2022), we adopt the validation set as the small clean dataset.Thus, the validation set is used to train both the auxiliary model and the learnable functions.We also conduct experiments on MultiWOZ 2.0, whose validation set and test set have been replaced with the counterparts of Multi-WOZ 2.4.Due to this change, we name the dataset MultiWOZ 2.0* in the following.The only difference between MultiWOZ 2.0* and MultiWOZ 2.4 is that the training set of the former is much noisier.

Evaluation Metrics
We adopt Joint Goal Accuracy (JGA), Joint Turn Accuracy (JTA) and Slot Accuracy (SA) as evaluation metrics.JGA is the primary metric for DST.It refers to the ratio of dialogue turns of which the entire state is correctly predicted.JTA is defined as the ratio of dialogue turns in which the values of all active slots are correctly predicted.A slot is said to be active if its value needs to be updated.SA considers only slot-level information and is calculated as the average of all individual slot accuracies.

Auxiliary and Primary Models
We use the same auxiliary and primary models as ASSIST to assess the effectiveness of MetaASSIST.Since the clean dataset is small, a simple auxiliary model AUX-DST was specially designed to avoid overfitting (Ye et al., 2022 For ASSIST, α = 0.0 means that only the vanilla labels are used to train the primary model.α = 1.0 means that only the generated pseudo labels are used.α = 0.4 is the best common weighting parameter found in (Ye et al., 2022).All three schemes in MetaASSIST use both types of labels.
slot-token attention to extract slot-specific information and selects the value that best matches this information as prediction.It is also adopted as one primary model.The other primary models considered are: 1) SOM-DST (Kim et al., 2020), an open vocabulary method that regards the dialogue state as a fixed-sized memory and selectively overwrites this memory with new values; and 2) STAR (Ye et al., 2021b), an ontology-based method that uses a stacked slot self-attention mechanism to learn the correlations amongst slots automatically.3 5 Results and Discussion

Main Results
Table 1 shows the performance of the three primary models on MultiWOZ 2.4 trained using ASSIST and our proposed framework MetaASSIST.We observe that all three schemes in MetaASSIST substantially improve the performance of the primary models on the test set compared to training with only vanilla labels (α = 0.0) or only pseudo labels (α = 1.0).This observation indicates that the proposed schemes are effective in learning appro- priate weighting parameters for combining pseudo labels and vanilla labels.Further, we observe that scheme S2 consistently outperforms ASSIST with the best common weighting parameter (α = 0.4), except the slot accuracy of AUX-DST.For example, STAR achieves 80.10% joint goal accuracy when using scheme S2 to learn the weighting parameters.Table 2 presents the performance of SOM-DST trained on MultiWOZ 2.0*.It also shows that scheme S2 achieves better results.On both MultiWOZ 2.4 and MultiWOZ 2.0*, we find that the performance of scheme S1 slightly lags behind ASSIST (with the best value of α) in terms of joint goal accuracy, even though the weighting Figure 2: The distribution of learned weights in the three schemes.For scheme S2, we include the average weight of each slot in (a).For scheme S3, we illustrate the distribution of the sum of the two weighting parameters it involves.parameters learned in scheme S1 are slot-wise.The reason we speculate is that the learning algorithm fails to find the optimal slot-wise weighting parameters, but only the suboptimal ones.In §5.4,we show that scheme S1 can actually outperform AS-SIST when the weighting parameters are initialized with the best value of α used in ASSIST.

Domain
As for scheme S3, Table 1 shows that it achieves the best performance when AUX-DST is adopted as the primary model.For SOM-DST and STAR, its performance is comparable to the best results of ASSIST.Table 1 also demonstrates that scheme S3 consistently outperforms scheme S1.However, it is inferior to scheme S2 when taking SOM-DST and STAR as the primary model.Recall that scheme S3 has the highest degree of flexibility in weighting parameters.These results suggest that while higher flexibility can in principle yield better results, the practical performance may not be particularly good due to the difficulty of learning optimal values for the weighting parameters.
From Table 1, it can be further seen that scheme S1 consistently achieves higher validation performance than schemes S2 and S3 (except the joint turn accuracy of STAR).This might be confusing because scheme S1 underperforms schemes S2 and S3 on the test set.Moreover, the validation set is utilized to train the learnable functions in the three schemes.Hence, high validation performance is expected.However, we found that the distributions of the validation and test sets are not exactly the same (e.g., some slot values only appear in the test set).This implies that scheme S1 tends to overfit the validation data.While scheme S2 and scheme S3 suffer less from this issue, because the weighting parameters in them are not only related to state labels but also to the dialogue context.

Domain-Specific Accuracy
Apart from the overall performance comparison, we also investigate the performance improvements in each domain.For this purpose, we report the domain-specific joint goal accuracy of SOM-DST on MultiWOZ 2.4 in Table 3 4 .As can be observed, MetaASSIST achieves the best performance in four domains.In particular, MetaASSIST outperforms ASSIST (α = 0.4) by 4.06 absolute points in the taxi domain.It can also be observed that MetaAS-SIST consistently outperforms ASSIST across all domains when ASSIST only considers vanilla labels (α = 0.0).

Distribution of Learned Weights
Figure 2 illustrates the distribution of the learned weights in each scheme.We conduct this study on MultiWOZ 2.4 and use STAR as the primary model.As shown in Figure 2 (a), the learned weights in scheme S1 indeed vary across slots.For most slots, the weights are less than 0.5, indicating that their vanilla labels are of higher quality than pseudo labels.We also observe that the average weight of each slot in scheme S2 is smaller than the corresponding weight in scheme S1.In fact, the learned weights in scheme S2 are more consistent with the optimal value used in ASSIST (i.e., 0.4).Since schemes S2 and S3 are instance-wise, we randomly select a slot and plot the distribution of weights of this slot over all training samples.The results are shown in Figures 2 (b) and (c).As can be seen, the learned weights vary across training samples.In scheme S2, although the learned weights for most training samples fall between 0.4 and 0.5, there are also many samples whose weights can be as small as 0 or as large as 1.Note that scheme S3 has two weighting parameters.We plot the distribution of their sums.It is interesting to observe that the sums of most training samples are around 1, even though we have removed the summation constraint in scheme S3.Nonetheless, we also observe that the sums of many training samples are less than 1, meaning that small weights have been assigned to both pseudo labels and vanilla labels.
Figure 3 illustrates the distribution of weights in scheme S2 relative to loss values of pseudo labels and vanilla labels.We see that when both vanilla loss and pseudo loss are very small, the weights are around 0.5.When the vanilla loss is much smaller than the pseudo loss, the weights tend to be small.And when the pseudo loss is much smaller than the vanilla loss, the weights tend to be large.
The observations above confirm the strong capability of MetaASSIST in learning proper slot-wise (and instance-wise) weights based on loss values.

Scheme S1 with Prior Knowledge
Given that the weighting parameters in scheme S1 are only slot-wise, we can readily initialize these parameters with any specified value.This implies that we can integrate prior knowledge into scheme S1.Specifically, we study using the optimal value of α found in ASSIST to initialize its weighting parameters.The results on MultiWOZ 2.4 are shown in Figure 4. We observe that the prior knowledge can effectively improve the performance of scheme S1 for all three primary models.Furthermore, the results demonstrate that scheme S1 can outperform ASSIST when they use the same prior knowledge.

Performance over Training Epochs
Figure 5 depicts the changing curves of validation accuracy and test accuracy over training epochs.We utilize SOM-DST as the primary model and conduct this experiment on MultiWOZ 2.4.For MetaASSIST, scheme S2 is applied.It is shown that the validation and test accuracy using MetaAS-SIST improves much faster than using ASSIST during early training epochs.In subsequent epochs, the validation accuracy with MetaASSIST is also higher and changes more smoothly.
Although these neural DST models have demonstrated good performance, they fail to consider the effect of label noise.It has been shown that no matter how noisy the training data are, neural models can easily overfit the training data (Zhang et al., 2021).As a result, the generalization performance of models trained on noisy data is usually unsatisfactory.Recently, Ye et al. (2022) proposed a general framework, ASSIST, to robustly train DST models on noisy data.Their experimental results show that several existing DST models can achieve much higher performance when trained under this framework.However, as discussed earlier, ASSIST contains a parameter that needs to be tuned on each dataset.Besides, the parameter is shared among all slots and all training samples, which we have shown to be suboptimal.Our proposed framework leverages meta learning to automatically learn slotwise (and instance-wise) parameters, overcoming the limitations of ASSIST.
The essence of meta learning is learning to learn (Hospedales et al., 2021;Zhu et al., 2022), which makes it a natural fit for our task of automatically learning parameters.We found that several existing works (Huang et al., 2020;Zeng et al., 2021;Dingliwal et al., 2021) have already applied meta learning to DST.These works directly adopt the MAML (Finn et al., 2017) algorithm or its variants and focus on improving the few-shot learning ability of DST models.While our focus is to improve the robustness of DST models.

Conclusion
In this work, we proposed a meta learning-based general framework MetaASSIST to robustly train DST models on noisy data.MetaASSIST improves ASSIST by automatically learning slot-wise (and instance-wise) weighting parameters that are used to combine pseudo labels and vanilla labels.Our comprehensive experiments demonstrate the effectiveness of MetaASSIST.For future work, we plan to extend the current framework to utilize pseudo labels generated by multiple auxiliary models.

Limitations
Our proposed framework MetaASSIST learns to weight pseudo labels and vanilla labels by minimizing the validation loss.Although it reduces the impact of label noise on model training, it runs the risk of biasing the trained model towards overfitting the validation data.One may argue that selecting the best model checkpoint based on validation performance is a standard strategy in machine learning.However, our empirical study shows that high performance on the validation set does not necessarily lead to high performance on the test set.This is because the validation set and test set are usually small, and their empirical data distributions can differ a lot.For a model trained with our framework to have high generalization performance, the validation set should be unbiased, but this requirement seems to be very demanding.In practice, we can augment the validation set to alleviate this problem.
Another limitation is that our proposed learning algorithm is more time-consuming than regular model training.As described in Algorithm 1, for each training batch, the model needs to perform two forward and backward passes (the first pass to obtain the interim model, the second pass to obtain the updated model).For each validation (meta) batch, the model also needs to perform one forward and backwad pass.Therefore, the learning algorithm needs 3× training time compared to regular training.Nonetheless, compared to ASSIST, the proposed framework MetaASSIST is more timeefficient.For ASSIST, we need to try a large number of values for α to find the best one.

Ethics Statement
The DST module is an essential component in many industrial and commercial dialogue systems.Performance improvements on DST can help these systems better understand users' requirements, thereby improving user satisfaction.Our proposed framework could be applied to these systems and improve their DST performance.The proposed framework can also be applied to other NLP and machine learning applications.

A Implementation Details
The MultiWOZ dataset contains seven domains: attraction, hotel, restaurant, taxi, train, hospital and police.However, the hospital domain and police domain only occur in the training set.Following previous works (Wu et al., 2019;Kim et al., 2020), we remove the two domains.This results in five domains with 30 slots in total.
For a fair comparison with ASSIST, we directly employ the pseudo labels published by the authors instead of training a new auxiliary model ourselves to generate pseudo labels.For all primary models, we modify their released code to implement our learning algorithm.All the primary models adopt BERT (Devlin et al., 2019) as the dialogue context encoder and are initialized using the pretrained BERT-base-uncased model.As for the MLP network in schemes S2 and S3, we set the hidden layer dimension to 768.The output layer dimension is fixed at 1.The MLP network is randomly initialized.For scheme S1, we initialize the weighting parameters to be 0.5.For all primary models, we adopt their default hyperparameter settings, except the training epochs.For SOM-DST, we halve the batch size due to its high GPU memory requirement.We fix the validation (meta) batch size at 8 for all three primary models.AdamW (Loshchilov and Hutter, 2017) is employed as the optimizer and a linear scheduler with warmup is created to adjust the learning rate dynamically.The warmup proportion is fixed at 0.1.Tables 4 and 5 summarize the training epochs and peak validation (meta) learning rate in each scheme for each model.

B Convergence Analysis
Considering that our proposed learning algorithm optimizes the primary model and the learnable func-  tion alternately, it is meaningful to study its convergence.To this end, we plot the loss value curves of training batch and validation (meta) batch over training steps.We adopt AUX-DST as the primary model and apply scheme S1 to learn the weighting parameters.We conduct this experiment on Mul-tiWOZ 2.4.The results are illustrated in Figure 6.As can be observed, the training loss and validation (meta) loss both converge to relatively small values after sufficient training steps.

C Error Analysis
We further investigate the error rate with respect to each slot.We adopt SOM-DST as the primary model and compare scheme S2 to ASSIST with the best value of α (α = 0.4) and ASSIST without using the pseudo labels (α = 0.0).We conduct the experiment on MultiWOZ 2.4 as well and the results are illustrated in Figure 7.It is shown that MetaASSIST achieves lower error rates for 28 slots when compared to ASSIST (α = 0.0).MetaAS-SIST also outperforms ASSIST (α = 0.4) on 18 of the 30 slots.These results verify again the superiority of our proposed framework MetaASSIST.

Figure 1 :
Figure 1: The structure of ASSIST and MetaASSIST.Both frameworks utilize soft labels obtained by linearly combining pseudo labels (one-hot) and vanilla labels (one-hot) using a weighting parameter α to enhance the training process compared to standard training that only relies on vanilla noisy labels.ASSIST adopts a single α that is shared by all slots and all training samples, while MetaASSIST uses slot-wise (and instance-wise) αs.

Input:(
The small clean dataset Dc, noisy training dataset Dn, validation dataset Dv, batch size n, m, k, and number of training steps for auxiliary and primary model JA, JP ; Output: The parameters of the learnable functions w of the primary model Θ (J P ) ; 1: £ Auxiliary model training 2: for j = 1, 2, . . ., JA do 3: Mc ← SampleMiniBatch(Dc, n); 4: Update the auxiliary model on Mc; 5: end for 6: Apply the trained auxiliary model to generate pseudo state annotations Bt for each Xt ∈ Dn; 7: £ Primary model training 8: Initialize parameters Θ (0) , w t and train the primary model on batch M n for one step, which results in an interim model with parameters Θ(j) (w (j−1) 1

Figure 5 :
Figure 3: The distribution of weights relative to loss values.

Figure 6 :
Figure 6: Training and validation (meta) loss curves over training steps.

Figure 7 :
Figure7: The error rate of each slot on MultiWOZ 2.4.SOM-DST is employed as the primary model.

Table 4 :
Number of maximum training epochs and peak validation (meta) learning rate on MultiWOZ 2.4.