ML-LMCL: Mutual Learning and Large-Margin Contrastive Learning for Improving ASR Robustness in Spoken Language Understanding

Spoken language understanding (SLU) is a fundamental task in the task-oriented dialogue systems. However, the inevitable errors from automatic speech recognition (ASR) usually impair the understanding performance and lead to error propagation. Although there are some attempts to address this problem through contrastive learning, they (1) treat clean manual transcripts and ASR transcripts equally without discrimination in fine-tuning; (2) neglect the fact that the semantically similar pairs are still pushed away when applying contrastive learning; (3) suffer from the problem of Kullback-Leibler (KL) vanishing. In this paper, we propose Mutual Learning and Large-Margin Contrastive Learning (ML-LMCL), a novel framework for improving ASR robustness in SLU. Specifically, in fine-tuning, we apply mutual learning and train two SLU models on the manual transcripts and the ASR transcripts, respectively, aiming to iteratively share knowledge between these two models. We also introduce a distance polarization regularizer to avoid pushing away the intra-cluster pairs as much as possible. Moreover, we use a cyclical annealing schedule to mitigate KL vanishing issue. Experiments on three datasets show that ML-LMCL outperforms existing models and achieves new state-of-the-art performance.


Introduction
Spoken language understanding(SLU) is an important component of various personal assistants, such as Amazon's Alexa, Apple's Siri, Microsoft's Cortana and Google's Assistant (Young et al., 2013).SLU aims at taking human speech input and extracting semantic information for two typical subtasks, mainly including intent detection and slot filling (Tur and De Mori, 2011).Pipeline approaches and end-to-end approaches are two kinds of solu- † Equal contribution.
* Corresponding author.tions of SLU.Pipeline SLU methods usually combine automatic speech recognitgion (ASR) and natural language understanding (NLU) in a cascaded manner, so they can easily apply external datasets and external pre-trained language models.However, error propagation is a common problem of pipeline approaches, where an inaccurate ASR output can theoretically lead to a series of errors in subtasks.As shown in Figure 1, due to the error from ASR, the model can not predict the intent correctly.Following Chang and Chen (2022), this paper only focuses on intent detection.Learning error-robust representations is an effective method to mitigate the negative impact of errors from ASR and is gaining increasing attention.The remedies for ASR errors can be broadly categorized into two types: (1) applying machine translation to translate the erroneous ASR transcripts to clean manual transcripts (Mani et al., 2020;Wang et al., 2020;Dutta et al., 2022); (2) using masked language modeling to adapt the model.However, these methods usually requires additional speechrelated inputs (Huang and Chen, 2019;Sergio et al., 2020;Wang et al., 2022), which may not always be readily available.Therefore, this paper focuses on improving ASR robustness in SLU without using any speech-related input features.
Despite existing error-robust SLU models have achieved promising progress, we discover that they suffer from three main issues: (1) Manual and ASR transcripts are treated as the same type.In fine-tuning, existing methods simply combine manual and ASR transcripts as the final dataset, which limits the performance.Intuitively, the information from manual transcripts and the information from ASR transcripts play different roles, so the model fine-tuned on their combination cannot discriminate their specific contributions.Based on our observations, models trained on the clean manual transcripts usually has higher accuracy, while models trained on the ASR transcripts are usually more robust to ASR errors.Therefore, manual and ASR transcripts should be treated differently to improve the performance of the model.
(2) Semantically similar pairs are still pushed away.Conventional contrastive learning enlarges distances between all pairs of instances and potentially leading to some ambiguous intra-cluster and inter-cluster distances (Mishchuk et al., 2017;Zhang et al., 2022), which is detrimental for SLU.Specifically, if clean manual transcripts are pushed away from their associated ASR transcripts while become closer to other sentences, the negative impact of ASR errors will be further exacerbated.
(3) They suffer from the problem of KL vanishing.Inevitable label noise usually has a negative impact on the model (Li et al., 2022;Cheng et al., 2023b).Existing methods apply self-distillation to minimize Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951) between the current prediction and the previous one to reduce the label noises in the training set.However, we find these methods suffer from the KL vanishing issue, which has been observed in other tasks (Zhao et al., 2017).KL vanishing can adversely affect the training of the model.Therefore, it is crucial to solve this problem to improve the performance.
In this paper, we propose Mutual Learning and Large-Margin Contrastive Learning (ML-LMCL), a novel framework to tackle above three issues.For the first issue, we propose a mutual learning paradigm.In fine-tuning, we train two SLU models on the manual and ASR transcripts, respectively.These two models are collaboratively trained and considered as peers, with the aim of iteratively learning and sharing the knowledge between the two models.Mutual learning allows effective dual knowledge transfer (Liao et al., 2020;Zhao et al., 2021;Zhu et al., 2021), which can improve the performance.For the second issue, our framework implements a large-margin contrastive learning to distinguish between intra-cluster and inter-cluster pairs.Specifically, we apply a distance polarization regularizer and penalize all pairwise distances within the margin region, which can encourage polarized distances for similarity determination and obtain a large margin in the distance space in an unsupervised way.For the third issue, following Fu et al. (2019), we mitigate KL vanishing by adopting a cyclical annealing schedule.The training process is effectively split into many cycles.In each cycle, the coefficient of KL Divergence progressively increases from 0 to 1 during some iterations and then stays at 1 for the remaining iterations.Experiment results on three datasets SLURP, ATIS and TREC6 (Bastianelli et al., 2020;Hemphill et al., 1990;Li and Roth, 2002;Chang and Chen, 2022) demonstrate that our ML-LMCL significantly outperforms previous best models and model analysis further verifies the advantages of our model.
The contributions of our work are four-fold:  success of pre-trained models (Liu et al., 2022b;Zhang et al., 2023a;Cheng et al., 2023a;Zhang et al., 2023b;Yang et al., 2023a), we continually train a pre-trained RoBERTa (Liu et al., 2019) on spoken language corpus.Given a mini-batch of input data of N pairs of transcripts B = {(x p i , x q i )} i=1..N , where x p i denotes a clean manual transcript and x q i denotes its associated ASR transcript.As shown in Figure 2, we first apply the pre-trained RoBERTa and utilize the last layer of [CLS] to obtain the representation h p i for x p i and h q i for x q i : Then we apply the proposed self-supervised contrastive loss L sc (Chen et al., 2020a;Gao et al., 2021) to adjust the sentence representations: where P is composed of 2N positive pairs of either (h p i , h q i ) or (h q i , h p i ), τ sc is the temperature hyper-parameter and s(•, •) denotes the cosine similarity function.In Eq.3, the first term brings the clean manual transcript and its associated ASR transcript (positive example) near together and the second term pushes irrelevant ones (negative examples) far apart to promote uniformity in representation space (Wang and Isola, 2020).Note that for a transcript, its negative examples may be clean manual transcripts or ASR transcripts.For example, in Figure 2, recap my day is a clean manual transcript and chicken tikka recipe is an ASR transcript.
However, conventional contrastive learning has a problem that semantically similar pairs are still pushed away (Chen et al., 2021).It indiscriminately enlarges distances between all pairs of instances and may not be able to distinguish intracluster and inter-cluster correctly, which causes some similar instance pairs to still be pushed away.Moreover, it may discard some negative pairs and regard them as semantically similar pairs wrongly, even though their learning objective treat each pair of original instances as dissimilar.These problems result in the distance between the clean manual transcript and its associated ASR transcript not being significantly smaller than the distance between unpaired instance, which is detrimental to improving ASR robustness.Motivated by Chen et al. ( 2021), we introduce a distance polarization regularizer to build a large-margin contrastive learning model.For simplicity, we further denote the following normalized cosine similarity: which measures the similarity between the pairs of (h i , h j ) ∈ B with the real value We suppose that the matrix where M = 2N denotes the total number of transcripts in B. D consists of distances D ij and there exists 0 < δ + < δ − < 1 where the intra-class distances are smaller than δ + while the inter-class distances are larger than δ − .The proposed distance polarization regularizer L reg is as follows: where are the threshold parameters and ∥ • ∥ 1 denotes the ℓ 1 -norm.The region (δ + , δ − ) ⊆ [0, 1] can be regarded as the large margin to discriminate the similarity of data pairs.L reg can encourage the sparse distance distribution in the margin region (δ + , δ − ), because any distance D ij fallen into the margin region (δ + , δ − ) will increase L reg .Minimizing the regularizer L reg will encourage more pairwise distances and each data pair is adaptively separated into similar or dissimilar result.As a result, through introducing the regularizer, our framework can better distinguish between intracluster and inter-cluster pairs.
Then the final large-margin self-supervised contrastive learning loss L reg sc is the weighted sum of self-supervised contrastive learning loss L sc and the regularizer L reg , which is calculated as follows: where λ reg is a hyper-parameter.

Mutual Learning
Previous work reveals that mutual learning can exploit the mutual guidance information between two models to improve their performance simultaneously (Nie et al., 2018;Hong et al., 2021).By mutual learning, we can obtain compact networks that perform better than those distilled from a strong but static teacher.In fine-tuning, we use the same pre-trained model in Sec.2.1 to train two networks on the manual transcripts and the ASR transcripts, respectively.For a manual transcript x p i and its associated ASR transcript x q i , the output probabilities p t i,p and p t i,q at the t-th epoch are as follows: where M clean denotes the model trained on clean manual transcripts and M asr denotes the model trained on ASR transcripts.We adopt Jensen-Shannon (JS) divergence as the mimicry loss, with the aim of effectively encouraging the two models to mimic each other.The mutual learning loss L mut in Figure 3 is as follows:

Supervised Contrastive Learning
We also apply supervised contrastive learning in fine-tuning by using label information.The pairs with the same label are regarded as positive samples and the pairs with different labels are regarded as negative samples.The embeddings of positive samples are pulled closer while the embeddings of negative samples are pushed away (Jian et al., 2022;Zhou et al., 2022).We utilize the supervised contrastive loss L p c for the model trained on manual transcripts and L q c for the model trained on ASR transcripts to encourage the learned representations to be aligned with their labels: where y p i = y p j denotes the labels of h p i and h p j are the same, y q i = y q j denotes the label of h q i and h q j are the same and τ c is the temperature hyper-parameter.
Like Sec.2.1, we also use distance polarization regularizers L p reg and L q reg to enhance the generalization ability of contrastive learning algorithm: where D p denotes the matrix consisting of pairwise distances on the clean manual transcripts and D q denotes the matrix on the ASR transcripts.
The large-margin supervised contrastive learning loss L reg c,p and L reg c,q in Figure 3 are as follows: where λ p reg and λ q reg are two hyper-parameters.The final large-margin supervised contrastive learning loss L reg c is as follows:

Self-distillation
To further reduce the impact of ASR errors, we apply a self-distillation method.We try to regularize the model by minimizing Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951;He et al., 2022) between the current prediction and the previous one (Liu et al., 2020(Liu et al., , 2021)).For the manual transcript x p i and its corresponding label y p i , p t i,p = P (y p i |x p i , t) denotes the probability distribution of x p i at the t-th epoch, and p t i,q = P (y q i |x q i , t) denotes the probability distribution of x q i at the t-th epoch.The loss functions L p d and L q d of selfdistillation in Figure 3 are formulated as: where τ d is the temperature to scale the smoothness of two distributions, note that p 0 i,p is the one-hot vector of label y p i and p 0 i,q is that of label y q i .Then the final self-distillation loss L d is the sum of two loss functions L p d and L q d :

Training Objective
Pre-training Following (Chang and Chen, 2022), the pre-training loss L pt is the weighted sum of the large-margin self-supervised contrastive learning loss L reg sc and an MLM loss L mlm : where λ pt is the coefficient balancing the two tasks.
The final fine-tuning loss L f t is the weighted sum of cross-entropy loss L ce , mutual learning loss L mut , large-margin supervised contrastive learning loss L reg c and self-distillation loss L d : where α, β, γ are the trade-off hyper-parameters.However, directly using KL divergence for selfditillation loss may suffer from the vanishing issue.To mitigate KL vanishing issue, we adopt a cyclical annealing schedule, which is also applied for this purpose in Fu et al. (2019); Zhao et al. (2021).Concretely, γ in Eq.24 changes periodically during training iterations, which is described by Eq.25: where t represents the current training iteration and R and G are two hyper-parameters.

Datasets and Metrics
Following Chang and Chen (2022), we conduct the experiments on three publicly available benchmark datasets1 : SLURP, ATIS and TREC6 (Bastianelli et al., 2020;Hemphill et al., 1990;Li and Roth, 2002;Chang and Chen, 2022).The statistics of the three datasets included are shown in Table 1.
Dataset SLURP is a challenging SLU dataset with various domains, speakers, and recording settings.An intent of SLURP is a (scenario, action) pair, the joint accuracy is used as the evaluation metric and the prediction is considered correct only when both scenario and action are correctly predicted.The ASR transcripts are obtained by Google Web API.
ATIS and TREC6 are two SLU datasets for flight reservation and question classification respectively.We use the synthesized text released by Phoneme-BERT (Sundararaman et al., 2021), where the data is synthesized by a text-to-speech (TTS) model and later transcribed by ASR.We adopt accuracy as the evaluation metric for intent detection.

Implementation Details
We pre-train the model for 10K steps with a batch size 128 on each dataset, and finetune the whole model up to 10 epochs with a batch size 256 to avoid overfitting.The training will early-stop if the loss on dev set does not decrease for 3 epochs.On SLURP, two separate classification heads are trained for scenario and action with the shared BERT embeddings.The mask ratio of MLM is set to 0.15, τ sc is set to 0.2, δ + is set to 0.2, δ − is set to 0.5, λ reg is set to 0.1, τ c is set to 0.2, λ p reg is set to 0.15, λ q reg is set to 0.15, τ d is set to 5, λ pt is set to 0.5, α is set to 1, β is set to 0.1, R is set to 0.5, and G is set to 5000.The reported scores are averaged over 5 runs.During both pre-training and fine-tuning, we utilize Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98, and 4k warm-up updates to optimize the parameters.The training process lasts a few hours.All experiments are conducted at an Nvidia Tesla-A100 GPU.

Baslines
We compare our model with the following baselines: (1) RoBERTa (Liu et al., 2019): a RoBERTabase model directly fine-tuned on the target training data; (2) Phoneme- BERT (Sundararaman et al., 2021): a RoBERTa-base model which is further pretrained on an additional corpus with the phoneme information and then fine-tuned on the target training data; (3) SimCSE (Gao et al., 2021): a stateof-the-art sentence embedding method applying contrastive learning; (4) SpokenCSE (Chang and Chen, 2022): a strong baseline for improving ASR robustness in SLU task.

Main Results
The performance comparison of ML-LMCL Net and baselines are shown in Table 2, from which we have the following observations: (1) Our ML-LMCL gains consistent improvements on all tasks and datasets.This is because our model achieves the mutual guidance between the model trained on the manual and ASR transcripts, allowing these two models to share the knowledge for each other.Moreover, large-margin contrastive learning encourages the model to more accurately distinguish between intra-cluster and inter-cluster pairs, which can avoid pushing away the semantically similar pairs as much as possible.And cyclical annealing schedule is applied to mitigate KL vanish, which can improve the robustness of the model.When not using manual transcripts, it still overpasses SpokenCSE, which also demonstrates the effectiveness of large-margin contrastive learning and cyclical annealing schedule to improve ASR robustness in SLU.
(2) In contrast, it is obvious that the improvement on SLURP dataset is more significant.We believe the reason is that SLURP is a more challenging SLU dataset than ATIS and TREC6.An intent of SLURP is a (scenario, action) pair and the prediction is considered to be correct only if the scenario and action are both correctly predicted.Due to the shortcomings of conventional contrastive learning, previous work fail to align the ASR transcript and its associate manual transcript with high accuracy.As a result, due to ASR errors, it is common that one of the two components of an intent is incor-rectly predicted.Our ML-LMCL is dedicated to overcome the shortcomings of conventional contrastive learning, resulting in better alignment and the improvement of performance.

Analysis
To verify the advantages of ML-LMCL from different perspectives, we use clean manual transcripts and conduct a set of ablation experiments.The experimental results are shown in Table 3. Table 3: Results of the ablation experiments when using clean manual transcripts.

Effectiveness of Mutual Learning
One of the core contributions of ML-LMCL is mutual learning, which allows the two models trained on manual and ASR transcripts learn from each other.To verify the effectiveness of mutual learning, we remove mutual learning loss and refer it to w/o L mut in Table 3.We observe that accuracy drops by 0.48, 0.38 and 0.44 on SLURP, ATIS and TREC6, respectively.Contrastive learning benefits more from larger batch size because larger batch size provides more negative examples to facilitate convergence (Chen et al., 2020a), and many attempts have been made to improve the performance of contrastive learning by increasing batch size indirectly (He et al., 2020;Chen et al., 2020b).Therefore, to verify that the proposed mutual learning rather than the indirectly boosted batch sizes works, we double the batch size after removing mutual learning loss and refer it to w/o L mut + bsz↑.The results show that despite the boosted batch size, it still performs worse than ML-LMCL, which demonstrate that the improvements come from the proposed mutual language rather than the boosted batch size.

Effectiveness of Distance Polarization Regularizer
To verify the effectiveness of distance polarization regularizer, we also remove distance polarization regularizer in pre-training and fine-tuning, which is named as w/o L reg and w/o L p reg & L p reg , respectively.When L reg is removed, the accuracy drops by 0.24, 0.23 and 0.19 on SLURP, ATIS and TREC6, respectively.And when L p reg and L q reg are removed, the accuracy drops by 0.41, 0.29 and 0.22 on SLURP, ATIS and TREC6.The results demonstrate that distance polarization regularizer can alleviate the negative impact of conventional contrastive learning.Furthermore, the drop in accuracy is greater when fine-tuning than when pretraining.We believe that the reason is that supervised contrast learning in fine-tuning is easier to be affected by label noise than unsupervised contrast learning in pre-training.As a result, more semantically similar pairs are incorrectly pushed away in fine-tuning when the regularizer is removed.
Chang and Chen (2022) also proposes a selfdistilled soft contrastive learning loss to relieve the negative effect of noisy labels in supervised contrastive learning.However, we believe that the regularizer can also effectively reduce the impact of label noise.Therefore, our ML-LMCL does not include another module to tackle the problem of label noise.To verify this, we augument ML-LMCL with the self-distilled soft contrastive learning loss, which is termed as w/ L sof t .We can observe that not only L sof t does not bring any improvement, it even causes performance drops, which proves that the distance polarization regularizer can indeed reduce the impact of label noise.

Effectiveness of Cyclical Annealing Schedule
We also remove cyclical annealing schedule and relate it to w/o cyc.We observe that the accuracy drops by 0.18, 0.13 and 0.11 on SLURP, ATIS and TREC6, respectively, which demonstrates that the cyclical annealing schedule also plays an important role in enhancing the performance by mitigating the problem of KL vanishing.

Visualization
To better understand how mutual learning and largemargin contrastive learning affects and contributes to the final result, we show the visualization of an example on SLURP dataset in Figure 4 (Abdi and Williams, 2010).
The circle and square in the same color means the corresponding manual and ASR transcriptions are associated.
tance, which further demonstrates that our method can align the ASR transcript and its associate manual transcript with high accuracy and better avoid semantically similar pairs being pushed away.

Related work
Error-robust Spoken Language Understanding SLU usually suffers from ASR error propagation and this paper focus on improving ASR robustness in SLU.Chang and Chen (2022) makes the first attempt to use contrastive learning to improve ASR robustness with only textual information.Following Chang and Chen (2022), this paper only focuses on intent detection in SLU.Intent detection is usually formulated as an utterance classification problem.As a large number of pre-trained models achieve surprising results across various tasks (Dong et al., 2022;Yang et al., 2023c;Zhu et al., 2023;Yang et al., 2023b), some BERTbased (Devlin et al., 2019) pre-trained work has been explored in SLU where the representation of the special token [CLS] is used for intent detection.
In our work, we adopt RoBERTa and try to learn the invariant representations between clean manual transcripts and erroneous ASR transcripts.
Mutual Learning Our method is motivated by the recent success in mutual learning.Mutual learning is an effective method which trains two models of the same architecture simultaneously but with different initialization and encourages them to learn collaboratively from each other.Unlike knowledge distillation (Hinton et al., 2015), mutual learning doesn't need a powerful teacher network which is not always available.Mutual learning is first proposed to leverage information from multiple models and allow effective dual knowledge transfer in image processing tasks (Zhang et al., 2018;Zhao et al., 2021) Contrastive learning Contrastive learning aims at learning example representations by minimizing the distance between the positive pairs in the vector space and maximizing the distance between the negative pairs (Saunshi et al., 2019;Liang et al., 2022;Liu et al., 2022a), which is first proposed in the field of computer vision (Chopra et al., 2005;Schroff et al., 2015;Sohn, 2016;Chen et al., 2020a;Wang and Liu, 2021).In the NLP area, contrastive learning is applied to learn sentence embeddings (Giorgi et al., 2021;Yan et al., 2021), translation (Pan et al., 2021;Ye et al., 2022) and summarization (Wang et al., 2021;Cao and Wang, 2021).Recently, Chen et al. (2021) points that conventional contrastive learning algorithms are still not good enough since they fail to maintain a large margin in the distance space for reliable instance discrimination Inspired by this, we add a similar distance polarization regularizer as Chen et al. (2021) to address this issue.To the best of our knowledge, we are the first to introduce the idea of large-margin contrastive learning to the SLU task.

Conclusion
In this paper, we propose ML-LMCL, a novel framework for improving ASR robustness in SLU.We apply mutual learning and introduce the distance polarization regularizer.Moreover, cyclical annealing schedule is utilized to mitigate KL vanishing.Experiments and analysis on three benchmark datasets show that our model significantly outperforms previous models whether clean manual transcriptions is available in fine-tuning or not.Future work will focus on improving ASR robustness with only clean manual transcriptions.

Limitations
By applying mutual learning, introducing distance polarization regularizer and utilizing cyclical annealing schedule, ML-LMCL achieves significant improvement on three benchmark datasets.Nevertheless, we summarize two limitations for further discussion and investigation of other researchers: (1) ML-LMCL still requires the ASR transcripts in fine-tuning to align with the target inference scenario.However, the ASR transcripts may not always be readily available due to the constraint of ASR systems and privacy concerns.In the future work, we will attempt to further improve ASR robustness without using any ASR transcripts.
(2) The training and inference runtime of ML-LMCL is larger than that of baselines.We attribute the extra cost to the fact that ML-LMCL has more parameters than baselines.In the future work, we plan to design a new paradigm with fewer parameters to reduce the requirement for GPU resources.This paper does not involve any data collection and release thus there are no privacy issues.All the datasets used in this paper are publicly available and widely adopted by researchers to test the performance of SLU models.
A3. Do the abstract and introduction summarize the paper's main claims?
In Section Abstract and Section 1. Introduction.B3.Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?In section 3. Experiments.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Not applicable.Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? In section 3. Experiments.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.In section 3. Experiments.

C Did you run computational experiments?
In section 3. Experiments.
C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?In section 3. Experiments.

Figure 1 :
Figure 1: An example of the intent being predicted incorrectly due to the ASR error.

Figure 2 :
Figure 2: The illustration of the pre-training stage.We apply large-margin self-supervised contrastive learning with paired transcripts.A positive pair consists of clean data and the associated ASR transcript.

Figure 3 :
Figure 3: The illustration of the fine-tuning stage.Two networks on the clean manual transcripts and the ASR transcripts are collaboratively trained via mutual learning ( §2.2).Large-margin supervised contrastive learning ( §2.3) and self-distillation ( §2.4) are applied to further reduce the impact of ASR errors.

Fine- tuning
Following Haihong et al. (2019);Chen et al. (2022), the intent detection objective is: you describe the limitations of your work?In Limitation Section.A2.Did you discuss any potential risks of your work?

A4.
Have you used AI writing assistants when working on this paper?Left blank.B Did you use or create scientific artifacts?In section 3. Experiments.B1.Did you cite the creators of artifacts you used?In section 3. Experiments.B2.Did you discuss the license or terms for use and / or distribution of any artifacts?Not applicable.Left blank.

Table 1 :
The statistics of all datasets.The test set of SLURP is sub-sampled.

Table 2 :
Accuracy results on three datasets.† denotes ML-LMCL obtains statistically significant improvements over baselines with p < 0.01."w/o manual transcripts" denotes clean manual transcripts are not used in fine-tuning, i.e. the loss functions associated with clean manual transcripts are set to 0, including L p ce , L mut , L reg c,p , and L p d ."w/ manual transcripts" denotes clean manual transcripts are used in fine-tuning.