The Interpreter Understands Your Meaning: End-to-end Spoken Language Understanding Aided by Speech Translation

End-to-end spoken language understanding (SLU) remains elusive even with current large pretrained language models on text and speech, especially in multilingual cases. Machine translation has been established as a powerful pretraining objective on text as it enables the model to capture high-level semantics of the input utterance and associations between different languages, which is desired for speech models that work on lower-level acoustic frames. Motivated particularly by the task of cross-lingual SLU, we demonstrate that the task of speech translation (ST) is a good means of pretraining speech models for end-to-end SLU on both intra- and cross-lingual scenarios. By introducing ST, our models reach higher performance over baselines on monolingual and multilingual intent classification as well as spoken question answering using SLURP, MINDS-14, and NMSQA benchmarks. To verify the effectiveness of our methods, we also create new benchmark datasets from both synthetic and real sources, for speech summarization and low-resource/zero-shot transfer from English to French or Spanish. We further show the value of preserving knowledge for the ST pretraining task for better downstream performance, possibly using Bayesian transfer regularizers.


Introduction
Modern artificial intelligence is characterized by large pretrained language models (PTLMs) with strong language capabilities to be adapted to various downstream tasks.The success of PTLMs rests on carefully-designed pretraining tasks to bestow the capability we expect on the model.Current PTLMs are mostly trained on self-supervised objectives, which started from masked language modelling (MLM) and next sentence prediction (NSP) in BERT (Devlin et al., 2019), but recently evolved into more difficult ones such as whole word masking (Cui et al., 2021), span masking (Joshi et al., 2020), text infilling, and token deletion (Lewis et al., 2020).While the rather simple NSP has been replaced by sentence permutation, document rotation (Lewis et al., 2020), and sentence order prediction (Lan et al., 2020).All those efforts introduced more challenges in the pretraining phase to mine stronger semantic supervision signals out of unlabelled data.This is particularly important for pretrained spoken language models like wav2vec2 (Baevski et al., 2020) and HuBERT (Hsu et al., 2021) based on MLM on phonetic units from lower-level audio signals, which are less informative and require models to carry out additional labor on acoustics.Therefore, their high-level capacities are more restricted.This may explain why automatic speech recognition (ASR) models fine-tuned upon them with paired data still have a role in fully end-to-end (E2E) SLU systems, often as a pretrained feature extractor (Seo et al., 2022;Arora et al., 2022).Unlike the cascaded SLU in which ASR produces transcripts for further processing, in such E2E systems ASR as an auxiliary task or additional pretraining stage provides strong supervision to explicitly link audio signals to representations that correspond to the denser and semantic-richer textual space, which is valuable for downstream understanding tasks.
On texts, self-supervised objectives are rather effective thanks to enormous text data with high information density, but such supervised tasks are still used in many cases, machine translation (MT) being an often-seen one.CoVe (McCann et al., 2017), a pioneer of the current PTLM paradigm, leverages the encoder of a pretrained sequence-to-sequence (seq2seq) MT model and achieved the then stateof-the-art on various downstream tasks.Belinkov et al. (2020) further validate language capabilities of MT on morphological, syntactic, and semantic arXiv:2305.09652v1[cs.CL] 16 May 2023 levels, and T5 (Raffel et al., 2020) uses an ensemble of supervised tasks including MT.Also, by bridging different languages, MT enjoys advantages on cross-lingual generation, and commonly serves as a pretraining or auxiliary task on cross-lingual text summarization (Zhu et al., 2019;Xu et al., 2020).Furthermore, when trained with inputs of multiple languages, the model encoder may align and push representations for inputs in different languages with similar meaning together to have the same output in the target language, thanks to the guidance from paired data (Johnson et al., 2017;Schwenk and Douze, 2017).With this semantic-centric language agnosticity, such an encoder can be directly applied to another language in downstream tasks (Eriguchi et al., 2018).Following that, MT-based tasks prove effective and outperform multilingual self-supervised models on a wide range of downstream tasks like cross-lingual natural language inference, question answering, and bi-text mining (Conneau and Lample, 2019;Artetxe and Schwenk, 2019;Chi et al., 2020;Kale et al., 2021).
Inspired by those works, we hypothesize that the counterpart of multilingual MT on speech, i.e., E2E speech translation (ST) that directly maps speech of various languages to texts in other languages, will also be effective as a pretraining task on E2E SLU, for three critical advantages: 1.It requires high-level understanding of the utterance as an interpreter must "understand" it before interpreting it into a different language, unlike ASR that transcribes speech verbatim and MLM on phonetic units that needs less semantic understanding.
2. It captures the long-term dependency and a global view of the full utterance, in contrast to ASR and MLM on text and speech, which can often be resolved with local contexts.
3. It enables better cross-lingual transfer in comparison with multilingual ASR models and selfsupervised PTLMs without the supervision that promotes language agnosticity.Therefore, in this paper we leverage speech translation as an extra pretraining task.As shown in Figure 1, the model is first pretrained on speech translation between English and French in both directions, then fine-tuned on downstream tasks with an additional classifier module, reusing the encoder.Furthermore, unlike knowledge for self-supervised objectives loosely connected to target tasks, knowledge to perform generic tasks like ASR and ST that link speech to semantic-rich texts would be more useful.Therefore, we also attempt to preserve such knowledge by multi-task learning, as well as regularizers derived from continual learning for parameter preservation, including L2-SP (Li et al., 2018b) and EWC (Kirkpatrick et al., 2017), which show benefits especially in low-resource cases.
We evaluate our methods on a variety of tasks for semantic understanding of speech, including existing benchmarks: SLURP (Bastianelli et al., 2020) and MINDS-14 (Gerz et al., 2021) for monolingual and multilingual intent classification (IC), plus NMSQA (Lin et al., 2022) for spoken question answering (SQA).We then create a French version of SLURP for cross-lingual transfer using both synthetic and real speech, and follow Huang et al. (2022) to create a dataset on summarizing spoken news into a short headline.On all the tasks, our approach outperforms previous state-of-the-art and ASR pretraining, often by a large margin.
To summarize, our contributions are three-fold: 1.We show the effectiveness of speech translation for pretraining speech models on intra-lingual and cross-lingual cases.
2. We demonstrate the application of Bayesian transfer learning regularizers for better preservation of knowledge from pretraining.
3. We build multiple datasets for evaluating spoken language understanding capabilities.

Methods
As shown in Figure 1, we first build a speech translator on both English and French following Li et al. (2021) with a combination of pretrained models on speech and text, though we fine-tune the whole model.Audio signals are fed into an acoustic encoding transformer initialized by half of XLS-R (Babu et al., 2022), the multilingual wav2vec2, so that the phonetic representations of the audio are extracted into a 320x-downsampled sequence.The upper 12 layers of XLS-R are discarded for computational efficiency, as they are more correlated with the pretraining task and those parameters are found less useful for downstream tasks (Zhu et al., 2022).Then the task is similar to machine translation from the phonetic embeddings to the output text, for which we leverage the MT model based on mBART (Liu et al., 2020).A 3-layer 8x- downsampling CNN adaptor is inserted to match the sequence length to the common mBART inputs.A target embedding is then prepended to specify the target language or task, similar to the target token used in mBART.To promote language agnosticity, we do not indicate the source language.The model is then trained on translation data between French and English in two directions.We also compare with training on ASR by indicating the target language as the same as the source language, or joint training with ASR and ST simultaneously.
After the pretraining, a 3-layer transformer is stacked as the classifier, taking the encoder outputs as its inputs, which is also transferred from a PTLM.As for IC, we use layer 2-4 from XLM-R (Conneau et al., 2020), stacked with linear classifier heads upon the mean-pooled representations.For cross-lingual transfer, IC models trained on English are then applied on/fine-tuned to French data.As for SQA to extract an answer span for the question from a long context article, we use Longformer (Beltagy et al., 2020), a PTLM specially designed for long utterances due to the extensive length of the context, as in Lin et al. (2022).Two linear classifiers are then stacked upon each frame to predict the start and the end of the answer span, along with an answer existence classifier upon the mean-pooled sequence to predict if the answer exists in the provided context.We then concatenate the question audio with each sentence of the article as model inputs, and pick the predicted answer span from the sentence with the highest answer existence likelihood.For these two tasks, we discard the decoder part, while for speech summarization, we keep the seq2seq architecture to generate the textual summary, while an additional target embedding is used to indicate the target task.The model is then trained together on the target data.Furthermore, it is found that promoting language agnosticity explicitly could be helpful for zero-shot transfer (Arivazhagan et al., 2019), hence we attempt to add language adversarial training on the encoder outputs during pretraining and fine-tuning, using a 2-layer language classifier with gradient reversal layer to explicitly align the representations between different languages.
To preserve the knowledge from pretraining, we start from a multi-task learning approach to jointly train on the pretraining task (ST and/or ASR) and the target task, which adaptively incorporates the knowledge for both types of tasks.Since this is computationally intensive and requires access to the pretraining data, we further explore the Bayesian transfer learning approaches based on regularizers for continual learning that limit the shift of model parameters based on the Laplacian approximation of parameter distributions with a Gaussian prior (MacKay, 1992).In all our models, we determine the parameters by optimizing the loss function with an L2 regularization, i.e. minimizing L(D; θ) + α θ 2 2 for the parameters θ ∈ R N given data D = {(x, y)} and the hyperparameter α, in which the cross-entropy loss L correspond to the negative log-likelihood − log p(y; θ) of the label upon model outputs.This can be formulated as maximum likelihood estimation (MLE) of θ by maximizing log p(θ; D), which equals to log p(D; θ) + log p(θ) − log p(D) by Bayes' theorem.With flat prior on D and a zero-mean diagonal Gaussian prior of uniform variance N (0, σ 2 ) on θ, the optimization objective corresponds to (1) In this way, optimizing the network with L2 regularization can be viewed as giving a flat zero-mean Gaussian prior to the model parameters that assigns higher probability to close-to-zero parameters, with a larger α indicating a smaller scalar σ 2 .Based on this idea, L2-SP (Li et al., 2018b) proposes to limit the parameter shift during fine-tuning by assigning a flat Gaussian prior centered at the parameters before fine-tuning, θ 0 .With this N (θ 0 , σ 2 ) prior, it has been found that better transfer learning performance is reached.
Nevertheless, it is an over-simplification of the prior as different parameters are unequal and some parameters are more critical for the performance on the pretraining task than others.The importance of a parameter can be represented by posterior distribution p(θ; D p ) near θ 0 on the pretraining data D p that correlate with the pretraining loss L(D p ; θ) ∝ p(D p ; θ).In this way, elastic weight consolidation (EWC) (Kirkpatrick et al., 2017) assigns a Gaussian prior N (θ 0 , σ 2 ) with diagonal covariance according to the estimated posterior distribution (i.e.loss landscape) on the pretraining task.A parameter θ i with larger impact to the L(D p ; θ) will have sharper p(θ i ; D p ) and smaller σ i , thus less flexibility during fine-tuning, lower variance in the fine-tuning prior, and higher weight for its L2 regularization under the goal of preserving the knowledge for the pretraining task.The variance vector is then estimated by the diagonal of the Fisher matrix as in Appendix A, using the expected squared gradients (Pascanu and Bengio, 2014), which is often estimated by averaging over the stochastic gradient descent (SGD) trajectory.However, we use the Adam optimizer that already provides vt , an exponential moving average of squared gradients (Kingma and Ba, 2015), which is close to linear averaging given a close-toone β 2 = 0.999.Hence we reuse them to set the per-parameter weight αF i for the L2 regularization.3 Experiments

Pretraining
We first pretrain the model using a combination of the En→Fr portion of MuST-C (Gangi et al., 2019), and the Fr→En portion of TEDx (Salesky et al., 2021), both derived from TED talks, plus the Fr→En portion of CoVoST2 (Wang et al., 2021a) based on general sentences in Common Voice (Ardila et al., 2020), with texts further cleaned and sentences that are too long or contain foreign characters removed.With a total of >700 hours of speech data paired with transcripts and translations on both directions, we achieve a satisfactory result on the ST and ASR pretraining task as indicated in Table 1, and ASR+ST training shows better performance compared to the single-task ones.

Tasks
We then train the model on a variety of benchmarks as follows.More details are given in Appendix B. SLURP is a large English SLU dataset with 72.2k real speech recordings and 69.3k synthetic audio for a broad range of speech commands given to voice assistants, and we are using its IC labels with 18 scenarios and 46 actions, for which we use two separate linear heads on the classifier.
MINDS-14 is a multilingual IC dataset for banking scenarios with 14 types of intents covering 14 languages with ∼600 utterances per language, and we use data of four languages (en-AU, en-GB, en-US, and fr-FR) from it.Under a 3:2:5 train-dev-test split in the XTREME-S benchmark (Conneau et al., 2022), the training data is rather scarce, demanding a data-efficient multilingual modelling.
NMSQA or Natural Multi-Speaker Question Answering is a spoken QA dataset consisting of audio for the questions and segmented context articles from SQuAD (Rajpurkar et al., 2016), with 97.6k question-answer pairs given in >300 hours of synthetic audio from 12 speakers produced by Amazon TTS, coupled with a 60-speaker real test set of 2.7 hours of recordings.In this task, the goal is similar to textual QA to predict the correct span in the spoken context audio that answers the question, and the performance is measured by Audio Overlapping Score (AOS) (Li et al., 2018a), defined as AOS = X ∩ Y /X ∪ Y , in which X is the predicted audio span and Y the ground truth.
SLURP-Fr is a French version of SLURP we create to evaluate the cross-lingual transfer capabilities of the model, which is based on MASSIVE (FitzGerald et al., 2022), a translation of SLURP texts into multiple languages.With the same input domain and output categories, zero-shot transfer becomes possible.We first produce the audio for the 16.5k French samples in MASSIVE with a 7:2:1 train-dev-test split using Google TTS from four different WaveNet-based speakers.Then we invite two native French speakers to read out a total of 477 randomly-selected category-balanced held-out utterances, forming the real test set.To mimic SLURP, we record the audio indoors with two microphones under both near-field and far-field conditions.We also define a 100-shot per category subset with 4.5k samples in total to simulate a condition with even lower resource.
Spoken Gigaword is the spoken version of the summarization or headline generation task on Gigaword (Rush et al., 2015) as proposed by (Huang et al., 2022), aimed at generating a brief headline from a short piece of English spoken news.As the authors didn't release the data, we follow their method to filter the dataset by removing samples that are too long or containing special characters, and create a synthetic dataset of 131.5 hours of audio, produced by Google TTS from 9 neural voices in en-US, containing 50k training samples, 1k validation samples, and 385 test samples, as a result of manual filtering due to the frequent noise in the test set.The performance is reported with the commonly-used ROUGE metric (Lin, 2004).

Results
Monolingual IC Following the previous works, we report the test accuracy on SLURP as in  (Arora et al., 2022) 86.30% CTI (Seo et al., 2022) 86.92% Generative IC+SF (Wang et al., 2021b) w/ wav2vec2 87.13% w/ HuBERT 89.38% ence, though it gives better WER and BLEU during pretraining.However, ASR does help considering the None model directly fine-tuned from selfsupervised PTLMs without any additional pretraining.By joint training with both the pretraining and the downstream task, the results are consistently improved.Owing to the lack of a publicly available multilingual version of HuBERT, which is found stronger on various downstream tasks compared to wav2vec2, we rely on the multilingual version of wav2vec2 as our acoustic encoder.However, we reach much better results compared to various notable baselines, including the approach of jointly generating the intents and slots (Wang et al., 2021b) based on wav2vec2, and reach slightly higher accuracy than the HuBERT version of the method, which is the current state-of-the-art.
Multilingual IC We then report the accuracy on MINDS-14 as in Table 3 on four languages plus the average accuracy across languages, compared to a baseline directly fine-tuned from XLS-R.The results are consistent with the monolingual case that ST pretraining can significantly improve the performance on SLU tasks, that joint training is beneficial, and that adding ASR gives limited gains.
Cross-lingual IC With our SLURP-Fr datasets parallel to the English SLURP dataset, we show the advantage of our method on cross-lingual transfer for IC under the conditions of full-data, 100-shot, and zero-shot transfer.As given in    guage adversarial training during pretraining and fine-tuning to further promote language agnosticity, which outperforms other models in most cases, particularly with zero shot.This implies the usefulness of language adversarial training and the importance of language agnosticity of the input features of the classifier.
Spoken QA We compare our methods with results reported by Lin et al. (2022), including the results from a cascaded pipeline that fine-tunes Longformer upon transcripts from wav2vec2-based ASR, and the DUAL approach that fine-tunes Longformer upon units extracted by a frozen HuBERT.For fair comparison, we fine-tune the classifier built by layers 2-4 of Longformer and the top 5 layers of the mBART encoder, while the rest of the model is frozen and used as a feature extractor, so that they have a comparable number of trainable parameters with the baselines.The DUAL approach enjoys a view of the whole article, while our model works on the question and each sentence in the article individually.However, as shown in Table 5, the additional pretraining stage leads to better feature extraction and model initialization compared to DUAL, and ST further improves the performance and successfully beats the cascaded solution.
Speech summarization Speech summarization is distinct from previous tasks that the model first needs to capture the general meaning of the text, and then produce representations suitable for generation.Hence it can reuse the seq2seq model and be formulated as an extra "target language".However, the produced output is still in English, which gives the ASR task an advantage.In the experiments, we observed that simply fine-tuning the model rapidly leads to overfitting, hence we perform joint-training only.As shown in Table 6, ASR shows benefits on the summarization task that the ST+ASR model consistently outperforms the ST one, while the ST one is still better than the ASR-only model, signifying the importance of the semantic understanding capability brought by ST pretraining.Similar to the cross-lingual IC case, we build a cascaded baseline that first transcribes the inputs with our ASR model, which introduces WER of 9.1% and 8.9% on dev and test respectively.Then we leverages an BART-based model fine-tuned on the full textual Gigaword with ROUGE-1/2/L 37.28/18.58/34.53 to produce the summaries.When applied to the relatively simple utterances in Spoken Gigaword, it reaches a higher performance on dev, which suggests the challenges for E2E systems in our new benchmark, though our E2E approach is still able to reach a quite small gap, and to achieve better results on test.

Pretraining knowledge preservation
Although jointly training the model with the pretraining task and the target task proves helpful, it requires more computational resources and the access to the pretraining data.Otherwise, as shown in experiments above, a performance gap is found.Therefore, as discussed in Section 2, we experiment with Bayesian transfer approaches, targeted on ST-pretraining on SLURP and MINDS-14, plus the ST+ASR pretraining on MINDS-14 which has a considerable 1.32% accuracy gap.We try to use various weights α for L2-SP regularization ranging from 1e-5 to 1e-2.Then we inspect the distribution of the approximated Fisher diagonals F i , which ranges from 1e-20 to 1e-5 as in Appendix C. For optimization stability we clamp the weight α • F i above 1e-2, and use EWC weights of 2e2, 2e4, 2e6, and 2e7 to roughly match the magnitude of the weights with those for the L2-SP regularizer.
Results are shown in Figure 2, and for MINDS-14 the average accuracies are reported.In the case of SLURP, it is possible that the amount of data is already sufficient that the preservation of the pretraining knowledge could be helpful only if it is carried out in a fully adaptive way, namely joint training.Therefore, the regularizers lead to limited effect or even harm to the accuracy when the weight is large.However, under the low-resource condition in MINDS-14, both regularizers are effective.As in Li et al. (2018b), although being more flexible and adaptive, EWC doesn't necessarily lead to better transfer learning.This is consistent with our findings on MINDS-14: Both the regularizers can successfully overcome the accuracy gap or even go beyond the joint training model given an appropriate weight, while the best regularizer varies in different cases, although the more adaptive EWC  has a chance to reach better results as indicated by the MINDS-14 ST+ASR case.In this way, we demonstrate the effectiveness of continual learning regularizers for better transfer learning on large pretrained language models.

Related work
Translation as an auxiliary task It has been found that representations from MT models capture various aspects of the input utterance such as syntax (Shi et al., 2016), morphology (Belinkov et al., 2017), and also semantic inferences (Poliak et al., 2018;Belinkov et al., 2020).Hence MT has been established as a pretraining task as in CoVe (McCann et al., 2017) for various downstream tasks.But unlike this paper, recent works on the direction has been focused on multilingual and cross-lingual cases, starting from attempts to reuse MT representations as sentence embeddings for text classification (Shi et al., 2016;Lu et al., 2018), and, particularly often, for semantic similarity and bi-text mining (Schwenk and Douze, 2017;Vázquez et al., 2019;Raganato et al., 2019;Artetxe and Schwenk, 2019).As for pretraining PTLMs to be fine-tuned, MT proves effective for downstream cross-lingual tasks on few-shot and zero-shot transfer (Eriguchi et al., 2018), while often accompanied with similar tasks like translation language modelling (Conneau and Lample, 2019;Kale et al., 2021), cross-lingual MLM (Chi et al., 2021), and dictionary denoising (Reid and Artetxe, 2022).Particularly, MT has been used as an auxiliary task for cross-lingual intent classification on texts (Schuster et al., 2019;Siddhant et al., 2020;van der Goot et al., 2021), and is widely used on cross-lingual generation, including summarization (Zhu et al., 2019;Cao et al., 2020;Xu et al., 2020;Takase and Okazaki, 2022), simplification (Mallinson et al., 2020), and data-to-text generation (Kale and Roy, 2020).
Bayesian transfer learning Viewing the pretrained model not as a point estimation but a distribution is critical for continual learning as in EWC (Kirkpatrick et al., 2017), and the idea has been also applied to transfer learning to regularize finetuning as in L2-SP for image classification (Li et al., 2018b), though similar regularizers have been used on MT (Barone et al., 2017) and ASR (Liao, 2013).More recently, Shwartz-Ziv et al. (2022) proposes to approximate the prior using SGD trajectory as in SWAG (Maddox et al., 2019) for transfer learning.

Conclusion
In this paper we confirm our hypothesis that speech translation can be a powerful pretraining and joint-training means for various end-to-end models on tasks involving semantic understanding of speech, including intent classification, spoken question answering, and speech summarization.Particularly, it benefits multilingual scenarios and cross-lingual transfer, including the zero-shot case.We also create two new datasets for the above tasks.Furthermore, we demonstrate the effectiveness of continual learning regularizers to preserve the knowledge from the pretraining for downstream tasks.

A Elastic weight consolidation
To estimate the posterior distribution of the parameter θ trained on the pretraining data D p , we perform Taylor expansion for the log likelihood log f (θ) = log p(θ; D p ) near the parameters after pretraining, namely θ 0 , which is assumed to be near to the optimum, making ∇ log f (θ 0 ) ≈ 0. Hence, Therefore, through a second-order expansion, p(θ; D p ) is approximated by a Gaussian distribution corresponding to the quadratic term above, with θ 0 being the mean and the Hessian matrix corresponding to the covariance.To estimate the Hessian matrix, we use Bayes' theorem and take a flat prior on D p , forming (4) While the Fisher information matrix can be written as Therefore, the posterior distribution of the parameter θ on the pretraining task is approximated by a Gaussian distribution with θ 0 as the mean and F as the covariance.The Fisher matrix can then be estimated by squared gradients as in Pascanu and Bengio (2014).and EWC further simplifies it by assuming a diagonal covariance.

B Implementation details
We pretrain the model following the common settings in the field on single 24GB V100 GPUs using the Adam optimizer with a learning rate schedule of 20k linear warmup steps from 0 to 1e-4, followed by an inverse-sqrt decay to 3e-5.Models are selected and early stopping is performed according to the WER or BLEU on the dev set.The PTLMs we use are the 24-layer "large" versions provided by Hugging Face.A dynamic batching strategy is adopted to accommodate input utterance with different lengths.Accompanied with gradient accumulation, an average batch size of ∼25 with ∼500 target tokens per step is used.The wav2vec2 part is frozen for the first 10k steps, and utterances shorter than 0.1s or longer than 10s are not used during the first 20k steps.The L2 regularization with α=5e-3 is applied to the weights, except the Bayesian transfer learning experiments.The setting is similar for the fine-tuning cases except that the encoder is frozen during the initial steps, and for joint-training models a 1:3 ratio between data for the pretraining and target task is used.While for smaller datasets including MINDS-14, SLURP-Fr, and Spoken Gigawords, the data ratio, dropout rate, and learning rate schedule are further tuned to avoid overfitting.More details could be found from the source code.

C EWC weight distribution
The distributions of the log estimated Fisher diagonals for each weight matrix or bias vector are illustrated in Figure 3.It can be observed that most weights are concentrated around 1e-5 to 1e-10, and they are close to each other as the standard deviations are at the similar magnitude.Hence with α=1e-7, most weights will reach the 1e-2 clamping threshold.The exceptions are the biases for key projection in attention modules, which correspond to the lower-left cluster and have much smaller weights.

Figure 1 :
Figure1: The framework of the speech translation aided spoken language understanding, with the encoder-decoder seq2seq model based on pretrained language models and fine-tuned on speech translation, then transferred to downstream tasks like intent classification with a stacked classifier leveraging representations produced by the translation encoder.

Figure 2 :
Figure2: Results for Bayesian transfer regularizers when applied to different tasks, with the goal of mitigating the gap between the performance of single-task and joint-training models, indicated by the lower and the upper horizontal lines.The x-axis indicates the regularization weight for EWC/L2-SP, and the y-axis the accuracy.The regularizers bring positive effects on the data-scarce MINDS-14 task, but not on SLURP.

Figure 3 :
Figure 3: The distribution of the estimated Fisher diagonals shown in heat map, with x-axis for the means of the log squared gradients of each weight or bias, and y-axis the standard deviation.

Table 1 :
Test results on the cleaned pretraining datasets given by word error rate (WER) for ASR and BLEU score for ST, with French inputs for TEDx and CoV-oST2, and English inputs for MuST-C.

Table 2 .
It can be observed that the models with ST pretraining outperform those trained on ASR only, while adding ASR to ST pretraining makes limited differ-

Table 2 :
SLURP test results of our models, compared to baselines of ESPnet-SLU using HuBERT-ASR+Conformer, the Continuous Token Interface approach, and the SpeechBrain models that jointly generate predictions for intent classification and slot filling.

Table 3 :
Test accuracies for models on MINDS-14 multilingual IC, comparing with directly fine-tuning the full XLS-R model.Both ST pretraining and joint training show benefits.

Table 4 :
Accuracies for models on SLURP-Fr cross-lingual IC transferred from SLURP with different amounts of data, accompanied with a cascaded system.Results highlight ST pretraining and language adversarial training.

Table 5 :
AOS (↑) scores for models on NMSQA.The pretraining tasks prove helpful, particularly ST pretraining, which outperforms the cascaded system.

Table 6 :
ROUGE (↑) scores for models on Spoken Gigaword speech summarization.ST still proves the most beneficial, while ASR is also found helpful for this situation of generating summaries in the same language.