CIF-PT: Bridging Speech and Text Representations for Spoken Language Understanding via Continuous Integrate-and-Fire Pre-Training

Speech or text representation generated by pre-trained models contains modal-specific information that could be combined for benefiting spoken language understanding (SLU) tasks. In this work, we propose a novel pre-training paradigm termed Continuous Integrate-and-Fire Pre-Training (CIF-PT). It relies on a simple but effective frame-to-token alignment: continuous integrate-and-fire (CIF) to bridge the representations between speech and text. It jointly performs speech-to-text training and language model distillation through CIF as the pre-training (PT). Evaluated on SLU benchmark SLURP dataset, CIF-PT outperforms the state-of-the-art model by 1.94% of accuracy and 2.71% of SLU-F1 on the tasks of intent classification and slot filling, respectively. We also observe the cross-modal representation extracted by CIF-PT obtains better performance than other neural interfaces for the tasks of SLU, including the dominant speech representation learned from self-supervised pre-training.


Introduction
Spoken language understanding (SLU) plays a key role in speech interaction systems such as spoken dialogue systems, voice assistants, automated calling robots, etc.It focuses on extracting key information and making predictions from audio signals of human speech (Wang et al., 2005;Tur and Mori, 2011).Traditional methods decompose SLU into two cascading tasks: automated speech recognition (ASR) and natural language understanding (NLU), where audio signals are first transcribed into texts, and then processed by a text-based language understanding model.In the cascading scheme, the errors of ASR module will be accumulated in the NLU module and degrade the final performance.Moreover, predicted text of ASR module may not be the ideal interface for the language understanding task.For example, acoustic information such as intonation and pitch that may be helpful for understanding tasks are lost after ASR.To tackle the problems above, resent researches employ endto-end approaches for SLU (Serdyuk et al., 2018;Haghani et al., 2018;Chung et al., 2021;Arora et al., 2022), where the language understanding is directly performed from audio signals without explicitly utilizing predicted text of ASR.
For text-based language understanding tasks, pre-trained language models such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019b) and GPT (Radford et al., 2019) have achieved remarkable success.These models utilize self-supervised pre-training on large-scale unlabeled corpora to learn contextual representations in token or sentence level with rich syntactic and semantic knowledge (Liu et al., 2019a), which significantly benefit downstream tasks such as NLU during fine-tuning.This self-supervised pre-training fashion has been extended into the representative learning on speech.Researches such as wav2vec (Baevski et al., 2020), HuBERT (Hsu et al., 2021) and data2vec (Baevski et al., 2022a) focus on learning better frame-level contextual representations using unlabeled speech data, to improve the performance of ASR as well as other speech processing tasks.For end-to-end SLU, these self-supervised speech models have been proven to be powerful backbones on learning semantic representations (Wang et al., 2021;Arora et al., 2022).
The self-supervised pre-training methods for speech mainly focus on leveraging speech data to model acoustic information (Chung et al., 2021) on the frame level, while pre-trained language models work on higher token or sentence levels to encode linguistic knowledge (Liu et al., 2019a).These two kinds of representation could be combined for better benefiting downstream tasks such as SLU.The combination of speech and text representations can be performed by jointly pre-training on data of the two modalites (Chuang et al., 2020), or distillating one pre-trained representations into another (Kim et al., 2021).In either way the framelevel speech representation needs to be aligned with the token-level textual representation.Frame-totoken alignment methods such as forced alignment has been applied to speech-text joint pre-training (Chuang et al., 2020).However, these alignment methods mainly rely on external models or rules, and can only generate hard alignment mapping that can not be updated in end-to-end training.On the other hand, aligning frames and tokens through cross-attention (Arora et al., 2022;Zhu et al., 2022) suffers from high complexity and lack of token timestamps that synchronized to frames.
The frame-to-token alignment also plays a critical role in ASR systems.Various works, such as Connectionist Temporal Classification (CTC) (Graves et al., 2006), Listen, Attend and Spell (LAS) (Chan et al., 2016), RNN Transducer (RNN-T) (Graves, 2012) and Continuous Integrate-and-Fire (CIF) (Dong and Xu, 2020), focus on bringing effective alignment methods for better speech recognition performance.Among these works, the CIF alignment, which explicitly aggregates framelevel speech representations into token-level, is adopted in our work to combine with text representation.Specifically, we propose a novel pretraining paradigm: Continuous Integrate-and-Fire Pre-Training (CIF-PT) for end-to-end SLU.Two pre-training tasks are included in CIF-PT: the first task is speech-to-text modeling (Wang et al., 2020) with CIF alignment.In this work, ASR task that transcribes speech to text is applied.The second task is language model distillation (LMD).Since the integrated speech representation by CIF is at token-level, token-level distillation from a pretrained language model can be performed to inject text-based linguistic knowledge into the representation.Through the joint pre-training of the two tasks, CIF-PT is able to generate representations with information from both speech and text modalites.
We examine our CIF-PT methods in downstream SLU tasks including intent classification and slot filling.On SLU benchmark SLURP (Bastianelli et al., 2020) dataset, the end-to-end SLU model with CIF-PT outperforms the state-of-the-art model by 1.94% of accuracy and 2.71% of SLU-F1 on the tasks of intent classification and slot filling, respectively.The cross-modal representation extracted by CIF-PT also shows its competitiveness in comparison of other neural interfaces (Rao et al., 2020;Raju et al., 2022) utilized in SLU.The obtained results and a series of experiments including ablation study and the pre-training on out-of-domain data demonstrate the effectiveness and generalization of CIF-PT.

Related Works
End-to-End SLU Various works extend models originally designed for ASR into the field of SLU.Peng et al. (2022) propose Branchformer as an alternative to Conformer (Gulati et al., 2020), and show performance gains in SLU as well as ASR.Huang et al. (2022) jointly train ASR and SLU as multitasks to exploit shared knowledge from different tasks.Seo et al. (2022) use the probability distribution output of ASR model as continuous token interface (CTI) for downstream NLU.Selfsupervised representative learning on speech data provides powerful backbones such as wav2vec 2.0 (Baevski et al., 2020), HuBERT (Hsu et al., 2021) for SLU.Arora et al. (2022) propose ESPnet-SLU and analyze the performance of HuBERT encoder pre-trained with ASR as feature extractor for SLU.Wang et al. (2021) perform partial fine-tuning and entire fine-tuning on pre-trained wav2vec 2.0 and HuBERT on SLU tasks.
Cross-Modal Pre-training for SLU In order to exploit information from speech and text for SLU, jointly pre-training on both of speech and text data has been proposed.SpeechBERT (Chuang et al., 2020) extends the masked language model (MLM) pre-training from BERT into the mixture of audio and text data.In SPLAT (Chung et al., 2021), a speech module and a language module are jointly pre-trained with token-level and sentencelevel alignment.Another branch of researches focus on knowledge distillation from pre-trained language model into pre-trained speech encoder.Kim et al. (2021) utilize BERT as a teacher to perform sentence-level knowledge distillation at the pre-training stage and target-specific distillation during fine-tuning.Zhu et al. (2022) introduce cross-attention between text and speech and perform distillation on the attention heads for knowledge transfering.
Frame-to-Token Alignment in SLU In Speech-BERT (Chuang et al., 2020), forced alignment based on external ASR engine is used to train the initial phonetic-semantic joint embedding.Chung et al. (2021) adopt a heuristic alignment approach in SPLAT, where alignment scores is computed by the cosine similarity between the output embeddings of the pre-trained speech and text models.The cross-attention alignment is introduced in (Zhu et al., 2022) to capture the interactions between text tokens and speech frames.For SpeechT5, since the pre-training does not strictly rely on audio-text pair data, (Ao et al., 2022) adopt shared codebook for speech and text representation and a diversity loss to encourage the alignment in latent space.

Method
In this section, we present the architecture of our proposed continuous integrate-and-fire pre-training (CIF-PT) method for SLU.As shown in Figure 1, our end-to-end SLU models go through two stages: CIF-PT and SLU training.
During CIF-PT, we employ two pre-training tasks: ASR training with CIF alignment and tokenlevel language model distillation (LMD).These two tasks help the model learn contextual representation of the speech features aligned to the tokens with high level linguistic knowledge.After CIF-PT, the pre-trained parameters including the speech encoder and CIF part are used for downstream SLU tasks such as intent classification and slot filling.

ASR training with CIF Alignment
As shown in Figure 1(a), the structure of CIF-based ASR model includes three parts: speech encoder, CIF part, and the corresponding decoder.For an input speech utterance, it is first processed into a sequence of frames ′ via speech feature extractor (e.g.melfilter bank, convolutional front-end (Baevski et al., 2020)), where x t is the feature vector of the t th frame.The speech encoder converts the frame-level input vector into frame-level hidden states: CIF part follows the speech encoder to convert the frame-level hidden states h into tokenlevel speech representations c.We follow the CIF setup from Dong and Xu (2020), which is briefed as follows.At first, the encoded hidden states The weights α and the frame-level hidden states h are input to CIF to ob- where N is the number of total tokens.Each token-level representation c i is a linear combination of frame-level representations {h t }.At each frame step t, the weight α t added to an accumulated weight α a i ← α a i + α t , and the frame-level hidden state h t is integrated into token-level representation c i ← c i + α t h t , until the accumulated weight α a i exceeds a threshold β.When α a i exceeds β, the weight of the boundary hidden state is divided into two parts α t = α t1 + α t2 , to ensure the accumulated weight for each token is exactly β, and the second part α t2 is accumulated to the next token representation.In such way, the frame-level hidden states are integrated into token-level representation, which not only reduces the redundancy of speech information but also reduces computation complexity when used for the subsequent ASR decoder and downstream understanding tasks.
We use the autoregressive ASR decoder in (Dong and Xu, 2020).It accepts previous token y i−1 and the integrated c i from CIF part as inputs, and autoregressively predicts the token output distribution for each c i .The CIF-based encoder-decoder model is trained with a cross entropy (CE) loss in a teacher-forcing manner: Optionally, L CTC can be applied on the frame-level hidden states h to be jointly trained.The quantity loss L QUA is to supervise the CIF part to predict the quantity of tokens closer the number of target tokens: The final CIF loss is the weighted sum of three: (1)

Language Model Distillation
Since the speech representation c i integrates speech information into the token-level, we use a pretrained BERT model as a knowledge distillation teacher to inject textual knowledge into speech representation.Let x = {x i } T i=1 be the speech frame sequence and y = {y i } N i=1 be the corresponding transcript token sequence.As shown in Figure 1   performed to make the speech representation close to the contextual representation brought by BERT, thus forming a cross-modal representation.We consider three types of language model distillation (LMD) loss in our paper, MSE loss, smoothed L1 loss and contrastive loss.Using BERT hidden output h t i as target, the MSE loss of The smoothed L1 loss is proposed in (Baevski et al., 2022a), where a γ is used to control the transition from a squared loss to an L 1 loss, i.e.
The contrastive loss encourage c i to be closer to h t i than other c ′ sampled from an in-batch negative set N c . , where sim(•, •) is the cosine similarity function and τ is the temperature scalar.
The LMD task is trained simultaneously with CIF-based ASR training as multitasks, which forms the training loss L of CIF-PT as follows: (2)

Spoken Language Understanding
After CIF-PT, the pre-trained speech encoder and CIF part convert speech input into the sequence of cross-modal representation {c i }, which is used for downstream SLU training.We evaluate our pretrained model on SLU tasks of intent classification and slot filling.The corresponding intent decoder and slot decider are shown in Figure 1(b).For intent classification, {c i } N i=0 is fed into additional Transformer layers to generate task specific decoder states.We use the average of decoder state on all position as the utterance representation for intent prediction through a linear projection.
The slot filling task is performed in a sequence generation style.The slot types and slot values are concatenated as targets {y s i } to train a sequence-tosequence model, i.e. "[SEP] slot_type1 slot_value1 [SEP] slot_type2 slot_value2".The slot decoder consists of Transformer decoder layers where the sequence of c i is used as the key and value of the cross-attention layer.We train the encoderdecoder to generate slot target sequence {y s i } K i=0 with teacher-forcing.
4 Experimental Setup

Dataset and Preprocessing
We conduct experiments on the dataset of SLURP (Bastianelli et al., 2020)

Model Configuration
In this part, we detail the model structure and configuration utilized in our experiments.All the models are implemented using (Paszke et al., 2019): Encoder we use two types of speech encoder which are denoted as conformer and data2vec in subsequent experiments.For the encoder of conformer, it consists of a two-layer convolutional front-end and 15-layer conformer blocks (Li et al., 2021).It applies a 8-time temporal down-sampling similar to (Dong et al., 2019).The hidden size in the conformer block uses 400.For the encoder of data2vec, it follows the official data2vec-large configuration (Baevski et al., 2022b) and uses the released model2 from (Wolf et al., 2020).For the text encoder that provides text representation in CIF-PT, we follow the BASE configuration of BERT (Devlin et al., 2019) and use our learned BPE tokenizer to perform pre-training on the English Wikipedia corpus.
CIF part we follow the implementation of weight estimator and CIF calculator in (Dong and Xu, 2020).The channel number in convolutional layer keeps the same as the hidden size in decoder.
The threshold β during CIF calculation is set to 1.0.The corresponding scaling strategy and tail handling methods are also used.
IC SF (Acc.)(SLU-F1) MTL-SLT (Huang et al., 2022) 83.10% 74.49% Speech-Brain (Ravanelli et al., 2021) 85.34% 74.26% ESPNET-SLU (Arora et al., 2022) 86.30% 71.90% CTI (Seo et al., 2022) 86.92% 74.66% Branchformer (Peng et al., 2022) 88.10% 77.70% Hubert SLU (Wang et al., 2021) 89 Decoder we use three types of decoder in our experiments, including the ASR decoder for speechto-text training in CIF-PT, the down-streaming intent decoder for IC and slot decoder for SF.For ASR decoder, it uses the original autoregressive decoder (Dong and Xu, 2020) with 2-layer selfattention networks (SANs, also known as transformer encoder layers (Vaswani et al., 2017)).The hidden size is 400 when the encoder uses conformer and 512 for data2vec.For intent decoder, it uses 2-layer SANs and a following average pooling layer .For slot decoder, it uses 4-layer SANs for the tag-based slot decoder and uses 4-layer transformer decoder layers (with additional crossattention layer) for the generation-based slot decoder.Without specific statement, the generationbased slot decoder is used by default.The hidden size keeps the same as ASR decoder for the two types of SLU decoder.

Training and Evaluation
We use an AdamW (Loshchilov and Hutter, 2018) optimizer with β 1 = 0.9, β 2 = 0.98 and weight decay of 1e-5.During CIF pre-training, we warm up the learning rate for the first 4% of updates to a peak of 1e-3 and keep it constant in the later 64% of updates, then linearly decay it to 1e-4.The number of total training steps is 80k.We set the weight of CTC loss λ 1 = 0.5, and the weight of quantity loss λ 2 = 1.0.The hyper-parameter of LMD loss is explored in section 5.2.During SLU training, we follow the Noam scheduler (Vaswani et al., 2017) with 1600 warm-up steps and peak learning rate of 5e-4.The number of total training steps is 32k.
After training, we first perform model average on the last 10 checkpoints for all models and then use the averaged model for evaluation.We follow  the metric of accuracy and SLU-F1 (Bastianelli et al., 2020) to evaluate the models on task of IC and SF, respectively.During the inference of SF task, we perform beam search with beam width 10 and a temperature scalar of 1.25 .All experimental results are averaged at least 2 runs.
5 Results and Analysis

Main Results
To verify the effectiveness of our proposed methods, we first conduct three sets of experiments to explore the importance of designs in CIF-PT.The main results are summarized in Table 2.
The first two rows of Table 2 show the performance of our end-to-end SLU models using CIF-PT.Consistent with our expectation, the model M1 with the self-supervised data2vec encoder obtains better results than the model M0 with conformer encoder on both tasks.We also compare the performance of our methods with the published results.As shown in Table 1, the model with CIF-PT (M1 in Table 2) achieves state-of-the-art result on both of IC and SF tasks.The performance advantages on the task of SF reaches 2.71% SLU-F1.We suspect that the cross-modal representation extracted by CIF-PT contains more language knowledge that benefits more to SF, which needs to predict the slot key and speech content simultaneously .It is worthy to mention that the model M0 with conformer encoer also achieves competitive performance, which is even superior or comparable to the published strong models (Wang et al., 2021;Seo et al., 2022) with self-supervised speech encoder.
For the model of M2 in Table 2, we ablate CIF-PT utilized in the model M0 and conduct a joint training of ASR and SLU tasks from scratch.The results show that ablating CIF-PT leads to a large performance degradation on both SLU tasks.Since CIF-PT consumes extra pre-training steps, we suspect the total training step maybe a factor of the performance gap.Therefore, we increase the training step to triple (from 32k to 96k) to obtain the model M3.The performance gap is narrowed but the model M0 with CIF-PT still has a certain performance advantage over model M3 with longer SLU training.
For the model of M5 and M6 in Table 2, we ablate language model distillation (LMD) utilized in CIF-PT.During pre-training, we find applying LMD bring 3.9% (14.83 → 14.25) relative WER reduction on the model with conformer encoder.During SLU training, we also observe the introduced LMD methods boosts the performance improvements on the two tasks in Table 2.For the reason of the smaller performance improvements of data2vec encoder , it may be that the model with data2vec encoder itself has strong modeling power and already learns effective pattern and textual knowledge, so that the injected textual knowledge can only be helpful for fewer evaluation samples.
We also compare the cross-modal representation extracted by CIF-PT with the speech representation derived from self-supervised learning.For the model M7 in Table 2, we ablate the frame-to-token CIF alignment in SLU models and directly pass the frame-level speech representation extracted by data2vec to the SLU decoder.Although achieving competitive results, model M7 could achieve further improvements after combining with CIF.We suspect the reason is two-folds: 1) CIF performs frame-to-text mapping that integrates relevant speech/semantic information, thus able to remove information redundancy in adjacent frames, 2) CIF-PT bridges the speech representation and text representation through ASR training and LMD, thus providing more textual knowledge that benefits SLU performance.To further verify our hypothesis, we introduce CTC-based ASR pre-training (CTC-PT) before the training of SLU model.Results show that CTC-PT provides improvements on SLU tasks (M7 → M8, M2 → M4), but it still has gap from CIF-PT .Above observations demonstrate the effectiveness of CIF-PT.

Comparison on Language Model Distillation
In this part, we compare different language model distillation (LMD) methods applied in CIF-PT.
From the Figure 2 we get three observations: (1) All LMD methods provide positive effects on SLU performance in most cases, except for one outlier uses MSE loss with a weight of 0.01 on SF.The degradation disappears as the loss weight increases; (2) The contrastive LMD method shows better quality on both SLU tasks than the other two methods.
We suspect the reason is contrastive distillation with proper temperature scalar mainly focuses on distinguishing hard negatives, instead of forcing the representation to be consistent like MSE.This helps the extracted representation retain speech and language information at the same time, which may benefit SLU modeling; (3) Different temperature scalar in contrastive LMD method has effects on down-streaming SLU tasks, with a τ value of 0.01 producing the best results on both SLU tasks.

Comparison on Neural Interfaces
We have compared the token-level representation c i extracted by CIF-PT with frame-level speech representations in section 5.1.In this part, we continue to compare c i with other popular token-level neural interfaces (or representations) summarized in (Raju et al., 2022), including hidden interface m i , posterior interface p i , tied embedding interface e i and the combinations.For fair comparison, we give up using LMD loss in CIF-PT which benefits c i .The results are shown in Table 3.
On the task of IC, we find all token-level neural interfaces achieve comparable accuracy.This may be because these interfaces contain close information that is useful for IC, and the pooling operation in intent decoder further reduce the discrimination between representations.The combination of c i and e i achieves the best performance.We suspect this is because they are located at the beginning (c i ) and end (e i ) of ASR decoder respectively, so they may have a large information difference and complementarity.
On the task of SF, we observe a relatively large differentiation among these neural interfaces.On the model using generation-based slot decoder, c i obtains the best performance, while other interfaces have a certain performance gap in comparison.This phenomenon can be understood as c i , which is sourced from pure speech inputs, contains more original and comprehensive speech information.It can provide sufficient information for the calculation of cross-attention in the slot decoder.In contrast, the other interfaces are all calculated Interfaces IC (Acc.)  via the autoregressive ASR decoder, thus the information may be biased to a certain hypothesis with errors in inference.In addition, using c i as the interface can also avoid the mismatch between the teacher-forcing inputs and predicted inputs in inference.
Interestingly, on the model using tag-based slot decoder, c i performs inferior to other neural interfaces.Since the tag-based slot model predicts slot key for each token of the one-best ASR hypothesis, the neural interfaces m i , p i , e i that are updated synchronously with the ASR decoding could provide closer slot prediction for the final ASR hypothesis.The original speech information provided by c i can also provide supplements to these interfaces, and the best performance is obtained by the combination of c i and m i .
Between two types of slot decoder, the model with generation-based slot decoder is superior to the tag-based slot decoder, we believe this is because generation-based decoder utilize the bidirectional contextual information from full sequence, which makes it have higher ceiling in the prediction of slot information.In contrast, tagbased decoder could only use the uni-directional information that is limited by the autoregressive ASR decoder.However, this characteristic makes the tag-based model suitable for the application scenario with low-latency.

Comparison on Out-of-domain Data
In above experiments, CIF pre-training is performed on the in-domain SLURP dataset.During SLU training, the pre-trained parameters are kept frozen ('Slurp-Frozen' in Table 4) and only the part of SLU decoder is trained.In this part, we first explore unfreezing the pre-trained parameters during SLU training ('Slurp-Unfrozen' in Table 4).Specifically, we hold the pre-trained parameters frozen in the first half of training, and then make the model entirely trained by performing joint training of ASR and SLU tasks.Results show that unfreezing pre-trained parameters leads to slight performance degradation.We suspect it is because the textual knowledge injected by LMD suffers catastrophic forgetting during SLU training.But the result achieved by 'Slurp-Unfrozen' is still better than the model using the frozen pre-trained model without LMD (model M5 in Table 2).We also conduct experiments on an out-ofdomain pre-training dataset (Librispeech) to explore its effects on the final SLU performance.Consistent with our expectations, freezing the parameters pre-trained on out-of-domain data ('LS-Frozen' in Table 4) leads to a large performance degradation on SLU tasks of SLURP.When unfreezing these pre-trained parameters ('LS-Unfrozen' in Table 4) during SLU fine-tuning, the model obtains a noticable performance boost, and even outperforms the model achieved on slurp dataset.This partly reflects the good generalization and the potential on transfer learning of our proposed CIF-PT method.

Conclusion
In this work, we propose a new pre-training paradigm: Continuous Integrate-and-Fire Pre-Training (CIF-PT) for end-to-end SLU.CIF serves as a bridge connecting speech and text modality: on the one hand, it integrates speech representation into token-level through its frame-to-token alignment ability learned from ASR pre-training task.On the other hand, it support one-to-one transfer of the textual knowledge into the integrated tokenlevel speech representation via the pre-training of language model distillation.After CIF-PT, we obtain a cross-model representation that is used as neural interface into down-streaming SLU tasks.
Evaluated on the largest SLU benchmark of SLURP, CIF-PT creates new state-of-the-art result on both of IC and SF tasks.We further validate the effectiveness and generalization of CIF-PT by a series of experiments including ablation study and the pre-training on out-of-domain data.We also observe the cross-modal representation extracted by CIF-PT shows its competitiveness in comparison with other neural interfaces on SLU.We believe that CIF-PT has the potential to better encode long-form speech content (e.g.spoken paragraph) through its language model distillation, and will explore to combine it with LLM methods like ChatGPT to further empower spoken language understanding (SLU) systems.

Limitation
In the process of conducting experiments, we find our method has some limitations.First, CIF-PT needs to be performed on the dataset with speechtext pair.For some small-scale dataset that only contains speech and SLU labels, our method needs to use external ASR dataset to conduct the pretraining, leading to the increase of complexity of model building.In addition, in CIF-PT, we need to ensure that the tokenizer of the pre-trained language model is consistent with the tokenizer in the ASR task.However, there is usually a gap between the two in terms of vocabulary size.In consideration of performance, it is necessary to modify the tokenzier of one or both sides.3. .

Figure 1 :
Figure 1: Architecture of our end-to-end SLU model with CIF-PT: (a) shows the procedure of CIF-PT including the ASR task with CIF alignment and token-level language model distillation; (b) shows the model structure of SLU decoder used for SLU training, including intent decoder and slot decoder for IC and SF, respectively.
Figure 2: (a) and (b) depict the performance fluctuation of different LMD methods on the two SLU tasks as the weight λ of LMD loss changes.(c) depicts the performance fluctuation of the contrastive LMD method as the temperature scalar τ changes.

Figure 3 :
Figure 3: Model structure of our ASR decoder and SLU decoders.Different neural interfaces are depicted in the ASR decoder.The details of tag-based slot decoder and generation decoder are also included in this figure.c i in this figure could be replaced by other interfaces, which are investigated in Table3.
, x is encoded into speech feature {c i } N i=1by the speech encoder and the CIF part.y is encoded by BERT into contextual representation vectors {h t i } N i=1 .Since {c i } are aligned to tokens, directly token-level knowledge distillation can be (Sennrich et al., 2016)) largest SLU benchmark and is also linguistically more diverse than other datasets.It is collected for developing an in-home personal robot assistant.The train, development and test sets split in the SLURP paper are used for the training and evaluation of our methods.In addition to use the in-domain SLURP data for pre-training, we also introduce the Librispeech(Panayotov et al., 2015)dataset that contains 960 hours of speech derived from audiobooks as the out-domain pre-training dataset, which is only used in Section 5.4.All speech data is re-sampled or kept at 16 kHz, and all text data is converted into a sequence of subword units by the subword-nmt(Sennrich et al., 2016)toolkit 1 .Specifically, we generate 10706 subword units by performing 36000 merge operations on the training set of Librispeech datasets, and use the learned BPE as the only tokenizer for text of all datasets.

Table 2 :
Ablation study on the proposed CIF-PT.For fair comparison, models in this table use the same structure of SLU decoder.For M7 where CIF is ablated, it directly passes the frame-level outputs of speech encoder to the SLU decoder.For M8, it follows the model structure of M7 but performs CTC Pre-Training (CTC-PT) on ASR tasks before training on SLU.The model structure of M4 is similar to M8 except using conformer as its speech encoder.All models are pre-trained and fine-tuned only on the SLURP data.

Table 3 :
(Raju et al., 2022)er token-level neural interfaces summarized in(Raju et al., 2022).Here, c i represents the cross-modal representations extracted by CIF-PT.m i represents the output of ASR decoder (Hidden Interfaces).p i represents the posterior predicted by ASR decoder (Posterior Interface).e i represent the token embedding of ASR's one-best token sequence (Tied Embedding Interface).Generation and tag in the table describe two types of slot decoder used in our model, we detail their structures in Section A.2.

Table 4 :
Comparison on the out-of-domain data.In the column of model, Slurp and LS before the dash represent the utilized pre-trained dataset, LS represents Librispeech.Frozen and Unfrozen represent the state of pre-trained parameters in SLU training.