Speech-Text Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment

Recently, speech-text pre-training methods have shown remarkable success in many speech and natural language processing tasks. However, most previous pre-trained models are usually tailored for one or two specific tasks, but fail to conquer a wide range of speech-text tasks. In addition, existing speech-text pre-training methods fail to explore the contextual information within a dialogue to enrich utterance representations. In this paper, we propose Speech-text Pre-training for spoken dialog understanding with ExpliCiT cRoss-Modal Alignment (SPECTRA), which is the first-ever speech-text dialog pre-training model. Concretely, to consider the temporality of speech modality, we design a novel temporal position prediction task to capture the speech-text alignment. This pre-training task aims to predict the start and end time of each textual word in the corresponding speech waveform. In addition, to learn the characteristics of spoken dialogs, we generalize a response selection task from textual dialog pre-training to speech-text dialog pre-training scenarios. Experimental results on four different downstream speech-text tasks demonstrate the superiority of SPECTRA in learning speech-text alignment and multi-turn dialog context.


Introduction
In recent years, speech-text pre-training, which learns universal feature representations from a large training corpus (Chen et al., 2018;Li et al., 2021;Bapna et al., 2021), has achieved significant success in both uni-modal (Schneider et al., 2019;Dosovitskiy et al., 2020) and multi-modal (Lu et al., 2019;Radford et al., 2021) downstream tasks. Ex- * Equal contribution. This work was conducted when Tianshu Yu and Haoyu Gao were interning at Alibaba. † † Min Yang and Yongbin Li are corresponding authors. 1 For reproducibility, we release our code and pretrained model at: https://github.com/AlibabaResearch/ DAMO-ConvAI/tree/main/SPECTRA. isting speech-text pre-training works mainly employed multi-modal self-supervised pre-training objectives, such as cross-modal masked data modeling (Li et al., 2021;Kang et al., 2022a) and crossmodal contrastive learning (Sachidananda et al., 2022;Elizalde et al., 2022), which align the speech utterance representation to the corresponding text sentence representation. Despite the remarkable progress of previous speech-text pre-training models, there are still several technical challenges to constructing an effective and unified speech-text pre-training model for spoken dialog understanding, which are not addressed well in prior works. First, previous models are mainly tailored for specific speech-text tasks, such as speech-to-text translation (Liu et al., 2020b) and speech-language understanding , failing to conquer a wide range of speechtext tasks. Although Tang et al. (2022) proposed a unified speech-text pre-training for speech translation and recognition, it fails to exploit the temporality of an input speech sequence and cannot learn the fine-grained speech-text alignment.
Second, limited exploration has been attempted to bridge the gap between plain speeches/texts and human conversations. In particular, existing speech-text pre-training methods fail to explore the context information within a dialog. Nevertheless, spoken dialog understanding needs to effectively process context information so as to help the system better understand the current utterance, since humans may omit previously mentioned entities/constraints and introduce substitutions to what has already been mentioned.
In this paper, we propose Speech-text dialog Pre-training for spoken dialog understanding with ExpliCiT cRoss-Modal Alignment (SPECTRA), which is the first-ever speech-text dialog pretraining model. We illustrate the framework of our method in Figure 1 and details in Figure 2. The backbone of SPECTRA is composed of a text encoder, a speech encoder, and a fusion module, learning semantic/acoustic information and the interaction between them, and pre-trained on a largescale real-world multi-modal (speech-text) dialog corpus. We propose two pre-training objectives to learn better context-aware speech/text representations for spoken dialog understanding (Dai et al., 2022;Zhang et al., 2022b). Specifically, to consider the temporality of speech modality, we design a novel temporal position prediction task to capture the speech-text alignment by predicting the start and end time of each textual word in the corresponding speech waveform. In addition, to learn the characteristics of spoken dialogs (Gao et al., 2023;Qian et al., 2023), we devise a cross-modal response selection objective to consider the context information within each dialog.
Our contributions are summarized as follows: • To the best of our knowledge, we are the first to propose a speech-text dialog pre-training model for spoken dialog understanding, which fully exploits the characteristics of multimodal (speech/text) dialogs.
• We introduce two pre-training objectives (temporal position prediction and multi-modal response selection) to effectively learn speechtext alignment and dialog context information.
• We conduct extensive experiments on five benchmark datasets belonging to four downstream speech-text tasks, including emotion recognition in conversation (ERC), multimodal sentiment analysis (MSA), spoken language understanding (SLU), and dialog state tracking (DST). We believe that the release of the pre-trained model and source code would push forward the research in this area.

Related Work
Uni-modal Pre-training In recent years, pretrained language models (PLMs), such as BERT (Kenton and Toutanova, 2019), RoBERTa (Liu et al., 2019), and GPT (Radford et al., 2019a) have been proposed and applied to many NLP tasks, yielding impressive performances. PLMs benefit from the rich linguistic knowledge in large-scale corpora (He et al., 2022c,a). Inspired by the success of PLMs in NLP tasks, several speech pretraining models, such as Wav2vec (Schneider et al., 2019), HuBERT (Hsu et al., 2021), and WavLM (Chen et al., 2022), were proposed to learn highquality universal speech representations from massive speech data.
Multimodal Pre-training Compared to multimodal pre-training for vision-and-language tasks, speech-text pre-training is relatively less explored.
SpeechBERT (Chuang et al., 2020) jointly trained multimodal representations based on a single BERT for spoken question-answering. CTAL (Li et al., 2021) extended the original Transformer to crossmodal by modifying the attention mechanism of the Transformer decoder. ST-BERT  combined a pre-trained acoustic model with BERT and took phoneme posterior and subword-level tokenized text as input. Kang et al. (2022b) explored multimodal pre-training model in extremely lowresource data scenarios. CLAM (Sachidananda et al., 2022) employed contrastive and multirate information inherent in audio and lexical inputs to align acoustic and lexical information. STPT (Tang et al., 2022) proposed a multi-task learning framework to integrate different modalities in speech-text pre-training. ℒ '(( Figure 2: The overview of SPECTRA. The left part shows the illustration of the temporal position prediction task and the cross-modal response selection task. The right part shows the overall structure of the pre-trained model. understanding model, which trained a semantically rich BERT-based conversation model along with a speech-based model. Different from previous works, SPECTRA is the first-ever speech-text dialog pre-training model, which bridges the gap between plain texts/speeches and human conversations.

Method
In this section, we introduce the model architecture and pre-training objectives of SPECTRA. Figure 2 shows the overall structure of our model SPECTRA, which consists of a text encoder, a speech encoder, and a modality fusion module. During pre-training, we first convert paired text and speech inputs into uni-modal embeddings, which are then fed into the text encoder and speech encoder respectively to obtain uni-modal representations. Finally, we concatenate text representations and speech representations as input of our modality fusion module to get fused representations for speech-text pre-training.

Data Preparation
Before diving into our model, we first prepare input text and speech sequences for our model. Let D = {T 1 , T 2 , ..., T n } denotes a conversation with n dialog turns, where every single dialog turn T i consists of a slice of raw speech waveform s i and its corresponding text Here, w ij is the j-th word of t i , and is annotated with its corresponding start/end time in the speech, denoted as .., t i−2 , t i−1 } and the previous speech dialog history s i−1 . In this way, each sample X i consists of k+1 turns of text and 2 turns of speeches, where the speeches correspond to the latest 2 turns of text. Note that we only use 2 turns of speech in pre-training for efficiency, since the length of speech representation is much longer than its corresponding text representation.

Text Embeddings
For each input element, its vector representation is a summation of the corresponding token embedding, absolute position embedding and segment embedding.
Specifically, we first concatenate all text sentences of each sample X i in temporal order to construct the text input: Note that we use special token <s> to mark the start of the whole sequence, and </s> to mark the end of each turn. Then, we encode each token in I i using a pre-trained RoBERTa (Liu et al., 2019) tokenizer. We assign learnable segment embedding e t,1 to tokens of t i and the last </s> token, and e t,0 for the rest of the tokens. The detailed tokenizing and encoding process is described in Appendix A.
We denote x i as the input text embeddings of I i .

Uni-modal Encoders
Text Encoder Inspired by the remarkable success of uni-modal pre-trained models on various downstream tasks, we employ RoBERTa (Liu et al., 2019) as our text encoder. We pass x i into text encoder to obtain the sequence representations: where H t,i ∈ R n×d h denotes the output hidden states of the last layer of RoBERTa, n is the length of input I i , and d h is the dimension of hidden state.
Speech Encoder We design our speech encoder based on the WavLM structure (Chen et al., 2022) with three key modules: a feature extractor, a feature projection module and a Transformer encoder module. The feature extractor consists of 8 temporal convolutional layers and a layer normalization. We implemented the first seven convolutional layers to be the same as WavLM, and added another convolutional layer with 512 channels, 5 strides and 5 kernels size, in order to shorten the length of the output speech features. As a result, each output token of speech features represents approximately 200ms of speech with a stride of 100ms. The feature projection layer is a layer normalization followed by a fully connected layer converting the size of speech features from 512 to d h . The Transformer encoder module is equipped with a convolution-based relative position embedding layer and 12 WavLM Transformer layers. For each sample, we directly input speech waveforms s i−1 and s i into our speech encoder, and denote the outputs of the feature projection layer for s i−1 and s i as f i−1 and f i : Then, we obtain a speech sequence a i by concatenating f i−1 and f i together with a separation token [SEP] and a starting token [CLS]: where a i ∈ R (m i−1 +m i +2)×d h denotes the concatenated sequence. m i−1 and m i are the lengths of s i−1 and s i , respectively. We pass a i as the input of the Transformer encoder module to get the speech sequence representations: where H s,i ∈ R (m i−1 +m i +2)×d h denotes the hidden states of the last Transformer layer.

Modality Fusion Module
To integrate two modalities, we employ a single self-attention Transformer layer as our modality fusion module. We first concatenate the text sequence representation H t,i and the speech sequence representation H s,i together. Then, we assign text and speech representations with learnable modality embeddings e m,0 and e m,1 respectively, and add the modality embeddings to the concatenated representations as the input of our modality fusion module. Finally, we obtain output hidden representations of modality fusion module H i ∈ R (n+m i−1 +m i +2)×d h as the speech-text joint representations.

Pre-training Tasks
We introduce two novel pre-training objectives for our SPECTRA model, empowering SPECTRA to capture speech-text alignment and multimodal dialog context effectively.

Temporal Position Prediction
Existing speech-text pre-training works mainly learn from prior visual-text pre-training models. These works ignore that speeches are temporal sequences, and thus fail to learn fine-grained speech-text alignment. In this work, we propose a novel temporal position prediction (TPP) objective, which utilizes the textual part of the hidden representations H i to predict the starting and ending time of each word in the speech waveform. In particular, for each word w ij in utterance t i with its start/end time annotations s ij /e ij , we denote its first/last token in H i as h s ij /h e ij . The goal of the TPP pre-training objective is to predict its starting and ending time in s i with h s ij and h e ij , respectively. We use squared error loss to optimize the TPP task: where W start , W end ∈ R d h ×1 are learnable parameters. L a is the maximum speech length limit. By normalizing s ij and e ij over L a , we guarantee that the starting and ending time falls into [0,1].
Here, we only calculate the TPP loss for the words in the last two turns of dialog (i.e., t i−1 and t i ) for each sample X i . We calculate the average TPP loss over all words within those two turns as the TPP loss of dialog X i : where l i−1 and l i denote the total lengths of transcripts t i−1 and t i in sample X i .

Cross-modal Response Selection
Inspired by the success of response selection tasks in textual dialog systems (Bao et al., 2019), we design a cross-modal response selection objective. For each sample X i , we randomly replace the text query t i or speech query s i with the utterances or speech from other dialogs in the dataset. In this way, for each sample X i , we can obtain three kinds of corrupted samples as negatives: (1) only the speech query is randomly substituted; (2) only the text query is randomly substituted; (3) both text and speech queries are randomly substituted. Note that both text and speech queries remain unchanged as positive as illustrated in Figure 2 Since the output of the first <s> token can be viewed as the representation of the whole speechtext sample, we apply a softmax function following a fully connected layer on top of the hidden state of token <s> as a four-way classifier, predicting which case the current example belongs to. We utilize the cross-entropy loss to optimize the crossmodal response selection task, denoted as L CRS .

Cross-modal Masked Data Modeling
Following previous works (Li et al., 2021), we also adopt the cross-modal representations H f for crossmodal masked language modeling (CMLM) and cross-modal masked acoustic modeling (CMAM) objectives. For masked language modeling, we follow the setup of RoBERTa (Liu et al., 2019) to dynamically mask out textual input tokens with a probability of 15%. For masked acoustic modeling, we follow Baevski et al. (2020) and Liu et al. (2020a) to mask continuous speech frames.
We modify the implementation of the original masked acoustic modeling method in previous works to increase the average number of masked speech frames in each sample. We provide the details of masked acoustic modeling in Algorithm 1 in Appendix B. The speech token masking step is performed between the feature extractor and feature projection. We employ the cross-entropy loss for the CMLM task (L CMLM ) and the mean absolute error loss for the CMAM task (L CMAM ).

Joint Pre-training Objective
We combine four pre-training objectives to form a joint pre-training objective for speech-text pretraining:

Fine-tuning on Downstream Tasks
We fine-tune SPECTRA on four downstream tasks, including multimodal sentiment analysis (MSA), emotion recognition in conversation (ERC), spoken language understanding (SLU), and dialog state tracking (DST).
We use the hidden state of <s> token in H i , denoted as h i , and pass it through a prediction head with two fully-connected layers and a GELU activation (Hendrycks and Gimpel, 2016) between them to get the prediction: where σ denotes the GELU activate function, (2) ∈ R do are new learnable parameters in the fine-tuning stage. The output size d o for MSA task is 1, and for ERC and SLU it is the corresponding number of classes. We adopt the squared error loss as the fine-tuning loss function for MSA. The cross-entropy loss is utilized for the rest of tasks.

Pre-training Data
In this paper, we adopt Spotify100K (Clifton et al., 2020) to pre-train SPECTRA, which is a real-world scene speech-text dialog dataset. Spotify100K contains 105,360 podcast episodes, with nearly 60,000 hours of speeches covering a variety of genres, subject matter, speaking styles, and structure formats. The corpus also provides automatically-generated word-level textual transcripts, marking the starting and ending time in the speech for each word.
For a fair comparison with previous speech-text pre-training studies, we only use the first 960 hours of speech as well as the corresponding transcripts to pre-train our SPECTRA model.

Experimental Setup
Baselines In addition to state-of-the-art downstream models tailored for MSA, ERC, SLU and DST (see Section 4.3-4.6), we also compare SPEC-TRA with three types of pre-training models, including the text modality pre-training model RoBERTa (Liu et al., 2019), speech modality pretraining model WavLM (Chen et al., 2022), and speech-text multimodal pre-training model CTAL (Li et al., 2021).

Experimental Settings during Pre-training
We use the first 960 hours of speech and textual transcripts of Spotify100K dataset for pre-training. We cut the speech waveform into slices of a maximum length of 10 seconds and view each slice with the corresponding transcripts as a single dialog turn, forming 356,380 dialog turns in total. By using these dialogs and setting k to a maximum of 7, we construct 350,784 samples, where each sample consists of 2~8 dialog turns of texts and 2 turns of speeches. Besides, we use pre-trained models RoBERTabase and WavLM-base+ to initialize our text and speech encoder, respectively. Since our speech encoder has one more convolution layer than WavLM-base+, we only initialize the first seven convolution layers with pre-trained parameters and randomly initialize the last layer. Both text and speech encoders have 12 Transformer layers with a hidden size d h of 768. We pre-train our SPEC-TRA model for 100 epochs on 8 Tesla-A100 GPUs with a batch size of 20 per GPU. We use AdamW (Loshchilov and Hutter, 2018) to optimize our model with a peak learning rate of 1 × 10 −4 and a linear warmup for the first 1% of updates.

Experimental Settings during Fine-tuning
For SpokenWoz dataset, each dialog turn consist of two utterances, one from the user and the other from the system. For other datasets, each dialog turn is a single utterance. For all datasets we truncate the speech length of each dialog turn to a maximum of 10 seconds. We fine-tune our pre-trained checkpoint on each downstream dataset using an AdamW (Loshchilov and Hutter, 2018) optimizer with a peak learning rate of 2 × 10 −5 and a cosine annealing warmup.

Fine-tuning on MSA
For MSA task , our model aims to predict the positive or negative sentiment polarities of the given multi-modal input. We conduct experiments on two multi-modal datasets MOSI (Zadeh et al., 2016) and MOSEI (Zadeh et al., 2018) to evaluate the effectiveness of our model for the MSA task. We adopt the accuracy over positive/negative sentiments classification (denoted as Acc 2 ) as the evaluation metric for our model and baselines. The experimental results are reported in Table 1.
From the results, we can observe that our model achieves substantially better performance than previous state-of-the-art (SOTA) methods on both datasets. In particular, for the MOSI dataset, the accuracy increases by 3.10% over the strongest baseline MIB (Mai et al., 2022). In addition, as shown in Table 2, our SPECTRA also significantly outperforms the speech modality pre-training model WavLM and speech-text pre-training model CTAL.

Fine-tuning on ERC
ERC task requires the model to predict the emotion category of an utterance given a speech clip with its transcripts and dialog history. Here, we fine-tune our model with the widely-used IEMOCAP dataset (Busso et al., 2008), and follow the settings with Chudasama et al. (2022) to perform a 6-way classification task. For each sample, we construct 11 turns of text and 2 turns of speech with a maximum text length of 512.
In Table 1, we report the accuracy of six-way classification for our model and previous SOTA method M2FNET (Chudasama et al., 2022). In addition, from Table 2, we can observe that our method outperforms uni-modal pre-training models, as well as speech-text pre-training baseline CTAL. Compared with the uni-modal baselines RoBERTa and WavLM, our model benefits from multi-modal pre-training tasks that capture interactions and alignments between modalities. Compared with CTAL, our model is equipped with  better speech-text alignment and multi-turn dialog context information with the help of TPP and CRS pre-training tasks.

Fine-tuning on SLU
We also conduct experiments on the spoken language understanding (SLU) task, which aims to predict the user intent (Lin and Xu, 2019) given a spoken utterance with the textual transcript. We use MIntRec (Zhang et al., 2022a) as the experimental dataset for SLU and adopt classification accuracy for the evaluation metric. From Table 1 and 2, we can observe that SPEC-TRA obtains significantly better results than previous methods. In particular, our SPECTRA model improves the results of RoBERTa and the previous SOTA method MAG-BERT (Rahman et al., 2020) by 1.55% and 2.47%, respectively. Compared to WavLM and CTAL, our model can capture semantic information in textual data and the context information within each dialog.

Fine-tuning on DST
For dialogue state tracking, we use an in-house, large-scale, cross-modal dataset called SpokenWoz. The dataset was collected by crowdsourcing recordings through phone calls using the Appen platform 2 . Transcriptions were obtained using a commercial ASR system, and speech-text pairs were annotated using a schema similar to MultiWoz (Eric et al., 2019). SpokenWoz consists of 204k turns, 5.7k dialog, and 249 hours of recordings.We adopt joint goal accuracy (JGA) as the evaluation metric, 2 https://appen.com/ which compares the predicted and ground-truth dialogue states at each turn. We follow Trippy (Heck et al., 2020) and substitute its context model BERT with our SPECTRA model.
As shown in Table 1, our model outperforms the previous SOTA method, SPACE+WavLM. In addition, our model also surpasses the three pretraining baselines by a noticeable margin. This demonstrates better speech-text alignment is critical to tackling complicated conversations.

Ablation Study
To better understand the effectiveness of our SPEC-TRA pre-training method, we investigate the influence of pre-training components and dialog history on the overall performance of SPECTRA. We report the ablation test results in Table 2.
Impact of Pre-training To demonstrate the efficiency of multi-modal pre-training, we directly use uni-modal encoders and randomly initialize the modality fusion module. We observe a significant performance drop by comparing (a) "w/o multimodal pre-training" to other pre-training settings on all five datasets. In particular, setting (a) directly collapses on the ERC task, which is a complicated and conversational scenario. This verifies the necessity of cross-modal pre-training and aligning speech-text modalities. In addition, by comparing SPECTRA and setting (b) "using less pre-training data", we can find that using more pre-training data can further improve the performance of our model.   Impact of TPP and CRS By comparing the setting (c) "w/o TPP" to SPECTRA, the performances on all five datasets drop to different extents, which verifies the generalization and effectiveness of our TPP pre-training task. Specifically, the performance drops significantly on SpokenWoz, which requires the model to have a stronger ability to align two modalities. This demonstrates that our TPP pre-training task empowers the model with stronger alignment modeling ability. For setting (d) "w/o CRS" with SPECTRA, the performance drops significantly on multi-turn dialog tasks such as ERC and DST. This suggests that the CRS task is essential to model multi-turn dialog context.

Impact of Dialog History
In setting (e) "using 1 turn of textual dialog history", each instance consists of 2 turns of paired speech and text.The model performance drops substantially on ERC and DST downstream tasks by comparing it with SPECTRA. This demonstrates that increasing dialog history in the pre-training stage is beneficial to the tasks that require multi-turn dialog context.

Case Study
To have a straightforward understanding of how we learn cross-modal interaction in our proposed SPECTRA model, we conduct a case study by providing two cases sampled from the MIntRec dataset. These two cases are incorrectly predicted by the model pre-trained without TPP but correctly predicted by our SPECTRA model. In Figure 3, we visualize the self-attention weights of the fusion layer in our model as well as the model pre-trained without TPP (denoted as w/o TPP). From Figure  3(a) and 3(c), we observe that there are rich crossmodal interactions in the fusion layer of the proposed SPECTRA model. Our model can capture fine-grained information between text and speech for more accurate classification. In contrast, we also visualize the self-attention weights of the w/o TPP model in Figure 3(b) and 3(d). Both cases show that text and speech sequences seldom connect to each other in self-attention layers.
In Table 3, we also illustrate the intent prediction results obtained by SPECTRA and w/o TPP. From the results, we can observe that our model can attend to both text and speech sequences effectively to predict correct intent results. However, w/o TPP is confused by the wrong labels since it hardly attends to speech tokens, which indicates that it has the propensity to omit useful information that exists in speech exclusively.

Conclusion
In this paper, we proposed our model SPECTRA, the first speech-text dialog pre-training model. Considering the temporality of speech and text modalities, we introduced a novel temporal position prediction pre-training task to learn word-level speechtext alignment. To capture multi-modal dialog context in our model, we generalized the response selection task into multi-modal scenarios. Extensive experiments show that our pre-training method can learn better cross-modal interactions as well as multi-modal contextual information and significantly outperformed other strong baselines. In the future, we would like to extend speech-text dialog pre-training to more modalities or generative tasks.

Limitations
We analyze the limitations of this work, so as to further improve the performance of our model in future work. Based on our empirical observation, we reveal several limitations, which can be divided into two primary categories. (1) First, our proposed SPECTRA method relies on large-scale spoken dialog corpora with explicit word-level speech-text alignment annotation, such as Spotify100K. This limits the generality of our model on more spoken dialog corpora.
In the future, we would like to develop a semisupervised pre-training method to leverage both labelled and unlabeled datasets. (2) Second, our method is mainly designed for speech-text understanding and has not been fully explored for generative tasks. We plan to devise dialog generation per-training objective to empower the model with better generation ability. (3) Third, the work only involves speech and text modalities. We are interested in handling more modalities, such as images or videos, to enrich cross-modal information in joint representations.