CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

Existing audio-language task-specific predictive approaches focus on building complicated late-fusion mechanisms. However, these models are facing challenges of overfitting with limited labels and low model generalization abilities. In this paper, we present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra-modality and inter-modality connections between audio and language through two proxy tasks on a large amount of audio-and-language pairs: masked language modeling and masked cross-modal acoustic modeling. After fine-tuning our pre-trained model on multiple downstream audio-and-language tasks, we observe significant improvements across various tasks, such as, emotion classification, sentiment analysis, and speaker verification. On this basis, we further propose a specially-designed fusion mechanism that can be used in fine-tuning phase, which allows our pre-trained model to achieve better performance. Lastly, we demonstrate detailed ablation studies to prove that both our novel cross-modality fusion component and audio-language pre-training methods significantly contribute to the promising results.


Introduction
Speech processing requires the understanding of a set of acoustic and language content, including phonemes, tone, words and semantic meanings.Unlike humans, who can benefit from self-learning through real-world experiences, current speech processing methods are like narrow experts relying heavily on large amount of task-specific human annotations, a small change in the learning problem means that they have to start all over again.In recent years, pre-training for single modality of natural language processing (NLP) and of audio signal processing are widely explored due to the ease-of-use and competent generalization to various downstream tasks.
In the field of NLP, pre-trained models, such as BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019), XLNet (Yang et al., 2019) and GPT2 (Radford et al., 2019), share the same idea of first leveraging large-scale unlabeled corpus to perform contextualized language model (LM) pre-training then fine-tuned to adapt to downstream tasks, such as machine reading comprehension (Lai et al., 2017), question answering (Rajpurkar et al., 2016) and natural language inference (Bowman et al., 2015), which has substantially advanced the state-of-theart results.Following the success of pre-training in NLP, BERT-like models are also applied to audio processing community (Schneider et al., 2019;Baevski et al., 2019Baevski et al., , 2020)), which learn robust audio representations through an audio-style selfsupervised context prediction task.
Despite these influential single-modal methods, for tasks at the intersection of audio and language, such as speech emotion classification (Livingstone and Russo, 2018;Busso et al., 2008), speaker verification (Panayotov et al., 2015) and sentiment analysis (Zadeh et al., 2018), large-scale pre-training for the modality-pair of audio and language is barely explored.The previous attempt is to train taskspecific experts upon the concatenation of language representations and audio representations in a late fusion manner (Ramirez et al., 2011;Glodek et al., 2011;Zadeh et al., 2017;Yoon et al., 2019Yoon et al., , 2018;;Xu et al., 2019), without any generic audio-andlanguage pre-training.These task-specific experts will suffer from overfitting problem when trained with limited data.Also, due to the heterogeneities across language and audio modalities, late fusion of high-level representations lacks surface-level cross-modal alignment and complementation during pre-training phase.
Motivated by these, we propose CTAL, a pretrainable generic representation for audio-and-language and has shown its strong performance on three established audio-and-language tasksemotion classification (Busso et al., 2008), sentiment analysis (Zadeh et al., 2018) and speaker verification (Panayotov et al., 2015).We propose multi-modal Transformer as our backbone model, which consists of two modules, a language stream encoding module which adapts word as input element, and a text-referred audio stream encoding module which accepts both frame-level Melspectrograms and token-level output embeddings from the language stream encoding module as input elements.In order to learn both intra-modality and inter-modality connections, we pre-train our model with two tasks: (1) masked language modeling.(2) masked cross-modal acoustic modeling.Different from single-modality pre-training (e.g., Masked Acoustic Modeling (MAM) in MOCK-INGJAY (Liu et al., 2020b)), our cross-modal pretraining enables our model to reconstruct masked audio features from both intra-modality and intermodality information.On the basis of our pretrained model, a regularization term based on feature orthogonality is introduced during model finetuning stage, which is designed to ensure that features of different modalities provide information from different perspectives, and it should be noted that this orthogonal regularization mechanism is general and not limited to audio-language tasks.
The main contributions of our paper are listed as follows: • We present CTAL, a pre-training framework for strong audio-and-language representations with Transformer, which are helpful in learning both intra-modality and inter-modality connections.To the best of our knowledge, we are the first to introduce the pre-training cross audio and language modality.
• We propose a novel cross-modality fusion mechanism during fine-tuning stage, which forces our pre-trained model learn composite features from different views.
• Comprehensive empirical evidence demonstrates that our CTAL achieves state-of-the-art results on various downstream speech understanding tasks, such as emotion classification, sentiment analysis, and speaker verification.We conduct detailed ablation studies and analysis to prove the effectiveness of our model components and our pre-training strategies.
2 Related Work

Self-Supervised Uni-modal Pre-training
There has been a long interest in NLP around self-supervised representation learning.Previous works have explored alternative approaches to improve word embedding (Mikolov et al., 2013;Le and Mikolov, 2014;Pennington et al., 2014), which is a low-level linguistic representation.After that, pre-trained NLP models based on multilayer Transformers, such as BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019), XLNet (Yang et al., 2019) and GPT2 (Radford et al., 2019), benefit from context-sensitive representation learning on large-scale corpus, show significant improvements in various downstream language understanding tasks.Self-supervised learning in audio signal processing has also shown increasing promise.Following BERT, many approaches (Jiang et al., 2019;Liu et al., 2020a,b;Chi et al., 2020) are proposed to learn high level acoustic representations rather than surface features such as log Mel-spectrograms or waveform, which can reveal the abundant information within audio signals.

Multimodal Pre-training
While pre-training for audio-and-language representations has rarely been studied, several attempts have been made to pre-train models for vision-andlanguage tasks on Visual Question Answering (Antol et al., 2015) and Visual Commonsense Reasoning (Zellers et al., 2019) datasets.In general, these vision-and-language pre-training methods can be divided into two categories, according to their different encoder architectures.(a) Prior works like ViLBERT (Lu et al., 2019) and LXMERT (Tan and Bansal, 2019), apply two single-modal networks to encode input text and images respectively and adapt cross-modal interactions in a symmetric fusion manner.(b) The other category of pre-training frameworks like VisualBert (Li et al., 2019) and Unicoder-VL (Li et al., 2020), concatenate vision and language features as a unified single-stream input and utilize a universal encoder to learn joint multimodal representations.To be noticed, transfer above algorithms directly from vision-and-language to audio-and-language field is facing challenges, including: (1) unified architecture is not suitable for audio-language modalities, since both text and audio-waveform are generally long sequences, and cross-modal aggregation at the very beginning phase with Transformer self-attention mechanism will lead to higher computational complexity; (2) speech audio is more informative than language text, which contains both semantic information of speech text and personal feelings of speakers.Thus, it is not suitable to apply the symmetric cross-modal fusion modules proposed in prior audio-and-language pre-training researches.Based on these facts, we design our backbone model with a language stream encoding module and a text-referred audio stream encoding module, which allow necessary intra-and intermodality connections during pre-training with less computational cost.
The closest work to our approach is from Haque et al. (2019) and our approach differs from it in two ways: (1) we use a more explicit, multi-component design for the cross-modality connections (i.e., with a text-referred audio stream encoding module and a novel cross-modality fusion component); (2) we employ different pre-training tasks which accept both text and audio frames as input to conduct contextualized masked language modeling and masked cross-modal acoustic modeling tasks.While previous work only adapts audio as input and formulates a multitask learning problem by reconstructing linguistic and acoustic features from a hidden speech embedding during pre-training.

Approach
In this section, we first present our cross-modal pre-training framework CTAL, including details of text and audio preprocessing and encoding module for separate modalities.Then we present our pretraining tasks.In the end, we propose our novel fusion mechanism which can be utilized in the finetuning stage.Following conventions, we apply bold upper case letters to represent matrices and bold lower case letters to represent vectors.

CTAL
We build our cross-modal Transformer by extending the original Transformer (Vaswani et al., 2017) into the multimodal paradigm.As shown in Figure 1, CTAL takes audio sequence and its corresponding text as the input.Each audio is represented as a sequence of frames, and each text is represented as a sequence of tokens.Then we encode the input to the linguistic embedding and audio embedding, and feed them into a text encoding module and a text-referred audio encoding module respectively to generate final language representations and text referred audio representations.Following the formula and notations proposed by Vaswani et al. (2017), we adapt Q, K and V as queries, keys and values for attention mechanism, MultiHead(Q, K, V) as multi-head attention, FFN(X) as positionwise feed forward networks and LayerNorm(X) as layer normalization.Next, we describe the components in detail.

Input Embeddings
Linguistic Embedding To encode any input text with a modest size (30K units) of subword vocabulary, we follow the text preprocessing of RoBERTa, which tokenizes each input text w = {w 0 , w 1 , ..., w T } with byte-level Byte-Pair Encoding (BBPE) (Radford et al., 2019).Besides, we also add the special tokens <s> and </s> to represent start and end token as the common practice, and T is the total length of input tokens.Then we sum up each token embedding and its corresponding position embedding as the final input token embeddings {e w 0 , e w 1 , ..., e w T } for language modality.

Audio Embedding
The input audio signal is first transformed into frames of width 50ms and step 12.5ms.Then the 80 dimension Melspectrograms are extracted from each frame and concatenated with their first order derivatives, making the feature dimension to 160.In this way, the raw signal is converted into sequence of frame-level acoustic surface features {a 0 , a 1 , ..., a T }, where T is the total number of frames.For simplicity, we denote this audio feature sequence as input acoustic features after this section.At final, we feed these surface features to a projection layer and add them with the position embeddings to obtain the input audio embeddings {e a 0 , e a 1 , ..., e a T } for audio modality.

Text Encoding Module
As shown in Figure 1, we apply the original Transformer encoder to language stream inputs, each language stream encoding layer consists of one multi-head self-attention sublayer and one positionwise feed forward sublayer.We stack N such language encoding layer in our text encoding module implementation, using the output of (k−1)-th layer as the input to the k-th layer, and we initialize H 0 w with {e w 0 , e w 1 , ..., e w T }.We obtain our language  representations for k-th layer with the followings: We get the final output H N w ∈ R T ×dw from our language stream encoding module, where d w denotes the hidden size of the language stream representations.The first token of every text sequence is always a special start token (<s>), and the final hidden state corresponding to this token is always used as the aggregated text sequence representation for classification tasks.

Text-Referred Audio Encoding Module
For text-referred audio encoding module, we first initialize hidden representations H 0 a with {e a 0 , e a 1 , ..., e a T }, and pass them to a stack of N text-referred audio encoding layers to acquire the final audio stream representations H N a .Our text-referred audio encoding module is different from the original Transformer decoder by modifying two kinds of multi-head attention mechanism.Firstly, in order to learn the bi-directional intra-modality representation for audio, we get rid of the future mask in the masked multi-head selfattention.Specifically for l-th layer: Secondly, we apply multi-head cross-modal attention which accepts the final language stream representations as keys and values in each layer to apply the inter-modality interactions: Finally, we obtain the text-referred audio representation of N -th layer H N a ∈ R T ×da , where d a denotes the hidden size of the audio stream representations.

Masked Language Modeling (MLM)
For language stream, we take the masked language modeling task for language intra-modality learning.As shown in Figure 1, in MLM, the task setup is almost the same as RoBERTa (Liu et al., 2019), we dynamically mask out the input tokens with a probability of 15%.Masked tokens are replaced with a special <mask> token 80% of the time, a random token 10%, and unaltered 10%.The goal of MLM is to predict these masked tokens based on the observed tokens.To be noticed, we do not introduce acoustic information for masked token prediction, since semantic information of language text can be well captured through language input only.And introducing cross-modal inputs is redundant, which we demonstrate through our later ablation study.

Masked Cross-modal Acoustic
Modeling (MCAM) For audio stream, we propose MCAM to train the text-referred audio representations.Prior research by Baevski et al. (2020) indicates that the performance of acoustic pre-trained models on down-stream tasks is improved with the increment in size of continuous masked frames during pre-training phase.However, due to the complexity of audio signals, the long-term dependencies in audio sequences is hard to be captured with acoustic features alone.To mitigate that problem, we propose MCAM to capture effective information of audio through learning both intra-and inter-modality connections between audio and language.
To implementation MCAM, we first split the audio in separate segments according to C num consecutive frames per segment, where C num is uniformly sampled from 20 to 50.Then we randomly select 15% of these segments and for each of them, we mask it all to zero 80% of the time, replace it with the other C num randomly selected frames within the audio 10% of the time, and keep it unchanged for the remaining cases.In this manner, we prevent the model exploiting local smoothness of acoustic frames and the model is required to make inference based on global information rather than local messages.Finally, the goal is to reconstruct these masked acoustic features {a i |i ∈ T mask } based on the remaining acoustic features and the language stream prompt, by minimizing the L1 loss between the original masked acoustic features and the predicted ones.
Overall, our final pre-training objective is to minimize the sum of the losses above.

Fine-Tuning CTAL
CTAL is designed to be a generic pre-training model for various audio-language tasks.It is relatively simple to fine-tune CTAL for various downstream tasks with just one additional output layer.To further combine information from different modalities, we propose a novel and flexible fusion mechanism at fine-tuning stage.We denote H N w ∈ R T ×d and H N a ∈ R T ×d as the final representation from text encoding module and textreferred audio encoding module, and we assume that both modules have the same hidden size d.
To fine-tune on speech understanding tasks, we are required to represent the input sequence (for both language and audio stream) to a compressed hidden vector.Following the idea from Wang (2018), which proves that max pooling mechanism has a tendency to make too many false negatives while attention pooling mechanism prefers making too many false positives, we come up with both Attention-Pooling layer and Max-Pooling layer to let them complement each other.After applying Attention-Pooling and Max-Pooling to audio stream final representations H N a , We obtain h attn a ∈ R d and h max a ∈ R d respectively.
where v attn a and W attn a are trainable parameters for audio side Attention-Pooling layer.
As discussed in Section 3.1.2,for language stream, we adapt the final hidden state of the start token h w0 ∈ R d as the aggregated text sequence representation h attn w for Attention-Pooling, and we conduct additional Max-Pooling for text stream output H N w to obtain h max w .Then we fuse the aggregated sequence representations from two modalities as follows: where ⊕ denotes the vector concatenation, and the final hidden state h f use is always used as the audio-and-language representation for classification tasks.

Orthogonal Regularization
One key characteristic of multimodal learning is the generated representations of different modality are supposed to depict a sample from different point of views.In order to encourage the two modules to get representations from different perspectives rather than similar characteristic.In addition to the loss function inherent to each task, we also introduce a regularization term which is minimized simultaneously with the objective to achieve the representations orthogonality during fine-tuning stage:

Experimental Setup and Result
In this section, we present CTAL pre-training details and fine-tuning results on three downstream audio-and-language tasks.

Pre-training Details
We pre-train our CTAL on the public dataset Lib-riSpeech (Panayotov et al., 2015), which is a dataset of reading English speech, including both audio recordings and corresponding authorized transcripts.It has 7 subsets in total (train-clean-100, train-clean-360, train-other-500, dev-clean, dev-other, test-clean, test-other).The subset with "clean" in its name contains audios with higher recording quality, while the other subsets have relatively lower quality recordings.We use all three training subsets for pre-training, including approximately 960 hours of speech and 280k utterances.
Following Radford et al. (2019), we consider training a BBPE tokenizer on the LibriSpeech corpus with additional special tokens (<s>, </s>, <mask>, <pad>) as our language stream tokenizer, and we tokenize the input text into token sequence as described in Section 3.1.1.For audio stream, we use Librosa (McFee et al., 2015), which is a well-established audio analysis Python package, to extract the 160-dimension input acoustic feature for each frame as described in Section 3.1.1.For the pre-train model architecture, we denote the number of layers (i.e., language stream encoding layer and text-referred audio stream encoding layer) as N, the number of self-attention heads as A, and the number of hidden size as H. we primarily report results on two model sizes:CTAL BASE (N=3, A=12, H=768) and CTAL LARGE (N=6, A=12, H=768).The total number of parameters for CTAL BASE is 60M and 110M for CTAL LARGE .More implementation details in Appendix A.1

Fine-tuning on Downstream Tasks
We transfer our pre-trained CTAL model to a set of three established speech understanding tasks, with simple and necessary modifications on the output layers, loss function and training strategy.

Emotion Classification
In emotion classification task, given a speech clip, the model is asked to predict which emotion category the speech is belonging to.Here, we conduct experiments on the widely-used dataset IEMOCAP (Busso et al., 2008).The dataset was recorded from ten actors, divided into five sessions, and each session has dialogues between two speakers with different genders.The dataset contains audio, transcriptions, and video recordings, we only use audio and transcriptions in our study.The recorded dialogues have been sliced into utterances and labeled in 10 categories by three annotators and utterances without any text content are filtered out in our experiment.For consistent comparison with previous works, we follow the settings with Xu et al. (2019) which use four emotions (angry, happy, neutral and sad) for classification and perform 5-fold crossvalidation over sessions, where each session is used as the test set in turn and remaining as training dataset.We adopt two widely used metrics for evaluation: weighted accuracy (WA) that is the overall classification accuracy and unweighted accuracy (UA) that is the average recall over all four classes.We report the averaged WA and UA over the 5-fold cross-validation experiments, and higher WA and UA results represent better model performances.
To fine-tune on IEMOCAP, we represent the input sequence (for a pair of audio and text) as described in Section 4.1, and use the final hidden vector h f use as the audio-and-language representation.The only new parameters introduced during fine-tuning are classification layer weights W ∈ R 4×d and CTAL fine-tuning is driven by the cross-entropy loss between the predicted class and the gold label.More implementation details in Appendix A.2 We select multiple models that claim to achieve SOTA results on IEMOCAP dataset as our baselines, and to be noticed, previous methods are specifically designed for the task with no pre-training stage.See details in Appendix A.3 Table 1 presents our experimental results on IEMOCAP dataset.Since some prior works experiment with different train/test split, we reimplement baseline models with their published codes 12 .Both CTAL BASE and CTAL LARGE outperform all three baselines by a substantial margin, obtaining 3.86% and 4.95% respective absolute WA improvement, and 3.56% and 4.49% respective absolute UA improvement over the prior state of the art.

Sentiment Analysis
The goal of the sentiment analysis task is to predict the degree of positive and negative sentiment.Com- pared to the emotion classification task, sentiment analysis is a regression task rather than a classification task.We adopt CMU-MOSEI (Zadeh et al., 2018) dataset for evaluation, which contains 23,454 movie review video clips taken from YouTube.We use only audio and corresponding transcriptions as input in our experiments.Each sample in the dataset is labeled with a sentiment score from -3 (strongly negative) to 3 (strongly positive) by human annotators.We follow the same experimental protocol as MuIT (Tsai et al., 2019), with the same train/test data split and the same evaluation metrics, which includes two classification metrics: binary accuracy (i.e., Acc 2 : accuracy over positive/negative sentiments classification), and F1 score, two regression metrics: mean absolute error (MAE), and the Pearson correlation coefficient (Corr) between model's predictions and human annotations.Since the prior top results reported on the CMU-MOSEI dataset are all achieved using all three modalities, so does MulT3 , we prune the vision-related components in MulT and re-train the model using only audio and text information.
During fine-tuning on sentiment analysis, we introduce additional parameters W ∈ R 1×d to project the final hidden representation h f use to the sentiment score, and the model is trained to minimize the L1 Loss between the predicted scores and the gold annotations.The other fine-tune settings over CMU-MOSEI are almost the same as IEMOCAP.As show in Table 2, we observe improvements across all 4 metrics for CTAL over MulT baseline under both base and large settings.

Speaker Verification
Speaker verification focuses on verifying the speaker identity of an utterance through comparing it with the pre-recoded voice print information.In this experiment, we adopt LibriSpeech (Panayotov et al., 2015) for evaluation, which includes 292k utterances collected from more than 2,438 speakers.Following the same experiment setting with prior works (Wan et al., 2018;Jung et al., 2019), we fine-tune our pre-trained model with all training splits (train-clean-100, train-clean-360 and train-other-500), and evaluate it with test-clean part, which contains 40 brand new speakers to the training part.Please note here, although the train set for our speaker verification task is identical with the one we used for pre-training, the speaker identity information and test-clean data are not released during the pre-training process.Thus, it is fair to perform comparisons between our models with other prior works.We add a classifier over the head of fused embeddings h f use and adopt crossentropy loss to fine-tune it.The output size of the classifier is same to the number of unique speakers in train set.
For evaluation, we utilize the representation before classifier as the input audio's identity embedding.And the cosine distance of each paired audio embeddings are used as the indicator for the final decision.Similar to prior studies, we report the Equal Error Rate (EER) as the evaluation metric, and lower EER represents better model performance.We choose two SOTA models as our baselines (Wan et al., 2018;Jung et al., 2019)   significantly outperforms CTAL BASE across all tasks.Besides, with the increment in the size of pretraining data, CTAL achieves better performances on all evaluation metrics except Acc 2 and F1 in sentiment analysis task (by comparing settings (a) "pretrain with train-clean-360" and CTAL LARGE ).The effectiveness of the asymmetry encoder design for audio-and-language representations is demonstrated by comparing CTAL LARGE to LXMERT, where both models are designed to have similar size of parameters.Setting (d) "w/o Orthogonal Fusion" removes our proposed cross-modality orthogonalfusion component and by comparing it with CTAL LARGE , we observe that the model's performances decrease on all three downstream tasks, which proves its effectiveness.Setting (e) "w/o Audio Outputs" and (f) "w/o Language Outputs" only use the output embeddings from either Audio or Language encoding module for downstream fine-tuning.Through comparing them to (d), we find each of embeddings contributes to the Audioand-Language tasks and the best performance is achieved through the appropriate fusion of both parts.At last, setting (g) "w/o Cross-modal Pre-training" utilizes unimodal pre-training models, RoBERTa and Mockingjay pre-trained with Lib-riSpeech dataset, and fuses their output embeddings for the downstream tasks.To be noticed, "w/o Cross-modal Pre-training" is chosen to have the same model size as CTAL LARGE for comparison purposes.Additionally, we present the performance of each single modality pre-trained model, Mockinjay and RoBERTa, to demonstrate the advantages of multimodal pre-training.From the results, we find our approach still holds better performance across all three tasks, which proves the importance of introducing inter-modality learning during pre-training phase.

Effect of Pre-training
We analyze the effect of pre-trained CTAL by visualizing its performance on downstream tasks versus different proportions of training data being used.From the result, CTAL only needs half the amount of training data to achieve better performance than the SOTA baselines.More details are provided in Appendix A.5.

Conclusion
In this work, we proposed CTAL, a novel pretrainable generic representation for audio-andlanguage tasks.It is pre-trained with two pretraining tasks on large-scale dataset of audioand-language pairs.Extensive empirical analysis demonstrates that our pre-trained model can boost various speech understanding performance significantly and achieve new state-of-the-art results.Besides, we show the effectiveness of different model components and the competent generalization capability via detailed ablation studies and analysis.

A.1 Pre-training Details
We take the Adam (Kingma and Ba, 2014) as the optimizer with initial learning rate of 5e-5 and a linear-decayed learning rate schedule with warm up (Devlin et al., 2018).We pre-train our model using 4 16G-V100 GPUs with a batch size of 16 for 1,000,000 steps, and the whole pre-training process takes roughly 48 hours.

A.2 Fine-tuning Details
We use a batch size of 4 and fine-tune for 20 epochs over each fold with 1 16G-V100 GPU.We take AdamW (Loshchilov and Hutter, 2018) as the optimizer in fine-tuning stage, the learning rate is initialized as 1e-5 and we apply a cosine annealing learning rate schedule (Loshchilov and Hutter, 2016) to reach the optimum.

A.3 Baseline for Emotion Classification
Xu et al. ( 2019) aims to produce more strong multimodal representations by learning the alignment between speech frames and text words using an attention mechanism, we call it LSTM_alignment in our paper since the original paper did not have a name for their method.MDRE (Yoon et al., 2018) uses a dual-RNNs to encode the information from audio and text separately, then combines them by simple representations concatenation to predict emotion classes.MHA (Yoon et al., 2019) proposes a multi-hop attention mechanism to infer the correlation between audio and language modalities based on the output hidden representations of two bi-directional long short-term memory (BiL-STM) encoders, and output the final classification result from the concatenation of audio and language representations.
A.4 Baseline for Speaker Verification GE2E (Wan et al., 2018) designs a general loss function that emphasizes examples that are difficult to verify at each step of the training process, while RawNet (Jung et al., 2019) proposes an end-to-end network that input raw audio waveforms to extract speaker embeddings.

A.5 Visualizing CTAL Behavior
In Figure 2a and Figure 2b, we show the performance on IEMOCAP, there are two observations.First of all, on both metrics, CTAL outperforms all baselines across different proportions of training  data.Secondly, the figures show that CTAL only needs half the amount of training data to achieve a better performance than baselines.The results on MOSEI are shown in Figure 3a and Figure 3b, and the same conclusion can also be drawn.
In Figure 4, we use t-SNE (Van der Maaten and Hinton, 2008) to visualize the speaker embeddings in test set extracted from pre-trained CTAL without training on downstream tasks.In the figure, each point represents an utterance and different speakers have different colors.We can observe that the model can have some capability to distinguish utterances of different speakers with only pre-training.
By comparing (b) "w/o MLM" to "w/o Pretraining" and (c) "w/o MCAM" to "w/o Pretraining", we see the benefits of pre-training on MCAM and MLM respectively.However, by comparing (b) and (c) with CTAL LARGE , both of them suffer dramatically performance decrease over all downstream tasks.This fact indicates the importance of joint-training with MLM and MCAM task during the pre-training stage.So far, the effectivenesses of pre-training and different tasks are demonstrated.

Figure 3 :
Figure 3: Metrics of different models vs amount of training data on CMU-MOSEI.

Figure 4 :
Figure 4: Visualization of 10 speakers embeddings via t-SNE.Different colors represent different speakers.

Table 1 :
, Comparison to the SOTA methods on the IEMOCAP dataset.

Table 2 :
Comparison to the SOTA methods on the CMU-MOSEI dataset.
. See more details in Appendix A.4.The comparison results are shown in Table.3. From the table, we observe that our CTAL BASE outperforms GE2E and RawNet by 1.85% and 0.59% respectively, and CTAL LARGE outperforms two baselines by 2.24% and 0.98% respectively.

Table 3 :
Comparison to the SOTA methods on the Lib-riSpeech dataset.

Table 4 :
The results for performing ablation study with CTAL LARGE .Notation (MAM) represents the acoustic stream encoding module is pre-trained with MAM task.The EER is not reported for setting (d) and RoBERTa, because it does not make sense to perform speaker verification with only semantic embeddings.