Learning Shared Semantic Space for Speech-to-Text Translation

Having numerous potential applications and great impact, end-to-end speech translation (ST) has long been treated as an independent task, failing to fully draw strength from the rapid advances of its sibling - text machine translation (MT). With text and audio inputs represented differently, the modality gap has rendered MT data and its end-to-end models incompatible with their ST counterparts. In observation of this obstacle, we propose to bridge this representation gap with Chimera. By projecting audio and text features to a common semantic representation, Chimera unifies MT and ST tasks and boosts the performance on ST benchmarks, MuST-C and Augmented Librispeech, to a new state-of-the-art. Specifically, Chimera obtains 27.1 BLEU on MuST-C EN-DE, improving the SOTA by a +1.9 BLEU margin. Further experimental analyses demonstrate that the shared semantic space indeed conveys common knowledge between these two tasks and thus paves a new way for augmenting training resources across modalities. Code, data, and resources are available at https://github.com/Glaciohound/Chimera-ST.


Introduction
Speech-to-text translation (ST) takes speech input in a source language and outputs text utterance in a target language. It has many real-world applications, including automatic video captioning, simultaneous translation for international conferences, etc. Traditional ST approaches cascade automatic speech recognition (ASR) and machine translation (MT) (Sperber et al., 2017Beck et al., 2019;. However, cascaded models often suffer from the issues of error propagation and translation latency. As a result, there have been a series of recent attempts on end-to-end speech-to-text translation (Liu et al., 2019Weiss et al., 2017;Bérard et al., 2018;Duong et al., 2016;Jia et al., 2019;Dong et al., 2021b;Wang et al., 2020b). The end-to-end approaches learn a single unified model, which is easier to deploy, has lower latency and could potentially reduce errors.
However, it remains a challenge for end-to-end ST to catch up with their cascaded counterparts in performance. We argue that the root cause is the gap between the two modalities, speech and text. Although they both encode human languages, they are dissimilar in both coding attributes (pitch, volume, and intonation versus words, affixes, and punctuation) and length (thousands of time frames versus tens of words). This issue is further coupled with the relatively smaller amount of parallel data for ST than for MT.
To tackle these challenges, we resort to making use of the additional available bilingual data for MT. Our hypothesis is, to better leverage MT data, an ideal model should be able to bridge the representations between speech and text. Motivated by this intuition, we propose Chimera, a text-speech shared semantic memory network. It learns a semantic memory by projecting features from both modalities into a shared semantic space. This approach unifies ST and MT workflows and thus has the advantage of leveraging massive MT corpora as a side boost in training. It can also use speechtext pairs to align the semantic memories from two modalities.
This idea of a unified text-speech representation also finds its neural basis as suggested by recent evidence from functional neuroimaging (van Atteveldt et al., 2004;Spitsyna et al., 2006;Shankweiler et al., 2008). Specifically, van Atteveldt et al. (2004); Spitsyna et al. (2006) identifies certain regions in brain that the processing stream for speech sounds and visual texts converge at. Shankweiler et al. (2008) further verifies that the size of such convergence sites correlates positively with the subjects' reading skills. Coincidentally, at these convergence sites also found regions responsive to downstream activities such as lexical and semantical word recognition (Binder et al., 2003) and spontaneous generation of speech (Blank et al., 2002). The evidence establishes the pivotal role of a modality-agnostic converged representation in language activities in brain.
This intuition lacks exploration in previous studies, with only a few exceptions (Indurthi et al., 2020;Liu et al., 2020), possibly due to the difficulties aforementioned and marginal improvements.
Our results show that Chimera achieves new state-of-the-art results on all of 8 translation directions in the benchmark datasets MuST-C and Augmented LibriSpeech. Specifically, Chimera obtains a 27.1 BLEU score on MuST-C EN-DE, which surpasses the best result ever reported by up to +1.9 BLEU. We also provide results under variations and ablations and validate our model design ideas by detailed analysis, as well as visualizing the semantic space Chimera has learned.
Our work makes the following contributions. First, we propose Chimera, which is able to bridge the modality gap between speech and text. Second, we derive a novel bi-modal contrastive training task to learn an alignment between semantic memories of two modalities. Finally, Chimera achieves a new state-of-the-art performance on the MuST-C benchmark and demonstrates its efficacy in learning modality-agnostic semantic representations.

Related Work
End-to-end ST Since its first proof-of-concept work (Bérard et al., 2016;Duong et al., 2016), solving Speech Translation in an end-to-end manner has attracted extensive attention (Vila et al., 2018;Salesky et al., 2018Salesky et al., , 2019Di Gangi et al., 2019b;Bahar et al., 2019a;Di Gangi et al., 2019c;Inaguma et al., 2020). Standard training techniques such as pretraining (Weiss et al., 2017;Bérard et al., 2018;Bansal et al., 2019;Stoian et al., 2020;Wang et al., 2020a;, multi-task training (Vydana et al., 2021;Le et al., 2020;Tang et al., 2021), meta-learning (Indurthi et al., 2020), and curriculum learning (Kano et al., 2017;Wang et al., 2020b) have been applied. As ST data are expensive to collect, Jia et al. (2019); Pino et al. (2019); Bahar et al. (2019b) augment synthesized data from ASR and MT corpora. Methods utiliz-ing trained models, such as knowledge distillation (Liu et al., 2019) and model adaptation (Di Gangi et al., 2020), have also been shown to be effective. Among these attempts, (Indurthi et al., 2020;Le et al., 2020;Liu et al., 2020) are most related to ours, as they also attempt to train models on ASR or MT data. However, they both lack pivotal modules in model design to semantically bridge the gap between audio and text, and could thus suffer from modality mismatch in representations. Cascaded ST The cascaded method is a more long-standing trend in ST (Sperber et al., 2017;. To alleviate its innate problem of error propagation, Cheng et al. (2018 introduce synthetic ASR-related errors and perturbations. On the other hand, some post-processing techniques such as re-segmentation (Matusov et al., 2006), punctuation restoration (Fügen, 2008), and disfluency detection (Fitzgerald et al., 2009) are proposed to fix flaws or errors that occurred during the translation. Cross-Lingual Techniques Techniques in multilingual tasks is also related to ours, as they aim at extracting common features out of sources from different representations (which, in this case, is language diversity) as well. However, multilingualism lacks key difficulties as observed in audio-text modality gap as discussed before. (Lu et al., 2018) and (Vazquez Carrillo et al., 2019) are early attempts by building an LSTM-based attentional interlingua. Yu et al. (2018); Yang et al. (2019) uses a similar cosine-based loss for multilingual training. Zhu et al. (2020) is probably more similar in method to ours, but Chimera is more simple in terms of model and objectives, and the memories in Chimera are additionally designed to focus on specific semantic categories.
3 Proposed Method: Text-Speech Shared Semantic Memory Network

Speech Translation Overview
An ST corpus usually consists of a set of triplet data S = {(x i , z i , y i )}. Here x i is the audio wave sequence, z i is the transcript sequence and y i is the translation sequence in the target language. As a benefit of shared semantic projection, Chimera is able to leverage large-scale MT training corpora   Figure 2: Besides MT and ST translation loss, we adopt a bi-modal contrastive loss to help aligning the semantic memories from text and speech. In short, among semantic memory elements from both sides of paired speech and transcript, the contrastive loss maximizes the cosine similarity between the same semantic memory element. Figure 1 illustrates the structure of Chimera. It consists of three major components, an encoding module, a shared semantic projection module, and a decoding module. Encoding Module Different from that of a conventional ST model, the encoding module of Chimera accepts either speech or text as input. For text input, we use word embeddings plus positional embeddings. For speech input, we use a pretrained Wav2Vec2 (Baevski et al., 2020) to extract speech features. As the speech features can be very long, we apply an additional 1-dimensional strided CNN to reduce the length. Both speech and text branches share a common subsequent Transformer encoder (Vaswani et al., 2017). The final output of the en-coding module is contextual features. Shared Semantic Projection Module Shared semantic projection plays a pivotal role in Chimera. The contextual features of speech and text may follow different distributions and of different lengths. Ideally, the shared semantic projection computes a constant number of semantic features as its output semantic memories.

Chimera Architecture
This module take the contextual features out from the encoding module as input and then output semantic memories of fixed length m. It consists of n attentional layers. It keeps a tuple of m trainable input-dependent memory queries to indicate the types of desired semantic information, which is used as the initial "memories". Uni-modal contextual features serves as attention "keys" and "values", while memories serves as attention "queries". Memories are iteratively fed to the n shared semantic projection layers, with each layer output used as input to next layer. The final output is used as the semantic memory.
where M 0 ,Ĥ, I i and O i denote the memory queries, contextual features, the input, and the output of each layer, respectively. l is the length of contextual features. d is the shared vector dimension. The top-most output O n is finally fed into the decoding module.
Decoding Module The decoding module contains a conventional Transformer decoder. The only difference is that the input is the fixed-size O n , which can possibly come from either speech or text.

Training Objectives
The training objective of Chimera consists of three aspects, with their supervision signals coming from speech-to-text translation data Speech-to-Text Translation Training The workflow of Chimera in Speech Translation is straightforward. The training objective is negative log-likelihood on speech-to-text translation data {(x i , y i )} from the data S as the loss function.
Text Machine Translation Training Chimera is also able to train on parallel sentences because of the unification of speech and text representations. The parallel data are directly from MT dataset or transcript-translation pairs. This enables the model to further acquire knowledge from the much larger extra MT corpus. Similar to ST training, the shared semantic projection module projects the contextual text features to the shared semantic space, which are then taken by the decoding module to output a translation.
Bi-modal Contrastive Training The motivation of Chimera design is to bridge the speech and text representations. We introduce dual-modal contrastive training to learn an alignment between representations from speech and text as illustrated in Intuitively, the contrastive loss forces the pair (M text i , M speech i ) to project semantic memories close to each other. In the meantime, the softmax function trains the model to maintain diversity among semantic memories.
The final loss is a weighted sum of each loss:

Experiments
We conduct experiments on the benchmark MuST-C and, as a validation of model design, carry out ablation studies and visualize the representations Chimera has learned.

Dataset and Preprocessing
MuST   For text input, on each translation pair, all texts (ST transcripts and translation, and MT source and target texts) are pre-processed in the same way. Texts are case-sensitive. Punctuation is kept, split from words, and normalized. Non-print punctuation is removed. The sentences are then tokenized with Moses tokenizer 4 . We filter out samples whose number of source or target tokens is over 250 and whose ratio of source and target text lengths is outside range [2/3, 3/2]. For sub-wording, we use a unigram sentencepiece 5 model with a dictionary size of 10000. On each translation direction, The sentencepiece model is learned on all text data from both ST and MT corpora. The dictionary is shared across MT and ST and across source and target languages.
Unless otherwise stated, performance is evaluated with BLEU (Papineni et al., 2002) Table 4: Performance of Mem-16 Chimera when freezing different modules in fine-tuning. S.S. Projection is abbreviation for shared semantic projection. "Fixed" indicates that weights in this module are not updated during fine-tuning, and "-" means otherwise. The results demonstrate that freezing modules indeed hampers the model's ability to adapt, but the weights pretrained on MT are already highly informative for ST. BLEU 6 . We average 7 consecutive checkpoints around the one of the best dev loss and adopt a beam size of 10.

Model Configuration
For text input, we use 512-dimensional word embeddings plus sinusoidal positional embeddings. For audio input, the Wav2Vec2 Module follows the base configuration in Baevski et al. (2020). It uses parameters pretrained on LibriSpeech audio data only. The 1-dimensional CNN for speech features has 2 layers with stride size 2, kernel size 5, padding 2, and hidden dimension 1024. The shared Transformer encoder consists of 6 layers. The memory queries are 64 512dimensional vectors. The parameters of shared semantic projection resemble a 3-layer Transformer encoder. The Transformer decoder has 6 layers. Each of these Transformer layers, except for those in the Wav2Vec2 module, has an embedding dimension of 512, a hidden dimension of 512, and 8 attention heads.
In both pretraining and fine-tuning stages, we use an Adam optimizer with β 1 = 0.9, β 2 = 0.98, and 4k warm-up updates. We apply an inverse square root schedule algorithm for the learning rate. In MT pretraining, the learning rate is 5e-4, the maximum number of updates is 300k, with at most 33k input tokens per batch. In ST pretraining, the learning rate is 1e-4, the maximum number of updates is 150k, with at most 16M source audio frames per batch. The loss weights λ ST , λ MT and λ ctr are all set to 1.
We also show results on a base version of 6 https://github.com/mjpost/sacrebleu, with configuration of 13a tokenzier, case-sensitiveness and full punctuation  Table 5: BLEU scores of Mem-16 Chimera on MuST-C tst-COMMON set without one or both of auxiliary tasks. "×" means this task is not used during finetuning, and " " means othersize. "Contrastive" is the bi-modal contrastive task. The removal of one or both of tasks greatly harms the model's performance on both language pairs. Chimera, for which the memory queries are only 16 512-dimensional vectors (codenamed "Mem-16"). Because of the training efficiency and simplicity, all ablation studies and visualizations adopted the Mem-16 configuration if not stated otherwise. Both Chimera and Chimera Mem-16 contain around 165M parameters. The whole training process for one trial on 8 Nvidia Tesla-V100 GPUs generally takes 20 -40 hours according to the translation direction.

Benchmark Experiments
Training We train Chimera in a pretrain -fine-tune manner. In the first stage, we pretrain Chimera on MT datasets so as to leverage additional sources of training data, as well as provide a better initialization point. In the fine-tuning stage, we adopt multi-task training as described in Section3.3. In addition to the conventional ST task, Chimera is also fine-tuned on MT and bi-modal contrastive task to align inputs from speech and text. Baselines We include as baselines the speech transformer model from (Ott et al., 2019), Espnet result from (Inaguma et al., 2020), adaptive feature selection method from (Zhang et al., 2020a), dualdecoder Transformer from (Le et al., 2020) and Modality-Agnostic Meta-Learning from (Indurthi et al., 2020) in Table 1. We also provide a series of baseline results of a simple combination of Wav2Vec2 (Baevski et al., 2020) and Transformer. It could be viewed Chimera without external MT pre-training, with still competitive but not SOTA results.
To verify the effectiveness of our training technique, we also compare with other baselines able to leverage external MT corpus in Table 3. Results The experiment results are shown in Table 1 and 2. Our Chimera achieves state-of-the-art performance on all language pairs, even though we do not utilize Google Translate results on Augmented Librispeech as most baselines. EN-DE results of Chimera uses WMT14+OpenSubtitles for MT pretrainng, while a detailed ablation study on the effect of MT data can be found in Section 4.4. Note that the improvement on EN-PT is not so significant as EN-DE and EN-FR. We attribute this to the data discrepancy between OPUS100 and MuST-C. A large number of sentences in OPUS100 are from movie subtitles, which are more informal, contain repeated sentences, and cover different topics from those in MuST-C public speeches.
In Table 3, under the same data condition, Chimera outperforms other techniques such as decoder pre-training and knowledge distillation (Liu et al., 2019).

Ablation Studies and Visualizations
Knowledge Shared across Tasks One potential benefit in our design is that the shared semantic space can hold common knowledge shared across Figure 4: 2-dimensional PCA projection of the semantic memories in Mem16 Chimera across different samples. Each colored cluster (circled out) represents a semantic memory element,. A "·" corresponds to a speech semantic memory, and a "+" marks a text one.
ST and MT tasks. To validate this motivating idea, we analyze the model's behavior while manipulating its modules. If certain weights pretrained during the MT task also contain meaningful information for ST, fixing them should not greatly harm the model's performance.
Specifically, after MT pretraining, we fix certain modules and do not update their weights during fine-tuning. We choose to fix the weights in the shared semantic projection module, the decoding module, or both of them. Table 4 shows the results. After freezing modules, the results on both EN-DE and EN-FR drop slightly. This demonstrates that freezing weights indeed hampers the model's ability to adapt from MT to ST dataset. But the decreased scores are still comparable to many of the best results in Table 1. This validates the effectiveness of shared semantic space, and indicates that the weights pretrained on MT are already informative enough for Chimera to still generalize sufficiently well on ST task.
Multi-task Training One advantage of bridging the modality gap is that the model can fully benefit from training on auxiliary tasks with more data, such as those mentioned in Section 3.2. To evaluate their impacts, we conduct another ablation study on EN-DE and EN-FR.. Either or both of the auxiliary tasks are not used during fine-tuning.
The results of this ablation are presented in Ta We're going to film it for real. I won't go into detail. Figure 5: A visualization of one particular semantic memory in Mem-16 Chimera no different samples and modalities. "·" marks speech representations, and "+" marks text representations. Marks of the same color come from the same speech-transcript pair and are linked with dashed lines. Some of speech-transcript pairs are circled together and annotated with their transcripts. Three fonts distinguish three groups of transcripts of similar patterns.
for example, p=0.020 in one-tailed Student's t-test comparing row 1 and 2) in BLEU scores when either of the auxiliary tasks is abandoned. Although the bi-model contrastive loss is not directly related to the ST task, a poor alignment between semantic memories of speech and text hinders Chimera from leveraging the shared knowledge across two tasks. When the MT task is removed, the drop in BLEU scores is also huge. This could be explained that during fine-tuning, the auxiliary MT task is necessary for keeping shared knowledge from being forgotten.
It is interesting to observe that abandoning both tasks produces results similar to abandoning either one of the tasks. This suggests that the two auxiliary tasks only have effects when combined with each other: only when both using MT task to maintain the pretrained parameters from forgetting, as well as using the bi-modal contrastive task to align between speech and text representations, can Chimera benefit from shared knowledge in MT pretraining.
Additional Machine Translation Data We attribute the gain in the performance of Chimera mainly to pre-training on MT data. One evidence is the performance gain when using the larger Open-Subtitles as MT corpus for EN-DE in 1. We also vary the amount of MT data available during pretraining on EN-DE direction. The results are plotted in Figure 3. As the size of additional MT dataset increases, the MuST-C BLEU score improves significantly. This confirms the importance of mas-sive high-quality MT data for pretraining Chimera. The results also help explain the relatively inferior scores on EN-PT in Table 1 which uses the OPUS100 dataset in pretraining.

Visualization of Semantic Memories
The shared semantic projection is designed to only extract semantic categories of information necessary for decoding, regardless of the input modality. In this way, it can bridge the different representations of speech and text during computation and facilitate knowledge sharing between MT and ST.
To validate this motivation, we visualize them with Principal Component Analysis (PCA) in Figure 4. Up to 100 speech-transcript paired samples are randomly chosen from the valid set. We record vector values of 16 semantic memories from Chimera Mem-16 when inputs are speeches or transcripts, and apply 2-dimensional PCA. The 16 semantic memories are distinguished by 16 colors. Every "·" corresponds to a semantic memory from speech, and each "+" is a semantic memory from the text. It is clear that the semantic memories are highly clustered, everyone of which learns a particular region. Speech and text representations are also projected close within the same region, proving the model's ability to ignore representation differences and bridge the modality gap.
To take a closer view of the structure of each semantic memory subspace, we randomly choose one semantic memory and apply PCA to its corresponding cluster. The results are visualized in Figure  5. These samples come from 50 speech-transcript pairs. Each pair of speech ("·") and transcript ("+") share the same color and are linked through dashed lines.
Two interesting properties could be observed. First, paired speech and transcript inputs are again close to each other, even though they are coming from different modalities. Second, the visualized representations are organized according to their semantic or syntactic patterns. We recognize several clusters in the figure, and annotate their transcripts with different fonts. The three annotations at the upper-right corner (Italic font) are all questions; those at the upper-left corner (wavy underlined font) all follow a simple future tense; at the bottomleft corner of the figure (underlined font) is another cluster of sentences of copular verbs. This proves that the shared semantic space that Chimera has learned is well-structured, and thus validates our model design.
Visualization of Inter-Modal Attention Alignment "Attention" is the internal mechanism of Transformer based modules. In the design of Chimera, attention is used for extracting M key semantic categories of features from input. To investigate whether these extracted features is indeed semantic, we further visualized the similarity between attention on paired audio and text in Fig 6. Here the colors, which distinguish different memories attending to inputs, is clustered on sequence and distributed close to the diagonal, demonstrating an alignment between matching tokens in two modalities. Here we also observe four beaming columns, where the full stop mark in text aligns with pauses in audio. This is an indication of semantic rather than positional essence of the memories.

Conclusions and Future Work
In this paper, we propose Chimera, a model capable of learning a text-speech shared semantic memory network for bridging the gap between speech and text representations. Being able to leverage a large amount of external Machine Translation data, Chimera achieves new state-of-the-art performance on the MuST-C dataset on all 8 languages. Additional experiment results also demonstrate its ability to learn a well-structured shared semantic space as well as effectively share learned knowledge across MT and ST, and validate our design of auxiliary tasks.
In the future, we will focus on deriving a better task to tightly align speech and text representations. Also, the workflows of MT and ST are only partially shared in Chimera, which still requires the model to adapt to ST when switching to the finetuning stage. So it remains a challenge to better couple their computation graphs in future designs.