ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

Text-to-speech(TTS) has undergone remarkable improvements in performance, particularly with the advent of Denoising Diffusion Probabilistic Models (DDPMs). However, the perceived quality of audio depends not solely on its content, pitch, rhythm, and energy, but also on the physical environment. In this work, we propose ViT-TTS, the first visual TTS model with scalable diffusion transformers. ViT-TTS complement the phoneme sequence with the visual information to generate high-perceived audio, opening up new avenues for practical applications of AR and VR to allow a more immersive and realistic audio experience. To mitigate the data scarcity in learning visual acoustic information, we 1) introduce a self-supervised learning framework to enhance both the visual-text encoder and denoiser decoder; 2) leverage the diffusion transformer scalable in terms of parameters and capacity to learn visual scene information. Experimental results demonstrate that ViT-TTS achieves new state-of-the-art results, outperforming cascaded systems and other baselines regardless of the visibility of the scene. With low-resource data (1h, 2h, 5h), ViT-TTS achieves comparative results with rich-resource baselines.~\footnote{Audio samples are available at \url{https://ViT-TTS.github.io/.}}


Introduction
Text-to-speech (TTS) (Ren et al., 2019;Huang et al., 2022a,b) aims to synthesize audios that is consistent with the reference samples in terms of semantic meaning, timbre, emotions, and melody, and has shown remarkable advancements with the advent of Denoising Diffusion Probabilistic Models (DDPMs).However, the perceived audio quality is not solely determined by these aspects, as it is also influenced by the surrounding physical environment.For instance, a room with hard surfaces like concrete or glass reflects sound waves, whereas a room with soft surfaces such as carpets or curtains absorbs them.This variance can drastically impact the clarity and quality of the sound we hear.
To ensure an authentic and captivating experience, it is imperative to accurately model the acoustics of a room, particularly in virtual reality (VR) and augmented reality (AR) applications.Recent years have seen a surge in significant research (Li et al., 2022;Radford et al., 2021;Li et al., 2023;Huang et al., 2023b) addressing the languagevisual modeling problem.For instance, Li et al. (2022) have proposed a unified video-language pretraining framework for learning robust representation, while Radford et al. (2021) have focused on large-scale image-text pairs pre-training via contrastive learning.Visual TTS open-ups numerous practical applications, including dubbing archival films, providing a more immersive and realistic experience in virtual and augmented reality, or adding appropriate sound effects to games.
Despite the benefits of language-visual approaches, training visual TTS models typically requires a large amount of training data, while there are very few resources providing parallel textvisual-audio data due to the heavy workload.Besides, creating a sound experience that matches the visual content remains challenging when developing AR/VR applications, as it is still unclear how various regions of the image contribute to reverberation and how to incorporate the visual modality as auxiliary information in TTS.
In this work, we formulate the task of visual TTS to generate audio with reverberation effects in target scenarios given a text and environmental image, introducing ViT-TTS to address the issues of data scarcity and room acoustic modeling.To enhance visual-acoustic matching, we 1) propose the visual-text fusion to integrate visual and textual information, which provides fine-grained language-visual reasoning by attending to regions of the image; 2) leverage transformer architecture to promote the scalability of the diffusion model.Regarding the data shortage challenge, we pre-train the encoder and decoder in a self-supervised manner, showing that large-scale pre-training reduces data requirements for training visual TTS models.
Experiments results demonstrate that ViT-TTS generates speech samples with accurate reverberation effects in target scenarios, achieving new stateof-the-art results in terms of perceptual quality.In addition, we investigate the scalability of ViT-TTS and its performance under low-resource conditions (1h/2h/5h).The main contributions of this work are summarized as follows: • We propose the first visual Text-to-Speech model ViT-TTS with vision-text fusion, which enables the generation of high-perceived audio that matches the physical environment.
• We show that large-scale pre-training alleviates the data scarcity in training visual TTS models.
• We introduce the diffusion transformer scalable in terms of parameters and capacity to learn visual scene information.
• Experimental results on subjective and objective evaluation demonstrate the state-ofthe-art results in terms of perceptual quality.With low-resource data (1h, 2h, 5h), ViT-TTS achieves comparative results with richresource baselines.
2 Related Work

Text-To-Speech
Text-to-Speech(TTS) tasks are divided into two categories: (1) generating a mel-spectrogram from text or phoneme sequence first (Wang et al., 2017;Ren et al., 2019), and then converting the generated spectrum into a waveform via vocoder (Kong et al., 2020;Lee et al., 2022;Huang et al., 2022a); (2) generating audio directly from text (Donahue et al., 2020;Kim et al., 2021).The earlier TTS (Li et al., 2019;Wang et al., 2017) models adopt an autoregressive manner, which suffers from the problem of slow inference speed.As a solution, nonautoregressive models have been proposed to enable fast inference by generating mel-spectrograms in parallel.More recently, Grad-TTS (Popov et al., 2021), DiffSpeech (MoonInTheRiver, 2021), and ProDiff (Huang et al., 2022c) have employed diffusion generative models to generate high-quality audio, but they all rely on the convolutional architecture such as WaveNet (Oord et al., 2016) and U-Net (Ronneberger et al., 2015) as the backbone.
In contrast, some studies (Peebles and Xie, 2023;Bao et al., 2023) in image generation tasks have explored transformers (Vaswani et al., 2017) as an alternative to convolutional architectures, achieving competitive results with U-Net.In this paper, we present the first transformer-based diffusion model as an alternative of convolutional architecture.By harnessing the scalable properties of transformers, we enhance the model capacity to more effectively capture visual scene information and promote the model performance.

Self-supervised Pre-training
There are two main criteria for optimizing speech pre-training: contrastive loss (Oord et al., 2018;Chung and Glass, 2020;Baevski et al., 2020) and masked prediction loss (Devlin et al., 2018).Contrastive loss is used to distinguish between positive and negative samples with respect to a reference sample, while masked prediction loss is originally proposed for natural language processing (Devlin et al., 2018;Lewis et al., 2019) and later applied to speech processing (Baevski et al., 2020;Hsu et al., 2021).Some recent work (Chung et al., 2021) has combined the two approaches, achieving good performance for downstream automatic speech recognition (ASR) tasks.In this work, we leverage the success of self-supervised to enhance both the encoder and decoder to alleviate the data scarcity issue.

Acoustic Matching
The primary objective of acoustic matching is to convert audio from a source environment into audio that resembles the target environment.In the field of blind estimation (Mack et al., 2020;Xiong et al., 2018;Murgai et al., 2017;Mezghani and Swindlehurst, 2018), acoustic matching is applied to generate a simple room impulse response (RIR) that can be used to synthesize the corresponding target audio using two critical acoustic metrics -the direct-to-reverberant ratio (DRR) (Zahorik, 2002) and the reverberation time 60 (RT60) (Ratnam et al., 2003).The music production community also implements acoustic matching to modify the reverberation, thus simulating the reverberation of the target space or processing algorithm (Koo et al., 2021;Sarroff and Michaels, 2020).Recently, there is research on visual acoustic matching (Chen et al., 2022), which involves generating audio recorded in the target environment based on the input source audio clip and an image of the target environment.However, our proposed visual TTS is distinct from those mentioned above as as it aims to generate audio that captures the room acoustics in the target environment based on the written text and the target environment image.

Overview
The overall architecture has been presented as Figure 1.To alleviate the issue of data scarcity, we leverage unlabeled data to pre-train the visual-text encoder and denoiser decoder with scalable transformers in a self-supervised manner.To capture the visual scene information, we employ the visual-text fusion module to reason about how different image patches contribute to texts.BigvGAN (Lee et al., 2022) converts the mel-spectrograms into audio that matches the target scene as a neural vocoder.

Enhanced visual-text Encoder
Self-supervised Pre-training The advent of the masked language model (Devlin et al., 2018;Clark et al., 2020) has marked a significant milestone in the field of natural language processing.To alleviate the data scarcity issue (Huang et al., 2022d;Liu et al., 2023;Huang et al., 2023c) and learn robust contextual encoder, we are encouraged to adopt the masking strategy like BERT in the pre-training stage.Specifically, we randomly mask the 15% of each phoneme sequence and predict those masked tokens rather than reconstructing the entire input.
The masked phoneme sequence is then input into the text encoder to obtain hidden states.The final hidden states are fed into a linear projection layer over the vocabulary to obtain the predicted tokens.Finally, we calculate the cross entropy loss between the predicted tokens and target tokens.
The masked token during the pre-training phase will not be used in the fine-tuning phase.To mitigate this mismatch between the pre-training and fine-tuning, we randomly choose the phonemes to be masked: 1) 80% probability to add masks; 2) 10% probability to keep phoneme unchanged, and 3) 10% probability to replace with a random token in the dictionary.

Visual-Text Fusion
In the fine-tuning stage, we integrate the visual modal and module into the encoder to integrate visual and textual information.Before feeding into the visual-text encoder, we first extract image features of panoramic images through ResNet18 (Oord et al., 2018) and obtain phoneme embedding.Both the image features and phoneme embedding are fed into one of the vari-ants of the transformer to get the hidden sequences.Specifically, we first pass the phoneme through relative self-attention, which is defined as follows: where n is the length of phoneme embedding, R ij are the relative position embedding of key and value, d k is the dimension of key, and Q, K, V are all the phoneme embedding.We use relative self-attention to model how much phoneme p i attends to phoneme p j .After that, we choose to use cross-attention instead of a simplistic concatenation approach as we can reason about how different image patches contribute to the text after feature extraction.The equation is defined as follows: where P is the phoneme embedding, V is the visual features, and d v is the dimension of vision features.
Finally, the feed-forward layer is applied to output the hidden sequence.

Enhanced Diffusion Transformer
Scalable Transformer As a rapidly growing category of generative models, DDPMs have demonstrated their exceptional ability to deliver topnotch results in both image (Zhang and Agrawala, 2023;Ho and Salimans, 2022) and audio synthesis (Huang et al., 2022c(Huang et al., , 2023a;;Lam et al., 2021).However, the most dominant diffusion TTS models adopt a convolutional architecture like WaveNet or U-Net as the de-factor choice of backbone.This architectural choice limits the model scalability to effectively incorporate panoramic visual images.
Recent research (Peebles and Xie, 2023;Bao et al., 2023) in the image synthesis field has revealed that the inductive bias of convolutional structures is not a critical determinant of DDPMs' performance.Instead, transformers have emerged as a viable alternative.
For this reason, we propose a diffusion transformer that leverages the scalability of transformers to expand model capacity and incorporate room acoustic information.Moreover, we leverage the adaptive normalization layers in GANs and initialize the full transformer block as the identity function to enhance the transformer architecture.
Unconditional Pre-training In this part, we investigate self-supervised learning from orders of magnitude mel-spectrograms data to alleviate data scarcity.Specifically, assuming the target melspectrogram is x 0 , we first random select 0.065% of x 0 as starting indices and apply a mask that spans 10 steps following the Wav2vec2.0(Baevski et al., 2020).Then, we obtain x t through a diffusion process, which is defined by a fixed Markov chain from data x 0 to the latent variable x t .
At each diffusion step t ∈ [1, T ], a tiny Gaussian noise is added to x t−1 to obtain x t , according to a small positive constant β t : x t obtained from the diffusion process is passed through the transformer to predict Gaussian noise ϵ θ .Loss is defined as mean squared error in the ϵ space, and efficient training is optimizing a random term of t with stochastic gradient descent: To this end, ViT-TTS takes advantage of the reconstruction loss to predict the self-supervised representations which largely alleviates the challenges of data scarcity.Detailed formulation of DDPM has been attached in Appendix C.
Controllable Fine-tuning During the finetuning stage, we will face the following challenges: (1) there is a data scarcity issue with the available panoramic images and target environmental audio for training; (2) a fast training method is equally crucial for optimizing the diffusion model, as it can save a significant amount of time and storage space.To address these challenges, we draw inspiration from Zhang and Agrawala (2023) and implement a swift fine-tuning technique.Specifically, we create two copies of the pre-trained diffusion model weights, namely a "trainable copy" and a "locked copy," to learn the input conditions.We fix all parameters of the pre-trained transformer, designated as Θ, and duplicate them into a trainable parameter Θ t .We train these trainable parameters and connect them with the "locked copy" via zero convolution layers.These convolution layers are unique as they have a kernel size of one by one and weights and biases set to zero, progressively growing from zeros to optimized parameters in a learned fashion.

Architecture
As illustrated in Figure 1, our model comprises a visual-text encoder, variance adaptor, and spectrogram denoiser.The visual-text encoder converts phoneme embeddings and visual features into hidden sequences, while the variance adaptor predicts the duration of each hidden sequence to regulate the length of the hidden sequences to match that of speech frames.Furthermore, different variances like pitch and speaker embedding are incorporated with hidden sequences following FastSpeech 2 Ren et al. (2022).Finally, the spectrogram denoiser iteratively refines the length-regulated hidden states into mel-spectrograms.We put more details in Appendix B.
Visual-Text Encoder The visual-text encoder consists of relative position transformer blocks based on the transformer architecture.Specifically, it convolves a pre-net for phoneme embedding, a visual feature extractor for image, and a transformer encoder which includes multi-head self-attention, multi-head cross-attention, and feed-forward layer.
Variance Adaptor In variance adaptor, the duration and pitch predictors share a similar model structure consisting of a 2-layer 1D-convolutional network with ReLU activation, each followed by the layer normalization and the dropout layer, and an extra linear layer to project the hidden states into the output sequence.
Spectrogram Denoiser Spectrogram denoiser takes in x t as input to predict ϵ added in diffusion process conditioned on the step embedding E t and encoder output.We adopt a variant of the transformer as our backbone and make some improvements upon the standard transformer motivated by Peebles and Xie (2023), mainly includes:(1) we explore replacing standard layer norm layers in transformer blocks with adaptive layer norm (adaLN) to regress scale and shift parameters from the sum of the embedding vector of t and hidden sequence.(2) Inspired by ResNets (Oord et al., 2018), we initialize the transformer block as the identity function and initialize the MLP to output the zero-vector.

Pre-training, Fine-tuning, and Inference Procedures
Pre-training The pre-training has two stages: 1) encoder stage: pre-train the visual-text encoder vias masked LM loss L CE (ie.cross-entropy loss) to predict the masked tokens.2) decoder stage: the masked x 0 is puted into denoiser to predict Gaussian noise ϵ θ .Then, the Mean Square Error(MSE) loss is applied to the predicted Gaussian noise and target Gaussian noise.

Fine-tuning
We begin by loading model weights from the pre-trained visual-text encoder and unconditional diffusion decoder, after which we finetune both of them until the model converges.The final loss term consists of the following parts: (1) sample reconstruction loss L θ : MSE between the predicted Gaussian noise and target Gaussian noise.
(2) variance reconstruction loss L dur , L p : MSE between the predicted and the target phoneme-level duration, pitch.
Inference During inference, DDPM iteratively runs the reverse process to obtain the data sample x 0 , and then we use a pre-trained BigvGAN-16khz-80band as the vocoder to transform the generated mel-spectrograms into waveforms.

Model Configurations
The size of the phoneme vocabulary is 73.The dimension of phoneme embeddings and the hidden size of the visual-text transformer block are both 256.We use the pretrained ResNet18 as an image feature extractor.As for the pitch encoder, the size of the lookup table and encoded pitch embedding are set to 300 and 256.In the denoiser, the number of transformer-B layers is 5 with the hidden size 384 and head 12.We initialize each transformer block as the identity function and set T to 100 and β to constants increasing linearly from β 1 = 10 −4 to β T = 0.06.We have attached more detailed information on the model configuration in Appendix B Pre-training, Fine-tuning, and Inference During the pre-training stage, we pre-train the encoder for 120k steps and the decoder for 160k until convergence.The diffusion probabilistic models have been trained using 1 NVIDIA A100 GPU with a batch size of 48 sentences.In the inference stage, we uniformly use a pre-trained BigvGAN-16khz-80band (Lee et al., 2022) as a vocoder to transform the generated mel-spectrograms into waveforms.

Scalable Diffusion Transformer
We compare and examine diffusion transformer sweeping over model config(S, B, L, XL), and conduct evaluations in terms of audio quality and parameters.Appendix A gives the details of the model configs.The results have been shown in Table 1.We have some observations from the results: (1) Increasing the depth and number of layers in the transformer can significantly enhance the performance of the diffusion model, resulting in an improvement in both objective metrics and subjective metrics, which demonstrates that expanding the model size enables finer-grained room acoustic modeling.(2) Our proposed diffusion transformer outperforms WaveNet backbone under similar parameters across both test-unseen and test-seen sets, significantly in the rt60 metric.We attribute this to the fact that instead of directly concatenating the condition input like WaveNet, we replace standard layer norm layers in transformer blocks with adaptive layer norm to regress dimension-wise scale and shift parameters from the sum of the embedding vectors of diffusion step and encoder output, which can better incorporate the conditional information, as proven in GANs (Brock et al., 2018;Karras et al., 2019).

Model Performances
In this study, we conduct a comprehensive comparison of the generated audio quality with other systems, including 1) GT, the ground-truth audio; 2) GT(voc.),where we first convert the groud-truth audio into mel-spectrograms and then convert them to audio using BigvGAN; 3) Diff-Speech (MoonInTheRiver, 2021), one of the most popular DDPM based on WaveNet; 4)ProDiff (Huang et al., 2022c), a recent generator-based diffusion model proposed to reduce the sampling time; 5)Visual-DiffSpeech, incorporate visual-text fusion module into DiffSpeech; 6) Cascaded, the system composed of DiffSpeech and Visual Acoustic Matching(VAM) (Chen et al., 2022).The results, compiled and presented in Table 2, provide valuable insights into the effectiveness of our approach: (1) As expected, the results in the test-unseen set do poorer than the test-seen part because there are invisible scenarios among the test-unseen set.However, our proposed model has achieved the best performance compared to baseline systems in both sets, indicating that our model generates the best-perceived audio that matches the target envi-  ronment from written text.
(2) Our model surpassed TTS diffusion models(i.e.DiffSpeech and ProDiff) across all metric scores, especially in terms of RTE values.This suggests that conventional diffusion models in TTS do poorly in modeling room acoustic information, as they mainly focus on audio content, pitch, energy, etc.Our proposed visual-text fusion module addresses this challenge by injecting visual properties into the model, resulting in a more accurate prediction of the correct acoustics from images and high-perceived audio synthesis. (3) The results of comparison with Visual-DiffSpeech highlight the advantages of our choice of transformer and self-supervised pre-training.Although Visual-DiffSpeech adds the visual-text module, the choice of WaveNet and the lack of a self-supervised pre-training strategy make it perform worse in predicting the correct acoustics from images and synthesizing high-perceived audio.(4) The cascaded system composed of DiffSpeech and Visual Acoustic Matching model visual properties is better than other baselines.However, compared to our proposed model, it performed worse in both testunseen and test-seen environments.This suggests that our direct visual text-to-speech system eliminates the influence of error propagation caused by the cascaded manner, resulting in high-perceived audio.In conclusion, our comprehensive evaluation results demonstrate the effectiveness of our proposed model in generating high-quality audio that matches the target environment.

Low Resource Evaluation
Training visual text-to-speech models typically requires a large amount of parallel target environment image and audio training data, while there may be very few resources due to the heavy workload.In this section, we prepare low-resource audio-visual data (1h/2h/5h) and leverage large-scale text-only and audio-only data to boost the performance of the visual TTS system, to investigate the effectiveness of our self-supervised learning methods.The results are compiled and presented in

Case Study
We provide two examples of generation sampled from a large empty room with significant reverberation in the Test-Seen environment depicted in Figure 2, and have the following observations: 1) Figure 2: Visualizations of the ground truth and generated mel-spectrograms by different Visual TTS models.The text corresponding to the first line in test-seen is "it is so made that everywhere we feel the sense of punishment" while the second line in test-unseen is "the task will not be difficult returned david hesitating though i greatly fear your presence would rather increase than mitigate his unhappy fortunes ".
Mel-spectrograms produced by ViT-TTS are noticeably more similar to the target counterpart.2) Moreover in challenging scenarios with invisible scene images, cascaded systems suffer severely from the issue of noisy and reverb details missing, which is largely alleviated in ViT-TTS.

Ablation Studies
We  that after replacing the target image with a random image, the performance of our model significantly degraded, indicating that our model could model the room acoustic information of visual input.

Conclusion
In this paper, we proposed ViT-TTS, the first visual text-to-speech synthesis model that aimed to convert written text and target environmental images into audio that matches the target environment.To mitigate the data scarcity for training visual TTS tasks and model visual acoustic information, we 1) introduced a self-supervised learning framework to enhance both the visual-text encoder and denoiser decoder; 2) leveraged the diffusion transformer scalable in terms of parameters and capacity to improve performance.Experimental results demonstrated that ViT-TTS achieved new state-of-the-art results and performed comparably to rich-resource baselines even with limited data.To this end, ViT-TTS provided a solid foundation for future visual text-to-speech studies, and we envision that our approach will have farreaching impacts on the fields of AR and VR.

Limitation and Potential Risks
As indicated in the experimental setup, we utilized ResNet-18 as our image feature extractor.While it is a classic extractor, there may be newer extractors that perform better.In future work, we will explore the use of superior extractors to enhance the quality of generated audio.
Moreover, our pre-trained encoder and decoder are based on the SoundSpace-Speech dataset, which, as described in the dataset section, is not sufficiently large.To address this limitation in future work, we will pre-train on a large-scale dataset to achieve better performance in low-resource scenarios.
ViT-TTS lowers the requirements for visual textto-speech generation, which may cause fraud and scams by impersonating someone else's voice.Furthermore, there is the potential for leading to the spread of false information and rumors.

E EVALUATION MATRIX E.1 Evaluation Metrics
We measure the sample quality of the generated waveform using both objective metrics and subjective indicators.The objective metrics we collected are designed to measure varied aspects of waveform quality between the ground-truth audio and the generated sample.Following the common practice of (Huang et al., 2022c;MoonInTheRiver, 2021;Popov et al., 2021), we randomly select a part of the test set for objective evaluation, here is 50.We provide the following metrics: (1) RT60 Error(RTE)-the correctness of the room acoustics between the predicted waveform and target waveform's RT60 values.RT60 indicates the reverberation time in seconds for the audio signal to decay by 60 dB, a standard metric to characterize room acoustics.We estimate the RT60 directly from magnitude spectrograms of the output audio, using a model trained with disjoint SoundSpaces data.
(2) Mel Cepstral Distortion(MCD)-measures the spectral distance between the synthesized and reference mel-spectrum features.The utilization of RTE is solely intended for evaluating the room acoustic performance of the generated audio, and as an additional measure, we have incorporated the MCD metric to assess the quality of the mel-spectrogram.
For subjective metrics, we use crowd-sourced human evaluation via Amazon Mechanical Turk, where raters are asked to rate Mean Opinion Score(MOS) on a 1-5 Likert scale.

E.2 RT60 Estimator
Following (Chen et al., 2022), we first encode the 2.56s speech clips as spectrograms, process them with a ResNet18 (Oord et al., 2018) and predict the RT60 of the speech.The ground truth RT60 is calculated with the Schroeder (Schroeder, 1965).We optimize the MSE loss between the predicted RT60 and the ground truth RT60.

E.3 MOS Evaluation
To probe audio quality, we conduct the MOS (mean opinion score) tests and explicitly instruct the raters to "focus on examining the audio quality, naturalness and whether the audio matches with the given image.".The testers present and rate the samples, and each tester is asked to evaluate the subjective naturalness on a 1-5 Likert scale.
Our subjective evaluation tests are crowdsourced and conducted via Amazon Mechanical Turk.These ratings are obtained independently for model samples and reference audio, and both are reported.The screenshots of instructions for testers have been shown in Figure 3.A small subset of speech samples used in the test is available at https://ViT-TTS.github.io/

F LOW RESOURCE SETTING
We partition the training set of SoundSpaces-Speech into 1h/2h/5h subsets based on the alphabetical order of speech IDs.Subsequently, we employ these subsets to fine-tune our pre-trained models and assess their performance on identical test sets.

Figure 1 :
Figure1: The overall architecture for ViT-TTS.In subfigure (b), V i denotes the visual sequence and N 1 denotes the layers of Encoder.In subfigure (c), N 2 is the number of transformer layers.α and β are the dimension-wise scale parameters, while γ is the dimension-wise shift parameters.c is the variance adaptor's output and t is the diffusion step.

Table 1 :
Comparison between the diffusion WaveNet and diffusion transformers sweeping over model config(S, B, L, XL).All models remove the pre-training stage and other conditions not related to backbone in training and inference remain the same.

Table 2 :
Comparison with baselines on the SoundSpaces-Speech for Seen and Unseen scenarios.The diffusion step of all diffusion models is set to 100.We use the pre-trained model provided by VAM for the evaluation of cascaded.

Table 3 :
Low resource evaluation results.

Table 4 :
Ablation study results.EP, DP, and RI are encoder pre-training, decoder pre-training, and random images respectively.