Translation-Enhanced Multilingual Text-to-Image Generation

Research on text-to-image generation (TTI) still predominantly focuses on the English language due to the lack of annotated image-caption data in other languages; in the long run, this might widen inequitable access to TTI technology. In this work, we thus investigate multilingual TTI (termed mTTI) and the current potential of neural machine translation (NMT) to bootstrap mTTI systems. We provide two key contributions. 1) Relying on a multilingual multi-modal encoder, we provide a systematic empirical study of standard methods used in cross-lingual NLP when applied to mTTI: Translate Train, Translate Test, and Zero-Shot Transfer. 2) We propose Ensemble Adapter (EnsAd), a novel parameter-efficient approach that learns to weigh and consolidate the multilingual text knowledge within the mTTI framework, mitigating the language gap and thus improving mTTI performance. Our evaluations on standard mTTI datasets COCO-CN, Multi30K Task2, and LAION-5B demonstrate the potential of translation-enhanced mTTI systems and also validate the benefits of the proposed EnsAd which derives consistent gains across all datasets. Further investigations on model variants, ablation studies, and qualitative analyses provide additional insights on the inner workings of the proposed mTTI approaches.


Introduction and Motivation
Text-to-Image Generation (TTI) is an emerging yet rapidly growing area, owing its recent progress to ever-growing deep generative models, largerscale multi-modal datasets, and increasing computational resources.The success of recent TTI work is impressive; e.g., it is possible to synthesise not only high-resolution complex scenes (Ramesh et al., 2022;Rombach et al., 2022), but also sur-realist and 'aesthetics-aware' paintings (Gallego, 2022).
However, current models are made and deployed almost exclusively for the English language (EN).This is primarily due to the lack of annotated imagecaption data in other languages, which might result in inequitable access to TTI technology in the long run, especially for low-resource languages (Blasi et al., 2022).Hiring human annotators to write high-quality image descriptions is time-consuming and expensive; 'gold standard' data, if it exists at all, is thus typically used for evaluation purposes only (Lan et al., 2017;Aggarwal and Kale, 2020).
Even if we put the crucial concerns of data scarcity aside, training state-of-the-art (SotA) TTI models from scratch for each language is technically infeasible and impractical: it would consume massive computational resources, exceeding the capabilities of many research labs (Ramesh et al., 2021;Saharia et al., 2022) and raising concerns of its environmental impact (Schwartz et al., 2020). 1  Therefore, in this work, we focus on multilingual TTI (mTTI) through the optics of NLP's crosslingual transfer learning methods, leaning on the reasonable assumption of having abundant imagetext pairs in English (and/or a pretrained EN TTI model), but only limited gold-standard data for fine-tuning and evaluation in a target language. 2n particular, we investigate the role of crosslingual transfer and (neural) machine translation (MT) in bootstrapping mTTI, and we focus on two crucial research questions.(RQ1) Are standard MT-based cross-lingual transfer methods feasible for mTTI, and how do they compare with standard 1 For instance, DALL-E (Ramesh et al., 2021) is trained on 1, 024 × 16GB NVIDIA ® V100 GPUs for a total of 430,000 updates.DALL-E Mega, an attempt to reproduce DALL-E's results, reports an estimated emission of 18, 013.47-kgCO2equivalents, training on a TPU v3-256 (128×TPU v3 chips) for 56 days.The estimation is based on a publicly available machine learning emissions calculator (Luccioni et al., 2019).zero-shot cross-lingual transfer methods?(RQ2) Is it possible to enhance zero-shot cross-lingual transfer relying on (ensembles of) MT-generated output for improved mTTI?
Our experiments and core findings are based on several mTTI benchmarks.First, we use the standard and publicly available COCO-CN (Li et al., 2019) and Multi30K (Elliott et al., 2016), and we also build a new dataset for Finnish as a lower-resource language from LAION-5B (Schuhmann et al., 2022).Regarding RQ1, we then conduct a systematic empirical study comparing the standard cross-lingual transfer methods: TRANS-LATE TRAIN, TRANSLATE TEST, and ZERO-SHOT TRANSFER.Our main results indicate that TRANS-LATE TRAIN achieves the best performance, followed by ZERO-SHOT TRANSFER which outperforms TRANSLATE TEST.
Regarding RQ2, we aim to combine MT-based and zero-shot cross-lingual transfer via fast and parameter-efficient fine-tuning.Inspired by the speech processing literature where a list of Automatic Speech Recognition (ASR) hypotheses can be jointly considered for downstream tasks (Ganesan et al., 2021;Liu et al., 2021) to alleviate the misrecognition of ASR systems, we propose a module within our mTTI framework termed Ensemble Adapter (ENSAD).It fuses the text encodings of 'non-English' text input and a set of its translations to English.Additionally inspired by Ponti et al. (2021), the idea is to combine the knowledge from multiple translations to mitigate potential translation errors, and that way boost cross-lingual transfer for mTTI.
Our proposed method derives robust gains across all evaluation datasets.Besides offering SotA mTTI performance, the introduced ENSAD component also adds only 0.1% dedicated extra parameters (relative to the full mTTI model size) per each supported target language.Put simply, the use of ENSAD increases the portability of our mTTI framework through quick and parameter-efficient adaptation to new languages.The resources of our work are available at https://www.amazon.science/code-and-datasets/translation-enh anced-multilingual-text-to-image-gener ation.

Related Work
Text-to-Image Generation.There are generally two categories of standard TTI setups: 1) a super-vised setup, where gold standard training and test data are from the same domain (e.g., both from MS-COCO); and 2) a zero-shot setup, where there is a domain difference between the training data (typically large-scale noisy Web-crawled data) and the high-quality test data (typically manually constructed).GAN-based models are common in supervised TTI setups (Reed et al., 2016;Xu et al., 2018;Zhu et al., 2019): they still hold the SotA results, offering smaller model sizes and faster image generation speed (Zhang et al., 2021;Tao et al., 2022;Zhou et al., 2022).GigaGAN (Kang et al., 2023), a recent attempt to scale up GAN models, achieves fairly strong and competitive zero-shot TTI performance.However, in the zero-shot setup, large Vector Quantised Variational Autoencoder (VQVAE)-based models (Ramesh et al., 2021;Crowson et al., 2022;Gafni et al., 2022) and large diffusion models (Nichol et al., 2022;Ramesh et al., 2022;Saharia et al., 2022) play the leading role and offer the best performance.
Multilingual and Non-EN TTI.Research on mTTI and non-EN TTI is currently limited and only in its infancy.Cogview is a large VQVAE-based Chinese TTI model with training data partly from crawling Chinese websites and social media platforms, and partly from translating EN data (Ding et al., 2021).ruDALL-E is a VQVAE-based Russian TTI model recreating DALL-E (Ramesh et al., 2021) with training data translated from EN data. 3o the best of our knowledge, there are only two existing papers attempting multilingual or crosslingual TTI.Zhang et al. (2022) align two monolingual text encoders, one for the source and the other for the target language, with a fixed image generator pretrained on the source language (i.e., EN).Jung et al. (2022) take a step further, relying on a multilingual text encoder that supports more languages simultaneously.
We note several crucial differences to the prior work.1) The two papers are based on earlier TTI models (Xu et al., 2018), which are now largely surpassed by recent SotA models (Zhou et al., 2022).
2) Their model designs are tied to the model of Xu et al. (2018) and cannot be easily adapted to the latest SotA TTI models.3) They use traditional LSTM text encoders enhanced by mono-modal BERT features, while SotA TTI models (Zhou et al., 2022;Saharia et al., 2022;Rombach et al., 2022) use the multi-modal CLIP model (Radford et al., 2021).Therefore, we neither adopt them as baselines nor try to adapt them for our use, also taking into account the difficulty of replicating the prior work as no code has been released to date.In contrast, our work relies on the mCLIP text encoder (Carlsson et al., 2022), the multilingual version of CLIP, and is developed based on LAFITE (Zhou et al., 2022), a SotA TTI model.In fact, as shown later in our work, training an English TTI model using mCLIP without any further tuning can already realise zero-shot mTTI, similar to what has been attempted by Jung et al. (2022).
Translation-Based Cross-lingual Transfer.Machine translation (MT) at both lexical level and sentence level has been successfully used for crosslingual transfer learning in NLP, where TRANS-LATE TRAIN and TRANSLATE TEST usually serve as strong baselines for downstream tasks (Conneau et al., 2018;Glavaš et al., 2019;Hu et al., 2020;Ponti et al., 2021;Li et al., 2022a,b).In addition, MT is used to generate sentence pairs for training multilingual multi-modal models (Zhou et al., 2021;Carlsson et al., 2022).However, MT is still largely underexplored and underutilised for mTTI.In this work, we analyse the potential of MT to enhance multilingual and cross-lingual TTI.

Methodology
In what follows in this section, we first introduce our base mLAFITE model and three baseline approaches for mTTI ( §3.1).Next, we propose an Ensemble Adapter module that can work in synergy with the pretrained mLAFITE model to improve mTTI performance ( §3.2).Finally, we describe how we train our Ensemble Adapter and formulate our loss functions ( §3.3).

mLAFITE and Baselines
For easier deployment and comparison of different cross-lingual transfer methods, our work focuses on the relatively lightweight GAN-based models, which are faster to train and evaluate compared with VQVAE-based models and large diffusion models (see §2).In particular, we adopt LAFITE (Zhou et al., 2022), a SotA GAN-based English TTI model, as our starting point.To unlock its multilingual capabilities, we replace its Englishonly CLIP text encoder (Radford et al., 2021) with mCLIP (Carlsson et al., 2022), which is already pretrained to align the sentence representation spaces of 68 languages. 4here are three common categories of crosslingual transfer approaches which we apply to mTTI and adopt as our principal baselines: TRANSLATE TRAIN.We translate all the captions from the English training set (e.g., COCO) into a (non-EN) target language (L) relying on an MT system.We then train a LAFITE TTI model in the target language from scratch, relying on mCLIP as the text encoder.5At inference, an L sentence is directly fed into the target-language TTI model.
The other two approaches instead rely on a TTI model pretrained with English data, and they do not require further tuning with captions in the target languages.As our first step, we pretrain an mCLIPbased LAFITE model (we call it mLAFITE for brevity) from scratch.TRANSLATE TEST.At inference, we first translate a caption in L into EN via MT and the EN translation then serves as mLAFITE's input.
ZERO-SHOT TRANSFER.Since mCLIP is a multilingual sentence encoder, text in L can be directly fed to our mLAFITE for TTI without any extra fine-tuning.

mLAFITE with Ensemble Adapter
We now propose an attention-based Ensemble Adapter (ENSAD) module that aims to improve mTTI via leveraging knowledge from multiple translations of the same input.The full pipeline and how ENSAD extends the base mLAFITE model are illustrated in Figure 1.Given an input sentence in language L, L̸ =EN, we first use any (N)MT system to sample a set of EN translations.We then deploy the ENSAD module between the mCLIP text encoder and the TTI generator to fuse the mCLIPextracted embeddings, bridging the EN-L language domain gap.The adapter can be trained with only a small set of image-L text pairs while mCLIP and the TTI generator networks are kept frozen.Formally, we use x 0 to denote the L input text, while {x 1 , x 2 , ..., x m } is a set of m EN translations of the L input text.The fixed mCLIP encoder extracts their respective (l 2 -normalised) ddimensional sentence embeddings, yielding the matrix H = (h 0 , h 1 , ..., h m ) ∈ R d×(m+1) .Then, our proposed ENSAD learns to fuse these sentence encodings from H. We define the query (q), key (K), and value (V ) inputs of our attention as: Note that {h 0 , h 1 , ..., h m } are all close to each other in the mCLIP representation space.Therefore, to focus on the 'additional information' contained in the EN translations, we take the difference between h i , i > 0 and h 0 as in Eq. (3). 6The calculation of attention scores is then based on the standard additive attention (Bahdanau et al., 2015): ENSAD's hidden size is d hid ; W q , W k , W v ∈ R d hid ×d are respective mappings for query, key, and value inputs; b ∈ R d hid is the bias, and W p ∈ R 1×d hid is a final projection matrix for deriving the attention scores.Then, the context vector is an attention-guided summarisation of V .ENSAD's final output is the linear combination of h 0 and the context vector, computed as follows: where W o ∈ R d×d is the output mapping, and α is an interpolation hyperparameter.We also l 2normalise the outputs of Eqs.(3), ( 7), (8), as well as the tanh(W o V ) term in Eq. (6).

Contrastive Adversarial Training
Our Generator (G) and Discriminator (D) network structures and the pretraining process of the base mLAFITE model all follow LAFITE's original implementation for supervised TTI.As illustrated in Figure 1, we take the pretrained mLAFITE and insert the ENSAD between mCLIP and G.We then adversarially train ENSAD and D iteratively while mCLIP and G are kept frozen. 7Additionally, we propose to optimise a novel contrastive objective aligning the D-extracted real image and fake (synthesised) image features in adversarial training.
The (m)LAFITE GAN framework is adapted from the popular unconditional StyleGAN2 framework (Karras et al., 2020b) which features a redesigned adaptive instance normalization mechanism (Huang and Belongie, 2017) in G: it enables the unconditional channel-wise 'style information' (e.g., pose, lighting, background style) to control G's image synthesis backbone (convolution and upsampling layers).The 'style information' is derived as follows: a random noise z is sampled from the standard Gaussian distribution N (0, I) and transformed into a so-called unconditional StyleSpace, which is proven to be a well-disentangled intermediate latent space (Wu et al., 2021).8LAFITE further proposes to inject text-conditioning information into the StyleSpace via a series of non-linear and affine mappings.In our pipeline, G takes our ENSAD-gathered feature h and noise z, and it then outputs a fake image: The discriminator has a characteristic 'twobranch' design: 1) D is in essence a convolutional image encoder, producing f D (I), a d-dim image feature for any real or fake (i.e., synthesised) input image I; 2) D also predicts if I is real or fake based on both I and h, where the prediction (a scalar output) is denoted as D(I, h) = D s (I) + hT f D (I).This is realised via adding two affine transformations on top of a shared visual backbone for deriving f D (I) and D s (I), respectively.We then define the adversarial (AD) losses for ENSAD and D following LAFITE: n is the batch size, and σ(•) is the sigmoid function.
We propose an auxiliary contrastive loss, aligning the discriminator-extracted I f ake and I real features, computed as follows: . ( 12) cos(•) calculates the cosine similarity, and τ is the temperature.
In the original LAFITE paper, there are already two auxiliary contrastive losses: 1) L G CL aligns CLIP-extracted image features of I f ake and the input text embedding, i.e., h in our case; 2) L D CL aligns f D (I) with its associated h.9In our preliminary experiments, we found that L G CL was not useful for ENSAD, so we completely remove it. 10ur final losses for training ENSAD and D are as follows, with two hyperparameters λ 1 and λ 2 controlling the weights of contrastive losses: The full training process is also summarised in Algorithm 1, available in Appendix C. Note that the use of ENSAD introduces only up to 0.1% extra parameters per each target language relative to the full model size.This parameter efficiency boosts the portability of our mTTI framework, enabling quick and efficient adaptation to new languages.
4 Datasets mLAFITE pretraining is based on the MS-COCO (Chen et al., 2015) training set comprising 82, 783 images, where each image is associated with 5 EN captions.10% of the training set is held out as our dev set, and the rest is used for training.MS-COCO also provides a validation set (40, 504 images), frequently used for TTI evaluation.
For mTTI, we choose evaluation datasets that satisfy the following criteria: a) no overlap between images in the test set and images used in pretraining; b) the test set includes at least 5K images;11 c) the captions are human-written descriptions and not (manual or MT-derived) translations from EN captions. 12Based on these requirements, we select three 'non-EN' datasets, outlined in what follows.(Li et al., 2019) provides Chinese (ZH) captions (i.e., human descriptions) for 20, 341 MS-COCO images.6, 748 of them are from the COCO validation set not seen during mLAFITE pretraining; we thus use them as our test set.We randomly sample 20% of the rest as our dev set (2, 718), and the training set has 10, 875 images.Each image has only one ZH caption.COCO-CN additionally offers 5, 000 ZH sentences manually translated from EN captions; we only use the corresponding EN-ZH sentence pairs to calculate BLEU scores for comparing different MT systems.
LAION-5B (Schuhmann et al., 2022) is a largescale Web-crawled vision-language dataset with 5 billion image-text pairs covering 100+ languages.We focus on Finnish (FI) as a lower-resource language for our evaluation.Unlike carefully annotated COCO-CN and Multi30K, LAION-5B's data are noisy, so we rely on massive filtering to select relatively high-quality data.The full data creation process for FI is provided in Appendix D.
The final dataset comprises training, development and test portions with 10, 000, 2, 000, and 18, 000 image-text pairs, respectively.Our manual inspection of the final dataset indicates that it is of acceptable quality although having its own characteristics (Appendix D) but the quality in general still cannot match COCO-CN or Multi30K.We use the data in our main experiments 1) as an initial trial to extend TTI evaluation to 'non-COCO-style' captions and another language and 2) for comparative analyses with COCO-CN and Multi30K.
Supplementary Dataset: IGLUE.In order to further widen the set of target languages, we also experiment with IGLUE xFlickr&CO (Bugliarello et al., 2022).It provides 2K images, where one half comes from the MS-COCO validation set and the other half from Multi30K with associated human descriptions in 5 additional languages: Spanish (ES), Indonesian (ID), Japanese (JA), Russian (RU), and Turkish (TR).Since IGLUE does not offer a training set, we use it only for RQ1-related experiments.Although IGLUE does not comply with our criterion b) above, we use it to extend our empirical analyses to more languages.
Table 6 in Appendix A provides a full and sys-tematic overview of languages and data statistics used in this work.

Experimental Setup
In what follows, we outline our experimental setups and choices related to the two core RQs from §1.
We also show details concerning  (Liu et al., 2020;Tang et al., 2021), and M2M100 (Fan et al., 2021).We leverage them to generate the 1-best translations for TRANSLATE TRAIN and TRANSLATE TEST, and we also compare the BLEU scores of the MT systems against the TTI performance.Note that training a TRANSLATE TRAIN TTI model from scratch for each of the MT systems also takes 75 hours; our TRANSLATE TRAIN experiments thus do not extend to other datasets beyond COCO-CN due to the high computational cost.Given the considerations above along with preliminary evaluations on COCO-CN which showed that Marian outperforms mBART50 and M2M100, for the other datasets we focus on comparing the Marian-based TRANSLATE TEST with ZERO-SHOT TRANSFER.RQ2 Experiments.RQ2 further studies the effectiveness of the proposed ENSAD module; see §3 and Figure 1.We select Marian as the NMT back-bone 16 and sample m EN translations per each input sentence in the input language L.17 To compare with ENSAD (with the frozen mLAFITE generator), we also propose and experiment with several insightful and simple baselines (without the use of ENSAD) in addition to the RQ1 baselines: 1) we try standard mean-pooling as a simple ensembling baseline directly on mLAFITE; 2) we finetune G using the original non-EN captions;18 3) we fine-tune G using mean-pooled text features.Finally, we also investigate variants which combine ENSAD with the tunable generator G to check if further gains can be achieved. 19raining for RQ2 experiments is conducted on 8×V100 GPUs with a batch size per GPU of 16 for about 7 hours (i.e., a total of 2 million data points sampled from the respective training sets).We use Adam optimiser (Kingma and Ba, 2014) with a learning rate of 5e-4 and betas of (0, 0.99).For the generator-tuning baselines, their contrastive loss setups completely follow the original LAFITE (Zhou et al., 2022).In our ENSAD experiments, λ 1 =4 and λ 2 =2.Other hyper-parameters are as follows: the NMT beam size is 12, NMT temperature is 2.0, images are scaled to resolution 256 × 256, m=12, d=512, d hid =256, and τ =0.5.In addition, we fuse 10% and 1% standard Gaussian noise into h 0 and h i (1 ≤ i ≤ m) respectively as a data augmentation 'trick'.The hyper-parameters are tuned on our dev split of COCO-CN with details in Appendix G.The same set of hyper-parameters is also adopted for the other two datasets.
Side Experiments.Besides the main RQ1 and RQ2 experiments, we also conduct a series of side analyses focused on ENSAD.They span 1) the impact of the number of EN translations m, 2) the impact of the interpolation hyperparameter α, and 3) robustness tests.We also conduct 4) ablation studies to validate the effectiveness of different components, and 5) present generated images and ENSAD attention scores.
Evaluation Metric.Following Zhou et al. (2022) and Ramesh et al. (2021), we report the Fréchet Inception Distance (FID) (Heusel et al., 2017) computed with 30, 000 synthesised images generated using randomly sampled test set texts against test set ground-truth images, which is the most authoritative machine evaluation metric for TTI so far. 20

Results and Discussion
The main results are structured around the two central RQs from §1, discussed in §6.1 and §6.2.

RQ1: Results and Analyses
Comparison of Three Baselines.The results of TRANSLATE TRAIN, TRANSLATE TEST, and ZERO-SHOT TRANSFER on COCO-CN are summarised in Table 1.While all three methods use mCLIP, TRANSLATE TEST and ZERO-SHOT TRANSFER are based on a pretrained EN mLAFITE and do not require any further tuning.TRANSLATE TRAIN achieves the best FID scores; however, it requires training from scratch with translated L captions (see §3.1 and §5).Since MS-COCO provides ground-truth human-written EN captions for COCO-CN images, and Multi30K Task2 also provides EN human descriptions, we directly feed the EN captions to mLAFITE and report the FID scores as an upper bound (see the first row of each of Tables 1 and 2). 21he scores in Tables 1 and 2 show that ZERO-SHOT TRANSFER outperforms TRANSLATE TEST, demonstrating the strong capability of the multilingual mCLIP text encoder.TRANSLATE TEST compares unfavourably to other methods, revealing the gap between EN translations and the ground-truth EN human descriptions (e.g., translation errors, 'translationese' bias).We further extend the comparison to five more languages from the IGLUE dataset, and the results from Table 7 in Appendix E corroborate the finding that ZERO-SHOT TRANS-FER generally outperforms TRANSLATE TEST.
Comparison of MT Systems.We compare the performance of the four MT systems on COCO-CN and also report their BLEU scores on the additional 5K sentence pairs.Table 1, as expected, reveals that the commercial Amazon Translate system offers much stronger MT performance than the three academic NMT systems in terms of BLEU.Concerning mTTI, Amazon Translate is the best system with the TRANSLATE TEST approach category and ranks second with TRANSLATE TRAIN.Interestingly, there are some salient discrepancies between BLEU-based versus TTI-based system rankings.For example, Marian ranks second in TRANSLATE TEST and is the best system with TRANSLATE TRAIN, although its MT performance underperforms both Amazon Translate and mBART50.We speculate that this might be due to the pretraining specifics of mCLIP, where Marian-generated pseudo-parallel sentence pairs were used (Carlsson et al., 2022).
In TRANSLATE TEST, M2M100 obtains the lowest ZH→EN BLEU score and also achieves the worst TTI performance.However, mBART50 and M2M100 have close EN→ZH BLEU scores in TRANSLATE TRAIN, and a small edge in BLEU cannot guarantee a better TTI performance.We additionally compare Marian and Amazon Translate for TRANSLATE TEST in Tables 2 and 7 (Appendix E) on other languages and datasets, which further validate the core findings.

RQ2: Results and Analyses
Effectiveness of ENSAD.The main results are summarised in Table 3 Variants of ENSAD.We further investigate the impact of crucial design choices and hyper-parameters in ENSAD such as m, α, and V (see Eq. ( 3)) respectively on the final TTI performance.The results of different variants are provided in Table 4.They indicate that increasing the number of translations m seems to be conducive to downstream TTI performance.In addition, when V = K, the FID score worsens, demonstrating the usefulness of the V variant as formulated by Eq. (3).Finally, the TTI performance deteriorates when α > 0.2, showing that h 0 should still be the main component of h, and ENSAD provides auxiliary information (i.e., a translation-based enhancement).Ablation Study.We now study the usefulness of two used contrastive losses: 1) our proposed L CL and 2) L D CL inherited from LAFITE.The results in Table 5 show that removing L CL causes a noticeable performance drop (increased FID).However, removing L D CL has only a minor impact on the FID score.When removing both CL losses, the adversarial losses alone produce an FID score of 14.82.We also additionally try the CL loss setup of the original LAFITE and find that the setup is detrimental to the training of ENSAD, producing a worse FID score than using the adversarial losses alone.
TTI Examples and Attention Scores.Finally, we refer the reader to Appendix H where we present images synthesised with TRANSLATE TEST, ZERO-SHOT TRANSFER, and our ENSAD models and where we also show the ENSAD attention scores.The differences between images are subtle and we were unable to find a clear pattern that links high attention scores with particular translations.

Conclusion
This work is one of the first investigations of multilingual and cross-lingual text-to-image generation (TTI), with a particular focus on investigating the use of machine translation (MT) for the task.We systematically compared standard cross-lingual transfer approaches TRANSLATE TRAIN, TRANS-LATE TEST and ZERO-SHOT TRANSFER in the context of TTI and also studied the differences over MT systems.We then proposed a novel Ensemble Adapter (ENSAD) method that leverages multiple translations to further improve the TTI performance, with strong and consistent gains reported across a series of standard TTI benchmarks in different languages.

Limitations
First, we again emphasise that the lack of highquality non-English image-caption pairs is a primary obstacle to wider-scale multilingual and cross-lingual TTI investigations.We hope that researchers in the future can construct and release more high-quality vision-language data for different languages, especially for low-resource ones.
Second, our work uses 512-dim 'XLM-R Large Vit-B/32' mCLIP 22 and is based on the Style-GAN2 framework (Karras et al., 2020b).Since the main focus of our work is to realise multilingual and cross-lingual TTI and enable fair comparisons across different models and approaches, we compare all proposed and baseline methods with the same mCLIP text encoder and the GAN framework.However, for readers and potential users interested in 'chasing' stronger absolute FID scores, we speculate that the larger 640-dim 'XLM-R Large Vit-B/16+' mCLIP text encoder and the more recent StyleGAN3 (Karras et al., 2021) can be helpful.
Third, we notice that in addition to LAFITE, several state-of-the-art large diffusion models such as those from Saharia et al. (2022) and Rombach et al. (2022) also use CLIP to condition image generation on text input.This means that we could be able to derive multilingual diffusion models for mTTI also by replacing CLIP with mCLIP and enhance the mTTI performance with our proposed ENSAD (of course, we would need to redesign our loss functions).However, due to limited computational resources, we leave it to future work.
Fourth, the ENSAD boosts cross-lingual transfer for TTI by combining the knowledge from multiple translations, which can mitigate potential translation errors.Our work does not demonstrate if ENSAD is applicable and adaptable to downstream cross-lingual tasks besides TTI.It is because 1) downstream tasks other than TTI are out of the scope of this work and 2) adapting ENSAD to different tasks will require redesign of model structures and losses catering to the characteristics of each downstream task, making us believe it is not proper to expand the topic and include everything in a single piece of work.Therefore, we also leave this to future work.

A Data Statistics and Languages
In Table 6, we summarise the data statistics and languages covered in our experiments.

B Additional Discussion on Data Sources
Even without human-annotated image descriptions, there are two possible ways to derive captions for a target language L.
First, we could translate EN captions into L manually (still costly) or via machine translation.Our TRANSLATE TRAIN baseline (see §3) derives training data via machine translation and trains an L TTI model from scratch.One main disadvantage of this approach is that it incurs huge training costs.While translations can be used as training data, we are conservative about using translated captions for TTI evaluation which can cause unexpected bias (Elliott et al., 2016;van Miltenburg et al., 2017;Bugliarello et al., 2022).
Second, it is possible to use cheaper but noisy Web-crawled visual-language data.For example, the recently released LAION-5B dataset (Schuhmann et al., 2022) has 5 billion image-text pairs for 100+ languages.There are previous examples that successfully trained SotA EN TTI models with Web-crawled data, such as large VQVAE-based models and diffusion models.The models described in Ramesh et al. (2021), Nichol et al. (2022) and Ramesh et al. (2022) are trained on EN largescale Web-crawled data, but are eventually also tested on the gold-standard MS-COCO validation set.In our work, in addition to two gold-standard datasets, we also try to build on our own a smallscale dataset for both training and evaluation by filtering relatively good-quality image-text pairs from a subset of the noisy LAION-5B data (details in §4).Training non-EN TTI models from scratch with large-scale Web-crawled data such as LAION-5B is out of the scope of our work, and we focus on crosslingual transfer learning setups with limited L data.As mentioned in §1, this is to a large extent due to concerns about huge computational costs for training TTI models.Moreover, there are circa 7, 000 languages worldwide (Lewis, 2009), and for lowresource languages not covered in LAION-5B's 100+ languages, cross-lingual transfer learning approaches would still be the first choice.Furthermore, the number of EN texts in LAION-5B is more than the total amount of texts from its 100+ non-EN texts.Making full use of the huge amount of EN image-text pairs via cross-lingual transfer learning ENSAD forward pass hi←ENSAD(Hi); 7: Synthesise fake image I f ake i ←G( hi, z) 8: Feed ( hi, I real ) and ( hi, I f ake ) to D respectively; 9: Update ENSAD with Eq. ( 13); 10: Update D with Eq. ( 14); 11: end while might be beneficial for other languages.Therefore, we think that cross-lingual transfer learning in relatively low-resource setups for multilingual TTI is a critical and valuable research topic.

C The Detailed Training Process of ENSAD
We summarise the training process of our ENSAD method (see §3) in Algorithm 1.

D Deriving LAION-5B Dataset for Finnish
We download circa 5.1 million image-caption pairs from the FI category of LAION-5B.Since the Webcrawled data are noisy, we apply several filtering steps: 1) since our images will be scaled to resolution 256 × 256, to avoid distortion we keep only images with their width-height ratio between 0.5 and 2; 2) we keep captions with a minimum length of 8 words, which is also a requirement of MS-COCO (Chen et al., 2015) in its data annotation; 3) we use the langdetect library23 to remove texts misclassified into the LAION-5B FI category and make sure the texts left are actually in Finnish; 4) we keep captions with one ending period '.'. 24fter these steps, 239K pairs are left, and we calculate mCLIP scores (cosine similarities between mCLIP-extracted text and image features) for all the pairs and keep the 30K highest-ranking pairs as the final dataset.We randomly split the data into training, development and test portions with 10, 000, 2, 000, and 18, 000 pairs, respectively.We 'sanity-check' 50 randomly sampled instances from our filtered data and find that, in most cases, the text matches the image content.But there are a small number of exceptional cases where the  6: Data statistics categorised by languages.This table includes information such as language family, ISO 639-1 code, dataset name, train/dev/test split, and statistics on sequence length (number of words per caption).Note that MS-COCO EN data is used for pretraining our mLAFITE only.We also show for each dataset if there is an image domain overlap with MS-COCO images used for mLAFITE pretraining.✓: all images are from MS-COCO; x: none of the images is from MS-COCO; ✓ -: half of the images are from MS-COCO.For IGLUE Indonesian data, we remove an empty caption and its associated image, so there are 1, 999 images left.
text contains extra information beyond the image content itself (e.g., event descriptions).Overall, the quality of our FI data still cannot match MS-COCO or Multi30K.Another interesting finding is that LAION-5B captions often use real and concrete names such as 'Messi' and 'the national stadium' to describe the image content, while MS-COCO and Multi30K tend to use general words such as 'a man'/'a football player' and 'a stadium'/'a building'.• Computing Infrastructure: we run our code on an Amazon EC2 P3.16xlarge Instance with 8×16GB Nvidia ® Tesla ® V100 GPUs, 64×2.30GHz Intel ® Xeon ® E5-2686 v4 CPU cores, and 488GB RAM.
6.37e-02 Houses are built on water, all around mountain mountains, as long as possible, and have access to water.
2.97e-02 The house is constructed in the form of water, surrounded by mountains and long distances.9.80e-03 Houses are built with water and are located far beyond the range of hills around them.9.17e-03 The homes have been built on water and are surrounded by mountain areas from a distance.
5.64e-03 Houses are built on water and surround it far from the mountains.
4.62e-03 The houses are built according to water and spread around them from a great direction to a very deep range of mountains.
1.80e-03 Houses are constructed around the mountain and built from a distant distance to an open point of view.
3.64e-04 The houses are built by water and are encircled by mountains, as far as the hills are concerned.
2.27e-04 The houses were built in the form of water.They were in a remote area around the mountains.
8.77e-05 The houses were built watery and were driven from a very distant part of the forest and surrounded by mountains. 5.30e-08

一个客厅，一个大窗户下面的沙发，桌子。
A living room, a couch under a huge window, a table.
3.08e-01 A sitting room, a couch under a big window, a table.
2.24e-01 A living room, a sofa under a big window, a table.
1.59e-01 A living room, a couch under a big window, a table.
1.31e-01 A living room, a couch under a big window, a table.
1.31e-01 A living hall, a couch under that big window, a table.
2.12e-02 I was in the living room, the sofa below the great window, the table.
1.27e-02 In the living room, in the couch under a big window, in the table.
5.58e-03 One living room.One large window under the couch.The table.
3.86e-03 There was a living room, a couch under a large window, there was a table.
1.66e-03 There's one room, and a big couch under the large window, and there's a table.
1.22e-03 There was a guest room, there was a couch underneath a great window, there was a Einen braunen Hund der spazieren geht in der Wiese.
I think he'd be able to walk in the meadow and we could have a brown dog to go for a walk.5.72e-01He walks a brown dog in the meadows, who goes for a walk.
4.24e-01 A brown dog that goes walking in the meadow.
1.88e-03 A brown dog who goes for walks in the meadow.
1.16e-03 A brown dog who goes for walks in the meadow.
1.16e-03 A brown dog who goes for a walk in the meadow.
1.02e-04 A brown dog going for a walk in the meadow.
1.10e-06 A brown dog who walks in the meadow.
8.78e-09 A brown dog taking a walk in the meadow.
4.56e-09 A brown dog taking a walk in the meadow.
4.56e-09 A brown dog walking in the meadow.
Home made soda on the terrace in glass jar and karaaffles and poured into small glasses.
2.07e-01 on the terrace table made homemade sodacello in a glass jar and in a girdle as well as poured into small glasses.
1.26e-01 On the terrace table in a glass jar and a karahas of homemade limecello put down in small glasses.
8.66e-02 On a terrace table of homemade lemonade in a glass jar and slab of gizzard and poured into small glasses.
8.64e-02 The table on the terrace has homemade soda crystals in a glass jar and swath and is poured into small glasses.
8.30e-02 On the terrace table housed lemonade in glass jars and swaths and poured down into small glasses.
8.24e-02 On the terrace table of home made wine in glass jars and karaffes and poured into small glasses.
7.41e-02 The terrace is equipped with homemade lemonade in the jar and perch and poured into small glasses.
5.89e-02 Top of the terrace is homemade lemonade in a jar of glass and karaoke and poured into small glasses.
5.67e-02 On the table of the terrace it's homemade limocello with glass pots and clovers and poured into small glasses.
5.50e-02 On a table of terraces, homemade lemoncello is made in a glass jar and in a caraments and poured into small glasses.
4.91e-02 on the table of terraces with homemade soda on a glass jar and karaffe and poured in small glasses. 3.56e-02 Tuli tuhosi pahoin historiallisen kirkon vuoden 2006 toukokuussa.
In May 2006 the historic church was badly destroyed by fire.
3.51e-01 In May of 2006, the historical church was severely destroyed by fire.
2.39e-01 It was, in May 2006, when the fire badly destroyed the historic church.9.44e-02 There was a great destruction of this historical church in May 2006.
6.56e-02 The fire did a great deal of damage to an historic church in May 2006.
5.30e-02 In May 2006 fire caused a very severe damage to the historic church.
3.58e-02 The fire seriously destroyed the historical church in May 2006.
3.49e-02 The fire was severely destroyed by the historical church in May 2006.
3.49e-02 A fire severely destroyed the historical church in May 2006.
3.05e-02 There's been massive damage to the historical Church in May 2006 when the fire took place.
3.01e-02 Fire was devastatingly damaged by the historic church in May 2006.
2.18e-02 Fire caused the serious destruction of the historic church in May 2006.9.31e-03

这是一个干净，但拥挤的厨房。
That's a clean-up, but crowded kitchen full of fruits.
It's clean but crowded in the kitchen full of fruits.That's a clean, but crowded kitchen full of fruits.It's a clean, but crowded kitchen full of fruits.It's a clean-up but congested kitchen full of fruits.That's a clean, but congested kitchen full of fruits.And it's a clean, but crowded kitchen full of fruits.
It's a clean, but congested kitchen full of fruits.-IT'S THIS IS A cleanING BUT CLOTHED CLIMBEN COILLOR IN THE CRUCKIT.-[CLICKS] full of fruits.
It was a clean but crowd-cooked kitchen full of fruits.That's a clean one, but crowd-cooked kitchen full of fruits.
It's a clean, but congested kitchen full of fruits.
Table 9: Additional information added to the EN translations.The underlined texts in red are added phrases.
Removing the phrases derives the NMT-generated translations.

Algorithm 1
Supervised Training of Ensemble Adapter 1: Input: An image-text dataset {x 0 i , Hi for each x 0 i with NMT and mCLIP 3: while not converge do: 4: Sample mini-batch {Hi, I real i
. For all methods except EN Captions'.With image domain gap present (i.e., DE and FI), training EN-SAD (with frozen G) still shows a small edge over fine-tuning G (without ENSAD) for DE; however, for the noisier LAION-5B data, fine-tuning G is

Table
# of Images Dev Set: # of Images Test Set: # of Images Min Seq Len Max Seq Len Avg.Seq Len Image Domain Overlap

Table 7 :
, 0.5}, and d hid from {32, 64, 128, 256, 512}.ID: FID ↓ JA: FID ↓ RU: FID ↓ TR: FID ↓ Avg.: FID ↓ RQ1 results: TRANSLATE TEST vs. ZERO-SHOT TRANSFER on five languages from IGLUE.FID↓: lower is better.To better understand what kind of information ENSAD extracts from EN translations, we also try to manually add additional information to EN translations (the additional information does not appear in and is not added to the original L input).Of course, this section is for probing purposes only since MT systems are not likely to produce the same translations.We found that when the additional information is added to only several of the 12 EN translations, it can hardly get reflected in the generated image.Here, we show two COCO-CN test set examples in Figure3where we add the new information into 12 EN translations simultaneously.In its first and second rows, the original L input is 'An open laptop is on the table.' and 'It's a clean, but crowded kitchen.' respectively (translated from the original Chinese captions).We manually add new objects 'roses' and 'fruits' respectively to all their EN translations as in Table9.As seen in Figure3, the roses and fruits do appear in the generated images.

Table 8 :
ENSAD attention scores.On a table is put on an open laptop, and roses.It was on the desk with an open laptop, and roses.There's a computer that's open that has an open laptop sitting on the table, and roses.There's a opened laptop on the table, and roses.There's an open laptop sitting on the table, and roses.An open laptop's on the table, and roses.And we have a laptop on your desk that's open, and roses.There was a laptop that was open on the table, and roses.A computer that opened up his laptop is in place on the table, and roses.There was a computer on the table.There was an open laptop on the table, and roses.There's an open laptop on the table, and roses.There was an unopened laptop on the table, and roses.