Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training

In this paper, we introduce Cross-View Language Modeling, a simple and effective pre-training framework that unifies cross-lingual and cross-modal pre-training with shared architectures and objectives. Our approach is motivated by a key observation that cross-lingual and cross-modal pre-training share the same goal of aligning two different views of the same object into a common semantic space. To this end, the cross-view language modeling framework considers both multi-modal data (i.e., image-caption pairs) and multi-lingual data (i.e., parallel sentence pairs) as two different views of the same object, and trains the model to align the two views by maximizing the mutual information between them with conditional masked language modeling and contrastive learning. We pre-train CCLM, a Cross-lingual Cross-modal Language Model, with the cross-view language modeling framework. Empirical results on IGLUE, a multi-lingual multi-modal benchmark, and two multi-lingual image-text retrieval datasets show that while conceptually simpler, CCLM significantly outperforms the prior state-of-the-art with an average absolute improvement of over 10%. Moreover, CCLM is the first multi-lingual multi-modal pre-trained model that surpasses the translate-test performance of representative English vision-language models by zero-shot cross-lingual transfer.


Introduction
vision-and-language tasks, they still significantly under-perform "translate-test", a simple baseline which translates the test examples into English and uses an English-only vision-language model for inference. This prevents existing multi-lingual multi-modal models to be applicable in realworld applications. In contrast, multi-lingual pre-trained models such as XLM-R [17] significantly outperforms the translate-test baseline in most languages and is widely used in practical applications.
This paper aims to fully exploit the potential of multi-lingual multi-modal pre-training. We point out two major limitation of current state-of-the-arts. First, existing methods do not exploit parallel text corpora, which can be easily collected and are abundant for many language pairs. Instead, M 3 P performs masked language modeling with monolingual texts in different languages for multilingual alignment. However, parallel texts are shown to be more helpful according to multi-lingual pre-training literature [17,19]. Second, a number of new pre-training objectives involving specific architecture changes and different input-output formats are introduced for English or image pivoting, making it non-trivial to combine them together for better performance and scale to larger data.
In this work, we argue that multi-lingual and multi-modal pre-training are essentially achieving the same goal of aligning two different views of a same object into a common semantic space. Therefore, we believe these two seemingly different strategies can be combined into a unified framework. To this end, we introduce cross-view language modeling, a simple and effective framework that unifies crosslingual and cross-modal pre-training with shared architecture and objectives. Specifically, we consider both multi-modal data (i.e., image-caption pairs) and multi-lingual data (i.e., parallel sentence pairs) as pairs of two different views of the same object. With either multi-modal or multi-lingual data as input, we encode the two views with Transformer models and then fuse their representations with a crossattention Transformer model shared for both cross-modal and cross-lingual fusion. We train the model to align the two views into a common semantic space by maximizing the mutual information between them with a conditional masked language modeling objective, a contrastive learning objective, and a matching objective. In this way, the cross-view language modeling framework unifies English pivoting and image pivoting schemes seamlessly and makes the best of both worlds.
To evaluate the effectiveness of our approach, we pre-train CCLM, a Cross-lingual Cross-modal Language Model, with the proposed cross-view language modeling framework. Experimental results show that CCLM significantly outperforms prior state-of-the-art with an averaged absolute improvement of 11.4% and 32.7% on multi-lingual vision-language understanding and retrieval tasks in terms of accuracy and R@1 on IGLUE [27], a recently released multi-lingual multi-modal benchmark. Notably, CCLM is the first multi-lingual vision-language model that surpasses the "translate-test" performance of mono-lingual vision-language models via zero-shot cross-lingual transfer, which we believe is a crucial step towards practical multi-lingual multi-modal pre-training.
Contributions. (1) We propose a cross-view language modeling framework that unifies multi-lingual and multi-modal pre-training with shared architectures and objectives. (2) We pre-train CCLM with the proposed approach on public available image-text pairs and parallel sentence pairs. (3) CCLM advances the state-of-the-art of multi-lingual vision-language pre-training by a large margin and surpass the translate-test baseline for the first time.

Related Work
Multi-lingual Pre-training Multilingual BERT [3] demonstrates that good cross-lingual transfer results can be achieved by performing masked language modeling on multi-lingual corpora with shared vocabulary and weight. Later, XLM [16], XLM-R [17], and Unicoder [28] introduce a number of new objectives including translation language modeling (TLM), cross-lingual word recovery, and cross-lingual paraphrase classification to improve multi-lingual pre-training. More recently, MAD-X [18] and InfoXLM [19] further improve multi-lingual pre-training via adapter [29] and contrastive learning.
Vision-Language Pre-training Inspired by the success of language model pre-training, a number of work [20,21,24,23,30] investigates vision-language pre-training on large scale image-caption pairs and proposes a number of objectives to align vision and language representations, including masked multi-modal modeling, multi-modal alignment prediction, RoI feature regression, image-text matching, to name a few. Vision-language pre-training has reshaped the landscape of vision-andlanguage research and pushed the state-of-the-arts on a wide range of vision-language tasks [31]. Figure 1: Illustration of the cross-view language modeling framework. CCLM takes two different views of the same object, i.e., either (a) image-caption pairs or (b) parallel sentence pairs, as input. CCLM first encodes the two views separately with Transformer encoders. Then the representations of the two views are fused by a Transformer-based fusion model, which is shared for both cross-lingual and cross-modal fusion. CCLM is optimized by maximizing the mutual information between the two views via conditional masked language modeling loss, contrastive loss, and matching loss.
However, it is non-trivial to collect large scale image-caption pairs in other languages. As such, most existing vision-language pre-trained models are limited to English tasks.
Multi-lingual Multi-modal Pre-training Multi-lingual multi-modal pre-training aims to make multi-modal models applicable on non-English texts by cross-lingual transfer. In this paper we mainly consider multi-modal in the vision-language context. The key difficulty of multi-lingual multi-modal pre-training is the lack of non-English image-text pairs. Two representative works tackle the lack of non-English image-text pairs by pivoting on either English texts or images. Specifically, M 3 P [25] uses English as pivot and alternates between English-only vision-language pre-training, multi-lingual masked language modeling, and multimodal code-switched training. UC 2 [26], on the other hand, translate English captions into multiple languages and considers images as the anchor, achieving state-of-the-art on various multi-lingual vision-language tasks. More recently, MURAL [32] pretrains a dual encoder model on massive multi-lingual multi-modal data via contrastive learning and achieves new state-of-the-art on multi-lingual image-text retrieval datasets. However, the dual encoder architecture of MURAL makes it unable to perform multi-modal understanding tasks well.

Overview
Cross-view language modeling is a simple framework that unifies cross-lingual pre-training and cross-modal pre-training with shared architecture and objectives. CCLM consists of an image encoder, a cross-lingual text encoder, and a fusion model. All components are Transformer-based. Specifically, the image encoder [33] first splits an image into non-overlapping patches, and then embeds these patches with transformer layers, yielding { v cls , v 1 , ..., v N1 }. For an image of resolution of 224x224 and patch size of 32x32, we have N 1 = 49. Similarly, the cross-lingual text encoder encodes a text input via transformer layers, yielding { w cls , w 1 , ..., w N2 }. N 2 is the length of the text input. Then, the fusion model fuses text features with the corresponding image features or features of the translated text based on cross-attention, producing { x cls , x 1 , ..., x N2 }.
As illustrated in Figure 1, with either (text, image) pairs or (text, translation) pairs as input, we consider the paired input as two different views and train the model to align their representations in a common semantic space. This unified cross-view perspective allows us to share input-output 3 formats, architectures, and training objectives between cross-lingual inputs and cross-modal inputs. Specifically, we completely share the fusion model for both cross-lingual fusion and cross-modal fusion, and optimize the model by contrastive loss, matching loss, and conditional masked language modeling loss for both cross-lingual and cross-modal inputs. We select these objectives because they are universally effective in both cross-lingual and cross-modal pre-training literature [19,34]. We will show that the three loss maximize sequence-level and token-level mutual information between image-caption pairs or parallel sentence pairs. On the other hand, we empirically find that the three loss are more effective for cross-lingual cross-modal pre-training than certain task-specific loss such as masked region-to-token language modeling which is specially for multi-modal pre-training or translation language modeling for multilingual pre-training.

A Mutual Information Maximization Perspective
In this section, we explain our approach from an information-theoretic perspective. Formally, given two random variables A and B, mutual information I(A, B) measures dependencies between the two random variables. We define A = a and B = b as two different views of a data point, which can be either an image-caption pair or a parallel sentence pair. In this case, we will show that CCLM maximizes a lower bound of I(A, B) for cross-lingual cross-modal pre-training by minimizing the InfoNCE loss [35] defined as: where f θ ∈ R is a function parameterized by θ andB contains the positive sample b and |B| − 1 negative samples.
The contrastive loss between the image encoder and the cross-lingual text encoder is a symmetric version of L nce : where |Ã| = |B| = N is the batch size, and we predict (a, b) pairs from in-batch negatives. f θ (a, b) = g v ( v cls ) g w ( w cls )/τ given an image-caption pair or f θ (a, b) = g w ( w a cls ) g w ( w b cls )/τ given a translation pair. v cls and w cls are the output [CLS] embedding of the image encoder 4 and the cross-lingual text encoder. g v and g w are transformations that map the [CLS] embeddings to normalized lower-dimensional representations. τ is a learnable temperature parameter.
Similarly, the matching loss applied on the output [CLS] embedding of the fusion model (denoted as x cls (a, b)) can also be viewed as a symmetric version of L nce : where we only sample a negative instance for each ground-truth (a, b) pair and predict whether a pair is matched (true or false). In this case, The conditional MLM loss can also be interpreted as maximizing mutual information [36] between the context c = (â, b) (â denotes the masked text input, and b is the corresponding image or translated text) and the masked token w i in a: ). x i is the output vector at w i position of the fusion model. ψ(w) : V → R d is a lookup function that maps a word token w into a parametric vector. V is the full vocabulary set.
Finally, the pre-training objective of CCLM is defined as: where the contrastive loss and matching loss maximize sequence-level mutual information while the MLM loss maximizes token-level mutual information, which are complement of each other.

Pre-training Datasets
Multi-modal Data We pre-train CCLM on the combination of image-caption pairs and parallel texts. For image-caption pairs, we follow the practice of UC 2 and use their released translation-augmented version of CC3M dataset. It contains the original CC3M image-caption pairs [37] and machinetranslated captions in five different languages (German, French, Czech, Japanese, and Chinese). In addition to this setting, we also experiment with another setting by including the COCO dataset [38] and the Visual Genome (VG) dataset [39], which contains around 1 million image-caption pairs in total. We consider this setup because COCO and VG datasets are commonly used in vision-language pre-training literature while not used by previous work on multi-lingual multi-modal pre-training. We denote our model trained with these two variants as CCLM 3M and CCLM 4M respectively.

Multi-lingual Data
As for parallel text corpus, we collect a subset of the WikiMatrix [40] dataset containing parallel texts between English and other languages in the IGLUE benchmark 5 . The multi-lingual pre-training data consists of 19M parallel sentence pairs in total.

Implementation Details
We initialize the image encoder by Swin Transformer [41] which consists of 12 Transformer layers. The cross-lingual text encoder and the fusion model are initialized with the first and second half of XLM-R [42] respectively, consisting of six layers each. The image encoder takes images of resolution of 224 × 224 as input. The maximum sequence length is set to 30 and 64 for image-caption pairs and parallel texts respectively. During fine-tuning, we increase the image resolution to 384 × 384 and interpolate the positional embeddings of image patches following Dosovitskiy et al. [33].
We apply mixed precision for pre-training. Following UC 2 , we train the model for 30 epochs on 8 NVIDIA A100 GPUs and the batch size is set to 1024, which tasks ∼ 1.5 days. We use the AdamW [43] optimizer with a weight decay of 0.02. The learning rate is warmed-up to 1e −4 from 1e −5 in the first 2500 steps and decayed to 1e −5 following a linear schedule. The pre-training is alternating between image-caption batches and parallel text batches.

Downstream Tasks
We adapt CCLM to two series of downstream datasets including IGLUE benchmark, a recently released benchmark for evaluating multi-lingual multi-modal pre-training, and multi-lingual imagetext retrieval datasets including the multi-lingual version of Flickr30K [44,45] and MSCOCO [46]. We describe the details of downstream datasets as follows.
Flickr30K: This dataset extended Flickr30K [44] from English (en) to German (de), French (fr) and Czech (cs). It contains 31,783 images and provides five captions per image in English and German, and one caption per image in French and Czech. Dataset splits are defined as the original Flickr30K.
MSCOCO: This dataset extends the MSCOCO caption dataset [46] by translating the captions into Japanese [47] and Chinese [48]. The Japanese and Chinese subsets consist of 820k and 20k captions respectively. Following previous work, we use the same train, dev, and test splits for English and Japanese as defined in Karpathy and Li [49]. As for Chinese, we use the COCO-CN split [48].
XVNLI: The Cross-lingual Visual NLI dataset is released by the IGLUE benchmark. It is collected by combining SNLI [50] with its multi-modal [51] and multi-lingual [52] counterparts. It requires the model to predict if a text-hypothesis "entails", "contradicts", or is "neutral" to an image-premise.
xGQA: The Cross-lingual Grounded Question Answering task [53] is collected by manually translating the GQA [54] validation set into 7 languages. It requires a model to answer several types of structured questions about an image. We model GQA as a generation task following Li et al. [34].
MaRVL: The Multicultural Reasoning over Vision and Language dataset [55] requires to determine whether a textual description is true or false about a pair of images. The MaRVL dataset is used for testing and the NLVR2 [56] dataset is used for training.
xFlickr&CO and WIT: The xFlickr&CO dataset is collected by combining 1000 images from Flickr30K and MSCOCO respectively and crowdsource image descriptions in 6 other languages. Similarly, the Wikipedia-based Image Text dataset [57] is collected from Wikipedia in 108 languages. We follow the data preprocessing and splitting details in IGLUE for both datasets.
For all retrieval tasks, we apply the same ranking strategy as ALBEF [34] for inference. For all experiments on Flickr30K and MSCOCO, we fine-tune the model for 10 epochs with a batch size of 160 on 8 GPUs, a learning rate of 1e −5 , a weight decay of 0.01, and a warmup ratio of 0.1. We find these hyperparameters work well across settings, datasets, and languages. Therefore, we do not perform any hyperparameter search on Flickr30K and MSCOCO. For IGLUE datasets, we present detailed hyperparameters for fine-tuning in Table 5 and Table 6 in the Appendix. UC 2 : The state-of-the-art multi-lingual vision-language model which relies on (text-only) machine translation technologies to obtain CC3M data in five languages (Czech, French, German, Japanese, and Mandarin). The model is then pre-trained on multi-lingual multi-modal batches where a caption is sampled uniformly from the available languages for each image. As for pre-training objectives.
In addition to conventional vision-language pre-training objectives, a visual-conditioned translation language modeling objective is added to improve multi-lingual multi-modal alignment.
All compared models are pre-trained with multi-modal data from CC3M. mUNITER, xUNITER, and M 3 P use Wikipedia data in different languages as multi-lingual data while UC 2 uses translated version of CC3M as multi-lingual data.

Results on IGLUE Benchmark
We first evaluate CCLM on the IGLUE benchmark. We follow the practice of IGLUE and report results in both zero-shot and few-shot cross-lingual transfer settings. In the zero-shot setting, the models fine-tuned on English train sets are directly evaluated on target languages. In the few-shot setting, the English trained models are continually fine-tuned with a few labeled examples in a target language before evaluating on this language. We select exactly the same few-shot examples following IGLUE instructions to ensure our results are compatible with that reported in IGLUE. The results are shown in Table 1. Results of compared models are copied from IGLUE. We omit few-shot evaluation on the WIT dataset because this setup is also omitted in IGLUE.
First, for zero-shot cross-lingual transfer results, we can see that CCLM 3M outperforms all compared models by a substantial margin while pre-trained on the same multi-modal data. Specifically, compared to UC 2 , the prior state-of-the-art, CCLM 3M obtains an average accuracy improvement of 11.4% on multi-lingual multi-modal understanding tasks including XVNLI, xGQA, and MaRVL, and an average R@1 improvement of 47.3% and 18.2% on multi-lingual multi-modal retrieval datasets including xFlickr&CO and WIT. This confirms that previous multi-lingual multi-modal models fail to fully exploit the potential of multi-lingual multi-modal pre-training and our proposed cross-view language modeling framework can better align multi-lingual multi-modal representations with unified objectives. We also find that the performance can be further improved (CCLM 4M ) by adding COCO and VG data, which are used for pre-training most English VLMs. Notably, CCLM is the first multilingual multi-modal pre-trained model that performs competitively with the translate-test results of representative English VLMs tested in the IGLUE benchmark. Concretely, CCLM 4M outperforms the translate-test results of all representative English VLMs in the IGLUE benchmark on XVNLI, MaRVL, xFlickr&CO, and WIT while performs slightly worse on the xGQA dataset. This, for the first time, proves the potential of multi-lingual multi-modal pre-training on building practical real-world applications involving vision-language tasks in different languages.
As for few-shot results, we find that similar to existing models, CCLM can also benefit from few-shot learning with a few examples in the target languages. By training on a few examples per class in the target languages, CCLM consistently outperforms translate-test results of English VLMs with a larger margin. This further confirms the potential of multi-lingual multi-modal pre-training.

Results on Multi-lingual Retrieval
We also compare CCLM with state-of-the-art methods on conventional image retrieval and text retrieval tasks on which UC 2 and M 3 P originally reported their results. We follow the practice of prior work and evaluate in three different settings including English-only fine-tuning, single-language fine-tuning, and all-language fine-tuning, where the model is fine-tuned on English data, target language data, and the combination of training data in all languages, respectively.
The results are shown in Table 2. For zero-shot cross-lingual transfer, we can see that CCLM 3M also substantially outperforms UC 2 , the prior state-of-the-art, with an averaged improvement of over 16% (in terms of averaged recall) across five languages. This confirms that our approach can better align multi-lingual multi-modal representations. Including COCO and VG data also yields some improvements, which is consistent with previous results on the IGLUE benchmark. Finetuning on target languages or the combination of all languages yields consistent improvements. The improvements are not as large as that for UC 2 and M 3 P, which is probably because the zero-shot cross-lingual transfer ability of CCLM is strong enough and the performance of our models is already  Table 2: Results on multi-lingual image-text retrieval. We compute the average Recall@K for both image-to-text retrieval and text-to-image retrieval with K = 1, 5, 10, as the evaluation metric. For our models, mean and standard deviation (in brackets) of 3 different runs with different random seeds are reported. Results of compared models are directly copied from the corresponding papers. Numbers of MURAL are in gray because it is a dual encoder model and is pre-trained on much larger data, thus not comparable with other results.
saturating. Nevertheless, CCLM 3M still substantially outperforms prior state-of-the-art by 9.4% and 6.8% averaged recall across five languages when fine-tuned on target languages or the combination of all languages, respectively. Moreover, CCLM 4M also significantly outperforms MURAL large , the prior state-of-the-art in the all-language fine-tuning setting by 3.8% averaged recall across four languages. This is notable because MURAL large is larger than our model and is pre-trained on much more data (∼ 450× more image-text pairs and 390× more parallel sentence pairs). Moreover, we show that CCLM also outperforms MURAL large in the zero-shot setting (w/o fine-tune) in Table 7. In addition to absolute cross-lingual transfer results, we also compare the cross-lingual transfer gap of different models. We visualize the ratio of a model's performance on non-English languages to its performance on English test set, in Figure 2. A larger radar chat indicates the model has a smaller relative transfer gap and can better transfer its performance to non-English test sets. We can see that CCLM's relative crosslingual transfer gap is consistently smaller than that of UC 2 across all tasks in the IGLUE benchmark (a) and all languages in the multi-lingual retrieval datasets (b). The absolute cross-lingual transfer gap is even more significant. For example, in Table 2, we can see that for M 3 P, the absolute zero-shot cross-lingual transfer gap between EN-CS and EN-JA in Flickr30K and MSCOCO are 41.4% and 32.6% respectively. This indicates that masked language modeling on unpaired texts in multiple languages are not very effective for cross-lingual alignment of multi-modal models. The gap for UC 2 is reduced to 13.2% and 16.4%, demonstrating the effectiveness of using machine-translated captions for multi-lingual multi-modal pre-training. Surprisingly, CCLM 4M further reduces this  Table 3: Ablation study results. Models w/o shared cross-attention and FFN are ablated variants where these modules are separately parameterized in the cross-lingual fusion model and the crossmodal fusion model. Models w/ TLM and TLM + CL are ablated variants where the multi-lingual objectives are that used in XLM-R and InfoXLM respectively, thus not unified with the multi-modal objectives. All compared models are pre-trained for 15 epochs.

Cross-lingual Transfer Gap
gap to 4% and 2.8%. This further confirms that the proposed cross-view language modeling framework can effectively transfer multi-modal representations from English to other languages without language-specific fine-tuning.
In addition, we also visualize the multi-lingual text representations and image representations in CCLM and a baseline approach in Figure 3, which clearly shows our approach can better align multi-lingual image-text representations. It's also noteworthy that the improved cross-lingual transfer ability does not sacrifice the model's performance on English. In Table 8, we can see that CCLM performs competitively with state-of-the-art English VLMs on representative English vision-language tasks. For completeness, we also report per-language results on the IGLUE benchmark in Table 9, 10, 11, 12, and 13. Please refer to the Appendix for these analyses and results.

Ablation Study
We also conduct an in-depth ablation study to investigate the role of different design choices in the cross-view language modeling framework. We pre-train 5 ablated variants of CCLM where parallel sentence pairs, unified architecture, or unified objectives are ablated. All compared models are pre-trained with the same CC3M and WikiMatrix data (except that w/o parallel sentence pairs) for 15 epochs to ensure a fair comparison. The results are shown in Table 3. First, we find that separate parameterization of cross-attention and FFN modules in the cross-lingual and cross-modal fusion models leads to inferior results, especially for multi-lingual multi-modal understanding tasks such as xGQA. We also find that using common objectives in multi-lingual pre-training literature, which is different from the multi-modal objectives, underperforms the unified objectives used in our approach. These observations confirm the importance of unified architectures and objectives in the cross-view language modeling framework. Moreover, we find that the use of parallel sentence pairs also plays a very important role. This indicates that previous methods fail to fully exploit the potential of language pivoting for multi-lingual multi-modal pre-training.

Conclusion
In this paper, we introduce cross-view language modeling, a simple and effective framework that unifies cross-lingual and cross-modal pre-training. It considers cross-lingual and cross-modal pretraining as the same procedure of aligning the representation of two different views of the same object, thus using shared model architectures and training objectives for multi-lingual multi-modal pre-training. We train CCLM with the proposed framework and show that it advances the state-ofthe-art on all downstream multi-lingual vision-language tasks by a large margin. More importantly, it surpasses the translate-test baseline for the first time, demonstrating the potential of multi-lingual multi-modal pre-training. We believe our model will become a foundation for future multi-lingual multi-modal research and serve as a strong baseline. Moreover, the cross-view language modeling framework also has the potential of unifying more modalities such as speech and video with the same architectures and objectives. We leave this for future work. 9

A.1 Limitations and Potential Socail Impacts
In this paper, CCLM is pre-trained with CC3M and WikiMatrix as multi-modal and multi-lingual pre-training data, which are of moderate size. Also, we only experiment with base-size models. These choices are made because we want to make apple-to-apple comparison with previous works such as M 3 P and UC 2 . However, there exists larger public available multi-lingual datasets (e.g., MultiUN [58], OPUS [59], etc.) and multi-modal datasets (e.g., CC12M, LAION [60], etc.). As suggested by the comparison between CCLM 4M and CCLM 3M , adding these large-scale datasets will probably lead to further performance improvements. Using larger models will also likely leads to some improvements. However, we do not run experiments with huge pre-training data and giant models due to environmental considerations [61,62] and we try to make our experiments as "green" as possible.
As for social impact, multi-modal pre-trained models can be used in applications that help people with disability in one modality. Our work makes these applications applicable to minority people speaking non-English, and potentially low-resource languages. In sum, our work potentially enables deep learning technology to benefit more people, and is unlikely to have direct negative social impact.

A.2 Details of Pre-training Datasets
We pre-train CCLM on the combination of image-caption pairs and parallel texts. For image-caption pairs, we follow UC 2 and use their release translation-augmented version of CC3M dataset, which contains machine-translated captions in five different languages (German, French, Czech, Japanese, and Chinese). As for parallel text corpus, we collect a subset of the WikiMatrix [40] dataset containing parallel texts between English and other languages in the IGLUE benchmark.   For few-shot experiments, we search three learning rates {1e-6, 1e-5, 5e-5}. The few-shot data size and final hyperparameters are shown in Table 6. For all the tasks, we train the network for 60 epochs on each language, and evaluate every 10 epochs to search the best results. Task   XVNLI xGQA MaRVL xFlickr&CO   Shot number  48  48  20  100  Learning rate  1e-6  1e-6  1e-6  1e-6  Batch size  64  64  32  32  Epochs  60  60  60  60  Max input length  80  40  40  80   Table 6: Data size and hyperparameters used for IGLUE few-shot finetuning.
A.4 Zero-Shot on Retrieval  Table 7: Zero-shot results on multi-lingual image-text retrieval. We compute the average Re-call@K for both image-to-text retrieval and text-to-image retrieval with K = 1, 5, 10, as the evaluation metric. Results of compared models are directly copied from the corresponding papers.
A.5 Results on English Tasks   Table 8 reports CCLM performance on three representative English multi-modal tasks. We can observe that CCLM also has competitive performance compared to state-of-the-art English multimodal baselines.

A.7 Per-Language Results
We also provide per-language results on the IGLUE benchmark in following tables, which supplement the ones provided in Table 1 in the main paper.      Table 13: Full per-language results on WIT. We take Recall@1 as evaluation metric for both image-to-text retrieval and text-to-image retrieval.