Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

Vision-Language Pre-training (VLP) has advanced the performance of many vision-language tasks, such as image-text retrieval, visual entailment, and visual reasoning.The pre-training mostly utilizes lexical databases and image queries in English. Previous work has demonstrated that the pre-training in English does not transfer well to other languages in a zero-shot setting. However, multilingual pre-trained language models (MPLM) have excelled at a variety of single-modal language tasks. In this paper, we propose a simple yet efficient approach to adapt VLP to unseen languages using MPLM.We utilize a cross-lingual contextualised token embeddings alignment approach to train text encoders for non-English languages. Our approach does not require image input and primarily uses machine translation, eliminating the need for target language data. Our evaluation across three distinct tasks (image-text retrieval, visual entailment, and natural language visual reasoning) demonstrates that this approach outperforms the state-of-the-art multilingual vision-language models without requiring large parallel corpora. Our code is available at https://github.com/Yasminekaroui/CliCoTea.


Introduction
Inspired by the recent advancements in language model pre-training, Vision-Language Pre-trained Models (VLPMs) have demonstrated state-of-theart performance across a wide range of visionlanguage (VL) tasks such as text-to-image retrieval, visual reasoning, visual entailment, and visual QA (Chen et al., 2020;Li et al., 2021Li et al., , 2022. However, extending VLPMs to multilingual scenarios is still challenging. On one hand, the majority of these models are trained on monolingual (English) corpora and thus cannot perform well for other languages. On the other hand, the multilingual pre-trained language models (Devlin et al., * Yasmine performed this work while interning at EPFL.  Figure 1: Overview of our approach. We adapt the text encoder of a monolingual VL model to an unseen language (a). Then we use the adapted model for a VL downstream task in a zero-shot setting (b). 2018; Conneau et al., 2019) cannot handle vision data (e.g., images or videos) directly.
Lately, there have been attempts (M 3 P, nUNITER, UC 2 ) to pivot on images or English texts to align multilingual representations with vision features (Chen et al., 2020;Ni et al., 2021;Zhou et al., 2021).
However, a recent benchmark on multilingual multimodal pretraining (IGLUE) (Bugliarello et al., 2022) shows that although these models achieve promising zeroshot cross-lingual transfer performance on some VL tasks, they still fall short in comparison to the "translate-test" baseline (using an English-only VLPM on the translations of the text examples).
A more recent work (CCLM) achieves promising performance on the IGLUE benchmark by exploiting massive parallel text and image-text corpora to pre-train a VL model (Zeng et al., 2022). This approach is motivated by a key observation that multilingual and multimodal pre-training essentially achieves the same goal of aligning two different views of the same object into a common semantic space. Although this framework performs well on the IGLUE benchmark, it requires a large amount of parallel data. Its pre-training phase relies on 19M multilingual parallel sentence pairs extracted from WikiMatrix (Schwenk et al., 2021), jointly trained with 4 million image-text pairs in multiple languages.
In this work, we are proposing a simple yet efficient way to adapt VLP models to unseen languages without requiring large parallel corpora. We propose to align a VLPM monolingual text encoder (achieving start-of-the-art performance on English downstream VL tasks) with a multilingual pre-trained language model (e.g., mBERT), using only small in-domain parallel text corpus. The recent progress in Neural Machine Translation (NMT) has enabled us to create such a parallel corpus from automatically translating the data from English to any other language, even for lowresource languages (i.e., Swahili). However, since our approach relies on token alignment, it is robust to errors made by NMT. Our zero-shot evaluation across three of the four IGLUE tasks shows that the proposed method achieves state-of-the-art results while using small set of in-domain parallel sentences. The key steps of our approach are illustrated in Figure 1. 2 CLiCoTEA : Cross-Lingual Contextualised Token Embedding Alignment We propose CLiCoTEA , an approach to transfer a monolingual vision-language (VL) pre-trained model in one language L 1 where there is an abundant number of training pairs of image and text (i.e., English) to a second language L 2 . As we focus in this paper on the zero-shot setting, we do the transfer after fine-tuning the pre-trained monolingual VL model on a downstream task t, where training samples are available in language L 1 . CLiCoTEA consists of six steps: 1. Pre-train a monolingual VL model on a massive collection of image-text pairs, where text is written in language L 1 .
2. Fine-tune the VL pre-trained model on the downstream task t in language L1.
3. Create a parallel text corpus by translating the training set from step 2 in the target language L 2 . Note that this step can be done automatically using neural machine translation.
4. Create a list of aligned tokens for each (potentially noisy) parallel sentence using a token alignment model. 5. Cross-lingual transfer by aligning contextualised token embeddings. As illustrated in Figure 1a, it transfers the VL fine-tuned model to the new language L 2 by aligning a pre-trained multilingual LM (e.g., mBERT or XLM-R) with the text encoder of the VL pre-trained model using the list of aligned tokens created in step 4.
6. Zero-shot transfer to L 2 by swapping the monolingual text encoder from the VL pretrained model with the aligned multilingual text encoder learned in step 5. An example of visual reasoning in Indonesian is illustrated in Figure 1b.
In practice, steps 1 and 2 are the most computationally expensive. Therefore, we propose to adapt VL fine-tuned models to new languages by only doing the steps from 3 to 5 which can be computed in a few hours on a single GPU.
We note that CLiCoTEA could be used with any multimodal pre-trained model where one of the modalities is a monolingual text encoder. We focus in this paper on VL models, but CLiCoTEA could be applied for instance to a language-knowledge model such as GreaseLM (Zhang et al., 2021) or DRAGON (Yasunaga et al., 2022).

Pre-trained Models
Vision-Language Model In step 1 of CLiCoTEA , we use the Align BEfore Fuse (ALBEF) framework 1 (Li et al., 2021) as our Vision-Language Pre-trained Model (VLPM). AL-BEF has been fine-tuned on multiple downstream VL tasks and achieves state-of-the-art performance. We use the ALBEF fine-tuned models in step 2 for the downstream tasks described in Section 3.3. Unlike other competitive VL pre-trained models (such as BLIP (Li et al., 2022)) that inject visual information by inserting cross-attention for each transformer block of the text encoder, ALBEF first encodes the image and text independently with a detector-free image encoder and a text encoder. Then it uses a multimodal encoder to fuse the image features with the text features through cross-modal attention. All encoders are based on transformer networks with the text encoder being a 6-layer transformer initialised using the first 6 layers of the BERT base . We thus extract this 6-layer text encoder for cross-lingual transfer training in step 5.
Multilingual Language Model As a multilingual pre-trained language model, we use the multilingual BERT (mBERT) 2 (Devlin et al., 2018). It has been trained on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective and has demonstrated remarkable zero-shot cross-lingual transfer capabilities (Wu and Dredze, 2019; Pires et al., 2019;Hu et al., 2020;Conneau et al., 2018). We extract the first 6-layer transformer to be aligned with the text encoder of ALBEF in step 5.

Implementation Details
Word Alignment Since the parallel sentences do not contain word-level alignment information, in step 4 of CLiCoTEA we utilize awesome-align 3 (Dou and Neubig, 2021) which is a tool that automatically extracts word alignments from mBERT. The generated word pairs are then filtered for keeping only one-to-one, oneto-many or many-to-one alignments and removing many-to-many alignments. This is done for all languages except Chinese because otherwise less than 3% of the training data would remain in the set. The advantage of this filtering is twofold: a) it removes the noise from the matching word pairs; b) it reduces the training time and computation. For words that are split into sub-word tokens, we consider either the left-most token embedding alignment (i.e., the first sub-word token of a word) or, the average embedding across all sub-word tokens.

Contextualised Token Alignment Training
Given a set of aligned contextual word pairs extracted from parallel sentences, we define , where x i ∈ R d is the contextualised embedding of token i in the target language (obtained from mBERT), and y i ∈ R d is the contextualised embedding of its alignment in the source 2 Available on HuggingFace hub at https://huggingface.co/ bert-base-multilingual-cased.
3 https://github.com/neulab/ awesome-align language (obtained from the fine-tuned ALBEF) 4 . In step 5 of CLiCoTEA , we minimise the following training objective: n i=1 ||x i − y i || 2 . The parameters of the source language encoder are frozen, while the ones of the target language encoder are fine-tuned at training time. The learning rate is set to 5.10 −5 . The batch size is set to 128. These hyperparameters are set through the NLVR2, Flickr30k, SNLI validation sets, for each task respectively. For each target language, the training is done on a single GeForce GTX TITAN X in a few hours.
Data Augmentation As multilingual language models are generally pre-trained on the source language L 1 , the contextualised token alignment can be trained not only with sentences from the target language L 2 , but also with sentences from the source language L 1 . This strategy doubles the training size, and consequently, the training time but it could be used with tasks where the number of available training sentences is limited.

Downstream Tasks
In step 6, we evaluate CLiCoTEA on three tasks from the IGLUE benchmark 5 in the zero-shot setting: • xFlickr&CO: The dataset is composed of 1000 images from Flickr30K (Plummer et al., 2015) and 1000 images from MSCOCO dataset (Lin et al., 2014). These images come along with croudsourced image captions in 6 different languages. xFlickr&CO is a retrieval task dataset. It is composed of two subtasks: image-to-text retrieval (TR) and text-to-image retrieval (IR).
• XVNLI: The dataset consists in merging SNLI hypothesis with Flickr30K (Plummer et al., 2015) images and translate the test set in four languages. The task is called visual entailment (VE) which is a fine-grained reasoning task to determine whether a text hypothesis "contradicts", "entails", or is "neutral" with respect to an image.
• MaRVL: The dataset is a multilingual expansion of NLVR2 dataset (Suhr et al., 2017), with images related to concepts of five languages and cultures. The task is called visual reasoning (VR) which consists in determining whether a statement is correct given a pair of images.
Step  Table 1 shows the datasets used for a) fine-tuning the monolingual VL pre-trained model in step 2, b) training the alignment of contextualised token embeddings in step 5, and c) testing the zero-shot cross-lingual transfer in step 6. For creating the parallel corpus in step 3, all datasets used for finetuning the monolingual pre-trained VL model are translated to the corresponding test dataset languages from the IGLUE benchmark using Google-Trans Python API 6 . Statistics about the translation datasets can be found in Section A.1. MaRVL being the smallest dataset, the data augmentation strategy described in Section 3.2 is applied only for this task. Detailed results on data augmentation can be found in Section 3.2.

Experimental Results
Results reported in Table 2 shows that CLiCoTEA outperforms the state-of-the-art CCLM models for all downstream tasks except retrieval. The larger improvement against CCLM models is obtained in visual entailment with an increase of almost 5%. The superiority of CLiCoTEA is especially high for Spanish (+7.68%), as can be seen from Table 10 in Section A.4. The average performance on visual reasoning is similar to CCLM, but CLiCoTEA significantly outperforms CCLM by ±4% on the low-resource languages such as Tamil and Swahili (results per language can be seen in Table 8

Conclusion
In this paper, we present CLiCoTEA an approach for adapting Vision-Language pre-trained models to unseen languages. Unlike other approaches that rely on an expensive pre-training phase (both in terms of data and computation), our approach adapts the contextualised token embeddings of a multilingual pre-trained language model by aligning them with the contextualised token embeddings of the VLPM text encoder. By aligning ALBEF text encoder with mBERT, we show that CLiCoTEA outperforms CCLM, which exploits massive parallel text and image-text corpora.
CLiCoTEA achieves start-of-the-art performance on visual entailment and visual reasoning, with an increase of almost 5% on visual entailment. It also demonstrates its effectiveness, especially for low-resource languages, as it does not require large corpora to do the adaptation.

Limitations
The general performance of CLiCoTEA could be improved with a better MPLM than mBERT, such as XLM-R which has a larger token vocabulary and has been pre-trained on a much larger dataset. Our approach is currently not applicable to generation tasks where a multilingual text decoder is needed to generate text in unseen languages. We leave this

A.2 Results on Retrieval
Zero-shot performance on the Flickr&CO dataset, the image-text and text-image retrieval tasks from the IGLUE benchmark, for four available languages (DE: German, ES: Spanish, ID: Indonesian, RU: Russian, TR: Turkish) are reported in Table 6 and Table 7   A.3 Results on Natural Language Visual Reasoning Table 8 shows the zero-shot performance on the MaRVL dataset, and the natural language visual reasoning task from the IGLUE benchmark, for all available languages (ID: Indonesian, SW: Swahili, TA: Tamil, TR: Turkish, ZH: Chinese). As MaRVL is the smallest dataset among the three tasks from IGLUE, we apply the data augmentation for training the alignment as described in Section 3.2. Results reported in Table 9 show that there is drop of 3.35% for Turkish, and 9.99% for Chinese when training only using the target language L 2 , while there is no significant difference for the three other languages (Indonesian, Swahili,    Table 9: Zero-shot performance of CLiCoTEA on visual reasoning with MaRVL dataset using monolingual (L 1 ) or bilingual (L 1 + L 2 ) alignment training. Accuracy is reported.   In order to eliminate the need for machine translations from CLiCoTEA in step 3, we created a parallel text corpus with sentences obtained from XNLI (Conneau et al., 2018) which is publicly available and covers 15 languages. A subset of XNLI has been used for training the alignment by considering only the sentences that were semantically close to the captions in NLVR2. To do so, we used the Sentence-Transformers framework 7 to compute sentence embeddings sim-7 Available at https://www.sbert.net.

372
ilarities between NLVR2 captions and XNLI English sentences and kept only the sentences with a cosine similarity higher than 0.5. About 50k English sentences from XNLI are semantically close to NLVR2 captions, we thus selected their parallel sentences in Swahili, Turkish and Chinese to perform an evaluation on MaRVL dataset. After the contextualised token alignment training on XNLI-based datasets, our results in Table 11 suggest that a multilingual open-domain dataset gives better results than mUNITER and xUNITER but underperforms the results obtained by translating in-domain training sets. This could be explained by the fact that although these datasets are multilingual, the sentences are not semantically close enough to NLVR2 captions.

ACL 2023 Responsible NLP Checklist
A For every submission: A1. Did you describe the limitations of your work?

Section 5
A2. Did you discuss any potential risks of your work?
We could not think of any risk as we do not introduce any model or dataset.
A3. Do the abstract and introduction summarize the paper's main claims?
Abstract and Section 1 A4. Have you used AI writing assistants when working on this paper?
Left blank.

B Did you use or create scientific artifacts?
Left blank.

B1. Did you cite the creators of artifacts you used?
Section 3 B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
We cited the datasets website that includes the licenses.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?
We use the datasets only for evaluation.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it? No, because we have employed widely used public datasets and have not collected any data ourselves.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? Section 3 B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created? Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results. For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be. Section 3 and Appendix A C Did you run computational experiments? Section 3 C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used? Sections 2 and 3 The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.