AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities

In this work, we present a conceptually simple and effective method to train a strong bilingual/multilingual multimodal representation model. Starting from the pre-trained multimodal representation model CLIP released by OpenAI, we altered its text encoder with a pre-trained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k-CN, COCO-CN and XTD. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding. Our models and code are available at https://github.com/FlagAI-Open/FlagAI.


Introduction
Learning a good representation in a joint space for vision and language has been a long pur-suit in the research of Artificial Intelligence (AI).Recently, the milestone work of CLIP (Radford et al., 2021) from OpenAI demonstrated impressive zero-shot performances across a number of tasks such as image classification on ImageNet (Deng et al., 2009), Image-to-Text and Text-to-Image retrieval on Flicker-30k (Young et al., 2014) and MSCOCO (Lin et al., 2014;Chen et al., 2015).There has been the pursuit of building contrastive language-image models in other languages such as Italian (Bianchi et al., 2021), Korean (Ko and Gu, 2022), Chinese (Changpinyo et al., 2021;Fei et al., 2021;Wang et al., 2022;Gu et al., 2022;Xie et al., 2022) or in a cross-lingual and multilingual setting (Aggarwal and Kale, 2020a).
Training a good language-image representation model often requires a huge amount of text-image pairs and vast computational resources.For instance, CLIP used 400M text-image pairs, and Taiyi (Wang et al., 2022), a recently proposed Chinese model, used 123M text-image pairs.To alleviate these problems, several works manage to take advantage of the existing text-image representation CLIP and extend its language capabilities to other languages (Portaz et al., 2019;Aggarwal and Kale, 2020a;Gu et al., 2022;Zhai et al., 2022).CN-CLIP (Yang et al., 2022) aligns a new Chinese text encoder to the CLIP vision encoder through 200M Chinese text-image pairs.More recently, M-CLIP (Carlsson et al., 2022) proposed to use Teacher Learning (a.k.a.Knowledge Distillation) on the text encoder of the CLIP model to learn a multilingual text-image representation model.This method only uses machine-translated data from English to a target language, without text-image pairs.However, existing works in the cross-lingual or multilingual setting mainly focus on the model's retrieval performance and ignore their generalization ability.The dataset to evaluate retrieval performance is often small, e.g., 1, 000 images in test sets for Flickr-30k.The retrieval performance fluctuates acutely with the change in training data distribution.Although current methods achieve good performance in retrieval, these methods often do not perform well on the ImageNet classification tasks.The ability to accurately predict images over 1, 000 classes often indicates better generalization ability of the model.
To address the aforementioned problems, we propose a multilingual model named Alter ego CLIP (AltCLIP) which achieved strong performances on ImageNet and multimodal retrieval tasks across languages.Our proposed method AltCLIP learns a multilingual text-image representation under a twostage framework (see Figure 1 for an overview).In the first stage, we use Teacher Learning on parallel text to distill the knowledge learned from CLIP and align different languages and images.In the second stage, we further improve the alignment of text and image via Contrastive Learning (Hadsell et al., 2006) on a moderate amount of multilingual text-image pairs.We employ this method to train a multilingual Vision-Language model that supports nine languages which we call AltCLIP M 9 .
We present an extensive experimental comparison over a variety of benchmarks and baseline methods, to demonstrate the effectiveness of our method.We show that using recall-based parallel text data in teacher learning can learn well-aligned text-image representation in both English and extended languages, while contrastive learning with text-image pairs effectively aligns the multilingual language model to the CLIP vision encoder.The model trained by this two-step training strategy results in a very strong performance on a broad range of multilingual multimodal benchmarks, including the original English multimodal benchmarks studied in CLIP (Radford et al., 2021).AltCLIP M 9 sets new state-of-the-art results on multilingual image classification and retrieval tasks.Furthermore, AltCLIP M 9 achieves superior cross-modal performances in Chinese, Korean, Japanese, and Italian compared to methods trained from scratch with large-scale text-image pairs.Lastly, we apply AltCLIP M 9 to the task of text-to-image generations (Ramesh et al., 2021;Rombach et al., 2022) to show that it enables high-quality image generation from prompts in different languages.

Related Work
CLIP (Radford et al., 2021) provides a strong En-glish Vision-Language representation.To expand the language of the CLIP model, there are prior studies on learning a bilingual text-image representation (Ko and Gu, 2022;Bianchi et al., 2021), and multilingual text-image representation (Aggarwal and Kale, 2020a).In the realm of multi-language models, MURAL (Jain et al., 2021), a dual-tower model, employs contrastive learning between multilanguage text and text-image pairs to expand the paradigm of multi-modal learning.It was trained on large-scale private data obtained through web crawling, including more than 6 billion translation pairs and 1.8 billion image-caption pairs.Carlsson et al. (2022) proposed a way to utilize Teacher Learning (a.k.a.Knowledge Distillation) (Hinton et al., 2015) to train a new textual encoder from the original CLIP model with only machinetranslated parallel data.Although this method achieves promising cross-lingual retrieval performances with only text data, its zero-shot classification performance in English drops significantly.In the domain of Chinese text-image pretraining models, prior work includes Taiyi (Wang et al., 2022), CN-CLIP (Yang et al., 2022), Wukong (Gu et al., 2022), R2D2 (Xie et al., 2022) and BriVL (Huo et al., 2021;Fei et al., 2021).These methods often need large-scale Chinese text-image pairs and suffer from a significant performance decline in English tasks.
XLM-R (Conneau et al., 2020) is a multilingual language model that achieves strong performances on a wide range of cross-lingual tasks.In our work, we use the XLM-R model as the underlying text encoder and align it with the image encoder trained in CLIP, to achieve competitive performances on cross-lingual and cross-modality tasks.
Knowledge distillation.In knowledge distillation, the teacher-student architecture is a generic carrier to form knowledge transfer.The model capacity gap between a large deep neural network and a small student neural network can degrade knowledge transfer.(Mirzadeh et al., 2020;Gao et al., 2021).To effectively transfer knowledge to student networks, a variety of methods have been proposed for a controlled reduction of the model complexity (Crowley et al., 2018;Liu et al., 2019;Wang et al., 2018).In this work, we use a multilingual model XLM-R as a student model for effectively transferring multilingual knowledge.

Methodology
We propose a two-stage method to learn a multilingual multimodal representation model.In the first stage, we follow the work of Carlsson et al. (2022) to use Teacher Learning to learn a multilingual text encoder from the CLIP text encoder.In this step, no image is needed in training and only language parallel data is used.In the second stage, we use text-image pairs to further fine-tune the model from contrastive learning.Our overall training procedure is summarized in Figure 1.

Teacher Learning Stage
In this stage, we perform Teacher Learning (Hinton et al., 2015) on text encoders.We use the text encoder from CLIP (Radford et al., 2021) as the teacher text encoder, and the XLM-R (Conneau et al., 2020) model pretrained on multilingual data as the student text encoder.A fully-connected layer is added to transform the output of the XLM-R model into the same output dimension as the teacher encoder.We use parallel text data between English and other languages * to distill the knowledge of text-image alignment.
Given parallel text input (sent 1 , sent 2 ), the teacher text encoder generates the learning target from input sent 1 , which is the embedding of the [TOS] token, denoted by x t tos .The student text encoder generates embedding x s cls from input sent 2 .We minimize Mean Squared Error (MSE) between x t tos and x s cls .After such training, the student text encoder can keep most of its multilingual capability and obtain text-image alignment capability in both languages.Note that the teacher encoder is only used at training time.At inference time, only the student encoder is used as the text encoder.
To show that our method is extensible in including more languages, we build a multilingual version (AltCLIP M 9 ) and a bilingual version (AltCLIP M 2 ).AltCLIP M 9 supports nine different languages: English(EN), Chinese(CN), Spanish(ES), French(FR), Russian(RU), Arabic(AR), Japanese(JA), Korean(KO), and Italian(IT).For the bilingual version (AltCLIP M 2 ), we align Chinese with English, with the same concept and architecture as in the multilingual version.

Contrastive Learning Stage
This stage of training aims to further improve textimage alignment by contrastive learning on multilingual text-image pairs.As illustrated in Figure 1, here we use the image encoder from CLIP which is based on Vision Transformer (ViT) (Dosovitskiy et al., 2020) as our image encoder, and use the student text encoder learned from the Teacher Learning Stage as our text encoder.
We use Contrastive Loss (Hadsell et al., 2006) between the output projection of the image encoder and text encoder, as done similarly in previous work (Radford et al., 2021).We follow LiT (Zhai et al., 2022) to freeze the image encoder at training time and only update the parameters in the text encoder.We observe that this stage of training further improves the model's performance on various evaluation benchmarks, as presented in Section 5.

Training Datasets
In this section, we describe the training datasets used in our two-stage training schema.encoder and XLM-R text encoder.The parallel corpus consists of a recall-based corpus and a machine-translated corpus translated by MBART (Tang et al., 2020).We use the same amount of data for each language, which contains 5M recall-based parallel data collected from OPUS(Tiedemann, 2012) † , 10M machinetranslated data from LAION (Schuhmann et al.,  2021)  ‡ and 3M machine-translated data from Conceptual Captions (CC3M) (Sharma et al., 2018).We use TSL2019(5M) (Xu, 2019) as parallel data for the training of AltCLIP M 2 .

Contrastive Learning Stage
We use unfiltered text-image pair data in this stage.For AltCLIP M 9 , we randomly selected 7 million textimage pairs for each language from the LAION2B-Multi (Schuhmann et al., 2022).For AltCLIP M 2 , we only employed half a million text-image pairs for each language in training.

Implementation details
We initialize our text encoder from XLM-R Large and use the text encoder from CLIP V iT −L14 as the teacher text encoder.We use the image encoder from CLIP V iT −L14 as our image encoder.In the Teacher Learning stage, we trained for 27 hours using 11×8 NVIDIA A100-SXM4-40GB GPUs.In the Contrastive Learning stage, we continued training for an additional 12 hours using 8 NVIDIA A100-SXM4-40GB GPUs.Detailed training settings can be found in Appendix A.3.

Experiments
We present experimental results in this section.In Section 5.1, we introduce the datasets and metrics used.We comprehensively validate our model through multilingual multimodal benchmarks in Section 5.2.In Section 5.3, we conduct an ablation study on the effects of various design choices in Teacher Learning and Contrastive Learning.Finally, in Section 5.4, we apply AltCLIP to textimage generation, and show that our model is capable to align text in different languages.

Evaluation Datasets and Metrics
In this section, we describe the datasets and metrics used.We use ImageNet (Deng et al., 2009) and its four out-of-distribution test variants, i.e.ImageNet Sketch (Wang et al., 2019), ImageNet-A (Hendrycks et al., 2021b), ImageNet-R (Hendrycks et al., 2021a), ImageNetV2 (Recht et al., 2019), to evaluate zero-shot image classification performances in English (Radford et al., 2021), Chinese, Japanese, Italain and Korean § .We adapt templates of manual prompts from CLIP for English and the corresponding machine translation templates for Chinese and Korean.For Japanese and Italian, the templates are collected from the same sources with the translated class names.
For cross-modal retrieval, we evaluate AltCLIP M 9 on the XTD (Aggarwal and Kale, 2020b) dataset and Multi30k (Elliott et al., 2016).XTD is built from selecting 1K images from COCO (Lin et al., 2014), and translating the corresponding English Captions into 11 languages.
¶ .The Multi30k dataset is a collection of multilingual image captions that provides translations of captions in English, German, French, and Czech for 29,000 images.We select Flickr30k (Young et al., 2014), COCO, as well as their corresponding Chinese datasets, Flickr30k CN (Lan et al., 2017), COCO CN || (Li et al., 2019), to evaluate zero-shot image-to-text retrieval and text-to-image retrieval performances on Chinese.
The evaluation metrics for image classification benchmarks are accuracy (default), mean per class (the average of recall obtained on each category, for imbalanced datasets, such as FGVC Aircraft, Oxford-IIIT Pets, Caltech-101, Oxford Flowers 102), 11-point mAP (mean average of 11-pt interpolated precision for each class, for VOC 2007), and mean(top1, top5) (the mean of acc@1 and acc@5, for Kinetics400 and Kinetics600).For cross-modal retrieval benchmarks, we use Recall@K where K ∈ {1, 5, 10}, and Mean Recall (the average of Recall@K) for both image-to-text retrieval and textto-image retrieval tasks, which are the same as the setups in CLIP (Radford et al., 2021).

Zero-shot performance
Image Classification We first present evaluation results of zero-shot image classification on the ImageNet dataset and its four out-of-distribution  variants.For baselines, we compare our model with OpenCLIP (Radford et al., 2021), CN-CLIP (Yang et al., 2022), KELIP (Ko and Gu, 2022), IT-CLIP (Bianchi et al., 2021), JA-CLIP ( , 2022) and multilingual CLIP (M-CLIP) (Carlsson et al., 2022).As illustrated in Table 1, AltCLIP M 9 outperforms OpenCLIP in English and sets new state-of-the-art results on ImageNet, ImageNet-A, ImageNet-R, and ImageNet V2 in Chinese, Japanese, Korean and Italian.These results demonstrate the effectiveness of our method in expanding the language ability of CLIP.Compared to Chinese/Korean baseline models where hundreds of millions of text-image pairs are used in pretraining, we only use 18M parallel text data and 7M text-image pairs (per language) in training.

Multilingual Cross-modal Retrieval
We compare our model with CLIP, M-CLIP (Carls-son et al., 2022), mUSE (Yang et al., 2020), UC 2 (Zhou et al., 2021), MLA (Zhang et al., 2022), ALIGN (Jia et al., 2021) and MURAL (Jain et al., 2021).The results of the comparison on Multi30k (Elliott et al., 2016) and XTD (Aggarwal and Kale, 2020b) are shown in Table 2, where AltCLIP M 9 achieves state-of-the-art results in 7 languages and outperforms the original CLIP model in English.This superior performance of our model is likely due to the use of higher-quality parallel corpora during the Teacher Learning stage, which effectively eliminates potential bias from machine translation.Additionally, we utilize contrastive learning to further align the text and image representation, which is crucial for downstream tasks.We will discuss this in more detail in Section 5. We also provide additional cases in Appenix A.4.

Method
Text-to-Image Retrival Image-to-Text Retrival MR R@1 R@5 R@10 R@1 R@5 R@10 Full CLIP benchmark We present the evaluation results for a range of tasks in English in Figure 2. We compare the effectiveness of multilingual AltCLIP M 9 and AltCLIP M 9−T with the original CLIP.AltCLIP M 9 outperforms CLIP, indicating that our method effectively fuses the abilities of CLIP and XLMR.We observed that at the Teacher Learning stage, the model already learns a good representation of text-image representation, as it achieves better average results than the original CLIP model on a range of zero-shot benchmarks.
The Contrastive Learning stage further improves the model's performance, particularly on retrieval tasks such as Flickr30k.
Task-level transferability We evaluated the transferability of AltCLIP for zero-shot image classification on the "Image Classification in the Wild (ICinW)" dataset from the ELEVATER benchmark (Li et al., 2022).ICinW is a publicly available benchmark to evaluate the large-scale tasklevel transferability of Vison Language models.ICinW consists of a series of image classification datasets such as KITTI-Distance (Fritsch et al., 2013) and hateful-memes (Kiela et al., 2020).As shown in Table 5: Ablation Experiments.For a fair comparison, all models were trained for 10 epochs and evaluated using the average results over nine-language ImageNet series tasks (INs), image-retrieval tasks (IRs), and textretrieval tasks (TRs) on XTD and Multi30K for eight languages (excluding Arabic).
Comparison with models trained from scratch.We compare our model with the ones trained with hundreds of millions of text-image pairs: CLIP in English and R2D2 (Xie et al., 2022), Wukong (Gu et al., 2022), Taiyi (Wang et al., 2022) and CN-CLIP (Yang et al., 2022) in Chinese.The results are shown in Table 4. AltCLIP M 9 outperforms all baseline models including models trained with large-scale text-image pairs on most datasets and tasks.We notice that AltCLIP M 2 outperforms CLIP on both text-to-image and image-to-text retrieval.This could be due to the following reasons: 1).We used a small subset (less than 1M) of LAION 5B at the Contrastive Learning stage, which is in a different distribution of the pretraining data used in CLIP; 2).Our language encoder initialized from XLM-R provides better language understanding ability.We elaborate on the detailed results of Bilingual settings in Appendix A.2.

Ablation study
We evaluate the effectiveness of our AltCLIP M 9 by analyzing its major components in this section.We use CL to denote the Contrastive Learning stage, and MT and RB to denote the Machine-Translated and Recall-Based parallel data used in the Teacher Learning stage.We evaluate the variations of our models in English-only and in multilingual settings.We use the average score on ImageNet series (INs), Image Retrieval tasks (IRs), and Text Retrieval tasks (TRs) as evaluation metrics.Results in Table 5 show that excluding machine-translated data has a significant impact on performance, except for the multilingual ImageNet series tasks.Combining machine-translated and recall-based parallel data leads to a significant improvement in most tasks, indicating that the quality and diversity in training data are both important.Additionally, the Contrastive Learning stage significantly improves the model's performances on multilingual tasks, achieving 58.4 on multilingual INs, a 3.9% improvement.

Examples of text-to-image generation
In this section, we apply our model to the task of text-to-image generation to enable multilingual image generation, and to show the effect of language alignment in our model.We use the text encoder of AltCLIP M 9 to fine-tune a Stable Diffusion model (Rombach et al., 2022).We use stable-diffusion v1-4 † † as initialization and AltCLIP M 9 as the language encoder, and we freeze all parameters in the diffusion model except for the key and value projection layers of the cross-attention block during † † https://huggingface.co/CompVis/ stable-diffusion-v-1-4-original fine-tuning.The dataset used for fine-tuning is the same one used for the Contrastive Learning stage as described in Section 4.1.As demonstrated in Fig. 3, our model generates high-quality images comparable to those generated by Stable Diffusion.This is likely due to the reason that AltCLIP M 9 achieves competitive performance in English with CLIP, where the latter is used in the original Stable Diffusion model.Additionally, we observe that our model generates similar images for translated English and Chinese prompts, demonstrating the effect of language alignment.More examples with images generated from different languages can be found in Appendix A.5.

Conclusion
In this work, we propose an effective two-stage training method for learning multilingual multimodal representation models, through teacher learning and contrastive learning.The effectiveness is demonstrated through extensive experiments on a wide range of tasks in multilingual multimodal benchmarks.AltCLIP M 9 outperforms the original CLIP model on many tasks in English and sets new state-of-the-art zero-shot results on multiple image classification tasks in Chinese/Korean/Italian/Japanese and multilingual re-trieval tasks.Meanwhile, our method is highly data-efficient, which consumes only around 1% text-image pairs compared to the hundreds of millions of text-image pairs used by prior work on vision-language pretraining models.

Limitations
It's worth noting that this study has certain limitations.One of the limitations is the limited scope of the training data employed.The AltCLIP model is trained on open-source parallel corpora and publicly available unfiltered text-image pairs.A more careful study of the training data, i.e. filtering textimage pairs by relevance and text/image quality may help to further improve the overall performance of the model.Another limitation is the challenge of evaluating the model in a multilingual setting.Despite our best efforts to include as many benchmarks as possible and to translate from English datasets, the evaluation of the model's performance in other languages is not as comprehensive as it is in English.For example, there may be fewer tasks available such as OCR or action recognition in videos in other languages.In addition, the use of machine translation may introduce biases that could affect performance.Future research should focus on creating a more robust and scientifically rigorous multilingual evaluation framework.

Ethics Statement
The AltCLIP approach presents an innovative way of building robust multilingual multimodal representation models while minimizing the need for energy-intensive GPU training, promoting a more sustainable approach.Additionally, it allows for greater accessibility as it does not require extensive computational resources to implement.Furthermore, our model was trained using open-sourced data and our model is open-sourced to promote transparency and reproducibility.However, we have not carefully investigated the training data we used, such as LAION (Schuhmann et al., 2022).The data may contain unsafe or biased text and/or images.It is important to note that models pretrained on it have the potential to reproduce sensitive training data.It is crucial to use this method responsibly and ethically to ensure it contributes to safe applications.D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: An illustration of AltCLIP.In the Teacher Learning stage, the student model (XLM-R) learns a wellaligned multilingual-image representation.The contrastive learning stage further improves alignment using only 7 million text-image pairs per language, making it more resource-efficient than training from scratch.

Figure 2 :
Figure 2: Experimental results on CLIP Benchmark.AltCLIP M 9−T denotes our model after Teacher Learning Stage while AltCLIP M 9 denotes our model after Contrastive Learning Stage.All image encoders are CLIP V iT −L14 .

C2.
Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values? 3 C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run? 3 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)? 3 D Did you use human annotators (e.g., crowdworkers) or research with human participants?Left blank.

Table 1 :
Results on multilingual Image Classification benchmarks.We compare AltCLIP M 9 with the M-CLIP and a model trained from scratch in five languages.For a fair comparison, the ones with ViT-L are chosen as default except for the ones with the mark † .The metric reported is zero-shot classification accuracy.We also build datasets and evaluate our model in the rest four languages with machine translation, details are in Appendix A.1

Table 2 :
Results on the multilingual cross-modal retrieval dataset.Recall@10 is reported for Text-to-Image on XTD and average recall for Text-to-Image and Image-to-Text on Multi30K.† denotes the unseen language in training AltCLIP M 9 , such as German and Czech.‡ denotes the reproduced results.Numbers denotes the good results of MURAL LARGE comes from the large-scale private data: 6 billion translation pairs (up to 100 million per language) in 109 languages and 1.8 billion image-caption pairs.

Table 3 :
The results on Image Classification in the Wild (ICinW)

Table 4 :
Experimental results on English and Chinese retrieval tasks.All image encoders used in these models are ViT-L for a fair comparison.
† represents we report original results from papers.