CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages

This work introduces CAPIVARA, a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages. While CLIP has excelled in zero-shot vision-language tasks, the resource-intensive nature of model training remains challenging. Many datasets lack linguistic diversity, featuring solely English descriptions for images. CAPIVARA addresses this by augmenting text data using image captioning and machine translation to generate multiple synthetic captions in low-resource languages. We optimize the training pipeline with LiT, LoRA, and gradient checkpointing to alleviate the computational cost. Through extensive experiments, CAPIVARA emerges as state of the art in zero-shot tasks involving images and Portuguese texts. We show the potential for significant improvements in other low-resource languages, achieved by fine-tuning the pre-trained multilingual CLIP using CAPIVARA on a single GPU for 2 hours. Our model and code is available at https://github.com/hiaac-nlp/CAPIVARA.


Introduction
The challenge of learning a joint multimodal representation for vision and language has developed various pre-trained models in recent years (Wang et al., 2021;Gao et al., 2021;Yang et al., 2022b;Geng et al., 2022;Li et al., 2023).Remarkably, CLIP (Radford et al., 2021) has gained attention for achieving state of the art on zero-shot visionlanguage tasks through contrastive learning to align images and text within a multimodal embedding.
Training models such as CLIP requires massive data and computational resources despite their good generalization capacity.These models are typically trained with datasets containing hundreds of millions of image-text pairs, often collected from the web.However, many datasets only provide images paired with English descriptions; as a result, the research community focuses excessively on English texts, whereas other languages are neglected, reinforcing cultural, regional, and linguistic biases (Bender et al., 2021).While recent advancements include approaches for languages beyond English (Bianchi et al., 2021;Yang et al., 2022a;Ko and Gu, 2022) and multilingual methods (Carlsson et al., 2022;Chen et al., 2023), they primarily focus on high-resource languages.There is a scarcity of approaches considering low-resource languages, and even models including them show performance disparities in tasks involving these languages compared to tasks with English texts.
We propose a cost-efficient approach for improving multilingual CLIP performance in lowresource languages (CAPIVARA), addressing the performance gap with English and reducing computational requirements.Our approach relies on the assumption that datasets may contain images annotated with noisy descriptions.In this way, our framework utilizes BLIP2 (Li et al., 2023) to generate multiple synthetic captions for each image, addressing noisy annotations and limited language diversity challenges.Using the re-annotated dataset, we translate both the original and generated captions into the target language and conduct fine-tuning on the multilingual model.To mitigate the computational cost associated with CLIP model training, we propose to optimize the training pipeline with LiT strategy (Zhai et al., 2022), wherein the image encoder remains frozen during training, gradient checkpointing (Chen et al., 2016) and LoRA (Hu et al., 2021).Figure 1 demonstrates that substantial improvements in low-resource language can be achieved by fine-tuning the pretrained multilingual CLIP with CAPIVARA.
Our main contributions are as follows: • We introduce CAPIVARA, a low-cost datacentric framework that leverages image captioning models to enhance the annotation of existing datasets to improve the performance of pre-trained multilingual CLIP in low-resource languages.We report the carbon footprint of our method.
from a pre-trained teacher network to new language models.M-CLIP is applied to 68 languages, translated versions of datasets by the MarianMT model (Junczys-Dowmunt et al., 2018).AltCLIP (Altering the Language Encoder in CLIP) (Chen et al., 2023) introduces a bilingual model for Chinese and a multilingual one for 11 languages.Like M-CLIP, the teacher-learning technique uses only the textual model across various languages.However, AltCLIP differs by incorporating English text distillation, human-curated translations, and a final fine-tuning phase.It also uses the LiT strategy to freeze the image encoder.
Data-Centric Approaches.Multimodal learning has been mainly explored through algorithmic designs, often treating datasets as monolithic.Santurkar et al. (2023) reveal that CLIP's performance depends on three pre-training datasets properties: dataset size, caption descriptiveness, and caption variability for each image.They employ BLIP (Bootstrapping Language-Image Pre-training) (Li et al., 2022b) to generate new captions to address limited text diversity, improving CLIP performance.Similarly, Fan et al. (2023) propose LaCLIP (Language augmented CLIP) that uses LLM (Large Language Model) to rewrite captions to increase the text diversity within text-image pairs in the pre-training dataset.However, the decoupled textgeneration process might limit effectiveness in datasets with non-descriptive captions (Nguyen et al., 2023).
Our work is related to Fan et al. (2023) and Nguyen et al. (2023).However, their studies focus on English captions during training and require extensive computational resources.In contrast, our research addresses a constrained scenario with limited computational power -a single GPU -and a lack of annotated datasets in the target language.We leverage multilingual OpenCLIP and Englishannotated open datasets to enhance model performance in Portuguese.Our method, centered on Portuguese-translated captions, can be extended to other languages, making it well-suited for lowresource language challenges.

Method
This section details our approach, including generating captions, translating them into Portuguese, and integrating these new captions into the training pipeline.It also describes optimization through LoRA and gradient checkpointing, effectively reducing the computational resources for CLIP model training.Figure 2 illustrates the main components of CAPIVARA.

Model Architecture
We use the pre-trained multilingual model OPENCLIP VIT-B/32 XLM-ROBERTA BASE 2 (OPENCLIP for short).This model utilizes XLM-RoBERTa Base (Conneau et al., 2020) and ViT Base (Dosovitskiy et al., 2020) with 32×32 resolution as text and image encoder, respectively.The model was pre-trained on LAION-5B (Schuhmann et al., 2022) for 12.8B steps and a batch size of 90k.We employ base versions of the encoders, as larger models would demand significantly greater computational resources for both training and inference.This consideration is crucial when addressing the low-resource language community.

Datasets
We use CC3M (Sharma et al., 2018) and modifications over it to fine-tune the OPENCLIP model to improve its performance in Portuguese.For zeroshot text-to-image and image-to-text retrieval tasks, we use PraCegoVer (dos Santos et al., 2022), which is composed of images annotated originally with Portuguese texts, and our Portuguese-translated versions of MS COCO (Lin et al., 2014) and Flickr30k (Plummer et al., 2017).We also translate the labels from ImageNet (Deng et al., 2009) and the ELEVATER benchmark datasets (Li et al., 2022a) for image classification.

Dataset Filtering
Similar to Schuhmann et al. (2022);Gadre et al. (2023), we apply CLIP score filtering.Thus, we discard examples where the cosine similarity, computed by OPENCLIP VIT-B/32 XLM-ROBERTA BASE, between the image and text embeddings is lower than 0.20.We employ this method to CC3M, naming the resulting dataset as CC3M-Filtered.We also apply this method to PraCegoVer 3 , used as a test set, to remove unrelated image-text pairs.
where B is the batch size, τ is a learnable temperature to scale the logits, sim(•) and aug(•) stands for cosine similarity and augmentation operation, respectively.
In the original proposal, only images are augmented as indicated in Equation 2, which might limit the text guidance to the image encoder.Fan et al. (2023) propose to use LLM to augment texts in addition to the image augmentation, as shown in Equation 3.However, this text-generation process does not consider the image content.
We propose to use BLIP2 4 to generate new captions conditioned on the images from CC3M.In 4 https://huggingface.co/Salesforce/blip2-opt-2.7bcontrast to Nguyen et al. (2023), and drawing inspiration from LaCLIP (Fan et al., 2023), we propose to generate multiple captions for each image in the dataset by passing different prefixes to BLIP2.Our approach addresses the limitation of LaCLIP and has the advantage of generating multiple captions per image, which is a drawback of Nguyen et al. (2023).Still, as BLIP2 is a monolingual model, we decided to generate the captions in English and then translate them into Portuguese using Google Translate5 .Therefore, our text augmentation comprehends generating English captions with BLIP2 and translating them into Portuguese.
During training, for each image, we randomly sample a caption among the original and the generated ones to fine-tune the text encoder.Hence, at each epoch, a different text can be selected for each image.For evaluation, we translate the annotations from Flickr30k and MS COCO, and the labels from ImageNet and ELEVATER.

Training
This work takes place within the context of limited computational resources.We apply many techniques to reduce the cost of fine-tuning the OPEN-CLIP.First, we use Gradient Checkpointing (Chen et al., 2016), which reduces the memory usage to O( √ n) when training n layers.This method removes the layers' activation after the forward pass and recalculates them during the backward pass if necessary.Using this technique, we achieved a considerable reduction in GPU memory usage.
Another method contributing to memory reduction is LiT (Zhai et al., 2022), which only trains the text encoder while keeping the image encoder frozen.The motivation for training only the text encoder is that the image encoder has already undergone extensive pre-training and can produce good representations for images.Hence, we train the text encoder with captions in Portuguese so that this model learns to align the text embeddings to fixed image features, producing a multimodal embedding space.This strategy speeds up training and reduces memory since the image encoder does not compute gradients.
Finally, we also apply LoRA (Hu et al., 2021) to reduce the number of trainable parameters, reducing the memory needed to train the models and the training time.LoRA involves a re-parameterization of the dense layers as follows: where W o ∈ R d 1 ×d 2 is the frozen pre-trained weight matrix, h is the result of the reparameterization, A ∈ R r×d 2 and B ∈ R d 1 ×r are decomposition matrices and r < min(d 1 , d 2 ) is the low-dimensional rank of the decomposition, an α is a hyperparameter for scale.Similar to Hu et al.
(2021), we use LoRA in the query (Q) and value (V) self-attention modules from the text encoder.
The original OPENCLIP consists of 366M parameters.Applying LiT strategies reduces this number to 88M trainable parameters (24% of the total).Further integration of LoRA reduces the trainable parameters to only 0.1% (300k).We report all the training hyperparameters in Appendix A.1.

Evaluation
To evaluate the proposed framework's generalization capacity, we follow the typical procedure of evaluating pre-trained models (Radford et al., 2021;Yang et al., 2022a;Ko and Gu, 2022) in zero-shot cross-modal retrieval (text-to-image and image-totext retrieval) and zero-shot image classification.
Zero-shot Cross-modal Retrieval: We evaluate our methods on three cross-modal retrieval datasets: PraCegoVer, MS COCO, and Flickr30k.
PraCegoVer is a multimodal dataset with native Portuguese captions based on Instagram posts.We built upon the conventional MS COCO and Flickr30k datasets, using Google Translate to translate all captions to Portuguese.To assess cross-modal retrieval performance, we adopted the recall@K evaluation metric, where K = {1, 5, 10}, and the mean recall, representing the average recall value across the recall@K instances.
Zero-shot Image Classification: We evaluate our pre-trained models on ImageNet-1k (Deng et al., 2009) and on ELEVATER image classification toolkit (Li et al., 2022a).It contains 20 datasets designed for image classification tasks across various domains and an easy-to-use toolkit to evaluate pre-trained language-augmented visual models.To accommodate evaluation in the Portuguese language, we manually translated the labels for each dataset, as well as the templates, following the methodology outlined in (Radford et al., 2021).In the evaluation process, ImageNet-1k employs the top-1 accuracy metric.Appendix A.2 provides the specific metrics for each dataset in ELEVATER benchmark.

Experiments and Results
This section presents a comprehensive set of experiments designed to investigate the effects of dataset filtering and the specific influence of each module within our framework, CAPIVARA.To reduce the effects of randomness, we ran each experiment setup three times.We also focus on zero-shot tasks involving images and Portuguese texts.Since no CLIP model is publicly available for Portuguese, we adopt as baseline the pre-trained multilingual model OPENCLIP due to its state-of-the-art performance in many tasks with Portuguese captions.
Dataset Filtering & CAPIVARA.We investigate two data-centric approaches: filter the training set by selecting promising samples capable of removing noise, and annotation enhancement with our proposed framework.Using CAPIVARA, for each image in CC3M, we add 10 synthetic captions, generated with BLIP2, besides the original caption.We comprehensively analyze the impact of the dataset filtering presented in Sec.3.3 and the effectiveness of CAPIVARA in cross-modal retrieval tasks on Flickr30k, MS COCO (with Portuguese-translated captions), and PraCegoVer datasets.
Table 1 shows the results of the text-to-image (txt2img) and image-to-text (img2txt) retrieval tasks conducted on OPENCLIP.These results encompass models fine-tuned and trained using the CAPIVARA framework on the original CC3M dataset and its filtered version, CC3M-Filtered.In Table 1, the columns "Synth."and "Trans."indicate which settings include synthetic captions and whether or not the captions are translated.
Employing the CC3M with translated captions, fourth row in Table 1, for fine-tuning increases the mean recall score by roughly 2 percentage points (pp.) in text-to-image and image-to-text retrieval tasks on Flickr30k and MS COCO, compared to the baseline, OPENCLIP.However, for the PraCe-goVer dataset, a decline of 1.6 pp. in text-to-image retrieval and a more significant drop of 9.3 pp. in image-to-text retrieval are observed.Comparing the fine-tuning using CC3M and CC3M-Filtered, one can note an average enhancement of 0.9 pp. in mean recall score for text-to-image retrieval and a 0.4 pp.improvement for image-to-text retrieval across all three datasets.
In addition, as an intermediate step in our architecture, we employ synthetic captions to mitigate noise in the training data.To illustrate the performance gains, we compare the results of only translating the training set and translating and generating synthetic captions (CAPIVARA), fourth and sixth rows in Table 1, respectively.For the Flickr30k dataset, we observe a 1.1 pp.improvement in textto-image retrieval with synthetic captions, with no significant difference in image-to-text retrieval.On the MS COCO dataset, we note a 1.5 pp.increase in text-to-image retrieval and a 1.2 pp.gain in imageto-text retrieval.Additionally, when evaluating the PraCegoVer dataset under the same conditions, we find a 2.6 pp.improvement in text-to-image retrieval and a 4.7 pp.gain in image-to-text retrieval.Thus, in most cases, using synthetic data as a means of data augmentation and noise reduction yields a positive impact.Details about the impact of the number of synthetic captions in the performance are shown in Table A6 The most significant performance gains over the baseline are achieved using CAPIVARA.For instance, the model trained on CC3M with CAPI-VARA, sixth row, yields a 3.5 pp.improvement in text-to-image retrieval for Flickr30k and MS COCO and 1 pp.enhancement on PraCegoVer.Notably, in image-to-text retrieval, CAPIVARA (CC3M) increases 2 pp. on Flickr30k and it has a remarkable 4.7 pp.gain on MS COCO over the base-line.Also, models trained on CC3M and CC3M-Filtered with CAPIVARA demonstrate similar performance levels.These experiments demonstrate the effectiveness of our proposal, CAPIVARA, in enhancing multilingual CLIP performance in Portuguese.
Caption Translation.We also investigate the impacts of automatic translations of captions in the final model performance for Portuguese texts.We conducted experiments training the model on datasets containing only English annotations (i.e., CC3M + no-translation and CC3M + no-translation + synthetic captions), and their counterparts translated into Portuguese using Google Translate (i.e., CC3M + translation and CC3M + translation + synthetic captions).The evaluation comprehends Flickr30k, MS COCO, and PraCegoVer datasets with only Portuguese captions, particularly images in PraCegoVer that are originally annotated in Portuguese.We present the results in Table 1.
One can note a substantial improvement when translating annotations within the training dataset.Specifically, models trained on datasets containing Portuguese annotations exhibit an average increase of 2.5 pp. in text-to-image mean recall scores compared to their English-trained counterparts.Similarly, employing Portuguese-translated captions leads to a mean recall improvement of 1.6 pp. for image-to-text retrieval on both the Flickr30k and MS COCO datasets.Fine-tuning with the original CC3M (i.e., CC3M + no-translation) hampers textto-image performance across all three datasets and drops notable 7 pp.the mean recall in image-to-text on PraCegoVer.By training the model on translated synthetic captions, CAPIVARA consistently outperformed all the other settings.Our method increases the average performance in 3.2 pp.compared to fine-tuning on the original CC3M dataset.This experiment highlights the importance of including the automatic translation of captions into the target language, Portuguese, in our training pipeline.
Training Pipeline Optimization.This work is inserted in a context of restricted computational resources, in which only a single RTX Quadro 8000 GPU is available.In this way, we propose a method to optimize our training pipeline, detailed in Sec.3.5.It combines LiT, Gradient Checkpointing (G.Checkpt), and LoRA techniques.In this section, we investigate the impacts of this optimization in terms of model performance and cost Table 1: Impact analysis of synthetic captions (Synth.)and translation (Trans.)on our framework.This table compares the performance of CLIP fine-tuning on English and Portuguese-translated texts, both with and without the addition of synthetic captions.It shows the experimental results in cross-modal retrieval on Flickr30k and MS COCO with captions translated into Portuguese, and PraCegoVer.We report the average and standard deviation of mean recall for text-to-image (txt2img) and image-to-text (img2txt) retrieval tasks.Our CAPIVARA achieves the best performance across datasets, highlighting its efficacy in enhancing pre-trained multilingual CLIP.formance similar to the one that fine-tunes the entire text encoder on Flickr30k, but it decreases by 1.2 pp. the average performance on MS COCO.
In addition, the model trained with our optimization technique LiT + G. Checkpt + LoRA + 1500 steps + BS=1000 presented a decline of 0.2 pp.compared to LiT + G. Checkpt + LoRA.Using our optimization method can remarkably reduce the GPU memory (from 38 GB to 8.5 GB) and training time (from 31h to 2h), yet outperform the baseline by 2.5 pp.across the tasks.Our training pipeline requires very modest computational resources compared to the literature, as shown in Table 3.These experiments demonstrate that our optimization method can effectively reduce the cost of fine-tuning CLIP, allowing researchers with restricted computing resources to conduct experiments.
Low-resource Languages.To demonstrate the effectiveness of CAPIVARA in improving pretrained multilingual CLIP performance on lowresource languages, we expand our investigation to include Xhosa and Hindi.Figure 1 compares the performance between the pre-trained OPENCLIP (baseline) and the models trained by employing the whole CAPIVARA optimized pipeline, which refers to the setting LiT + G. Checkpt + LoRA + 1500 steps + BS=1000, named CAPIVARA + Opt., for text-to-image and image-to-text retrieval tasks on Flickr30k and MS COCO.This experiment em-ploys our optimized training pipeline (Sec.4), training models for 2 hours on a single GPU Quadro RTX 8000 with a memory usage of 8.5 GB.
The baseline presents the weakest performance in Xhosa across all tasks, with mean recall close to zero in MS COCO and 3 and 10 in text-toimage and image-to-text on Flickr30k, respectively.CAPIVARA increases the average performance in this language by 6.5 pp. on Flickr30k and MS COCO.The most significant improvement can be noted in Hindi.A remarkable increase of 15 pp. on MS COCO and 21 pp. on Flickr30k is obtained with CAPIVARA.This experiment shows that CAPIVARA effectively boosts the pre-trained multilingual CLIP's performance in other low-resource languages with a low computational cost.
Image Classification.In addition to zero-shot cross-modal retrieval tasks, we also evaluate our models in zero-shot image classification across 21 datasets.The results are presented in Table 4.In the context of ELEVATER, training CLIP with CAPIVARA yielded an average improvement of 0.6 pp.over our baseline.We plot the bar chart in Figure A1 to thoroughly analyze the performance gap between the baseline and the model trained with CAPIVARA for each dataset within ELEVATER.Our method consistently surpassed the baseline across most datasets, yielding substantial accuracy improvements of 5.53 pp., 5.15 pp., and 3.07 pp. for KITTI-Distance, MNIST, and GTSRB, respectively.Regarding ImageNet-1k, CAPIVARA exhibited a performance gain of 0.2 pp.compared to the baseline.In addition, the model's performance trained with CAPIVARA + Opt. is close to our baseline.Hence, LoRA-tuning for 1500 steps keeps the average performance on zero-shot image classification, whereas it improves considerably the performance on zero-shot crossmodal retrieval.
Carbon Footprint.Despite the remarkable achievements of large language models, their deployment requires substantial computational power, resulting in significant energy usage.For instance, models such as GPT-3 and BLOOM consumed approximately 1,287 MWh and 433 MWh, respectively, in their training, corresponding to 502 tonnes of CO 2 and 25 tonnes of CO 2 emissions (Maslej et al., 2023).The BLOOM model's carbon footprint alone surpasses an average American's annual carbon emissions by 1.4 times.The energy consumed during BLOOM's training could power a To compare energy consumption between our model and larger language models, we employed the codecarbon tool (Courty et al., 2023).The results are shown in Table 5.As other CLIP-like models do not provide energy and carbon expenditure data, we present a comparison with other large language models for which such data is available in the literature (Maslej et al., 2023).For the baseline model, the energy usage amounted to 6.4 kW, resulting in 0.5 kg of CO 2 equivalent emissions.Applying LoRA and reducing the number of training steps decreased energy consumption to 5.6 kW and 1.8 kW, respectively, resulting in 0.4 kg and 0.1 kg of CO 2 equivalent emissions.These calculations are based on Brazil's energy mix, where hydropower is the primary energy source.This calculation does not include the carbon footprint of the initial pre-training performed by OPENCLIP, but only the training with CAPIVARA.We aim to advance sustainable AI systems development by employing these techniques and optimizing training times.

Conclusion
This work demonstrates the potential challenges of fine-tuning multilingual CLIP models within lowresource languages due to noisy annotations.To address this issue, we introduce CAPIVARA, a cost-effective framework that leverages image captioning models to enhance the dataset annotations.We conducted extensive experiments involving dataset filtering, re-annotation, and automatic translation.CAPIVARA effectively boosts OPEN-CLIP performance for Portuguese texts, achieving state-of-the-art results in many zero-shot tasks.Our findings show the importance of dataset reannotation and automatic translation.
We also propose optimizing our training pipeline using LiT, including LoRA and gradient checkpointing.Our results show a substantial improvement in Portuguese performance by fine-tuning the pre-trained OPENCLIP in a single GPU for 2 hours, and only 8.5 GB of memory -considerably modest compared to literature.Moreover, we demonstrate that our framework is readily extensible to other low-resource languages.
A direction for future research involves investigating the scalability of the proposed approach in terms of dataset and model size, building upon its success with base models.We also plan to explore different image captioning models and text decoding methods.Due to the cost of generating synthetic captions and translating them to Portuguese, there is interest in automating the process, possibly by improving BLIP2's performance in Portuguese.Besides, due to the success of LoRA, other parameter-efficient fine-tuning can be explored.Lastly, an interesting research question remains open: "how many examples annotated in a low-resource language are necessary to achieve a performance comparable to English?".

Limitations
Model.Unlike other studies that compare models with varying architectures and sizes (Radford et al., 2021;Yang et al., 2022a;Li et al., 2022c;Mu et al., 2022), our research focuses on specific choices: the ViT-B/32 as image encoder and the XLM-Roberta Base as text encoder.Future work will explore different model sizes within our budget and consider alternative fine-tuning approaches, such as Parameter-Efficient Fine-Tuning (PEFT) (Liao et al., 2023).
Data.Recent efforts to adapt CLIP for specific languages (Ko and Gu, 2022;Yang et al., 2022a;Bianchi et al., 2021) have typically used datasets much larger than our study.Investigating scalability using training datasets could reveal the optimal trade-off between cost and performance.
Generating captions in languages such as Portuguese involves two steps: caption generation and machine translation; due to the lack of robust non-English image captioning models.Hence, future research could focus on fine-tuning image captioning models for target languages to streamline the process and improve accuracy.Our study used the BLIP2 model for caption generation, but exploring alternative models could enhance results.
An additional limitation is the prevalent use of machine-translated datasets in various multilingual datasets (Carlsson et al., 2022;Jain et al., 2021;Yang et al., 2022a;Bianchi et al., 2021).However, these datasets may not effectively capture unique expressions, cultural nuances, and proper nouns, leading to bias over-amplification, where biases from the source text become exaggerated in the translated output (Hovy and Prabhumoye, 2021; Prabhumoye et al., 2021;Hovy et al., 2020).

Ethics Statement
CAPIVARA is a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages.For this purpose, CAPIVARA augments text data using image captioning and machine translation to generate multiple synthetic captions in low-resource languages, and the training pipeline is optimized with LiT, LoRA, and gradient checkpointing to alleviate the computational cost.Intended to be used for general tasks, the model learns to represent in a joint space texts and images.It can be employed in text-toimage, image-to-text retrieval, and image classification tasks.The developed model is particularly intended for scientific researchers.
Based on known problems with image and language models, the model may present lower performance for under-represented and minority groups (Bender et al., 2021).To adapt the model to lowresource languages, we use texts translated from English; thus, the model does not represent the cul-tural and local aspects of the countries that speak these target languages.This can lead to linguistic biases and a lack of representativeness for the target groups.
The datasets used comprehend texts from the internet and carry biases; thus, the model may perform differently for data collected from other sources.Also, the datasets may contain data with cultural, political, or religious positioning.
Furthermore, CAPIVARA does not generate any type of data that could pose a risk to human life.However, our model can be adapted for other specific tasks, e.g., image or text generation, which could contribute to generating false information and harming people.CAPIVARA is a framework that aims to improve performance for low-resource languages.However, our results show that despite the significant improvements achieved with CAPI-VARA, there is still a considerable gap between the model performance with English texts and texts in low-resource languages.Further research is needed to improve performance across languages and incorporate cultural and linguistic elements into the model.
Since language models require large computational, environmental, and financial resources (Bender et al., 2021), CAPIVARA optimizes its training pipeline, resulting in a smaller carbon footprint than traditional fine-tuning.More details about ethical considerations can be found in Model Cards (Appendix A.6). E.C. advised G.O.S. throughout all tasks.H.P. advised the team on all tasks and contributed to the writing process.S.A. served as the principal advisor of the team, providing guidance on all tasks and contributing to the writing process.All authors reviewed the manuscript and provided critical feedback to enhance its quality.

A.1 Hyperparameters
To facilitate the reproducibility of the work, we present Tables A1 and A2.These tables contain the hyperparameters used for the best models evaluated in the different experiments.Table A1 contains only the hyperparameters used in the finetuning of the OPENCLIP model for Portuguese.Table A2 considers the hyperparameters with the LoRA-tuning for the models with optimizations and 1500 steps, in Portuguese, Hindi and Xhosa.

A.2 Results on ELEVATER and ImageNet-1k
In our supplementary experiments on ELEVATER and ImageNet-1k benchmarks, summarized in Table A3, we consistently observe that our approach outperforms the baseline model across various setups, with the exception of CAPIVARA + Opt.This suggests that more training steps might be necessary to fully leverage LoRA's potential in finetuning.Furthermore, Table A3 reveals the effect of caption generation and filtering on the efficacy of our method.By analyzing the scenarios with synthetic captions, one can note that training with multiple captions per image outperforms training on only OPENCLIP + Fine-tuning both with or without filtering.Notably, the optimal configuration involves training with CAPIVARA on CC3M-Filtered, resulting in a performance boost of 0.6 pp.over the baseline.Still, similar to the crossmodal retrieval in Sec.A.3.1, we do not observe a significant performance gain by augmenting the number of generated captions.Table A4 provides the specific metrics for each dataset in ELEVATER benchmark.
Figure A1 presents the difference in performance between fine-tuning with CAPIVARA and the baseline, OPENCLIP.It can be noted that the majority of datasets exhibit positive differences in performance, indicating a favorable improvement over the baseline with CAPIVARA.Notably, the model trained with CAPIVARA led to substantial improvements of 5.53 and 5.15 pp. in two datasets, namely KITTI-Distance and MNIST, respectively.However, it is important to acknowledge instances where the performance of our model under this configuration falls short.Noteworthy cases include the Oxford-IIIT Pets dataset, encompassing 37 distinct    tween cat breeds British Shorthair and Russian Blue, as well as dog breeds Leonberger and Newfoundland, leading to reduced overall correctness.

A.3.1 Impact of Multiple Captions & Generated Caption Selection
To further validate the contributions of synthetic captions, we analyze the influence of multiple captions per image and how to select proper captions for each image.This latter aspect is related to BLIP2's hallucination, i.e., the model generates a text that does not match the associated image (Xu et al., 2023).The use of these synthetic annotations can introduce noise to the dataset.To address this issue, we implement the Captioning and Filtering  Threshold-based: We select the texts among the original and generated captions based on their similarity to the associated image.Then, a caption is selected if the similarity between it and the image is greater than or equal to a given threshold; in this case, the threshold is 0.15.

Threshold-based + near-duplication removal:
We first apply the threshold-based filter, and then we remove the near-duplicate captions using the algorithm described in Algorithm 1, keeping a minimum of k min = 3 captions per image.Algorithm 1 first computes the text similarity matrix.Then, it computes the cost of removing a text t i as c(t i ) = B j=1 sim(t i , t j ), ∀i ̸ = j.At each step, it removes the text with the highest cost and updates the cost array.The algorithm stops when all similarity scores are lower than a given threshold or the minimum number of captions is reached.In this way, the algorithm can keep the maximum diversity among the texts.# captions : image captions # k_min : minimum number of texts to keep # thr : maximum similarity between texts # allowed # Remove similar texts keeping the # maximum diversity among them def remove_similar ( captions , k_min =3 , thr =0.3) : if len ( captions ) < k_min : return captions sim_matrix = text_similarity ( captions ) n_texts = sim_matrix .shape [0] # set the cost in the diagonal to zero sim_matrix -= np .eye ( n_texts ) while not ( sim_matrix <= thr ).all () and n_texts > k_min : # compute the cost to remove each # text as sum of the similarity # between that text and all others .cost = sim_matrix .sum ( axis =0) # remove the text with the highest # cost i = np .argmax ( cost ) # set the cost of the texts to be # removed to zero sim_matrix [i , :] = 0 sim_matrix [: , i] = 0 n_texts -= 1 # compute the final cost for all texts cost = sim_matrix .sum ( axis =0) # all texts whose cost is zero will be # removed remove_indices = np .where ( cost ==0) [0] # return the filtered texts return [ caption for i , caption in enumerate ( captions ) if i not in remove_indices ] Algorithm 1: Python-like pseudocode of near-duplicate text removal algorithm.
From a thorough analysis of the results exhibited in Table A5, we note that none of the caption selection strategies significantly impacted the model performance.All strategies performed similarly to CAPIVARA with no caption selection.Specifically, the threshold-based caption selection strategy performed slightly better than the others but still in pair with CAPIVARA.This result suggests that BLIP2 is effective in generating captions related to images and, because of this, the caption selection methods did not affect the final performance.Nevertheless, Figure A8 and the results in Table A6 reveal that BLIP2 produces slightly different texts.Therefore, generating multiple captions per image has a limited effect on text augmentation.Note that adding 10 captions slightly improved compared to Therefore, to determine the optimal batch size to use in our method, we conducted experiments fixing the number of steps in 5863 and varying this value considering our GPU memory limitation.We experimented three different batch sizes: 1000, 2816, and 4300.Each setting was tested with traditional fine-tuning and with CAPIVARA, the results are presented in Table A7.Overall, we do not observe a significant gain in increasing the batch size.Intriguingly, in the context of CAPIVARA, the performance slightly improves across the datasets as we increase the batch size from 1000 to 2816.However, it declines when we use a batch size of 4300.For this reason, the CAPIVARA models were trained with an average batch size of 2816, while the optimized CAPI-VARA models were trained with a batch size of 1000.This study shows that using smaller batches to train the optimized models does not result in significant loss.At the same time, it saves memory and training time.

A.4 Qualitative Analysis
We conducted experiments on Flickr30k for a qualitative analysis of the model's ability in cross-modal retrieval tasks, the outcomes are presented in Figures A6 and A7. Figure A6 shows the result of the image-to-text retrieval task, where the five texts in Portuguese more similar to a given image are retrieved by our model.For the first example, all the texts retrieved describe correctly the image content, which consists of a group of women running in a race.However, in the second example, none of the retrieved text matches the input image.It illustrates the limitations of our model.Similarly, we analyze qualitatively our model in text-to-image retrieval.In Figure A7, we present four examples of texts and the top-5 images more similar to each of them.We can see that overall the model ranks the correct images on the top.Regarding the other images, although the scene representations match the texts, there is still a lack of details in the images that are not considered by the model, such as the number of people, objects, and colors.This can happen because there are no images that contain all elements from the text within the dataset, and it tries to retrieve the most similar images, or by model limitations.Thus, in the last example, we present an instance in which the model fails.Given the text "Woman and man walking across wooden rope bridge with a caution sign beside it.",the model does not rank the expected image among the top-5 most similar.

A.5 Synthetic Captions Generated by BLIP2
In the process of text augmentation, the BLIP2 model (Li et al., 2023) was used to generate new captions for the images.However, this model presents some issues regarding text generation.For example, it may generate text that does not match the image and repeat words.Several strategies have been used to mitigate these problems in our work.They are best described in Sec. 3. Figure A8 shows three images from CC3M along with their original caption and 10 captions generated with BLIP2.
The first image represents an example where the generated captions are good and diverse, as all captions correctly describe the image, there are no repeated words, and there is a high diversity of words used to describe the scene.The captions generally describe the image and add new elements to the description, although they still contain repetitive structures.In the second example, we present a scenario of good caption and low textual diversity.The captions describe the image, but there is a high level of repetition in the sentence structures.In the third example, we illustrate a case of badly generated captions and low textual diversity.In this example, the model not only shows a lot of word repetition, but also fails to represent the image, hallucinating.A man is sitting on door steps in front of a house.
A man sets up a red ladder in a yard.#3: Um homem com roupas de neve está deitado na neve em frente a uma porta.
A man in snow weather gear is laying in the snow in front of a door.#4: Um homem de camisa vermelha na porta de uma lavanderia.
A man in a red shirt in the doorway of a laundry mat.
A person in a long orange coat walks along a sets of stairs.

A.6 Model Cards
This section was done using the Model Cards for Model Reporting (Mitchell et al., 2019) tool.

Model Details
• Developed by researchers from the Natural Language Processing Group of the Artificial Intelligence and Cognitive Architectures Hub -H.IAAC.
• CAPIVARA is a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages.
• CAPIVARA augments text data using image captioning and machine translation to generate multiple synthetic captions in low-resource languages.The training pipeline is optimized with LiT, LoRA, and gradient checkpointing to alleviate the computational cost.
• For further information or questions, please contact Sandra Avila avilas@unicamp.br.

Intended Use
• Intended to be used for general tasks focused on finding a representation in a common space for texts and images.Examples of tasks are image-to-text and text-to-image retrieval and image classification.
• Particularly intended for scientific researchers.
• Not intended to be used with aspects, positions, and cultural values from an under-represented region (e.g., Brazilian memes) due to the lack of representativeness of the datasets used for training.It cannot be used with long texts (more than 77 tokens).

Factors
• Based on known problems with image and language models, potential relevant factors include groups for under-represented and minority people.In order to adapt the model to languages with low resources, texts were initially translated from English; thus, the model does not represent the cultural and geographical aspects of the countries that speak these target languages.The datasets used are made of texts collected from the Internet; therefore, the model may not perform as well for data collected from other sources and may carry biases from the original texts.

Metrics
• Evaluation metrics include Mean Recall, representing the average recall value across the recall@K instances, where K = 1, 5, 10, for cross-modal retrieval, which is the main task of CAPIVARA, and top-1 accuracy metrics for image classification task on ImageNet-1k.Moreover, the ELEVATER benchmark was used for the image classification task, and Appendix A.2 provides the specific metrics used (see Table A4).
• Each experiment was run three times, and the mean and standard deviation were reported for all experiments performed (see Section 4).

Quantitative Analyses
• Quantitative Analyses can be seen in Figure 1 and Section 4.

Evaluation Data
• Evaluation data include Flickr30k, MS COCO, and PraCegoVer datasets for cross-modal retrieval task, and all 20 datasets from ELEVATOR benchmark and ImageNet-1k for image classification task (see Table A3).
• These datasets were chosen because they are the most widely used datasets in the literature, except for PraCegoVer.PraCegoVer is a dataset with images and texts originally in Portuguese that was used precisely to evaluate linguistic and cultural aspects present in the Portuguese language.(NOTE: Data originally in English that has been translated into the target language will be made available with the model).
• See Section 3.2 for more details about data preprocessing.
Training Data • Training data was CC3M dataset.
• This dataset was chosen because of the amount of example data provided and the better quality of the data.In addition, our limited computing infrastructure for training the model was considered.
• See Section 3.2 for more details about data preprocessing.
• It is possible that the model was trained with data where group distributions are not homogeneous and, therefore, encoded some type of bias.

Ethical Considerations
• CAPIVARA does not deliberately use sensitive data in training.However, since it uses data collected from the Internet consisting of images and annotations about the image's content, it is possible that data with political, religious, or cultural positioning have been used.
• CAPIVARA does not generate any type of data that could pose a risk to human life.However, our model can be adapted for other specific tasks, e.g., image or text generation, which could contribute to generating false information and harming people.
• The model's training data was translated via Google Translate from English into the target language.This can lead to linguistic biases and a lack of representativeness for the target groups.
• CAPIVARA adopts training time optimizers, resulting in a smaller carbon footprint than traditional fine-tuning.Therefore, it presents a better financial and environmental alternative to improve the performance of pre-trained models.

Caveats and Recommendations
• Further work is needed to assess the impact of adding more samples from the target language and how much this brings the performance of the target language closer to English, which currently has the best performances.See Section 5 for more future works.
• People and groups who do not have access to the Internet and, therefore, do not produce digital content were under-represented in the training set.However, CAPIVARA is intended to be applied to languages with low digital resources.CAPIVARA offers the technique to improve performance for low-resource languages, however, there is still a gap in performance between English texts and texts in low-resource languages.Future studies are required to improve performance for different languages and include cultural and linguistic aspects of the target language in the model.
• An ideal evaluation dataset would additionally include annotations made in the target language, which also considers cultural and linguistic aspects and has a background of minority and underrepresented groups.
• Current literature is constantly evaluating the ethical risks and impacts that vision and language models can have on society.Keeping up with this work is extremely important, as these studies can point to risks and negative impacts that have not yet been considered in this current version of Model Cards.
• Ideally, when using CAPIVARA as a base model for other applications, a study of the ethical impacts of the application should be carried out before it is implemented.
• It is highly recommended to read these Model Cards in conjunction with the article that introduces CAPIVARA, as the article contains detailed information on the entire life cycle of the proposed model.

Figure 1 :
Figure 1: Improving multilingual CLIP Performance in Low-Resource Languages: Xhosa, Hindi, and Portuguese.This figure illustrates CAPIVARA's effectiveness in enhancing the performance of pre-trained multilingual CLIP models, the OPEN-CLIP baseline (B), for low-resource languages.The percentage point increase in mean recall for text-to-image (txt2img) and image-totext (img2txt) retrieval with low-resource languages on Flickr30k and MS COCO datasets is highlighted above the respective bars.CAPIVARA significantly improves the model's baseline performance with only 2 hours of training and 8.5 GB of GPU memory.

Figure 2 :
Figure2: CAPIVARA overview.In our framework, the training dataset comprehends images annotated with English captions.To enhance the annotations, we use an image captioning model to generate synthetic captions for the images.Then, both original and synthetic captions are translated from English to the target language, in our case, Portuguese.We freeze the image encoder and fine-tune the text encoder using the translated captions to align the visual representation by optimizing the InfoNCE loss.While it is possible to fine-tune the entire text encoder, such an approach is resource-intensive.Thus, we propose an optimization method based on LoRA-tuning that can significantly reduce the associated computational cost and speed up the training time.
.S., D.A.B.M., and A.I.F.collaborated on dataset translation, designing and implementing the proposed pipeline, analyzing the results, and writing the manuscript.G.O.S. also conducted experiments related to dataset filtering, re-annotation, translation, and low-resource languages.D.A.B.M. worked on constructing training datasets, focused on experiments to optimize the pipeline and conducted a carbon footprint analysis.A.I.F.executed inferences for zero-shot image classification.In collaboration with G.O.S., D.A.B.M., and A.I.F., J.S. wrote the Ethics Statement and Model Cards sections.L.P. helped in the result analysis.P.B. contributed to dataset translation.T.S. helped in constructing training datasets.H.M. contributed to the discussion with the team and the result analysis.N.S. advised A.I.F. and T.S. throughout all tasks.
offer a deeper dive into these observations, presenting normalized confusion matrices that provide granular insights into datasets where CAPIVARA underperformed the baseline.Specifically, FiguresA2 and A3unveil nuances in accurate and erroneous predictions within the Fer-2013 dataset.Notably, the baseline model excels in recognizing neutral expressions, while the finetuned model performs well in identifying expressions of sadness.However, the fine-tuned model is also more likely to confound emotions such as sadness and neutral expressions.FiguresA4 and A5present normalized confusion matrices for the Oxford-IIIT Pets Dataset, highlighting the finetuned model's tendency to amplify confusion be-

Figure A2 :
Figure A2: Normalized confusion matrix of the FER-2013 dataset for the OPENCLIP baseline model.

Figure A4 :
Figure A4: Normalized confusion matrix of the Oxford-IIIT Pets dataset for OPENCLIP baseline model.

Figure A5 :
Figure A5: Normalized confusion matrix of the Oxford-IIIT Pets dataset for the OPENCLIP + Fine-tuning model with 10 generated annotations.

# 1 :
Várias mulheres em trajes de corrida correm em grupo.Several women in racing singlets run in a pack.#2: Atletas do Japão, Alemanha e China estão correndo lado a lado.A group of woman from various ethnic backgrounds are competing in a marathon.#3: Um grupo de mulheres de várias origens étnicas está competindo em uma maratona.A group of woman from various ethnic backgrounds are competing in a marathon.#4: Três corredores competem em uma corrida.Three runners compete in a race.#5: Três corredores passam correndo em uma competição de atletismo.Three runners race past at a track meet.

Figure A8 :
Figure A8: Examples of images with synthetic captions generated by BLIP2.

Table 2 :
Impact of optimization techniques.We evaluate training models on CC3M with CAPIVARA combined with many optimization techniques.We report the experimental results in terms of mean recall in text-to-image (txt2img), and image-to-text (img2txt) and memory (M) and training time cost (T).Our optimization method leads to the best training time and computational cost while performing similarly to other approaches.

Table 3 :
Summary of the models and resources invested in their training, considering the dataset size, the GPU/TPU used, and the required training time.

Table 5 :
Average costs per trained model in terms of energy consumption and equivalent CO 2 emissions (CO 2 -eq), compared with the number of trainable parameters (# Param.).All the models were trained with a batch size (BS) of 2816 for 5863 steps, except for CAPIVARA + LoRA + 1500 steps / BS=1000.
household in the United States for up to 41 years.

Table A1 :
Hyperparameters used in the fine-tuning.

Table A3 :
Results on ELEVATER benchmark.Ablation without LoRA and with LoRA.

Table A4 :
Details of the image classification datasets on the ELEVATER benchmark.
breeds of cats and dogs, and the FER-2013 dataset, featuring a range of human emotional expressions.Also, our model presented a performance decline on these datasets, with respective decrements of 2.19 and 1.97 pp. in comparison to the baseline.Figures A2 to A5

Table A5 :
Experimental results for caption selection strategies.In this table, "threshold-based nearduplication", "threshold-based", and "rank-based" refer to caption selection methods, whereas CAPIVARA does not consider any caption selection strategy.For each setting, we report the average and the standard deviation of mean recall.

Table A6 :
Impact of multiple captions.This table presents the results of models trained with different numbers of synthetic captions translated into Portuguese.We report the average and the standard deviation of mean recall for each setting across Flickr30k, MS COCO, and PraCegoVer datasets.

Table A7 :
Comparison between different batch sizes in fine-tuning and CAPIVARA settings.