LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting

Multilingual image captioning has recently been tackled by training with large-scale machine translated data, which is an expensive, noisy, and time-consuming process. Without requiring any multilingual caption data, we propose LMCap, an image-blind few-shot multilingual captioning model that works by prompting a language model with retrieved captions. Specifically, instead of following the standard encoder-decoder paradigm, given an image, LMCap first retrieves the captions of similar images using a multilingual CLIP encoder. These captions are then combined into a prompt for an XGLM decoder, in order to generate captions in the desired language. In other words, the generation model does not directly process the image, instead processing retrieved captions. Experiments on the XM3600 dataset of geographically diverse images show that our model is competitive with fully-supervised multilingual captioning models, without requiring any supervised training on any captioning data.


Introduction
The task of image captioning has witnessed impressive performance gains with the trend of large-scale encoder-decoder models and vision-and-language pre-training (Li et al., 2022;Wang et al., 2021;Hu et al., 2022;Wang et al., 2022).Despite all of this progress, existing models are mostly available on English or are specialised for other high-resource languages.This limits the access to the technology for a broader range of languages that exist in the world.Moreover, the current mainstream trend results in design decisions and methods that may only work well for English-centric datasets or the couple of languages for which captioning data is available (Ruder, 2020).There is a need to develop multilingual image captioning models that can serve speakers of different languages.
Still, scaling captioning models to a wide variety of languages involves different challenges.One major limitation is the lack of multilingual imagecaption pairs of clean labelled data for training the models.One possible solution is to automatically translate the existing English datasets (Thapliyal et al., 2022).While effective, this approach can result in models that learn translation artefacts, and perpetuates an English-centric perspective instead of encouraging the use of geographically diverse concepts that are not overly specific to the western culture (Liu et al., 2021).Moreover, with or without automatic translations, training captioning models with multilingual data can be expensive, given the amount of data and number of parameters needed to mitigate the curse of multilinguality (Conneau et al., 2019;Goyal et al., 2021).
This paper presents LMCAP, an image-blind multilingual image captioning model that does not require any training specific for image captioning.We propose an efficient method that reuses a pretrained multilingual language model and adapts it to the vision-and-language captioning setting.Our work is motivated by the recent "Socratic Models" framework (Zeng et al., 2022), in which different models can be combined through text prompting (e.g., image captioning can be achieved by prompting a language model with a set of visual concepts extracted from the predictions of a vision model).Different from the original Socratic Models, our approach is inspired by retrieval-augmented generation (Lewis et al., 2020;Izacard et al., 2022).Specifically, a multilingual language model generates captions given a prompt consisting of the captions retrieved from similar images, and a demonstration of how to produce a caption in the desired language.We note here that this is an image-blind approach, i.e. the language model producing the caption does not actually process the image.
Our main contributions are as follows: (1) We propose a few-shot multilingual image captioning approach named LMCAP, that re-uses pre-trained models without requiring any training specific for image captioning; (2) To the best of our knowledge, LMCAP is the first captioning model with retrieval-augmented generation in a multilingual setting, and in a few-shot setting of captioning; (3) We report on experiments with the XM3600 benchmark (Thapliyal et al., 2022) of human-authored captions and geographic diverse images, demonstrating that LMCAP exhibits strong few-shot performance on a wide variety of languages; (4) We further show that LMCAP performs substantially better than the original Socratic Models.Moreover, instead of only achieving competitive performance against other zero-shot models, LMCAP can also compete with a large-scale supervised state-of-art captioning model.

Background and Related Work
Image Captioning: The task of automatically generating textual descriptions for input images has been largely explored in English, while multilingual image captioning has only been addressed in a couple of studies (Gu et al., 2018;Thapliyal et al., 2022;Chen et al., 2022).Like in most recent work on image captioning (Li et al., 2022;Wang et al., 2021Wang et al., , 2022)), studies addressing multilingual setups have also focused on scaling the size of encoder-decoder models and the amount of training data, resorting to machine translated versions of multimodal data to accommodate multiple languages (Thapliyal et al., 2022).Differently from training a large-scale encoder-decoder model, we follow a few-shot setting with an image-blind approach based on prompting.
Few-Shot and Zero-Shot Approaches: Performing few-shot learning by prompting a language model with examples and demonstrations of a task (Brown et al., 2020;Radford et al., 2019;Schick and Schütze, 2020) is an efficient and effective alternative to update model parameters.Similarly to other NLP tasks, recent work in the vision-andlanguage domain has used prompt-based learning by building on top of pre-trained language and vision models, although usually also involving extra multimodal training (Tsimpoukelli et al., 2021;Alayrac et al., 2022;Jin et al., 2021).In our work, we follow a similar few-shot prompting approach to the recent Socratic Models (Zeng et al., 2022) that do not involve any multimodal training, as described next.In image captioning, there have also been zero-shot methods that similarly to our approach do not involve any training, by relying on prompts or adaptations over the decoding algorithms, such as ZeroCap (Tewel et al., 2021) and ConZic (Zeng et al., 2023).However, these models work for English and not for the multilingual captioning setting.
Socratic Models: Zeng et al. (2022) proposed the Socratic Models (SMs) framework, where different multimodal pre-trained models communicate via zero-shot or few-shot prompting.For the task of image captioning, SMs generate captions by prompting a language model (i.e., GPT-3 (Brown et al., 2020)) with information about the input image obtained with another pre-trained model (i.e., CLIP (Radford et al., 2021)).The visual information is in this way represented into a languagebased prompt, containing the number of people presented in the image, the places, objects, and what is the type of image.We explore a similar approach in the multilingual setting by reusing multilingual models, and through a retrieval-based prompt.

Retrieval-augmentation:
The knowledge from language models can be adapted and expanded by combing non-parametric knowledge from datastores (i.e., external memories) (Khandelwal et al., 2019;Lewis et al., 2020;Izacard et al., 2022;Ram et al., 2023).The success of conditioning generation with retrieved information, in several different NLP tasks, has inspired some recent studies in image captioning (Ramos et al., 2023a;Fei, 2021;Sarto et al., 2022;Ramos et al., 2023b).The study that is most closely related to our captioning model is SmallCap (Ramos et al., 2023b), an encoder-decoder model that is prompted with retrieved captions as well.However, in image captioning, retrieval-augmentation has mostly being explored with supervised learning and not fewshot learning.Moreover, retrieval-augmentation remains unexplored in the multilingual scenario.

Model
Language Model Prompt-based Captioning (LMCAP) is a few-shot multilingual captioning model augmented with retrieval.It involves prompting a Language Model (LM) with captions retrieved from a datastore by a Vision-and-Language Model (VLM).Captions are generated in an image-blind manner, without actually processing the visual contents of the input image, instead using a prompt containing the retrieved captions.The method works as follows: first, given an input image, the VLM is used to find relevant captions in the datastore.Second, the retrieved captions are converted to a language prompt, which is encoded by the multilingual LM to generate captions in a desired language, conditioning the generation on the prompt.Finally, the set of generated captions can be scored by the VLM against the input image, to select the best caption.The main aspects of our approach are shown in Figure 1 and fully detailed next.

Image-Text Retrieval:
The input image and a datastore of captions are encoded by a multilingual CLIP (Carlsson et al., 2022), i.e. a VLM that can be used to calculate image-text similarity.In this way, given the encoded data, M-CLIP is used to retrieve the K most similar captions from the datastore.The datastore contains captions associated to diverse images, which can be in English or another language.The retrieved captions will serve to guide a language model as an example of what the predicted caption should resemble, through the use of a prompt and as described next.
Retrieval-augmented Prompting: The retrieved captions, which represent the visual information about the image, are formatted into a prompt for the language model.The prompt starts with fixed N -shot examples and ends with the retrieval information about the input image, to guide the language model.Each shot is a demonstration of how to generate a caption in a desired language for an image, given a set of retrieved captions.After these Nexamples, the prompt terminates with the retrieved information about the actual input image.An example of the format of the prompt can be seen in Figure 1 and in more detail in Appendix D. We note that the retrieved captions, either from the fixed N -shot examples or those corresponding to the input image, can be presented in any language or in multiple languages.
Prompting Multilingual Text Generation: The aforementioned prompt is used as input for an XGLM (Lin et al., 2021) pre-trained multilingual autoregressive LM, to generate captions in a given language.XGLM is applied in a few-shot setting, which means that LMCAP does not require any training (i.e., the captions are generated by providing the prompt at inference time to XGLM).Captions are generated in the desired language by including an example in the N demonstrations in the prompt, as shown in Figure 1.
Multilingual Reranking: After the LM generates a set of captions, the multilingual VLM performs a final image-text similarity step to find the caption that best describes the input image.This is based on the same M-CLIP model used for the initial image-text retrieval.

Evaluation
In this section, we describe the evaluation of LM-CAP.We describe the experimental setup and results, and we also present ablation studies and further discussions about our approach.

Experimental Setup
Model: LMCAP uses two pre-trained multilingual models, namely the autoregressive XGLM language model facebook/xglm-2.9B,and the multilingual M-CLIP vison-and-language model xlm-roberta-large-ViT-H-14, respectively available on HuggingFace (Wolf et al., 2020) and OpenCLIP1 .Our approach does not require any training, generating captions at inference time using a single NVIDIA V100S 32GB GPU.
To generate a caption in a desired language, XGLM is prompted with retrieved captions extracted by the M-CLIP model.For caption retrieval, the input image and a set of captions from a datastore are both encoded by M-CLIP to perform direct image-text search.The datastore contains English captions from the COCO training set and is indexed offline with the nearest-neighbour search library named FAISS (Johnson et al., 2017), using the index IndexFlatIP that does not involve training.A set of K=4 retrieved captions are used in the prompt for the input image, along with a fixed set of N =3-shot examples, as described in Appendix D. Conditioned on the prompt, XGLM generates captions using beam-search decoding with a beam of 3. A set of c=3 candidate captions are re-ranked using M-CLIP, to select the final generated caption in the desired language.The code for LMCAP is made freely available2 .

Datasets:
We mainly evaluate our approach on XM3600, i.e. a multilingual image captioning dataset (Thapliyal et al., 2022) featuring geographically-diverse images, collected from Open Images with basis on the regions of 36 languages.For each language, 100 images were selected and annotated with human generated cap- tions, resulting in a total of 3600 images and 261375 captions across the 36 languages.XM3600 does not contain training or validation splits.
For validation and hyperparameter tuning, we relied on the COCO (Chen et al., 2015) validation split (COCO-DEV) from the standard Karpathy splits (Karpathy and Fei-Fei, 2015).For "reference captions", we machine translate the English captions into Spanish, Hindi, and Chinese, using the M2M-100 model (Fan et al., 2021), similarly in spirit to Thapliyal et al. ( 2022) who used the Google Translate API 3 .We make this development set available to the community at https://github.com/RitaRamo/lmcap.As previously mentioned, we also use the captions from the COCO training set to build the datastore.The datastore simply contains the original English cap-3 https://cloud.google.com/translatetions from COCO without incurring in an expensive and noisy machine translation process, unlike in the study from Thapliyal et al. (2022).

Model Assessment and Comparison:
We compare LMCAP with the four multilingual models proposed by Thapliyal et al. (2022).These models combine different mT5 (Xue et al., 2020) and ViT (Zhai et al., 2022) versions and are trained in a fully-supervised fashion on COCO-35L and CC3M-35L, i.e., Google's machine translation API versions of the original COCO and CC3M datasets (Chen et al., 2015;Sharma et al., 2018).Specifically, BB+CC combines mT5-base and ViT-B/16 pretrained on CC3M-35L and finetuned on COCO-35L; BB is trained on COCO-35L; Bg switches to the ViT-g/14 model; and Lg uses mT5-large and and ViT-g/14, also trained with COCO-35L.For reference, Thapliyal et al. ( 2022) spent 5000 TPU hours to train their models, while our method can be used out-of-the-box for inference, i.e., 45 minutes for the X3600 benchmark per language.
Following Thapliyal et al. ( 2022), results are reported with the CIDEr (Vedantam et al., 2015) metric for English, Spanish, Hindi, and Chinese, with other languages covered in Section 4.4.CIDEr is a standard captioning metric that computes how well the generated caption matches the consensus of the reference captions, based on Term Frequency-Inverse Document Frequency (TF-IDF).In Appendix A, we included more generation metrics for holistic evaluation.To compute the metrics, we used the COCO evaluation package4 , and the SacreBLEU tokenization (Post, 2018).

Results
XM3600: Following Thapliyal et al. ( 2022), we report results on XM3600 for English, Spanish, Hindi, and Chinese, in Table 1.We can see that LMCAP outperforms all supervised approaches on Chinese, and achieves competitive performance on the other languages, despite being image-blind and not being trained on any image captioning data.For English, Spanish, and Hindi, we note that LMCAP is only outperformed by the large-scale supervised variant BB+CC, pre-trained on CCM3 and finetuned on COCO, jointly on English and the other 35 languages for the two datasets, i.e., with 123M captions.For the other variants that are only trained on COCO-35L, our model has a substantially larger performance on the CIDER metric across all four languages.We also show that our model can further benefit from increasing the datastore (LMCAP + ), as described in more detail over Section 4.3.

COCO:
For completeness, we also report results on the machine translated COCO-DEV set in Table 2.In the top half of the table we show the performance of the 4 SOTA models on COCO-DEV via Google's machine translation API.Since this dataset was not provided by the authors, we perform as well automatic machine-translation but using the M2M-100 model (Fan et al., 2021) COCO for any of those languages, neither was it trained on any multimodal data.This is especially the case for English, where our model reaches a similar CIDEr score, although it only reaches about half the performance for the other languages.In Appendix B, we also compare LMCAP with promptbased captioning methods that were specially designed for English.

Ablation Studies
To better understand the design choices of LMCAP, we report a series of ablation tests on COCO-DEV, to avoid direct tuning on the XM3600 benchmark.
Prompt: Given that LMCAP works by prompting a language model with K retrieved captions and N -shot examples, we study the effect of our prompt when varying K and N .Table 3 shows the importance of not depending on a single retrieved caption across the 4 languages.This is similar to previous findings in retrieval-augmentated captioning studies focusing on English (Sarto et al., 2022;Ramos et al., 2023b), which showed that a large K makes the model more robust to mismatched captions.We further see that English and Spanish benefit from encoding a larger set of retrieved captions, while Hindi and Chinese work better with a smaller K.We select K = 4 since it has close-tooptimal performance for each of the languages.We then explore varying the number of N -shot examples, and found N = 3 to be the optimal value on all the four the languages.We thus use K = 4 and N = 3 in the prompt of LMCAP.Datastore: We also studied different contents for the datastore beyond the English captions from the COCO training set, shown in Table 4.Given that our model reaches much better performance on English, we hypothesise that our model can better generate captions in a desired language when having the retrieved captions in that same language.This could be validated using translations from COCO in the other languages, but since those are not available, we instead used a machine translated version of the Conceptual Captions dataset (CCM3) from Qiu et al. (2022).We used the English, Spanish, and Chinese versions of the CCM3 training set, respectively for each of the corresponding languages (CCM3-L).We found that performance deteriorates on the COCO-DEV dataset, which might be explained by the difference between the COCO and CCM3-L datasets.Even combining the two datasets (COCO + CCM3-L) is worse than using only the COCO dataset.
In an attempt to cover more diverse concepts, we augmented COCO with three large web datasets (Conceptual Captions (Sharma et al., 2018), Conceptual 12M (Changpinyo et al., 2021), and SBU captions (Ordonez et al., 2011)), using their noisefree versions (Li et al., 2022).We refer to this dataset as CCS, and it contains synthetic modelgenerated texts for the web images.Using CCS leads to an improvement compared to just using COCO, except for Hindi.In Table 1, we also report results on XM3600 with this best datastore configuration, for which the performance again decreases for Hindi, but has a substantial improvement on English and Chinese.The benefits of including a more diverse collection of captions are further shown in Apprendix E with some qualitative examples (e.g., LMCAP was now able to generate the french concept macarons in English).Notice that the retrieved captions from CCS are still on English.Thus, although there is lack of multilingual image-caption pairs with clean labelled data, it would be interesting to pursue further work on incorporating retrieved information from other languages, in order to improve performance to levels similar to those for English.

Additional Discussion
We now discuss the performance of LMCAP across the 36 languages, taking into consideration the data that was used for pre-training the LM.We also compare our approach with SMs and a simple baseline of retrieval plus translation.To support quantitative evaluation, we show some qualitative examples.
Multilingual Pre-training: In Table 6, we report the results of LMCAP on XM3600 for all the 36 languages considered in the dataset, ordered by the percentage of pre-training data used in XGLM for each language.LMCAP shows strong few shot performance on the diverse set of languages in which XGLM was pre-trained on.Similarly to BB+CC and Lg models, which are limited to the 36 languages they were trained on, our model is also dependent on the LM pre-training data, although there is potential to replace XGLM by another large LM, in order to generalize to other languages.
Comparision with Socratic Models: Since LM-CAP is inspired in Socratic Models (SMs), we compare them against our approach.For this, XGLM receives the Socratic prompt that includes the image type, the number of people, places and object categories6 , instead of our retrieved captions.Results are reported in CIDER improvement of more than 39.1% on English, 20.0% on Spanish, 11.5% on Hindi, and of 21.4% Chinese.This confirms the effectiveness of our retrieval-augmented LM prompting approach.
en: "a young man is standing in front of microphones" es: "un joven presenta algo en un micro" (a young man presents something on a microphone) hi: "एक यु वा व्यि त एक लै पटॉप के सामने खड़ा है " (a young man stands in front of a laptop) zh: "一个年轻的男子站在他的电脑前,他准备开始演 讲" (a young man standing in front of his computer, ready to give a speech) en: "a woman sitting in front of a cake for her birthday" es: "un pastel de cumpleaños" (a birthday cake) hi: "एक बहु त ही सु ं दर और स्वा दष्ट जन्म दन के क" (a very nice and delicious birthday cake) zh: "一个老妇人坐在她的生日蛋糕前" (an old lady sits in front of her birthday cake) en: "two people and a kid skiing along a trail" es: "dos hombres y un niño esquiando en una pista de nieve" (two men and a boy skiing on a snow slope) hi: "दो लोग और एक बच्चा स्कीइं ग के रास्ते पर चल रहा है (two men and a child are walking on the way to skiing) zh: "两个大人和一个小男孩在雪地上滑雪" (two adults and a little boy skiing on the snow) -a young man holds a microphone while staring at a laptop computer -a man is standing in front of microphones -the emcee is ready to introduce the first speaker -blurry photograph of a young man presenting something -two people and a kid skiing along a trail -an adult and two small children are cross country skiing -two men and a little boy are skiing on a snowy spot -two adults on skis with a child on skis between them -a large square cake with pink candles sticking out of it -a man twenty ninth birthday consisted of a family dinner and a homemade cake -an elderly woman celebrates her 90th birthday with a cake -a woman sitting in front of a birthday cake for her 90th birthday  Baseline of Retrieval with Translation: We also compared our approach against a baseline that retrieves the nearest caption on English and translates it into other languages in Table 8, using the M2M-100 model.This is to quantify the impact of prompting the language model compared to performing direct translation on retrieved captions.On COCO-DEV, we see that LMCAP only outperforms these results on English.Notice, however, that the references on COCO-DEV for the other languages rely on the M2M-100 distributions, as the baseline, promoting to an inequitable CIDEr.When evaluating on human-labeled data, as is the case with the XM3600 dataset, we see the benefits of prompting with retrieval information.Notice also both LMCAP and the retrieval baseline outperform the BB model (the later also com-petitive to the other 3 SOTA variants), despite training with large-scale multimodal machine translated data for hours.This shows the clear benefits of using retrieval-augmentation in multilingual image captioning, not just for result quality but to avoid high computation costs as well.Qualitative Results: Figure 2 shows examples of captions generated in different languages by LM-CAP, together with the retrieved captions that are provided in the prompt regarding each blind-input image.Qualitative examples tend to show diversity in the generation across the languages, with the retrieved information being itself diverse.For instance, in the first example, for English and Spanish, LMCAP focuses on describing that a man is in front of microphones (i.e., based on the first two retrieved captions).In turn, for Hindi and Chinese, the man is in front of a laptop (i.e., from the first example), and the captions can also mention that he is ready to give a speech in Chinese (i.e., given the last two retrieved captions).In the second image, we can see that LMCAP can simply copy a retrieved caption to generate in English, while for the other languages the model may come up with terms not directly present in the retrieved captions (e.g., "snow slope" in Spanish).The last image is a negative example, where incorrect retrieved captions led the model into errors in English and Chinese, showing that there are also limitations in our image-blind approach.For more examples, see Appendix C.

Conclusions
This paper proposes LMCAP, an image-blind fewshot multilingual image captioning model.LM-CAP is based on prompting a language model with N -shot examples and retrieved captions extracted by a vision-and-language model, to condition caption generation in a desired language with a multilingual language model.On XM3600, i.e. a humanlabelled massively multilingual multimodal benchmark, LMCAP performs competitively against the state-of-the-art without involving expensive training with large-scale translated multimodal data, or with any captioning data.Experimental results further demonstrate that LMCAP largely outperforms Socratic Models (Zeng et al., 2022), showing that retrieval augmentation plays a crucial role in our prompting approach.As future work, we plan to further assess the use of multilingual data in the datastore, as well as the impact of directly promoting diversity (Ye et al., 2022;Levy et al., 2022) in the captions used in the prompt.

Limitations
Image captioning and multilingual image captioning studies tend to focus on the COCO dataset, which was shown to contain gender imbalance.Previous research has also showed that models trained on COCO tend to amplify this bias (Hendricks et al., 2018;Zhao et al., 2017).While our model is not trained on COCO or in any captioning data, it relies on a pre-trained language model, which is known to suffer from different sources of bias and fairness issues (Bommasani et al., 2021;Sheng et al., 2021;Schramowski et al., 2022).Our model also involves retrieval-augmentation with captions extracted by a vision-and-language model, also pre-trained in an unsupervised manner.Like in the case of other retrieval-augmented generative models (Lewis et al., 2020), LMCAP has inherently a bias towards the retrieved information.Notwithstanding, by conditioning on information from a datastore with clean and curated text, LMCAP has potential to ameliorate some of the generation issues of the language model (e.g., elude hateful or violent language).To have insights on the biases presented in LMCAP, we recommend analysing the retrieved captions used by the model, since they provided cues to the predictions, as shown in Figure 2. We argue that it can be much harder to have a direct interpretation for captioning models that are not retrieval-augmented.
Another limitation of our model relates to it following a full image-blind approach, which heavily depends on information from similar captions instead of the visual content from the actual input image.To address this limitation, future work could additionally include concepts extracted from the image in the prompt, as proposed in Socratic Models, combined with the retrieved information.more attention to low-resource languages as well (i.e., languages beyond those covered in our tests).Evaluating LMCAP with additional datasets, covering an even larger set of languages and concepts, would be desirable.
en: "polar bear diving underwater" es: "un oso polar es visto bajo el agua" (a polar bear is seen under the water) hi: "एक बाघ पानी के अं दर है (a tiger is inside the water) zh: "一只北极熊在水下潜水" (a arctic bear dives underwater) en: "the new york stock exchange" es: "una escena de la ciudad de nueva york" (a scene of new york city) hi: "एक नई स्टॉक एक्सचें ज की तस्वीर" (photo of a new stock exchange) zh: "纽约证券交易所" (new york stock exchange) en: "a military style helicopter that is in a hangar" es: "un helicóptero militar que se encuentra en un hangar" (a military helicopter in a hangar) hi: "एक एयर फोसर्स हे लीकॉप्टर जमीन पर खड़ा है " (a air force helicopter stands on the ground) zh: "军用直升机停在空地" (a military helicopter parked in open space) -a street scene with focus on the new york stock exchange -an intersection with a street sign and flag, at a stock exchange building -a window to a building that has an american flag in it -wall st sign up close with numbers 95 through 104 -a military style helicopter that is in a hangar -a large army looking helicopter landing at an airport -a helicopter that is sitting with its back wheels on the ground -an air force helicopter sitting in a gravel area -a polar bear dives underwater at the zoo -a polar bear as seen from underwater camera -a polar bear diving to the bottom of his tank -a polar bear in the zoo dives underwater

D Prompt-Template
We follow the Socratic template, where instead of including different categories (objects, places, number of people, etc), we replace them by the retrieved captions.By following the same template, in place of a completely different one, we can assess the impact of including retrieval compared to the original Socratic framework.Our template is: I am an intelligent image captioning bot.Similar images have the following captions: <caption 1> <caption 2> <caption 3> <caption 4>.A creative short caption I can generate to describe this image in <language> is: Between the retrieved captions we use the special end of sentence token (i.e., </s>) of XGLM.Notice also that our prompt starts with 3 fixed shot examples from images in the training dataset (i.e., the same prompt is repeated multiple times to encode the n-shot examples).We share the N -shot examples and the set of K retrieved captions used in our prompt, together with the code at https://github.com/RitaRamo/lmcap.
The following text is a concrete example of the prompt provided for the first image of XM3600.
I am an intelligent image captioning bot.Similar images have the following captions: a horse grazing in a grassy field next to a barn</s> a brown horse grazing in its pen and a red barn and water</s> a pretty brown horse eating some grass in a bare field</s> a horse is eating grass next to a barn in the middle of a pasture</s> A creative short caption I can generate to describe this image in spanish is: Un caballo marrón es grasa cerca de una casa roja</s> I am an intelligent image captioning bot.Similar images have the following captions: a teal toilet is the center of this bathroom photo</s> a small bathroom with brightly painted blue walls</s> the bathroom has a splash of color with the blue tiles</s> the sink is above a turquoise tile sink</s> A creative short caption I can generate to describe this image in spanish is: Un baño muy limpio y bien decorado</s> I am an intelligent image captioning bot.Similar images have the following captions: a woman and child focus on a pink device in public</s> a woman holding a small child while standing near a crowd</s> a very cute lady posing with a small kid</s> a young child with a cell phone and an adult</s> A creative short caption I can generate to describe this image in spanish is: Una mujer se acercó a mirar en su teléfono mientras está listo para tomar una foto</s> I am an intelligent image captioning bot.Similar images have the following captions: a brown chicken is walking around outside with another hen</s> a couple of roosters standing in a field</s> a hen pecks the ground while another looks off in the distance</s> a couple of roosters are in a field</s> A creative short caption I can generate to describe this image in spanish is:.

E Augmented Datastore Examples
In this appendix, we provide qualitative examples on XM3600 when the datastore is augmented with CCS, i.e., with large and diverse data.In Figure 4, we can see generation improving for English, where LMCAP correctly mentions the french concept of macarons, available in the retrieved captions.In line with the quantitative results provided in Section 4.3, we can also see a possible explanation for why generation degraded for Hindi, that has a lower pre-training language ratio than English: LMCAP seems to have copied the last 3-shot example provided in prompt, described above in Section D), maybe due to presence of more noise in the CCS data.Another example can be seen in Figure 5, where LMCAP is more specific in generating the flower type orchid.

Figure 1 :
Figure 1: Illustration of the key aspects of LMCAP, a few-shot multilingual image captioning approach that re-uses pre-trained unimodal models without requiring any training.In our image-blind approach, a multilingual language model (XGLM) is prompted with information retrieved with a multilingual CLIP model.The prompt contains a set of N -shot examples and K retrieved captions, to guide caption generation in a desired language.

Figure 2 :
Figure 2: Examples of captions generated by LMCAP for English, Spanish, Hindi, and Chinese, on XM3600 images and based on retrieved captions regarding each blind-input image.

Figure 3 :
Figure 3: More examples of captions generated by LMCAP for XM3600 images, with retrieval from COCO.

Figure 4 :
Figure 4: An example of captions generation by LMCAP conditioned on captions retrieved from COCO (top) compared to augmenting the datastore with CCS (bottom).
Figure 5: An example of LMCAP generation based on retrieval from COCO (top) or COCO augmented with CCS (bottom).

Table 1 :
Thapliyal et al. (2022)ically-diverse XM3600 benchmark.We compare our few-shot LMCAP model against large-scale supervised multilingual and multimodal SOTA models proposed byThapliyal et al. (2022).Best results in bold and second-best underlined.

Table 2 :
CIDEr performance on the COCO dataset.The top of the table presents SOTA results on the COCO validation split, translated via the GOOGLE API.The bottom rows of the table shows our model performance on COCO-DEV, translated via the M2M-100 model.|θ| corresponds to the number of trainable parameters in the model (in millions).

Table 3 :
The effect of using different numbers of K retrieved captions and N few-shot examples.Results reported on COCO-DEV with best results in bold.

Table 5 :
(Brown et al., 2020) memory, which limits the size of the prompt that can be encoded with modest hardware 5 .LMCAP uses the more efficient XGLM-2.9Bversion.These results are in line with previous findings, which suggest that stronger fewshot performance is achieved when the prompt is encoded by large LMs(Brown et al., 2020).CIDEr performance on COCO-DEV, across the different variants of XGLM, to show the scaling behaviour of the LM used in LMCAP.RAM corresponds to the GPU memory consumption.
Model Size: In Table5, we show the importance of using a language model that has a sufficiently large number of parameters.Both XGLM-562M and XGLM-1.7Bare unable to generate captions beyond English.On the other hand, the 7.5B variant can lead to a stronger performance, but large-scale

Table 7 :
Comparison to Socratic Models (SMs) on the COCO-DEV dataset.LMCAP clearly outperforms SMs, as highlighted by bold.

Table 8 :
Comparison to direct translation on retrieved captions (Baseline), on COCO-DEV and XM3600.