PR-MCS: Perturbation Robust Metric for MultiLingual Image Captioning

Vulnerability to lexical perturbation is a critical weakness of automatic evaluation metrics for image captioning. This paper proposes Perturbation Robust Multi-Lingual CLIPScore(PR-MCS), which exhibits robustness to such perturbations, as a novel reference-free image captioning metric applicable to multiple languages. To achieve perturbation robustness, we fine-tune the text encoder of CLIP with our language-agnostic method to distinguish the perturbed text from the original text. To verify the robustness of PR-MCS, we introduce a new fine-grained evaluation dataset consisting of detailed captions, critical objects, and the relationships between the objects for 3, 000 images in five languages. In our experiments, PR-MCS significantly outperforms baseline metrics in capturing lexical noise of all various perturbation types in all five languages, proving that PR-MCS is highly robust to lexical perturbations.


Introduction
Image captioning (Xu et al., 2015;Vinyals et al., 2015Vinyals et al., , 2016;;Lu et al., 2017) is a multimodal task that automatically generates captions that describe the visual content of an image and integrates multiple disciplines of visual and textual modality.Image captioning is a natural language generation (NLG) task (Gatt and Krahmer, 2018), but the evaluation metric has different characteristics from other NLG metrics (Sai et al., 2022).Image captioning metrics should be able to evaluate not only linguistic fluency and syntactic thoroughness but also semantic correspondence to visual content (Bai and An, 2018).
Evaluation criteria for image captioning have evolved from N-gram-based metrics (Papineni The baseline metric shows similar scores to both original and perturbed captions, but our metric shows prominent score drop for perturbed captions indicating that perturbation is well detected.et al., 2002;Lin, 2004;Banerjee and Lavie, 2005;Vedantam et al., 2015) to reference-free metrics (Lee et al., 2020(Lee et al., , 2021;;Hessel et al., 2021).Recently, CLIPScore (Hessel et al., 2021) has been proposed to leverage the large-scale pretrained vision and language model CLIP (Radford et al., 2021).In evaluating generated captions by computing cosine similarity between embedded vectors (i.e., image and text) using CLIP, CLIPScore achieves a higher correlation with human judgments than traditional metrics.
However, Sai et al. (2021) have revealed that current metrics are prone to failure in capturing lexical noise in generated captions.For example, when a perturbation is applied to an original caption (e.g., a removal or swap at the token level), existing image captioning metrics do not recognize the change and compute a score similar to that for the original caption case.This failure to capture lexical noise raises a critical question concerning the reliability of the metric, as shown in the example in Figure 1.CLIPScore exhibits the same tendency in our analysis, which reflects its vulnerability to perturbed texts.By extending CLIPScore to a multilingual setting, we observe that a multilingual CLIPScore exhibits the same limitations in multiple languages other than English, i.e., French, German, Spanish, and Japanese.
In this paper, we address this problem by proposing a novel method for enhancing the perturbation robustness of CLIPScore.Our method is to fine-tune the text encoder of CLIP with perturbed captions so that the text encoder can distinguish the perturbed text embeddings from the original text embeddings.The simplicity and effectiveness of our method enable us to apply it to multiple languages without relying on human annotations.Using our method, we develop Perturbation-Robust Multilingual CLIPScore (PR-MCS), a perturbationrobust and language-agnostic metric for image captioning.
Furthermore, to validate the robustness of PR-MCS against perturbations and its high human correlation, we introduce two newly created datasets: M-FineCap3k and M-CapEval1k.Currently, most image captioning datasets are limited to English, necessitating a machine translation (Bahdanau et al., 2014;Johnson et al., 2017) process for multilingual experiments.However, this process relies on the performance of the translation model, which may result in lower evaluation reliability compared to human-annotated labels.Hence, we elicit image captions directly from human experts, tailored to the purpose of the datasets.Firstly, M-FineCap3k is designed as an image captioning dataset, created to generate fine-grained captions that are appropriate for the corresponding images.Secondly, M-CapEval1k serves as a benchmark dataset developed for the purpose of measuring the human correlation with image captioning metrics.
Finally, experimental results on five datasets, including M-FineCap3k, demonstrate that PR-MCS outperforms baseline metrics in capturing lexical noise in captions across all five languages considered.In addition, the results of measuring human correlation using M-CapEval1k reveal that PR-MCS exhibits a strong alignment with human judgments.Therefore, we confirm that the proposed PR-MCS is a useful and reliable image captioning metric with perturbation robustness and a strong correlation with human judgments.
CLIPScore Researchers have also proposed unreferenced image captioning metrics that evaluate generated captions by comparing them with original images that does not require groundtruth caption (Madhyastha et al., 2019;Kusner et al., 2015;Lee et al., 2021;Chen et al., 2020).CLIPScore (Hessel et al., 2021), which is also a reference-free metric, relies heavily on the CLIP (Radford et al., 2021) model, trained with 400 million image caption pairs using a contrastive objective function that distinguishes original image-caption pairs from unmatched captions.The calculated CLIPScore is the weighted value of cosine similarity between image embedding and text embedding encoded by the CLIP model.Despite the fact that CLIPScore exhibits a high correlation with human evaluation, it is limited in that it is an image captioning metric that applies only to English.In this study, we propose a new multilingual image captioning metric developed by extending CLIPScore to a multilingual setting.
Perturbation Robustness In a recent study, Sai et al. (2021) selected various criteria for use in assessing how various NLG evaluation metrics perform.In addition, perturbation was applied to multiple image captioning factors to assess the perturbation robustness of the image captioning metrics.Sai et al. (2021) provided a perturbation checklist of metrics for NLG tasks; we go further and present a novel metric that overcomes the limitations of other metrics.We select some perturbation criteria from among those suggested by Sai et al. ( 2021), designate them as target perturbations, and show that the CLIPScore cannot detect these perturbations in multiple languages.Even if the generated captions are corrupted, CLIPScore outputs similar results for the original and corrupted sentences.This study proposes a novel metric with perturbation robustness based on CLIPScore to address its weaknesses in multiple languages.
3 Perturbation Robust Multi-Lingual CLIPScore  2 left).In teacher learning, the multilingual text encoder learns the pre-trained CLIP text embedding of an English sentence so that the embeddings of the sentence translated through the machine translation model are similar.We use MSE loss as the teacher learning loss, and the formula is as follows: where t is the teacher embedding and s is the student embedding.More details on multilingual textual encoder pre-training can be found in the appendix A.
We present a new multilingual image captioning metric, MCS, using this model as a backbone.MCS uses image-caption pairs with weight given to the cosine similarity of embeddings created through visual and text encoders, respectively.The formula for an image-caption input pair (I, c) in MCS is as follows: where V (I) is the visual embedding where the image is passed through the visual encoder and T (c) is the text embedding where the caption is passed through the multilingual text encoder.The value of w is set to 2.5 as in the original CLIPScore.

Human Correlations of MCS
An adequate evaluation of image captioning metrics requires assessing the correlation between metric-generated caption scores and humangenerated caption scores.However, to the best of our knowledge, no benchmark exists to evaluate image captioning metrics across multiple languages.
For languages other than English, machine translation is necessary, but this approach can lead to incorrect results when machine translation models attempt to correct sentences that have already been annotated with a low score.To tackle this problem, we created the M-CapEval1k benchmark by translating CapEval1k (Lee et al., 2021), an image captioning metric evaluation set originally in English, into five languages (English, German, Spanish, French, and Japanese) with the assistance of native speakers for each language.Our translation process ensured that the goodness or badness scores for each sentence were maintained.This benchmark can be leveraged for the quantitative evaluation of multilingual image captioning metrics, and an example of the M-CapEval1k is provided in the Appendix C.
Table 1 shows the Kendall tau-c (τ c ) value (Kendall, 1938) representing the human correlation for each language of the metrics.The MCS which uses our pretrained model as a backbone demonstrates a high correlation with human judgement similar to CLIPScore in English.For other languages, it can serve as a baseline for multilingual image captioning metrics in future research.Therefore, the MCS presented in this paper extends CLIPScore well to languages beyond English.

Vulnerability to lexical perturbation
We employed some of the perturbation criteria identified by Sai et al. (2021) and checked the perturbation sensitivity of CLIPScore and MCS.One of the criteria, "Repetition", is a perturbation in which words appear repeatedly at the token level in the original caption (e.g., "I am a boy." → "I am am a boy boy."). Figure 3 shows what score is given by baseline metrics when repetitive lexical noise is introduced.We randomly selected 3,000 samples from the MSCOCO (Lin et al., 2014) dataset, translated them into four languages, and gave repetitive lexical noise to the captions.The blue bars are the score for the original captions, and the red bars are the score for the perturbed captions.For English, CLIPScore is used as a metric, and for other languages, MCS is used for score extraction.The caption to which lexical noise is added is expected to have a lower matching score with the image than the original caption.However, for all languages, the scores for the perturbed captions are not lower than those for the original captions.There are even cases in which the perturbed caption is given a higher score.Similar tendencies can be observed for other perturbation criteria as well as "Repetition".These results confirm that CLIPScore and MCS are limited in that they are vulnerable to lexical perturbation and that a metric that is robust to perturbation is needed.

Perturbation-Robust Multilingual
CLIPScore (PR-MCS) We introduce a novel language-agnostic perturbation method that increases the robustness of MCS.This method of fine-tuning the multilingual text encoder is to add three losses to original CLIP loss.The CLIPScore is constructed through embeddings of pre-trained CLIP without additional training.The CLIP loss L CLIP consists of in-batch contrastive loss using cross-entropy loss, and the implementation is the same as the pre-training loss of CLIP.We construct a loss based on the contrastive loss of CLIP to maintain the high correlation with the human judgment of CLIPScore.
Then, we train the text encoder by adding three additional losses for perturbation robustness.These losses aim to maintain the close relationship between the image embedding and the original caption while increasing the distance from the perturbed caption.An (image, original caption, perturbed caption) triplet is then used as input to finetune the text encoder through three losses, as shown in Figure 2 (right).The losses are as follows: where (I, o, p) is the (image, original caption, perturbed caption) triplet, V is a visual encoder, and T is a text encoder.Equation ( 1) is composed of the cosine embedding loss of the two representations needed to increase the similarity between the image embedding and the original caption.Since MCS is based on cosine similarity, the purpose of Eq. ( 1) is to obtain a higher score for the original caption.Equation (2) reduces the marginal cosine similarity of image embedding and perturbed caption embedding.Equation (3) reduces the similarity between the original and perturbed caption embeddings.The margin m is set to 0.1.These three losses are combined to obtain the final objective function as follows: We develop PR-MCS by fine-tuning Multilingual CLIP using the proposed loss function.
P R-M CS(I, c) = w * max(0, cos(V (I), T * (c)), where T * (c) is the text embedding from the finetuned multi-lingual text encoder.w is also set to 2.5, as in the original CLIPScore and MCS.

M-FineCap3k Dataset
To evaluate the performance of PR-MCS, we introduce a new image captioning dataset, M-FineCap3k.Most existing image captioning datasets are limited to English (Young et al., 2014;Krishna et al., 2017).Therefore, a model for machine translation (MT) from English to other languages is essential to evaluate image captioning in various languages.However, translated evaluation set is highly dependent on the performance of the MT model and is highly likely to have an English-language bias.In addition, translation is inarticulate in the target language because it is difficult to reflect the unique characteristics of each language, such as word order and lexical choice (Zhang and Toral, 2019;Cao et al., 2020).Therefore, the results obtained using a translated evaluation set achieve poorer agreement with human evaluation than those obtained using the English evaluation set.For these reasons, a human-annotated image captioning evaluation set with a wide variety of languages is needed.

Multilingual image captioning evaluation set
We introduce a new human-annotated multilingual evaluation dataset, M-FineCap3k.We extended FineCapEval (Cho et al., 2022), which has only English captions, to five languages (English, German, French, Spanish, and Japanese).Human experts viewed the images for each language and added captions directly.Each sentence generated directly by native speakers is more fluent than translated versions.Moreover, M-FineCap3k can capture various cultural aspects that MT models cannot (Liu et al., 2021).Therefore, the reliability of evaluation in multilingual settings increases.An example of the dataset is shown in Figure 4. Human annotators for each language created a caption, critical objects, backgrounds, and relationships for a given image.

Fine-grained caption with critical objects
The widely-used multi-lingual image captioning dataset, Multi30K (Elliott et al., 2016(Elliott et al., , 2017)), is also based on human annotations of images from Flickr30K (Young et al., 2014).However, the dataset is composed of brief sentences, rendering it challenging to execute diverse lexical perturbations.Moreover, it is only expanded to two languages, namely German and French.To address this, we create M-FineCap3k with long, detailed captions of 20 words or more to enhance the impact of human annotation and generate a range of perturbed captions for evaluation purposes.In addition, we had human experts point to critical objects, backgrounds, and relationships to create perturbed captions effectively.As described above, the image captioning metric should also reflect the semantic correspondence of whether the caption captures information contained in visual content well.The critical object of the caption should point to the most important object of this visual content, so it plays a key role in the comparison between embeddings.When perturbation is applied to this critical object, a more powerful and effective perturbation is achieved.
For instance, the well-known weakness of CLIP is that it does not produce different results when the positions of critical objects in the sentence are changed.For example, the CLIP text embeddings of the two sentences "A blue car in

Flickr8k
VizWiz M-FineCap3k Multi30k Figure 6: Experiment result graph.The y-axis value represents the score drop of the perturbed caption as a percentage difference compared to the original caption.The experiment is conducted with five datasets, and we report the average of the languages to confirm the results for each perturbation.As a result, it can be seen that PR-MCS is more robust than baseline metrics for all perturbations across all datasets.front of the white church" and "A white car in front of the blue church" are almost identical.To evaluate the robustness to this perturbation, we construct perturbation criteria using critical object information.The statistics of the dataset can be found in the Appendix D.

Experiments
Our framework seeks to identify whether a given metric can detect lexical noise in a generated caption.Through exhaustive experiments, we evaluate whether the PR-MCS developed as described in this paper successfully distinguishes the perturbed caption from the original caption.

Experimental Setup
Fine-tuning set We use MSCOCO, the dataset most widely used for image captioning, as the finetuning set to enhance the perturbation robustness of MCS.We use the training and validation split of the MSCOCO dataset described by Chen et al. (2020).The number of elements in the training set is 414k.Since only English captions exist in MSCOCO, captions are translated into four other languages using the MBART-50 (Liu et al., 2020) MT model.

Baseline metric
As the baseline of the experiment, we use two MCS metrics.As mentioned above, the MCS metric is configured using CLIP's visual and multilingual text encoder.The first baseline is the MCS metric constructed using the multilingual CLIP text encoder implemented by Fredrik Carlsson ( 2022) as the backbone.The second baseline is the MCS metric constructed using the multilingual text encoder trained by the teacher learning method described in Section 3.1.

Perturbation configuration
We select the following five criteria to perturb the sentences in the fine-tuning and evaluation sets.The criteria below are error types commonly found in model-generated captions.These criteria are part of the checklist proposed by Sai et al. (2021).We select five orthogonal criteria.Each perturbation example is shown in Figure 5.
Repetition Repeated words are found in several model-generated captions.A well-known problem is that the transformer model is vulnerable because it does not capture repetitive perturbation well at the embedding level.We give each word token a repeating perturbation with a probability of 0.4.Removal Among the sentences given a low score in the evaluation dataset for the image captioning metric, such as Composite (Aditya et al., 2015) or Pascal50s (Vedantam et al., 2015), some word tokens are removed, and incomplete sentences are found.We configure perturbation by removing some tokens to reflect this noise.Each word token is drawn with a probability of 0.4.
Masking Masking is a perturbation in which randomly selected tokens in the caption are replaced with [Mask] tokens.When lexical noise is given in units of tokens, the meaning of the corresponding token disappears, but unlike in the Removal case, the position is maintained.Position information can be critical in a reference-free metric based on a transformer model such as CLIPScore (Dai   et al., 2019;Wu et al., 2021).Therefore, even if the [Mask] token does not appear in the generated caption, we select Masking perturbation as the criterion, separate from Removal, to address the above case.Each word token is replaced with a [Mask] token with a probability of 0.4.
Jumble We generate perturbed samples using random-order permutation at the token level in the original reference caption.The model composing the metric can see all tokens of the sentence, including visual content, but considerable noise is introduced into the position information.
Substitution Substitution involves changing the positions of key elements in a sentence.In the case of M-FineCap3k, substitution is performed using critical objects annotated by human experts.In the remaining datasets, nouns in the caption are extracted, and their positions are changed.The perturbed caption includes all elements that exist in the original caption, but unlike in the Jumble case, it does not deform the grammatical structure at all.Detecting substitution noise well is the most challenging task because it requires judging semantic correspondence to visual content perfectly.

Perturbation robustness evaluation
We report the main results for all datasets and languages in Figure 6 and Table 2.The robust evaluation metric is expected to give lower scores to perturbed captions than to original captions.
Each graph of Figure 6 shows the experimental results for MSCOCO, VizWiz, Flickr8k, Multi30K, and M-FineCapEval by perturbation.Each point represents the average results for five languages for one perturbation.It shows how much score drop the perturbed caption has from the original caption.The green line indicates PR-MCS, and the blue and red lines refer to the two baseline multilingual CLIPScores.The scores of the perturbed caption by baseline metrics do not differ much from those of the original caption for any perturbation methods.In some cases, the scores for the perturbed captions are higher than those for the original captions.However, our metric exhibits a significant score decrease for all perturbations compared to the original captions, which means that the metric can clearly distinguish when the perturbation is applied.In other words, our metric exhibits robustness for all perturbations in the evaluation dataset.In particular, even in the cases of Repetition and Substitution, which are known to be challenging pertur-bations, PR-MCS detects perturbations very well, while baseline metrics do not capture perturbations at all.
Table 2 shows the score of the original caption and the average score of the perturbed caption given by the metrics for each of the four evaluation sets for each language.The result shows how much the percentage of the score decreased for each perturbation in comparison to the original caption.The results for the baseline metrics show that the score decrease for the average perturbation is very small, i.e., approximately 3%, relative to the original caption.It is difficult to say that the metric can distinguish the perturbed caption from the original caption based on such a slight difference.In contrast, in the case of PR-MCS, the percentage decrease for the perturbed caption ranges from 50% to 70%.Clearly, our proposed method exhibits perturbation robustness in the metric score and can identify perturbed captions through anomaly detection only with a performance drop.In the cases of Vizwiz, Flickr8k, Multi30K, and M-FineCap3k, the performance is outstanding even though the distributions are not trained in fine-tuning.

Few-shot setting for M-FineCap3k
As the results summarized in section 5.3 show, PR-MCS is much more robust to perturbation than the baselines in M-FineCap3k.However, the performance degradation for perturbed captions in M-FineCap3k is lower than for MSCOCO, VizWiz, Flickr8k, and Multi30K (e.g., -55.66% to -39.89% in average from MSCOCO).We attribute this to the distribution shift from the sequence length difference between the MSCOCO fine-tuning set and the M-FineCap3k test set.VizWiz, Flickr8k, and Multi30K are composed of short captions, so there is not much difference in caption length between them and MSCOCO.Therefore, to check whether the distribution can be learned when some information about the M-FineCap3k 3K test set is provided, we perform additional experiments on M-FineCap3k with a few-shot setting.We split M-FineCap3k into subsets proportioned 1:9 in size and use only 300 perturbed captions as the fewshot input.
The experimental results for the few-shot setting are shown in Table 2 and Figure 6 (the yellow line in rightmost graph).When the distribution for the fine-grained caption is given, the overall performance in perturbation detection, in terms of the av- erage score, increases for all five languages.These results show that lexical noise in long sentences is more reliably captured by learning a small number of samples with a few-shot setting.The experimental results for all languages and all perturbations for each dataset are provided in the Appendix H.

Correlations with human judgement
Table 3 and Table 4 shows that PR-MCS is an useful image captioning metric with high correlation with human judgment.The higher the Kendall tauc (τ c ) value and the Pearson correlation coefficient (ρ) (Benesty et al., 2009), indicators for viewing the correlation with human judgment, the better.The Kendall tau-c value is the similarity between the two variables based on ranking, and the Pearson correlation coefficient is a measure of linear correlation between two sets of data.
For Table 3, M-CapEval1k is used as an evaluation set for measuring the performance of image captioning metric, and it demonstrates excellent characteristics of PR-MCS for various languages within the dataset.As introduced in Section 3.2, our created M-CapEval1k serves as a benchmark for measuring the correlation of given metrics with human judgment across various languages.PR-MCS exhibits perturbation robustness while also showing similar performance to CLIPScore and MCS with Kendall correlation in English and even higher performance in Pearson correlation.
In Table 4, it is evident that PR-MCS remains a valuable metric for existing English benchmark datasets such as Composite (Aditya et al., 2015) and Flickr8k_Expert (Hodosh et al., 2013).The correlation of PR-MCS with human judgment for English sentences is notably strong compared to conventional reference-based metrics and is similar to CLIPScore.Furthermore, as asserted in Table 3, PR-MCS operates in a multilingual setting and demonstrates perturbation robustness, distinguishing it from CLIPScore.PR-MCS's adaptability to a multilingual context allows for showcasing its effectiveness by translating these datasets into languages besides English using machine translation.PR-MCS consistently demonstrates strong human correlation across all other languages, and the comparative results with reference-based image captioning metrics can be found in Appendix I.
In light of the cumulative findings discussed previously, for sentences that potentially include perturbations that an image captioning model may output, existing metrics such as CLIPScore and MCS are vulnerable.PR-MCS is a useful metric that can evaluate good sentences positively and bad sentences negatively for potentially perturbed sentences; thus, it can be used instead of CLIPScore.

Conclusion
In this paper, we propose PR-MCS, a perturbationrobust metric for multilingual image captioning using language-agnostic fine-tuning.PR-MCS, developed by fine-tuning the text encoder of CLIP, can distinguish lexically perturbed text from original text.We also propose a new fine-grained multilingual image captioning evaluation set, M-FineCap3k, for use in perturbation robustness evaluation.Experimental results for existing datasets and our new dataset show that PR-MCS detects perturbation well and is robust to perturbation in multiple languages.Furthermore, we verify that PR-MCS is a useful metric with strong correlation with human judgment, using our M-CapEval1k.

Limitations Model bias of machine translation in training
In our study, an evaluation set is created by directly annotating languages other than English to remove the bias of the machine translation (MT) model in the evaluation phase.However, in the training phase, the dataset size is too large to annotate directly in multiple languages other than English.Therefore, the pre-training set and the fine-tuning set are translated into other languages by utilizing the MT model, so we have no choice but to depend on the performance of the MT model and avoid model bias.
Recently, Reimers and Gurevych (2019) released the MultiLingual CLIP based on the ViT-B/32 CLIP model * .We constructed an MCS with this model as the backbone and measured human correlation using M-CapEval1k.The results showed a slightly lower correlation compared to the MCS with our implemented Multilingual Text Encoder, but were still similar (average Kendall score for 5 languages: Sentenceformer based MCS -0.249, our MCS -0.255).However, as this model is a large-scale model, it has the disadvantage of being slow in inference speed as an automatic metric.We conducted finetuning with our more lightweight model with high human correlation.In future work, if experiments are conducted with this large-scale model, additional analysis on our proposed methodology and new metrics can be done through a variety of experimental results and interpretations.Furthermore, trends in large-scale models can also be confirmed.

Ethics Statement
The annotators for the two newly created datasets (M-FineCap3k and M-CapEval1k) were hired through a data annotation service.The remuneration was calculated according to the annotators' country of residence, as determined by the company.Annotators were asked not to write any toxic content (1.offensive, sexist, or racist comments; 2. toxic words; 3. sexual behavior).All other datasets and models used in the experiments are from the publicly available website or Github.

B.1 Reproductabilty checklists
Dataset and Source code We provide our pretraining, fine-tuning, and evaluation source code along with configuration code for perturbations as supplementary materials.We will publicly release our dataset M-FineCap3k, and the full codes with weight parameters.
Computing Resources AMD Ryzen Threadripper 2950X (3.50 GHz) with GeForce GTX 2080 Ti is used for the experiments.All codes are implemented on Python 3.6.15and PyTorch 1.7.1.The fine-tuning of each model trains 5 epochs, and takes about 6 hours per epoch.

Number of Parameters
The number of parameter of our multilingual CLIP is about 66M as like as Distill-Multilingual BERT.
Train-Valid-Test split MSCOCO used for finetuning consists of 414k training set and 25k validation set.We split the training set by 9:1 and used it for fine-tuning and validation.We also randomly extracted 3k samples from the existing validation set and used it as a test set.
The hyper-parameter was manually tuned based on the effective detection of lexical noise while maintaining high human correlation, and finally, the best-performing λ values of the objective function for fine-tuning are as follows: λ 1 = 0.1, λ 2 = 0.05, λ 3 = 0.05.In the main text, we reported single-run scores after finding the best performing parameters.

C M-CapEval1k examples
The examples of the M-CapEval1k can be seen in Figure 8 and Figure 9. Native speakers of each language translated into each language while maintaining the score based on the image, original caption, and score assigned to the pair.The word transcreate is used because it was translated while maintaining the score, not a simple translation.The API for collecting the M-CapEval1k dataset is shown in Figure 10, and instructions for collecting the dataset can be show in Figure 11.Table 5 provides detailed statistics for M-FineCap3k, including the dataset size for each language, the average sentence length, and the average numbers of critical objects, backgrounds, and relationships.M-FineCap3k consists of lengthy sentences of approximately 20 word tokens on average.In the case of Japanese, since there is no spacing in a sentence, sentence length is calculated using a tokenizer based on word extractor, and the sentence length is almost the same as the character level.In addition, there are three to four critical objects in all languages, so each sentence describes the visual content of an image in great detail.

F Perturbed caption examples
The examples of the perturbed captions for languages other than English can be seen in Figure 12-Figure 15.The critical objects shuffled for insentence substitution perturbation are displayed using each color.

H All results tables
MSCOCO The results for all perturbation of all languages for MSCOCO 3k eval set can be found in Table 7.

Flickr8k
The results for all perturbation of all languages for Flickr8k eval set can be found in Table 8.

VizWiz
The results for all perturbation of all languages for Vizwiz eval set can be found in Table 9.

Multi30k
The results for all perturbation of all languages for Multi30k eval set can be found in Table 10.

M-FineCap3k
The results for all perturbation of all languages for M-FineCap3k eval set can be found in Table 11.The results for diverse languages on the Composite (Aditya et al., 2015) and Flickr8k_Expert (Hodosh et al., 2013) datasets are shown in the Table 6.Given that these benchmark datasets are initially in English, we employed the MBART-50 machine translation model (Tang et al., 2020) to translate them into various languages.When contrasted with reference-based image captioning metrics, PR-MCS showcases a significantly strong correlation across all languages for both datasets, and its performance closely resembles that of MCS in a comparative analysis.-The baby sits on the furniture holding a toothbrush in his mou th.

Algorithm 1 Python implementation of perturbation
-The young baby is sticking a toothbrush in his mouth.
-A baby is sticking a toothbrush in its mouth.
-A baby playing with a white toothbrush in its mouth.
-A baby boy has a toothbrush in his mouth.
Good Transcreation: a baby brushing his teeth with a toothbrush.

Figure 1 :
Figure 1: An example for perturbation robustness test.The baseline metric shows similar scores to both original and perturbed captions, but our metric shows prominent score drop for perturbed captions indicating that perturbation is well detected.

Figure 2 :
Figure 2: Overall training procedure of PR-MCS.We pre-train Multilingual text encoder with teacher learning.Then, we fine-tune the Multilingual text encoder.

Figure 3 :
Figure 3: Scores of original captions (blue) and perturbed captions (red) with repetitive lexical noises.In all cases, the perturbed captions show no differences from the original captions.
Figure 4:An example of the proposed M-FineCap3k dataset (translation provided for explanation purpose).

Critical
Objects white shirt, grey shorts, golf, green field Original A man, wearing a white shirt and grey shorts, is playing golf on a green field with green trees and a blue sky in the background Jumble with green playing blue trees a background.green in shorts, and white the is wearing man, a A and grey on sky a golf shirt field Removal man, white and shorts, is playing a green trees a Repetition A man, man, wearing a white white shirt and and grey shorts, shorts, is is playing playing golf on a a green field with green green trees and a a blue sky in the background.Masking A [MASK] wearing a [MASK] shirt [MASK] grey [MASK] [MASK] [MASK] golf on [MASK] green field with [MASK] [MASK] and [MASK] blue sky in the background.SubstitutionA man, wearing a golf and green field, is playing white shirt on a grey shorts with green trees and a blue sky in the background.

Figure 5 :
Figure 5: Example of perturbed captions of M-FineCap3k in English.The critical objects are shuffled for in-sentence substitution.

Figure 7 :
Figure 7: M-FineCap3k eval set examples for each languages.

Examples(
The example of Transcreation is written in English for easy understanding) EN Caption: a bird perched on top of a tree Score (0-1 range) : 0.25 Reference Captions: -Two birds going up the back of a giraffe.-Two birds sitting on the back of a giraffe.-Two birds are sitting on a wall near the bushes.-A large giraffe that is walking by some trees.-Two birds standing on the back of a giraffe.Good Trancreation: a bird sitting on top of a tree Bad Transcreation: Two birds sitting on a tree.(Reason: The number of birds written incorrectly in th e original caption has been corrected) EN Caption: a person holding a cell phone in their hand Score (0-1 range) : 0.8 Reference Captions: -A smart phone being held up in front of a laptop.-The person is holding his cell phone while on his laptop.-A hand holding a cellphone with a laptop in the background.-IPhone with a screen full of icons in front of a laptop.-someone holding a cell phone in front of a laptop Good Transcreation: a person holding a mobile phone in their hand Bad Transcreation: a young girl is holding a cell phone (Reason: Incorrect or unknown information (a y oung girl) is added which is not included in the original caption.)EN Caption: a tray of food with a bunch of bananas.Score (0-1 range) : 0.4 Reference Captions: -A meal on an airplane of cereal, milk and fruit.-A tray covered in food on top of a table.-An airline lunch tray filled with healthy food.-A tray with breakfast of orange juice, cereal with milk, a banan a and a piece of brea.-Breakfast on the train prepares the worker for the day ahead.Good Transcreation: a tray of food items with a bunch of bananas .Bad Transcreation: a tray of food including a banana.(Reason: Incorrect information in the original ca ption (a bunch of bananas) is revised correctly.)EN Caption: a baby brushing his teeth with a toothbrush.Score (0-1 range) : 0.85 Reference Captions: Figure 10: M-CapEval1k collection API.

French
Critical Objects : chat noir, chat blanc, matelas pour animaux de compagnie blanc et rose Original: Un chat noir et un chat blanc allongés ensemble sur un matelas pour animaux de compagnie blanc et rose posé sur un tapis blanc.Jumble: pour Un compagnie un blanc un sur chat blanc sur rose et matelas de allongés un blanc.posé animaux et noir tapis ensemble chat Removal: Un et un chat blanc allongés un pour animaux compagnie rose posé blanc.Repetition: Un Un chat noir et et un un chat chat blanc blanc allongés allongés ensemble sur un un matelas pour pour animaux animaux de compagnie compagnie blanc et rose rose posé posé sur un tapis blanc.

Table 2 :
Experiement results table.Each values are represented using the average value for each perturbation.For all five datasets, PR-MCS outperforms the baseline performance for all languages, and the performance has further increased after additional M-FineCap3k fine-tuning in few-shot settings.

Table 4 :
Kendall Tau (τ c ) correlations with human judgment for existing English datasets.

Table 6 :
Kendall Tau (τ c ) correlations with human judgment for existing datasets other than English.