Parameter-Efficient Cross-lingual Transfer of Vision and Language Models via Translation-based Alignment

Pre-trained vision and language models such as CLIP have witnessed remarkable success in connecting images and texts with a primary focus on English texts. Despite recent efforts to extend CLIP to support other languages, disparities in performance among different languages have been observed due to uneven resource availability. Additionally, current cross-lingual transfer methods of those pre-trained models would consume excessive resources for a large number of languages. Therefore, we propose a new parameter-efficient cross-lingual transfer learning framework that utilizes a translation-based alignment method to mitigate multilingual disparities and explores parameter-efficient fine-tuning methods for parameter-efficient cross-lingual transfer. Extensive experiments on XTD and Multi30K datasets, covering 11 languages under zero-shot, few-shot, and full-dataset learning scenarios, show that our framework significantly reduces the multilingual disparities among languages and improves cross-lingual transfer results, especially in low-resource scenarios, while only keeping and fine-tuning an extremely small number of parameters compared to the full model (e.g., Our framework only requires 0.16\% additional parameters of a full-model for each language in the few-shot learning scenario). The codes are available at \url{https://github.com/eric-ai-lab/PECTVLM}. The codes are available at \url{https://github.com/eric-ai-lab/PECTVLM}.


Introduction
Cross-lingual transfer of models facilitates the transfer of learned representations or knowledge from one language to another.It plays a vital role in enhancing the performance of target languages, where the availability of labeled data and linguistic resources are particularly limited.Cross-lingual transfer has found applications in various NLP tasks, including sentiment classification (Chen et al., 2018), dependency parsing (Ahmad et al., 2018), named entity recognition (Rahimi et al., 2019), question answering (Lewis et al., 2019), and dialogue (Schuster et al., 2018), among many others.Recent advancements, such as XLM-R (Conneau et al., 2019), mBART (Liu et al., 2020), and mT5 (Xue et al., 2020), have extended the capabilities of large language models in a multilingual manner, enabling them to comprehend and process multiple languages concurrently.
Two-stream vision-language pre-trained model CLIP (Radford et al., 2021) has demonstrated remarkable performance in image-text retrieval (Cao et al., 2022) by encoding images and text into a shared representation space.However, it primarily focuses on English and cannot comprehend other languages.To address this limitation, Multilingual-CLIP (Carlsson et al., 2022) has been proposed to enhance the CLIP's ability to support multiple languages through cross-lingual transfer.Nevertheless, Multilingual-CLIP treats English as a pivot language, leading to performance disparities across languages, especially low-resource languages.While previous work (Wang et al., 2022) has accessed and highlighted this multilingual disparity, there is currently a lack of proposed solutions to address it.
Multilingual models often encounter a performance trade-off across different languages (Xin et al., 2022) in the sense that overfitting the model in one language may degrade its performance in another.This can be a significant issue as the need to train and maintain separate models for each language can become resource-intensive when dealing with a large number of languages.
The goal of this paper is to address the multilingual disparity in a parameter-efficient manner.
To achieve this, we introduce a framework that extends the capabilities of the Multilingual-CLIP model.More specifically, within this framework, we propose a translation-based alignment method that effectively minimizes the distribution gap between translated and natural language distributions.This alignment method plays a crucial role in significantly reducing the multilingual disparity exhibited by Multilingual-CLIP.Artetxe et al. (2023) also point out the importance of machine translation in classification tasks.Additionally, we adopt Parameter-Efficient Fine-tuning (PEFT) methods (Houlsby et al., 2019;Karimi Mahabadi et al., 2021;He et al., 2022a;Rücklé et al., 2020;Li and Liang, 2021;Guo et al., 2020;Hu et al., 2021;Zaken et al., 2021;Lester et al., 2021a) as a solution to achieve parameter efficiency.Furthermore, we find that, in the zero-shot scenario, hard prompt can also reduce the multilingual disparity and improve multilingual ability in addition to parameter efficiency.Compared with full-model fine-tuning on each language, our framework mitigates the multilingual disparity and obtains higher average performance across all languages, using much fewer additional parameters than a single model.
We conduct our experiments on XTD and Multi30K datasets covering 11 languages in zeroshot, few-shot, and full-dataset learning scenarios.Through extensive analytical experiments, we verify the effectiveness of our framework and provide answers to our research questions.Based on the results of our experiments, we conclude the following: 1.The Multilingual-CLIP model can achieve better performance than the original CLIP model, but still suffers from a significant multilingual disparity.Meanwhile, we find machine translation can map the distribution of text embedding to a better initialization and reduce multilingual disparity.(Section 5.2) 2. Mapping the distribution of text embedding to a better initialization and approximating it to natural pivot language distribution as a better target can significantly help reduce the multilingual disparity.(Section 5.3) 3. PEFT methods can address the excessive resource consumption of Multilingual-CLIP while maintaining acceptable performance degradation.Moreover, we find that hard prompt in English is very effective in the zeroshot learning scenario and can be applied to all languages.(Section 5.4)

Background
Multilingual-CLIP CLIP (Radford et al., 2021), proposed by OpenAI, is a two-stream visionlanguage pre-trained model with textual and visual encoder.It is trained on a large scale imagetext pair dataset using a contrastive loss to encode the image and text into a shared embedding space.CLIP calculate the cosine similarity between image and text features to measure their semantic similarity.
Recently, Multilingual-CLIP (Carlsson et al., 2022) extend CLIP to a multilingual version.This work replaces original English text encoder with a pre-trained multilingual language model such as M-BERT (Devlin et al., 2018) and trains it using teacher learning (Hinton et al., 2015).Although Multilingual-CLIP endowed CLIP with multilingual capabilities, the performance of Multilingual-CLIP in other languages is worse than in English due to the limited amount of data available in lowresource languages, leading to insufficient training in these languages.Furthermore, training data for other languages are translated from English text, which can result in a distribution gap during training and practical application.Noticing this problem, we aim to reduce this multilingual disparity in this paper.
Parameter-Efficient Fine-tuning As the size of foundation models (Bommasani et al., 2021) increases, fine-tuning and saving the entire model becomes very resource-intensive.Many parameterefficient fine-tuning (PEFT) methods have been proposed to solve this issue.These approaches add additional parameters inside the model (Houlsby et al., 2019;Karimi Mahabadi et al., 2021;He et al., 2022a;Rücklé et al., 2020;Li and Liang, 2021), optimize a small portion of the parameters or their low-rank decomposition matrix (Guo et al., 2020;Hu et al., 2021;Zaken et al., 2021), or add trainable token embedding into the input (Lester et al., 2021a).Moreover, He et al. (2021) and Ding et al. (2022) analyze and combine these approaches from a unified perspective.Furthermore, Hu et al. (2022) and Zhang et al. (2022) propose automatic methods to search for an optimal combination of these PEFT methods for language models and visual models, respectively.Many works (Gao et al., 2021;Zhou et al., 2022;Zhang et al., 2021;He et al., 2022b) also apply PEFT methods to CLIP models.Nevertheless, those PEFT methods have not been thor-oughly explored for the Multilingual-CLIP model in the cross-lingual transfer setting.It is important to note that PEFT methods often result in a decline in performance to varying degrees compared to full-model fine-tuning.Therefore, it is essential to conduct experiments to verify the effectiveness of these methods and determine the most appropriate approach for specific tasks and models.

Framework
Our main contribution is a cross-lingual transfer framework for Multilingual-CLIP (Figure 1).In this framework, we propose a novel translationbased cross-lingual alignment method that reduces the multilingual disparity and exploits parameterefficient tuning methods to solve the resource consumption problem in cross-lingual transfer.

Insights of Our Method
Our method is grounded in experimental results (Table 2) and motivated by the work of (Wang et al., 2022).Table 2 shows that machine translation improves M-CLIP's multilingual disparity and performance with English consistently performing the best.Wang et al. (2022) demonstrates that text embeddings are more closely aligned in the translation portion.Our hypothesis is that English text can yield embeddings of better quality owing to rich language resource, thus become a good target for alignment.Additionally, translation can bring other languages' embedding to a better initialization for further alignment with English.

Translation-based Cross-lingual Alignment
In Figure 1(a), we present a diagram of the translation-based cross-lingual alignment method.The blue circles represent the distribution of natural language embeddings in the representation space, while the orange ones represents the embedding distribution of text generated by machine translation.Since the pre-training of Multilingual-CLIP involves aligning English text with translated target text in other languages, there exists a disparity in the distribution of text between training and real-world usage.This gap varies among different languages and contributes to the multilingual discrepancy.In our approach, we aim to minimize this distribution gap by employing machine translation to map one embedding distribution to another.We propose an alignment method using pivot-target language text pairs, which depict the same image in both the pivot (English) and target languages.
Our alignment method has different combinations of routines and loss functions to be compared.Alignment Routines.The term "alignment routine" refers to which two distinct embedding types we are going to align.Given the pivot language (English) and target language, there are three routines in the representation space to narrow the gap between embedding distributions as shown in the left part of Figure 1.To be specific, these routines are (1) aligning original English and target language text embeddings, (2) aligning translated English (to target) and original target language text embeddings, as Multilingual-CLIP is only pre-trained in target language in translation distribution where it performs better than natural language distribution, and (3) aligning original English and translated target language (to English) text embeddings.We compare these three routines in the experiments and find routine 3 performs best.Note that these three routines do not apply to pivot languages (English) and routine 3 still need machine translation in the inference process.
Alignment loss functions.In addition to the alignment routines, alignment loss functions must also be considered.Mean Squared Error (MSE) loss and contrastive loss are two practical loss functions for narrowing the distance between embeddings.
To be specific, the original contrastive loss between image and text embeddings can be written as: where v i is the visual embedding of the image in the i-th pair and t j represents the textual embedding of the text in the j-th pair.We use i2t and t2i to represent image-text and text-image matching.τ is the temperature used to scale the cosine similarity.Following CLIP (Radford et al., 2021), it is set to 0.01.N is the number of image-text pairs in the dataset.Pivot and target language text embedding can be obtained through text encoder: Note that texts can be translated from one to another: These two different losses to regularize the distance between parallel text embeddings can be represented as: ) 2 , (7) t pivot i and t target i refers to the i-th text embedding in pivot (English) and target language.They are added to the contrastive loss between image and text with an alignment coefficient λ: (9)

Parameter-Efficient Cross-lingual Transfer Learning
Our framework utilizes Parameter-Efficient Finetuning (PEFT) methods to solve the resource consumption problem.We compare the following PEFT methods with full-model fine-tuning.When training these PEFT modules, we freeze the parameter of Multilingual-CLIP.PEFT methods are usually designed for different tasks, while we instead use them for different languages to achieve parameter efficiency.
Adapter (Houlsby et al., 2019): , where r B represents the rank of B i .Finally, the formula can be written as follows: W up is also decomposed in this way with shared A i .
LoRA (Hu et al., 2021): Following the default setting of LoRA, we apply this method to query and value projection matrices (W q , W v ) in self-attention layers.
Hard and soft Prompt (Lester et al., 2021b): Figure 1(b) bottom.Hard prompt attaches text prompts to the front of the input text (e.g., "a photo of [Text]"), which is manually designed and explainable.CLIP uses multiple hard in the pretraining phase, so we are interested in whether it is applicable in cross-language transfer scenarios.We compare different combinations of the hard prompt and input text in different languages, and find English hard prompt work well on average across all languages.Soft prompt, also known as prompt tuning, adds trainable token embeddings to the front of the input.We do not plot soft prompt in our framework as our experiments show it doesn't perform well.
In our experiments, we also tune the linear head and layer-norm layer of the text encoder when we train Adapter, Compacter and LoRA and the number of their parameters is 0.44%, 0.05% and 0.16% of the text encoder, respectively.Hard prompts only require saving a few words, while soft prompts require storing several token embeddings, each of which takes up 1024 floating point numbers in storage space.

Optimization Objective
Our optimization objective is to get optimal parameters of specific PEFT modules on each target language by minimizing the loss L.

Experimental Setup
Dataset We work with two datasets: (1) Flickr30K (Young et al., 2014) is an image captioning datasets in English, split into train/dev/test datasets with the number of 29000/1024/1000.The Multi30K (Elliott et al., 2016) dataset extends captions of Flickr30K dataset with human translated and independent German sentences.Elliott et al. (2017) and Barrault et al. (2018)  Base Model and Translation Tool We use XLM-R Large -ViT L/14 (Carlsson et al., 2022) as our base model.The model fixes the original visual encoder of OpenAI ViT L/14 (Radford et al., 2021) and replaces the text encoder by XLM-Roberta Large (Conneau et al., 2019) trained by teacher learning (Hinton et al., 2015).We use Google Translation 1 , a strong Neural Machine Translation (NMT) system, to translate between all the different languages.
To reduce the computational overhead, we translate the dataset in advance, rather than when it is used.We give the results of the translation in the code part.Analysis of different machine translation tools can be seen in Appendix B.

Experiments and Analysis
In this section, we first present the overall results of our framework with the optimal combination and then conduct analytical experiments to demonstrate the effectiveness of machine translation (Section 5.2), routine 3 and MSE loss is the best choice for alignment (Section 5.3) and hard prompt, LoRA and Adapter is respectively outperforms in zero-shot, few-shot and full-dataset scenario (Secion 5.4).The details of our experimental configurations are in Appendix A.

Cross-Lingual Transfer Results on XTD and Multi30K
Table 1 shows the results of Multilingual-CLIP with and without our framework on XTD and Multi30K datasets in zero-shot, few-shot, and fulldataset scenarios.We report the Recall@1 score on the English dataset and the average score across all other languages for evaluation of the multilingual performance and calculate statistical indicators, standard deviation, and range for evaluation of the multilingual disparity.
Compared to the vanilla Multilingual-CLIP, our framework outperforms in all zero-shot, few-shot, and full-dataset scenarios.It reduces range by more than 5 points and standard deviation by more than 1.5 points while achieving significant performance improvement both in English and on average across all other languages on the XTD dataset in both zero-shot and few-shot scenarios, which is a common application scenario of low-resource languages.The improvement in the Multi30K dataset is also significant.In terms of the number of parameters, our framework is also more efficient than fullmodel fine-tuning.In eleven languages, Adapter and LoRA only use 4.89% and 1.73% of the parameters respectively, which is far less than the parameters of 11 models.
To sum up, our framework significantly reduces the multilingual disparity and enables parameterefficient cross-lingual transfer, with a byproduct of improving multilingual performance.

Analysis on Multilingual Disparity of Multilingual CLIP
Considering that Multilingual-CLIP replaces the text encoder with a multilingual version while keeping the image encoder of CLIP, we directly compare Multilingual-CLIP to the CLIP model equiped with machine translation as a strong baseline (Jain et al., 2021).We compare the performance of the CLIP and Multilingual CLIP models on XTD dataset with the help of machine translation.we utilize machine translation to convert non-English corpora into English.This translated version is then employed as input for both CLIP and Multilingual-CLIP.Additionally, we perform translations of the English dataset into each respective language and evaluate them using multilingual CLIP.ited multilingual capabilities, with the help of machine translation, it can achieve high multilingual capabilities to a certain extent and even surpass Multilingual-CLIP.However, Multilingual-CLIP with machine translation obtain the best multilingual capability and the lowest disparity in both scenarios.We did not use the setting "M-CLIP (en→tgt)" as a comparison as it is not a practical application scenario.On the other hand, there is a large multilingual disparity for multilingual CLIP.

Result analysis. As shown in
For example, the difference between Japanese and English is up to 16% and the standard deviation is up to 4.7%.Aided by machine translation, the improvement of Multilingual-CLIP on multilingual differences is very obvious.Finally, using data translated from English, Multilingual-CLIP shows a large improvement in other languages (mean improvement of 1.6%) but still lower than in English.This may be because the model is pre-trained with text translated from English, making it more adapted to this situation.This also suggests that multilingual disparity are partly the result of differences in the quality of the datasets in different language instead of the ability of the model.
Remark.In terms of the multilingual representation space, machine translation serves the purpose of mapping text from one embedding distribution to another.By mapping text into English, we achieve a more favorable initialization of the distribution for subsequent alignment processes.While the embedding distribution of the translated text may differ slightly from that of natural language, this disparity is significantly smaller compared to the distinction between two distinct languages.Consequently, it becomes easier to optimize and narrow the gap between these distributions.

Analysis on How to Exploit Pivot Language
Datasets in low-resource languages are usually small, and we can obtain the corresponding pivot language (English) text from target language text through human annotation.In particular, for the image caption dataset, annotators can directly give high-quality English captions based on the image without mastering other languages.For (relatively) high-resource language, the parallel text is also a source to obtain texts in different languages with the same meaning.We call these texts as pivottarget language text pair.Since the model has higher performance on pivot language and those Table 2: Recall@1 in percentage on image-text retrieval XTD dataset.We compare CLIP and Multilingual-CLIP with machine translation as a tool in zero-shot and few-shot scenarios.We average the score to get the overall performance across the languages, and evaluate the multilingual disparity with standard deviation and range.M-CLIP is short for Multilingual-CLIP and MT is short for machine translation.We bold the best scores for zero-shot and few-shot respectively."Avg.-en " represents average score without English and "en→tgt" means data in other languages is translated from English set.Table 3: We compare different alignment routines for parallel corpus on XTD and Multi30K datasets.We report the score in English once as these combinations cannot applies on English set.CL is short for contrastive loss.We bold the best results on each dataset.
pivot-target text pairs can provide more information, there must be a better approach to exploit pivot language for better cross-lingual transfer of Multilingual-CLIP.
Comparison of different alignment routines and loss functions.In section 3.2, we mention three alignment routines and two alignment loss functions.We compare their different combinations with two baselines without alignment.In the first baseline, we directly apply contrastive loss between images and texts in target language in a mini-batch.
For the second baseline, we translate all texts to English previously on the basis of the first baseline.
We conduct experiments on XTD and Multi30K in zero-shot, few-shot, and full-dataset learning scenarios.
As shown in Table 3, we first compare different routines combined with MSE loss and find that routine 3, translating the target language into English and doing alignment between natural and translation English embedding distribution, performs best.Natural refers to "generated by humans rather than machine translation".Then we compare different alignment methods with routine 3 and find MSE loss performs best.Ultimately, routine3 combined with MSE loss performs best on all 3 metrics.This can be explained as that Multilingual-CLIP uses MSE loss for text-text pairs in the pre-training stage, and natural English embedding distribution is a better distribution, which denotes an embedding with higher performance.Meanwhile, machine translation maps the target language embedding distribution to a distribution close to optimal distribution where multilingual CLIP performs best, making it easier to optimize.

Analysis on Parameter-Efficient
Cross-Lingual Transfer Learning In this section, we first evaluate the performance of hard prompt.Then we compare Adapter, Compacter, and LoRA, and discuss the feasibility of using these methods for cross-lingual transfer.

Hard Prompt
We primarily focus on the prompting method, in particular the hard prompt method, which demonstrates remarkable capabilities in zero-shot learning scenarios where fine-tuning model parameters with domain-specific data is unfeasible.In a multilingual setting, we must take into account two types of prompts.Firstly, prompts can be constructed in multiple languages, potentially leading to performance disparities across different languages.Secondly, the text input can be automatically translated into any other language using machine translation.Therefore, we explore the following combinations: (1) Simply prepend an English prompt before the text.
(2) Translate the optimal English prompt into target languages and append them before their respective texts.
(3) Translate all the text inputs to English and place English prompts in the front.text input in target language and translated into English.We compare different hard prompts and find "a photo of" performs best.However, Translating English prompt into target language makes the performance decrease slightly.Finally, we get the best performance by translating all the text inputs into English and adding the best English prompt to them.

Result analysis. From results in
Comparison with soft prompt.We also conduct experiments on soft prompt based on the third combination: initiate prompt from the best prompt, which result in 3 trainable token embeddings, and randomly initiate 20 token embeddings.With 50 training instances in the few-shot scenario, the model obtains marginal improvement or even performance decreasing by utilizing these templates.
As a result, we do not incorporate soft prompt in our framework.

Other PEFT Methods
In this section, we compare three popular PEFT methods, Adapter, Compacter, and LoRA, with full-model fine-tuning.Following He et al. (2022a), we unfreeze the linear head and assign the same learning rate as fine-tuning.The number of parameters of Adapter, Compacter and LoRA is only 0.45%, 0.05% and 0.16% of Multilingual-CLIP's text encode, respectively.Thus, even if we assign different parameters to each of the 100 languages, the total number of parameters is smaller than that of a single model.As a result, we utilized languagespecific modules, yet our method can still achieve parameter efficiency and outperform full-model fine-tuning.From the results shown in Table 5, we can find that Adapter performs the best in fulldataset scenario, which preserves 99.7% performance of fine-tuning and LoRA performs the best in few-shot scenario, which even achieve almost the same performance.We further adopt Adapter and LoRA with the best combination discussed in Section 5.3, which forms a part of our final framework, and find this combination obtains better performance than full-model fine-tuning.It indicates that with the help of PEFT methods, our framework can achieve the parameter-efficiency without losing too much performance.

Summary of PEFT Methods
In summary, this section presents three primary contributions of PEFT methods: (1) We conducted a comprehensive investigation of PEFT methods in cross-lingual transfer learning scenarios using the M-CLIP model.Through this study, we discovered novel insights.For instance, in multimodal and cross-lingual transfer settings, we found that adapters are often not the optimal solution, and employing English prompts tends to yield superior performance compared to multilingual prompts.
(2) Our framework demonstrated improved parameter efficiency compared to full-model fine-tuning, especially when considering the vast array of available languages.
(3) We achieved enhanced transfer learning performance, a significant accomplishment considering that PEFT methods generally exhibit lower performance than full-model fine-tuning in other research works.

Conclusion
In this paper, we propose a framework that significantly mitigates the multilingual disparity of Multilingual-CLIP in a parameter-efficient manner in zero-shot, few-shot and full-dataset scenarios.
Our framework uses a translation-based alignment method and adopts parameter-efficient tuning methods.Analytical experiments indicate that machine translation is effective for cross-lingual transfer; exploiting pivot language can help reduce the disparity; parameter-efficient tuning methods are beneficial for reducing resource consumption without too much performance degradation.

Limitations
Our work primarily focuses on addressing multilingual disparity by improving the multilingual text encoder in the CLIP-liked framework.However, it is very possible that the visual encoder can also be enhanced with image and text data from diverse culture and languages.During the pre-training process, the original Multilingual-CLIP's visual encoder is directly aligned with English corpora only, and connected with other languages by using English as a pivot language.We expect future work can address the multilingual disparity problem from the perspective of a more powerful visual encoder.

Broader Impact
This work provides a framework for cross-lingual transfer in few-shot learning setting.The deployment of our method is potential to mitigate the performance disparity for state-of-the-art multimodal models for scarce-resource languages.However, we note that our method relies on the collection of parallel corpus, either collected from online machine translation systems or native human speakers.Our work does not thoroughly scrutinize whether these parallel corpus contains implicit social biases in different dimensions, such as race, gender and religion.When these parallel corpus contains unexpected biases or stereotypes, it is likely that the model learned from such data may perpetuate these biases that we did not foresee.

Figure 1 :
Figure1: Illustration on our framework based on multilingual CLIP.We propose a translation-based alignment method to narrow the distribution gap and adopt PEFT methods to achieve parameter efficiency.We also find hard prompt is very effective in the zero-shot scenario.
Figure 1(c) top right.Adapter adds a few trainable linear neural modules after every attention and feedforward layer.it consists of a down sampling matrix W down ∈ R d×r and a up sampling matrix W up ∈ R r×d with a nonlinear activation function f in the middle, where d is the dimention of the input x ∈ R d of the adapter.There is also a residual connection and the output O can be written as: O = x + f (xW down )W up (10) Compacter (Karimi Mahabadi et al., 2021): Figure 1(c) center right.An improvement work of Adapter.This work replaces the standard Adapter layer with a low-rank hypercomplex Adapter layer, which requires fewer parameters and yields competitive results.To be specific, Compacter decompose the W down ∈ R d×r to the sum of k Kronecker products of matrix Figure 1(c) bottom right.LoRA assumes the low-rank change of model weights W ∈ R d×K and then uses two trainable rank-decomposition matrices W A ∈ R d×r and W B ∈ R r×k to approximate the matrix change.Consequently, LoRA adds changes to the original output O from the input x: (Aggarwal and Kale, 2020) Flickr30k captions to French and Czech, respectively.(2)XTD(Aggarwal and Kale, 2020)

Table 2
, we observe that although the original CLIP has lim-1 https://translate.google.com/

Table 4 :
Results on XTD dataset for comparison of different combinations of prompts and texts.

Table 5 :
Comparison of different PEFT methods.We care about the degree of performance degradation caused by different PEFT methods.We bold the best score among the three PEFT methods and bold the score of our framework if it is better than full-model fine-tuning.Our framework updates 0.16% and 0.45% parameters for few-shot and full-dataset scenario, respectively.The non-"Our framework" rows show results without translation alignment.