Efficient Multilingual Language Model Compression through Vocabulary Trimming

,


Introduction
Multilingual language model (LM) pre-training (Devlin et al., 2019;Conneau et al., 2019;Liu et al., 2020;Xue et al., 2021) has been shown to be an efficient mechanism to store information from many languages into a single model, without the need for training multiple language-specific models.Moreover, it has been proven reliable for cross-lingual tasks (Pires et al., 2019;Conneau and Lample, 2019) and can provide competitive performance in most settings, generally similar to its monolingual counterparts (Goyal et al., 2021), while being generally less affected by culturally-dependant biases (Ahn and Oh, 2021).Similarly to monolingual models, multilingual LMs can be used for zero/fewshot learning (Scao et al., 2022) by increasing the model size and, more frequently, can be specialized to different tasks by fine-tuning to specific data.In practice, however, there are a few practical issues when training multilingual LM such as the curse of multilinguality (Conneau et al., 2019;Pfeiffer et al., 2022), a trade-off between the number of languages and individual performance in a single language, or the multilingual vocabulary construction, which requires a careful design for better generalization (Chung et al., 2020;Zheng et al., 2021;Liang et al., 2023).
Besides such generalization concerns, multilingual LMs usually consist of larger parameters than their monolingual counterparts due to the need for a large vocabulary covering multiple languages.This becomes an important issue in practice when the resources to host models are limited.For instance, while using the same configuration (i.e., same number of layers and hidden units), the parameter size of T5 SMALL (Raffel et al., 2020) and mT5 SMALL (Xue et al., 2021) are 140M and 300M, respectively.This is only due to their difference in vocabulary size, with T5 being 50k and mT5, 250k.In fact, the embedding matrix stemming from the LM vocabulary can occupy a large portion of the parameter space.For instance, the ratio of the embedding matrix to the full model's parameter size in multilingual LMs can be higher than 80% as T5 (see Figure 1).
In this paper, we propose a simple vocabulary trimming (VT) method to remove tokens from the vocabulary of multilingual LMs that may be irrelevant to the target language1 .This is achieved by automatically identifying language-specific tokens from an underlying text corpus.We consider two VT strategies of pre-FT VT (VT before finetuning) and post-FT VT (VT after fine-tuning) and analyse them by varying the final vocabulary size.We conduct experiments on two generation tasks, question answering (QA) and question generation (QG), and two classification tasks, sentiment analysis and natural language inference (NLI), across seven different languages.The experimental results show that both pre and post fine-tuning VT can reduce the model size while retaining the original performance in generation tasks (QA and QG), and particularly in classification tasks (sentiment and NLI) where the results are close to being identical despite the significant reduction in vocabulary size.In all tasks, the original performance can be generally maintained with less than 40% of the full model parameters for all languages.
Finally, even though pre-trained LMs have reported impressive performance on various NLP downstream tasks (Kenton and Toutanova, 2019;Liu et al., 2019;Conneau et al., 2019), such LMs also demonstrate worrying levels of social biases in certain situations (May et al., 2019;Kurita et al., 2019;Kaneko and Bollegala, 2021).One natural question that arises is whether VT can have an influence on the bias level in multilingual LMs, including fine-tuned models.For this purpose, we evaluate social bias in multilingual LMs after applying VT with different settings and compare it against its monolingual counterpart.Experimental results show that the monolingual LM tends to contain more bias than its multilingual versions.Moreover, compared to the original multilingual LM, the bias level has no significant change after applying VT.These results suggest that a monolingual LM can be induced by applying VT to its corresponding multilingual LM, thereby obtaining a less biased monolingual LM compared to its original monolingual counterpart.

Related Work
Several studies have explored the possibility to modify or adapt the vocabulary of LMs.For instance, Artetxe et al. (2020) and Marchisio et al. ( 2022) adapted a mono-lingual LM into another language by learning the embedding matrix on the new language, while fixing the other weights.Similarly, Wang et al. (2019) augmented the vocabulary of a multilingual LM to new languages with multilingual word alignment (Lample et al., 2018).2023) used a multi-stage fine-tuning to obtain a LM in the target language from other LM in the source language.These prior works modify existing mono/multi-lingual LMs to include new languages, i.e. augmenting the multilinguality of the LMs.In contrast, our study focuses on compressing multilingual LMs into the target language to effectively achieve smaller monolingual LMs, i.e. reducing the multilingual representation of the LMs while retaining the capability in a specific target language.
The work of Abdaoui et al. (2020) is the most relevant to our study as, to the best of our knowledge, they introduced the idea of VT for the first time.However, their analysis is limited to NLI with pre-fine-tuning VT with mBERT (Devlin et al., 2019) only, as well as a fixed vocabulary size after VT.In contrast, our study compares two VT strategies, before and after fine-tuning, and show how this latter strategy, not considered in Abdaoui et al. (2020), can be a more effective compression technique in some settings.Furthermore, we extend the experiments to generation tasks as well as classification tasks with more recent LMs such as mBART (Lewis et al., 2020) and mT5 (Xue et al., 2021), and provide an exhaustive analysis on the effect of VT.

Vocabulary Trimming
To perform vocabulary trimming (VT), we first need a multilingual LM as an input.The idea is to tailor model to a particular target language l, which in principle belong to the same set of languages used to trained the input multilingual LM.For the target language l, VT first identifies languagespecific tokens on a language-specific corpus C l , and remove all the tokens along with their embeddings except for those appeared in C l as described in Figure 2. In our analysis ( § 5), we also consider to keep the top-n most frequent tokens in C l to further reduce the model size by removing less frequent tokens.We consider two VT strategies: (1) before fine-tuning and (2) after fine-tuning.
The difference between these two strategies is whether to perform VT before or after fine-tuning, as shown in Figure 3.Both VTs have advantages and drawbacks: while pre-FT VT can reduce the time of fine-tuning as the trimmed LM is smaller than the original LM, post-FT VT only need a finetuned multilingual LM -this way, post-FT VT can be used as a postprocessing step and no additional language-specific training is required.

Experimental Setting
Tasks and datasets.In order to test the efficacy of VT, we consider two generation tasks, question answering (QA) and question generation (QG), and two classification tasks, sentiment analysis and natural language inference (NLI).As the datasets for QA, we use SQuAD (Rajpurkar et al., 2016) (English), Spanish SQuAD (Casimiro Pio et al., 2019) (Spanish), FQuAD (d'Hoffschmidt et al., 2020) (French), Italian SQuAD (Croce et al., 2018) (Italian), JAQuAD (So et al., 2022) (Japanese), Ko-rQuAD (Lim et al., 2019) (Korean), and SberQuAd  (Efimov et al., 2020) (Russian).For QG, we use the same datasets adapted for QG via QG-Bench (Ushio et al., 2022).For sentiment analysis, we use Twitter-based datasets for English (Rosenthal et al., 2017), Arabic (Rosenthal et al., 2017), French (Benamara et al., 2017), Italian (Barbieri et al., 2016), German (Cieliebak et al., 2017), Portuguese (Brum and Nunes, 2017), and Spanish (Díaz-Galiano et al., 2018) from UMSAB (Unified Multilingual Sentiment Analysis Benchmark) (Barbieri et al., 2022).All the sentiment analysis datasets contain three labels: positive, neutral and negative.For NLI, we use XNLI (Conneau et al., 2018), a multilingual NLI dataset, including English, French, German, Spanish and Arabic, which are the languages included in the sentiment analysis experiment.We fine-tune LMs on the training sets of each language, which were translated automatically from English and released in the original paper.Evaluation metrics.For the evaluation, we use the following standard metrics: answer span F1 score (Ans-F1) and exact match (EM) are used for QA; METEOR (MTR) and BERTScore (BS) for QG, which have been shown to be the most correlated metrics to human judgement (Ushio et al., 2022); macro-F1 score for sentiment following (Barbieri et al., 2022); and accuracy for NLI.As the language-specific corpus C l to extract vocabulary counts for VT, we use mC4 (Xue et al., 2021), one of the largest public multilingual corpora.Base language models.As the base LMs, given computational constraints we chose the smallest mT5 and mBART to fine-tune on QA and QG, and XLM-R and XLM-V (Liang et al., 2023) for sentiment analysis and NLI.All these models have a vocabulary size of 250K, except for XLM-V that has vocabulary size of 901K subword tokens.For our experiments, we compare the results of pre/post-FT VT against vanilla LM fine-tuning without VT, which we refer as No-Trim.Fine-tuning.For model fine-tuning, we use lmqg (Ushio et al., 2023)  for sentiment analysis.In both cases, we use the default search space for hyperparameter search.For NLI, we follow the same hyperparameters used in Liang et al. (2023).All the resulting models and code will be released upon acceptance of the paper.

Generation Tasks: QA & QG
Table 1 shows the overall results on QA and QG.
The results confirm that both of pre/post-FT VT can maintain the original performance in most cases, while being smaller than the original models by significantly reducing the vocabulary size.First, post-FT VT achieves at least the same performance as the vanilla fine-tuning for all the languages for both LMs in QA and QG, except for a few cases such as mBART QA in Korean and mBART QG in Russian, although the decrease is no more than 0.5%.Meanwhile, pre-FT VT outperforms its vanilla finetuning model with a relatively important margin in some cases, such as mBART French QA and mT5 Spanish QA.In contrast, there are a few models where pre-FT VT degrades the performance of the original model such as mT5 QA in Korean (2.6% decrease in Ans-F1) or mBART QA in Russian (3.2% decrease in Ans-F1).
Since we keep all the vocabulary that appeared in the language-specific corpus C l , the percentage of reduced parameter depends on the language, and generally VT can reduce the model size for Asian (Japanese/Korean) and European (Spanish/French/Italian) languages efficiently (50% for mT5 and 70% for mBART), but it remains high in other languages (English/Russian).

Classification Tasks: Sentiment & NLI
Table 2 shows the results on sentiment analysis and NLI.In this case, post-FT VT can robustly preserve the original performance of the original No-Trim baseline in both tasks for XLM-R and XLM-V, while being no more than 40% and 60% in vocabulary and overall parameter size, respectively, of the original XLM-V and XLM-R models in all the non-English datasets.XLM-V PT sentiment model is the only post-FT VT where a slight decrease can be observed (0.1%).On the other hand, the accuracy of Pre-FT VT appears to be sensitive to the language and task, where it improves the performance in some languages such as Italian (XLM-R and XLM-V achieve 7.9% and 3.8% increase for sentiment analysis), but it decreases the performance with non-trivial margin in other languages such as Arabic, where XLM-R decreases 5% for sentiment analysis and 2% for XNLI.Since XLM-V has a larger vocabulary size, the percentage of reduced parameters at VT is more prominent in XLM-V, as seen in Arabic (20.2%) and Portuguese (28.9%) for example.

Vocabulary Size Analysis
In our main experimental results ( § 4.2), all the unique tokens that appeared in the monolingual corpus were kept, which resulted in a low compression ratio for some languages such as English and VT, but Japanese QG with pre-FT VT is worse in any choice of n on contrary.This larger variation of results may also be due to the parameter size space, as the optimal parameters for the original multilingual LM (which is the one trained for post-FT VT) may differ.We leave this extended analysis for future work.

Classification Tasks: Sentiment & NLI
Figure 5 and Figure 6 show the results of XLM-R on sentiment and NLI.In NLI, we can see that post/pre-FT VT both can reduce the vocabulary to 30K (39% of the original XLM-R in the parameter size) without any decrease except 0.3% of pre-FT VT for German, and there are no decrease more than 0.4% even with top-15K of post-FT VT.In sentiment analysis, pre-FT VT with top-10K (33% of the original XLM-R in the parameter size) can retain the accuracy of the No-Trim baseline in French and Italian.Moreover, post-FT VT with 30K can retain the original F1 score without a major drop in sentiment analysis, yet the decrease in F1 score is slightly more prominent than NLI (1.1% in Arabic sentiment analysis).
The sentiment analysis datasets are collected from Twitter, so one dataset in a single language can contain tokens from other languages (hashtags or named-entities, or even code-switching).In contrast, XNLI translates English NLI into other languages, so there is less chance for a dataset to contain tokens from the other languages.This can explain the effectiveness of top-n VT in NLI compared with sentiment analysis, as smaller values of n should result in a vocabulary with fewer tokens from the other languages, which limits the ability of the models to handle foreign tokens.

Monolingual vs. Multilingual LMs: The Case of Social Bias
There has been extensive literature in NLP comparing monolingual and multilingual LMs (Muller et al., 2021;Goyal et al., 2021).As for the performance, there is no clear consensus on which type is better for certain languages, tasks or settings.However, there are other important factors that play a role in this comparison.First, mono-  lingual models tend to have a smaller vocabulary size, which makes them more practical.In contrast, a single multilingual model can be used for a large number of languages.Moreover, multilingual LMs are less prone to capture and carry culturalor language-dependant biases.This is due to the combination of languages and cultures into a single model, which makes it less biased to specific cultures (Liang et al., 2020;Ahn and Oh, 2021).
Prior works have shown that different types of bias consistently appears in language-specific models (Nadeem et al., 2021;Nangia et al., 2020;Blodgett et al., 2021;Dhamala et al., 2021;Kaneko et al., 2022;Zhou et al., 2022).While the comparison of monolingual and multilingual LMs is not the main focus of this paper, this analysis is certainly relevant.Trimming the vocabulary of a multilingual model essentially makes the model smaller, and therefore alleviates one of the main drawbacks of using multilingual language models on languagespecific tasks, which is its larger size.On top of that, this strategy enables the usage of monolingual models with potentially less social bias.In the following, we present a comparison of monolingual and multilingual LMs (both trimmed and not trimmed) in terms of social bias and general performance.

Experimental setting
Social bias datasets.To study the effect of VT on social bias existing in pre-trained LMs, we first conduct experiments on two commonly used social bias evaluation datasets for masked LMs: Stere-oSet (SS; Nadeem et al., 2021) and crowdsourced stereotype pairs benchmark (CP; Nangia et al., 2020).SS consists of associative contexts covering four types of social biases: race, gender, religion, and profession; while CP is crowdsourced and annotated by workers in the United States, which contains nine types of social biases: race, gender, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status.In order to further investigate the impact of pre/post-FT VT on LMs, we trim and fine-tune models on sentiment analysis with different orders and evaluate the social bias in such models on the Equity Evaluation Corpus (EEC; Kiritchenko and Mohammad, 2018) considering two bias types: gender and race.The EEC dataset was specifically with the aim to examine social bias for sentiment analysis systems.
Evaluation metrics.We compare the pseudolikelihood scores returned by each model for stereotypical and anti-stereotypical sentences using AULA (All Unmasked Likelihood with Attention weights) (Kaneko and Bollegala, 2022).AULA has been shown to be robust against the frequency biases of the masked tokens and provides a more reliable assessment in contrast to alternative metrics when evaluating social biases in masked language models (MLMs).Given a sentence pair in the test dataset: "My mom spent all day cooking for Thanksgiving" vs. "My dad spent all day cooking for Thanksgiving", the first sentence is considered as stereotypical while the second one is anti-stereotypical.AULA computes the percentage of stereotypical sentences preferred by the MLM over anti-stereotypical ones as the bias score.The AULA score falls within the range of [0,100] and an unbiased model would return a bias score close to 50.On the other hand, a bias score greater than or less than 50 indicates the bias direction towards the stereotype or anti-stereotype, respectively.Since the original AULA is not fitted to evaluate finetuned models, we adapt AULA to the EEC dataset obtain the bias score for the LMs fine-tuned on sentiment analysis, and denote this metric as EEC-AULA.Specifically, given a model that predicts sentiment labels (e.g., positive, neutral, negative) to sentences, we consider the percentage of stereotypical test sentences with a more negative label over anti-stereotypical ones as the corresponding bias evaluation measure.General performance.As a proxy to test the general performance, we use the general language understanding evaluation (GLUE; Wang et al., 2018) 4 benchmark.We acknowledge the limitations of using this benchmark to draw reliable conclusions at large (Ethayarajh and Jurafsky, 2020) but it nevertheless provides a good proxy for understanding the overall performance of comparable models in standard NLP tasks.Moreover, these experiments are only aimed at analysing the effect of vocabulary trimming on general performance.

Models.
We compute the bias scores of RoBERTa (Liu et al., 2019) as base monolingual LM, and XLM-R (Conneau et al., 2019) as its multilingual counterpart (they have been trained with the same architecture and in an overlapping corpus.We explore two VT settings to be applied to XLM-R: XLM-R with the standard VT including the full English vocabulary (VT XLM-R) and XLM-R with VT for top-50K English vocabulary (top-50K VT XLM-R), which is the same vocabulary size as the monolingual RoBERTa model.Our experiments are based both on masked language models on AULA (in which the post-VT does not have any effect) and models fine-tuned on the sentiment analysis presented in § 4.1 on EEC-AULA, as well as on the corresponding GLUE training sets.

Results
Table 3 shows the performance of pre-FT VT and post-FT VT models against the original monolingual and multilingual LMs on social bias evaluation Table 3: Results of pre/post-FT VT models compared with the original monolingual and multilingual models on two social bias analysis benchmarks (AULA for pre-trained masked language models and EEC-AULA for models fine-tuned on sentiment analysis) and the GLUE benchmark.The VT models are trimmed on English vocabulary with different vocabulary sizes: EN (full English vocabulary) and 50K (top 50K subword tokens).Note that for post-FT VT, the results on AULA are exactly the same as the original XLM-R.The green and red colours represent the social bias towards anti-stereotypical sentences (scores lower than 50) and stereotypical sentences (scores higher than 50), respectively.The lighter colour indicates less social bias observed in the LM.
datasets and the GLUE benchmark.Both AULA and GLUE results are computed using the LMs without fine-tuning (i.e., RoBERTa, XLM-R, VT XLM-R, and top-50K VT XLM-R), whereas the EEC-AULA results are computed using the models applying VT and fine-tuning strategies.We observe that the monolingual model contains the highest level of social bias compared to the multilingual models with different settings.In particular, RoBERTa obtains the overall highest bias score on the EEC dataset after fine-tuning, with an alarmingly high 85.7 score on race. 5On the other hand, compared to the original XLM-R, there is no significant change in performance on social bias and GLUE evaluation tasks for pre-FT VT and post-FT VT models.This is important as we can apply the proposed VT method to any multilingual LM, obtaining a monolingual one with consistent performance on the GLUE benchmark and less social bias than the original monolingual model pre-trained in the target language, without using any ad-hoc debiasing methods.

Discussion
Vocabulary trimming before and after finetuning.According to the results, pre-FT VT appears to be generally more effective in classification tasks (see section 4.2.2).For generation tasks, both pre/post-FT VT robustly retain the original performance while being able to considerably reduce the model size (see section 4.2.1).As a guideline to choose the type of VT in such a case, post-FT VT 5 Appendix C includes a breakdown of social bias results.
should be more suitable if one already has a finetuned model, as no additional training is needed for this case.Moreover, post-FT is more robust as a compression mechanism as the performance is largely maintained with respect to that of the original multilingual LM.On the other hand, if one needs to fine-tune a model from scratch and the computation resources are limited, we recommend exploring pre-FT VT, as fine-tuning on a trimmed LM should be more efficient due to its smaller vocabulary and parameters and, in some cases, can lead to better overall results.However, this process has to be done carefully as the set of optimal parameters could differ from the original multilingual LM fine-tuning process.
Monolingual and multilingual LM comparison.
While in this paper we have not compared monolingual and multilingual models, the question would be whether we need vocabulary trimming strategies in a world where monolingual LMs exist.In this case, a monolingual model may perform similarly to a multilingual model (Goyal et al., 2021).However, the multilingual model is often larger mainly due to larger vocabulary storage requirements.In contrast, our proposed VT technique does not require any extra LM training or computational resources.Indeed, only a multilingual LM is needed and we can induce multiple smaller language-specific monolingual models.This may reduce the carbon footprint overall and especially help with less-resource languages when a highquality monolingual model does not exist.Finally, our social bias analysis see § 6 shows how monolingual models exhibit larger social biases (especially racial) than a VT-induced multilingual LM.This is consistent with prior work suggesting that a multilingual LM has been trained with more languages, and hence more cultural variety, and these diverging viewpoints can compensate each other (Ahn and Oh, 2021).

Conclusion
In this paper, we proposed vocabulary-trimming (VT), a method to reduce the vocabulary of a multilingual LM to a vocabulary specific to any given target language.VT can induce a monolingual LM in the target language by leveraging an existing multilingual LM.The main advantage of this filtering step is the reduced size, as well as avoiding having to train monolingual LMs from scratch, which would be computationally demanding.Our experiments show how VT can retain the high performance of the original multilingual LM, while largely reducing the model size.For all languages evaluated, a 35% compression rate proves sufficient to keep the original performance of the larger mT5 multilingual LM in both QA and QG, with a similar 39% in NLI and 55% in sentiment sis with XLM-R.Interestingly, in some cases, the compressed LM can even achieve better results than the original larger model when trimmed before fine-tuning.Since the main goal of the paper was to compress a multilingual LM while keeping its original performance, we leave the analysis of this behaviour for future work.

Limitations
We have not tested our methodology in truly lowresource languages.Because of this, there could be a different behaviour when we apply VT to a language with lower resources or that is poorly representing in the underlying training corpus.The LMs we used in the paper limited their size up to 600M, and we have not considered larger models such as mT5 XXL or BLOOM (Scao et al., 2022), due to our limited computational resources.As the language-specific corpus to compute frequency, we employ mC4, which is one of the largest multilingual corpora.Nonetheless, this is used as a proxy and having access to the full multilingual model could give potentially better results.Similarly, we acknowledge the limitations of the analysis comparing multilingual and monolingual models in terms of social bias.Due to evaluation data available and existence of comparable mono-lingual and multilingual LMs, the evaluation is focused on English only and the results could differ for other languages.Moreover, there are other types of biases not covered in this evaluation.

Ethics Statement
Pre-trained LMs are known to contain undesirable biases to generate toxic contents in some edge cases (Schick et al., 2021), so the resulting models could inherit such biases.While we have not analysed in detail the output of all models in the tasks evaluated, in this paper we have made an attempt to study this effect in terms of social biases for both base pretrained LMs and fine-tuned LMs.A Top-n VT of XLM-R Table 4 shows the results of XLM-R fine-tuned on sentiment and NLI with post/pre-VT for different top-n.

B Top-n VT of mT5
Table 5 shows the results of mT5 fine-tuned on QA and QG with post/pre-VT for different top-n.

C Details of Results on Social Bias Evaluation
Table 6 shows the details of social bias evaluation (EEC dataset) regarding each emotion type observed in the LMs fine-tuned on sentiment analysis.Table 7 shows the details of social bias regarding each bias type in both CP and SS datasets observed in the comparison LMs.

Figure 1 :
Figure 1: The ratio of the embedding matrix to the number of parameters for each multilingual LM.
Zheng et al. (2021) proposed to evaluate the ability of a vocabulary to represent a particular language, and Chung et al. (2020) proposed a multilingual vocabulary construction that balances the tradeoff between optimizing for cross-lingual sub-word sharing and the need for robust representation of individual languages.XLM-V (Liang et al., 2023) combines the idea of Zheng et al. (2021) and Chung et al. (2020) to efficiently enlarge the vocabulary size along with the model size scaling.Ostendorff and Rehm (

Figure 2 :
Figure 2: An illustration of vocabulary trimming for Korean and French.

Figure 3 :
Figure 3: Comparisons of Pre-FT vs Post-FT VT in an example of fine-tuning on a task in French.

Figure 4 :
Figure 4: QG (METEOR) and QA (Ans-F1) results for mT5 with pre/post-FT VT for different vocabulary sizes compared to the original multilingual LM (No-Trim).

Figure 5 :
Figure 5: Sentiment analysis macro-F1 results of XLM-R with pre/post-FT VT for different vocabulary sizes compared to No-Trim.

Figure 6 :
Figure 6: NLI accuracy of XLM-R with pre/post-FT VT for different vocabulary sizes compared to No-Trim.

Table 1 :
for QA/QG, and Ray Tune 2 Results on QA (Ans-F1/EM) and QG (MTR/BS), including both the vocabulary size and the number of parameters after VT with the ratio to the original model (%).The best results in each LM and language are in bold characters.Note that the parameter size of the original mT5 and mBART (No-Trim) is 300M and 611M, respectively, both with a vocabulary size of 250K.

Table 2 :
Results of sentiment analysis (macro F1) and XNLI (accuracy) including both the vocabulary size and the number of parameters after VT with the ratio to the original model (%).The best results in each LM and language are in bold characters.Note that the overall parameter size of the original XLM-R and XLM-V (No-Trim) is 278M and 778M, respectively, with the vocabulary size being 250K and 901K vocabulary in each case.
the languages in QG (English, Italian, Korean, and Russian), but the result is sensitive to the choice of n.For example, Japanese/Korean QA and Russian QG with pre-FT VT for top-5K (16% of the original mT5) outperforms No-Trim as well as post-FT Bo Zheng, Li Dong, Shaohan Huang, Saksham Singhal, Wanxiang Che, Ting Liu, Xia Song, and Furu Wei.2021.Allocating large vocabulary capacity for crosslingual language model pre-training.In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3203-3215, Online and Punta Cana, Dominican Republic.Association for Computational Linguistics.Yi Zhou, Masahiro Kaneko, and Danushka Bollegala.2022.Sense embeddings are also biased -evaluating social biases in static and contextualised sense embeddings.In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1924-1935, Dublin, Ireland.Association for Computational Linguistics.

Table 4 :
Results of XLM-R for sentiment analysis (macro F1) and NLI (accuracy) with different top-n vocabulary size at VT, where the best results in each LM and language are in the bold characters.