Allocating Large Vocabulary Capacity for Cross-Lingual Language Model Pre-Training

Compared to monolingual models, cross-lingual models usually require a more expressive vocabulary to represent all languages adequately. We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity. To this end, we propose an algorithm VoCap to determine the desired vocabulary capacity of each language. However, increasing the vocabulary size significantly slows down the pre-training speed. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax. Our experiments show that the multilingual vocabulary learned with VoCap benefits cross-lingual language model pre-training. Moreover, k-NN-based target sampling mitigates the side-effects of increasing the vocabulary size while achieving comparable performance and faster pre-training speed. The code and the pretrained multilingual vocabularies are available at https://github.com/bozheng-hit/VoCapXLM.


Introduction
Pretrained cross-lingual language models (Conneau and Lample, 2019; Conneau et al., 2020;Chi et al., 2021b;Xue et al., 2020) have recently shown great success in improving cross-lingual transferability. These models encode texts from different languages into universal representations with a shared multilingual vocabulary and a shared Transformer encoder (Vaswani et al., 2017). By pretraining cross-lingual language models on the largescale multilingual corpus, the models achieve stateof-the-art performance on various downstream tasks, e.g., cross-lingual question answering and cross-lingual sentence classification.
Although the Transformer architecture used in most pretrained monolingual and cross-lingual language models are almost identical, the vocabularies * Contribution during internship at Microsoft Research. are quite different. The vocabulary sizes in existing pretrained monolingual language models typically range from 30K to 60K subword units (Devlin et al., 2019;Dong et al., 2019;Bao et al., 2020). Meanwhile, state-of-the-art pretrained cross-lingual language models use the shared multilingual vocabulary of 250K subword units to represent more than 100 languages (Conneau et al., 2020;Chi et al., 2021b;Xue et al., 2020). Although some subword units are shared across languages, no more than 2.5K language-specific subword units on average are allocated for each language, which is still relatively small. Besides, the multilingual vocabulary is trained on the combined multilingual corpus with subword segmentation algorithms like BPE (Sennrich et al., 2015) and unigram language model (Kudo, 2018). During vocabulary construction, these algorithms tend to select more subword units shared across languages with common scripts like Latin and Cyrillic (Chung et al., 2020b), but have a lower chance to select language-specific subword units. It is hard to determine how much vocabulary capacity a particular language requires and whether the shared multilingual vocabulary has allocated enough vocabulary capacity to represent the language.
In this paper, we propose VOCAP, an algorithm to allocate large vocabulary for cross-lingual language model by separately evaluating the required vocabulary capacity of each language. First, we use the average log probability (ALP) to evaluate the ability of a vocabulary to represent a particular language. We find that ALP is highly correlated to the downstream task performance, and we use it as an indicator to allocate language-specific vocabulary capacity. In addition, the language-specific pre-training corpus size should also be considered since the pretrained model can only learn limited knowledge from low-resource languages where the pre-training data is scarce. Therefore, allocating too much vocabulary capacity for low-resource lan-guages is inefficient. VOCAP leverages both ALP and pre-training corpus size to evaluate the required vocabulary capacity of each language. We finally allocate a multilingual vocabulary with 500K subword units with VOCAP and show it can significantly improve the model performance.
However, increasing the vocabulary size has two practical drawbacks: slow pre-training speed and heavy model size. To address the pre-training speed issue, we propose k-NN-based target sampling, an approximate algorithm to improve the computing efficiency in the expensive softmax caused by the large vocabulary. We pre-train the model with a small subset of the entire vocabulary constructed with k nearest neighbors of the target words in current mini-batch data, evaluated with the inner product of subword embeddings. As for the model size, we halve the embedding dimension and draw a different conclusion from Conneau et al. (2020) that increasing vocabulary from 250K to 500K with a fixed capacity model can also improve the performance.
Our contributions are summarized as follows: • We propose VOCAP, an algorithm to allocate appropriate vocabulary capacity for each language in the shared multilingual vocabulary of cross-lingual language models.
• We propose k-NN-based target sampling, a softmax approximation algorithm to improve the computing efficiency during cross-lingual language model pre-training.
• We evaluate our methods on the XTREME benchmark (Hu et al., 2020), including three different tasks on seven datasets. Experiments show that VOCAP consistently outperforms previous vocabulary construction methods. Meanwhile, our k-NN-based target sampling enables effective acceleration while achieving comparable performance.

VOCAP: Language-Specific Vocabulary Capacity Allocation
We attribute the main factors that affect the performance of a particular language in a crosslingual language model to language-specific pretraining corpus size and vocabulary capacity. While previous work adjusts pre-training corpus size with an exponentially smoothed sampling distribution (Conneau and Lample, 2019;Conneau et al., 2020), few existing works have explored the effect of the language-specific vocabulary capacity in pretrained cross-lingual language models. In this section, we first investigate the correlation between the language-specific vocabulary capacity and downstream task performance through experiments. Then we introduce our proposed multilingual vocabulary allocation algorithm VOCAP.

Investigating Language-Specific Vocabulary Capacity
We start by introducing average log probability (ALP) to quantify the language-specific vocabulary capacity in the shared multilingual vocabulary for a specific language. 1 Given a monolingual corpus composed of sentences D i = {s 1 , ..., s |D i | } from the i-th language and tokenized with vocabulary V , the average log probability is defined as follows: where s k j is the k-th subword of the sentence s j , and p uni (·) is the unigram distribution counted on the monolingual corpus D i . It is difficult to count the language-specific subword units in multilingual vocabularies since the raw text contains a lot of code-switched data. By contrast, ALP is a more convenient indicator of language-specific vocabulary capacity and it is penalized by the subword units with low-frequency.
To investigate the impact of language-specific vocabulary capacity, we first learn monolingual vocabularies in different sizes to obtain vocabularies with different ALP, i.e., language-specific vocabulary capacity. Then we conduct pre-training with these monolingual vocabularies on their corresponding monolingual corpora. Finally, we evaluate these monolingual models on downstream tasks and study the correlation between languagespecific vocabulary capacity and downstream task performance.

Setup
To alleviate the bias from the languages' characteristics, we first select four languages with different pre-training corpus sizes from different language families, which are Hindi (hi), Persian (fa), Italian (it), Russian (ru). We first learn thirty monolingual  vocabularies for each language on the corresponding monolingual corpus, with vocabulary size ranging from 1K to 30K. Then we pretrain monolingual language models with the corresponding monolingual vocabularies. We evaluate these pretrained models on two downstream tasks: NER (Pan et al., 2017) and POS (Zeman et al., 2019) from the XTREME benchmark since there is annotated task data for a large number of languages. The vocabularies are learned on the reconstructed Common-Crawl corpus (Chi et al., 2021b;Conneau et al., 2020) using SentencePiece (Kudo and Richardson, 2018) with the unigram language model (Kudo, 2018). The unigram distributions are also counted on the CommonCrawl corpus. The Wikipedia corpus is used for all pre-training experiments in this paper since it is easier to run experiments due to its smaller size. More details about the pre-training data can be found in the appendix.

Observations
Increasing vocabulary size affects ALP of different languages in varying degrees. In Fig-ure 1, we show the correlation between vocabulary size and ALP of four different languages. We observe the ALP varies across different languages, mainly because ALP correlates with the lexicon granularity of the language, i.e., the average number of tokens per sentence. Besides, when the vocabulary size is larger than 10,000, the gains of increasing monolingual vocabulary size in hi and fa are less than it and ru. We attribute it to that hi and fa does not have extensive compoundings. Another observation is that for each language, every time we increase the vocabulary size by 1K, the increment in ALP is monotonically decreasing.
ALP correlates positively with downstream task performance. In Figure 2 and Figure 3, we illustrate downstream task performance of models pretrained with monolingual vocabularies on corresponding monolingual corpora. We observe that ALP correlates positively with downstream task performance, making language-specific ALP a valid indicator to allocate multilingual vocabulary. Another natural option to allocate multilingual vo- Clip the size of V to T cabulary is directly using monolingual vocabulary size to indicate language-specific vocabulary capacity. We compare ALP against vocabulary size and observe that ALP correlates better than vocabulary size with the downstream task performance. Besides, ALP reflects the language-specific characteristics, while vocabulary size does not. The detailed comparison is shown in the appendix.

Allocating Multilingual Vocabulary with VOCAP
Based on the observations in Section 2.1.2, we first give the implementation of our proposed vocabulary allocation algorithm VOCAP. Then we compare the multilingual vocabulary learned with VOCAP and directly learned with SentencePiece on the multilingual corpus.

VOCAP Implementation
We formulate the vocabulary construction of VO-CAP as the problem of finding the optimal way to allocate language-specific vocabulary size to each language, such that the overall ALP of all languages is maximized. In addition to languagespecific vocabulary capacity measured with ALP from Equation (1), the language-specific pretraining corpus size also affects the downstream task performance. Considering the two factors, the procedure of VOCAP can be formulated as follows: where t i ∈ {x × 1000 | x ≤ 50, x ∈ N + } is the number of subword units allocated to the i-th language, 2 β is a rescaling factor, V i t i is the vocabulary of the i-th language with t i subword units, T is the size of the target multilingual vocabulary, and q i is the probability of sampling training instances from i-th language during pre-training (Conneau and Lample, 2019;Conneau et al., 2020): where n i is the number of instances in the i-th language, α is a rescaling factor used to alleviate the bias towards high-resource languages. Since the increment in ALP when increasing the vocabulary size by a certain number is monotonically decreasing, Equation (2) can be solved with the greedy algorithm in Algorithm 1.

Intrinsic Analysis
We compare the multilingual vocabulary learned with VOCAP and directly learned with Sentence-Piece on the multilingual corpus. The multilingual corpus to learn vocabularies in this paper is the concatenation of sentences sampled randomly from the monolingual corpora. Sentences from the i-th language is sampled with probability q i from Equation (3) and use α = 0.7. We filter languages with corpus size larger than 0.1 GB, resulting in 86 languages. We evaluate the multilingual vocabularies with their ALP on each language's monolingual corpus, and show results of different-resourced languages in Figure 4. We refer to languages with less than 1GB and more than 10GB pre-training corpus in the reconstructed CommonCrawl as low-resource and high-resource languages, respectively, otherwise mid-resource languages. When directly learning vocabulary on the multilingual corpus using SentencePiece, the vocabulary with 500K subword units (JOINT 500K ) only has a negligible improvement compared to the vocabulary with 250K subword units (JOINT 250K ). Meanwhile, our method 3207 (VOCAP 500K ) consistently outperforms JOINT 500K in different-resourced languages, especially in mid and low-resource languages. The statistics of the allocated vocabulary size for each language in VOCAP 500K are shown in the appendix.

Accelerate Large-Vocabulary
Language Model Pre-Training Although extending the multilingual vocabulary benefits cross-lingual language models, pretraining with such large vocabularies brings two practical issues: slow pre-training speed and heavy model size. To tackle the issues, we first introduce our k-NN-based target sampling in Section 3.1, which is a softmax approximation algorithm to improve computing efficiency. Then we describe how we reallocate the model parameters to keep the model size fixed in Section 3.2.

k-NN-Based Target Sampling
To reduce the expensive computation cost of the softmax function, we propose k-NN-based target sampling to approximate the expensive softmax. The original masked language modeling objective minimizes the cross-entropy loss for every masked subword w i on the extensive multilingual vocabulary V . The proposed k-NN-based target sampling instead uses a smaller vocabulary subset V . The approximation of the masked language modeling loss for the masked subword w i is defined as follows: where h is the corresponding output vector of the penultimate network layer, i.e., the output vector of the Transformer encoder, v w i is the embedding of the subword unit w i , and b w i is a bias term. We formulate the construction of the vocabulary subset V as follows: where W denotes the set of target masked subword units in the current mini-batch, and I k (w i ) denotes the k most similar subwords measured with the inner product of the subword embedding v w i and v w j . However, retrieving I k (w i ) at every training step for every subword unit w i ∈ W requires as much Algorithm 2 Pre-training with k-NN-based target sampling Input: multilingual corpus Dm; size k of k-NN-based target sampling; multilingual vocabulary V ; learning rate τ Output: model parameters θ 1: while not converged do 2: Sample n mini-batches {X (t) , W (t) } n t=1 ∼ Dm X (t) is a mini-batch of monolingual text, and W (t) is the set of masked subwords.

3:
Update I k (wi) for every wi ∈ V 4: for t ← 1 to n do Train the model for n steps. 5: computation cost as softmax, which is unaffordable.
As an alternative, we compute I k (w i ) for every subword w i ∈ V according to the current subword embeddings every n training steps and replace the previous version of I k (w i ) with the new one. We determine the value of n such that |V | n × |W|. We illustrate the pre-training procedure with k-NNbased target sampling in Algorithm 2.
From a practical point of view under the crosslingual setting, the previous sampling-based softmax approximation methods either sample subwords from recent mini-batches or samples subwords from unigram distribution, the task becomes simpler since a considerable part of the subword samples is from different languages. Meanwhile, our k-NN-based target sampling uses subwords with similar representations like synonyms, which enforces the model focus on discriminating the ground-truth subword from a set of noise samples that are not easy to distinguish. When using an approximate algorithm, the key point is to remain the difficult part of the original masked language modeling objective as much as possible.

Reducing the Embedding Dimension
In order to keep the number of model parameters fixed while increasing the vocabulary size, we follow (Lan et al., 2020) and(Chung et al., 2020a) to reduce both the input and output embedding dimension and linearly project the embeddings to the hidden dimension of the Transformer blocks. More precisely, we halve the embedding dimension when the vocabulary size is doubled. This rebalancing strategy only slightly degrades the model performance but improves pre-training speed and decreases the model size. Conneau et al. (2020) also studied the relation between the size of the shared multilingual vocabulary and downstream task performance with multi-  Table 1: Evaluation results on the XTREME benchmark. "XLM-R 250K " denotes using the XLM-R (Conneau et al., 2020) vocabulary with 250K subword units. "k-NN" and "half emb" denote our k-NN-based target sampling method and using half embedding dimension, respectively.
lingual models of the fixed number of parameters. They keep the overall number of parameters constant by adjusting the width (i.e., hidden size) of the Transformer. Notice that we only reduce the embedding dimension while keeping the Transformer blocks untouched.

Setup
Fine-Tuning Datasets To validate the effectiveness of our methods, we conduct experiments on three types of cross-lingual understanding tasks from XTREME benchmark (Hu et al., 2020), including two classification datasets: XNLI (Conneau

Implementation Details
We adapt the Transformer architecture from the base model setting in Conneau et al. (2020), i.e., 12 layers and 768 hidden dimension size. We use masked language modeling objective to train our models for 1 million updates on eight 32GB Nvidia V100 GPUs with a batch size of 256. We update the top-k indices for every word in the multilingual vocabulary every 1,000 training steps and use k = 50 in k-NN-based target sampling. The learning rate is scheduled with a polynomial decay with 10K warmup steps, where the peak learning rate is set as 0.0001. We adapt other hyper-parameters in pre-training from Chi et al. (2021b). All fine-tuning results are averaged over five random seeds. The fine-tuning pipeline is based on the code base of (Zheng et al., 2021). The fine-tuning implementation details are shown in the appendix. Table 1 shows XTREME fine-tuning results with models pretrained using different vocabularies and acceleration strategies. Compared to vocabulary directly learned on multilingual corpus with SentencePiece, i.e., XLM-R 250K and JOINT 250K , our VOCAP 250K improves on question answering datasets but degrades on PAWS-X, POS and NER. Then increasing the vocabulary from VOCAP 250K to VOCAP 500K mitigates the gap and bring improvements on six datasets except for PAWS-X, which only includes seven high-resource languages. However, increasing the size of vocabulary directly learned with Sentencepiece from JOINT 250K to JOINT 500K does not improve the performance as our VOCAP method does, showing the importance of selecting language-specific subword units and leveraging how much vocabulary capacity each language requires. Since increasing vocabulary size brings the issues of model size and pre-training speed, we study the proposed method to accelerate pre-training: k-NN-based target sampling (k-NN) and using half embedding dimension (half emb). Our k-NN method improves pre-training speed with a 500K vocabulary so that the speed is 1.18 times that vanilla pre-training with a 250K vocabulary. Meanwhile, pre-training with our k-NN method does not significantly degrade the performance, it even brings improvement on XNLI, MLQA, and TyDiQA. Then we halve the embedding dimension of the models with 500K vocabulary and results in a similar number of parameters to models with 250K   vocabulary. The overall performance degrades by 0.6-points but still consistently improves over models with 250K vocabularies while the speed is comparable. Combining the two methods above, we achieve a 1.35-times speed-up and more than 1 point improvement with a similar model size compared to models with 250K vocabularies.

Analysis and Discussion
We conduct a thorough analysis to understand the impact of our proposed methods on cross-lingual language models. To reduce the computation load, we only pre-train the cross-lingual language models for 500K steps for some of our settings.
k-NN-based target sampling outperforms previous sampling-based approaches. To verify the effectiveness of our proposed k-NN-based sampling method, we compare it against previous sampling-based approaches used to approximate softmax, which are target sampling (Jean et al., 2015), noise contrastive estimation (Mnih and Teh (2012), NCE) and negative sampling (Mikolov et al. (2013), NEG). The results are shown in Table 2.
To make a fair comparison, since our k-NN-based sampling method using k = 50 samples vocabulary subset with less than 50,000 subword units per batch on average, we here sample 50,000 negative subword units per batch for target sampling, NCE, and NEG. Among the four methods, NCE and NEG are significantly worse than k-NN and  target sampling. We attribute it that NCE and NEG need more training steps to converge (Mnih and Teh, 2012). Besides, the original NCE typically sample different negative samples for every target word, while we here use 50,000 negative samples for all target word in current mini-batch, which is more efficient on GPUs.
Effect of the value of k in k-NN-based target sampling. We illustrate the downstream task performance when using different values of k in our k-NN-based target sampling in Table 3. While a smaller k indicates faster pre-training speed, we observe even with a small value like 5, the result does not significantly degrade compared to using the original softmax. We attribute this to that by retrieving subword samples that are most similar to the target subword, the model can focus on the difficult part of the original masked language modeling objective. More precisely, the model focus on discriminating the ground-truth subword from a set of noise samples that are not easy to distinguish. Considering the overall performance, the pre-training speed, and running memory to store k-NN indices, we use k = 50 in all our experiments.
Language-specific pre-training corpus should also be considered when allocating vocabulary capacity. The pre-training corpus size varies across different languages. It is inefficient to allocate a large vocabulary capacity for low-resource languages with rare pre-training data since the pretrained model can only learn limited knowledge from these languages. Here we study the value of rescaling factor β from Equation (2) in multilingual vocabulary construction in Table 4. The rescaling factor β controls the number of selected languagespecific subword units. Increasing the value of β improves the performance of XNLI, where most languages are high-resource languages. However, it degrades the performance of NER, where more low-resources languages exist. When considering overall performance, we decide to use β = 0.7 in our experiments.
The proposed acceleration strategies significantly improve the downstream task performance under the same pre-training cost. Increasing the vocabulary size slows the pre-training speed, even though there is almost no difference in fine-tuning speed. We study the relationship between the downstream task performance and the pre-training cost under different model settings in Figure 5. We observe VOCAP 500K +k-NN achieves the best performance. Models trained with 500K vocabulary consistently outperform 250K vocabulary on XNLI. Besides, we observe the performance on MLQA with the model trained using 250K vocabulary degrades as the training continues while models trained using 500K vocabulary does not, indicating the sufficient vocabulary capacity is essential for question answering task.
VOCAP gains more improvement on mid and low-resource languages than high-resource languages. In Figure 4 in Section 2, we show that the vocabulary learned with VOCAP benefits the vocabulary capacity of low-resource languages more than high-resource languages, indicating the improvements should mainly come from low-resource languages. To verify this, we compare VOCAP against SentencePiece baseline on the performance of different-resourced languages on XNLI and NER in Figure 6. We observe that the vocabulary learned with VOCAP significantly outperforms the vocabularies directly learned with SentencePiece on mid and low-resource languages. This observation is also consistent with the ALP results in Figure 4.

Related Work
Pretrained Cross-Lingual Language Models Recent work pre-trains Transformer models (Vaswani et al., 2017) on the large-scale multilingual corpus to obtain pretrained crosslingual language models (Conneau and Lample, 2019;Conneau et al., 2020;Chi et al., 2020Chi et al., , 2021aChung et al., 2020a;Xue et al., 2020;Ma et al., 2020Ma et al., , 2021. These models are capable of encoding texts from different languages into universal representations and significantly improves cross-lingual transferability. Multilingual Vocabulary Construction Crosslingual language models need large vocabularies to ensure all languages are adequately represented. Recent research work on constructing multilingual vocabulary for cross-lingual language models can be categorized into two groups. mBERT (Devlin et al., 2019), XLM (Conneau and Lample, 2019), and XLM-R (Conneau et al., 2020) learn vocabularies on a combined multilingual corpus with WordPiece (Wu et al., 2016), BPE (Sennrich et al., 2015), and unigram language model (Kudo, 2018) from SentencePiece (Kudo andRichardson, 2018), respectively. Chung et al. (2020b) propose to balance the trade-off between optimizing for crosslingual subword sharing and the need for robust representation of individual languages. They first group languages into clusters and learn vocabularies individually on each cluster, then combine all cluster-vocabularies to form a single unified multilingual vocabulary. Compared to Chung et al. (2020b), our advantage is that we separately quantify the vocabulary capacity each language needs with average log probability and balance the construction procedure with pre-training corpus size.
Softmax Approximation Approximating the softmax was a core problem in training NLP tasks with a large vocabulary, e.g., neural machine translation, language modeling. With the rise of subword representations (Sennrich et al., 2015;Wu et al., 2016;Kudo, 2018), the vocabulary size significantly decreases, and the problem has been less studied recently. Nevertheless, the need for training cross-lingual language models with a large multilingual vocabulary has drawn our attention again to the softmax approximation approaches. The existing softmax approximation approaches can be grouped into softmax-based and samplingbased approaches. Softmax-based approaches includes hierarchical softmax (Morin and Bengio, 2005), differentiated softmax , and CNN-softmax (Kim et al., 2016). However, these approaches improve the softmax efficiency by changing its architecture, which is unsuitable for either training on GPUs or multilingual settings. Sampling-based approaches instead optimize some other easy-to-compute loss function to approximate the original softmax, including target sampling (Jean et al., 2015), noise contrastive estimation (Mnih and Teh, 2012), negative sampling (Mikolov et al., 2013). Our k-NN-based target sampling is also a sampling-based approach.

Conclusion
In this paper, we study pre-training cross-lingual language models with large vocabulary capacity. First, we propose VOCAP to construct large multilingual vocabulary in cross-lingual language models. We conduct a quantitative analysis to show that average log probability is an valid indicator of vocabulary capacity for a particular language, which also correlates with downstream task performance on the language. VOCAP uses the languagespecific average log probability and pre-training corpus size to allocate appropriate vocabulary capacity for each language in the multilingual vocabulary. Moreover, we propose k-NN-based target sampling to accelerate pre-training with the allocated large multilingual vocabulary by approximating the expensive softmax. We also show that reducing the embedding dimension is an effective way to keep the improvement brought by the large vocabulary without increasing the number of model parameters. The experiments demonstrate the effectiveness of the proposed vocabulary construction method as well as the acceleration methods.

A Correlation between Language-Specific Vocabulary Capacity and Task Performance
We compare the Pearson correlation coefficients between ALP and downstream task performance with the coefficients between vocabulary size and downstream task performance in

D Pre-Training Data
We use the reconstruct CommonCrawl corpus in Chi et al. (2021b) to learn vocabularies in our paper. Because tokenizing the pre-training data is timeconsuming, we instead conduct our pre-training on Wikipedia since it has a smaller size. We only consider the languages that are shared by the reconstructed CommonCrawl corpus and Wikipedia. The statistics of the Wikipedia corpus and the reconstructed CommonCrawl corpus are listed in Table 8 and Table 7.   3  hy  5  ps  3  ar  15  id  13  pt  20  as  2  is  3  ro  13  az  5  it  22  ru  34  ba  2  ja  23  sa  1  be  3  ka  4  sd  2  bg  9  kk  4  si  3  bn  6  km  4  sk  11  ca  8  kn  2  sl  8  cs  14  ko  17  sq  7  cy  3  ky  3  sr  10  da  9  la  3  sv  18  de  24  lo  2  sw  3  el  17  lt  7  ta  6  en  23  lv  6  te  4  eo  4  mk  4  tg  5  es  26  ml  3  th  14  et  5  mn  3  tl  4  eu  4  mr  3  tr  18  fa  9  ms  4  tt  3  fi  9  mt  3  ug  3  fr  25  my  2  uk  12  ga  2  ne  3  ur  5  gl  5  nl  14  uz  2  gu  2  nn  3  vi  12  he  6  no  7  yi  2  hi  6  or  2  zh  30  hr  6 pa 3 Table 9: The statistics of the allocated vocabulary size for each language.