Fast Vocabulary Transfer for Language Model Compression

Real-world business applications require a trade-off between language model performance and size. We propose a new method for model compression that relies on vocabulary transfer. We evaluate the method on various vertical domains and downstream tasks. Our results indicate that vocabulary transfer can be effectively used in combination with other compression techniques, yielding a significant reduction in model size and inference time while marginally compromising on performance.


Introduction
In the last few years, many NLP applications have been relying more and more on large pre-trained Language Models (LM) (Devlin et al., 2018;Liu et al., 2019;He et al., 2020).Because larger LMs, on average, exhibit higher accuracy, a common trend has been to increase the model's size.Some LMs like GPT-3 (Brown et al., 2020) and BLOOM 1 have reached hundreds of billion parameters.However, these models' superior performance comes at the cost of a steep increase in computational footprint, both for development and for inference, ultimately hampering their adoption in real-world business use-cases.Besides models that only a few hi-tech giants can afford, like GPT-3, even smaller LMs with hundreds of million parameters could be too expensive or infeasible for certain products.For one thing, despite being tremendously cheaper than their bigger cousins, fine-tuning, deploying and maintaining large numbers of such models (one for each downstream task) soon becomes too expensive.Furthermore, latency and/or hardware requirements may limit their applicability to specific 1 https://bigscience.huggingface.co/blog/bloomuse-cases.For all these reasons, significant efforts -in both academic and industry-driven researchare oriented towards the designing of solutions to drastically reduce the costs of LMs.
Recently, several attempts have been made to make these models smaller, faster and cheaper, while retaining most of their original performance (Gupta et al., 2015;Shen et al., 2020).Notably, Knowledge Distillation (KD) (Hinton et al., 2015) is a teacher-student framework, whereby the teacher consists of a pre-trained large model and the student of a smaller one.The teacher-student framework requires that both the teacher and the student estimate the same probability distribution.While the outcome is a smaller model, yet, this procedure constrains the student to operate with the same vocabulary as the teacher in the context of Language Modeling.
In this work, we explore a method for further reducing an LM's size by compressing its vocabulary through the training of a tokenizer in the downstream task domain.The tokenizer (Sennrich et al., 2016;Schuster and Nakajima, 2012;Kudo and Richardson, 2018) is a crucial part of modern LMs.In particular, moving from word to subwordlevel, the tokenization solves two problems: vocabulary explosion and unknown words.Moreover, the capability to tokenize text effectively in any domain is key for the massive adoption of pre-trained general-purpose LMs fine-tuned on downstream tasks.Indeed, tokenizers are still able to process out-of-distribution texts at the cost of producing frequent word splits into multiple tokens.
However, the language varies significantly in vertical domains or, more generally, in different topics.Hence, ad-hoc tokenizers, trained on the domain statistics, may perform a more efficient to-Figure 1: Sketch of the VT procedure.First, the vocabulary is constructed on the in-domain data, then an embedding is assigned to each token, transferring information from the pre-trained representations of the general-purpose language model.kenization, reducing on average the length of the tokenized sequences.This is important since compact and meaningful inputs could reduce computational costs, while improving performance.Indeed, memory and time complexity of attention layers grows quadratically with respect to the sequence length (Vaswani et al., 2017).Furthermore, a vertical tokenizer may require a smaller vocabulary, which also affects the size of the embedding matrix, hence further reducing the model's size.
Following this intuition, we propose a Vocabulary Transfer (VT) technique to adapt LMs to in-domain, smaller tokenizers, in order to further compress and accelerate them.This technique is complementary to the aforementioned model compression methods and independent of the type of tokenizer.As a matter of fact, we apply it in combination with KD.
Our experiments show that VT achieves an inference speed-up between x1.07 and x1.40, depending on the downstream task, with a limited performance drop, and that a combination of VT with KD yields an overall reduction up to x2.76.
The paper is organized as follows.After reviewing related works in Section 2, we present the methodology in Section 3, we then outline the experiments in Section 4 and draw our conclusions in Section 5.

Related Works
The goal of Model Compression is to shrink and optimize neural architectures, while retaining most of their initial performance.Research on LM compression has been carried out following a variety of approaches like quantization (Gupta et al., 2015;Shen et al., 2020), pruning (Zhu and Gupta, 2017;Michel et al., 2019) knowledge distillation (Sanh et al., 2019;Jiao et al., 2020;Wang et al., 2020), and combinations thereof (Polino et al., 2018).
A most popular distillation approach in NLP was proposed by Sanh et al. (2019).The obtained model, called DistilBERT, is a smaller version of BERT, with the same architecture but half the layers, trained to imitate the full output distribution of the teacher (a pre-trained BERT model).Dis-tilBERT has a 40% smaller size than BERT and retains 97% of its language understanding capabilities.This enables a 60% inference-time speedup.Further compression was achieved by Jiao et al. (2020) by adding transformer-layer, predictionlayer and embedding-layer distillation.The resulting model, TinyBERT, is 10 times smaller than BERT, with only four layers and reduced embeddings sizes.Related methods were proposed (Sun et al., 2020;Wang et al., 2020), achieving similar compression rates.All these works focus on the distillation of general-purpose language models.Gordon and Duh (2020) investigated the interaction between KD and Domain Adaptation.
Little focus has been devoted thus far to the role of tokenization in the context of model compression.Even in domain adaptation (Gordon and Duh, 2020), the vocabulary was kept the same.Both the versatility of the subword-level tokenization, and the constraints imposed by the teacherstudent framework (same output distribution), discouraged such investigations.Recently, Samenko et al. (2021) presented an approach for transferring the vocabulary of an LM into a new vocabulary learned from new domain, with the purpose of boosting the performance of the fine-tuned model.To the best of our knowledge, we are the first to study VT in the scope of model compression.

Vocabulary Transfer
Let us consider a LM, trained on a general-purpose domain D gen and associated with a vocabulary V gen .Such a vocabulary is used by the LM's tokenizer in order to produce an encoding of the input string via an embedding matrix E gen defined on V gen .More specifically, a tokenizer is a function that maps a textual string into a sequence of symbols of a given vocabulary V. Let T be a tokenizer associated with a vocabulary V and a string s, we have Hence, the vocabulary of the tokenizer determines how words in a text are split, whether as words, subwords, or even characters.These symbols, which define the LM's vocabulary, are statistically determined by training the tokenizer to learn the distribution of a dataset.Now, let us consider a vertical domain D in , also referred as in-domain.For the reasons discussed earlier, a vocabulary V in specialized on D in itself better fits the language distribution than V gen .Unfortunately, with a new vocabulary, embedding representations associated with the tokens of V gen would be lost.Thus, VT aims to initialize V in by re-using most of the information learned from the LM pre-trained on D gen .Once the new tokenizer T in has been trained on the in-domain dataset D in using a given vocabulary size, T in will be different from the LM's tokenizer T gen .However, the two tokenizers' vocabularies V gen and V in may still have a large portion of their symbols in common.Our objective is to transfer most of the information from V gen into V in .To this end, we first define a mapping between each symbol in V in and a set of symbols in V gen .Then, we define an assignment criterion, based on the mapping, to obtain the embeddings for the tokens of T in .The existence of a meaningful mapping is the only underlying assumption for VT, which holds for all the phonographic languages.
One such criterion, called Vocabulary Initialization with Partial Inheritance (VIPI), was defined by Samenko et al. (2021).Whenever a token is in V in but not in V gen , VIPI calculates all the partitions of the new token with tokens from V gen , then takes the minimal partitions and finally averages them to obtain an embedding for the new token.Differently, we define a simplified implementation of VIPI called FVT for Fast Vocabulary Transfer.Instead of calculating all tokenizations, FVT uses a straightforward assignment mechanism, whereby

Dataset
T each token t i ∈ V in is partitioned using T gen .If t i belongs to both vocabularies, t i ∈ V in ∩ V gen , then T gen (t i ) = t i and the in-domain LM embedding E in (t i ) is the same as the embedding in the general LM: If instead t i ∈ V in \ V gen , then the in-domain embedding is the average of the embeddings associated with the tokens produced by T gen : Please notice that Equation 2 is a generalization of Equation 1. Indeed, in case t i ∈ V in ∩ V gen , Equation 2 falls back to Equation 1.
Once embeddings are initialized with FVT, we adjust the model's weights by training it with MLM on the in-domain data before fine-tuning it on the downstream task.MLM eases adaptation and has already been found to be beneficial in (Samenko et al., 2021).We observed this trend as well during preliminary experiments, therefore we kept such a tuning stage in all our experiments.
As a baseline model, we also implement a method called Partial Vocabulary Transfer (PVT), whereby only the tokens belonging to both vocabularies t i ∈ V in ∩ V gen are initialized with pretrained embeddings, while unseen new tokens are randomly initialized.

Distillation
VT can be combined with other model compression methods like quantization, pruning and KD.For some of the methods, the combination is trivial, since they have no impact on the vocabulary.KD, however, requires the vocabularies of the student and teacher to be aligned.Hence, its integration with VT is non-trivial.Accordingly, we set up a KD procedure with VT, in order to determine the effects of applying both VT and KD to an LM.Our distillation consists of two steps.In the first step, we replicate the distillation process used in (Sanh et al., 2019) for DistilBERT, in which the number of layers of the encoder is halved and a triple loss-function is applied: a distillation loss, a MLM loss, and a cosine embedding loss.However, unlike the original setup, we do not remove the token-type embeddings and pooler.Inspired by Gordon and Duh (2020), after distilling the student on D gen , we further distil the student using D in .However, instead of adapting the teacher before the second distillation, we simply distil the student a second time on the in-domain dataset.Finally, we apply VT using either FVT or PVT and fine-tune the student model on the in-domain datasets.
Our choice of applying VT after KD is based on findings by Kim and Hassan (2020), that different input embedding spaces will produce different output embedding spaces.This difference in spaces is not conducive to knowledge transfer during dis- tillation.Hence, if VT were to be applied first to the student, its input embedding space would differ greatly from that of the pre-trained teacher during distillation.

Experiments
In the experiments we measure the impact of FVT on three main KPIs: quality (F1 score), size of the models and speedup in inference.

Experimental Setup
We consider for all our experiments the pre-trained cased version of BERT base (Devlin et al., 2018) as our pre-trained language model.Its tokenizer is composed of 28996 wordpieces.We then define four vocabulary sizes for retraining our tokenizers.Specifically, we take the original vocabulary size and define it as a vocabulary size of 100%.We subsequently reduce this size to 75%, 50%, and 25%.
From now on, we will refer to such tokenizers as T 100 , T 75 , T 50 , T 25 respectively, while the original vocabulary will be called T gen .Models are fine-tuned for 10 epochs with early stopping on the downstream task.We set the initial learning rate to 3 • 10 −5 and batch size to 64 for each task.The sequence length is set to 64 for ADE and CoNLL03 and 128 for LEDGAR.Each configuration is repeated 3 times with different random initializations.MLM is performed for one epoch.

Datasets
To best assess the effectiveness of VT, we apply it on three different tasks from three heterogeneous linguistic domains: medical (ADE), legal (LEDGAR) and news (CoNLL03).Table 4 reports the dataset statistics.
ADE.The Adverse Drug Events (ADE) corpus (Gurulingappa et al., 2012) is a binary sentence classification dataset in the medical domain.This domain is particularly suitable for investigating the benefits of VT, since documents are characterized by the presence of frequent technical terms, such as drug and disease names, that are usually rare in common language.Domain-specific words are usually split into multiple tokens, yielding longer sequences and breaking the semantics of a word into multiple pieces.An example is shown in LEDGAR.LEDGAR (Tuggener et al., 2020) is a document classification corpus of legal provisions in contracts from the US Securities and Exchange Commission (SEC).The dataset is annotated with 100 different mutually-exclusive labels.It is also part of LexGLUE (Chalkidis et al., 2022), a benchmark for legal language understanding.
CoNLL03.CoNLL03 (Tjong Kim Sang and De Meulder, 2003) is a popular Named Entity Recognition (NER) benchmark.It is made of news stories from the Reuters corpus.We chose this corpus because, differently from ADE and LEDGAR, the news domain typically uses a more standard language, hence we expect its distribution to differ less from the one captured by a general-purpose tokenizers in the web.Statistics in Table 1 confirms this hypothesis.We can observe that the sequence compression gain obtained with domainspecific tokenizers is less significant with respect to LEDGAR and ADE.

Results
We report an extensive evaluation of FVT on different setups and perspectives.
In-domain Tokenization.By retraining the tokenizer on the in-domain dataset, the average num-   ber of tokens per sequence decreases since the learned distribution reduces the number of word splits, as shown in Table 1.In the medical domain, which is particularly specialized, we notice a remarkable 32% reduction of the average number of tokens per sequence.We expect this to yield a noticeable impact on inference time speedup.Furthermore, we can notice in Figure 3 that the sequence length distribution shifts to the left for the learned tokenizers.It can also be observed that by reducing the vocabulary size of the in-domain tokenizer, the sequence length distribution will begin to shift back to the right.Indeed, with fewer tokens in its vocabulary, the tokenizer will need to break down words more frequently into subwords.
Vocabulary Transfer.From the results shown in Tables 2 and 3, we note a few interesting findings.First, FVT vectors initialization method consistently outperforms the baseline PVT, which confirms the positive contribution of Equation 2. Sec-  ond, transferring vocabulary with FVT causes limited drops in performance, especially in LEDGAR (the largest one), where F1 slightly increases despite a 75% vocabulary reduction.We observed a higher degradation in CoNLL03.We believe this is due to the less specialized nature of the news domain, whereby the benefits of adapting the vocabulary to it are reduced.Overall, the effects of FVT on model performance do not have a steadily decreasing trend, as it might be presumed when reducing the vocabulary size, as also evident from Figure 4.In some cases, somewhat surprisingly, reducing the vocabulary size yields better model performance.In other cases, a 50% vocabulary size reduction yields better results than a full scale reduction or no reduction.Hence, vocabulary size should be considered as a hyper-parameter, where the model selection criteria may vary depending on the application's KPIs, such as acceptable F1 drop, disk occupation and delay constraints.
Vocabulary Transfer and Distillation.The results summarized in Table 3 clearly indicate that KD is complementary to VT: there is no harm in applying them together, in terms of performance on the downstream task.Crucially, this guarantees a full exploitation of FVT in the scope of language model compression.
Compression and Efficiency.After showcasing that VT has limited impact on performance, we analyze and discuss its effects on efficiency and model compression.Furthermore, the reduced input length enabled by in-domain tokenization brings a reduction in inference time.The more a language is specialized, the higher is the speedup with in-domain tokenizers.This is also confirmed by the experiments, where the major benefits are obtained on the medical domain, with a x1.40 speedup.In CoNLL03 instead where language is much less specialized, speedup reduces and even disappears with T 25 .Distillation further pushes compression and speedup in any benchmark and setup, up to about 55% (of which 15% due to VT) and x2.75 respectively.
In summary, depending on the application needs, VT enables a strategic trade-off between compression rate, inference speed and accuracy.

Conclusion
The viability and success of industrial NLP applications often hinges on a delicate trade-off between computational requirements, responsiveness and output quality.Hence, language model compression methods are an active area of research whose practical ramifications are self-evident.One of the factors that greatly contribute to a model's inference speed and memory footprint is vocabulary size.VT has been recently proposed for improving performance, but never so far in the scope of model compression.In this work, we run an extensive experimental study on the application of a lightweight method for VT, called FVT.An analysis conducted on various downstream tasks, application domains, vocabulary sizes and on its possible combination with knowledge distillation indicates that FVT enables a strategic trade-off between compression rate, inference speed and accuracy, especially, but not only, in more specialized domains.Importantly, FVT appears to be orthogonal to other model compression methods.
In the future, we plan to fully integrate Vocabulary Transfer within Knowledge Distillation during the learning process in order to maximize the information transfer.We also plan to define a unified metric that combines all the KPIs, to facilitate model selection.

Figure 2 :
Figure2: Example of different tokenizations using a pre-trained or an adapted tokenizer.In the latter case, domain-specific words are not broken down into multiple word pieces.

Figure 3 :
Figure 3: Sequence length distribution of each tokenizer on ADE, LEDGAR and CoNLL03 (left to right).

Figure 4 :
Figure 4: F1-score vs model size of VT with or without KD on ADE.VT and KD together can further compress a LM's size in exchange for a limited performance drop.FVT is better than PVT.A smaller vocabulary size does not always imply a lower performance.

Table 1 :
gen T 100 T 75 T 50 T 25 Average sequence length on the three datasets with different tokenizers.T gen is the generic tokenizer (BERT cased), the same in each corpus, while T % are the tokenizers trained in the vertical domain itself, where % indicates the percentage of the original vocabulary size that has been set for training it.

Table 2 :
F1 results on the three benchmarks.A pretrained language model fine-tuned on the task (T gen ) is compared with models having differently sized in-domain tokenizers (T 100 , T 75 , T 50 , T 25 ) adapted by transferring information with FVT or PVT.

Table 3 :
F1 results on the three benchmarks.A distilled language model fine-tuned on the task (T gen ) is compared with models having differently sized in-domain tokenizers (T 100 , T 75 , T 50 , T 25 ) adapted by transferring information with FVT or PVT.

Table 4 :
Number of examples of each dataset.

Table 5 :
The first row (T gen ) reports absolute values of the LM fine-tuned on the downstream task without VT or KD.The rows below show values relative to T gen .