Extremely Small BERT Models from Mixed-Vocabulary Training

Pretrained language models like BERT have achieved good results on NLP tasks, but are impractical on resource-limited devices due to memory footprint. A large fraction of this footprint comes from the input embeddings with large input vocabulary and embedding dimensions. Existing knowledge distillation methods used for model compression cannot be directly applied to train student models with reduced vocabulary sizes. To this end, we propose a distillation method to align the teacher and student embeddings via mixed-vocabulary training. Our method compresses BERT-LARGE to a task-agnostic model with smaller vocabulary and hidden dimensions, which is an order of magnitude smaller than other distilled BERT models and offers a better size-accuracy trade-off on language understanding benchmarks as well as a practical dialogue task.


Introduction
Recently, pre-trained context-aware language models like ELMo (Peters et al., 2018), GPT (Radford et al., 2019), BERT (Devlin et al., 2018) and XLNet  have outperformed traditional word embedding models like Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014), and achieved strong results on a number of language understanding tasks. However, these models are typically too huge to host on mobile/edge devices, especially for real-time inference. Recent work has explored, inter alia, knowledge distillation (Ba and Caruana, 2014;Hinton et al., 2015) to train small-footprint student models by implicit transfer of knowledge from a teacher model.
Most distillation methods, however, need the student and teacher output spaces to be aligned. This complicates task-agnostic distillation of BERT to Asterisk (*) denotes equal contribution. Research conducted when all authors were at Google. smaller-vocabulary student BERT models since the input vocabulary is also the output space for the masked language modeling (MLM) task used in BERT. This in turn limits these distillation methods' ability to compress the input embedding matrix, that makes up a major proportion of model parameters e.g. the ∼30K input WordPiece embeddings of the BERT BASE model make up over 21% of the model size. This proportion is even higher for most distilled BERT models, owing to these distilled models typically having fewer layers than their teacher BERT counterparts.
We present a task and model-agnostic distillation approach for training small, reduced-vocabulary BERT models running into a few megabytes. In our setup, the teacher and student models have incompatible vocabularies and tokenizations for the same sequence. We therefore align the student and teacher WordPiece embeddings by training the teacher on the MLM task with a mix of teacher-tokenized and student-tokenized words in a sequence, and then using these student embeddings to train smaller student models. Using our method, we train compact 6 and 12-layer reducedvocabulary student models which achieve competitive performance in addition to high compression for benchmark datasets as well as a real-world application in language understanding for dialogue.

Related Work
Work in NLP model compression falls broadly into four classes: matrix approximation, weight quantization, pruning/sharing, and knowledge distillation.
The former two seek to map model parameters to low-rank approximations (Tulloch and Jia, 2017) and lower-precision integers/floats (Chen et al., 2015;Zhou et al., 2018;Shen et al., 2019) respectively. In contrast, pruning aims to remove/share redundant model weights (Li et al., 2016;Lan et al., 2019). More recently, dropout (Srivastava et al., 2014) has been used to cut inference latency by Figure 1: Depiction of our mixed-vocabulary training approach. (Left) Stage I involving retrained teacher BERT with default config (e.g., 30K vocabulary, 768 hidden dim) and mixed-vocabulary input. (Right) Stage II involving student model with smaller vocabulary (5K) and hidden dims (e.g., 256) and embeddings initialized from stage I. early exit (Fan et al., 2019;Xin et al., 2020).
Another highly relevant line of work focuses on reducing the size of the embedding matrix, either via factorization (Shu and Nakayama, 2018;Lan et al., 2019) or vocabulary selection/pruning (Provilkov et al., 2019;Chen et al., 2019b).

Proposed Approach
Here, we discuss our rationale behind reducing the student vocabulary size and its challenges, followed by our mixed-vocabulary distillation approach.

Student Vocabulary
WordPiece (WP) tokens (Wu et al., 2016) are subword units obtained by applying greedy segmentation to a training corpus. Given such a corpus and a number of desired tokens D, a WordPiece vocabulary is generated by selecting D subword tokens such that the resulting corpus is minimal in the number of WordPiece when segmented according to the chosen WordPiece model. The greedy algorithm for this optimization problem is described in more detail in Sennrich et al. (2016). Most published BERT models use a vocabulary of 30522 Word-Pieces, obtained by running the above algorithm on the Wikipedia and BooksCorpus (Zhu et al., 2015) corpora with a desired vocabulary size D of 30000. For our student model, we chose a target vocabulary size D of 5000 WordPiece tokens. Using the same WordPiece vocabulary generation algorithm and corpus as above, we obtain a 4928-WordPiece vocabulary for the student model. This student vocabulary includes all ASCII characters as separate tokens, ensuring no out-of-vocabulary words upon tokenization with this vocabulary. Additionally, the 30K teacher BERT vocabulary includes 93.9% of the WP tokens in this 5K student vocabulary but does not subsume it. We explore other strategies to obtain a small student vocabulary in Section 6.
For task-agnostic student models, we reuse BERT's masked language modeling (MLM) task: words in context are randomly masked and predicted given the context via softmax over the model's WP vocabulary. Thus, the output spaces for our teacher (30K) and student (5K) models are unaligned. This, coupled with both vocabularies tokenizing the same words differently, means existing distillation methods do not apply to our setting.

Mixed-vocabulary training
We propose a two-stage approach for implicit transfer of knowledge to the student via the student embeddings, as described below.
Stage I (Student Embedding Initialization): We first train the student embeddings with the teacher model initialized from BERT LARGE . For a given input sequence, we mix the vocabularies by randomly selecting (with probability p SV , a hyperparameter) words from the sequence to segment using the student vocabulary, with the other words segmented using the teacher vocabulary. As in Figure 1 on the left, for input ['I', 'like', 'machine', 'learning'], the words 'like' and 'learning' are segmented using the student vocabulary (in blue), with the others using the teacher vocabulary (in green). Similar to Lample and Conneau (2019), this step seeks to align the student and teacher embeddings for the same tokens: the model learns to predict student tokens using context which is segmented using the teacher vocabulary, and vice versa.
Note that since the student embeddings are set to a lower dimension than the teacher embeddings, as they are meant to be used in the smaller student model, we project the student embeddings up to the teacher embedding dimension using a trainable affine layer before these are input to the teacher BERT. We choose to keep the two embedding matrices separate despite the high token overlap: this is partly to keep our approach robust to lower vocabulary overlap settings, and partly due to empirical considerations described in Section 6.
Let θ s /eb s and θ t /eb t denote the transformer layer and embedding weights for the student and teacher models respectively. The loss defined in Equation 1 is the MLM cross entropy summed over masked positions M t in the teacher input. y i and c i denote the predicted and true tokens at position i respectively and can belong to either vocabulary. v i ∈{s,t} denotes the vocabulary used to segment this token. Separate softmax layers P v i are used for token prediction, one for each vocabulary, depending on the segmenting vocabulary v i for token i. All teacher parameters (θ t , eb t ) and student embeddings (eb s ) are updated in this step.
Stage II (Student Model Layers): With student embeddings initialized in stage I, we now train the student model normally i.e., using only the student vocabulary and discarding the teacher model. Equation 2 shows the student MLM loss where M s is the set of positions masked in the student input. All student model parameters (θ s , eb s ) are updated. L s 2 = − i∈M s logP s (y i =c i |θ s , eb s )) (2)

Experiments
For evaluation, we finetune the student model just as one would finetune the original BERT model i.e., without using the teacher model or any taskspecific distillation. We describe our experiments below, with dataset details left to the appendix.

Evaluation Tasks and Datasets
We fine-tune and evaluate the distilled student models on two classes of language understanding tasks: • MNLI: Multi-Genre Natural Language Inference (Williams et al., 2018), a 3-way sentence pair classification task with 393K training instances.
• SST-2: Stanford Sentiment Treebank (Socher et al., 2013), a 2-way sentence classification task with 67K training instances. Spoken Language Understanding: Since we are also keen on edge device applications, we also evaluate on spoken language understanding, a practical task in dialogue systems. We use the SNIPS dataset (Coucke et al., 2018) of ∼14K virtual assistant queries, each comprising one of 7 intents and values for one or more of the 39 pre-defined slots. The intent detection and slot filling subtasks are modeled respectively as 7-way sentence classification and sequence tagging with IOB slot labels.

Models and Baselines
For GLUE, we train student models with 6 and 12 layers, 4 attention heads, and embedding/hidden dimensions fixed to 256, each using a compact 5K-WP vocabulary. We also evaluate baselines without knowledge distillation (NoKD), parameterized identically to the distilled student models (incl. the 5K vocabulary), trained on the MLM teacher objective from scratch. We also compare our models on GLUE with the following approaches:   structures for an optimized student model. For SNIPS, we shift our focus to smaller, lowlatency models for on-device use cases. Here, we train student models with 6 layers and embedding/hidden dimensions ∈ {96, 192, 256}. The smaller models here may not be competitive on GLUE but are adequate for practical tasks such as spoken LU. We compare with two strong baselines: • BERT BASE (Chen et al., 2019a) with intent and IOB slot tags predicted using the [CLS] and the first WP tokens of each word respectively, and • StackProp (Qin et al., 2019), which uses a series of smaller recurrent and self-attentive encoders.

Training Details
Distillation: For all our models, we train the teacher model with mixed-vocabulary inputs (stage I) for 500K steps, followed by 300K steps of training just the student model (stage II). We utilize the same corpora as the teacher model i.e. BooksCorpus (Zhu et al., 2015) and English Wikipedia. For both stages, up to 20 input tokens were masked for MLM. In stage I, up to 10 of these masked tokens were tokenized using the teacher vocabulary, the rest using the student vocabulary.
We optimize the loss using LAMB (You et al., 2019) with a max learning rate of .00125, linear warmup for the first 10% of steps, batch size of 2048 and sequence length of 128. Distillation was done on Cloud TPUs in a 8x8 pod configuration. p SV , the probability of segmenting a Stage I input word using the student vocabulary, is set to 0.5. Finetuning: For all downstream task evaluations on GLUE, we finetune for 10 epochs using LAMB with a learning rate of 0.0001 and batch size of 64. For all experiments on SNIPS, we use ADAM with a learning rate of 0.0001 and a batch size of 64. Table 1 shows results on downstream GLUE tasks and model sizes for our proposed models, BERT BASE/LARGE , and baselines. Our models consistently improve upon the identically parameterized NoKD baselines, indicating mixedvocabulary training is better than training from scratch and avoids a large teacher-student performance gap. Compared with PKD/DistilBERT, our 6-layer model outperforms PKD 3 while being >7x smaller and our 12-layer model is comparable to PKD 6 and DistilBERT 4 while being ∼5-6x smaller.

GLUE:
Interestingly, our models do particularly well on the MRPC task: the 6-layer distilled model performs almost as well as PKD 6 while being over 10x smaller. This may be due to our smaller models being data-efficient on the smaller MRPC dataset.
TinyBERT and Bert-of-Theseus are trained in task-specific fashion i.e., a teacher model already finetuned on the downstream task is used for distillation. TinyBERT's non-task-specific model results are reported on GLUE dev sets: these results are, therefore, not directly comparable with ours. Even so, our 12-layer model performs credibly  compared with the two, presenting a competitive size-accuracy tradeoff, particularly when compared to the 6x larger BERT-of-Theseus.
MobileBERT performs strongly for the size while being task-agnostic. Our 12-layer model, in comparison, retains ∼98% of its performance with 57% fewer parameters and may thus be bettersuited for use on highly resource-limited devices.
TinyBERT sees major gains from task-specific data augmentation and distillation, and Mobile-BERT from student architecture search and bottleneck layers. Notably, our technique targets the student vocabulary without conflicting with any of the above methods and can, in fact, be combined with these methods for even smaller models. Table 2 shows results on the SNIPS intent and slot tasks for our models and two state-of-theart baselines. Our smallest 6-layer model retains over 95% of the BERT BASE model's slot filling F1 score (Sang and Buchholz, 2000) while being 30x smaller (< 10 MB w/o quantization) and 57x faster on a mobile device, yet task-agnostic. Our other larger distilled models also demonstrate strong performance (0.2-0.5% slot F1 higher than the respective NoKD baselines) with small model sizes and latencies low enough for real-time inference. This indicates that small multi-task BERT models (Tsai et al., 2019) present better trade-offs for on-device usage for size, accuracy and latency versus recurrent encoder-based models such as StackProp.

Discussion
Impact of vocabulary size: We trained a model from scratch identical to BERT BASE except with our 5K-WP student vocabulary. On the SST-2 and MNLI-m dev sets, this model obtained 90.9% and 83.7% accuracy respectively -only 1.8% and 0.7% lower respectively compared to BERT BASE .
Since embeddings account for a larger fraction of model parameters with fewer layers, we trained another model identical to our 6×256 model, but with a 30K-WP vocabulary and teacher label dis-tillation. This model showed small gains (0.1% / 0.5% accuracy on SST-2 / MNLI-m dev) over our analogous distilled model, but with 30% more parameters solely due to the larger vocabulary.
This suggests that a small WordPiece vocabulary may be almost as effective for sequence classification/tagging tasks, especially for smaller BERT models and up to moderately long inputs. Curiously, increasing the student vocabulary size to 7K or 10K did not lead to an increase in performance on GLUE. We surmise that this may be due to underfitting owing to the embeddings accounting for a larger proportion of the model parameters.
Alternative vocabulary pruning: Probing other strategies for a small-vocabulary model, we used the above 6×256 30K-WP vanilla distilled model to obtain a smaller model by pruning the vocabulary to contain the intersection of the 30K and 5K vocabularies (total 4629 WPs). This model is 1.2% smaller than our 4928-WP distilled model, but drops 0.8% / 0.7% on SST-2/MNLI-m dev sets.
Furthermore, to exploit the high overlap in vocabularies, we tried running our distillation pipeline but with the embeddings for student tokens (after projecting up to the teacher dimension) also present in the teacher vocabulary tied to the teacher embeddings for those tokens. This model, however, dropped 0.7% / 0.5% on SST-2/MNLI-m compared to our analogous 6×256 distilled model.
We also tried pretraining BERT LARGE from scratch with the 5K vocabulary and doing vanilla distillation for a 6×256 student: this model dropped 1.2% / 0.7% for SST-2/MNLI-m over our similar distilled model, indicating the efficacy of mixed-vocabulary training over vanilla distillation.

Conclusion
We propose a novel approach to knowledge distillation for BERT, focusing on using a significantly smaller vocabulary for the student BERT models. Our mixed-vocabulary training method encourages implicit alignment of the teacher and student Word-Piece embeddings. Our highly-compressed 6 and 12-layer distilled student models are optimized for on-device use cases and demonstrate competitive performance on both benchmark datasets and practical tasks. Our technique is unique in targeting the student vocabulary size, enabling easy combination with most BERT distillation methods.