MVP-BERT: Multi-Vocab Pre-training for Chinese BERT

Despite the development of pre-trained language models (PLMs) significantly raise the performances of various Chinese natural language processing (NLP) tasks, the vocabulary (vocab) for these Chinese PLMs remains to be the one provided by Google Chinese BERT (CITATION), which is based on Chinese characters (chars). Second, the masked language model pre-training is based on a single vocab, limiting its downstream task performances. In this work, we first experimentally demonstrate that building a vocab via Chinese word segmentation (CWS) guided sub-word tokenization (SGT) can improve the performances of Chinese PLMs. Then we propose two versions of multi-vocab pre-training (MVP), Hi-MVP and AL-MVP, to improve the models’ expressiveness. Experiments show that: (a) MVP training strategies improve PLMs’ downstream performances, especially it can improve the PLM’s performances on span-level tasks; (b) our AL-MVP outperforms the recent AMBERT (CITATION) after large-scale pre-training, and it is more robust against adversarial attacks.


Introduction
The pre-trained language models (PLMs), including BERT (Devlin et al., 2019) and its variants , have been proven beneficial for many natural language processing (NLP) tasks, such as text classification, question answering (Rajpurkar et al., 2018), natural language inference (NLI) (Bowman et al., 2015) and relation extraction (Zhu et al., 2020), on English, Chinese and many other languages. Although they bring impressive improvements for Chinese NLP tasks, most Chinese PLMs still use the vocabulary (vocab) provided by Google Chinese BERT (Devlin et al., 2019). Google Chinese * Contact: 52205901018@stu.ecnu.edu.cn.
BERT is a character (char) based model since it splits the Chinese characters with blank spaces. In the pre-BERT era, a part of the literature on Chinese natural language processing (NLP) first do Chinese word segmentation (CWS) to divide the text inputs into sequences of words and use a wordbased vocab in NLP models (Xu et al., 2015;Zou et al., 2013). There are many arguments on which vocab a Chinese NLP model should adopt.
The advantages of char-based models are apparent. First, char-based vocab is smaller, thus reducing the model size. Second, it does not rely on CWS, thus avoiding word segmentation error, which can directly result in performance gain in span-based tasks such as named entity recognition (NER). Third, char-based models are less vulnerable to data sparsity or the presence of out-of-vocab (OOV) words and thus less prone to over-fitting . However, word-based model has its advantages. First, it will result in shorter sequences than char-based counterparties, thus are faster. Second, words are less ambiguous, thus helping models learn the semantic meanings of words. Third, with a word-based model, exposure biases may be reduced in text generation tasks (Zhao et al., 2013). Another branch of literature tries to balance the two by combining word-based embedding with charbased embedding (Yin et al., 2016;Dong et al., 2016).
This article tries to strike a balance between the char-based and word-based models and provides alternative approaches for pre-training Chinese PLMs. We experiment on two approaches to build a vocab for Chinese PLMs: (1) following Devlin et al. (2019), separate the Chinese chars with white spaces, and then learn a sub-word tokenizer (denote as CHAR); (2) first segment the sentences with a CWS toolkit like jieba 1 , and then learn a sub-word tokenizer (denoted as SGT); (3) do CWS and keep the high-frequency words as tokens and low-frequency words will be tokenized by SGT (denoted as SEG). See Figure 1 for their workflow of processing an input sentence. The experiments show that SGT is best suited for PLMs.
Inspired by the previous work that incorporates multiple vocabularies (vocabs) or naturally combines multiple vocabs (Yin et al., 2016;Dong et al., 2016;Zhang & Li, 2020), we also investigate a series of strategies, which we will call Multi-Vocab Pre-training (MVP) strategies. The first version of MVP incorporates a hierarchical structure to combine the char-based vocab and word-based vocab. From the viewpoint of model forward pass, Chinese characters' embeddings are aggregated to form the vector representations of multi-gram words or tokens, which are fed into transformer encoders. Then the word-based vocab will be used in masked language model (MLM) training. The second version of MVP (denoted as AL-MVP) is to employ an additional vocab to form an auxiliary loss term in MLM, enhancing the PLM's ability to capture the contextual information.
Extensive experiments and ablation studies are conducted. We select BPE implemented by sentencepiece 2 as the sub-word tokenization model, and Albert (Lan et al., 2019) (tiny and base model) as our PLMs. Pre-training is done on Chinese Wikipedia corpus 3 (C-1), and a larger corpus we collect (C-2). The MVP strategies are compared on a series of Chinese benchmark datasets, two of which are sentence classification (CLS) tasks, two are named entity recognition (NER) tasks, and the remaining two are machine reading comprehension (MRC) tasks. The experimental results reveal the following take-aways: 1) combining CWS and sub-word tokenization yields the best vocab for Chinese PLMs; 3) MVP strategies can improve a single-vocab model on all three types of tasks.
We now summarize the following contributions in this work.
• We validate that combining CWS and subword tokenization is a better way for building vocabs for Chinese PLMs.
• We propose the novel MVP pre-training strategies for enhancing the Chinese PLMs, and they are proven to be effective.

RELATED WORK
Since Devlin et al. (2019), a large amount of literature on pre-trained language models appear and push the NLP community forward with a speed that has never been witnessed before. Peters et al. (2018) is one of the earliest PLMs that learns contextualized representations of words. GPTs (Radford et al., 2018(Radford et al., , 2019 and BERT (Devlin et al., 2019) take advantage of Transformer (Vaswani et al., 2017). GPTs are uni-directional and make predictions on the input text in an auto-regressive manner, and BERT is bi-directional and makes predictions on the whole or part of the input text. At its core, what makes BERT so powerful are the pre-training tasks, i.e., Mask language modeling (MLM) and next sentence prediction (NSP), where the former is more important than the latter. Since BERT, a series of improvements have been proposed. The first branch of literature improves the model architecture of BERT. ALBERT (Lan et al., 2019) makes BERT more light-weighted by embedding factorization and progressive cross-layer parameter sharing. Zaheer et al. (2020) improve BERT's performance on longer sequences by employing sparser attention. The second branch of literature improves the training of BERT.  stabilize and improve the training of BERT with a larger corpus. More work has focused on new language pre-training tasks. ALBERT (Lan et al., 2019) introduce sentence order prediction (SOP). Struct-BERT  designs two novel pretraining tasks, word structural task and sentence structural task, to learn better representations of tokens and sentences. ERNIE 2.0  proposes a series of pre-training tasks and applies continual learning to incorporate these tasks. ELECTRA (Clark et al., 2020) has a GAN-style pre-training task for efficiently utilizing all tokens in pre-training. Our work is closely related to this literature branch by designing a series of novel pretraining objectives by incorporating multiple vocabularies. Our proposed method is off-the-shelf and can be easily incorporated with other pre-training tasks.
Another branch of literature looks into the role of words in pre-training. Although not mentioned in Devlin et al. (2019), the authors propose whole word masking in their open-source repository, which is effective for pre-training BERT. In Span-BERT , text spans are masked in pre-training, and the learned model can substantially enhance the performances of span selection tasks. It is indicated that word segmentation is vital for Chinese PLMs.  and  both show that masking tokens in the units of natural Chinese words instead of single Chinese characters can significantly improve Chinese PLMs.  apply CWS to build a vocab that can improve Chinese-English translation performance. AMBERT (Zhang & Li, 2020) propose to leverage vocabs of different granularity in encoding sentences and improve the pre-training. In this work, compared to literature, our contributions are: (a) we find that CWS and sub-word tokenization can improve the pre-trained models' performances on downstream tasks. (b) we propose MVP pre-training tasks, which are proven to improve the expressiveness of pre-trained models and downstream performances.

Our methods
This section presents our methods for rebuilding the vocab for Chinese PLMs and introducing our series of MVP strategies.

Building the vocabs
We investigate four workflows to process the text inputs, each corresponding to a different vocab (or a group of vocabs) (Figure 1). We first introduce the single vocab models, CHAR, SEG and SGT.
For char-based vocab CHAR, Chinese characters in the corpus are treated as words in English and are separated with blank spaces, and a sub-word tokenizer is learned. 4 This method is essentially how BERT (Devlin et al., 2019) builds the Chinese vocab.
SGT (short for segmentation guided tokenization) requires the corpus sentences to be segmented with a CWS tool, and a sub-word tokenizer like BPE is learned on the segmented sentences. Some natural Chinese words in SGT will be split into pieces, but there are still many tokens with multiple Chinese chars.
Finally, SEG (short for segmentation) with size N is built with the following procedures: (a) do CWS on the corpus; (b) for long-tail Chinese words and non-Chinese tokens, tokenize them into tokens that have high frequencies; (c) sort the vocab via frequency, and if the most frequent N words or tokens can cover R percent of the corpus 5 , then take them as vocab; if not, then re-do (b).
Note that SEG is essentially how AMBERT (Zhang & Li, 2020) builds the vocab for their Chinese PLM. However, they do not learn a sub-word tokenizer after CWS, thus making our SGT different from theirs. We will use experiments to show that our SGT yields comparably better PLMs.

Multi-vocab pre-training (MVP)
In this subsection, we will introduce MVP, a series of natural extensions to the MLM task by Devlin et al. (2019).

Hierarchical MVP
We first introduce hierarchical MVP (Hi-MVP). input sentences. Two vocab, a more fine-grained vocab V f , and a more coarse-grained vocab V c , are combined hierarchically. Sequences are first tokenized via V c , and then the Chinese tokens (if containing multiple Chinese chars) are split into single chars. Thus V f consists of Chinese chars and non-Chinese tokens from V c . Then Chinese chars and non-Chinese tokens are embedded into vectors. The representations of chars inside a token are aggregated into the representation of this token, further fed into the transformer encoder. We apply a convolution network (with kernel size 3 and #channels equally the embedding size) and maxpooling to convert the char sequence into a fixed token level representation in this work.
During MLM task, whole word masking is applied. That is, we will mask 15% of the tokens in the V c . For example, in Figure 2(a), " 喜欢" (like) is masked, thus in the char sequence, two tokens " 喜" and "欢" are masked. A classifier is designated to predict the masked V c token " 喜欢". Let x and y denote the sequences of tokens with lengths l x and l y , for the same sentence under V c and V f , in which a part of tokens are masked. Denote x mask as the masked tokens under V c . The loss function for MVP hier is in which I x i is a variable with binary values indicating whether the i-th token is masked in x. Figure 2(b) depicts another version of MVP. In this method, a sentence is tokenized and embedded in a fine-grained V f (e.g., a char-based vocab), and an MLM task on V f is conducted. However, different from the vanilla MLM, an auxiliary MLM loss objective based on a more coarse-grained vocab V c is added. Thus, we call this method Auxiliary loss MVP (AL-MVP).

Auxiliary loss MVP
For example, encoded representations of the chars " 喜" and "欢" inside the word " 喜欢" is aggregated to the vector representation of the word, and an auxiliary MLM layer is tasked to predict the word-based on V c . For the aggregator in the example, we adopt the BERT-style pooler, which uses the starting token's representation to represent the word's representation. 6 Denote x mask and y mask as the masked tokens under V f and V c , respectively. The loss function for MVP obj is as follows: in which I x i and I y i are variables with binary values indicating whether the i-th token is masked in sequence x and y, respectively. Here λ is the coefficient which measures the relative importance of the auxiliary MLM task.
Note that AL-MVP is different from AMBERT's architecture (Figure 2(c)). In AMBERT, a sequence has to be encoded twice with different vocabs. Meanwhile, AL-MVP is a plug-in pre-training strategy, and during inference, the PLM is the same as the original PLM.
We will denote the model pre-trained with Hi-MVP strategy and vocab V as Hi-MVP(V ) for notational convenience. AL-MVP with a fine-grained vocab V f and a coarse-grained vocab V c are denoted as AL-MVP(V f , V c ).

Setup
Two corpora are used for pre-training. The first one is Chinese Wikipedia (C-1). We conduct most of the experiments and ablation studies on this corpus. Finally, we will use the other corpus (C-2) to match the SOTA performances. C-2 has 25 million documents, thus it has approximately the same size as the Chinese corpus in AMBERT (Zhang & Li, 2020). 7 CHAR's vocab size is set at 21128, which is the same with Google Chinese BERT. We consider three vocab sizes for SGT: {21,128, 31,692, 72,635}. We will show in experiments that SGT works best with vocab size 31,692. Moreover, for the experiments with AL-MVP, we will only consider SGT with vocab size 31,692. We set the vocab size of SEG to be 72,635, which is the same as AMBERT. Table 1 reports the basic statistics for the tokens in these vocabs. As the vocab size goes up, As the vocab size goes up, the vocab will include more and more phrase-level tokens (# Chinese chars ≥ 2).
For Hi-MVP, we consider Hi-MVP(SGT) and Hi-MVP(SEG). For AL-MVP, we consider AL-MVP(CHAR, SGT), AL-MVP(CHAR, SEG), and AL-MVP(SGT, SEG). The relative importance coefficient λ in Eq. 2 is tuned from the set {0.1, 0.5, 1.0, 2.0, 10.0} via training on a small corpus with 100k sentences and a small dev corpus with 5k sentences. We finally select λ = 0.5 for all models.
For pre-training, whole word masking is adopted, and a total of 15% of the words (from CWS) in the corpus are chosen. Furthermore, following BERT (Devlin et al., 2019), 80% of the chosen words are masked, a random word replaces 10%, and the rest remain unchanged. For AL-MVP, 1/3 of the time masked tokens from the fine-grained vocab are predicted, and 1/3 of the time masked tokens from the coarse-grained vocab are predicted, and for the rest of the time, masked tokens from both vocabs are predicted.
In this article, all models use the ALBERT as the encoder. We use two different settings. The first is for a smaller ALBERT model (ALBERT-tiny). The number of layers is 3, the embedding size is 128, and the hidden size is 256. We use this setting for extensive comparisons and ablation studies. Then we use the second model configuration, which is the same as ALBERT base. We pre-trained the best model from AL-MVP and show that our method also works for large language models. Other ALBERT configurations remain the same with ALBERT (Lan et al., 2019). The pre-training hyper-parameters are almost the same with AL-BERT (Lan et al., 2019) and the maximum sequence length is 512. Here, the sequence length is counted under the more fine-grained vocab for AL-MVP. The batch size is 1024, and all the models are trained for 12.5k steps. The pre-training optimizer is LAMB, and the learning rate is 1e-4. For finetuning, the sequence length is 256, the learning rate is 2e-5, the optimizer is Adam (Kingma & Ba, 2015), and the batch size is set as the power of 2 so that each epoch contains less than 500 steps. Each model is run on a given task 10 times, and the average performance scores are reported for reproducibility.

Baseline models
The first group of baselines is the original Google Chinese BERT (Devlin et al., 2019), with different vocabs. The second one is AMBERT (Zhang & Li, 2020), a pre-trained model with two vocabs of different granularity. For fair comparison, we pre-train the baselines ourselves, with the same corpus.

benchmark tasks
For downstream tasks, we select two sentence pair classification (CLS) tasks: (1) XNLI from Conneau et al. (2018) ; (2) LCQMC (Liu et al., 2018). We also investigate two named entity recognition (N-ER) tasks. MSRA NER (MSRA) (Levow, 2006) is from open domain, and CCKS NER 8 (CCKS) is collected from medical records. For machine reading comprehension (MRC) tasks, we consider two benchmark datasets, CMRC2018  and ChID (Zheng et al., 2019). Table 3 report the results of pre-training ALBERTtiny with a series of different vocabs. We can see that SGT obtains the best results on CLS, while 8 https://biendata.com/competition/CCKS2017 2/ CHAR and SGT have comparable results for spanlevel tasks NER and MRC. Even though the model with SEG has more parameters than SGT, it consistently under-performs SGT. The above results indicate two conclusions. First, CWS alone can not build a proper vocab for Chinese BERT. Second, sub-word tokenizers learned on the segmented Chinese corpus can decompose long-tail words into tokens while keeping meaningful phrases as it is, improving the downstream performances of AL-BERT.

Results for different vocabs
Also, Table 3 reports SGT's performances using different vocab sizes. The results show that vocab size 31,692 is best suited for Chinese PLMs. When the SGT's vocab size goes up, the less frequent tokens will not receive enough training, thus affecting the downstream performances. When the SGT's vocab size goes down, it is essentially similar to CHAR. Thus it can not leverage phrasal information of the Chinese language. Thus, for the experiments in the rest of the paper, we only use SGT with vocab size 31,692.
SGT has the efficiency advantage over CHAR. We now make inference on the LCQMC test set using batch size 1 9 , and the sequence length is kept as it is. We can observe that SGT has a 1.25x inference speed up than CHAR.

Results for MVP
In this subsection, we analyze results for our MVP strategies. We can see from Table 2 that when trained using the same corpus, our Hi-MVP's performance can match the AMBERT's performances. Note that AMBERT has twice the computational complexity of our Hi-MVP. Our Hi-MVP encoders the sentence from char level to phrase level, thus understanding the components of the sentence.
Note that Hi-MVP's pre-training works on the phrase level; thus, it does not perform well on the span level tasks.      pecially on span-level tasks. Also, our two versions of AL-MVP models can outperform AMBERT on most of the tasks. AL-MVP asks the model to learn a more general representation that can work with different vocabs, making the model better understand a token's relation with its contexts. Among the two AL-MVP models, AL-MVP(SGT, SEG) performs best on five of the six tasks. On CMRC2018, the performance of AL-MVP(SGT, SEG) is very close to AL-MVP(CHAR, SEG). AL-MVP(SGT, SEG) maintains the SGT's advantage on CLS tasks while improving NER and MRC via AL-MVP pre-training.

Ablation on the pre-training strategies of AL-MVP
For AL-MVP, we emphasize that cross-vocab MLMs is essential for the pre-training. Thus, we compare AL-MVP(SGT, SEG) with two other versions. First, AL-MVP(SGT, SEG)-1 keeps the main MLM layer in Figure 2(b), that is, to only make MLM predictions on the more fine-grained vocab; 10 Second, AL-MVP(SGT, SEG)-2 only keep the auxiliary MLM layer in Figure 2(b), that is, to only make MLM predictions on the more coarsegrained vocab. Table 4 reports that AL-MVP(SGT, SEG) achieves the best results on all 6 tasks. The results show that MLM pre-training that combines both vocabs can effectively improve the PLM's language understanding abilities and downstream performances.

Large scale pre-training
In section, we report the pre-training results on C-2, a large-scale corpus matching the size of AM-BERT's corpus. Table 5 reports the performances of ALBERT-base. We first directly report the results of AMBERT from Zhang & Li (2020) on the CMRC2018 and ChID tasks. Besides, to eliminate the factor of different training corpus, we also train AMBERT on the C-2 corpus. The results show that our AL-MVP(SGT, SEG) model outperforms both AMBERT models. Note that we only require half the GPU time for AMBERT training, and the inference speed of AL-MVP(SGT, SEG) is 2.15x of AMBERT.

Robustness over adversarial attacks
We claim that our AL-MVP training strategy can ask the ALBERT encoder to efficiently draw infor-mation from contexts into token representations, thus improving the expressiveness. Thus it is a fair reasonable that AL-MVP pre-trained models should be more robust to adversarial attacks. This subsection leverages the TextFooler framework (Jin et al., 2020) to conduct black-box attacks on the LCQMC and XNLI datasets. As shown in Table  6, we report the original performance, after-attack performance, and the number of queries needed by TextFooler to attack each model. We can see that AL-MVP(SGT, SEG) increases the number of queries needed to attack by a clear margin. Compared with AMBERT, our AL-MVP(SGT, SEG) demonstrates robustness improvements.

Conclusions
In this work, we propose a series of novel pretraining methods called MVPs, which leverage multiple vocabularies in the language model pretraining. To select the vocabs for MVP pre-training, we first conduct experiments to validate SGT, which combines Chinese word segmentation and sub-word tokenization, works best for the Chinese language model pre-training. We then use experiments to show that our proposed MVP methods can achieve better performances than AMBERT with less computational resources. Also, we show our MVP method can improve the pre-trained model's robustness against adversarial attacks.