LICHEE: Improving Language Model Pre-training with Multi-grained Tokenization

Language model pre-training based on large corpora has achieved tremendous success in terms of constructing enriched contextual representations and has led to significant performance gains on a diverse range of Natural Language Understanding (NLU) tasks. Despite the success, most current pre-trained language models, such as BERT, are trained based on single-grained tokenization, usually with fine-grained characters or sub-words, making it hard for them to learn the precise meaning of coarse-grained words and phrases. In this paper, we propose a simple yet effective pre-training method named LICHEE to efficiently incorporate multi-grained information of input text. Our method can be applied to various pre-trained language models and improve their representation capability. Extensive experiments conducted on CLUE and SuperGLUE demonstrate that our method achieves comprehensive improvements on a wide variety of NLU tasks in both Chinese and English with little extra inference cost incurred, and that our best ensemble model achieves the state-of-the-art performance on CLUE benchmark competition.


Introduction
Pre-trained language models (PLMs) such as GPT (Radford et al., 2018), BERT (Devlin et al., 2019) and XLNet  have become enormously popular and achieved great success on diverse natural language understanding tasks, such as sentiment analysis, question answering, and language inference. These models usually utilize a transformer architecture (Vaswani et al., 2017) to capture the dependencies between tokens in the input text, to model the language information, and to learn contextual representations. It is first pretrainined based on large-scale unlabeled corpora, * * Equal contribution. and subsequently fine-tuned based on the labeled data from downstream tasks.
In many NLU applications, tokenization often affects the performance and needs to be chosen carefully. The input tokens for pre-trained language models are usually fine-grained, e.g., words and sub-words for English and characters for Chinese. Compared with coarse-grained tokens such as phrases, the advantage of fine-grained tokens is that they form a smaller vocabulary, yielding abundant training samples per token, and thus alleviating the data sparsity issue and out-of-vocabulary (OOV) problem . However, even trained on large corpora, it is still hard for language models pre-trained with fine-grained tokens to learn the correct attention boundaries of larger semantic units in many languages (Zhang and Li, 2020).
To obtain a more accurate model, prior studies attempt to incorporate coarse-grained information into models trained with fine-grained tokenization by masking sequences of consecutive tokens in the pre-training stage (Joshi et al., 2020;Cui et al., 2019). Zhang and Li (2020) propose AMBERT, a Siamese network based on BERT to handle multigrained input text, and uses two encoders with shared weights to separately encode fine-grained tokens and coarse-grained tokens into two sequences of contextualized representations. Despite its effectiveness, the inference cost of AMBERT almost doubles that of the original BERT due to the dualencoder structure, which is often unacceptable in industrial scenarios.
In this paper, we propose a novel method named LICHEE designed to efficiently leverage the input information at multiple levels of granularity in the pre-training stage in order to enhance the representation ability of PLMs. Unlike AMBERT that encodes the fine-grained and coarse-grained tokens with two encoders, which significantly increases the inference cost, in LICHEE the fusion of multi-grained information of input text happens at the embedding level, which requires no change on the original model structure of the PLM, and thus induces little extra inference cost when applied in online NLP applications. Specifically, LICHEE first pre-processes the input text into fine-grained and coarse-grained tokens, which are passed through two embedding layers, respectively, to derive their corresponding vector representations. Both vector representations are then merged via pooling to form the multi-grained embedding vector, which serves as the input to the PLM encoder. Finally, the enhanced contextual representations generated by the PLM encoder, with both fine-grained and coarsegrained information incorporated, are obtained and used for downstream tasks.
We have applied LICHEE to enhance multiple different pre-trained language models, including BERT (Devlin et al., 2019), ALBERT (Lan et al., 2019), and GPT (Brown et al., 2020), and conducted extensive evaluation of the resulted language models on Chinese natural language understanding (NLU) tasks evaluated by CLUE (Liang Xu, 2020) benchmarks. Results show that with LICHEE, the resulted pre-trained language models significantly outperform their singlegrained counterparts on almost all tasks, by taking advantage of multi-grained information to effectively and efficiently produce more accurate representations.
In addition, we also participated in the CLUE benchmark competition with our best ensemble model built upon a collection of LICHEE-enhanced BERT-large models, and achieved the state-of-theart performance of an average score of 80.42 (as of January 8, 2021) over 9 different Chinese NLU tasks, as well as the best scores on two individual tasks: IFLYTEK and CSL.
Moreover, we have also conducted English natural language understanding experiments based on SuperGLUE (Wang et al., 2019a) benchmarks. Significant improvements are observed when LICHEE is employed in the pre-training stage, which demonstrates that the proposed pre-training method is generally effective in different language settings.

Related Work
In this section, we give a brief overview of some popular pre-trained language models and studies on the training techniques related to tokenization.
Pre-trained language models are pre-trained on large unsupervised corpora and aim to produce meaningful representations for each input token not only considering the meaning of itself, but also with its surrounding contexts anticipated. ELMo (Peters et al., 2018) is one of the first pre-trained language models based on bidirectional LSTMs which produces the contextual representation of each token by concatenating its left-to-right and right-to-left representations. GPTs (Radford et al., 2018(Radford et al., , 2019Brown et al., 2020) leverage the powerful Transformer (Vaswani et al., 2017) to build an auto-regressive language model predicting the next token given its history context. BERT (Devlin et al., 2019) is a bidirectional auto-encoding language model also based on transformer. It consists of two pre-training objectives: masked language model (MLM) and next sentence prediction (NSP). Yang et al. (2019) point out the discrepancy of the pre-training and fine-tuning stage of BERT due to the masking symbol, and propose a permutation language model called XLNet .
The great popularity of BERT draws many researchers to make improvements on the architecture. RoBERTa  improves several training details of BERT including dynamic masking and the removal of the NSP pre-training task. ALBERT (Lan et al., 2019) reduces the model parameters with cross-layer weight sharing and accelerates the training process. ELECTRA (Clark et al., 2019) proposes a new token detection task and adopts a generator-discriminator framework to pre-train the language model. Although most pre-trained language models are built on fine-grained tokenization, coarse-grained information proves to be helpful to the model performance. Cui et al. (2019) propose a masking scheme called "whole word masking" (WWM) for Chinese BERT, where the consecutive characters belonging to the same word are masked together. In ERNIE , knowledge graphs are added to enhance the model, and entity level masking is used during the pre-training, which is beneficial for language understanding tasks. Span-BERT (Joshi et al., 2020) proposes to mask random spans instead of random tokens, and adopts a new span boundary objective task to replace the next sentence prediction task in the pre-training. Instead of focusing on the masking scheme, AM-BERT (Zhang and Li, 2020) proposes to adopt two encoders with shared parameters to learn the representations of fine-grained and coarse-grained to-kens in parallel. However, even that the weight sharing setting reduces the number of model parameters, the dual-encoder structure of AMBERT induces twice the inference cost, which remains a huge issue when deployed in online applications.
Different from AMBERT, our work merges the fine-grained and coarse-grained tokenization at embedding level, and achieves significant performance gains with little additional computation costs.

Methodology
In this section, we present LICHEE, the general multi-grained framework for language model pretraining, and its detailed implementation, including the pre-training methods for both auto-regressive and auto-encoding tasks and fine-tuning details. Figure 1 gives an overview of LICHEE where the input information from multiple granularities is leveraged to enhance the representation ability for many pre-trained language models.

Model Architecture
The framework takes in text sequences as input which are tokenized into token sequences. In this paper, we keep two vocabularies and use two tokenizers to perform fine-grained and coarse-grained tokenizations, where items in vocabularies are selected based on their token frequencies in pretraining corpora. Also, the definitions of "fine grain" and "coarse grain" vary across languages. For example, in English, words and phrases are often used as the fine-grained and coarse-grained tokens respectively. And in Chinese, characters and words are used instead. Officially, for a given input text sequence T , we use t f i to denote the i-th fine-grained token and t c j-k to denote a coarsegrained token that is composed of fine-grained tokens {t f j , ..., t f k } between j and k. For example, in figure 1, the coarse-grained token "New York Times" is composed of the first, sencond, and third fine-grained tokens, and is denoted as t c 1-3 . After tokenization, two separate embedding layers are used to map the tokenized tokens to their vector representations. Specifically, each finegrained token t f i is passed into a fine-grained embedding layer to produce the fine-grained embedding vector e f i ∈ R d of the token, where d denotes the dimension of the fine-grained embedding. Similarly, the coarse-grained embedding e c j-k ∈ R d is derived with the same dimension d by feeding token t c j-k to the coarse-grained embedding layer, shown as: (1) For each token t f i , we construct its multi-grained embedding vector e i ∈ R d by performing a maxpooling operation on the derived fine-grained embedding e f i and the coarse-grained embedding e c j-k of its corresponding coarse-grained token t c j-k : where j ≤ i ≤ k. Note that d is equal to the original embedding dimension of the single-grained PLM, to prove that the performance gain is contributed to the introduction of multi-grained information other than modified model structure.
Finally, the combined embedding vectors e are fed into the PLM encoder to construct the final contextualized representations h enhanced with multi-grained information: (3)

Pre-training
We have applied LICHEE on both auto-regressive and auto-encoding PLMs, such as GPT and BERT. For auto-regressive PLMs, the pre-training task is Next Token Prediction which aims to predict the next token t i based on its previous context t <i , by optimizing the following objective function where the conditional probability p θ is modeled with a network with parameter θ.
In our framework, we adjust the objective function to include both fine-grained context t f <i and coarse-grained context t c <i , shown as: Note that when making predictions on any token within a coarse-grained span t i ∈ t c j-k , the token embedding e i will cause information leakage as it involves the coarse-grained token embedding e c j-k which contains information beyond the history context. For example, in the case illustrated in figure  1, the prediction on token "York" should not rely  Figure 1: The overall structure of our proposed pre-training framework LICHEE. Fine-grained and coarse-grained tokens are first derived from the input text by tokenization, and separately passed into two individual embedding layers. The multi-grained embedding vectors are acquired by taking a max-pooling on the fine-grained and coarsegrained embedding vectors, and are fed into the PLM encoder to extract the final contextualized representations. on token "New" and its embedding e 1 as it discloses the entire information of the coarse-grained token of "New York Times" by the coarse-grained embedding e c 1-3 . Therefore, we can only exploit the context before the start position of the coarsegrained token to make predictions, illustrated as: where j and k are the start and end positions of the coarse-grained token.
For auto-encoding PLMs, we only include Masked Language Modeling (MLM) task in the pre-training process, as Next Sentence Prediction (NSP) task is shown to have no benefits indicated in many recent studies (Lan et al., 2019;Zhang and Li, 2020). In MLM, 15% of the tokens are randomly selected and substituted with a set of tokens, in which 80% are replaced with [MASK] token, 10% are replaced with random tokens, and 10% stay unchanged.
The objective is to recover the masked tokens T m ⊂ T from the altered text input sequenceT : In our framework, we propose to exploit the multi-grained information of the input in the MLM task, shown as: whereT f andT c stand for the fine-grained and coarse-grained altered input text. Similar to the strategy deployed in autoregressive PLMs, we apply a masking strategy that when a fine-grained token t f i is to be masked, its corresponding coarse-grained token t c j-k and all the fine-grained tokens t f j , ..., t f k belonging to it are also masked, in order to avoid information leakage from the multi-grained embeddings.

Fine-tuning
In fine-tuning of downstream tasks, we append the special tokens ([CLS], [SEP]) to both fine-grained and coarse-grained vocabularies. In sentence-level classification tasks, [CLS] is attached to the start of input sequences in auto-encoding PLMs like BERT, and to the end of the input in auto-regressive PLMs like GPT. Its multi-grained contextualized representation h [CLS] is used to represent the whole input sequence and is passed into a projecting layer for the final prediction.
Similarly, for tasks that include token-level span detection, such as Question Answering, the contextual representation h i for each token t i is extracted and utilized in the task.

Experiments
We have carried out extensive experiments on various natural language understanding tasks on both Chinese and English datasets. In the following section, we will first introduce the pre-training datasets used in our evaluation and provide the implementation details of our framework. And we demonstrate the effectiveness of LICHEE by conducting comprehensive experiments on various Chinese NLU datasets with multiple different PLMs, and compare our method with other baseline methods. Next, we perform a thorough ablation study to evaluate different approaches of integrating input text information from multiple granularities. Finally, we adopt LICHEE to an English BERT to verify its efficacy on English NLU tasks.

Pre-Training Datasets
For Chinese language, there is no commonly used corpus for pre-training language models. We utilize a large corpus consisting of 450G text from a wide range of popular Chinese applications including Kandian, Zhihu, Wechat, and Weibo, in various fields of news, wiki, and blogs.
Similar to most Chinese PLMs, characters are used as fine-grained tokens due to the language nature of Chinese. For coarse-grained tokens, We use QQSeg which is a segmentation tool with an open API to perform segmentation on text, and the segmented words are treated as coarse-grained tokens. For the construction of vocabularies, we follow Google's Chinese BERT and include 21, 128 tokens in the fine-grained vocabulary. And in the coarse-grained vocabulary, we calculate the token frequencies and trimmed out tokens with frequency lower than 8, resulting in 210, 946 tokens. Note that in order to alleviate the out-of-vocabulary (OOV) problem, all tokens in the fine-tuned vocabulary are also included in coarse-grained vocabulary.
For English, a corpus with 6.2 million documents (18.9G compressed text) from Wikipedia is leveraged to pre-train the model. We first perform sub-word tokenization with BPE algorithm (Sennrich et al., 2015) on the English text, where the produced words and sub-words constitute the finegrained vocabulary of 28, 996 tokens. In the coarsegrained vocabulary, we treat high-frequency words as coarse-grained tokens, resulting in 136, 630 tokens in total, which also include all tokens in the fine-grained vocabulary for the OOV concern.

Benchmarks
The evaluation of the pre-trained models is conducted on various downstream NLU tasks. In our experiments, all the Chinese PLMs are evaluated on Chinese Language Understanding Evaluation (CLUE) (Liang Xu, 2020) which is a comprehensive language understanding benchmark developed for Chinese containing 9 natural language understanding tasks. Within the 9 tasks, there are two single-sentence classification tasks that are TNEWS and IFLYTEK, four sentence-pair classification tasks that are AFQMC, OCNLI, CLUEWSC and CSL, and three question answering tasks that are CMRC2018, CHID, and C3. Note that OC-NLI has replaced CMNLI since Oct 22, 2020. We compare the model performance by reporting the performance score of each task and the average score of all tasks.
For English tasks, we use the SuperGLUE benchmarks (Wang et al., 2019a) which is an extension of GLUE (Wang et al., 2019b) consisting of a collection of 8 NLU tasks of higher difficulty for comprehensively evaluating the performance of English PLMs. SuperGLEU contains a word sense disambiguation task (WiC), two textual entailment tasks (CB and RTE), two reasoning tasks (COPA and WSC), and three question answering tasks (BoolQ, MultiRC, and ReCoRD).

Experiment Setup
In order to demonstrate the general applicability and effectiveness of our framework, we have implemented three different pre-trained language models with our method including BERT, ALBERT and GPT, and compare the performances with their corresponding single-grained baseline methods.
For BERT and ALBERT, we follow the "base" structure in (Devlin et al., 2019) with an encoder of 12 layers. And the GPT model in our experiment is also made up of a 12-layer transformer decoder.  Then, we apply the following training setting to the training process of all three models. For better scalability in large batch, we adopt LAMB (You et al., 2019) to replace Adam (Kingma and Ba, 2014) as the optimizer with a batch size of 768 and a learning rate of 2e − 4. We first train the model for 1M steps using 128 as the maximum sequence length, and increase the maximum length to 512 for another 100k steps, for better capturing the long distance dependencies. To enhance the training efficiency, we adopt mix-precision training technique (Micikevicius et al., 2017) during pretraining, which are performed on 4 Nvidia V100 gpus.
We have also implemented a LICHEE-enhanced ensemble model based on BERT-large to participate in the CLUE benchmark competition. During training, we adapt the batch size to 1, 024 and the maximum sequence lengths at the first and second stage are set to 256 and 512. And 64 Nvidia V100 gpus are used to train the model.
For the evaluation of each task, we derive 6 results with different random seeds and report the average performance in this paper.

Main Results
In table 1, we adopt our multi-grained pre-training method on three pre-trained language models: 1 https://www.cluebenchmarks.com/rank.html BERT, ALBERT, and GPT, and compare them with their single-grained baselines on CLUE benchmark. From the results, we can see that our method achieves significant performance gains by exploiting the multi-grained information of the text input. The averaged CLUE scores of our multi-grained BERT-LICHEE, ALBERT-LICHEE and GPT-LICHEE are 73.92, 69.30 and 68.73 respectively, producing significant absolute improvements of 2.80, 2.03, and 1.32 compared to their single-grained baseline models. Aside from the improvement on the averaged CLUE score, it is also worth to mention that our multi-grained BERT-LICHEE and GPT-LICHEE outperforms their single-grained baselines on all 9 NLU tasks in CLUE, while the ALBERT-LICHEE model also beat the single-grained ALBERT in 8 out of 9 tasks, which provides strong evidence that the benefits of our method are generally applicable to different pre-trained language models and diverse NLU tasks.
In order to further investigate the potential of LICHEE, we apply it on an ensemble model based on BERT-large and participate in the CLUE benchmark competition. As demonstrated in table 2, our method outperforms all other candidates on the average score of 9 CLUE tasks by a significant margin, and also achieves the state-of-the-art performance on two individual NLU tasks of IFLY-  Table 3: Ablation study of different pre-training strategies with BERT model on CLUE dataset. Two single-grained (SG) baselines and five multi-grained (MG) methods (LICHEE and its variants) with different ways of integrating the fine-grained and coarse-grained representations are evaluated.
TEK and CSL. This results further proves that our multi-grained pre-training method is able to bring significant improvements on the representation ability of language models and is generally effective to a wide range of downstream NLU tasks. The reason of LICHEE's success is that we adopt a multi-grained pre-training strategy to model the contextual information of the input text to leverage the advantages from both granularities, where finegrained token representations are easier to learn considering the sufficient training samples, and coarse-grained tokens are more complete as lexical units and provide more accurate contextual information. Furthermore, in our framework, the combination of the multi-grained information is realized on the embedding level so that we can keep the model structure unaltered, showing that the benefits are achieved entirely through the information gains caused by multi-grained pre-training other than model-level modifications.

Ablation Analysis
We have conducted ablation analysis on CLUE benchmarks with BERT, to evaluate the impact of our multi-grained design, as well as perform a comprehensive study on the different methods of integrating the multi-grained embedding. Table 3 lists the performance of model variants with different training strategies, including two single-grained methods and five multi-grained methods.
The original single-grained BERT whose masking scheme is solely based on fine-grained tokens gives an average CLUE score of 71.12. The Whole Word Masking (WWM) technique (Cui et al., 2019) performs masking operations on continuous finegrained tokens that form a coarse-grained token and improves the performance to 72.24. Note that although WWM utilizes coarse-grained token boundary information during the masking operations, it does not explicitly train representations for coarsegrained tokens. Therefore, we treat WWM also as a single-grained pre-training method.
For multi-grained pre-training methods, we have conducted experiments to explore five different approaches of combining embedding representations of fine-grained and coarse-grained tokens, including concatenating the embedding vectors with different dimension settings, and integrating them with mean-pooling and max-pooling. For the concatenation approaches, we keep the dimension of the concatenated multi-grained embeddings to 768 to align with the baseline models, and apply three settings to adjust the dimensions of fine-grained and coarse-grained embedding correspondingly to (384, 384), (256, 512) and (512, 256). Empirically, we discover that the three concatenation settings achieve similar performances, while having larger embedding vectors for fine-grained tokens and smaller embedding vectors for coarse-grained tokens produces a slightly better performance of 73.08 average CLUE score.
Exploiting mean-pooling to integrate the multigrained information gives more performance gains compared with concatenation methods and reaches 73.22 average CLUE score, which may be attributed to the greater number of embedding parameters, as pooling methods do not require a shrink on the embedding dimension and allow both finegrained and coarse-grained embedding dimension to stay 768. Finally, LICHEE with the max-pooling incorporated outperforms all the fore-mentioned approaches, attains an overall score of 73.92, and achieves the best score on 3 out of 9 CLUE tasks, due to its capability of extracting more representative features. Especially for the task of CLUEWSC, LICHEE acquires an accuracy of 81.03 while the second best method only reaches 76.54. We believe this is because the small training set of CLUEWSC   with only 532 examples makes it more dependent on powerful pre-trained representations, so that the advantage of the max-pooling method is amplified. Overall, we can see from table 3 that all multigrained pre-training methods outperform the singlegrained baselines by a significant margin, which again proves that our idea of incorporating multigrained information during the pre-training phase is efficacious and can benefit model performance considerably.

Inference Speed Analysis
We have also studied the inference speed of LICHEE and compare it with the original singlegrained BERT and another multi-grained method AMBERT. Table 5 gives a brief comparison in terms of FLOPs and speedup, tested on a binary classification task with 512 sequence length. FLOPs indicates the number of floating-point operations that the model performs for a single process, where generally speaking, the higher the model's FLOPs is, the slower the inference speed will be.
We can see that the FLOPs of the AMBERT is 87.0 billion, twice the number of the single-grained BERT. It means the inference time of AMBERT is almost doubled, which can cost a lot more time and resources, and often can be unacceptable for realworld applications. Meanwhile, our multi-grained method produces a model with 43.5 billion FLOPs with a negligible increase compared with the singlegrained baseline, because the additional operations only include an embedding lookup operation for coarse-grained tokens and a max-pooling operation to integrate the fine-grained and coarse-grained embedding vectors. In summary, LICHEE can produce significant performance gains with negligible extra inference time needed.

English Tasks
We have also conducted experiments on Super-GLUE benchmarks to evaluate LICHEE on English language tasks, and compared it with the singlegrained baseline: BERT-WWM (Cui et al., 2019).
As shown in table 4, the BERT model pre-trained with our multi-grained method outperforms the single-grained BERT-WWM on all 8 SuperGLUE tasks, and attains an average score of 65.53 surpassing the baseline by 1.89. This improvement over BERT-WWM demonstrates that the effectiveness of LICHEE is attributed greatly to the information gain of its multi-grained representations, more than just token boundary information. We also notice that, similar to the CLUEWSC task, a huge increase of 8.45 on accuracy is achieved for the CB dataset of 250 training samples, because our pretraining method leverages the information gains of multi-grained tokens and produces more accurate representations, which is especially effective on tasks with small training data.
This result evidently illustrates that LICHEE is not only effective on tasks of character based language like Chinese that highly relies on correct tokenizations, but can also produce significant improvements on languages that are naturally tokenized such as English.

Conclusion
In this paper, we have proposed a novel multigrained method for language model pre-training named LICHEE, which can be applied to both autoregressive and auto-encoding PLMs. In our method, the fine-grained embeddings and the coarse-grained embeddings are separately learned and integrated as the multi-grained embeddings, which is then passed into the encoder of the language model. Experiments show that LICHEE can significantly en-hance the model performance by a great margin on downstream tasks of both Chinese and English, and significantly improve the inference speed compared to the prior multi-grained method.