AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization

Pre-trained language models such as BERT have exhibited remarkable performances in many tasks in natural language understanding (NLU). The tokens in the models are usually fine-grained in the sense that for languages like English they are words or sub-words and for languages like Chinese they are characters. In English, for example, there are multi-word expressions which form natural lexical units and thus the use of coarse-grained tokenization also appears to be reasonable. In fact, both fine-grained and coarse-grained tokenizations have advantages and disadvantages for learning of pre-trained language models. In this paper, we propose a novel pre-trained language model, referred to as AMBERT (A Multi-grained BERT), on the basis of both fine-grained and coarse-grained tokenizations. For English, AMBERT takes both the sequence of words (fine-grained tokens) and the sequence of phrases (coarse-grained tokens) as input after tokenization, employs one encoder for processing the sequence of words and the other encoder for processing the sequence of the phrases, utilizes shared parameters between the two encoders, and finally creates a sequence of contextualized representations of the words and a sequence of contextualized representations of the phrases. Experiments have been conducted on benchmark datasets for Chinese and English, including CLUE, GLUE, SQuAD and RACE. The results show that AMBERT can outperform BERT in all cases, particularly the improvements are significant for Chinese. We also develop a method to improve the efficiency of AMBERT in inference, which still performs better than BERT with the same computational cost as BERT.


Introduction
Pre-trained models such as BERT, RoBERTa, and ALBERT (Devlin et al., 2018;Lan et al., 2019) have shown great power in natural language understanding (NLU). The Transformerbased language models are first learned from a large corpus in pre-training, and then learned from labeled data of a downstream task in fine-tuning. With Transformer (Vaswani et al., 2017), pretraining technique, and big data, the models can effectively capture the lexical, syntactic, and semantic relations between the tokens in the input text and achieve state-of-the-art performance in many NLU tasks, such as sentiment analysis, text entailment, and machine reading comprehension.
In BERT, for example, pre-training is mainly conducted based on masked language modeling (MLM) in which about 15% of the tokens in the input text are masked with a special token [MASK], and the goal is to reconstruct the original text from the masked tokens. Fine-tuning is separately performed for individual tasks as text classification, text matching, text span detection, etc. Usually, the tokens in the input text are fine-grained; for example, they are words or sub-words in English and characters in Chinese. In principle, the tokens can also be coarse-grained, that is, for example, phrases in English and words in Chinese. There are many multi-word expressions in English such as 'New York' and 'ice cream' and the use of phrases also appears to be reasonable. It is more sensible to use words (including single character words) in Chinese, because they are basic lexical units. In fact, all existing pre-trained language models employ single-grained (usually fine-grained) tokenization.
Previous work indicates that the fine-grained approach and the coarse-grained approach have both pros and cons. The tokens in the fine-grained approach are less complete as lexical units but their representations are easier to learn (because there are less token types and more tokens in training data), while the tokens in the coarse-grained approach are more complete as lexical units but their representations are more difficult to learn (because there are more token types and less tokens in training data). Moreover, for the coarse-grained approach there is no guarantee that tokenization (segmentation) is completely correct. Sometimes ambiguity exists and it would be better to retain all possibilities of tokenization. In contrast, for the fine-grained approach tokenization is carried out at the primitive level and there is no risk of 'incorrect' tokenization.
For example, (Li et al., 2019) observe that finegrained models consistently outperform coarsegrained models in deep learning for Chinese language processing. They point out that the reason is that low frequency words (coarse-grained tokens) tend to have insufficient training data and tend to be out of vocabulary, and as a result the learned representations are not sufficiently reliable. On the other hand, previous work also demonstrates that masking of coarse-grained tokens in pre-training of language models is helpful (Cui et al., 2019;Joshi et al., 2020). That is, although the model itself is fine-grained, masking on consecutive tokens (phrases in English and words in Chinese) can lead to learning of a more accurate model. In Appendix A, we give examples of attention maps in BERT to further support the assertion.
In this paper, we propose A Multi-grained BERT model (AMBERT), which employs both fine-grained and coarse-grained tokenizations. For English, AMBERT extends BERT by simultaneously constructing representations for both words and phrases in the input text using two encoders. Specifically, AMBERT first conducts tokenization at both word and phrase levels. It then takes the embeddings of words and phrases as input to the two encoders with the shared parameters. Finally it obtains a contextualized representation for the word and a contextualized representation for the phrase at each position. Note that the number of parameters in AMBERT is comparable to that of BERT, because the parameters in the two encoders are shared. There are only additional parameters from multigrained embeddings. AMBERT can represent the input text at both word-level and phrase-level, to leverage the advantages of the two approaches of tokenization, and create richer representations for the input text at multiple granularity.
AMBERT consists of two encoders and thus its computational cost is roughly doubled compared with BERT. We also develop a method for im-proving the efficiency of AMBERT in inference, which only uses one of the two encoders. One can choose either the fine-grained encoder or the coarse-grained encoder for a specific task using a development dataset.
We conduct extensive experiments to make a comparison between AMBERT and the baselines as well as alternatives to AMBERT, using the benchmark datasets in English and Chinese. The results show that AMBERT significantly outperforms single-grained BERT models with a large margin in both Chinese and English. In English, compared to Google BERT, AMBERT achieves 2.0% higher GLUE score, 2.5% higher RACE score, and 5.1% more SQuAD score. In Chinese, AMBERT improves average score by over 2.7% in CLUE. Furthermore, AMBERT with only one encoder can preform much better than the single-grained BERT models with a similar amount of inference time.
We make the following contributions.
• Study of multi-grained pre-trained language models, • Proposal of a new pre-trained language model called AMBERT as an extension of BERT, • Empirical verification of AMBERT on the English and Chinese benchmark datasets GLUE, SQuAD, RACE, and CLUE, • Proposal of an efficient inference method for AMBERT.

Related work
There has been a large amount of work on pretrained language models. ELMo (Peters et al., 2018) is one of the first pre-trained language models for learning contextualized representations of words in the input text. Leveraging the power of Transformer (Vaswani et al., 2017), GPTs (Radford et al., 2018 are developed as unidirectional models to make predictions on the input text in an auto-regressive manner, and BERT (Devlin et al., 2018) is developed as a bidirectional model to make predictions on the whole or part of the input text. Masked language modeling (MLM) and next sentence prediction (NSP) are the two tasks in pre-training of BERT. Since the inception of BERT, a number of new models have been proposed to further enhance the performance of it. XL-Net  is a permutation language model which can improve the accuracy of MLM. RoBERTa  represents a new way of training more reliable BERT with a very large amount of data. ALBERT (Lan et al., 2019) is a light-weight version of BERT, which shares parameters across layers. StructBERT (Wang et al., 2019) incorporates word and sentence structures into BERT to learn better representations of tokens and sentences. ERNIE2.0 ) is a variant of BERT pre-trained on multiple tasks with coarse-grained tokens masked. ELECTRA (Clark et al., 2020) has a GAN-style architecture for efficiently utilizing all tokens in pre-training. It has been found that the use of coarse-grained tokens is beneficial for pre-trained language models. (Devlin et al., 2018) point out that 'whole word masking' is effective for training of BERT. It is also observed that whole word masking is useful for building a Chinese BERT (Cui et al., 2019). In ERNIE (Sun et al., 2019b), entity level masking is employed as a strategy for pre-training and proved to be effective for language understanding tasks (see also ). In Span-BERT (Joshi et al., 2020), text spans are masked in pre-training and the learned model can substantially enhance the accuracies of span selection tasks. It is indicated that word segmentation is especially important for Chinese and a BERT-based Chinese text encoder is proposed with n-gram representations (Diao et al., 2019). All existing work focuses on the use of single-grained tokens in learning and utilization of pre-trained language models. In this work, we propose a general technique of exploiting multi-grained tokens for pre-trained language models and apply it to BERT.

Our Method: AMBERT
In this section, we present the model, pre-training, and fine-tuning of AMBERT. We also present a discussion on alternatives to AMBERT. Figure 1 gives an overview of AMBERT. AMBERT takes a text as input. Tokenization is conducted on the input text to obtain a sequence of fine-grained tokens and a sequence of coarse-grained tokens. AMBERT has two encoders, one for processing the fine-grained token sequence and the other for processing the coarse-grained token sequence. Each of the encoders has exactly the same architecture as that of BERT (Devlin et al., 2018). The two encoders share the same parameters at each corresponding layer, except that each has its own token embedding parameters. The fine-grained encoder generates contextualized representations from the sequence of fine-grained tokens through its layers. In parallel, the coarse-grained encoder generates contextualized representations from the sequence of coarse-grained tokens through its layers. AM-BERT outputs a sequence of contextualized representations for the fine-grained tokens and a sequence of contextualized representations for the coarse-grained tokens.

Model
AMBERT is expressive in that it learns and utilizes contextualized representations of the input text at both fine-grained and coarse-grained levels. The model retains all possibilities of tokenizations and learns the attention weights (importance) of representations of multi-grained tokens. AMBERT is also efficient through sharing of parameters between the two encoders. The parameters represent the same ways of combining representations, no matter whether representations are those of finegrained tokens or coarse-grained tokens.

Pre-Training
Pre-training of AMBERT is mainly conducted on the basis of masked language modeling (MLM), at both fine-grained and coarse-grained levels. Next sentence prediction (NSP) is not essential as indicated in many studies after BERT (Lan et al., 2019;. We only use NSP in our experiments for comparison purposes. Letx denote the sequence of fine-grained tokens with some of them being masked, andx denote the masked fine-grained tokens. Letẑ denote the sequence of coarse-grained tokens with some of them being masked, andz denote the masked coarse-grained tokens. Pre-training is defined as optimization of the following function, where m i takes 1 or 0 as values and m i = 1 indicates that fine-grained token x i is masked, m denotes the total number of fine-grained tokens; n j takes 1 or 0 as values and n j = 1 indicates that coarse-grained token z j is masked, n denotes the total number of coarse-grained tokens; and θ denotes parameters.

Fine-Tuning
In fine-tuning of AMBERT for classification, the fine-grained encoder and coarse-grained encoder Fi ne -g rai ne d En co de r Output : Contextualized representations of fine-grained and coarse-grained tokens.  The input is a sentence in English and output is the overall representation of the sentence. There are two encoders for processing the sequence of fine-grained tokens and the sequence of coarse-grained tokens respectively. The final contextualized representations of fine-grained tokens and coarse-grained tokens are denoted as r x0 , r x1 , · · · , r xm and r z0 , r z1 , · · · , r zn respectively. create special [CLS] representations, and both representations are used for classification. Fine-tuning is defined as optimization of the following function, which is a regularized loss of multi-task learning, starting from the pre-trained model, where x is the input text, y is the classification label, r x0 and r z0 are the [CLS] representations of fine-grained encoder and coarse-grained encoder, [a, b] denotes concatenation of vectors a and b, λ is a regularization coefficient, and 2 denotes L2 norm. The last term is based on agreement regularization (Brantley et al., 2019), which forces agreement between the predictions (ỹ x andỹ z ).
Similarly, fine-tuning of AMBERT for span detection can be carried out, in which the representations of fine-grained tokens are concatenated with the representations of corresponding coarsegrained tokens. The concatenated representations are then utilized in the task.

Inference
We propose two ways of using AMBERT in inference. One is to utilize the AMBERT itself and the other to utilize only one encoder of AMBERT. The former performs better but needs more computation and the latter performs slightly worse but only needs computation comparable to BERT. One can choose either drop the fine-grained encoder or the coarse-grained encoder in AMBERT through evaluation using a development dataset, which makes the computational cost close to that of BERT.

Alternatives
We can consider two alternatives to AMBERT, which also rely on multi-grained tokenization. We refer to them as AMBERT-Combo and AMBERT-Hybrid and make comparisons of them with AM-BERT in our experiments.
AMBERT-Combo has two individual encoders, an encoder (BERT) working on the fine-grained token sequence and the other encoder (BERT) working on the coarse-grained token sequence, without parameter sharing between them. In learning and inference AMBERT-Combo simply combines the output layers of the two encoders. Its fine-tuning is similar to that of AMBERT.
AMBERT-Hybrid has only one encoder (BERT) working on both the fine-grained token sequence and the coarse-grained token sequence. It creates representations on the concatenation of two sequences and lets the representations of the two Table 1: Performance on classification tasks in CLUE in terms of accuracy (%). The numbers in boldface denote the best results of tasks. Average accuracies of models are also given. Numbers of parameters (param) and time complexities (cmplx) of models are also shown, where l, n, and d denote layer number, sequence length, and hidden representation size respectively. The tasks with mark † are those with data augmentation.  sequences interact with each other at each layer. Its pre-training is formalized in the following function, where the notations are the same as in (1). Its finetuning is the same as that of BERT.

Experiments
We make comparisons between AMBERT and the baselines including fine-grained BERT and coarsegrained BERT, as well as the alternatives including AMBERT-Combo and AMBERT-Hybrid, using benchmark datasets in both Chinese and English. The experiments on the alternatives can also be seen as ablation study on AMBERT. The ablation studies for the regularization term λ are given in the Appendix E.

Data for Pre-Training
For Chinese, we use a corpus consisting of 25 million documents (57G uncompressed text) from Jinri Toutiao 1 . Note that there is no common corpus for training of Chinese BERT. For English, we 1 Jinri Toutiao is a popular news app. in China.
use a corpus of 13.9 million documents (47G uncompressed text) from Wikipedia and OpenWeb-Text (Gokaslan and Cohen, 2019) 2 . The characters in the Chinese texts are naturally taken as fine-grained tokens. We conduct word segmentation on the texts and treat the words as coarsegrained tokens. We employ a word segmentation tool based on a n-gram model. Both tokenizations exploit WordPiece embeddings (Wu et al., 2016). There are 21,128 characters and 72,635 words in the vocabulary of Chinese.
The words in the English texts are naturally taken as fine-grained tokens. We perform coarsegrained tokenization on the English texts in the following way. First, we calculate the n-grams in the Wikipedia documents using KenLM (Heafield, 2011). We next build a phrase-level dictionary consisting of phrases whose frequencies are sufficiently high and whose last words highly depend on their previous words. We then employ a leftto-right search algorithm to perform phrase-level tokenization on the texts. There are 30,522 words and 77,645 phrases in the vocabulary of English.

Experimental setup
We make use of the same parameter settings for the AMBERT and BERT models. All models in this paper are 'base-models' having 12 layers of encoder. It is too computationally expensive for  us to train the models as 'large models' having 24 layers. To retain consistency, the masked spans in the coarse-grained encoder are also masked in the fine-grained encoder. The details of pre-training and fine-tuning are the same as those in the original BERT paper (Devlin et al., 2018), which are given in Appendix C.

Benchmarks
We use the benchmark datasets, Chinese Language Understanding Evaluation (CLUE) (Xu et al., 2020) for experiments in Chinese. CLUE contains six classification tasks, that are TNEWS, IFLYTEK and CLUEWSC2020, AFQMC, CSL and CMNLI 3 , and three Machine Reading Comprehension (MRC) tasks which are CMRC2018, ChID and C 3 . The details of all the benchmarks are shown in Appendix B. Data augmentation is also performed for all models in the tasks of TNEWS, CSL and CLUEWSC2020 to achieve better performance (see Appendix D for detailed explanation).

Experimental Results
We compare AMBERT with the BERT baselines, including the BERT model released from Google, referred to as Google BERT, and the BERT model trained by us, referred to as Our BERT, including fine-grained (character) and coarse-grained (word) models. Case study is given in Appendix F. Table 1 shows the results of the classification tasks. AMBERT improves average scores of the 3 The task is introduced at the CLUE website.
BERT baselines by about 1.0% and also works better than AMBERT-Combo and AMBERT-Hybrid. The results of MRC tasks are shown in Table 2. AMBERT improves average scores of the BERT baselines by over 3.0%. Our BERT (word) performs poorly in CMRC2018. This is probably because the results of word segmentation are not accurate enough for the task. AMBERT-Combo and AMBERT-Hybrid are on average better than single-grained BERT models. AMBERT further outperforms both of them.
We also compare AMBERT with the state-ofthe-art models such as RoBERTa and ALBERT in CLUE benchmark. The base models are trained with different datasets and procedures, and thus the comparisons should only be taken as references. Note that the settings of the base models are the same as that of Xu et al. (2020). Table 3 shows the results. The average score of AMBERT is higher than all the other models. We conclude that multigrained tokenization is very helpful for pre-trained language models and the design of AMBERT is reasonable.

Benchmarks
The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) is a collection of nine NLU tasks. Following BERT (Devlin et al., 2018), we exclude the task WNLI for the reason that results of different models on this task are undifferentiated. In addition, three MRC tasks are also included, i.e., SQuAD v1.1, SQuAD v2.0,

Experimental Results
We compare AMBERT with the BERT models on the tasks in GLUE. The results of Google BERT are from the original paper (Devlin et al., 2018), and the results of Our BERT are obtained by us. From Table 4 we can see the following trends, 1) Multi-grained models, particularly AMBERT, can achieve better results than single-grained models.
2) Among the multi-grained models, AMBERT performs best with fewer parameters and less computation. Case study is given in Appendix F. We also make comparison on the MRC tasks. The results of Google BERT are either from the papers (Devlin et al., 2018; or from our runs with the official code. From Table 5 we make the following conclusions. 1) in SQuAD, AMBERT outperforms Google BERT with a large margin. Our BERT (word) generally performs well and Our BERT (phrase) performs poorly in the span detection tasks. 2) In RACE, AMBERT performs best among all the baselines for both development set and test set. 3) AMBERT is the best multigrained model.
We compare AMBERT with the state-of-the-art models in both GLUE and MRC benchmarks. The results of baselines, in Table 6, are either reported in published papers or re-implemented by us with HuggingFace's Transformer (Wolf et al., 2019). We use the provided implementation in HuggingFace's Transformer, without additional data augmentation, question-answering module 4 and other tricks. Note that AMBERT outperforms all the models on average without using training techniques such as bigger batches and dynamic masking.

Enhancement of Inference Speed
We also conduct experiments on the efficient inference method of AMBERT on CLUE/GLUE/SQuAD/RACE. We choose the fine-grained encoder for the span detection tasks (CMRC2018 and SQuAD) because it performs much better in the tasks. We choose the coarse-grained encoder for the other Chinese tasks and the fine-grained encoder for the other English tasks because they perform better on average. All the decisions are made based on the results from the Dev datasets. The detailed results are shown in Table 7. We conclude that, a) for the English tasks, AMBERT with one chosen encoder achieves similar results as AMBERT with two encoders and outperforms the single-grained "Our BERT" models with a large margin; b) for the Chinese tasks, AMBERT with one chosen encoder performs slightly worse than AMBERT but performs much better than the single-grained "Our BERT" models. Therefore, in practice, one can train an AMBERT with two encoders and use only one of the encoders in inference.

Discussions
We further investigate the reason that AMBERT is superior to AMBERT-Combo. Figure 3 shows the distances between the [CLS] representations of the fine-grained encoder and coarse-grained encoder in AMBERT-Combo and AMBERT after pre-training, in terms of cosine distance (one minus cosine similarity) and normalized Euclidean distance. One can see that the distances in AMBERT-Combo are larger than the distances in AMBERT. We perform the assessment using the data in different tasks and find similar trends. The results indicate that the representations of fine-grained encoder and coarse-grained encoder are closer in AMBERT than in AMBERT-Combo. These are natural consequences of using AMBERT and AMBERT-Combo, whose parameters are respectively shared and unshared across encoders. It implies that the higher performances by AMBERT is due to its parameter sharing, which can learn and represent similar ways of combining tokens no matter whether they are finegrained or coarse-grained. An intuitive explanation is that the ways of combining representations of fine-grained tokens and the ways of combining representations of coarse-grained tokens "in the same contexts" are exactly the same.
We also examine the reasons that AMBERT works better than AMBERT-Hybrid, while both of them exploit multi-grained tokenization. Figure 2 shows the attention weights of first layers in AMBERT and AMBERT-Hybrid, as well as the single-grained BERT models, after pre-training. In AMBERT-Hybrid, the fine-grained tokens attend more to the corresponding coarse-grained tokens and as a result the attention weights among finegrained tokens are weakened. In contrast, in AM-BERT the attention weights among fine-grained tokens and those among coarse-grained tokens are intact. It appears that attentions among singlegrained tokens (fine-grained ones or coarse-grained ones) play important roles in downstream tasks.
To answer the question why the improvements by AMBERT on Chinese are larger than on English in the same pre-training settings, we further make an analysis. We respectively tokenize 10,000 randomly selected Chinese sentences from five tasks in CLUE with our Chinese word tokenizer. The average proportion of words is 51.5%, which indicates that about half of the tokens are fine-grained and half are coarse-grained in Chinese. Similarly, we tokenize 10,000 randomly selected English sentences from five different tasks in GLUE with our English phrase tokenizer. The average proportion of phrases is only 13.1%, which means that there are much less coarse-grained tokens than fine-grained tokens in English. (Please refer to Table 10 in the Appendix for more details of the experiments.) Therefore, we postulate that for Chinese it is necessary for a model to process the language at both fine-grained and coarse-grained levels. AMBERT indeed has the capability.

Conclusion
In this paper, we have proposed a novel pre-trained language model called AMBERT, as an extension of BERT. AMBERT employs multi-grained tokenization, that is, it uses both words and phrases in English and both characters and words in Chinese. With multi-grained tokenization, AMBERT learns in parallel the representations of the fine-grained tokens and the coarse-grained tokens using two encoders with shared parameters. We also develop an alternative way of using AMBERT in inference to save computation cost. Experimental results have demonstrated that AMBERT significantly outperforms BERT and other models in NLU tasks in both English and Chinese. AMBERT increases average score of Google BERT by about 2.7% in Chinese benchmark CLUE. AMBERT improves Google BERT by over 3.0% on a variety of tasks in English benchmarks GLUE, SQuAD (1.1 and 2.0), and RACE.
As future work, we plan to study the following issues: 1) to investigate model acceleration methods in learning of AMBERT, such as sparse attention Kitaev et al., 2020;Zaheer et al., 2020) and synthetic attention (Tay et al., 2020); 2) to apply the technique of AMBERT into other pre-trained language models such as XLNet; 3) to employ AMBERT in other NLU tasks.

A Attention maps for single-grained models
We construct fine-grained and coarse-grained BERT models for English and Chinese, and examine the attention maps of the models using the BertViz tool (Vig, 2019). Figure 4 shows the attention maps of the first layer of fine-grained models for several sentences in English and Chinese.
One can see that there are tokens that improperly attend to other tokens in the sentences. For example, in the English sentences, the words "drawing", "new", and "dog" have high attention weights to "portrait", "york", and "food", respectively, which are not appropriate. For example, in the Chinese sentences, the chars "拍", "北", "长" have high attention weights to "卖", "京", "市", respectively, which are also not reasonable. (It is verified that the bottom layers at BERT mainly represent lexical information, the middle layers mainly represent syntactic information, and the top layers mainly represent semantic information (Jawahar et al., 2019).) Ideally a token should only attend to the tokens with which they form a lexical unit at the first layer. This cannot be guaranteed in the fine-grained BERT model, however, because usually a fine-grained token may belong to multiple lexical units (i.e., there is ambiguity). Figure 5 shows the attention maps of the first layer of coarse-grained models for the same sentences in English and Chinese. In the English sentences, the words are combined into the phrases of "drawing room", "york minister", and "dog food". The attentions are appropriate in the first two sentences, but it is not in the last sentence because of the incorrect tokenization. Similarly, in the Chinese sentences, the high attention weights of words " 球拍(bat)" and "京城(capital)" are reasonable, but that of word "市长(mayor)" is not. Note that incorrect tokenization is inevitable.

B Detailed descriptions for the benchmarks B.1 Chinese Tasks
TNEWS is a text classification task in which titles of news articles in TouTiao are to be classified into 15 classes. IFLYTEK is a task of assigning app descriptions into 119 categories. CLUEWSC2020, standing for the Chinese Winograd Schema Challenge, is a co-reference resolution task. AFQMC is a binary classification task that aims to predict whether two sentences are semantically similar. CSL uses the Chinese Scientific Literature dataset containing abstracts and their keywords of papers and the goal is to identify whether given keywords are the original keywords of a paper. CMNLI is based on translation from MNLI (Williams et al., 2017), which is a largescale, crowd-sourced entailment classification task. CMRC2018 (Cui et al., 2018) makes use of a spanbased dataset for Chinese machine reading comprehension. ChID (Zheng et al., 2019) is a large-scale Chinese IDiom cloze test. C 3 (Sun et al., 2019a) is a free-form multiple-choice machine reading comprehension for Chinese.

B.2 English Tasks
CoLA ( In pre-training of the AMBERT models, in total 15% of the coarse-grained tokens are masked, which is the same proportion for the BERT models. We adopt the standard hyper-parameters of BERT in pre-training of the models except batch sizes which are tuned to make our fine-grained BERT models comparable to the Google BERT models. Table 8 shows the hyper-parameters in our Chinese AMBERT and English AMBERT. Our BERT models and alternatives of AMBERT (AMBERT-Combo and AMBERT-Hybrid) all use the same Figure 4: Attention maps of first layers of fine-grained BERT models for English and Chinese sentences. The Chinese sentences are "商店里的兵乓球拍卖完了 (Table tennis bats are sold out in the shop)", "北上京城 施展平生报复 (Go north to Beijing to fulfill the dream)", "南京市长江大桥位于南京 (The Nanjing Yantze River bridge is located in Nanjing)". Different colors represent attention weights in different heads and darkness represents weight.
hyper-parameters in pre-training. The optimizer is Adam (Kingma and Ba, 2014). To enhance efficiency, we use mixed precision for all the models. Training is carried out on Nvidia V-100. The numbers of GPUs used for training are from 32 to 64, depending on the model sizes.

C.2 Hyper-parameters in Fine-tuning
For the Chinese tasks, since all the original papers do not report detailed hyper-parameters in fine-tuning of the baseline models, we uniformly use the same hyper-parameters as shown in Ta-ble 11 except training epoch, because AMBERT and AMBERT-Combo have more parameters and need more training to get converged. We choose the training epochs for all models when the performances on development sets stop to improve. Table 11 also shows all the hyper-parameters in fine-tuning of the English models. We adopt the best hyper-parameters in the original papers for the baselines. Moreover, for AMBERT ‡ , we also tune learning rate ([1e-5, 2e-5, 3e-5]) and batch size ([16, 32]) for GLUE with the same method in RoBERTa .

D Data Augmentation
To enhance the performance, we conduct data augmentation for the three Chinese classification tasks of TNEWS, CSL, and CLUEWSC2020. In TNEWS, we use both keywords and titles. In CSL, we concatenate keywords with a special token " ". In CLUEWSC2020, we duplicate a few instances having pronouns in the training data such as "她 (she)". Table 9 shows the results of using different values as regularization coefficients in fine-tuning on the development sets of CLUE, GLUE and RACE. It appears that for most tasks the use of regularization   is necessary. For simplicity, we did not use the best value of coefficient for each task and instead we adopt 0.0 for RACE and 1.0 for the other tasks.

F Case study
We also qualitatively study the results of BERT and AMBERT, and find that they support our claims (cf., Section 1) very well. Here, we give some random examples from the entailment tasks (QNLI and CMNLI) in Table 12. One can have the following observations. 1) The fine-grained models (e.g., Our BERT word) cannot effectively use complete lexical units such as "Doctor Who" and "打死" (sentence pairs 1 and 5), which may result in incorrect predictions.
2) The coarse-grained models (e.g., Our BERT phrase), on the other hand, cannot effectively deal with incorrect tokenizations, for example, "the blind" and "格式" (sentence pairs 2 and 6). 3) AMBERT is able to make effective use of complete lexical units such as "sister station" in sentence pair 4 and "员工/ 工人" in sentence pair 7, and robust to incorrect tokenizations, such as "used to" in sentence pair 3. 4) AMBERT can in general make more accurate decisions on difficult sentence pairs with both fine-grained and coarsegrained tokenization results.  Table 12: Case study for sentence matching tasks in both English and Chinese (QNLI and CMNLI). The value "0" denotes entailment relation, while the value "1" denotes no entailment relation. WORD/PHRASE represents Our BERT word/phrase. In English the tokens in the same phrase are concatenated with " ", and in Chinese phrases are split with "/".

Sentence1 Sentence2
Label WORD PHRASEAMBERT What Star Trek episode has a nod to Doctor Who? (What Star Trek episode has a nod to Doctor Who?) There have also been many references to Doctor Who in popular culture and other science fiction, including Star Trek: The Next Generation ("The Neutral Zone") and Leverage. (There have also been many references to Doctor Who in popular culture and other science fiction, including Star Trek: the next generation ("the neutral zone") and leverage.) 0 1 0 0 What was the name of the blind date concept program debuted by ABC in 1966? (What was the name of the blind date concept program debuted by ABC in 1966?) In December of that year, the ABC television network premiered The Dating Game, a pioneer series in its genre, which was a reworking of the blind date concept in which a suitor selected one of three contestants sight unseen based on the answers to selected questions.
(In December of that year, the ABC television network premiered the dating game, a pioneer series in its genre, which was a reworking of the blind date concept in which a suitor selected one of three contestants sight unseen based on the answers to selected questions.) 0 0 1 0 What are two basic primary resources used to guage complexity? (What are two basic primary resources used to guage complexity?) The theory formalizes this intuition, by introducing mathematical models of computation to study these problems and quantifying the amount of resources needed to solve them, such as time and storage. (The theory formalizes this intuition, by introducing mathematical models of computation to study these problems and quantifying the amount of resources needed to solve them, such as time and storage.) What is the frequency of the radio station WBT in North Carolina? (What is the frequency of the radio station WBT in north carolina?) WBT will also simulcast the game on its sister station WBTFM (99.3 FM), which is based in Chester, South Carolina. (WBT will also simulcast the game on its sister station WBTFM (99.3 FM), which is based in Chester, South Carolina.)