ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information

Recent pretraining models in Chinese neglect two important aspects specific to the Chinese language: glyph and pinyin, which carry significant syntax and semantic information for language understanding. In this work, we propose ChineseBERT, which incorporates both the glyph and pinyin information of Chinese characters into language model pretraining. The glyph embedding is obtained based on different fonts of a Chinese character, being able to capture character semantics from the visual features, and the pinyin embedding characterizes the pronunciation of Chinese characters, which handles the highly prevalent heteronym phenomenon in Chinese (the same character has different pronunciations with different meanings). Pretrained on large-scale unlabeled Chinese corpus, the proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps. The proposed model achieves new SOTA performances on a wide range of Chinese NLP tasks, including machine reading comprehension, natural language inference, text classification, sentence pair matching, and competitive performances in named entity recognition and word segmentation.


Introduction
Large-scale pretrained models have become a fundamental backbone for various natural language processing tasks such as natural language understanding , text classification (Reimers and Gurevych, 2019;Chai et al., 2020) and question answering (Clark and Gardner, 2017;Lewis et al., 2020). Apart from English NLP tasks, pretrained models have also demonstrated their effectiveness for various Chinese NLP tasks Cui et al., 2019a.
Since pretraining models are originally designed for English, two important aspects specific to the Chinese language are missing in current large-scale pretraining: glyph-based information and pinyinbased information. For the former, a key aspect that makes Chinese distinguishable from languages such as English, German, is that Chinese is a logographic language. The logographic of characters encodes semantic information. For example, "液(liquid)", "河(river)" and "湖(lake)" all have the radical "氵(water)", which indicates that they are all related to water in semantics. Intuitively, the rich semantics behind Chinese character glyphs should enhance the expressiveness of Chinese NLP models. This idea has motivated a variety of of work on learning and incorporating Chinese glyph information into neural models (Sun et al., 2014;Shi et al., 2015;Liu et al., 2017;Dai and Cai, 2017;Su and Lee, 2017;Meng et al., 2019), but not yet large-scale pretraining.
For the latter, pinyin, the Romanized sequence of a Chinese character representing its pronunciation(s), is crucial in modeling both semantic and syntax information that can not be captured by contextualized or glyph embeddings. This aspect is especially important considering the highly prevalent heteronym phenomenon in Chinese 3 , where the same character have multiple pronunciations, each of which is associated with a specific meaning. Each pronunciation is associated with a specific pinyin expression. At the semantic level, for example, the Chinese character "乐" has two distinctly different pronunciations: "乐" can be pronounced as "yuè [yE 51 ]", which means "music", and "lè [lG 51 ]", which means "happy". On the syntax level, pronunciations help identify the part-of-speech of a character. For example, character "还" has two pronunciations: "huán[xwan 35 ]" and "hái [xaI 35 ]", with the former meaning the verb "return" and the latter meaning the adverb "also". Different pronunciations of the same character cannot be distinguished by the glyph embedding since the logographic is the same, or the char-ID embedding, since they both point to the same character ID, but can be characterized by pinyin.
In this work, we propose ChineseBERT, a model that incorporates the glyph and pinyin information of Chinese characters into the process of largescale pretraining. The glyph embedding is based on different fonts of a Chinese character, being able to capture character semantics from the visual surface character forms. The pinyin embedding models different semantic meanings that share the same character form and thus bypasses the limitation of interwound morphemes behind a single character. For a Chinese character, the glyph embedding, the pinyin embedding and the character embedding are combined to form a fusion embedding, which models the distinctive semantic property of that character.
With less training data and fewer training epochs, ChineseBERT achieves significant performance boost over baselines across a wide range of Chinese NLP tasks. It achieves new SOTA performances on a wide range of Chinese NLP tasks，including machine reading comprehension, natural language inference, text classification, sentence pair matching, and results comparable to SOTA performances in named entity recognition and word segmentation.
2 Related Work 2.1 Large-Scale Pretraining in NLP Recent years has witnessed substantial work on large-scale pretraining in NLP. BERT (Devlin et al., 2018), which is built on top of the Transformer architecture (Vaswani et al., 2017), is pretrained on large-scale unlabeled text corpus in the manner of Masked Language Model (MLM) and Next Sentence Prediction (NSP). Following this trend, considerable progress has been made by modifying the masking strategy Joshi et al., 2020), pretraining tasks Clark et al., 2020) or model backbones Choromanski et al., 2020). Specifically, RoBERTa  proposed to remove the NSP pretraining task since it has been proved to offer no benefits for improving down-stream performances. The GPT series (Radford et al., 2019;Brown et al., 2020) and other BERT variants (Lewis et al., 2019;Song et al., 2019;Lample and Conneau, 2019;Dong et al., 2019;Zhu et al., 2020) adapted the paradigm of large-scale unsupervised pretraining to text generation tasks such as machine translation, text summarization and dialog generation, so that generative models can enjoy the benefit of large-scale pretraining.
Unlike the English language, Chinese has its particular characteristics in terms of syntax, lexicon and pronunciation. Hence, pretraining Chinese models should fit the Chinese features correspondingly.  proposed to use Chinese character as the basic unit instead of word or subword that is used in English (Wu et al., 2016;Sennrich et al., 2016). ERNIE  applied three types of masking strategies -charlevel masking, phrase-level masking and entitylevel masking -to enhance the ability of capturing multi-granularity semantics. Cui et al. (2019a pretrained models using the Whole Word Masking strategy, where all characters within a Chinese word are masked altogether. In this way, the model is learning to address a more challenging task as opposed to predicting word components. More recently, Zhang et al. (2020) developed the largest Chinese pretrained language model to date -CPM. It is pretrained on 100GB Chinese data and has 2.6B parameters comparable to "GPT3 2.7B" (Brown et al., 2020). Xu et al. (2020) released the first large-scale Chinese Language Understanding Evaluation benchmark CLUE, facilitating researches in large-scale Chinese pretraining.

Learning Glyph Information
Learning glyph information from surface Chinese character forms has gained attractions since the prevalence of deep neural networks. Inspired by word embeddings (Mikolov et al., 2013b,a   denotes vector concatenation. For each Chinese character, we use three types of fonts -FangSong, XingKai and LiShu, each of which is a 24 × 24 image with pixel value ranging 0 ∼ 255. Images are concatenated into a tensor of size 24 × 24 × 3. The tensor is flattened and passed to an FC to obtain the glyph embedding.  For any Chinese character, e.g. 猫(cat) in this case, a CNN with width 2 is applied to the sequence of Romanized pinyin letters, followed by max-pooling to derive the final pinyin embedding.  denotes vector concatenation, and × is vector-matrix multiplication. We concatenate the char embedding, the glyph embedding and the pinyin embedding, and use an FC layer with a learnable matrix W F to induce the fusion embedding. Figure 1 shows an overview of the proposed Chi-neseBERT model. For each Chinese character, its char embedding, glyph embedding and pinyin embedding are first concatenated, and then mapped to a D-dimensional embedding through a fully connected layer to form the fusion embedding. The fusion embedding is then added with the position embedding, which is fed as input to the BERT model Since we do not use the NSP pretraining task, we omit the segment embedding. We use both Whole Word Masking (WWM) (Cui et al., 2019a) and Char Masking (CM) for pretraining (See Section 4.2 for details).

Input
The input to the model is the addition of the learnable absolute positional embedding and the fusion embedding, where the fusion embedding is based on the char embedding, the glyph embedding and the pinyin embedding of the corresponding character. The char embedding performs in a way analogous to the token embedding used in BERT but at the character granularity. Below we respectively describe how to induce the glyph embedding, the pinyin embedding and the fusion embedding.
Glyph Embedding We followed Meng et al. (2019) to use three types of Chinese fonts -Fang-Song, XingKai and LiShu, each of which is instantiated as a 24 × 24 image with floating point pixels ranging from 0 to 255. Different from Meng et al. (2019), which used CNNs to convert image to representations, we use an FC layer. We first converted the 24×24×3 vector to a 2,352 vector. The flattened vector is fed to an FC layer to obtain the output glyph vector.
Pinyin Embedding The pinyin embedding for each character is used to decouple different semantic meanings belonging to the same character form, as shown in Figure 3. We use the opensourced pypinyin package 4 to generate pinyin sequences for its constituent characters. pypinyin is a system that combines machine learning models with dictionary-based rules to infer the pinyin for characters given contexts. Pinyin for a Chinese character is a sequence of Romanian characters, with one of four diacritics denoting tones. We use special tokens to denote tones, which are appended to the end of the Romanian character sequence. We apply a CNN model with width 2 on the pinyin sequence, followed by max-pooling to derive the resulting pinyin embedding. This makes output dimensionality immune to the length of the input pinyin sequence. The length of the input pinyin sequence is fixed at 8, with the remaining slots filled with a special letter "-" when the actual length of the pinyin sequence does not reach 8.
Fusion Embedding Once we have the char embedding, the glyph embedding and the pinyin embedding for a character, we concatenate them to form a 3D-dimensional vector. The fusion layers maps the 3D-dimensional vector to D-dimensional through a fully connected layer. The fusion embedding is added with position embedding, and output to the BERT layer. An illustration is shown in Figure 4.

Data
We collected our pretraining data from Common-Crawl 5 . After pre-processing (such as removing the data with too much English text and filtering the html tagger), about 10% high-quality data is maintained for pretraining, containing 4B Chinese characters in total. We use the LTP toolkit 6 (Che et al., 2010) to identify the boundary of Chinese words for whole word masking.

Masking Strategies
We use two masking strategies -Whole Word Masking (WWM) and Char Masking (CM) for Chi-neseBERT.  suggested that using Chinese characters as the basic input unit can alleviate the out-of-vocabulary issue in the Chinese language. We thus adopt the method of masking random characters in the given context, denoted by Char Masking. On the other hand, a large number of words in Chinese consist of multiple characters, for which the CM strategy may be too easy for the model to predict. For example, for the input context "我喜欢逛紫禁[M] (i like going to The Forbidden [M])", the model can easily predict that the masked character is "城(City)". Hence, we follow Cui et al. (2019a) to use WWM, a strategy to mask out all characters within a selected word, mitigating the easy-predicting shortcoming of the CM strategy. Note that for both WWM and CM, the basic input unit is Chinese characters. The main difference between WWM and CM lies in how they mask characters and how the model predicts masked characters.

Pretraining Details
Different from Cui et al. (2019a) who pretrained their model based on the official pretrained Chinese BERT model, we train the ChineseBERT model from scratch. To enforce the model to learn both long-term and short-term dependencies, we propose to alternate pretraining between packed input and single input, where the packed input is the concatenation of multiple sentences with a maximum length 512, and the single input is a single sentence. We feed the packed input with probability of 0.9 and the single input with probability of 0. and Char Masking 10% of the time. The masking probability for each word/char is 15%. If the i-th word/char is chosen, we mask it 80% of the time, replace it with a random word/char 10% of the time and maintain it 10% of the time. We also use the dynamic masking strategy to avoid duplicate training instances . We use two model setups: base and large, respectively consisting of 12/24 Transformer layers, with input dimensionality of 768/1,024 and 12/16 heads per layer. This makes our models comparable to other BERT-style models in terms of model size. Upon the submission of the paper, we have trained the base model 500K steps with a maximum learning rate 1e-4, warmup of 20K steps and a batch size of 3.2k, and the large model 280K steps with a maximum learning rate 3e-4, warmup of 90K steps and a batch size of 8k. After pretraining, the model can be directly used to be finetuned on downstream tasks in the same way as BERT (Devlin et al., 2018).

Experiments
We conduct extensive experiments on a variety of Chinese NLP tasks. Models are separately finetuned on task-specific datasets for evaluation. Concretely, we use the following tasks: • Machine Reading Comprehension (MRC) • Natural Language Inference (NLI) • Text Classification (TC) • Sentence Pair Matching (SPM) • Named Entity Recognition (NER) • Chinese Word Segmentation (CWS).
We compare ChineseBERT to current state-ofthe-art ERNIE , BERTwwm (Cui et al., 2019a) and MacBERT  models. ERNIE adopts various masking strategies including token-level, phrase-level and entity-level masking to pretrain BERT on largescale heterogeneous data. BERT-wwm/RoBERTawwm continues pretraining on top of official pretrained Chinese BERT/RoBERTa models with the Whole Word Masking pretraining strategy. Unless otherwise specified, we use BERT/RoBERTa to represent BERT-wwm/RoBERTa-wwm and omit "wwm". MacBERT improves upon RoBERTa by using the MLM-As-Correlation (MAC) pretraining strategy as well as the sentence-order prediction (SOP) task. It is worth noting that BERT and BERTwwm do not have the large version available online, and we thus omit the corresponding performances.
A comparison of these models is shown in Table 1. It is worth noting that training steps of the proposed model significantly smaller than baseline models. Different from BERT-wwm and MacBERT which are initialized with pretrained BERT, the proposed model is initialized from scratch. Due to the additional consideration of glyph and pinyin, the proposed cannot be directly initialized using a vanilla BERT model, as the model structures are different. Even initialized from scratch, the proposed model is trained fewer steps than the steps in retraining BERT-wwm and MacBERT after BERT initialization.

Machine Reading Comprehension
Machine reading comprehension tests the model's ability of answering the questions based on the given contexts. We use two datasets for this task: CMRC 2018 (Cui et al., 2019b) and CJRC (Duan et al., 2019) . CMRC is a span-extraction style dataset while CJRC additionally has yes/no questions and no-answer questions. CMRC 2018 and CJRC respectively contain 10K/3.2K/4.9K and 39K/6K/6K data instances for training/dev/test. Test results for CMRC 2018 are evaluated from the CLUE leaderboard. 7 Note that the CJRC dataset is different from the one used in Cui et al. (2019a) as Cui et al. (2019a) did not release their train/dev/test split. We thus run the released models on the CJRC dataset used in this work for comparison.
Results are shown in Table 2 and Table 3. As we can see, ChineseBERT yields significant perfor-   Table 3: Performances of different models on the MRC dataset CJRC. We report results for baseline models based on their released models. • represents models pretrained on extended data. mance boost on both datasets, and the improvement of EM is larger than that of F1 on the CJRC dataset, which indicates that ChineseBERT is better at detecting exact answer spans.

Natural Language Inference (NLI)
The goal of NLI is to determine the entailment relationship between a hypothesis and a premise. We use the Cross-lingual Natural Language Inference (XNLI) dataset (Conneau et al., 2018) for evaluation. The corpus is a crowd-sourced collection of 5K test and 2.5K dev pairs for the MultiNLI corpus. Each sentence pair is annotated with the "entailment", "neutral" or "contradiction" label. We use the official machine translated Chinese data for training. 8 Results are present in Table 4, which shows that ChineseBERT is able to achieve the best performances for both base and large setups.  Table 4: Performances of different models on XNLI. Accuracy is reported for comparison. • represents models pretrained on extended data.

Text Classification (TC)
In text classification the model is required to categorize a piece of text into one of the specified classes. We follow Cui et al. (2019a) to use THUC-News (Li and Sun, 2007) and ChnSentiCorp 9 for this task. THUCNews is a subset of THUCTC 10 , with 50K/5K/10K data points respectively for training/dev/test. Data is evenly distributed in 10 domains including sports, finance, etc. 11 ChnSen-tiCorp is a binary sentiment classification dataset containing 9.6K/1.2K/1.2K data points respectively for training/dev/test. The two datasets are relatively simple with vanilla BERT achieving an accuracy of above 95%. Hence, apart from THUC-News and ChnSentiCorp, we also use TNEWS, a more difficult dataset that is included in the CLUE benchmark (Xu et al., 2020). 12 TNEWS is a 15class short news text classification dataset with 53K/10K/10K data points for training/dev/test. Table 5. On ChunSen-tiCorp and THUCNews, the improvement from ChineseBERT is marginal as baselines have already achieved quite high results on these two datasets. On the TNEWS dataset, ChineseBERT outperforms all other models. We can see that the ERNIE model only performs slightly worse than ChineseBERT. This is because ERNIE is trained on additional web data, which is beneficial to model web news text that covers a wide range of domains.

Sentence Pair Matching (SPM)
For SPM, the model is asked to determine whether a given sentence pair expresses the same semantics. We use the LCQMC  and BQ Corpus

Named Entity Recognition (NER)
For NER tasks (Chiu and Nichols, 2016;Lample et al., 2016;Li et al., 2019a), the model is asked to identify named entities within a piece of text, which is formalized as a sequence labeling task. We use OntoNotes 4.0 (Weischedel et al., 2011) and Weibo (Peng and Dredze, 2015) for this task. We use OntoNotes 4.0 and Weibo NER for this task. OntoNotes has 18 named entity types and Weibo has 4 named entity types. OntoNotes and Weibo respectively contain 15K/4K/4K and 1,350/270/270 instances for training/dev/test. Results are shown in Table 7. As we can see, ChineseBERT significantly outperforms BERT and RoBERTa in terms of F1. In spite of a slight loss on precision for the base version, the gains on recall are particularly high, leading to a final performance boost on F1.

Chinese Word Segmentation
The task divides text into words and is formalized as a character-based sequence labelling task. We use the PKU and MSRA datasets for Chinese word segmentation. PKU consists of 19K/2K sentences for training and test, and MSRA consists of 87k/4k sentences for training and test. Output character embedding is fed to the softmax function for final predictions. Results are shown in Table 8, where we can see that ChineseBERT is able to outperform BERT-wwm and RoBERTa-wwm on both datasets for both metrics.

Ablation Studies
In this section, we conduct ablation studies to understand the behaviors of ChineseBERT. We use the Chinese named entity recognition dataset OntoNotes 4.0 for analysis and all models are based on the base version.   Table 9: Performances for different models without considering glyph or pinyin information.

The Effect of Glyph Embeddings and Pinyin Embeddings
We would like to explore the effects of glyph embeddings and pinyin embeddings. For fair comparison, we pretrained different models on the same dataset, with the same number of training steps, and with the same model size. Setups include "-glyph", where glyph embeddings are not considered and we only consider pinyin and char-ID embeddings; "-pinyin", where pinyin embeddings are not considered and we only consider glyph and char-ID embeddings; "-glyph-pinyin", where only char-ID embeddings are considered, and the model degenerates to RoBERTa. We finetune different models on the OntoNotes dataset of the NER dataset for comparison. Results are shown in Table 9. As can be seen, either removing glyph embeddings or pinyin embeddings results in performance degradation, and removing both has the greatest negative impact on the F1 value, which is a drop of about 2 points. This validates the importance of both pinyin and glyph embeddings for modeling Chinese semantics. The reason why "-glyph-pinyin" performs worse than RoBERTa is that the model we use here is trained on a smaller size of data with smaller number of training steps.

The Effect of Training Data Size
We hypothesize glyph and pinyin embeddings also serve as strong regularization over text semantics, which means that the proposed ChineseBERT model is able to perform better with less training data. We randomly sample 10%∼90% of the training data while maintaining the ratio of samples with entities w.r.t. samples without entities. We perform each experiment five times and report the average F1 value on the test set. Figure 5 shows the results. As can be seen, ChineseBERT performs better across all setups. With less than 30% of the training data, the improvement of ChineseBERT is slight, but with over 30% training data, the performance improvement is greater. This is because ChineseBERT still requires sufficient training data to fully train the glyph and pinyin embeddings, and insufficient training data would lead to inadequate training.

Conclusion
In this paper, we introduce ChineseBERT, a largescale pretraining Chinese NLP model. It leverages the glyph and pinyin information of Chinese characters to enhance the model's ability of capturing context semantics from surface character forms and disambiguating polyphonic characters in Chinese. The proposed ChineseBERT model achieves significant performance boost across a wide range of Chinese NLP tasks. The proposed ChineseBERT performs better than vanilla pretrained models with less training data, indicating that the introduced glyph embeddings and pinyin embeddings serve as a strong regularizer for semantic modeling in Chinese. Future work involves training a large size version of ChineseBERT.