Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training

We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters. We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries and Jiezi refers to the process of enhancing characters' glyph representations with structure understanding. To facilitate dictionary understanding, we propose three pre-training tasks, i.e., Masked Entry Modeling, Contrastive Learning for Synonym and Antonym, and Example Learning. We evaluate our method on both modern Chinese understanding benchmark CLUE and ancient Chinese benchmark CCLUE. Moreover, we propose a new polysemy discrimination task PolyMRC based on the collected dictionary of ancient Chinese. Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks. Moreover, our approach yields significant boosting on few-shot setting of ancient Chinese understanding.


Introduction
Large-scale pre-trained language models (PLMs) such as BERT (Devlin et al., 2018) and GPT (Brown et al., 2020) have revolutionized various research fields in natural language processing (NLP) landscape, including language generation (Brown et al., 2020), text classification (Wang et al., 2018), language reasoning , etc. The de facto paradigm to build such LMs is to feed massive training corpus and datasets to a Transformer-based language model with billions of parameters.
Apart from English PLMs, similar approaches have also been attempted in multilingual (Lample and Conneau, 2019) and Chinese language understanding tasks (Sun et al., 2021b(Sun et al., , 2019a. To enhance Chinese character representations, pioneer works have incorporated additional character information, including glyph (character's geometric shape), pinyin (character's pronunciation), and stroke (character's writing order) (Sun et al., 2021b;. Nevertheless, there still exists a huge performance gap between concurrent stateof-the-art (SOTA) English PLMs and those on Chinese or other non-Latin languages , which leads us to rethink the central question: What are the unique aspects of Chinese that are essential to achieve human-level Chinese understanding?
With an in-depth investigation of Chinese lan-guage understanding, this work aims to point out the following crucial challenges that have barely been addressed in previous Chinese PLMs. • Frequent vs. Rare Characters. Different from English that enjoys 26 characters to form frequently-used vocabularies (30,522 Word-Pieces in BERT), the number of frequentlyoccurred Chinese characters are much smaller (21,128 in Chinese BERT 1 ), of which only 3,500 characters are frequently occurred. As of year 2023, over 17 thousand characters have been newly appended to the Chinese character set. Such phenomenon requires models to quickly adapt to rare or even unobserved characters. • One vs. Many Meanings. Compared with English expressions, polysemy is more common for Chinese characters, of which most meanings are semantically distinguished. Similar as character set, the meanings of characters keep changing. For example, the character "卷" has recently raised a new meaning: "the involution phenomena caused by peer-pressure". • Holistic vs. Compositional Glyphs. Considering the logographic nature of Chinese characters, the glyph information has been incorporated in previous works. However, most work treat glyph as an independent visual image while neglecting its compositional structure and relationship with character's semantic meanings.
In this work, we propose CDBERT, a new Chinese pre-training paradigm that aims to go beyond feature aggregation and resort to mining information from Chinese dictionaries and glyphs' structures, two essential sources that interpret Chinese characters' meaning. We name the two core modules of CDBERT as Shuowen and Jiezi, in homage to one of the earliest Chinese dictionary in Han Dynasty. Figure 1 depicts the overall model. Shuowen refers to the process that finds the most appropriate definition of a character in a Chinese dictionary. Indeed, resorting to dictionaries for Chinese understanding is not unusual even for Chinese Linguistic experts, especially when it comes to ancient Chinese (aka. classical Chinese) understanding. Different from previous works that simply use dictionaries as additional text corpus , we propose a fine-grained definition retrieval framework from Chinese dictionaries. Specifically, we design three types of objectives for dictionary pre-training: Masked En-1 https://github.com/ymcui/Chinese-BERT-wwm try Modeling (MEM) to learn entry representation; Contrastive Learning objective with synonyms and antonyms; Example Leaning (EL) to distinguish polysemy by example in the dictionary. Jiezi refers to the process of decomposing and understanding the semantic information existing in the glyph information. Such a process grants native Chinese the capability of understanding new characters. In CDBERT, we leverage radical embeddings and previous success of CLIP Radford et al., 2021) model to enhance model's glyph understanding capability.
We evaluate CDBERT with extensive experiments and demonstrate consistent improvements of previous baselines on both modern Chinese and ancient Chinese understanding benchmarks. It is worth noting that our method gets significant improvement on CCLUE-MRC task in few-shot setting. Additionally, we construct a new dataset aiming to test models' ability to distinguish polysemy in Chinese. Based on the BaiduHanyu, we construct a polysemy machine reading comprehension task (PolyMRC). Given the example and entry, the model needs to choose a proper definition from multiple interpretations of the entry. We believe our benchmark will help the development of Chinese semantics understanding.
In summary, the contributions of this work are four-fold: (i) We propose CDBERT, a new learning paradigm for improving PLMs with Chinese dictionary and characters' glyph representation; (ii) We derive three pre-training tasks, Masked Entry Modeling, Contrastive Learning for Synonym and Antonym, and Example Learning, for learning a dictionary knowledge base with a polysemy retriever (Sec. 3.1); (iii) We propose a new task PolyMRC, specially designed for benchmarking model's ability on distinguishing polysemy in ancient Chinese. This new task complements existing benchmarks for Chinese semantics understanding (Sec. 4); (iv) We systematically evaluate and analyze the CDBERT on both modern Chinese and ancient Chinese NLP tasks, and demonstrate improvements across all these tasks among different types of PLMs. In particular, we obtain significant performance boost for few-shot setting in ancient Chinese understanding.

Related Work
Chinese Language Model Chinese characters, different from Latin letters, are generally lo-gograms. At an early stage, Devlin et al. (2018); Liu et al. (2019b) propose BERT-like language models with character-level masking strategy on Chinese corpus. Sun et al. (2019b) take phraselevel and entity-level masking strategies to learn multi-granularity semantics for PLM. Cui et al. (2019) pre-trained transformers by masking all characters within a Chinese word. Lai et al. (2021) learn multi-granularity information with a constructed lattice graph. Recently, ; Zeng et al. (2021); Su et al. (2022b) pretrained billion-scale parameters large language models for Chinese understanding and generation. In addition to improving masking strategies or model size, some researchers probe the semantics from the structure of Chinese characters to enhance the word embedding. Since Chinese characters are composed of radicals, components, and strokes hierarchically, various works (Sun et al., 2014;Shi et al., 2015;Li et al., 2015;Yin et al., 2016;Xu et al., 2016;Lu et al., 2022) learn the Chinese word embedding through combining indexed radical embedding or hierarchical graph. Benefiting from the strong representation capability of convolutional neural networks (CNNs), some researchers try to learn the morphological information directly from the glyph (Liu et al., 2017;Zhang and LeCun, 2017;Dai and Cai, 2017;Su and yi Lee, 2017;Tao et al., 2019;. Sehanobish and Song (2020); Xuan et al. (2020) apply the glyph-embedding to improve the performance of BERT on named entity recognition (NER). Besides, polysemy is common among Chinese characters, where one character may correspond to different meanings with different pronunciations. Therefore, Zhang et al. (2019) use "pinyin" to assist modeling in distinguishing Chinese words. Sun et al. (2021c) first incorporate glyph and "pinyin" of Chinese characters into PLM, and achieve SOTA performances across a wide range of Chinese NLP tasks. Su et al. (2022a) pre-trained a robust Chinese BERT with synthesized adversarial contrastive learning examples including semantic, phonetic, and visual features.
Knowledge Augmented pre-training Although PLMs have shown great success on many NLP tasks. There are many limitations on reasoning tasks and domain-specific tasks, where the data of downstream tasks vary from training corpus in distribution. Even for the strongest LLM Chat-GPT, which achieves significant performance boost across a wide range of NLP tasks, it is not able to answer questions involving up-to-date knowledge. And it is impossible to train LLMs frequently due to the terrifying costs. As a result, researchers have been dedicated to injecting various types of knowledge into PLM/LLM. According to the types, knowledge in existing methods can be classified to text knowledge (Hu et al., 2022) and graph knowledge, where text knowledge can be further divided into linguistic knowledge and non-linguistic knowledge. Specifically, some works took lexical information (Lauscher et al., 2019;Lyu et al., 2021) or syntax tree (Sachan et al., 2020;Li et al., 2020;Bai et al., 2021) to enhance the ability of PLMs in linguistic tasks. For the nonlinguistic knowledge, some researchers incorporate general knowledge such as Wikipedia with retrieval methods (Guu et al., 2020;Yao et al., 2022; to improve the performance on downstream tasks, others use domain-specific corpora (Lee et al., 2019;Beltagy et al., 2019) to transfer the PLMs to corresponding downstream tasks. Compared with text knowledge, a knowledge graph contains more structured information and is better for reasoning. Thus a flourish of work (Liu et al., 2019a;He et al., 2021;Sun et al., 2021a; designed fusion methods to combine the KG with PLMs.
Dictionary Augmented pre-training Considering the heavy-tailed distribution of the pre-training corpus and difficult access to the knowledge graph, some works injected dictionary knowledge into PLMs to alleviate the above problems.  enhance PLM with rare word definitions from English dictionaries.  pretrained BERT with English dictionary as a pretraining corpus and adopt an attention-based infusion mechanism for downstream tasks.

Shuowen: Dictionary as a pre-trained Knowledge
We take three steps while looking up the dictionary as the pre-training tasks: 1) Masked Entry Modeling (MEM). The basic usage of a dictionary is to clarify the meaning of the entry. 2) Contrastive Learning for Synonym and Antonym (CL4SA). For ambiguous meanings, we always refer to the synonym and antonym for further understanding. 3) Example Learning (EL). We will figure out the ac-curate meaning through several classical examples.
Masked Entry Modeling (MEM) Following existing transformer-based language pre-training models (Devlin et al., 2018;Liu et al., 2019b), we take the MEM as a pre-training task.   as the entry masking strategy. The objective of MEM L mem is computed as the cross-entropy between the recovered entry and the ground truth.
Contrastive Learning for Synonym and Antonym (CL4SA) Inspired by , we adopt contrastive learning to better support the semantics of the pre-trained representation. We construct positive sample pair ⟨ent, syno⟩ with synonyms in the dictionary, and negative sample pair ⟨ent, anto⟩ with antonyms in the dictionary. The goal of the CL4SA is to make the positive sample pair closer while pushing the negative sample pair further. Thus we describe the contrastive objective as follows: where · denotes the element-wise product, h ent , h syno , h anto is the representation of the original entry, the synonym, and the antonym respectively. In practice, we use the hidden states of [CLS] token as the representation of the input Since the antonyms in the dictionary are much less than synonyms, we randomly sampled entries from the vocabulary for compensation. To distinguish the sampled entries with the strict antonyms, we set different weights for them.
Example Learning (EL) Compared with other languages, the phenomenon of polysemy in Chinese is more serious, and most characters or words have more than one meanings or definitions. To better distinguish multiple definitions of an entry in a certain context, we introduce example learning, which attempts to learn the weight of different definitions for a certain example. Specifically, given an entry ent, K multiple definitions def 1 , . . . , def K , We use the cross-entropy loss to supervise the meaning of retriever training: L el = CrossEntropy(one-hot(def ), Attn def ) (2) where one-hot(·) is a one-hot vector transition of ground-truth indexes.
We sum over all the above objectives to obtain the final loss function: where λ 1 , λ 2 , λ 3 are three hyper-parameters to balance three tasks.

Jiezi: Glyph-enhanced Character Representation
Chinese characters, different from Latin script, demonstrate strong semantic meanings. We conduct two structured learning strategies to capture the semantics of Chinese characters. Following Sun et al. (2021b), we extract the glyph feature by the CNN-based network.
CLIP enhanced glyph representation To better capture the semantics of glyphs, we learn the glyph representation through a contrastive learning algorithm. Specifically, we concatenate character c with its definition def as text input and generate a picture of the character as visual input. We initialize our model with the pre-trained checkpoint of Chinese-CLIP  and keep the symmetric cross-entropy loss over the similarity scores between text input and visual input as objectives. To alleviate the influence of pixel-level noise, we follow Jaderberg et al. (2014Jaderberg et al. ( , 2016 to generate a large number of images of characters by transformation, including font, size, direction, etc. Besides, we introduce some Chinese character images in wild (Yuan et al., 2019) in the training corpus to improve model robustness. Finally, we extract the glyph feature through the text encoder to mitigate the pixel bias.

Radical-based character embedding
Since the glyph feature requires extra processing and is constrained by the noise in images, we propose a radical-based embedding for end-to-end pretraining. We first construct a radical vocabulary, then add the radical embedding for each character with their radical token in the radical vocabulary. Thus, we can pre-train the CDBERT in the end-toend learning method.

Applying CDBERT to downstream tasks
Following , we use the CDBERT as a knowledge base for retrieving entry definitions. Specifically, given an input expression, we first look up all the entries in the dictionary. Then, we adopt the dictionary pre-training to get the representation of the entry. At last, we fuse the CDBERTaugmented representation to the output of the language model for further processing in downstream tasks. We take the attention block pre-trained by the EL task as a retriever to learn the weight of all the input entries with multiple meanings. After that, we use weighted sum as a pooling strategy to get the CDBERT-augmented representation of the input. We concatenate the original output of the language model with the CDBERT-augmented representation for final prediction.

The PolyMRC Task
Most existing Chinese language understanding evaluation benchmarks do not require the model to have strong semantics understanding ability. Hence, we propose a new dataset and a new machine reading comprehension task focusing on polysemy understanding. Specifically, we construct a  Figure 3: Illustration of applying CDBERT to downstream tasks.
indicates the concatenation operation. The Attn. Block is the pre-trained attention model from the EL task. dataset through entries with multiple meanings and examples from dictionaries. As for the Polysemy Machine Reading Comprehension (PolyMRC) task, we set the example as context and explanations as choices, the goal of PolyMRC is to find the correct explanation of the entry in the example. Table 1 shows the statistics of the dataset.

Implementation Details
We pre-train CDBert based on multiple official pretrained Chinese BERT models. All the models are pre-trained for 10 epochs with batch size 64 and maximum sequence length 256. We adopt AdamW as the optimizer and set the learning rate as 5e − 5 with a warmup ratio of 0.05. We set λ 1 = 0.6, λ 2 = 0.2, and λ 3 = 0.2 in Eqn.
(3) for all the experiments. We finetune CLUE  with the default setting reported in the CLUE GitHub repository 2 .

Baselines
BERT We adopt the official BERT-base model pre-trained on the Chinese Wikipedia corpus as baseline models.
RoBERTa Besides BERT, we use two stronger PLMs as baselines: RoBERTa-base-wwm-ext and RoBERTa-large-wwm-ext (we will use RoBERTa and RoBERTa-large for simplicity). In these models, wwm denotes the model continues pre-training on official RoBERTa models with the WWM strategy, and ext denotes the models are pre-trained on extended data besides Wikipedia corpus.
MacBERT MacBERT improves on RoBERTa by taking the MLM-as-correlation (MAC) strategy and adding sentence ordering prediction (SOP) as a new pre-training task. We use MacBERT-large as a strong baseline method.

CLUE
We evaluate the general natural language understanding (NLU) capability of our method with CLUE benchmark , which includes text classification and machine reading comprehension (MRC) tasks. There are five datasets for text classification tasks: CMNLI for natural language inference, IFLYTEK for long text classification, TNEWS' for short text classification, AFQMC for semantic similarity, CLUEWSC 2020 for coreference resolution, and CSL for keyword recognition. The text classification tasks can further be classified into single-sentence tasks and sentence pair tasks. The MRC tasks include span selectionbased CMRC2018, multiple choice questions C3, and idiom Cloze ChID. The results of text classification are shown in Table 2. In general, CDBERT performs better on single-sentence tasks than sentence pair tasks. Specifically, compared with baselines, CDBERT achieves an average improvement of 1.8% on single sentence classification: TNEWS', IFLYTEK, and WSC. Besides, CDBERT outperforms baselines on long text classification task IFLYTEK by improving 2.08% accuracy on average, which is 2 https://github.com/CLUEbenchmark/CLUE more significant than the results (1.07%) on short text classification task TNEWS'. This is because TNEWS' consists of news titles in 15 categories, and most titles consist of common words which are easy to understand. But IFLYTEK is a long text 119 classification task that requires comprehensive understanding of the context. In comparison, the average improvement on sentence pair tasks brought by CDBERT is 0.76%, which is worse than the results on single sentence tasks. These results show dictionary is limited in helping PLM to improve the ability of advanced NLU tasks, such as sentiment entailment, keywords extraction, and natural language inference.
We demonstrate the results on MRC tasks in Table 3. As we can see, CDBERT yields a performance boost on MRC tasks (0.79%) on average among all the baselines. It is worth noting that when the PLM gets larger in parameters and training corpus, the gain obtained by CDBERT becomes less. We believe this is caused by the limitation of CLUE benchmark for the reason that several large language models have passed the performance of humans .

CCLUE
Ancient Chinese (aka. Classical Chinese) is the essence of Chinese culture, but there are many differences between ancient Chinese and modern Chinese. CCLUE 3 is a general ancient NLU evaluation benchmark including NER task, short sentence classification task, long sentence classification task, and machine reading comprehension task. We use the CCLUE benchmark to evaluate the ability of CDBERT to adapt modern Chinese pre-trained models to ancient Chinese understanding tasks.
In order to assess the ability of modern Chinese PLM to understand ancient Chinese by CDBERT, we test our model on CCLUE benchmark. We pretrain the CDBERT on the ancient Chinese dictionary for fairness. Results are presented in Table 4, which shows CDBERT is helpful in all three general NLU tasks: sequence labeling, text classification, and machine reading comprehension. We find in MRC task, CDBERT improves from 42.93 on average accuracy of all 4 models to 44.72 (4.15% relatively), which is significantly better than other tasks. In addition, we can see the gain obtained from the model scale is less than CDBERT on CLUE datasets. This is because the training corpus     Table 5. Compared to baselines, CD-BERT shows a 1.01% improvement for accuracy on average. We notice that the overall performance show weak relation with the scale of the training corpus of PLM, which is a good sign as it reveals that the new task can not be solved by models simply adding training data.

FewShot Setting on PolyMRC and CCLUE-MRC
To further investigate the ability of CDBERT on few-shot setting, we construct two challenge datasets based on CCLUE MRC and PolyMRC.  We conduct ablation studies on different components of CDBERT. We use the CCLUE-MRC for analysis and take the Roberta base as the backbone. The overall results are shown in Table 7. Generally, CDBERT improves the Roberta from 42.30 to 44.14 (4.3% relatively).

Ablation Study
The Effect of Character Structure We first evaluate the effects of radical embeddings and glyph embeddings. For fair comparisons, we keep other settings unchanged, and focus on the following setups: "-Radical", where radical embedding is not considered; "Rep Glyph", where we replace the radical embedding with glyph embedding. Results are shown in row 3-4. As can be seen, when we replace the radical embedding with glyph embedding, the accuracy drops 1.61 points, where the performance degradation is more obvious than removing radical embedding. The reason we use here is the scale of training corpus is not large enough to fuse the pre-trained glyph feature to CDBERT.
The Effect of Dictionary We then assess the effectiveness of the dictionary. We replace the original dictionary with character dictionary (row 5) and keep the model size and related hyper-parameters the same as CDBERT pre-training procedure for fair. Besides, during finetuning process, we identify all the characters that are included in the character dictionary for further injecting with dictionary knowledge. We observe the character CDBERT is helpful to some degree (1.1%) but is much worse than the original CDBERT. On the one hand, the number of characters in Chinese is limited, on the other hand, a word and its constituent characters may have totally different explanations.
The Effect of Pre-training Tasks At last, we evaluate different pre-training tasks of CDBERT including CL4SA and EL (row 6-7). Specifically, both CL4SA and EL help improve the NLU ability of PLM, and EL demonstrated larger improvement than CL4SA. The average improvements on CCLUE-MRC brought by CL4SA and EL are 1.05% and 2.68%. In order to verify the impact of CDBERT instead of the additional corpus, we follow Cui et al. (2019) to continuously pre-train the Roberta on the dictionary, which is regarded as extended data. As shown in row 8, using additional pre-training data results in further improvement. However, such improvement is less than our proposed CDBERT, which is a drop of 1 point.

Limitations
We collect the dictionary from the Internet, and although we make effort to reduce replicate explanations, there is noise in the dictionary. Besides, not all the words are included in the dictionary. In other words, the quality and amount of entries in the Chinese dictionary are to be improved. Additionally, our method is pre-trained on the Bert-like transformers to enhance the corresponding PLMs, and can not be applied to LLM directly whose frameworks are unavailable. In the future, we will use the retriever for disambiguation and dictionary knowledge infusion to LLM.

Conclusion
In this work, we leverage Chinese dictionary and structure information of Chinese characters to enhance the semantics understanding ability of PLM. To make Chinese Dictionary knowledge better act on PLM, we propose 3 pre-training objectives simulating looking up dictionary in our study, and incorporate radical or glyph features to CDBERT. Experiment results on both modern Chinese tasks and ancient Chinese tasks show our method significantly improve the semantic understanding ability of various PLM. In the future, we will explore our method on more high-quality dictionaries (e.g. Bilingual dictionary), and adapt our method to LLM to lessen the semantic errors. Besides, we will probe more fine-grained structure information of logograms in both understanding and generation tasks.