Glyph Enhanced Chinese Character Pre-Training for Lexical Sememe Prediction

Sememes are deﬁned as the atomic units to describe the semantic meaning of concepts. Due to the difﬁculty of manually annotating sememes and the inconsistency of annotations between experts, the lexical sememe prediction task has been proposed. However, previous methods heavily rely on word or character embeddings, and ignore the ﬁne-grained information. In this paper, we propose a novel pre-training method which is designed to better incorporate the internal information of Chinese character. The G lyph enhanced C hinese C haracter representation ( GCC ) is used to assist sememe prediction. We experiment and evaluate our model on HowNet, which is a fa-mous sememe knowledge base. The experimental results show that our method outperforms existing non-external information models.


Introduction
In linguistics, sememes are defined as the minimum semantic units for human language (Bloomfield, 1926), which describe the semantic meaning of concepts. HowNet (Dong and Dong, 2003) is one of the most well-known sememe knowledge bases (KB), which has been widely used in many NLP tasks (Qi et al., 2021), such as semantic similarity computation (Liu, 2002), sentiment analysis (Fu et al., 2013;Huang et al., 2014), language modeling (Gu et al., 2018), word representation learning (Niu et al., 2017) and short text matching (Lyu et al., 2021).
In order to free human experts from the laborious sememe annotating job,  propose the task of sememe prediction, which intends to automatically select related sememes from a closed sememe set for each word. They propose two frameworks based on word embedding and matrix factorization. But these methods usually fail to * The corresponding authors are Lu Chen and Kai Yu. deal with the prediction problem of low-frequency words.
Motivated by this, Jin et al. (2018) present character-enhanced sememe prediction (CSP), taking advantage of both internal character information and external context information of words. However, CSP is an ensemble model which still relies on word and character representation, and ignores the fine-grained information.
For internal structural information of words, many researchers believe that only using characters is not sufficient for capturing the semantic information (Yu et al., 2017;Cao et al., 2018;Meng et al., 2019). For instance, the words "森林(forest)" and "木头(wood)" are semantically related. But these two words share no information since they consist of different characters. To address this problem, we split each Chinese character into several components, and regard component as the minimum unit to express the meaning of the character. We believe that fine-grained units can share more information between semantically related words, which helps model prediction. Take Figure 1 for example, the characters of word "濒海(near the sea)" have components "步(step)" and "氵(water)", which are related to its sememes, namely "靠近(BeNear)" and "水域(waters)", respectively.
In order to better incorporate the internal information of Chinese character, we pre-train a Glyph Masked Multi-head Self-attention Model enhanced Chinese Character embedding (GCC) for sememe prediction task. More specifically, we use the same model structure as BERT (Kenton and Toutanova, 2019), but change the input unit and the masking scheme. First, we regard Chinese words as our training samples and take components of each character in the word to form the input sequence. Second, we mask random tokens and predict the modified tokens as well as all characters in the sample. We evaluate our model on HowNet sememe KB. Experimental results demonstrate that our model outperforms the baseline model. In summary, our contributions include: • To the best of our knowledge, we are the first to use masked language model (MLM) objective to force the model to learn the internal information of characters.
• We propose a novel sememe prediction framework considering both internal and contextual character information.
• Our method is particularly useful for lowfrequency words and shows the effectiveness and robustness on the dataset.

Methodology
In this section, we first introduce the architecture of pre-training model. Then, we describe how to incorporate pre-trained representation into sememe prediction task.

Pre-Training Model Architecture
As shown in Figure 2, the framework of our pretraining model includes an embedding layer and a masked transformer encoder layer. First, we use the file 1 about structures of Han Ideographs and refer to Ke and Hagiwara 2 to get all the Chinese character trees. Then, we use the depth-first algorithm to convert each character tree into the format of a sequence (Nguyen et al., 2019). Note that, there are two types of tokens in the input sequence. As shown in the left block in Figure  2, the leaf nodes (position 2, 5, 6, 7) are components of Chinese character, and the inner nodes (position 1, 3, 4) are structural composition operators (such as vertical stacking) applied to children nodes. The character "濒 (near)" can be serialized as where C is the set of components, T is the structural composition operator set.

Embedding Layer
The input embedding of the model is the sum of token embedding, type embedding, position embedding and character segmentation embedding.
For token embedding, we maintain two lookup tables  and use [CHAR] as the character tag which represents the entire character information, [S_M] to mask the structure type token and [C_M] to mask the component token. To distinguish them, we simply use type embedding to indicate the token types, i.e. CHAR for character tag, STC for structure type token and CPN for component token. As for position embedding, we assign a number starting from 0 to each token belonging to the same character. Finally, our model use segmentation embedding to identify different characters. For instance, the input sequence in Fig-ure 2 is marked with a sequence of segment tags, i.e. {A, ..., A, B, ..., B}. All the embeddings have the same dimension d.

Masked Transformer Encoder
We use the multi-head self-attention network as the basic structure. Given the representation of sequence tokens X ∈ R n×d , where n is the number of tokens in the sequence and d is the dimension of each token. The process of masked self-attention can be formulated by where W Q , W K , W V ∈ R d×d k are learnable parameters, and M ∈ R n×n is the attention mask matrix (Liu et al., 2020). We obtain M by setting Masking component tokens helps model to learn the fine-grained information from the contextual component sequence. Masking structure type tokens helps model to learn the structural information of components. We also predict the character of tag [CHAR]. This objective forces model to gather all useful multi-granularity information to the token [CHAR]. The advantage is that we can easily use the hidden output of [CHAR] as the character representation u for downstream tasks, such as sememe prediction task.

Sememe Prediction Model
Given a word w ∈ W, the goal of our model is to predict the corresponding P (s|w) for each sememe  s ∈ S, where W is the word set and S is the set of sememes existing in HowNet. Then, we recommend sememes with high scores to w. Our sememe prediction model GCC (Figure 3) has two parts, one is an encoder which encodes the word-related information into a vector and the other is a multi-label classifier, which uses the vector to compute scores for each sememe.
We use Bidirectional LSTM (Bi-LSTM) (Schuster and Paliwal, 1997) as the encoder. For each word w, we concatenate the word and the characters c i in the word as {w, c 1 , ..., c n }, and then convert it to {w, c 1 , ..., c n } with the embedding trained on SogouT corpus 3 using Skipgram (Mikolov et al., 2013). We incorporate our pre-trained character embedding by addition operation:ĉ where W U is a projection matrix and u i is the character representation mentioned in Section 2.1.3. Then, we pass it to Bi-LSTM. The concatenation of the last hidden states in both directions, denoted as h, is fed to the multi-label classifier: where W ∈ R |S|×2l , x, b ∈ R |S| , l represents the dimension of hidden states in a single direction. Each element of x is a score related to the sememe in S. For training, we use the multi-label oneversus-all cross-entropy loss, where σ is a sigmoid function and y j ∈ {0, 1} represents whether the j-th sememe is in the sememe set of word w: 3 Experiments

Experimental Setup
Pre-Training Data We adopt Tencent embedding corpus (Song et al., 2018) which covers over 8 million Chinese words and phrases. We remove non-Chinese characters such as punctuation and pure digits, and finally get 7,291,828 words as our pre-training samples.

Sememe Prediction Dataset
To make the results comparable, we follow Du et al. who proposed the previous state-of-the-art model. This dataset is constructed from HowNet sememe KB, where they disregard the hierarchical structures of sememes and filter out the low-frequency sememes which appear less than 5 times in HowNet. The final number of sememes we use is 1, 400. The total number of words in the dataset is 48,383, which are divided into non-overlapping training, validation, and test sets in the ratio of 8:1:1.
Hyper-parameters Both pre-training and the sememe prediction models are trained by Adam with a learning rate of 0.0001 (Kingma and Ba, 2014). For pre-training, we use the structure of BERT-base and the batch size is 1024. For sememe prediction, the dimension of word embedding is 200, the dimension of Bi-LSTM hidden states is 512 × 2, and the batch size is 128. Our code is available at https://github.com/lbe0613/GCC.

Evaluation Metrics
Following Xie et al., we use mean average precision (MAP) as evaluation metrics. We rank all sememes according to the model output. For a word with K sememes, we get MAP by where the rankings of the K sememes are r 1 ≤ r 2 ≤ ... ≤ r K .

Results
In Table 1, we report average results of 5 runs to ensure the reliability of results. We compare our model with two types of baselines: representation-based models and definitionbased models. Traditional representation-based models include SPWE and CSP, which is an ensemble model relying on word and character embedding. Definition-based models utilize dictionary definitions as the external information. Such

Models MAP
SPWE  55.04 CSP (Jin et al., 2018) 58.93 LD+Seq2Seq †  30.49 MC † (Du et al., 2020) 60.55 SCorP † (Du et al., 2020) 64.65 GCC w/o pre-train (Ours) 58.18 GCC ♣ (Ours) 60.23 JWE ♣ (Yu et al., 2017) 59.03 Glyce ♣ (Meng et al., 2019) 59.10 Table 1: Sememe prediction results of all models. The second part models with † utilize external dictionary definition information, and the third part models with ♣ consider glyph information. models include LD+Seq2Seq, MC and SCorP. Our GCC models belong to representation-based models. We also compare GCC with other models utilizing glyph information. Here, we simply replace our GCC embedding in Figure 3 with character embedding in JWE and Glyce. As shown in Table 1, the models considering glyph information perform better than all traditional representation-based models, which demonstrates that glyph can enhance Chinese character embedding for sememe prediction task. Especially, GCC has an absolute improvement of 2.05% compared to GCC baseline without pre-training and significantly outperforms JWE and Glyce. The reason is that firstly Chinese characters are pictographic characters, and glyphs express the meaning of the word to a certain extent, which is related to the sememes of the word. Secondly, pre-training enables GCC to better integrate fine-grained information into Chinese character representation.
In addition, since experts refer to dictionary definitions when annotating sememes (Dong and Dong, 2003), it is very powerful semantic information for sememe prediction. Even though, our model is still comparable to MC and even better than LD+Seq2Seq when only using the information in words. Figure 4 shows the evaluation results of different frequencies on four strong models. We can see that GCC is superior to other models in all word frequency ranges. In addition, word frequency has great impacts on sememe prediction. Since lowfrequency words are usually unrelated to each other and contain fewer and simpler sememes, the per- formance of the model is drastically reduced when facing low-frequency words. However, our model GCC is particularly helpful in improving the performance of them. When the word frequency is less than 50, the MAP increases by 3.31% after utilizing glyph enhanced character embedding. Compared with other models using glyph information (JWE and Glyce), it has an increase of at least 2.3%, which is greater than that of all other word frequency ranges.  The examples in Figure 5 show how glyph infor-mation assist sememe prediction. We present the sememe labels with their corresponding ranks, and average precision score of each model. Average precision refers to the accuracy of a single sample. The model recommends low-rank sememes to words. In Figure (a), the meaning of component "月(moon)" in Chinese is related to "肉 (flesh)". Thus, the rank of sememe flesh is raised from 27 to 6 when incorporating glyph information. And the average precision score increases from 44.82 to 64.88. In Figure (b), the component "艹(grass)" is the same as grass, which is related to bury, because objects can be buried by grass. And the sememe die is also the component of the character "葬(burial)", which demonstrates the glyphs are related to the semantics of the word. The result is also convincing. The rank of sememe bury is raised from 62 to 6 while the average precision score increases from 44.44 to 80.95.

Conclusion
In this work, we pre-train a Glyph enhanced Chinese Character embedding (GCC) for sememe prediction. The model is evaluated on HowNet sememe KB and outperforms existing non-external information models. Our experiments show that glyph information can enhance the semantic expression of words, and has a better performance on low-frequency words.