Improving Model Generalization: A Chinese Named Entity Recognition Case Study

Generalization is an important ability that helps to ensure that a machine learning model can perform well on unseen data. In this paper, we study the effect of data bias on model generalization, using Chinese Named Entity Recognition (NER) as a case study. Specifically, we analyzed five benchmarking datasets for Chinese NER, and observed the following two types of data bias that can compromise model generalization ability. Firstly, the test sets of all the five datasets contain a significant proportion of entities that have been seen in the training sets. Such test data would therefore not be able to reflect the true generalization ability of a model. Secondly, all datasets are dominated by a few fat-head entities, i.e., entities appearing with particularly high frequency. As a result, a model might be able to produce high prediction accuracy simply by keyword memorization without leveraging context knowledge. To address these data biases, we first refine each test set by excluding seen entities from it, so as to better evaluate a model’s generalization ability. Then, we propose a simple yet effective entity resampling method to make entities within the same category distributed equally, encouraging a model to leverage both name and context knowledge in the training process. Experimental results demonstrate that the proposed entity resampling method significantly improves a model’s ability in detecting unseen entities, especially for company, organization and position categories.


Introduction
Named Entity Recognition (NER) is a fundamental building block for various downstream natural language processing tasks such as relation extraction (Bunescu and Mooney, 2005), event extraction (Ji and Grishman, 2008), information retrieval (Chen et al., 2015), question answering (Diefenbach et al., 2018), etc. Due to the ambiguous word boundaries and complex composition (Gui et al., 2019), Chinese NER task is more challenging compared with English NER.
Recently, by leveraging upon the pretrained language model (e.g, BERT (Devlin et al., 2018), etc.), we have witnessed superior performances on Chinese NER datasets, including: MSRA, Weibo, Ontonotes 4.0 and Resume Xuan et al., 2020). Despite the superior performance of the fine-tuned models, we argue that there are two types of data bias that can compromise the model generalization ability.
First, we observe that in widely used Chinese NER datasets, 50% to 70% entities in test data are seen in the training data. Such test data would therefore not be able to evaluate the true generalization ability of a model. Second, the datasets are dominated by a few fat-head entities, i.e., entities appearing with particularly high frequency. For example, within the organization category of Cluener (Xu et al., 2020), fat-head entity 曼联 (Manchester United) appears 59 times, while 法兰克福队 (Eintracht Frankfurt) occurs only once. As a result, a model might be encouraged to memorize those fat-head entities rather than leveraging context knowledge during training process. The rationale is that given the same entity and diverse contexts, the easiest way for model convergence is to memorize the entity rather than extracting patterns from the diverse contexts.
To address these data biases, we first refine each test set by excluding seen entities from it, so as to better evaluate a model's generalization ability. Then, we propose a simple yet effective entity rebalancing method to make entities within the same category distributed equally, encouraging a model to leverage both name and context knowledge in the training process.
The contributions of this paper are as follows.  • We analyze five benchmarking Chinese NER datasets and identify two types of data bias that can compromise model generalization ability.
• We refine each test set by excluding seen entities from it, which can measure real model generalization. Specifically, the competitive BERT+CRF model only achieves 33.33% and 65.10% F1 score on detecting unseen organization entities of Cluener and MSRA dataset respectively, which are far from satisfactory.
• We design a simple yet effective algorithm to rebalance the entity distribution. The experiments show that the proposed method significantly improves the model generalization.
In particular, the F1 score has been improved by 12.61% and 37.14% on the organization category of Cluener and MSRA dataset respectively.

Seen vs Unseen Entity
If an entity in dev/test data has been covered by the training data, we refer it as a seen entity. Otherwise, it is an unseen entity. To quantify the degree to which entities in the dev/test data have been seen in the training data, we define a measurement called entity coverage ratio. The entity coverage ratio of data D te is denoted by r(D te ), which is calculated using the below equation.
where Ent(.) denotes a function to obtain the list of annotated entities and D train represents the training data. As Table 2 shows, the entity coverage ratios of the dev and test data in different benchmarking datasets are very high, ranging from 0.429 to 0.709.  Observation 1 The test sets of Chinese NER datasets contain a significant proportion of seen entities.

Fat-head vs Long-tail Entity
Fat-head entity is defined as the entity appearing with particularly high frequency, while long-tail entity is defined as the entity with very few mentions. To identify the existence of fat-head entity, we use kurtosis (Balanda and MacGillivray, 1988), a statistical measure that defines how heavily the tails of a distribution differ from the tails of a normal distribution. Usually, high kurtosis (greater than 3) indicates the existence of outliers, i.e., fat-head entities. Table 3 shows the kurtosis score of each category in different datasets. For example, the kurtosis score of PER category of training data in MSRA dataset is 984.1, which is very high. We find that 1% distinct entities with the highest frequency contribute 21% of the overall annotation.
Observation 2 Fat-head entities prevail in different categories of Chinese NER datasets.
We think this finding is also valid in other NER datasets, since the annotated corpus is usually collected within a certain time frame when some entities (e.g., celebrities, organizations) get much more exposure than others.
We hypothesize that the dominance of fat-head entities will cause the model to simply memorize those high-frequency entities without fully leveraging context knowledge. The rationale is that given the same entity and diverse contexts, the easiest way for model convergence is to memorize the entity rather than extracting patterns from the diverse contexts.

Dataset
Training Test

Method
To improve model's generalization ability in detecting unseen entities, we argue that the model should be trained to leverage both name and context knowledge (Nie et al., 2020;Lin et al., 2020). Thus, we propose a simple yet effective entity rebalancing algorithm. The main idea is to make the annotated entity equally distributed within the same category. There are two major reasons why the proposed entity rebalancing algorithm works. First, the equal distribution will encourage the model to leverage both name knowledge and context knowledge, since there are no simple statistical cues (Niven and Kao, 2019) to exploit due to uneven distribution. Second, different entities within the same category should be interchangeable semantically in most cases, which avoids the train-test discrepancy.
The proposed algorithm works as follows. First, rebalance the annotated entity frequency in the training data. Let C l denotes the original entity frequency counter of category l. For example, given C l = {e 1 : 11, e 2 : 1, e 3 : 1}, which means entity e 1 is annotated 11 times, and both e 2 and e 3 are annotated once in the category l, which is very imbalanced. Then we turn C l to the balanced entity frequency counter C b l , which is C b l = {e 1 : 5, e 2 : 4, e 3 : 4}. In C b l , the difference between the maximum and minimum entity frequency is 1 at most. Second, replace the fathead entity with randomly sampled entity of the same category, once its accumulated occurrence surpasses the rebalanced frequency in C b l . Details are shown in Algorithm 1.

Experiment Settings
According to observation 1, the test sets of Chinese NER datasets contain a significant proportion of seen entities, which fails to evaluate the true model generalization ability. In our study, the test sample will be excluded if it contains entities that are covered in training data. For Cluener (Xu et al., 2020), we split the original training set into 90% train and 10% dev, and use the development set for test, as the test set is not publicly available. For Resume (Zhang and Yang, 2018) and Weibo (Peng and Dredze, 2015) datasets, we report evaluation results on the selected categories, since there are zero or very few unseen entities on other categories. We use the BIOES tagging scheme to label named entities, since previous studies have shown optimistic improvement with this scheme (Ratinov and Roth, 2009). We report span-level micro-averaged F1 score obtained from seqeval (Nakayama, 2018) toolkit using IOBES scheme.
We use BERT+CRF as the model architecture. In particular, we use bert-base-chinese pre-trained model 1 (12-layer, 768-hidden, 12-heads) released by google (Devlin et al., 2018). The hyperparameters of the model are tuned on the development set using grid search method (details are reported in Appendix A.). As shown in Table 4, the adopted BERT+CRF model is competitive with the complicated state-of-the-art models.  Table 4: Comparison between BERT+CRF and the state-of-the-art models using the same train/dev/test splits as Xuan et al., 2020)

Results
Table 5 presents the comparisons between the proposed method and the baseline on five Chinese NER datasets. The baseline uses the original training data, while the proposed applies entity rebalancing algorithm on the original training data. For Cluener, MSRA and OntoNotes datasets with over 10K training samples, our proposed method outperforms the baseline on different categories. One exception is on the address category of Cluener dataset when the proposed method performs worse than the baseline by -2.58%. We believe it is due to the fact that the address category contains both geopolitical entities and location entities, which are not interchangeable semantically.
For Weibo dataset, the proposed outperforms the baseline by 8.89% in PER.NAM category, but performs worse in PER.NOM category. Note that the PER.NOM category contains entities such as man, woman and friend, which are hard to generalize based on context knowledge. For Resume dataset, the proposed method does not work well. We think it is due to the structure of the resume corpus, which is the mere concatenation of name, education and organization, etc. Thus, there is very few context knowledge to leverage.
Overall, the proposed entity rebalancing method is able to improve model's generalization ability in detecting unseen entities. However, the proposed method only works for categories which meet cer-  Table 5: Evaluation results (F1 score) of the proposed entity resampling method and the baseline on unseen test data tain conditions. First, the entities of the same category require to be interchangeable semantically. Second, the entities should be dependent of context knowledge.

Conclusion and Future Work
In this paper, we take Chinese NER as a case study, aiming to improve the model generalization by mitigating the data bias. We first refine each test set by excluding seen entities from it, so as to better evaluate a model's generalization ability. Then, we propose an entity rebalancing method to make entities within the same category distributed equally. Experimental results show that the proposed entity rebalancing method significantly improves a model's ability in detecting unseen entities. As future work, we will first investigate the generalizability of this study to non-Chinese NER. Second, we will improve the entity replacement algorithm by leveraging language model so that the replaced entity is more semantically plausible.