CLiMP: A Benchmark for Chinese Language Model Evaluation

Linguistically informed analyses of language models (LMs) contribute to the understanding and improvement of such models. Here, we introduce the corpus of Chinese linguistic minimal pairs (CLiMP) to investigate what knowledge Chinese LMs acquire. CLiMP consists of sets of 1000 minimal pairs (MPs) for 16 syntactic contrasts in Chinese, covering 9 major Chinese linguistic phenomena. The MPs are semi-automatically generated, and human agreement with the labels in CLiMP is 95.8%. We evaluate 11 different LMs on CLiMP, covering n-grams, LSTMs, and Chinese BERT. We find that classifier–noun agreement and verb complement selection are the phenomena that models generally perform best at. However, models struggle the most with the ba construction, binding, and filler-gap dependencies. Overall, Chinese BERT achieves an 81.8% average accuracy, while the performances of LSTMs and 5-grams are only moderately above chance level.


Introduction
Language models (LMs) are crucial parts of natural language processing (NLP) systems for a large variety of tasks, including summarization, machine translation, and dialog generation. More recently, they have become popular in the form of pretrained models, 1 which are then fine-tuned on downstream tasks and often obtain state-of-the-art performance (Peters et al., 2018;Devlin et al., 2019;Conneau et al., 2020). However, which linguistic phenomena language models can or cannot learn is still poorly understood for many languages.
Resources for the syntactic evaluation of LMs, such as BLiMP (Warstadt et al., 2020) have focused mainly on English, and non-English resources currently only cover a small set of phenomena (Mueller et al., 2020;Gulordava et al., 2018;Ravfogel et al., 2018). In order to spur the analysis and subsequent improvement of LMs in Chinese, we introduce the corpus of Chinese linguistic minimal pairs (CLiMP), which can be used to evaluate LMs' knowledge of Chinese grammar.
CLiMP consists of 16 individual datasets that are semi-automatically generated from grammar templates. Each set-or paradigm-contains 1,000 minimal pairs (MPs). Together, they cover 9 core linguistic phenomena in Chinese. Human agreement on this corpus is 95.8%, confirming that CLiMP represents robust contrasts in Chinese grammar. High performance on CLiMP thus implies high correlation with human acceptability judgments across these phenomena.
We use CLiMP to study Chinese BERT (Devlin et al., 2019), 2 6 LSTM (Hochreiter and Schmidhuber, 1997) LMs, and 4 5-gram LMs. We evaluate for each MP whether the LM assigns a higher probability to the grammatical or the ungrammatical sentence. Our results show that Chinese BERT is closest to human performance, achieving an 81.8% accuracy on average over all phenomena, while the performances of LSTMs and 5-grams, regardless of the training data size, are only moderately above chance level. Classifier-noun agreement and verb complement selection are the phenomena that models generally perform best at, suggesting that Chinese LMs are better at acquiring knowledge of local selectional restrictions. The bǎ construction, binding, and filler-gap dependencies are the phenomena models have the most difficulties with. This indicates that they struggle to learn hierarchical syntax and to identify long-distance dependencies.

Language Models
LMs assign probabilities to sequences of words (Jurafsky and Martin, 2009). Recently, they have become commonly used as pretrained models, which can be fine-tuned for downstream NLP tasks (Peters et al., 2018;Devlin et al., 2019;Conneau et al., 2020). Strictly speaking, LMs compute the probabilities of words based only on past context. BERT (Devlin et al., 2019), however, is trained using a masked language modeling objective: it predicts words based on past and future tokens. Wang and Cho (2019) show that BERT is a Markov random field language model that can assign sentences a pseudo-log-likelihood score, which is computed by summing the conditional log probabilities of all tokens in the sentence, as well as generate text. Shin et al. (2019) and Salazar et al. (2020) apply pseudo-log-likelihood scores to sentence ranking and LM evaluation.

Evaluation of Linguistic Knowledge
Numerous methods exist for probing syntactic knowledge of neural network models in English (Hewitt and Manning, 2019;Tenney et al., 2019), and a growing body of work evaluates the syntactic knowledge of neural models by testing whether they can judge the grammatical acceptability of sentences. One common version of this task uses MPs to evaluate LMs' linguistic knowledge (Linzen et al., 2016;Marvin and Linzen, 2018;Warstadt et al., 2020;Wilcox et al., 2018).
A MP is a pair of sentences that only differ in acceptability due to a single edit, as in (1) and (2). Native speakers can be asked to choose which sentence in each pair sounds more grammatical. Semiautomatically generating MPs can yield a larger set of controlled sentences, providing sufficient data for model evaluation (Linzen et al., 2016;Marvin and Linzen, 2018;Ettinger et al., 2018 It is possible to model acceptability in a totally unsupervised way using LMs. The model assigns a probability to each sentence in a MP, and the one with the higher score is predicted as correct, and the model's predictions can be evaluated against human judgments (Marvin and Linzen, 2018;Warstadt et al., 2020). Supervised approaches are also possible (Warstadt et al., 2019), but can be less informative on LMs' linguistic knowledge acquisition due to the bias introduced by training on acceptability judgment labels.
Some prior work evaluates the linguistic knowledge of different non-English models (Ravfogel et al., 2018;Gulordava et al., 2018;Mueller et al., 2020). However, these efforts focus mainly on subject-verb agreement, which is absent in Chinese, and the knowledge of Chinese LMs has not yet been explicitly studied.
Finally, the linguistic abilities of English BERT have been investigated in a a lot of prior work, e.g., Clark et al. (2019); Vig (2019); Hewitt and Manning (2019). We refer the reader to Rogers et al. (2021) for an overview.

CLiMP
Our main contribution is CLiMP, a corpus of Chinese MPs designed to evaluate Chinese LMs. CLiMP consists of 1,000 MPs for each of 16 grammatical contrasts, covering 9 major Chinese linguistic phenomena. Example MPs for each phenomenon are shown in Table 1.

Data Generation
We generate data from grammar templates for every paradigm we incorporate. Our templates set lexical, syntactic, and semantic constraints for each paradigm, aiming at building robust contrasts and keeping the sentence length the same within each MP. We then build an annotated vocabulary, and generate sentences by sampling words from it. (1) and (2) show an MP together with the template 3 used to create it.

Vocabulary
We translate Warstadt et al.'s (2020) English vocabulary, containing 3,000 English words with morphological, syntactical, and semantic annotations. We add words and features specific to Chinese linguistic phenomena to our vocabulary, including classifiers, verb complements, action verbs, and coverbs. Our final vocabulary contains 3,456 words and 84 features.
We show the frequency of words in CLiMP's vocabulary in the Chinese Internet Corpus 4 in Figure 1. 1,055 of the words in CLiMP are within the 5,000 most frequent words in the Chinese Internet Corpus.

Linguistic Phenomena
CLiMP covers 9 major linguistic phenomena in Mandarin Chinese, cf. Table 1. They are picked from a comprehensive Chinese grammar book by Po-Ching and Rimmington (2015). Following Po-Ching and Rimmington's discussion, we now explain the phenomena not present in English. The bǎ construction is an SOV construction involving the particle bǎ, which precedes the object and moves the object to a position before the main verb. It is only grammatical with a subset of transitive verbs. Coverbs are verb-like items that precede the main verb in a serial verb construction. They almost invariably have to be used in conjunction with other verbs in a sentence. They share some properties with prepositions, but are not syntactically interchangeable with them. Classifiers obligatorily appear with nouns when those are modified by numerals or adjectives. Mandarin has dozens of classifiers, and nouns select the classifier they combine with. Verb complements follow a verb, often expressing a result or manner of an event. Not all verbs can be used with all complements, making certain combinations ungrammatical. NP head finality is present in Mandarin noun phrases. The relative clause precedes noun phrases.

Data Validation
To verify whether the MPs in our dataset show clear contrasts, we conduct two rounds of human valida-4 http://corpus.leeds.ac.uk/frqc/internet-zh.num tion with 22 annotators. They are all native speakers of Chinese, 14 females and 8 males, whose ages range from 20 to 48. All of them have at least a high school degree.
In our first human validation, each human annotator is assigned a subset (100 MPs) of a paradigm. We let them perform the same forced-choice task as our models: decide for each MP which sentence seems more acceptable. We discard one paradigm, the coverb-direction paradigm, after this validation, because its human validation accuracy is below 85%. The average human agreement for the remaining paradigms is 95.8%.
In the second human validation, we sample 15 MPs from each of the remaining paradigm, resulting in a dataset consisting of 240 MPs. 16 annotators complete the same forced-choice task on this dataset. We count a MP as valid if more than half of the annotators agree with its label. The human agreement on this dataset is 97.1%, showing that our data creation results in valid examples.

Comparison with BLiMP
BLiMP consists of 67 datasets, each containing 1,000 MPs and organized by phenomenon into 12 categories. CLiMP only contains 16 datasets due to the less inflectional nature of Mandarin Chinese. 3 phenomena are covered by both corpora: anaphor agreement, binding, and filler-gap. The human agreement for these three phenomena in BLiMP is 97.5%, 87.3%, and 86.9%, respectively. The corresponding accuracies in CLiMP are 94.5%, 99%, and 100%, respectively. The overall human agreement for BLiMP is 88.6%, which is 7.2% lower than for CLiMP.

Models and Methods
We use accuracy for evaluation. A MP in CLiMP is classified correctly if a LM assigns a higher probability to the grammatical sentence than to the ungrammatical one. We evaluate statistical and neural LMs, including masked LMs. Corpora which contain 0.4M, 2M, and 21.5M sentences are used for further exploration. We also investigate the effect of different tokenizations. 5 Chinese BERT BERT (Devlin et al., 2019) is a transformer-based neural model (Vaswani et al., 2017). Here, we evaluate Chinese BERT. 6 This 5 We use character tokenization and word tokenization (https://github.com/fxsjy/jieba). 6 https://github.com/googleresearch/bert/blob/master/multilingual.md  Table 1: Nine Chinese linguistic phenomena covered by CLiMP with acceptable and unacceptable sentence examples. Minimal differences are underlined. The second line of each example shows a gloss, the third line is an English translation. N represents how many paradigms (each with 1,000 examples) are within each phenomena. model has 12 layers, 768 hidden units, 12 attention heads, and 110M parameters. The training dataset contains 25M sentences. We assign probabilities to sentences with this model by masking the words in a sentence one by one, computing the probability of each masked word, and, finally, multiplying the probabilities of all words (Wang and Cho, 2019;Salazar et al., 2020). 7 LSTM LMs We further evaluate 6 LSTM (Hochreiter and Schmidhuber, 1997) LMs. These model have 2 layers, 200 hidden units, and 2 attention heads. We train them using Pytorch's word language model code 8 on 3 differently-sized Chinese Wikipedia corpora: 0.4M, 2M, and 21.5M sentences. We further compare word-level and character-level models (cf. Table 2). For evaluation, we employ code adapted by Warstadt et al. (2020) from Gulordava et al. (2018). 9 n-gram LMs Finally, we experiment with 4 different 5-gram LMs, which have been trained on 0.4M and 2M sentences from Chinese Wikipedia. For each corpus size, we train one word-based and one character-based LM. Those models are imple-7 https://github.com/xu-song/bert-as-language-model 8 https://github.com/pytorch/examples/tree/master/ word language model 9 https://github.com/sheng-fu/colorlessgreenRNNs mented using KenLM. 10

Results
All results are shown in Table 2.
Phenomenon-specific Results Our LMs perform best on classifier-noun agreement and verb complement selection: Chinese BERT's accuracy is only 6.8% and, respectively, 3% lower than that of humans on these two phenomena. LSTMs and 5grams remain around 30% behind humans, but still perform better on these phenomena than on others in CLiMP. This indicates that Chinese LMs acquire local selection knowledge better than the linguistic knowledge needed to master other phenomena.
Our LMs stuggle most with the bǎ construction, binding, and filler-gap dependencies. All models perform close to chance level for binding, suggesting that they lack the hierarchical knowledge necessary to correctly resolve the structural relationship between a reflexive and its binder. Similarly, most models perform near chance on filler-gap dependencies. This suggests that they do not robustly represent long-distance dependencies. 11

Model
Overall Clsfr.  On the head-final construction, Chinese BERT performs surprisingly poorly as compared to the other models: only 53.1% accuracy as compared to an average accuracy of 81% by the LSTMs. The coverb construction, in contrast, is easy for Chinese BERT: it achieves 87.9% accuracy, while the highest accuracy among all other models is 47%.
Model-specific Results Comparing across models, Chinese BERT achieves by far the highest overall accuracy with 81.8%. Our different LSTMs all perform worse, but obtain surprisingly similar scores: from 60.4% to 66.0%. The performances of our 5-grams range from 55.9% to 65.7%. Keeping tokenization and corpus size constant, three out of four 5-grams are outperformed by LSTMs. Thus, we overall find that neural models have advantages as compared to statistical models.
Comparing among the LSTMs, we find similarly to Hu et al. (2020) that the corpus size does not have much influence on the overall performance, with the caveat that these models perform close to chance. In contrast, a larger corpus size does result in a better performance in 5-grams. We also compare the effect of different tokenizations: Character-based 5-grams demonstrate better performance than word-based ones. For LSTMs, however, using characters only results in a better performance for our smallest corpus size (0.4M).
Compared to English LMs (Warstadt et al., 2020), the human-model gap is much bigger for Chinese models. While neither models nor datasets are directly comparable between our and previous work, this still suggests that more analyses and developments are needed for non-English models. tion more common in speech, and less likely to appear in the training corpora.

Conclusion
We introduced CLiMP, a suite of diagnostic test sets aimed at evaluating which syntactic phenomena Chinese LMs learn, and used it to evaluate 11 different models. All LMs appeared to have learned local selectional restrictions, but struggled with argument structure alternations, hierarchical structure, and long-distance dependencies. Chinese BERT performed best on CLiMP overall. However, it obtained a 14% lower accuracy than humans, suggesting there is still much room for improvement. We hope that CLiMP will serve as a linguistically informed resource for benchmarking and analyzing future progress on Chinese LMs. CLiMP is available at https://nalacub.github.io/resources.