SLING: Sino Linguistic Evaluation of Large Language Models

To understand what kinds of linguistic knowledge are encoded by pretrained Chinese language models (LMs), we introduce the benchmark of Sino LINGuistics (SLING), which consists of 38K minimal sentence pairs in Mandarin Chinese grouped into 9 high-level linguistic phenomena. Each pair demonstrates the acceptability contrast of a specific syntactic or semantic phenomenon (e.g., The keys are lost vs. The keys is lost), and an LM should assign lower perplexity to the acceptable sentence. In contrast to the CLiMP dataset (Xiang et al., 2021), which also contains Chinese minimal pairs and was created by translating the vocabulary of the English BLiMP dataset, the minimal pairs in SLING are derived primarily by applying syntactic and lexical transformations to naturally-occurring, linguist-annotated sentences from the Chinese Treebank 9.0, thus addressing severe issues in CLiMP’s data generation process. We test 18 publicly available pretrained monolingual (e.g., BERT-base-zh, CPM) and multi-lingual (e.g., mT5, XLM) language models on SLING. Our experiments show that the average accuracy for LMs is far below human performance (69.7% vs. 97.1%), while BERT-base-zh achieves the highest accuracy (84.8%) of all tested LMs, even much larger ones. Additionally, we find that most LMs have a strong gender and number (singular/plural) bias, and they perform better on local phenomena than hierarchical ones.


Introduction
While large-scale pretrained language models (LMs) have achieved considerable downstream success (Devlin et al., 2019;Xue et al., 2021;Brown et al., 2020, a.o.), it remains challenging to evaluate how much linguistic knowledge they have acquired.One approach is to design minimal pairs consisting of two sentences that differ only in a Figure 1: An illustration of the SLING dataset.The A sentence is acceptable but B, a minimal edit counterpart of A, is not.LMs see one sentence at a time and are expected to assign a lower (pseudo-)perplexity to the acceptable sentence.Overall, LMs underperform Chinese native speakers on SLING (97% vs 70%), making it an exciting benchmark for future Chinese LM research.critical word or phrase, which renders only one of the sentences acceptable (e.g., The keys are lost vs.The keys is lost).If an LM is sensitive to the phenomenon exemplified by the minimal pair (in this case, plurality), it should assign a lower perplexity to the acceptable sentence.This methodology can be used to test an LM's understanding of a wide range of linguistic phenomena; for example, the BLiMP dataset (Warstadt et al., 2020) contains 67K minimal pairs automatically generated via manually-constructed grammars that span 12 high-level English phenomena.
Can we create similar datasets to study linguistic phenomena in a different language, such as Chinese?As a first step in this direction, Xiang et al. (2021) introduce CLiMP, a Chinese dataset of minimal pairs.However, we identify two major issues with CLiMP's construction process: (1) its vocabulary is translated from BLiMP's vocabulary, which due to morphological differences between English and Chinese (e.g., the latter lacks numeral or verbal inflections) results in a large number of unin-telligible sentences; and (2) the grammatical templates for several phenomena (anaphor agreement, classifier-noun agreement, and filler-gap dependencies) are inadequately designed, which along with the vocabulary issue results in minimal pairs that do not have any clear contrast. 2o address these issues, we introduce SLING (Sino LINGuistics benchmark), a dataset of 38K minimal pairs to study nine Chinese linguistic phenomena, many of which are unique to the Chinese language.Instead of translating BLiMP, we construct SLING primarily using the Chinese Treebank 9.0 (Xue et al., 2016), which was annotated by trained linguists (see Table 1 for a comparison).We extract subtrees from humanvalidated constituency parses in this treebank and then carefully edit them using manually-designed linguistic templates to create minimal pairs.SLING does not suffer from the issues we found in CLiMP, and it additionally includes semantic as well as syntactic phenomena, seven of which are not found in CLiMP.A human validation of SLING with 16 native speakers confirms that its minimal pairs unambiguously show the acceptability contrast across all phenomena, yielding an almost perfect inter-annotator agreement (Fleiss' κ = 0.88).
We evaluate a total of 18 publicly-available pretrained LMs on SLING, including monolingual Chinese (e.g., bert-base-chinese, PanGu-α) and multilingual models (e.g., mT5, XLM-R).Our results reveal that: (1) no LM consistently outperforms others on SLING; (2) larger LMs do not necessarily outperform smaller ones; (3) monolingual Chinese LMs generally perform better than multilingual ones; and (4) humans significantly outperform all LMs (97.1% vs 69.7% average across LMs).We observe that the ranking of models on CLiMP differs from that on SLING: for example, bert-chinese-base is the best-performing model on SLING (average accuracy 84.8%), while chinese-pert-base performs best on CLiMP (81.2%).This result is due in part to the issues in CLiMP's construction process, as well as the different phenomena that we test in SLING.Additionally, SLING is more discriminative than CLiMP (i.e., LMs vary more across the phenomena in terms of accuracy), which makes it more useful as a diagnostic benchmark especially given the large gap  (Xiang et al., 2021) and SLING.SLING is created with a natural and diverse vocabulary, covers new semantic and syntactic Chinese linguistic phenomena, and is evaluated on large pretrained LMs, including multilingual models like mT5. 3   between human and model performance.
2 Evaluating Chinese LMs with Minimal Pairs: CLiMP and Its Shortcomings Using minimal pairs to detect a function of a single element (e.g., phoneme, affix, or word) is a common practice in linguistics.In Figure 1, by changing the position of 了, sentence A is transformed into the ungrammatical sentence B, and we know how the two aspect markers 在 and 了 interacts.In this paper, following BLiMP and CLiMP, we call each major grammatical category a phenomenon, and minimal pair types within each phenomenon paradigms.The A and B sentences in Figure 1 form a minimal pair of a paradigm in the aspect phenomenon of SLING. 4  Xiang et al. (2021) created CLiMP to evaluate 9 Chinese syntactic phenomena with 16 paradigms.However, the dataset suffers from two major issues: (1) faulty minimal pair generation templates and (2) its translated vocabulary.In this section, we discuss the issues in detail and show why they hamper CLiMP's utility as a diagnostic dataset for LMs.
CLiMP's minimal pairs often do not show the desired acceptability contrast.This problem is especially prominent in the ba construction, binding/anaphor, and filler-gap dependency phenomena, on which Xiang et al. (2021) conclude that LMs perform poorly.The templates used to generate data for these phenomena are the primary cause of these errors, as we show below.ba construction: Many minimal pairs associated with this construction do not exhibit the acceptability contrast. 5We examine the first 50 minimal pairs of this phenomenon in CLiMP and discover that 6 pairs actually have the wrong acceptability label: The primary reason for the low quality of these pairs is that CLiMP does not carefully control the source of unacceptability (Abrusán, 2019), which we discuss further in the Limitations section.Specific to the ba construction, CLiMP does not include essential information about thematic relations6 in the vocabulary.Another contributing factor is the small size of the CLiMP vocabulary, which is translated from that of BLiMP despite many annotated features of BLiMP not applying to Chinese (e.g., number features, verb forms, or cases).For example, the English verb buy has six forms in BLiMP, listed in Table 2, which differ from each other in seven verb-related features.These inflections are useful in English for distinguishing sentence acceptability in several BLiMP phenomena (e.g., Passive, Irregular Forms, and Subject-Verb Agreement); however, they do not apply to Chinese because the language lacks inflection, and thus they cannot help construct Chinese paradigms.In Chinese, the same forms can be represented and built based on the three words shown in bold: mai (buy), (zheng) zai (progressive marker), and le (perfective marker).They do not need to be redundantly listed in the vocabulary.After removing the redundant word types, CLiMP's vocabulary size is 1,272 (including 230 proper names), not 3,456 as Xiang et al. (2021) report.This lack of diversity in the vocabulary contributes to the generation of nonsensical sentences using their minimal pair templates.Binding and anaphor paradigms: These two paradigms test whether the gender feature of the object anaphor agrees with that of the subject.Issues in the binding and anaphor paradigms stem from the fact that CLiMP uses proper names, which were added to CLiMP's vocabulary in addition to the one translated from BLiMP.However, Chinese proper names do not always unambiguously show gender.If the gender of the subject is ambiguous as in (1) where Ye Zi can be either gender (similarily for Alex in English), the performance of the LMs is not representative of whether they know the function of the reflexive anaphor, which is exactly what the binding and anaphor paradigms want to test.
Other issues with these two paradigms are discussed in detail in Appendix D.2.

Filler-gap paradigm:
To create minimal pairs for the filler-gap paradigm in CLiMP, Xiang et al. (2021) use what they call the topicalization construction.However, (2a), taken from CLiMP, does not contain a filler-gap topicalization dependency.
A real topicalization filler-gap structure should be the one in (2b), in which the direct object of the verb buy is topicalized and moved to the beginning of the sentence, leaving a (gap) at its base generated position (Huang et al., 2009, Section 6.1).Unfortunately, the minimal pairs associated with this paradigm are generated based on an erroneous template, which means no conclusions can be drawn from model performance on it.

Creating the SLING Benchmark
This section describes our process of generating minimal pairs for SLING.We make use of the Chinese Treebank 9.0 (Xue et al., 2016), a Chinese corpus with linguist-annotated constituency parses that contains 2,084,387 words.This treebank allows us to use naturally-occurring sentences to construct our minimal pairs, unlike the synthetic and sometimes nonsensical sentences of CLiMP.Also, unlike CLiMP, whose linguistic templates rely solely on one grammar book (Po-Ching and Rimmington, 2015), our linguistic templates are constructed by a native Chinese linguist (the first author of this paper) based on multiple works in linguistics.Details of the construction of each phenomenon and the cited works can be found in Appendix D. The general minimal pair generation process is to identify a linguistic pattern, search for relevant linguistic structures in the Treebank, and form minimal pairs by applying hand-crafted transformation rules on the extracted structures.
Figure 2 provides an overview of this process, with the same running example as this section.
3.1 Corpus: Chinese Treebank 9.0 Chinese Treebank 9.0 is a corpus of parsed text (3,247,331 Chinese and foreign characters) from various resources, both formal and colloquial.The Treebank contains 132,080 sentences; we extract a subset of these sentences that contains linguistic structures of interest and then manipulate those sentences to create minimal pairs for SLING.

Pattern Search
The most important patterns and corresponding strings extracted from the Treebank are classifiernoun phrases, compound noun phrases, and verbobject phrases.To demonstrate the extraction process, we will use classifier-noun phrases as an example.We extract classifier-noun phrases by searching for subtrees that have NP as their root node and contain a classifier M, for example, (3). ( For each sub-tree, a classifier-noun pair is extracted as shown in Figure 2. Because each noun may have multiple compatible classifiers, a dictionary is created with the nouns as keys and the compatible classifiers as the values.Compound noun phrases and verb-object phrases are extracted in a similar way but stored as sub-trees only.

Sentence Generation
Minimal pairs are generated based on linguistic templates and the extracted strings.Using the classifier-noun agreement phenomenon as an example, the template is CD M Noun.For the acceptable phrases, the M is taken from the classifiers that are compatible with the noun in the dictionary.For the unacceptable phrases, M is randomly chosen from a classifier list (after making sure it is not in the list of compatible classifiers).3: An overview of the phenomena present in SLING along with their properties.The table indicates whether the paradigms within each phenomena represent syntactic (syn) or semantic (sem) knowledge, whether they involve a distractor (e.g., the roses in the vase are/*is . . .), whether there are long distance dependencies (e.g., these beautiful red blooming roses), and whether the LMs need hierarchical knowledge of the language (e.g., Figure 3) to distinguish acceptable sentences from unacceptable ones.Details of each phenomenon are given in Appendix D.
In addition to phrases extracted from the Treebank, we also extract the transitive verbs 7 used in CLiMP's anaphor and binding phenomena, 8 and for certain phenomena we also utilize word lists (e.g., locations, pronouns, and occupations) to build the minimal pairs.Finally, for each paradigm in SLING, we generate one thousand minimal pairs.

Phenomena
As summarized in Table 3, SLING includes 9 major Chinese linguistic phenomena in syntax and semantics.Several minimal pair paradigms are designed to test an LM's robustness to distance and distractors in a dependency relation as well as whether they have the essential linguistic knowledge of hierarchy in Chinese; more details are provided in Appendix D.Here we describe the gist of each phenomenon.The alternative question phenomenon tests the knowledge that the disjunctor haishi and the polar question marker ma may not co-occur.In the anaphor agreement phenomenon, we first use baselines to test the LMs' gender and number 7 The transitive verbs from CLiMP are used in a small portion of the minimal pairs in SLING's Anaphora dataset, which requires transitive verbs that take animate subjects and objects.The acceptability contrast of sentences does not rely on those verbs.Extracting such verbs from the Treebank was impossible because animacy of nouns is not encoded in the parse. 8The vocabulary and data generation code of CLiMP can be found here https://github.com/beileixiang/CLiMP.bias (see Appendix D.2).Then, the morpheme ziji (self) is added to test if the LMs knows the function of ziji and agree the gender/number of the reflexive with the sentence subject.To avoid the issue caused by Chinese proper names in CLiMP, we use gender + occupation as the subject of sentences to clearly indicate the gender.The aspect phenomenon tests the knowledge of the perfective aspect markers le and guo in the sense of their interaction with tense and the progressive marker zai.The classifier-noun agreement is observed when a noun is modified by a numeral or demonstrative.One noun can be compatible with more than one classifier and the matching can be idiosyncratic.The definiteness effect phenomenon is established on the observation that demonstrative zhe (this)/na (that) and the quantifier mei (every) may not occur in the post-verbal position of an existential you (there is) sentence.Polarity items (PI) are words or phrases whose occurrence is restricted to certain contexts (e.g., negative or affirmative).We test two negative PIs, renhe (any) and shenme (what), as well as one positive PI huoduo huoshao (more or less).Chinese relative clauses exhibit a filler-gap dependency relationship.If the gap is a simple subject or direct object position, no resumptive noun or pronoun is allowed.Lastly, the wh-fronting phenomenon shows that in absence of a specific context (e.g., an echo question), a wh phrase must stay in situ.

Human Validation
Two rounds of human validation were conducted on PCIbex (Zehr and Schwarz, 2018) to verify the quality of the generated minimal pairs. 9Eleven students from the University of Massachusetts Amherst were recruited as annotators for the first round, and five for the second round.Each student has finished at least senior high school in China, and they all use Chinese on a daily basis.For the first round evaluation, every annotator rated 20 pairs from each of the 30 paradigms (not the baselines). 10The annotators were shown one minimal pair at a time and asked to choose the more acceptable sentence.In total, the annotation task took 1.5 to 2 hours on average, and the annotators were paid $40 each.Details on the second annotation round can be found in Appendix E. The final raw human accuracy mean over all paradigms is 97.12% (median = 97.27%,SD = 2.29%).The inter-annotator agreement as measured by Fleiss' κ is 0.8823, indicating almost perfect agreement (Landis and Koch, 1977).

Experimental Setup
Evaluated Models: There are many publicly available pretrained monolingual Chinese LMs and multilingual LMs.While Xiang et al. (2021) only test bert-base-chinese, three LSTM LMs, and two 5-gram LMs in their work on CLiMP, we experiment with the 18 LMs listed in Table 4. 11 There are 6 pairs of LMs (color coded in Table 4) in which one model is either trained with more parameters than the other in the pair or with larger training data. 12Although lstm-zh-cluecorpussmall and gpt2-zh-cluecorpussmall also differ in their model structure, we pair them to see whether a Transformer-based architecture leads to better model performance.We run the same suite of LMs on CLiMP, show the results in Table 7, and discuss 9 After the first round, the human accuracy on the two compound noun paradigms were 61.36% and 77.27%.To improve the quality of SLING, we revised the generation process of the two paradigms and re-evaluated their quality.
10 Ten practice and 24 filler item pairs were created to test whether the annotators understood and paid attention to the task.Those pairs are irrelevant to the paradigms of interest.All annotators did these tests with 100% accuracy.
11 Most LMs tokenize an input sentence into characters but CPM-Generate and PanGu-α occasionally cuts an input into words, and the ByT5 models use bytes.
12 The mengzi-bert-base-fin model is mengzi-base further trained with 20G extra financial news and research reports.them in Section 5.6.
Evaluation: To evaluate the performance of an LM on SLING, we use perplexity for the causal LMs and pseudo-perplexity (Salazar et al., 2020) for the masked LMs (see Appendix B for details).Given a minimal pair, the LMs should assign a lower (pseudo-)perplexity to the acceptable sentence.The accuracy of each LM on a paradigm is the proportion of the minimal pairs in which the model assigns the acceptable sentence a lower (pseudo-)perplexity.
Why perplexity?We choose to use perplexity instead of other metrics (e.g., raw probability) because some phenomena in SLING have systematic difference in sentence length within minimal pairs (e.g., Polarity Item, Relative Clause).Thus, we require a length-normalized metric like perplexity, since metrics such as probability can prefer shorter sentences by nature (Wu et al., 2016;Koehn and Knowles, 2017;Brown et al., 2020;Holtzman et al., 2021).Additionally, perplexity (or pseudo-perplexity) is applicable to all phenomena and all LMs that are tested in SLING (details in Appendix B).We considered other evaluation metrics such as prefix methods (Linzen et al., 2016;Gulordava et al., 2019;Wilcox et al., 2019), byword surprisal (Futrell et al., 2018), and training an acceptability classifier (Warstadt et al., 2019) but eventually decided not to use them for reasons detailed in Appendix C.

Results & Analysis
Table 5 reports the human performance and the results of the LMs on each phenomenon. 13Overall, LM performance (bert-base-zh 84.8% being the best) lags far behind human performance (97.1%).Looking into each phenomenon, although some LMs occasionally perform better than humans (e.g., in the definiteness effect), no single LM performs consistently well.Comparing the monolingual LMs to the multilingual ones, the former performs in general better than the latter. 14In the following subsections, we provide analyses of the model performance from the aspects of model size, distance, and hierarchy.By-phenomenon results and analyses are in Appendix F.

Model Size
To investigate whether a larger model performs better on SLING, two-tailed pairwise Wilcoxon signed rank tests were conducted on each LM pair in Table 4.The tests indicated that the performance of the LMs in the pert and mengzi LM pairs statistically significantly differed from each other while there is no statistical difference in other LM pairs.Further one-tailed pairwise Wilcoxon signed rank tests on these two pairs revealed (unintuitively) that the smaller LMs (pert-base, mengzi-base) perform better than the larger ones (pert-large, mengzi-fin).The test results can be found in Table 9 in Appendix G.3.The finding here coincides with the conclusion drawn in BLiMP and CLiMP that increasing model size does not necessarily improve the model performance.

LMs are Affected by Distance
The classifier-noun phenomenon was designed to test if the LMs are affected by distance in a dependency.For example, in (4), the classifier is separated from the noun by a long adjective,15 making the local dependency distant.The noun phrase can also be a compound noun (5), in which case the classifier should agree with the second noun.
(4) 三户非常优秀且高效的家庭 3 households of very excellent and efficient families (5) 三本非常优秀且高效的家庭小说 3 copies of very excellent and efficient family fiction Two two-tailed paired Wilcoxon signed rank tests were conducted to compare the simple noun paradigm with and without a long adjective as well as the ones with compound nouns.The results indicated that there was a statistically significant difference between the model performance when the long adjective was present and absent in the simple noun paradigms.There was no such difference in the compound noun paradigm.Further one-tailed Wilcoxon signed rank tests showed that, with a long adjective, the LM performance of the simple noun paradigms decreased.The p values are reported in Table 10.

LMs struggle with Hierarchy
All LMs struggle with hierarchical phenomena and are vulnerable to linear closeness.This is shown in the results for the anaphor and classifier-noun phenomena.The anaphor phenomenon was designed to test whether the LMs prefer linear or hierarchical closeness.For the LMs to correctly choose the acceptable sentences, they should prefer hierarchical closeness.In the example in Figure 3, DP 5 can only agree in its gender feature with DP 1 , which is hierarchically closer.If the LMs are distracted by the linearly closer DP 3 , they would pick the unacceptable sentence in which the DP 5 is herself.Figure 3: The syntax structure of the sentence 男学 者在女导演的店里申请了他自己的退税。 (The male scholar applied for his own tax return at the female film director's shop.)The reflexive anaphor himself must be bound by DP 1 , which is hierarchically closer, rather than DP 3 , which is linearly closer.Details of the tree can be found in Appendix D.
Two two-tailed paired Wilcoxon signed rank tests were conducted on the male and female anaphor paradigms with and without a PP respectively.The results show that there is a statistically significant decrease in the performance when the  distractor is present. 16The descriptive and test statistics can be found in Table 11.
The classifier-noun phenomenon is designed to test whether the LMs are aware of the right headedness of Chinese compound noun and match the classifier with the second noun in a compound noun rather than the first one (cf.( 4) and ( 5)).If the LMs do not have this knowledge but prefer linear closeness, they would choose the wrong sentence in a minimal pair.The statistics and the results of two two-tailed Wilcoxon signed rank tests in Table 12 show that the LMs performed worse when the distractor was present.

Strong Gender and Number Bias
Because the LMs can have gender and number bias, in the anaphor phenomena, we use baselines (e.g., The male baker likes him / her.) to test the bias. 17The higher the accuracy number is, the more biased a LM is towards him.Figure 9 in Appendix G.3 shows that, with a male subject, only four monolingual LMs (gpt2-zh, CPM, pert-base, and ernie) are gender neutral.When the subject is female, all LMs are biased towards a female object (see Figure 12).
One reviewer raised concern that the anaphora resolution in those baselines can only be reliably solved in context of the preceding text, which is true in real life situations.However, in our test setting, since there is no context, the models should 16 This is even the case in the female paradigms where the LMs are strongly biased.The female baseline row in Table 8 shows that when the sentence subject is female, and there is no need for the object to agree with the subject of the sentence, the LMs strongly biased towards a female object.Detailed explanation of the baselines can be found in Appendix D.2. 17 The Chinese baseline has the same structure as this English translation.ideally be gender neutral on average (Bordia and Bowman, 2019).
The LMs also have number bias.A baseline example is The three male bakers like them / him.The higher the accuracy number is, the more biased a LM is towards them.As seen in the results in Table 8 (Appendix G.3), while most LMs are biased to a plural object when the subject is plural, PanGu-α is strongly biased to a singular object.
The purpose of the baselines is to reliably test whether the LMs know that the gender/number of ziji (self) should agree with the subject's gender/number in the paradigms.As it turn out, the female and number features are not useful for our purpose because the LMs already achieve a 'high' accuracy in the baselines, making it ambiguous whether the high accuracy in non-baselines is because they know the function of ziji (self) or they are just biased.The male self paradigm, on the other hand, shows that most monolingual LMs were able to use ziji as a hint to agree the gender of the subject and object.Among the multilingual LMs, only gpt3-davinci achieved a meaningful accuracy increase.

Vulnerable to Uncertainty
In the current study, haishi, le, and wh phrases can have more than one usage depending on contexts.The observation is that the LMs performed worse on the paradigms with those phrases.This is most obvious in the aspect and polarity item phenomena.
In the aspect phenomenon, the possible position of guo is relatively fixed compared to le, and there is no interaction between guo and the progressive marker.The LMs performed better on the guo paradigms than on le.
In the polarity item phenomenon, the contexts where the positive polarity item more or less can occur is more restricted than any, which is more restricted to wh phrases.And we see that the LM performance is the best on more or less, followed by any, and the worst on wh phrases.

Evaluating Our Set of 18 LMs on CLiMP
We ran the 18 LMs on CLiMP and compare model rankings and performance on CLiMP and SLING.We observe major differences: the best LM on SLING is bert-base-chinese (84.8%), and on CLiMP it is chinese-pert-base (81.22%).That said, monolingual LMs perform better than multilingual LMs on both datasets. 18 While the average performance of the LMs on both datasets is similar (SLING 69.7%, CLiMP 70.1%), on average LMs have significantly larger variation across phenomena on SLING (SD = 24.1%)than on CLiMP (SD = 13.2%).Thus, SLING is more discriminative of the strengths and weaknesses of LMs, as LMs tend to be more polarized to one direction across phenomena in SLING compared to those in CLiMP.Finally, because CLiMP does not test the LMs' bias in the gender and number features for their binding and anaphor paradigms, the LM performance on these two paradigms is uninformative since we do not know what role the bias plays in the tests.SLING corrects this issue by including 8 baseline paradigms and shows that the LMs can be strongly biased (see Section 5.4).

Conclusion
We present SLING, a new benchmark for evaluating Chinese linguistic knowledge in large scale pretrained LMs.Unlike the existing CLiMP dataset, in which we identify several critical issues, we construct SLING from naturally-occurring sentences in the Chinese Treebank.Our results show that monolingual Chinese LMs achieve better performance on SLING than multilingual LMs.We find that LMs are better at handling local dependencies than long-range dependencies or with distractors, and that they are better at syntactic rather than semantic phenomena.Overall, there remains a large gap between LM and human performance.

Limitations
As a benchmark of evaluating LMs' Chinese linguistic knowledge, SLING covers 9 major Chi-18 Kendall Tau correlation of the two rankings for monolingual LMs is 0.42 and for multilingual LMs is 0.79.nese grammatical phenomena with 38k minimal pairs.However, there are still phenomena that are important but not included in the current work: for example, the ba and bei constructions.For those structures, unacceptability can have different sources (e.g., syntax or pragmatics). 19Simple syntactic structure restrictions are not enough.When deciding which phenomena to include in SLING, we deliberately avoid such cases because the (un)acceptability of these phenomena can be mitigated by contextual or world knowledge.As a result, human judgement can vary significantly.As an example, take the bei construction (Passive): the sentence 王萍被嘴举了 (Wang was lifted by a mouth) is wildly bizarre to some people, while for others, it is acceptable because it is possible to imagine a world in which each body part is a mighty character that can lift things.Such "unacceptable" sentences are different from The roses is red., which cannot be resolved by any context.
Another limitation is that even though Chinese Treebank 9.0 contains a rich and diverse vocabulary, it can still be inadequate at times.For example, for the classifier-noun agreement phenomenon in SLING, we were not able to extract enough highquality compound nouns and thus had to manually create 196 minimal pairs, as described in Appendix E. One possible way to get around this limitation is to train a parser on the Treebank and use it to automatically parse even more raw Chinese data.We leave this for future work.

Ethical Considerations
Following best practices (McMillan-Major et al., 2021), we plan to open source our dataset along with a data card.We will follow the templates used in the GEM benchmark (Gehrmann et al., 2021)  20  and HuggingFace Datasets repository (Lhoest et al., 2021). 21Overall, our project had a small computational cost since we did not need to do any model training.We performed inference on all 18 LMs on a single RTX8000 GPU with 48GB memory.All inference experiments in this paper can be completed within a day on the single GPU.

A Ngram Count of CLiMP and SLING
CLiMP contains 16K minimal pairs (32K sentences) and SLING 38K (76K sentences).The average sentence length in CLiMP is 11.8 (median = 11) and in SLING is 12.5 (median = 12).Because of the difficulty of defining what counts as a word in Chinese, we report one to four ngram counts of types in Table 6, together with the word type counts returned by Jieba. 22Because SLING has more sentences which can lead to larger type counts, we randomly shuffled the sentences and took 32K sentences to calcuate the ngram and Jieba counts of word types.One reason for having 1K sentence pairs in each paradigm is to cancel out the potential influence of word frequency on the perplexity of sentences.Having a diverse vocabulary surely helps in this sense.

B Metrics
Causal LMs Perplexity (PPL) is used for causal LMs to decide the preferred sentences.Each token w is assigned a probability p given the prefix being seen.The perplexity is calculated based on the log likelihood (L).For a sentence of length m, its perplexity is calculated as below: Each sentence in a minimal pair is assigned a perplexity value.The one with the lower perplexity is taken as the good sentence that the models choose.
Masked LMs Pseudo-perplexity values (pseudo-PPL) are used to evaluate masked LMs (Salazar et al., 2020).Concretely, tokens in a sentence is masked one after another (w j ).The masked language models return a probability distribution over the vocabulary in the masked position given the context surrounding it.For a sentence of length m, its pseudo-perplexity is calculated as follows: C Related Work: Methods of Evaluating Linguistic Knowledge and Their

Limitations in SLING
To investigate what kind of and how much linguistic knowledge large-scale pretrained LMs have compared to human, previous works have focused on limited LMs and probed into the internal encoding of the linguistic knowledge (Tenney et al., 2019a,b;Clark et al., 2019).Other works investigate the LMs' linguistic knowledge of a small subset of English syntactic grammar by using prefix methods (Linzen et al., 2016;Gulordava et al., 2019;Wilcox et al., 2019), by-word surprisal (Futrell et al., 2018), or trained an acceptability classifier (Warstadt et al., 2019).
Prefix method Linzen et al. ( 2016) focus on English subject-verb dependencies and use a prefix method for evaluation, which requires LMs to assign probabilities to the next word given a prefix.
The grammatical next word is expected to have a higher probability (e.g., The keys are vs. *The keys is).The task includes local subject-verb dependencies (e.g., The keys are vs. *The keys is) as well as dependencies in distance with distractors (e.g., The roses in the vase by the door are vs. *The roses in the vase by the door is).The prefix method is adopted in later works, for example, Gulordava et al. (2019) and Wilcox et al. (2019).
The limitation of the prefix methods is that it mostly applies to inflectional grammatical phenomena in a dependency relationship.For Chinese, a language that largely lacks inflection, the usage of the methods is very limited.Taking SLING as an example, the prefix methods are not applicable to all nine phenomena because the minimal pairs' acceptability depends on: • the presence/absence of a crucial word (Alternative Question, Anaphor (number), Aspect, Polarity Item, Relative Clause); • the word order (Aspect, wh fronting); • the choice of a crucial word in the middle of a sentence whose acceptability depends on the part of sentence that is after the word (Anaphor (gender), Classifier-Noun, Definiteness Effect, Polarity Item, Relative Clause).
By-word surprisal Another evaluation method, inspired by the controlled psycholinguistic experimentation, is the by-word surprisal23 and sentence completion methods proposed by Futrell et al. (2018) to explore LMs' knowledge of syntax.The surprisal reflects whether LMs are affected by the presence/absence of critical words in grammatical configurations.In the sentence completion task, LMs completes a sentence given a prefix.Human annotators then judge the grammaticality of the completed sentences.
The by-word surprisal method solves one limitation of the prefix methods (i.e., the acceptability depends on the presence/absence of a crucial word) but still does not account for the other two listed above.The sentence completion method faces similar restrictions and cannot be applied in a large scale because it requires human judgement of the completed sentences.Warstadt et al. (2019) trained an acceptability classifier to perform a grammaticality judgement task, which consists of sentences collected from the linguistics literature marked for their acceptability.

Acceptability classifier
There are several limitations of training a classifier.First, it involves many debatable design decisions (e.g., hyper-parameters).Second, LMs may learn the task from the training data (Hewitt and Liang, 2019;Voita and Titov, 2020).Our goal is to measure the linguistic capability of pretrained LMs without additional help from a training dataset that has the same distribution as the test set.
Overall, the previous methods are either only applicable to a subset of linguistic grammar or depend on the performance of a classifier.The minimal pair method used in BLiMP breaks through these limitations.
Minimal pair method To cover a wide range of linguistic phenomenon, Warstadt et al. (2020) introduced minimal pair evaluation for LMs and created the Benchmark of Linguistic Minimal Pairs for English (BLiMP).It evaluates the linguistic knowledge of twelve English grammatical phenomena including syntax and semantics.Each of them consists of minimal pair paradigms representing different aspects of the phenomena.All minimal pairs are code-generated using templates created by linguists and an annotated vocabulary that contains 3000 words.The dataset is human validated.
The results on BLiMP show that the LMs tested in BLiMP are good at local dependency relations (e.g., morphology agreement) but bad at phenomena involving hierarchy and semantic knowledge.Concerning the training size and model size, while increasing training size can improve model performance, increasing model size does less so.
Other possible metrics and their limitations Other possible metrics are probability and a masked-token method.However, probability is not a suitable metric to use in SLING for at least two reasons.First, probability is only useful for minimal pairs whose sentences have the same length.Otherwise, probability by nature prefers shorter sentences.Second, the sentences in a minimal pair need to have similar word orders.This is because tokenizers might tokenize a sentence in different ways depending on the word order, causing the sentence length of the sentences in a minimal pair to be different.In the masked-token method, we can mask out the crucial word in each sentence in a minimal pair and ask a LM to give probability of the two masked words.This method is not applicable to causal LMs.For masked LMs, it is only applicable to Anaphor (gender), Classifier-Noun, and Definiteness Effect in SLING where the word order does not change.In those cases, since SLING uses minimal pairs, the masked token in those phenomena will be exactly the part in which the sentences in a minimal pair differ.Hence, the masked-token method will return the same results as the pseudo-perplexity.

D Linguistic Phenomena
The current work focuses on six syntax and three semantics phenomena in Chinese.Table 3 offers an overview.There are 30 test paradigms.The anaphor phenomenon has 8 baseline paradigms to detect LMs' gender (male/female) and number (singular/plural) biases.
All phenomena have at least one paradigm that can be solved by checking the linear order of tokens.Some phenomena require a negative co-occurrence of words.For example, in the alternative question phenomenon, the disjuntor haishi and the polar question particle ma may not co-occur.Other phenomena require a positive co-occurrence.For example, in the polarity item phenomenon, the grammaticality of renhe (any) dependes on the occurrence of negation.
Three phenomena contain paradigms that require the LMs to use the knowledge of hierarchy.If LMs use linear closeness rather than hierarchical closeness, they will wrongly assign a lower perplexity to the unacceptable sentence in a minimal pair.The anaphor phenomenon, for example, contains such paradigms.
The anaphor, classifier-noun agreement, and relative clause phenomena have paradigms that test LMs' robustness to distractors and long distance dependencies.A distractor is an element that intervenes between the head and its dependent in a dependency/agreement relation.For example, in The roses in the vase are . . ., roses and are are in a dependency relation, and vase is the distractor.By distance, it is meant to be the case that the head and its dependent is separated from each other (e.g., these beautiful red blooming roses).
This section introduces phenomena in turn.If a phenomenon is in CLiMP, a comparison between CLiMP and the current work will be provided.

D.1 Alternative Questions with haishi
Chinese alternative questions (AltQ) are most reliably marked by the disjunctor haishi (Huang et al., 2009).Although haishi has different usages (Wu, 2010), when it is used as the disjunctor, the polar question particle ma (SP) cannot occur.Minimal pairs like ( 6 "Are they teachers or carpenters?"

D.2 Anaphor
Mandarin Chinese has two reflexive pronouns: ziji and ta(men)-ziji.The former is morphologically simple with no person, number, or gender features.The latter, ta(men)-ziji, has the pronoun ta which encodes gender features in writing: 她 for singular female third person, 他 for singular male third person, and 它 for singular non-human third person.
The character men indicates plurality.Because of this morphological richness, ta(men)-ziji is used to form minimal pairs.Since CLiMP contains the binding phenomenon, their implementation will be first introduced, followed by the binding phenomenon in the current work.
Binding Phenomenon in CLiMP Xiang et al.
(2021) use singular female and male third person reflexives ta-ziji to test the LMs' knowledge of binding.There are two paradigms.The first one has a simple SVO structure in which the object is an anaphor and needs to match the gender feature of the subject.The second paradigm involves a distractor between the antecedent and the reflexive (e.g., DP 2 in Figure 4).The distractor is different from the true antecedent in its gender feature.The distractor is linearly closer to the reflexive but hierarchically farther.It turns out that the LMs struggle with this paradigm.The results show that the LMs did no better than chance.One of the acceptable binding sentences in Xiang et al. ( 2021) is cited below.We provide its syntax in Figure 4.The corresponding unacceptable sentence changes herself to himself.
(7)  Although, by comparing the two paradigms, Xiang et al. ( 2021) find the models are bad at dealing with hierarchy and distractors, there are four shortcomings in the minimal pair design that weaken the strength of the observation.First, it was not tested whether the LMs knew the gender of the proper names.Because Chinese names do not always clearly indicate the gender, this can cause the LMs guessing randomly.Second, the syntax of the second paradigm is complex because it involves ellipsis. 25With the presence of ellipsis, it is not for sure that the models did bad because they preferred a linearly closer agreement or because they couldn't recover the omitted subject correctly.Third, CLiMP does not have a baseline for the gender biases of the LMs.Hence, we cannot know if the models know the function of ziji or they simply prefer one gender.Fourth, CLiMP does not have separate corpora for the two genders.Thus, we do not know if the LMs are bad in both female and male reflexive agreements or only in one of them.

Paradigms in Current Work
To amend the four shortcomings, the current work includes baseline paradigms to test LMs' gender bias.Sentences have a simple SVO structure.Instead of using proper names as the subject, the paradigms use gender plus occupations to indicate the gender of a noun.The female and male reflexive agreements are tested separately.
To form the baseline minimal pairs for the male reflexive agreement, an occupation and a transitive verb were chosen randomly.Following the verb is either a male or female pronoun.Example ( 8) is one resulting minimal pair.
(8) nan male dianyuan shop assistant baituole got rid of ta him / / ta.her "The male shop assistant got rid of him." Both sentences are acceptable.The purpose is to see whether the models are gender biased when there is no clue for any gender agreement.Other baselines are formed in the same way.
With the baseline being established, the minimal pairs for the reflexive agreement are created by adding ziji to the end of the sentences in the baselines.This turns (8) into (9).Because the presence of ziji, the gender of ta should agree with the gender of the male shop assistant.Hence, himself is acceptable but herself is not.Such agreement can be solved by linear closeness.The next paradigm tests whether LMs prefer a linearly closer or a hierarchically closer noun as the antecedent of an anaphor.An example is (10).The syntax of the grammatical sentence is in Figure 5  Like Figure 4, Figure 5 involves a distractor DP 3 but has no ellipsis.It is a SVO sentence with a preposition phrase (PP) modifying the verb phrase.The antecedent of DP 5 can only be DP 1 which ccommands himself while DP 3 is embedded deeply in PP.DP 1 is hierarchically closer to himself while DP 3 is linearly closer.The LMs will fail if they have no knowledge of hierarchical structure.
The current work also uses the number feature to test LMs.Baselines are used to see if the tested LMs are biased to singularity or plurality.The gender feature is kept constant so that any distinct behaviour is only caused by the number feature.

D.3 Aspect Marker le and guo
The morphemes le and guo often function as perfective aspect markers. 26Although they can occur in sentences of various tenses, without the help of a future oriented adverb together with morphemes as cai or jiu, they only occur in sentences of past tenses.A paradigm is built on this observation.An example is in (11).( 11 (Huang et al., 2009).The difficulty in classifier-noun agreement is that the matching can be idiosyncratic, and one noun can be compatible with multiple classifiers.CLiMP includes the classifier-noun agreement phenomenon which consists of three paradigms.However, because the variables in their minimal pairs are not well controlled, the experiment results are not conclusive.
Classifier-Noun Agreement in CLiMP Their first paradigm is the local classifier-noun matching.The second paradigm inserts an adjective with two to four characters between the classifier and the noun to increase the distance of the two.There is no distractor in the adjective.The third paradigm further increase the distance by having a relative clause instead of an adjective.Without showing the results of each paradigm, Xiang et al. (2021) report that the mean of the model performance is 71.66% (median 70.1%).Chinese BERT performs the best (92.9%).The overall human accuracy of the paradigms is 99.7%.
There are two issues with the paradigms.First, some minimal pairs do not show a clear contrast.Example ( 15) is taken from CLiMP, in which the classifier jia is intended to be unacceptable.However, both liang and jia are compatible with the noun bike.
(15) The reason for the issue is that each noun in the CLiMP vocabulary is associated with only one classifier.However, as mentioned before, the classifiernoun matching can be a many to many relation.The second issue is the relative clauses in the third paradigm.Some relative clauses contain a distractor.In certain cases, the distractor even matches the classifier.

Paradigms in Current Work
The current work has five paradigms for the classifier-noun agreement.To avoid the issues in CLiMP, we built a classifier-noun dictionary.Each noun is associated with a group of classifiers.When creating the minimal pairs, it is ensured that the classifier in the unacceptable sentences is not listed as a compatible classifier of the noun.
In the five paradigms, one paradigm tests models' knowledge of the linear order of demonstratives (DT) or numerals (CD) and classifiers (M) before a noun.The other four paradigms test LMs' knowledge of classifier-noun agreement.
The first of the four paradigms involves local classifier-noun agreement.The second paradigm inserts a long adjective between the classifier and the noun but, still, no knowledge of hierarchy is needed.The third paradigm is based on compound nouns.An example is given in ( 16). ( 16 Huang et al., 2009).Hence, noun1 functions as a distractor.In ( 16), ming is the classifier for policeman while tiao is for railway.The last paradigm adds a long adjective after the classifier in the third paradigm.For the compound noun paradigms, the knowledge of hierarchy is needed.That is, the LMs should know the rightheadedness of Chinese compound nouns.

D.5 Definiteness Effect
It has long been noticed that certain strong determiners cannot be in the postverbal position in an English existential there-sentence (Keenan, 1987;Abbott, 1993;Zucchi, 1995).Similar effects have been observed in Chinese (Xu, 1995;Hu and Pan, 2008).The phenomenon to be tested here involves Chinese you (have), a close counterpart to the thereconstruction.The demonstratives zhe (this) and na (that) as well as the quantifier mei (every) are used as an equivalence to the strong determiners in English.The phrase yi (one) + M is used as a counterpart of English weak determiners.This paradigm can be solved by checking the linear co-occurrence of two elements, here/there and the strong determiners.An example is in ( 17). (

D.6 Polarity Items
Polarity items (PI) are common in natural languages (Tóth, 1999;Yoshimura, 2007;Kumar, 2013;Giannakidou et al., 2019, a.o.).English, for example, has any, ever, and yet, etc.In Chinese, renhe (any) and shenme (what) are two actively investigated negative PIs.They occur in negation, polar questions, and conditionals (Cheng, 1994;Wang and Hsieh, 1996;Lin, 1998;Chen, 2012;Lin and Giannakidou, 2015).The phenomenon contains three paradigms.There is no complex hierarchical structure involved.All paradigms can be solved by just checking the linear co-occurrence or absence of certain tokens.The first one concerns renhe (any The second paradigm involves shenme, a multifunctional phrase.It is often seen in wh-questions (e.g., ni you chi eat shenme what "what do you eat?").However, shenme also occurs in the contexts where typical negative PIs occur.The acceptability contrast is manipulated by the presence of negation.Yet, to avoid a wh-question reading, the adverb shenzhi (even) is used, which can occur in affirmative or negative contexts but not in wh-questions as it can be a focus intervener (Beck, 2006)

D.7 Relative Clauses
Relative clauses in Mandarin Chinese are headfinal, meaning a modifying clause occurs before a modified noun.This characteristic is tested in CLiMP.Another characteristic of Chinese relative clauses is that it is a filler-gap construction and, in the gap position, a resumptive noun is out of the question, and a resumptive pronoun cannot occur freely.As cited in Wen (2020), Zhou and Han (2012)   30 The minimal pairs of this paradigm differ in two aspects.First, the acceptable sentences contain le but the unacceptable ones do not.Second, the acceptable sentences do not contain mei but the unacceptable ones do.This seems render the pairs not minimally distinct.However, the morpheme mei is a negation that encodes the perfective aspect.This is what le does in the acceptable sentences.Keeping le in the unacceptable sentences will make them unacceptable for a reason that is not at issue here.Hence, even though on the surface the two sentences are not minimally distinct, they semantically are.

D.8 Wh-fronting
As mentioned in Section D.6, shenme is frequently used to form wh-questions.In canonical whquestions, the wh-phrases stay in situ (Huang et al., 2009).Without a very specific appropriate context, wh-fronting is unacceptable.Hence, no matter whether shenme alone functions as an object or modifies a noun as in ( 22), the noun phrase containing it cannot be fronted.To force a question reading of shenme, the phrase jiujing or daodi (on earth) are added.There is no complex hierarchy in the sentences and the wh phrases are all objects.

E Second Round of Human Validation
The minimal pairs of the two compound noun paradigms were refined.Among the 2000 new minimal pairs, 1804 were code generated and 196 were manually created.To verify the minimal pair quality, a second round of human validation was conducted.Five annotators (3 female, 2 male) with an average age of 22.2 were recruited the same way as described in Section 3.5.Twenty pairs of sentences were randomly sampled from both the code generated and manually created minimal pairs from each paradigm.The practice and filler items were used.Each annotator rated 114 pairs.They did the practice and filler items with 100% accuracy.The task took less than 10 minutes.The annotators were paid $5.The raw accuracy on the new validated pairs was 95.25% (κ = 0.8823).The manually created minimal pairs had a higher accuracy than the code generated ones (97.5% vs. 93%).After the second round, the raw human accuracy mean over all paradigms is 97.12%.

F By-phenomenon Results and Analyses
AltQ The multi-lingual LMs either prefer the sentences with ma or perform near chance.Although the mono-lingual LMs perform better, only bert-base-zh and ernie have an accuracy higher than 90%.There can be multiple reasons for the unsatisfactory performance.First, haishi is multi-functional, which might cause the LMs being unsure of its disjunctor usage.Second, ma only occurs in interrogative contexts, which can make the LMs prefer having it.Third, the LMs do not have a global view of the sentences but only attend to parts of them, which can be the reason of their random guessing. 31naphor (Gender) The LMs are gender biased.Figure 9 shows that, with a male subject, only four mono-lingual LMs (gpt2-zh, CPM, pert-base, and ernie) are gender neutral.When the subject is female, all LMs are biased (see Figure 12).The mono-lingual LMs strongly prefer a female object.
On one hand because the LMs are strongly biased, using the female gender to test the anaphor phenomenon is inconclusive.Compare Figure 13 to Figure 12, it is unclear whether the LMs achieved a high accuracy because they knew ziji or just because they liked the female feature.The male self paradigm, on the other hand, shows that most mono-lingual LMs were able to use ziji as a hint to agree the gender of the subject and object.Among the multi-lingual LMs, only gpt3-davinci achieved a meaningful accuracy increase.
Turning to the female self with PP paradigm in Figure 14, even thought the mono-lingual LMs prefer the female feature in the baseline, when there is a male distractor in the PP which is linearly closer to the reflexive, the LMs are affected, reflected as a decrease in the accuracy.Fewer multilingual LMs are affected by the distractor.As a matter of fact, XLM-large and ByT5-small even have an increase in accuracy.On the male self with PP paradigm, only the mengzi models and gpt3-davinci are relatively unaffected by the distractor.

Anaphor (Number)
The plural number feature is used to elicit the anaphor agreement.The feature is imposed on the subject by using numeral + classifier or the plural marker men, or both.The plural feature on the object reflexive is reflected by adding men to it.As it turns out, the number feature is not a good choice because most LMs are strongly biased (see Table 8).Aspect Compared to le, guo has a fixed position in a VP and cannot take a wide scope over the progressive marker zai.The results show that the LMs performed better on the guo paradigms than on le.There is no obvious reason why CPM in Figure 18 performs extremely bad.Classifier-noun agreement The first paradigm tested the LMs' knowledge of the relative order of a demonstrative and classifier.Figure 20 shows that, except for the CPM, PanGu-α, mt5, and ByT5 models, all LMs' accuracy are comparable to the human annotators.
Comparing the paradigms with simple nouns (Figure 21 and 22) to the ones with compound nouns (Figure 23 and 24), the multi-lingual models are more severely affected by the existence of a distractor (i.e., noun1 in a compound noun) than the mono-lingual ones.The LMs are less affected by the distance created by the long adjective (Figure 21 vs. Figure 22, and Figure 23 vs. Figure 24).Definiteness Effect Except for CPM, PanGu-α and pert-large, all mono-lingual models have a decent accuracy.On the multi-lingual side, the ByT5 models are especially bad.Polarity item Among the three PIs, huoduo huoshao (more or less) reliably occurs only in affirmative contexts.The negative PIs, renhe (any) and shenme (what), can occur in negative, interrogative, and affirmative contexts.Fifteen out of eighteen LMs reached an accuracy on huoduo huoshao comparable or even better than human.On the other two PIs, although there are quite a few LMs perform even better than human, overall, the accuracy values are worse and uneven.Relative clause In the resumptive noun paradigm, only CPM and pert-large have a satisfying performance.The other models are either near chance (lstm and mt5-small) or strongly deviated by the repeated filler in the gap position.The reason could be that the LMs are vulnerable to repetition, or to local grammaticality.When the gap in the relative clause is filled by a pronoun that matches the gender of the head noun, fewer than half of the LMs are able to notice the minimal pair contrast.Wh-fronting All mono-lingual models performed well.Probably because wh in situ is a prominent feature of Mandarin Chinese.Except for the mt5 and ByT5 models, most multi-lingual models did well.The gpt3-davinci model even reaches a 100% accuracy.

G.1 CLiMP
The results are reported in Table 7 and Figure 6.

G.2 SLING
The results are reported in Table 8 and Figure 7 to Figure 33.

G.3 Statistic Tests
The results are reported in Table 9 to Table 12

(
They are already in the process of having a meal.)B: 他们在吃了饭。 (They are already in the process of having finished a meal.)Chinese Speakers (97.1% acc.)Pretrained LMs (69.7% acc.)

这本小说Step 1 :Figure 2 :
Figure 2: An illustration of the minimal pair generation process used to construct SLING.
shop assistant got rid of himself."

"Figure 5 :
Figure 5: The syntax structure of the sentence in (10) with himself being bound by DP 1 . .

Figure 6 :Figure 7 :
Figure6: The box represents the inter-quartile range of the human and LM accuracy, with an orange line at the median accuracy and a green triangle at the mean.The whiskers extend from the box by 1.5 times.Dots are the accuracy values that past the end of the whiskers.
b as e-zh pe rt -b as e pe rt -l ar ge m en gz i-ba se m en gz i-ba se -fi n er ni e xl m -R -b as e xl m -R -l ar ge be rt -b as e-m ul ti m t5 -s m al l m t5 -l ar ge by t5 -s m al l by t5 -l ar ge gp t3

Figure 32 :Figure 33 :
Figure 32: The LM accuracy on the bare wh fronting paradigm.

Table 1 :
An comparison between CLiMP

Table 5 :
The average percentage accuracy of the LMs and human performance on each phenomenon (random guessing is 50%).Overall, humans significantly outperform all LMs.No LM performs well on all phenomena, but monolingual LMs perform better than multilingual ones.A larger model size does not imply better performance.The vertical line separates the mono/multilingual models.The anaphor phenomenon accuracies include the baselines.

Table 6 :
Counts of one to four ngram types in CLiMP and SLING and word type counts by Jieba. .
index 1 indicates its antecedent is DP1.
The above paradigms can be solved linearly but the interaction between le and zai requires the knowledge of hierarchy.The morpheme le can co-occur with zai if le takes scope over zai but not the other way.Based on this, two paradigms are formed.The first one (13) tests the knowledge that le cannot scope under zai.The other paradigm (14) shows that le can scope over zai.
.The last paradigm in the current phenomenon focuses on the adverb huoduo huoshao (more or less).It is less studied than renhe (any) or shenme (what).Nonetheless, by searching in the corpus CCL29, it is confirmed that there is no sentence in which bu or mei (not) negates the verb within 10 characters before or after huoduo huoshao.Hence, the acceptability of the minimal pairs is built on the absence of negation.30 point out that resumptive pronouns may not occur in simple subject or direct object positions.The current study uses this property and constructs minimal pairs as in (21).If the LMs are not aware of the relative clause structure in those sentences, they can perform poorly because of the local coherence created by the filled-in gaps.

Table 8 :
Eighteen LMs' performance on SLING.The blue marked lines are baselines.The baselines are supposed to have an accuracy of 50%, meaning the LMs are gender/number neutral.