Controlled Evaluation of Grammatical Knowledge in Mandarin Chinese Language Models

Prior work has shown that structural supervision helps English language models learn generalizations about syntactic phenomena such as subject-verb agreement. However, it remains unclear if such an inductive bias would also improve language models’ ability to learn grammatical dependencies in typologically different languages. Here we investigate this question in Mandarin Chinese, which has a logographic, largely syllable-based writing system; different word order; and sparser morphology than English. We train LSTMs, Recurrent Neural Network Grammars, Transformer language models, and Transformer-parameterized generative parsing models on two Mandarin Chinese datasets of different sizes. We evaluate the models’ ability to learn different aspects of Mandarin grammar that assess syntactic and semantic relationships. We find suggestive evidence that structural supervision helps with representing syntactic state across intervening content and improves performance in low-data settings, suggesting that the benefits of hierarchical inductive biases in acquiring dependency relationships may extend beyond English.


Introduction
A rich collection of targeted linguistic evaluations has shown that neural language models can surprisingly learn many aspects of grammar from unlabeled linguistic input (e.g., Linzen et al., 2016;Gulordava et al., 2018;Warstadt et al., 2020;Hu et al., 2020;Xiang et al., 2021). There is also growing evidence that explicit modeling of syntax helps neural network-based language models represent syntactic state and exhibit human-like processing behaviors of non-local grammatical dependencies, including number agreement (Kuncoro et al., 2018), negative polarity licensing, filler-gap dependencies Hu et al., 2020), and garden-path effects (Futrell et al., 2019;Hu et al., 2020). However, this line of research has focused primarily on the syntax of English. It is unclear to what extent structural supervision may help neural language models generalize for languages with differing typologies. Expanding these analyses beyond English has the potential to inform scientific questions about inductive biases for language acquisition, as well as practical questions about model architectures that approach language-independence (Bender, 2011).
Here, we perform a controlled case study of grammatical knowledge in Mandarin Chinese language models. The orthography and grammar of Chinese provide a useful testing ground given the differences from English and other Indo-European languages (see, e.g., Shopen 1985;Li et al. 2015). Whereas today's Indo-European languages like English generally use phone-based orthography, Chinese uses a logographic system where each character generally responds to a syllable. Most Mandarin Chinese words are one or two syllables, influencing the distribution of tokens. Grammatically, Chinese has almost no inflectional morphology, and corpus studies suggest that the average dependency length of Mandarin Chinese sentences is larger than that of English sentences (Jiang and Liu, 2015), with potential implications for language modeling. On the one hand, the need to track input across long dependencies may make structural supervision more beneficial for Mandarin Chinese language models; on the other hand, the prevalence of these dependencies may make it easier for them to learn to maintain non-local information without explicitly modeling syntax. Other fine-grained differences in typology also affect the types of syntactic tests that can be conducted. For example, since relative clauses precede the head noun in Chinese (unlike in English), we can manipulate the distance of a verbobject dependency by inserting relative clauses in between. These characteristics motivate our choice of Mandarin Chinese as a language for evaluating structurally supervised neural language models.
We design six classes of Mandarin test suites covering a range of syntactic and semantic relationships, some specific to Mandarin and some comparable to English. We train neural language models with differing inductive biases on two datasets of different sizes, and compare models' performance on our targeted evaluation materials. While most prior work investigating syntactically guided language models has used Recurrent Neural Network Grammar models (Dyer et al., 2016) -potentially conflating structural supervision with a particular parameterization -this work further explores structured Transformer language models (Qian et al., 2021). Our results are summarized as follows. We find that structural supervision yields greatest performance advantages in low-data settings, in line with prior work on English language models. Our results also suggest a potential benefit of structural supervision in deriving garden-path effects induced by local classifier-noun mismatch, and in maintaining syntactic expectations across intervening content within a dependency relation. These findings suggest that the benefits of hierarchical inductive biases in acquiring dependency relationships may not be specific to English.

Targeted Linguistic Evaluation
Linguistic minimal pairs have been used to construct syntactic test suites in English (e.g., Linzen et al., 2016;Marvin and Linzen, 2018;Mueller et al., 2020;Davis and van Schijndel, 2020;Warstadt et al., 2020) and other languages such as Italian, Spanish, French, and Russian (Ravfogel et al., 2018;Gulordava et al., 2018;Mueller et al., 2020;Davis and van Schijndel, 2020). A minimal pair is formed by two sentences that differ in grammaticality or acceptability but are otherwise matched in structure and lexical content. The two sentences differ only at the main verb 'drinks'/'drink', which must agree with its subject 'The man' in number. Since only the third-person singular form 'drinks' agrees with the subject, (A) is grammatical, whereas (B) is not.
The closest work to ours is the corpus of Chinese linguistic minimal pairs (CLiMP; Xiang et al., 2021), which provides a benchmark for testing syntactic generalization of Mandarin Chinese language models. While CLiMP focuses on building a comprehensive challenge set, the current work performs controlled experiments to investigate the effect of structural supervision on language models' ability to learn syntactic and semantic relationships. Moreover, the test items in CLiMP are (semi)-automatically generated, which may result in semantically anomalous sentences and introduce noise into the evaluation phase. In contrast, we manually construct the items in our test suites to sound as natural as possible.

Evaluation Paradigm
The general structure of our evaluation paradigm follows that of  and Hu et al. (2020). We use surprisal as a linking function between the language model output and human expectations (Hale, 2001). Surprisal is defined as the inverse log probability of a word (w i ) conditioned on the preceding words in the same context (w 1 . . . w i−1 ): Our test suites take the form of a group of handwritten, controlled sentence sets. Each sentence set, or test item, contains at least two minimally differing sentences, and each sentence contains a stimulus prefix and a downstream target region. The content of the target region remains fixed across the sentence variants within the test item, while the content of the stimulus varies in a minimal manner that modulates the sentence's grammaticality or acceptability. The target region, where we measure the surprisal output of a certain model, is underlined in all the example items described in Section 3.2.
For each test item, we measure success by computing the difference in surprisals assigned by the model to the target region, conditioned on the ungrammatical vs. grammatical stimulus prefixes. If the model successfully captures the dependency, it should be less surprised at the grammatical target region than the ungrammatical one, leading to a positive surprisal difference (ungrammatical − grammatical). If this criterion is satisfied, then the model achieves a success score of 1, and 0 otherwise. These binary scores are averaged over test suite items and/or classes to obtain accuracy scores.

Test Suites
We organize our materials into six classes of test suites, each of which assesses models' knowledge of a particular linguistic phenomenon. The test suite classes include both syntactic and semantic dependencies, some of which do not exist in languages that have been the focus of targeted evaluation, and thus have not yet been explored. MISSING OBJECT evaluates syntactic knowledge of argument structure, SUBORDINA-TION and GARDEN PATH SUBJECT/OBJECT assess representation of syntactic state, and CLASSIFIER-NOUN COMPATIBILITY and VERB-NOUN COM-PATIBILITY evaluate a combination of syntactic and semantic factors (Balari, 1992). Looking crosslinguistically, one class assesses a phenomenon present in Mandarin but not English (CLASSIFIER-NOUN COMPATIBILITY); two classes assess an expectation-violation phenomenon that is present in both Mandarin and English but arises from different sources (GARDEN PATH SUBJECT/OBJECT); and three classes assess phenomena present in both languages (VERB-NOUN COMPATIBILITY, MISS-ING OBJECT, and SUBORDINATION).
For each phenomenon targeted by a given test suite class, the two components of the syntactic/semantic dependency often occur adjacently in a sentence. However, if a language model robustly represents the dependency, then it should maintain its expectations even when intervening content is present between the upstream and downstream ends of the dependency. We assess the robustness of the models' grammatical knowledge on each test suite class by inserting three commonly-used types of modifiers to create non-local dependencies: adjectives, subject-extracted relative clauses (SRCs) and their variants, and object-extracted relative clauses 1 All code and data, including test suites, can be found at https://github.com/YiwenWang03/ syntactic-generalization-mandarin (ORCs). The resulting set of test suites is described in greater detail in the following sections.

Classifier-Noun Compatibility
Classifiers are a special class of words in Chinese languages which are obligatorily used with numerals in a noun phrase. Each specific classifier in Mandarin is only compatible with a set of noun references that is largely semantically delimited. The general classifier "个"(CL GENERAL ), in contrast, is compatible with most nouns. In the CLASSIFIER-NOUN COMPATIBILITY test suites, we evaluate whether a model expects nouns from a semantically-compatible class over those from a semantically-incompatible class, given a specific classifier.
"The child heard a familiar song." 'The child heard a familiar song." "The child heard a familiar album." Example (1) shows a test item from the suite with adjectival modifiers. We also consider ORCs and SRCs as modifiers in this test suite class (see Appendix A.2.1). Here we consider the classifier "首"(CL SONG ), which is compatible with the noun "歌曲"(song) but not the noun "专辑"(album), and the classifier "张"(CL ALBUM ), which is compatible with the noun "专辑" but not the noun "歌曲". The four variants (1.a-d) show four possible combinations of the two classifiers and the two nouns.
Here the target region is the sentence-final noun together with the period. We also check that the two nouns compared within each test item have similar frequency in the training data. We measure surprisals at the sentence-final noun and the period. A human-like language model should assign lower surprisals to the target regions in (1.a) and (1.c), the items with an appropriate classifier-noun pair, and high surprisals to the target regions in (1.b) and (1.d), the items with mismatched classifier-noun pairs. In other words, we evaluate four pair-wise comparisons to see whether they meet the following criteria: (1.b) > (1.a), (1.d) > (1.c), (1.d) > (1.a), and (1.b) > (1.c). We report mean accuracy averaged across all four pair-wise comparisons as a model's accuracy on a given test suite.

Garden-Path Effects
Garden-path effects are a class of phenomena in human sentence processing, where the incremental parsing state of a sentence prefix needs to be reanalyzed as the comprehender processes a downstream disambiguator region (Bever, 1970). We construct a set of test suites evaluating whether models exhibit garden-path effects induced by locally mismatched classifier-noun pairs situated within a globally coherent sentence, inspired by previous human behavioral studies (Wu et al., 2018).
To illustrate the classifier-induced garden-path effect, consider examples (2.a) and (2.b): He left the factory that the friend started." In (2.b), the general classifier "个"(CL GENERAL ) is compatible with the immediately following noun, "朋友"(friend), resulting in a garden-path interpretation that this noun is the object of the main-clause verb "离开"(leave). This interpretation is disconfirmed, however, by the next verb, "开"(start), which indicates that the noun "朋友" is actually the start of a relative clause preceding the main-clause object, "工厂"(factory) . This means that the relative clause verb "开" should be highly surprising in (2.b). In (2.a), in contrast, the specific classifier "间"(CL BUILDING ) is incompatible with "朋友" (though it is compatible with "工厂"), cueing the upcoming relative clause structure (Wu et al., 2018). A human-like language model should thus show a higher surprisal at the target verb "开" for (2.b) than (2.a).
We design test suites similar to the structure of example (2) and (3). Examples (2.a-b) show the structure of items in the GARDEN PATH OBJECT set, where an ORC modifies the object of the main clause verb and the target region is the verb immediately following the closest noun to the classifier. Examples (3.a-b) show the basic structure of items in the GARDEN PATH SUBJECT set, where an ORC modifies the sentence subject.
"The factory that the friend started has closed." Here in (3.b), the garden-path interpretation is that the ORC's subject "朋友" may be initially analyzed as the subject of the sentence during incremental processing; the target region of the garden-path effect is the disambiguating word "的"(DE) that ends the relative clause and precedes the head noun of the true main subject of the sentence, "工厂". The criterion for getting a test item correct is that the model shows lower surprisal at the target region "开" (start) in (2.a) than (2.b), and lower surprisal at the target region "的" in (3.a) than (3.b). Examples (2) and (3) have no modifiers in between the classifier stimulus and the target region. To manipulate the length of dependency, we also consider same types of modifiers as in CLASSIFIER-NOUN COMPATIBILITY in the full test suites.

Verb-Noun Compatibility
Similar to CLASSIFIER-NOUN COMPATIBILITY, VERB-NOUN COMPATIBILITY is also a group of semantic test suites, assessing the consistency between a transitive verb 2 and its direct object noun.
(4) shows an example with an adjective modifier in between the verb and its object, where (4.b) is semantically inconsistent since the word "阅 读"(read) does not match the object noun "电 脑"(computer). The stimulus is the transitive verb, and the target region is the object noun and the period (which encapsulates the possibility of an incomplete but potentially grammatical sentence). We insert adjectives, ORCs and SRCs modifiers (same as in CLASSIFIER-NOUN COMPATIBILITY) between the verb and object. The expected behavior is that the surprisal at the target region being lower in the semantically consistent variant (here, (4.a)).

Missing Object
Next, we turn to phenomena primarily characterized by syntactic expectations. The first of these test suite classes, MISSING OBJECT, assesses models' ability to track a direct object required by a transitive main verb. Consider (5.a) and (5.b) (no modifier case): The journalist interviewed." (5.a) is grammatical and (5.b) is not, since the main verb "采访"(interview) requires a downstream direct object. To test if models learn this dependency, we record the model's surprisal at the sentence-final period "。". The model should be more surprised to see "。" in (5.b) since it is less likely to end a sentence without an object needed by the verb. Note that this is a case where we assess humanlikeness of an autoregressive language model by whether it is temporarily confused as to the structural interpretation midway through the sentence, as evidenced by its next-word predictions.
To continue our investigation of long-distance dependencies, we add three types of modifiers: single SRC, coordinated SRCs, and embedded SRCs, exemplified in Appendix A.2.5, respectively. 3 We expect the insertion of modifiers after the main verb to increase difficulty, as the model must track the verb-object dependency over a greater amount of content. In addition, the parallel and hierarchical SRCs are longer and more syntactically complex than the simple SRCs.

Subordination
Finally, our SUBORDINATION test suites assess the ability of a model to maintain global expectations for a main clause while inside a local subordinate clause. For example, consider (6.a) and (6.b) (no modifier case): In this case, we test the surprisal at the sentencefinal period. If the model correctly represents the gross syntactic state within the subordinate clause, then it should assign higher surprisal to the period in sentences like (6.b) than in sentences like (6.a). We include modifiers (same types as in CLASSIFIER-NOUN COMPATIBILITY) before the subject noun inside the matrix clause.

Models
To investigate the effect of syntax modeling in learning the dependencies described in Section 3.2, we train four classes of neural language models by crossing two types of parameterization with two types of supervision. Two of our model classes are trained for vanilla next-word-prediction: Long Short-Term Memory networks (LSTM; Hochreiter and Schmidhuber, 1997) and Transformers (Vaswani et al., 2017). The remaining two model classes are based on the LSTM and Transformer architectures, but explicitly incorporate syntactic structure during training: Recurrent Neural Network Grammars (RNNG; Dyer et al., 2016) and Transformer-parameterized parsing-as-languagemodelling models (PLM; Qian et al., 2021). While prior work on structural supervision in English language models has focused primarily on RNNGs, both RNNGs and PLMs are joint probabilistic models of terminal word sequences along with the corresponding constituency parses. Thus, they both explicitly model syntax (in contrast to their vanilla language modeling counterparts), while featuring different parameterizations.
We use the PyTorch implementation of the LSTM (Paszke et al., 2019). The Transformer and PLM models are based on the HuggingFace GPT-2 architecture (Conneau and Lample, 2019). While we use the model architecture equivalent to the size of pre-trained GPT-2, we do not use the pretrained tokenizer. All of our models are trained on a pre-tokenized Mandarin corpus and share the same vocabulary for each training dataset. Model sizes are reported in Table 3a in Appendix B.
For the LSTM and Tranformer models, we calculate the surprisal at the target region by taking the negative log of the model's predicted conditional probability. We estimate the RNNGs and PLMs' word surprisals with word-synchronous beam search (Stern et al., 2017), following Hale et al. (2018) and . The action beam size is 100 and the word beam size is 10. For regions with multi-token content, we sum over the probabilities of each token.
As a baseline, we additionally implement an n-gram model with Kneser-Ney Smoothing (Kneser and Ney, 1995) using the SRILM toolkit (Stolcke, 2002). For cases where the smoothed n-gram model assigns identical probabilities to the target region across different conditions in a test item, we do tie-breaking by randomly flipping a fair coin to determine the outcome for that particular item.

Corpus Data
We consider two datasets to explore how training data size affects models ability to acquire grammatical knowledge (similar to Hu et al., 2020). The LSTM and Transformer models are trained on the raw text only, whereas the RNNG and PLM models are tained with additional syntactic annotations.
Chinese Treebank (CTB) The Chinese Treebank (CTB 9.0; Xue, Nianwen et al., 2016) is a Chinese language corpus annotated with Penn Treebank-style (Marcus et al., 1993) constituency parses. We use the Newswire, Magazine articles, Broadcast news, Broadcast conversations, and Weblogs sections, as we expect these sources to contain well-formed sentences with a variety of syntactic constructions. We follow the split defined by Shao et al. (2017) to construct training, development, and test sets.
Xinhua News Data To investigate the effects of increased training data size on models' syntactic generalization, we create a larger corpus combining CTB with a subset from the Xinhua News corpus (Wu, Zhibiao, 1995 We filter out extremely long sentences (>100 tokens) and map tokens occurring less than twice in the training data to fine-grained UNK tokens. Appendix C reports full statistics of the training corpora.
We train ten types of language models, crossing model architecture (n-gram, LSTM, RNNG, Transformer, and PLM) with dataset (Chinese Tree-bank and hybrid Xinhua dataset). 5 Each model type is trained with multiple random seeds, 6 and results are reported as averages across these instances. Model perplexity scores are reported in Table 3b.

Results
We begin by reporting the overall performance of the models on the test suite classes introduced in Section 3.2. Figure 1 shows accuracy scores averaged across test suites within each class. 7 First, we note that the n-gram baseline overall performs the worst among all language models, which matches our expectations since syntactic dependencies beyond the 5-token window are difficult for the model to capture.
Turning to the neural models, we assess the effects of training data size and architecture by examining the mean accuracy scores across test suites for each model. We fit separate linear mixed-effects models comparing the effects of data size, using the lme4 package in R (Bates et al., 2015). The dependent variable is the mean accuracy score (Figure 2). For each language model type, the main effect is a binary indicator of whether the model is trained on the CTB dataset or the larger Xinhua dataset. We include test suite class, modifier type, and model seed as random factors with random intercepts and slopes for the accuracy score. Across test suite classes, the Transformer-Xinhua models outperform their smaller CTB counterparts (p < .05), but the effect of data size is less clear for the LSTM, RNNG and PLM models.
Comparing different model architectures, we find that in the SUBORDINATION test suite class, RNNGs trained on the smaller CTB dataset achieve comparable performance to LSTMs trained on the larger Xinhua dataset (p = .938; linear mixedeffects model with model type as the main factor). Furthermore, for both VERB-NOUN COM-PATIBILITY and SUBORDINATION, the RNNGs perform better than the LSTMs and Transformers when trained on the smaller CTB dataset (see Figure 1). These results suggest that an inductive bias for learning hierarchical structure may help in low-data Chinese settings, consistent with prior  Table 1: Comparison between language models that perform explicit syntax modeling (RNNGs and PLMs) and their vanilla counterparts (LSTMs and Transformers). represents statistically significant improvement in structurally supervised language models, and represents the opposite direction. *: p ≤ .05, **: p ≤ .01, ***: p ≤ .001. Overall, we find only suggestive benefits of structural supervision. While prior work in English (Kuncoro et al., 2018;Hu et al., 2020;Qian et al., 2021) has shown that RNNGs and PLMs can improve upon their vanilla LSTM and Transformer counterparts, the improvement is relatively smaller for Mandarin. To compare the performance of models with and without structural supervision, we fit a generalized linear mixedeffect model on the binary outcome (whether or not the model predicts a positive surprisal difference) for each test item, within each combination of test suite class, training data, and model parameterization. We consider a binary indicator of whether or not the model performs syntax modeling explicitly as the main factor, and include test item, modifier type and model seed as random factors. Table 1 summarizes the results. For the CTB-only models, the structurally supervised models (RNNG and PLM) achieve accuracy either significantly greater or comparable to the corresponding vanilla models (LSTM and Transformer) across all test suite classes. However, the pattern is less clear for the Xinhua-trained models: the structurally supervised models lead to both gains and losses in accuracy compared to the vanilla models. We conjecture that word segmentation and parsing errors in the automatically-annotated Xinhua dataset might have affected the learning process of the model. In addition, the Xinhua training data explored in this work is still not very large in size (<10 million tokens), so it could be that further benefits of syntactic supervision may be more pronounced with much larger training datasets. Nevertheless, the suggestive benefits of explicitly modelling syntax with very small amounts of data could have im- plications for language modeling in low-resource settings.
We also assess whether our models are better at capturing syntactic or semantic relationships. To do this, we group the six classes of test suites into three categories: syntactic dependency (MISSING OB-JECT and SUBORDINATION), semantic relationship (CLASSIFIER-NOUN COMPATIBILITY and VERB-NOUN COMPATIBILITY), and a hybrid capturing semantically-driven syntactic state representations (GARDEN PATH SUBJECT and GARDEN PATH OB-JECT). We find that the average accuracy score is higher in the syntactic test suites than in the semantic test suites (p < .001), 8 suggesting that the language models in our study -including those with no explicit syntax modeling -find it easier to learn syntactic dependencies than semantic relationships.

Robustness to Intervening Content
Next, we investigate the effect of structural supervision on tracking dependencies across intervening content. We focus our analysis on MISSING OB-JECT, 9 as the modifiers considered in these test suites can be ordered according to their syntactic complexity (no SRC < single SRC < coordinated SRCs < embedded SRCs). Figure 3 shows the models' performance on this test suite class as a function of modifier complexity, ranging from least to most difficult along the horizontal axis of each subplot. The vanilla LSTMs and Transformers clearly degrade in performance as the intervening materials between the stimulus and the target region grow in length and complex-8 See Appendix D.1 for details. 9 The modifiers used in the other test suite classes (adjective, ORC, SRC) are not as directly comparable, since they vary in multiple dimensions, not just complexity. ity. In contrast, it is also visually apparent that the RNNG models do not degrade sharply as modifiers get longer and more complex. We fit linear mixed-effects models to investigate the relationship between modifier type and accuracy for each language model. 10 Our results appear to confirm that both the RNNGs and PLMs do not significantly degrade on the simple SRC and coordinated SRC modifiers (compared to the no-modifier baseline). For the most complex modifier (embedded SRC), all models suffer in accuracy, but the magnitude of this effect is smaller for the RNNGs and PLMs compared to the LSTMs and Transformers. Taken together, our results suggest that while structural supervision does not give the language models a significant advantage compared to their vanilla counterparts in the accuracy scores, it seems to help the model maintain syntactic expectations despite the intervention of syntactically complex content.

Garden-Path Effects
Building upon the CLASSIFIER-NOUN COMPAT-IBILITY results, we investigate whether a mismatched local classifier-noun pair may serve as an early cue for the upcoming RC structure, inducing a garden path effect. Figure 1 shows that the neural models systematically perform better on GARDEN PATH OBJECT than GARDEN PATH SUBJECT. We conjecture that neural language models may implicitly predict an ORC modifying the subject noun regardless of the type of classifiers. Therefore, the language models may be more prepared to see an ORC modifying the subject "工厂" in (3.b) than in (2.b) with the object "工厂".
To gain a better understanding of model performance on these two test suite classes, we examine the average difference in target-region surprisal values between sentences with and without local classifier-noun mismatch (Figure 4). The average difference gives detailed information on how each language model processes the garden path region, which is complementary to the binary success/failure score achieved by a model on a given test item. Figure 4a shows that the neural models have a positive average surprisal difference across test items for GARDEN PATH OBJECT. Furthermore, the magnitude of this difference increases with the inclusion of the larger Xinhua dataset, suggesting that with more data, models become more confident in taking the incongruence between the classifier and the noun as a pre-RC cue. 11 On the other hand, recall that all models perform rather poorly on GARDEN PATH SUBJECT (Figure 1). Figure 4b shows that only the RNNGs trained on the Xinhua corpus output a statistically significant postive average surprisal difference (p < .001; onesample t-test). PLM-Xinhua, although not statistically significant, has a positive mean surprisal difference as well. This is due to the fact that the magnitude of the suprisal differences predicted by the RNNG-Xinhua and PLM-Xinhua models is greater when they exhibit the predicted garden-path effects, and smaller when they do not follow the predicted direction. Therefore, structural supervi-11 This result also accords with prior findings that classifiers facilitate object-modifying RC processing (Wu et al., 2014(Wu et al., , 2018. sion may help models represent syntactic state in a more human-like way.

Conclusion
This work evaluates Mandarin Chinese language models on six grammatical relationships, including syntactic dependencies and semantic compatibilities. We use Mandarin as a case study for analyzing how the potential advantages of explicit syntax modeling (as performed by RNNGs and PLMs) generalize from English to a typologically different language. Although structural supervision does not boost the sequential model's learning in all relationships tested in this study, we find that it does allow the RNNG and PLM models to learn dependencies robust to increasingly complex modifiers, as seen in the MISSING OBJECT test suites. Compared to the vanilla sequence-based LSTM and Transformer models, explicit syntactic modeling also seems to help with grammatical generalization in settings with small training data. We also find that Mandarin syntactic dependencies (such as tracking gross syntactic state within a subordinate clause) tend to be easier to learn than semantic dependencies (such as the compatibility between classifiers and nouns). This study is one of the first steps towards understanding the role structural inductive biases may play in learning semantic and syntactic relationships in typologically diverse languages. ORC as modifier: "He left the factory that the friend whom I respect started." SRC as modifier: "He left the factory that the friend who helped me before started."

B Model Information
We find that the perplexity score reported in Table 3b is comparatively high for RNNG compared to that reported in Dyer et al. (2016). This may be because the CTB data we use includes some informal and spoken language, such as weblogs and broadcast conversations.     In this section, we focus on comparing how well models learn about syntactic dependency and semantic compatibility.

C Corpus Statistics
Recall that we group CLASSIFIER-NOUN COM-PATIBILITY and VERB-NOUN COMPATIBILITY as semantic relationships, and MISSING OBJECT and SUBORDINATION as syntactic dependency. We compute the accuracy scores of these two categories, as shown in Figure 6. Here we exclude the n-gram models since their performances deviate from the other language models to a great extent. Adding modifiers between the stimulus and the target region seems to shrink the gap a bit. And this again might be due to the fact that for MISSING OBJECT, we intentionally make the intervening contents increasingly hard to learn, dragging down the average accuracy score for syntactic dependencies. Similar to testing the effect of data size, we determine the statistical significance here with a linear mixed-effects model on the accuracy score. We predict accuracy with a binary indicator of whether or not the test is in the syntactic group that we define, including model type and test item as random factors. We find that syntactic relationships seem to be easier to learn than semantic ones regardless of the intervening content (p < .001).

D.2 Intervening Content in MISSING OBJECT
In this section, we discuss our statistical analysis on language models' robustness against intervening contents. We fit separate linear mixed-effects models for each language model with the accuracy score as the dependent variable, and modifier type as the predictor. We include model seed and data size as random factors, both with random intercepts and random slopes. Table 5 summarizes our results. Recall that in MISSING OBJECT, we consider four types of modifiers: None, single SRC, coordinated SRCs, and embedded SRCs. Each cell in Table 5 represents coefficients of a particular modifier with respect to the None modifier baseline for a particular language model class. For single SRC and coordinated SRC modifiers, neither of the RNNG or PLM models show significant degradation in the accuracy score. All models suffer from the embedded SRC modifier, with negative coefficients that are all statistically significant. However, RNNGs and PLMs seem to be affected the least, with larger coefficients than vanilla LSTMs and Transformers. This suggests that structural supervision helps language model to learn syntactic dependencies that are more robust against the intervening content.

E Accuracy by Model and Test Suite Class
See