Investigating Transfer Learning in Multilingual Pre-trained Language Models through Chinese Natural Language Inference

Multilingual transformers (XLM, mT5) have been shown to have remarkable transfer skills in zero-shot settings. Most transfer studies, however, rely on automatically translated resources (XNLI, XQuAD), making it hard to discern the particular linguistic knowledge that is being transferred, and the role of expert annotated monolingual datasets when developing task-specific models. We investigate the cross-lingual transfer abilities of XLM-R for Chinese and English natural language inference (NLI), with a focus on the recent large-scale Chinese dataset OCNLI. To better understand linguistic transfer, we created 4 categories of challenge and adversarial tasks (totaling 17 new datasets) for Chinese that build on several well-known resources for English (e.g., HANS, NLI stress-tests). We find that cross-lingual models trained on English NLI do transfer well across our Chinese tasks (e.g., in 3/4 of our challenge categories, they perform as well/better than the best monolingual models, even on 3/5 uniquely Chinese linguistic phenomena such as idioms, pro drop). These results, however, come with important caveats: cross-lingual models often perform best when trained on a mixture of English and high-quality monolingual NLI data (OCNLI), and are often hindered by automatically translated resources (XNLI-zh). For many phenomena, all models continue to struggle, highlighting the need for our new diagnostics to help benchmark Chinese and cross-lingual models. All new datasets/code are released at https://github.com/huhailinguist/ChineseNLIProbing.


Introduction
Recent pre-trained multilingual transformer models, such as XLM(-R) (Conneau and Lample, 2019;Conneau et al., 2020), mT5 (Xue et al., 2020) and others (Liu et al., 2020;Lewis et al., 2020) have been shown to be successful in NLP tasks for several non-English languages (Khashabi et al., 2020;Choi et al., 2021), as well as in multilingual benchmarks (Devlin et al., 2019;Conneau et al., 2020;Xue et al., 2020;Artetxe et al., 2020). A particular appeal is that they can be used for cross-lingual and zero-shot transfer. That is, after pre-training on a raw, unaligned corpus consisting of text from many languages, models can be subsequently fine-tuned on a particular task in a resource-rich language (e.g., English) and directly applied to the same task in other languages without requiring any additional language-specific training.
Given this recent progress, a natural question arises: does it make sense to invest in large-scale task-specific dataset construction for low-resourced languages, or does cross-lingual transfer alone suffice for many languages and tasks? A closely related question is: how well do multilingual models transfer across specific linguistic and languagespecific phenomena? While there has been much recent work on probing multilingual models (Wu and Dredze, 2019; Pires et al., 2019;Karthikeyan et al., 2019), inter alia, a particular limitation is that most studies rely on automatically translated resources such as XNLI (Conneau et al., 2018) and XQuAD (Artetxe et al., 2020), which makes it difficult to discern the particular linguistic knowledge that is being transferred and the role of large-scale, expert annotated monolingual datasets when building task-and language-specific models.
In this paper, we investigate the cross-lingual transfer abilities of XLM-R (Conneau et al., 2020) for Chinese natural language inference (NLI). Our focus on Chinese NLI is motivated by the recent release of the first large-scale, human-annotated Chinese NLI dataset OCNLI (Original Chinese NLI) (Hu et al., 2020) 2 , which we use to directly in-2 To our knowledge, OCNLI is currently the largest non-  vestigate the role of high-quality task-specific data vs. English-based cross-lingual transfer. To better understand linguistic transfer, and help benchmark recent SOTA Chinese NLI models, we created 4 categories of challenge/adversarial tasks (totaling 17 new datasets) for Chinese that build on several wellestablished resources for English and the literature on model probing (see Poliak (2020)). Our new resources, which are summarized in Table 1 (Naik et al., 2018), as well as a collection of the basic reasoning and logic semantic probes for Chinese based on Richardson et al. (2020). Our results are largely positive: We find that cross-lingual models trained exclusively on English NLI do transfer relatively well across our new Chinese tasks (e.g., in 3/4 of the challenge categories shown in Table 1, they perform overall as well or better than the best monolingual Chinese models without additional specialized training on Chinese data, and have competitive performance on OCNLI). A particularly striking result is that such models even perform well on 3/5 uniquely Chinese linguistic phenomena such as idioms, pro drop, providing evidence that many language-specific phenomena do indeed transfer. These results, how-English NLI dataset that was annotated in the style of English MNLI without any translation. ever, come with important caveats: on several phenomena we find that models continue to struggle and are far outpaced by conservative estimates of human performance (e.g., our best model on Chinese HANS remains ∼19% behind human performance), highlighting the need for more languagespecific diagnostics tests. Also, fine-tuning models on mixtures of English NLI data and high-quality monolingual data (OCNLI) consistently performs the best, whereas mixing with automatically translated datasets (XNLI-zh) can greatly hinder model performance. This last result shows that highquality monolingual datasets still play an important role when building cross-lingual models, however, the particular type of monolingual dataset that is needed can vary and is best informed by targeted behavioral testing of the type we pursue here.

Related Work
There has been a lot of work on trying to understand multilingual transformers (Wu and Dredze, 2019; Pires et al., 2019), which has focused on either examining the representation of different layers in the transformer architecture or the lexical overlap between languages. Karthikeyan et al. (2019) investigate the role of network depth and number of attention heads, as well as syntactic/word-order similarity on the cross-lingual transfer performance. In addition to studies cited at the outset, positive results of cross-lingual transfer across a wide range of languages are reported in Wu and Dredze (2020); Nozza et al. (2020), with a focus on transfer across specific tasks such as POS tagging, NER; in contrast, we focus on different categories of linguistic transfer, which has received less attention, as well as the role of monolingual data for transfer in NLI.
Studies into the linguistic abilities and robustness of current NLI models have proliferated in recent years, partly owing to the discovery of systematic biases, or annotation artifacts (Gururangan et al., 2018;Poliak et al., 2018), in benchmark NLI datasets such as SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018). This has been coupled with the development of new adversarial tests such as HANS (McCoy et al., 2019) and the NLI stress-tests (Naik et al., 2018), as well as several new linguistic challenge datasets (Glockner et al., 2018;Richardson et al., 2020;Geiger et al., 2020;Yanaka et al., 2019;Saha et al., 2020;Goodwin et al., 2020), inter alia, that focus on a wide range of linguistic and reasoning phenomena. All of this work focuses exclusively on English, whereas we focus on constructing analogous probing datasets tailored to Chinese to help advance research on Chinese NLI and cross-lingual transfer.
There has been a surge in the development of NLI resources for languages other than English. Such resources are often created in the following two ways: (1) from scratch, in the style of MNLI (Williams et al., 2018), where annotators are used to produce hypotheses and inference labels based on a provided set of premises, as pursued for Chinese OCNLI (Hu et al., 2020), or SciTail (Khot et al., 2018, where sentences are paired automatically and labeled by annotators (Amirkhani et al., 2020;Hayashibe, 2020). (2) Through automatic (Conneau et al., 2018;Budur et al., 2020;Real et al., 2020)

Dataset creation
In this section, we describe the details of the 4 types of challenge datasets we constructed for Chinese to study cross-lingual transfer (see details in Table 1). They fit into two general categories: Adversarial datasets (Section 3.1) built largely from patterns in OCNLI (Hu et al., 2020) and XNLI (Conneau et al., 2018) and Probing/diagnostic datasets (Section 3.2), which are built from scratch in a parallel fashion to existing datasets in English.
While we aim to mimic the annotation protocols pursued in the original English studies, we place the additional methodological constraint that each new dataset is vetted, either through human annotation using a disjoint set of Chinese linguists, or through internal mediation among local Chinese experts; details are provided below.

Adversarial dataset
Examples from the 7 adversarial tests we created are illustrated in Table 2. 3 Chinese HANS is built from patterns extracted in the large-scale Chinese NLI dataset OCNLI (Hu et al., 2020), whereas the Distraction, Antonym, Synonym and Spelling subsets are built from an equal mixture of OCNLI and XNLI-zh (Conneau et al., 2018) data; in the latter case, such a difference allows us to fairly compare the effect of training on expert-annotated (i.e., OCNLI) vs. automatically translated data (i.e., XNLI-zh) as detailed in Section 4. McCoy et al. (2019) discovered systematic biases/heuristics in the MNLI dataset, which they named "lexical/subsequence/constituent" overlap. "Lexical overlap" is defined to be the pairs where the vocabulary of the hypothesis is a subset of the vocabulary of the premise. For example, "The boss is meeting the client." and "The client is meeting the boss.", which has an entailment relation. However, lexical overlap does not necessarily mean the premise will entail the hypothesis, e.g., "The judge was paid by the actor." does not entail "The actor was paid by the judge." (examples from McCoy et al. (2019)). Thus a model relying on the heuristic will fail catastrophically in the second case.

Chinese HANS
Inspired by the English HANS, we examine whether OCNLI also possesses such biases, as it has a similar annotation procedure as MNLI. We follow the design of the original HANS experiments, and adapt their scripts 4 to extract examples in OCNLI that satisfy the two heuristics. We find a heavy bias towards "entailment", where 79.5% of such examples are "entailment", similar to MNLI. To construct a Chinese HANS, we first look into syntactic structures of the examples having the two heuristics. Then we write 29 templates for the lexical overlap heuristic and 11 templates for subsequence overlap. 5 Using the templates and a vocabulary of 263 words, we generated 1,941 NLI pairs. See Table 2 for examples and Appendix A for details.
Distraction We add distractions to the premise or hypothesis, similar to the "length mismatch" and "word overlap" conditions in the NLI stress tests of Naik et al. (2018). The distractions are either tautologies ("true is not false") or a true statement from our world knowledge ("Finnland is not a permanent member of the UN security council"), which should not influence the inference label. We control whether the distraction contain a negation or not, and thus create four conditions: premise-negation, premise-no-negation, hypothesis-negation, and hypothesis-no-negation. See Table 2    Antonym We replace a word in the premise with its antonym to form a contradiction. To ensure the quality of the resulting NLI pairs, we manually examine the initially generated data and decided to only replace nouns and adjectives, as they are more likely to produce real contradictions.
Synonym We replace a word in the premise with its synonym to form an entailment.
Spelling We replace one random character in the hypotheses with its homonym (character with the same pinyin pronunciation ignoring tones) as this is one of the most common types of misspellings in Chinese.
Numerical reasoning We create a probing set for numerical reasoning, following simple heuristics such as the following. When the premise is Mary types x words per minute, the entailed hypothesis can be: Mary types less than y words per minute, where x < y. A contradictory hypothesis: Mary types y words per minute, where x > y or x < y. Then a neutral pair can be produced by reversing the premise and hypothesis of the above entailment pair. 4 heuristic rules (with 6 words for quantification) are used and the seed sentences are extracted from Ape210k (Zhao et al., 2020), a dataset of Chinese elementary-school math problems. The resulting data contains 8,613 NLI pairs. For quality control and to compute human performance, we randomly sampled 50 examples from all subsets and asked 5 Chinese speakers to verify. Our goal is to mimic the human annotation protocol from Nangia and Bowman (2019), which gives us a conservative estimate of human performance given that our annotators received very little in-structions. Their majority vote agrees with the gold label 90.0% of the time, which suggests that our data is of high quality and allows us to later compare against model performance. 6

Probing/diagnostic datasets
While the Chinese HANS and stress tests are designed to adversarially test the models, we also create probing or diagnostic datasets which are aimed at examining the models' linguistic and reasoning abilities.
Hand-crafted diagnostics We expanded the diagnostic dataset from the Chinese NLU Benchmark (CLUE) (Xu et al., 2020) in the following two ways: First, 6 Chinese linguists (PhD students) created diagnostics for 4 Chinese-specific linguistic phenomena. Here are two of the phenomena: 7 (1) prodrop: subjects or objects in Chinese can be dropped when they can be recovered from the context (Li et al., 1981). Thus the model needs to figure out the subject/object from the context. (2) four-character idioms (i.e., 成语 Chengyu). They are a special type of Chinese idioms that has exactly four characters, usually with a figurative meaning different from the literary meaning, e.g., 打草惊蛇 hit hay startle snake (behaving carelessly and causing your enemy to become vigilant). We construct examples to test whether models understand the figurative meaning in the idioms. Specifically, we first create a premise P which includes the idiom, where there is enough context so that a human is highly likely to interpret the idiom figuratively. Then we create an entailed hypothesis that is based on the figurative (correct) interpretation, and a neutral/contradictory hypothesis that uses the literal (incorrect) meaning (see Table 11 in the Appendix for an example). For each P we write 3 hypothesis, one for each inference relation. We also added diagnostics involving world knowledge.
Second, we double the number of diagnostic pairs for all 9 existing linguistic phenomena in CLUE with pairs whose premises are selected from a large news corpus 8 and hypotheses are handwritten by our linguists, to accompany the 514 artificially created data in CLUE. The resulting new diagnostics is 4 times as large as the original one, with a total of 2,122 NLI pairs. For quality control, each pair is double-checked by local Chinese linguists not involved in this study and the controversial cases were discarded after a discussion among the 6 linguists. See Table 11 in Appendix A for examples.
Semantic fragments Following Richardson et al. (2020) and Salvatore et al. (2019), we design synthesized fragments to examine models' understanding ability of six types of linguistic and logic inference: boolean, comparative, conditional, counting, negation and quantifier, where each category has 2-4 templates. See example templates and NLI pairs in Table 3.
The data is generated using context-free grammar rules and a vocabulary of 80,000 person names (Chinese and transliterated), 8659 city names and expanded predicates and comparative relations in Richardson et al. (2020) to make the data more challenging. As a result, we generated 1,000 examples for each fragment. For quality control, each template was checked by 3 linguists/logicians; also 20 examples from each category were checked for correctness by local experts.

Experimental setup
Our main goal is to test whether cross-lingual transfer are robust against the adversarial and probing data we created when evaluated without additional training. Thus we need to compare the best Chinese monolingual models with the best multilingual models trained either on English NLI data alone, or on combinations of Chinese and English data. 9 Chinese monolingual models We experimented with two current state-of-the-art transformer models: RoBERTa-large (Liu et al., 2019) and Electralarge-discriminator (Clark et al., 2019). We use the Chinese models released from (Cui et al., 2020) 10 implemented the Huggingface Transformer library (Wolf et al., 2020).
Multilingual model We use XLM-RoBERTalarge (Conneau et al., 2020). We choose XLM-R over mT5 (Xue et al., 2020) because XLM-R generally performs better than mT5 under the same model size (see original paper for details). Also, XLM-R as a RoBERTa model is most related architecturally to existing Chinese pre-trained models.
(2) XNLI-small: 50k examples from XNLI, the same size as the training data of OCNLI. (3) OCNLI: Original Chinese NLI dataset (Hu et al., 2020). It is a Chinese NLI dataset collected from scratch, following the MNLI procedure, with 50k training examples. We use this to measure the effect of the quality of training data; that is, whether it is better to use small, high-quality training data (OCNLI), or large, low-quality MT data (XNLI). (4) OCNLI + XNLI: a combination of the two training sets, 440k examples.
Fine-tuning data for XLM-R To examine crosslingual transfer, we finetune XLM-R on English NLI data alone and English + Chinese NLI data: (1)  (4) XNLI + English all NLI. These two are set to examine whether combining Chinese and English fine-tuning data is helpful. 9 We also run the same experiments for Chinese-to-English transfer, i.e., fine-tuning XLM-R with OCNLI and evaluate on the four English counterpart datasets. We find that transferring from OCNLI to English does not perform as well as monolingual English models, likely due to the small size of OCNLI. Detailed results are reported in Appendix C. 10 We use hfl/chinese-roberta-wwm-ext-large from https://github.com/ymcui/Chinese-BERT-wwm and hfl/chinese-electra-large-discriminator from https://github.com/ymcui/Chinese-ELECTRA. category premise hypothesis label .. personn has been to locationn. If personn hasn't been to locationn, then personm has been to locationm.
is younger than person2, ..., personn; person1 is as old as personm 亚厄纳尔普比梁培娟大。personm is older than personn−2.  We fine-tune the models on OCNLI-dev. Acknowledging that different training runs can produce very different checkpoints for behavioral testing (D'Amour et al., 2020), we run 5 models on different seeds and report the mean accuracy of the models with the best hyper-parameter setting (for details see Appendix B).

Results on OCNLI dev
Results on the dev set of OCNLI are presented in Table 4. For monolingual RoBERTa, we see a similar performance as reported in the OCNLI paper (Hu et al., 2020), with 79.11% accuracy. The monolingual Electra achieves a very close accuracy of 79.02% (not shown in the Table). As we see the same trend in the following experiments, we will therefore only report results on RoBERTa.
For XLM-R, fine-tuning on MNLI or En-all-NLI gives us reasonable results of around 72% to 74%, which is better than models fine-tuned on XNLI, indicating that fine-tuning on an English data (MNLI) alone can outperform monolingual models finetuned on the same data but machine-translated into Chinese (XNLI). 11 This is consistent with previous results on Korean (Choi et al., 2021) and Persian (Khashabi et al., 2020) for other NLU tasks.
What is also interesting is that combining OC-NLI and En-all-NLI gives us a boost of 2% to 82.18% (a result that is comparable to the current published SOTA), showing the power of mixing high-quality English and Chinese training data. Table 5 shows results of the Chinese HANS data tested on the aforementioned monolingual models and cross-lingual model.

Chinese HANS
Cross-lingual transfer achieves strong results. We first notice that when XLM-R is finetuned solely on the English data (En-all-NLI), the performance (∼69%) is only slightly worse than the best monolingual model (∼71%). This suggests that cross-lingual transfer from English to Chinese is quite successful for an adversarial dataset like HANS. Second, adding OCNLI to En-all-NLI in the training data gives a big boost of about 9%, and achieves the overall best result. This is about 12% higher than combining XNLI and the English data, demonstrating the advantage of the expertannotated OCNLI over machine translated XNLI, even though the latter is about 8 times the size of the former. Despite these results, however, we note that all models continue to perform below human performance, suggesting more room for improvement.
Our results also suggest that examples involving the sub-sequence heuristics are more difficult than  those targeting the lexical overlap heuristics for the transformers models we tested (see the "subsequence" and "lexical overlap" columns in Table 5). This is in line with the results reported in the English HANS paper (specifically Table 15 Table 5). This stands in contrast with the lexical overlap heuristic, where the best monolingual model performs similarly to the best zero-shot cross-lingual transfer (75.39% versus 77.62%). This is one of the few cases where cross-lingual transfer under-performs the monolingual setting by a large margin, suggesting that in certain situations monolingual models may be preferred. Table 6 presents the accuracies on all the stress tests. We first see that cross-lingual zero-shot transfer using all English NLI data performs even better than the best monolingual model (∼74% vs. ∼71%). This demonstrates the power of the cross-lingual transfer-learning. Adding OCNLI to all English NLI gives another increase of about 3 percentage points (to 77%), while adding XNLI hurts the performance, again showing the importance of having expert-annotated language-specific data.

Stress tests
Antonyms and Synonyms All models except those fine-tuned on OCNLI achieved almost perfect score on the synonym test. However, for antonyms, both monolingual and multilingual models fine-tuned with OCNLI perform better than XNLI. XLM-R fine-tuned with English NLI data only again outperforms the best of monolingual models (∼80% vs. ∼72%). Interestingly, adding XNLI to all English NLI data hurts the accuracy badly (a 14% drop), while adding OCNLI to the same English data improves the result slightly.
As antonyms are harder to learn (Glockner et al., 2018), we take our results to mean that either expert-annotated data for Chinese or a huge English NLI dataset is needed for a model to learn decent representations about antonyms, as indicated by the high performance of RoBERTa fine-tuned with OCNLI (71.81%), and XLM-R fine-tuned with Enall-NLI (80.36%), on antonyms. That is, using machine-translated XNLI will not work well for learning antonyms (∼55% accuracy).
Distraction Results in Table 6 show that adding distractions to the hypotheses has a more negative negative impact on models' performance, compared with appending distractions to the premises. The difference is about 20% for all models (see "Distr H" columns and "Distr P" columns in Table 6), which has not been reported in previous studies, to the best of our knowledge. Including a negation in the hypothesis makes it even more challenging, as we see another one percent drop in the accuracy for all models. This is expected as previous literature has demonstrated the key role negation plays when hypotheses are produced by the annotators (Poliak et al., 2018).
Spelling This is another case where cross-lingual transfer with English data alone falls behind monolingual Chinese models (by about 4%). Also the best results are from fine-tuning XLM-R with OC-NLI + XNLI, rather than a combination of English and Chinese data. Considering the data is created by swapping Chinese characters with others of the same pronunciation, we take it to suggest   that monolingual models are still better at picking up the misspellings or learning the connections between characters at the phonological level.
Numerical Reasoning Results in the last column of Table 6 suggest a similar pattern: using all English NLI data for cross-lingual transfer outperforms the best monolingual model. However, finetuning a monolingual model with the small OCNLI (50k examples, accuracy: 54%) achieves better accuracy than using a much larger MNLI (390k examples, accuracy: 51%) for cross-lingual transfer, although both are worse than XLM-R fine-tuned with all English NLI which has more than 1,000k examples (accuracy: 83%). This suggests that there are cases where a monolingual setting (RoBERTa with OCNLI) is competitive against zero-shot transfer with a large English dataset (XLM-R with MNLI). However, that competitiveness may disappear when the English dataset grows to an order of magnitude larger in size or becomes more diverse (En-all-NLI contains 4 different English NLI datasets).

Hand-written diagnostics
Results on the expanded diagnostics are presented in Table 7. We first see that XLM-R fine-tuned with only English performs very well, at 70.2% and 71.9%, even slightly higher than the best monolingual Chinese model (69.3%). Most surprisingly, in 3/5 categories with uniquely Chinese linguistic features, zeroshot transfer outperforms monolingual models. Only in "non-core arguments" and "time of event" do we see higher performance of OCNLI as the fine-tuning data. What is particularly striking is that for "idioms (Chengyu)", XLM-R fine-tuned only on English data achieves the best result, suggesting that the cross-lingual transfer is capable of learning meaning representation beyond the surface lexical information, at least for many of the idioms we tested. The overall results (accuracy of 74.3%) indicate that cross-lingual transfer is very successful in most cases. Manual inspection of the results shows that for many NLI pairs with idioms, XLM-R correctly predicts the figurative interpretation of the idiom as entailment, and the literal interpretation as non-entailment, as described in section 3.2. Looking at OCNLI and XNLI, we observe that they perform similarly when fine-tuned on monolingual RoBERTa. However, when fine-tuned with XLM-R, OCNLI has a clear advantage (68.0% versus 60.9%), suggesting that OCNLI may produce more  Table 8: Accuracy on the Chinese semantic probing datasets, designed following Richardson et al. (2020).
stable results than XNLI. Furthermore, when coupled with English data to be used with XLM-R, we see again that OCNLI + En-all-NLI results in an accuracy 3 percent higher than XNLI + En-all-NLI.

Semantic fragments
Results on the semantic probing datasets (shown in Table 8) are more mixed. First, the results are in general much worse than the other evaluation data, but overall, XLM-R fine-tuned with OCNLI and all English data still performs the best. The overall lower performance is likely due to the longer length of premises and hypotheses in the semantic probing datasets, compared with the other three evaluation sets. Second, zero-shot transfer is better or on par with monolingual Chinese RoBERTa in 4/6 semantic fragments (except Boolean and quantifier). Third, for Boolean and comparative, XLM-R fine-tuned with OCNLI has a much better result than all other monolingual models or XLM-R finetuned with mixed data. We also observe that all models have highest performance on the "counting" fragment. Note that none of the models have seen any data from the "counting" fragment during fine-tuning. That is, all the knowledge come from the pre-training and fine-tuning on general NLI datasets. The surprisingly good performance of XLM-R model (w/ En-all-NLI, 92.34%) suggests that it may have already acquired a mapping from counting the words/names to numbers, and this knowledge can be transferred cross-linguistically.

Conclusion and Future Work
In this paper, we examine the cross-lingual transfer ability of XLM-R in the context of Chinese NLI through four new sets of aversarial/probing tasks consisting of a total of 17 new high-quality and linguistically motivated challenge datasets. We find that cross-lingual transfer via fine-tuning solely on benchmark English data generally yields impressive performance. In 3/4 of our task categories, such zero-shot transfer outperforms our best monolingual models trained on benchmark Chinese NLI data, including 3/5 of our hand-crafted challenge tasks that test uniquely Chinese linguistic phenomena. These results suggest that multilingual models are indeed capable of considerable cross-lingual linguistic transfer and that zero-shot NLI may serve as a serious alternative to large-scale dataset development for new languages. These results come with several important caveats. Model performance is still outperformed by conservative estimates of human performance and our best models still have considerable room for improvement; we hope that our new resources will be useful for continuing to benchmark progress on Chinese NLI. We also find that high-quality Chinese NLI data (e.g., OCNLI) can help improve results further, which suggests an important role for certain types of expertly annotated monolingual data in a training pipeline. In virtue of our study being limited to behavioral testing, the exact reason for why cross-lingual zero-shot transfer generally performs well, especially on some Chinese-specific phenomena, is an open question that requires further investigation. In particular, we believe that techniques that couple behavioral testing with intervention techniques (Geiger et al., 2020;Vig et al., 2020) and other analysis methods (Giulianelli et al., 2018;Belinkov and Glass, 2019;Sinha et al., 2021) might provide insight, and that our new resources can play an important role in such future work.

A Details for dataset creation
In this section, we list example NLI pairs and their translations. For examples of the Chinese HANS and stress tests, see Table 2. For the expanded diagnostics, see Table 11. For the semantic/logic probing dataset, see Table 3. Table 9 lists the number of examples in OCNLI for each inference label that satisfy the two heuristics we are examining. We observe that entailment examples take the majority for both heuristics. Therefore, we hypothesize that if the heuristics are learned, the entailment examples are likely to be correctly predicted while nonentailment (contradiction and neutral) examples are prone to receive wrong prediction.

Chinese HANS
To guarantee the generated sentences are syntactically and semantically sound, we add features for our vocabulary so that subject-predicate and verbobject constraints are satisfied, e.g., some verbs can only take animate subjects and objects. We then generate 50 premise-hypothesis pairs for each template described in our Github repository. 12 Excluding duplicated examples, our generated dataset has 1,941 pairs and the distribution of the three labels is shown in Table 10.
The templates for the two heuristics are listed in Appendix D.
Antonym After looking at the quality of initially generated data, we decided to replace only the nouns and adjectives with their antonyms since such replacements are most likely result in grammatical and contradictory hypotheses. 13 Synonym After inspecting the initially generated data, we decided to perform replacements only to verbs and adjectives. To ensure the quality of synonyms, we rank the synonyms from a commonly used synonym dictionary by their vector similarity to the original word, and pick the top ranking synonym. 14 Distraction We created the distraction data similar to the stress test setting (Naik et al., 2018) but experimented with variations as to where "distractor statement"-either a tautology or a true statementwas added: the premise or the hypothesis. The distractor statement also varied w.r.t. whether it contains a negation: • Premise-no-negation: A distractor statement is added to the end of the premise and it contains no negation. • Premise-negation: A distractor statement containing a negation is added to the premise. • Hypothesis-no-negation: A distractor statement is added to the end of the hypothesis. • Hypothesis-negation: Same as the previous condition except that the distractor contains a negation.
Only two tautologies are used in Naik et al. (2018). In this paper, to thoroughly examine the influence of different true statements, we designed 50 tautology/statements varied in three factors: length, out-of-vocabulary, and negation word. There are 25 statements pairs in total (1 tautology and 24 true statements); each pair includes a true statement and its corresponding true statement with negation form. All the statements range from 5 to 16 characters. For the true statements in negation form, two common Chinese negation words 不 and 没 are used for negation. For the 24 true statement pairs, half of them contains at least one Out-of-Vocabulary word from OCNLI. 15 Experiments show that length, Out-of-Vocabulary words, and the choice of negator have little effects on the results.
Spelling We generate a set of data containing "spelling errors" by replacing one random character in the hypotheses with its homonym, which is defined as a character with the same pinyin pronunciation ignoring tones. We also limit the frequency of the homonym as within the range of 100 to 6000 so that the character is neither too rare nor too frequent.
Numerical reasoning We extracted sentences from Ape210k (Zhao et al., 2020), a large-scale math word problem dataset containing 210K Chinese elementary school-level problems 16 . We generate entailed, contradictory and neutral hypotheses for each premise, with the rules below: 1. Entailment: Randomly choose a number x and change it to y from the hypothesis. If the y > x, prefix it with one phrase that translate to "less than"; if y < x, prefix it with one phrase that translate to "more than". Premise: Mary types 110 words per minute. Hypothesis: Mary types less than 510 words per minute. 2. Contradiction: Perform either 1) randomly choose a number x from the hypothesis and change it; 2) randomly choose a number from the hypothesis and prefix it with one phrase that translate to "less than" or "more than". Premise: Mary types 110 words per minute. Hypothesis: Mary types 710 words per minute. 3. Neutral: Reverse the corresponding entailed premise-hypothesis pairs. Premise: Mary types less than 510 words per minute. Hypothesis: Mary types 110 words per minute.
The result contains 2,871 unique premise sentences and 8,613 NLI pairs.
Diagnostics The diagnstics for classifiers (or measure word) and non-core arguments are explained in detail below (see examples in Table 11).
1. classifiers (or measure word): in Chinese, when modified by a numeral, a noun must be preceeded by a category of words called classifier. They can be semantically vacuous but sometimes also carry semantic content: 一匹狼 one pi wolf (one wolf); 一群狼 one qun wolf (one pack of wolves). Our examples 16 We split all problems into individual sentences and filter out sentences without numbers. Then we remove sentences without any named entities ("PERSON", "LOCATION" and "ORGANIZATION") using the NER tool provided by LTP toolkit (Che et al., 2020). require the model to understand the semantic content of the classifiers. 2. non-core arguments: in Chinese syntax, sometimes a noun phrase at the argument position (e.g., object) is not serving as an object: 今 天吃筷子，不吃叉子。today eat chopsticks, not eat fork (We eat with chopsticks today, not with fork). Sun (2009) shows that this structure is very productive in Chinese and we take example sentences from her dissertation.
Additionally, for the pro-drop examples, they are constructed such that the models return the correct inference relation only when they successfully identify what the dropped pro refers to. That is, our constructed premises involve several entities the dropped pro could potentially refer to, and the entailed hypothesis identifies the correct referent while the neutral/contradictory hypothesis does not (see Table 11 for an example). Table 12 presents the hyperparameters used for the models. The learning-rate search space for RoBERTa is: 1e-5, 2e-5, 3e-5, 4e-5 and 5e-5, for XLM-R: 5e-6, 7e-6, 9e-6, 2e-5 and 5e-5.

C Chinese-to-English transfer
We present Chinese-to-English transfer results in this section. As mentioned in the main text, for most of the cases, zero-shot transfer learning does not work well mostly likely due to the small size of OCNLI. However, for 3 out of the 4 datasets, XLM-R fine-tuned with the mix data outperforms the monolingual setting, suggesting that even OCNLI is only 1/20 of En-all-NLI, XLM-R can still acquire some useful information from OCNLI, in addition to what is present in En-all-NLI.
Specifically, (1) for English HANS, XLM-R finetuned with OCNLI is about 13 percentage points below the best English monolingual model, shown in Table 13. (2) For stress tests shown in Table 14, the gap is about 5 percent (XLM-R with OCNLI = 74%; RoBERTa with En-all-NLI = 79%). Interestingly, XLM-R with OCNLI performs the best for Negation and Word overlap. It even outperforms RoBERTa w/ MNLI on the Antonym, which seems to be consistent with the high performance of OCNLI-trained models on the Chinese Antonym in our constructed stress tests. (3) For semantic probing data, as shown in Table 15  OCNLI is 5 percent behind monolingual model fine-tuned with all English NLI, but performs better than the monolingual RoBERTa fine-tuned with MNLI (53.6% vs. 51.3%). This is quite surprising since the size of OCNLI is only 1/8 of MNLI. (4) For the English diagnostics as shown in Table 16 and Table 17, XLM-R with OCNLI is 7 percent behind RoBERTa fine-tuned with MNLI. We leave it for future work to thoroughly examine transfer learning from a "low-resource" language such as Chinese to the high-resource one such as English.

D Example templates for Chinese HANS
We present the templates for the two heuristics in Chinese HANS in Table 18 and Table 19