Common Sense Beyond English: Evaluating and Improving Multilingual Language Models for Commonsense Reasoning

Commonsense reasoning research has so far been limited to English. We aim to evaluate and improve popular multilingual language models (ML-LMs) to help advance commonsense reasoning (CSR) beyond English. We collect the Mickey corpus, consisting of 561k sentences in 11 different languages, which can be used for analyzing and improving ML-LMs. We propose Mickey Probe, a language-general probing task for fairly evaluating the common sense of popular ML-LMs across different languages. In addition, we also create two new datasets, X-CSQA and X-CODAH, by translating their English versions to 14 other languages, so that we can evaluate popular ML-LMs for cross-lingual commonsense reasoning. To improve the performance beyond English, we propose a simple yet effective method — multilingual contrastive pretraining (MCP). It significantly enhances sentence representations, yielding a large performance gain on both benchmarks (e.g., +2.7% accuracy for X-CSQA over XLM-R_L).


Introduction
Understanding natural language relies heavily on commonsense reasoning (CSR), which is the process of making inferences with commonsense knowledge. Commonsense knowledge is the set of general facts that reflect our natural understanding of the physical world and human behavior, which are usually seen as an implicit background when people communicate with each other using languages. It is thus of vital importance to evaluate and improve the commonsense reasoning capability of language models (LMs), towards building general natural language understanding (NLU) systems (Davis and Marcus, 2015). Many recent benchmark datasets and probing methods have been proposed to evaluate machine common sense. As shown in Figure 1, the LAMA probe (Petroni et al., 2019) is for analyzing LMs' zero-shot commonsense recalling ability; CommonsenseQA (CSQA) (Talmor et al., 2019) is instead a multiple-choice QA task that needs fine-tuning; CODAH  and SWAG (Zellers et al., 2018) focus on the ability to complete the most plausible scenes. However, all these works have been limited only to English. Consequently, follow-up analysis and reasoning methods developed (Lin et al., 2019;Feng et al., 2020; also focus only on English LMs like BERT (Devlin et al., 2019). Such English-centric trend of commonsense reasoning studies not only limits our research scope, but also tends to exacerbate English-specific bias that might prevent future methods from generalizing beyond English (Ponti et al., 2020).
It is of pressing urgency for the community to develop NLU systems that can serve all languages in the world to bridge the gap between different cultures and eliminate language barriers (Hu et al., 2020), and multilingual language models (ML-LMs), such as XLM-R (Conneau et al., 2020), are among the most promising tools to achieve this ambitious goal. Although ML-LMs have been evaluated in a few NLU tasks, e.g., XNLI (Conneau et al., 2018) and XTEMRE (Hu et al., 2020), it is still relatively unclear how ML-LMs perform in commonsense reasoning tasks, due to the lack of 1) dedicated methods for probing common sense in ML-LMs and 2) multilingual benchmark datasets for commonsense reasoning.
To analyze how much common sense ML-LMs already have without any tuning, we propose MICKEYPROBE, a zero-shot probing task. It tasks a ML-LM to rank a set of contrastive assertions (i.e., declarative sentences) in the same language by their commonsense plausibility, for which we use pseudo-likelihood (PLL) (Salazar et al., 2020) as a proxy. Unlike the LAMA probe, it can study multi-token concepts which are ubiquitous in some non-English languages. In addition, it fairly compares performance across different languages via a language-invariant evaluation protocol. Alongside the probing task, we also create MickeyCorpus, a large-scale multilingual dataset, consisting of 561k sentences in 11 different languages. Our experiments reveal that there are always large discrepancies across different languages in the tested ML-LMs, and different ML-LMs show very different language preferences.
Beyond supervision-free analysis of ML-LMs, we also study their performance in commonsense reasoning tasks, such as CSQA and CODAH, within a cross-lingual transfer setting (i.e., trained on English data and tested on other languages). We find that existing ML-LMs tend to have much lower accuracy in commonsense reasoning beyond English. We conjecture a major common weakness of existing ML-LMs is that their pretraining stages do not have a proper sentence-level objective. Therefore, we propose multilingual contrastive pre-training (MCP), which tasks a ML-LM to select the correct assertion out of a set of N contrastive assertions in N different languages. We re-format MickeyCorpus by sampling across languages and thus form a dedicated pre-training corpus for the MCP task. To fairly evaluate different ML-LMs and validate the effectiveness of MCP, we create X-CSQA and X-CODAH, two cross-lingual commonsense reasoning datasets by translating their English versions to 15 other languages 2 , including low-resource ones such as Swahili (sw) and Urdu (ur). Experiments show that the proposed MCP objective indeed significantly improves the performance of state-ofthe-art ML-LMs in cross-lingual commonsense reasoning. Our contributions are as follows: • Resources. We collect a large multilingual parallel corpus, MickeyCorpus, consisting of 561k sentences in 11 languages, which can be used for analyzing and improving ML-LMs. We also create X-CSQA and X-CODAH, two cross-lingual CSR benchmarks in 16 languages, for question answering and scene completion, respectively. • Evaluation and analysis. We analyze multiple popular ML-LMs with MICKEYPROBE, a language-invariant, zero-shot task for probing common sense in ML-LMs; We also evaluate them on X-CSQA and X-CODAH in a cross-lingual transfer setting. • Method to improve ML-LMs. We propose multilingual contrastive pretraining, a simple and effective sentence-level pretext task for enhancing ML-LMs in cross-lingual commonsense reasoning, which significantly improves the state-of-the-art ML-LMs in crosslingual commonsense reasoning.

Background and Related Work
In this section, we introduce important concepts, background knowledge, and related work before we present our work in following sections.

Multilingual Language Models
A multilingual language model (ML-LM) aims to produce text representations for multiple languages in a unified embedding space. One of the unique advantages of ML-LMs is their potential ability to perform zero-shot cross-lingual transfer -a model trained (or fine-tuned) on data in one language (usually English) can be directly used in other languages as well without further fine-tuning. Improving ML-LMs is thus believed as one of the most promising approach towards multilingual NLU at scale. mBERT (Devlin et al., 2019) is simply the BERT model (Devlin et al., 2019) trained on multilingual corpora without specific designs about multilinguality. The distil-mBERT (d-mBERT) (Sanh et al., 2019) is a smaller mBERT trained by knowledge distillation. Conneau and Lample (2019) proposed XLM(-100), which is pretrained with both masked language modeling (MLM) and translation language modeling (TLM). Conneau et al. (2020) further proposed XLM-R, which improves the XLM with a better sub-token vocabulary and highquality multilingual corpora (CC100). We leave the analysis of recent seq2seq ML-LMs, such as mBART  and mT5 (Xue et al., 2021), as future work, because their architectures are significantly different from the other ML-LMs. Note that the above ML-LMs are pretrained only with token-level training objectives such as MLM (i.e., recovering masked tokens in monolingual text) and TLM (i.e., recovering masked tokens in a pair of parallel sentences in two different languages). However, most NLU tasks, including commonsense reasoning, highly rely on sentence-level representations. We argue that a well-designed sentence-level pre-training objective should improve ML-LMs for NLU tasks. This intuition motivates us to propose a sentence-level pre-training objective -MCP (Section 5).
CSQA is a question answering task and the other two are scene completion tasks, while all have a multiple-choice selection objective, as shown in Figure 1. These benchmarks are widely used to evaluate LMs for commonsense reasoning. Unfortunately, they are limited to English, not applicable to evaluate models of multilingual commonsense knowledge, which motivates us to create X-CSQA and X-CODAH. The goal of the recent XCOPA (Ponti et al., 2020) dataset shares a similar goal, but it only focused on event-based causal reasoning in the scope of humans' social behavior, which is thus arguably more culturally biased. In contrast, the X-CSQA and X-CODAH are mainly for evaluating general world knowledge and cover more fine-grained types of reasoning (e.g., quantitative, negation), and thus engage a more language-agnostic, comprehensive understanding of ML-LMs about common sense.

The LAMA Probe and Its Limitations
The LAMA Probe (Petroni et al., 2019) is the seminal work on probing for common sense in (English) language models. It has a straightforward intuition: if a pretrained language model contains more commonsense knowledge, then it should be better at recalling a masked token in a commonsense assertion (e.g.,"birds have [mask]"). Specifically, given a LAMA-probe sentence s and its masked token w t , a LM under testing uses all past and future tokenss \t := w 1 , . . . , w t 1 , w t+1 , . . . , w |s| . as the input to rank all tokens in the vocabulary with the probability P w t | s \t via zero-shot inference. One can evaluate the performance of recalling common sense by measuring the position of a correct token "wing" in the ranked list. That is, the LAMA probe method uses token-level probability as a proxy to probe for common sense in LMs via ranking all tokens in their vocabularies.
This intuitive method, however, has several inherent limitations. First, in many other languages, multi-token concepts are ubiquitous, for example, "˛fÜ" ("library" in Simplified Chinese).  present several methods to decode multi-token entities so that they can adapt the LAMA probe to probe a LM for language-specific analysis. It is however infeasible to use tokenlevel probing tasks if we want to analyze ML-LMs across languages. In addition, the evaluation metric of the LAMA probe could be unfair, because there can be many correct words for a masked position (e.g., "birds have legs/eyes"). The ranking metrics of the LAMA probe, however, tend to ignore these facts, resulting in a less trustworthy analysis. The vocabulary-specific ranking is unfair when comparing across different languages, so they can have very different label space. These limitations of the LAMA Probe prevent us from analyzing common sense in ML-LM across topologically diverse languages.

The Mickey Probe
The challenges of using the LAMA Probe for probing common sense in ML-LMs motivate us to propose a more suitable method for analyzing ML-LMs, one that can fairly compare across a diverse set of languages. We present MICKEYPROBE, a Multilingual task for probing commonsense knowledge and analysis. We design a languageagnostic probing task with a sentence-selection objective for analyzing common sense of a ML-LM: given a set of assertions (i.e., declarative sentences) that have similar words and syntactic features, select the one with highest commonsense plausibility. We present the task formulation in this section and then introduce how we collect the dedicated dataset in Section 4.
Notations. We define a Mickey probe M as a set of K assertions in the same language, where one and only one of them (say, M i ) is the truth assertion with better commonsense plausibility than the other K 1 ones. Each Mickey probe M has multiple semantically equivalent versions in different languages. Let us denote a language by l 2 L where L = {en, f r, ru, zh, . . . } and |L| is the number of languages of interest. Then, M l is the probe M in the language l. For example, M en and M fr denote the probes with the same meaning but in English (en) and French (fr) respectively. We use M to denote a multilingual parallel dataset for MICKEYPROBE, which consists of T ⇥|L|⇥K assertions. T is the number of MICKEYPROBE items and each item has K assertions and |L| language. Finally, we can formally describe a multilingual parallel dataset M for MICKEYPROBE: (1) We use the notation ./ to indicate two assertions in different languages (e.g., l x and l y ) are semantically equivalent to each other. We leave the details of creating such an M in Section 4.

Commonsense Probing Task. Given a Micky
Probe M in the dataset M, and suppose the index of the truth assertion to be t, a perfect multilingual language model would produce sentence probabilities such that it always gives the truth assertion M l t the highest probability among other candidates for every language.

Ranking by PLLs
MickeyProbe ML-LM  Figure 2: A Mickey Probe example M has a set of probes in different languages (e.g., M en/zh ), and each of them is a set of 5 assertions. We rank assertions in the same language by their PLLs to probe common sense in ML-LMs across different languages.
It is still an open problem to properly compute sentence probabilities from masked language models, the recently proposed pseudo-loglikelihood scoring (PLLs) (Salazar et al., 2020) has shown promising results in many downstream NLP applications that need sentence re-ranking (e.g., speech recognition, and translation), suggesting it is a promising proxy of sentence probability. Given a sentence s, its PLL is defined as: That is, we individually mask each token w i at a time and use the remaining context s \i to get the probability of a word w i in the sentence s. Finally, we aggregate them to approximate P (s).
Evaluation Metric. The evaluation metric for MICKEYPROBE over a multilingual parallel dataset M in a specific language l is defined as the overall hit@k accuracy of the selection results hit@ k (l) = P M 2M 1{truth-rank(M l )  k} / |M| where truth-rank(M l ) means the the position of the truth assertion M l t in M l sorted by their probabilities defined in Eq. (3). The hit@1 is just equivalent to the conventional accuracy.
Advantages of MICKEYPROBE. There are two key advantages of the MICKEYPROBE for evaluating ML-LMs: (1) The sentence-level probability can be more generally applied in languages besides English, comparing with the LAMA probe which only studies single-token English words. (2) The task formulation creates a relatively closed-ended setting, such that we can use a language-independent evaluation metric to fairly compare across various languages within a ML-LM and compare across various ML-LMs for a particular language. In addition, we can see LAMA Probe as a monolingual, word-level version of the more general MICKEYPROBE: the LAMA Probe is when L = {en}, and {M en } = M 2 M is a huge number of K assertions (i.e., the vocabulary size) -a fixed [mask] is replaced by all tokens in the vocabulary.

The Mickey Corpus and Evaluation
We present a procedure for automatically creating a multilingual parallel dataset M for the probing task MICKEYPROBE. Our collected corpus, named MickeyCorpus , has 561k sentences in 11 languages (T =10.2k, K=5, |L|=11).

Creating English Probes
For the correct commonsense assertions in English, we have an existing resource, the OMCS corpus (Singh et al., 2002) which contains humanwritten sentences in English that describe commonsense facts. Each assertion can be used as a M en t and we perform perturbations on it to create the other K 1 distractor assertions (i.e., false candidates), yielding an M en example.
Inspired by BERT-attack method , we use a simple method to generate false assertions that are semantically related and syntactically similar to the truth assertions. Given a correct assertion, we first randomly sample a few (1 ⇠ 3) words with a part-of-speech tag as noun, verb, or adjective, and replace them with [mask]. Then, we use a beam-search style method to decode the [mask] tokens one by one from left to right. To ensure that the distractors are less plau- sible, we limit the decoding steps to only sample tokens that ranks between 200th⇠300th. We repeat the above procedure multiple times with different sets of [mask] tokens. Then, we use Stanza  to remove distractors that have sequences of POS tags or morphological features different from the truth assertions. Finally, we sample K 1 of them as the distractors.

Scaling to Ten Other Languages.
We use bidirectional translation with the Mar-ianMT models (Junczys-Dowmunt et al., 2018) pretrained on the OPUS corpora (Tiedemann, 2016). We translate all English probes to the 25 languages that has models in both directions and then translate them back to English. As the outputs from these models might contain noise and errors, we compute the semantic similarities (i.e., cosine similarity) between the original M en and the backtranslated M x-en via the SentenceBERT (Reimers and Gurevych, 2019) model.
To ensure the quality and fair comparisons, we set a similarity threshold as 0.75 and keep the intersections of probes in all languages. Considering some languages tend to have translations of lower quality, we finally choose the best 10 languages to build the Mickey Probe dataset for our analysis, yielding 10k examples in each language and 10.2k*5*11 ⇡ 561k sentences in total. The language set L = {en, de, f r, ru, es, hi, vi, bg, zh, nl, it}. Note that our purpose of checking the backtranslation quality here is mainly to only keep the high-quality translations for all language pairs that we considered. Conventional metrics, e.g., BLUE score (Papineni et al., 2002), which focus on the exact word match, are thus less suitable: given the original sentence "I have a book", the translation results "I have a novel" and "I have a tool" will be seen as equally wrong. Inspired by BERTScore , the BT-cosine is based on SentenceBERT, which efficiently gives a higher score for the former and a lower score for the latter, due to the semantic relatedness between "novel" and "book." We observed that most of our back-translations are in similar situations, and thus decide to use BT-cosine instead of others.

Analyzing ML-LMs with Mickey Probes
We now use the MickeyCorpus to evaluate the 5 pre-trained ML-LMs introduced in Section 2.1: d-mBERT (Sanh et al., 2019), mBERT (Devlin et al., 2019), XLM (Conneau and Lample, 2019), XLM-R Base , and XLM-R Large (Conneau et al., 2020). All these ML-LMs pretraining objectives contain masked-word-prediction tasks, so we can easily use PPLs (Eq. 3) to probe them a zeroshot, supervision-free manner with hit@1 accuracy. (The hit@2 results are shown in Appendix.) We present a histogram in Figure 3 and show the concrete results in Table 1. We find that there are always large discrepancies across different languages in all tested ML-LMs, which motivates us to analyze the following questions.
Q1: Do different ML-LMs have similar language preferences? No. We arrange the languages in all ML-LMs with the same order for Figure 3 -the monotonically descending order of XLM-R L . Interestingly, we find that different ML-LMs are good for different languages, resulting in a very diverse set of trends. For example, XLM-R B , has a higher performance in it than zh and fr, unlike XLM-R L which are pre-trained on the same corpora with the same objectives. mBERT and d-mBERT has stronger performance in fr than nl and de, unlike XLM and XLM-R.
Q2: Does length influence PLL ranking? Not much. The PLL computation indeed tends to prefer shorter sequences (see Eq. 3), so one may wonder if the length of assertions would influence the probing results. The "Shortest" row in Table 1 presents the results when we always select the shortest assertion within a probe, instead of PLL ranking. The gaps between these scores and XLM-R-L's suggest that the probing task indeed uses PLL as a valid proxy for evaluating common sense based on sentence-level semantics.
Q3: Is the translation quality a key factor? We show "BT-Cosine", the mean of the cosine scores between the original English sentences and the back-translated ones, and sort the table by these numbers. The first 5 languages, {de, it, es, fr, nl} have the largest BT-Cosine, i.e., the best translation quality, and they indeed have better performances in general for XLM-R models. However, although zh has a worse BT-score than vi, all ML-LMs perform better in zh than vi. Thus, we believe the translation quality of MickeyCorpus will not be a factor to influence our understanding of ML-LMs. Consequently, this suggests that further study must depend on pre-training corpora of each ML-LM in different languages.
Q4: Does the size of pre-training corpora matter? We list the size of the monolingual corpus in each language for CC-100 that XLM-R are pretrained on (i.e., the CC-size row). Although ru has a much larger corpus than de, it, etc., the XLM-R performance in ru is much worse. In addition, fr and nl have almost the same translation quality while fr's CC-size is twice the size of nl, but the performance in fr is still much worse than nl. We conjecture this would be either due to the design of sub-token vocabulary or the text quality (instead of the size) of the CC-100 corpora.
Further implications. The benchmark results of five popular ML-LMs on the MICKEYPROBE task over the MickeyCorpus offer the initial and valuable understanding with a closer look at the commonsense knowledge of ML-LMs by probing them in a unified evaluation protocol. One can either compare a ML-LM across different languages or compare a certain language across ML-LMs in Table 1. These comparable results support further analysis that can benefit the development of ML-LMs in the future. After all, even the best ML-LM XLM-R L also degrades much in other languages, and also perform slightly worse than RoBERTa L in en (93.4%). We argue (cultureinvariant) common sense knowledge should be seen as an important way to connect multiple languages and thus better align them in a shared embedding space induced by a ML-LM.

Multilingual Contrastive Pre-Training
In this section, we reformat the MICKEYPROBE so that we can reuse the MickeyCorpus for improving the pre-trained ML-LMs for commonsense reasoning beyond English. We propose a multilingual contrastive pre-training (MCP) task that focuses on enhancing the sentence-level representation of ML-LMs. MCP improves a ML-LM in a multilingual, contrastive environment, where the model learns to select the assertion with the best commonsense plausibility from a set of contrastive sentences in different languages. Each MCP example is a set of multilingual assertions while each Mickey probe is a monolingual set. MCP Learning. Given a MCP example C 2 C, we append one dense linear layer f on top of a ML-LM with parameters denoted as ⇥ ML-LM for learning to predict the commonsense plausibility score of each assertion C i 2 C as follows:

MCP Dataset
(4) We first get the logit o i of each assertion by projecting its [CLS] embeddings h i to a logit o i via a dense layer f with parameters ⇥ f ; Then, we use SoftMax to normalize the logits as plausibility scores z i ; Finally, we compute the cross-entropy loss ⇢ where 1 i =1 if C i is a correct assertion and 0 otherwise. We fine-tune {⇥ ML-LM , ⇥ f } to minimize the overall loss over the MCP dataset C.

Evaluation for Cross-lingual CSR
In this section, we introduce the datasets, experimental setup, results, and our analysis.
Algorithm 1: Convert a Mickey Probe M to an example for the MCP task.
In: M 2 M /* is a probe that has |L| sub-sets; each sub-set M lx is a set of K assertions in the same language lx 2 L. M lx t is always the truth. */ Out: C /* A set of V assertions in different languages. */ Remarks: n (X) is a function to randomly sample n unique elements from a set X.

X-CSQA & X-CODAH: Two New Benchmarks for Evaluating XCSR
To evaluate ML-LMs for commonsense reasoning in a cross-lingual zero-shot transfer setting, we create two benchmark datasets, namely X-CSQA and X-CODAH. Table 3 shows the statistics of the two datasets. Specifically, we use online commercial services such as DeepL Pro Translate to collect high-quality translations of the examples in CSQA and CODAH for 15 languages other than English. The size of CODAH is small (only 2.7k), so we use 7k SWAG validation examples as additional training data which share the same formulation. We discuss the reduction of cultural differences and quality control of automatic translations as well as other details in Ethical Considerations (the paragraph for cultural bias reduction) and Appendix (A). As our goal is to evaluate different ML-LMs (instead of different languages) in a unified evaluation protocol for cross-lingual commonsense reasoning, we argue that such automatically translated examples, although might contain noise, can serve as a starting benchmark for us to obtain meaningful analysis before more humantranslated datasets will be available in the future.

Setup
We focus on 4 popular ML-LMs that we introduced in Section 2.1: mBERT, XLM-100, XLM-R B and XLM-R L as well as our proposed MCP method. For both tasks, we concatenate each prompt (the question or first sentence) and each   Why zero-shot cross-lingual transfer? It is almost impossible to collect data in all languages that an NLU system might be used for. Therefore, prior works mainly focus on zero-shot crosslingual transfer (Conneau et al., 2018), which is more meaningful and can offer lower-bound performance analysis. It is also an ideal setting for studying CSR because most commonsense facts are language-invariant. Thus, an Englishfinetuned ML-LM for CSR should be able to transfer its ability to a wide range of other languages as well. Furthermore, our goal of this paper is to evaluate and improve ML-LMs, so translating back to English and then use an English-only LM is also not helpful towards to this end.

Experiments for Cross-lingual CSR
In Table 2, we present the empirical results over X-CODAH and X-CSQA for the ML-LMs as well as two models enhanced by our proposed MCP method. On both tasks, the XLM-R L performs the best with a large margin. Enhanced by the MCP method, both XLM-R B and XLM-R L see significant improvement (e.g., 2.7% absolute improvement for XLM-R L on X-CSQA-avg).
Can MCP's improvement generalize to unseen, low-resource languages? Note that MCP dataset only involves 9 languages here, and there are 6 languages that are totally unseen in the MCP training (i.e., {pl, ar, ja, pt, sw, ur}). The largest performance gain is in ru on X-CSQA and vi on X-CODAH. Surprisingly, we find the improvements on them are also large for XLM-R L (e.g., 48.4! 52.3 for ar). In addition, for the two low-resource languages sw and ur, MCP also brings 2 ⇠ 3 percentage points of improvement for XLM-R L . It is, however, not always the case for XLM-R B , which we conjecture tends to be more likely to overfit. Although ML-LMs enjoy the merits of zeroshot cross-lingual transfer, their performances are usually worse than the English-only RoBERTa L on the en-test (70.4% vs 66.7% for X-CSQA). Although MCP can mitigate the gap (70.4% vs 69.5%) for X-CSQA, there is still a large gap (81.6% vs 69.9%) for X-CODAH. We use Fig. 4 to analyze how different categories of commonsense reasoning in CODAH  are diverse in different languages. We find that others, reference, and negation have relatively smaller variances across different languages, as they are more language-invariant. However, a few polysemous, idioms examples can be Englishspecific which may not generalize to other languages. More detailed analysis is in Appendix.
From the curve of dev accuracy in Figure 5, we see that MCP-enhanced XLM-R models are much more sample efficient and converge much faster than vanilla versions. This suggests that the MCP, if used on a larger corpus with broader topics, can potentially produce a better ML-LM with more general usage, especially when only limited labelled is available. Our results on XNLI-10% (using 10% of the training data) (Conneau et al., 2018) show that MCP-enhanced XLM-R L has 1.2 percent accuracy improvement on the average of 15 languages. As our focus in this paper is commonsense reasoning, we leave the study on other cross-lingual NLU tasks as future work. Importantly, our experiments imply that a proper (continual) pre-training task that has a (contrastive) sentence-level objective could improve both the final performance as well as learning efficiency.

Conclusion
We evaluate and improve popular multilingual language models (ML-LMs) for advancing commonsense reasoning beyond English. We propose the MICKEYPROBE, a language-agnostic probing task for analyzing common sense of ML-LMs in a zero-shot manner. With our proposed new benchmark datasets via automatic translation, X-CSQA and X-CODAH, we evaluate ML-LMs in a crosslingual transfer setting for commonsense reasoning. We also improve the state-of-the-art ML-LM with a simple yet effective method -multilingual contrastive pre-training, which uses a sentencelevel objective to enhance sentence representations, yielding a significant performance gain. All above work is based on MickeyCorpus, which can be used as both a probing dataset and a pretraining corpus for analyzing and improving ML-LMs. We hope our resources and pre-training method for ML-LMs can help the community advance commonsense reasoning beyond English.

* Ethical Considerations
Resource Copyright This work presents three new resources: MickeyCorpus, X-CODAH, and X-CSQA, which are multilingual extension of the OMCS (Singh et al., 2002) 3 , CSQA (Talmor et al., 2019) 4 , and CODAH (Chen et al., 2019) 5 respectively. All these three original sources of the data are publicly available for free, and we do not add any additional requirement for accessing our resources. We will highlight the original sources of our data and ask users to cite the original papers when they use our extended versions for research.
Cultural Bias Reduction Like most most multilingual parallel resources, especially in general NLU domain, there exists potential data bias due to the barrier of languages as well as cultural differences (Acharya et al., 2020;Lin et al., 2018), which could induce the labeling differences on the same situation. For example, a question like "what do people usually drink in the morning? (coffee/tea/milk)" or "when does a wedding usually start? (morning/afternoon/evening)" might be answered very differently by people from different backgrounds and cultures, not to mention different languages. The prior English commonsense resources which our datasets are built on are already possess such inherent bias, even with in the English language. Therefore, before we translate CSQA and CODAH, we intentionally remove the examples that are either labeled as non-neutral by a pre-trained sentiment classifier, or contained any keywords that are relevant to social behavior (e.g., weddings). We manually inspect test examples in X-CSQA and X-CODAH in the English and Chinese versions and have a strong confidence there is few strongly controversial example. However, we admit that such reduction of cultural differences in common sense has not been systematically measured in this work for other languages.
Application Risks of Cross-lingual CSR.
The work also evaluates a few multilingual language models (ML-LMs) for cross-lingual commonsense reasoning (XCSR), and introduced a new model which outperforms them. This raises the question of whether harm might arise from applications of XCSR-or more generally, since XCSR is intended as a step toward making English-only CSR more applicable in other languages, whether harm might arise more generally from existing ML-LMs. Among the risks that need to be considered in any deployment of NLP technology are that responses may be wrong or biased, in ways that would lead to improperly justified decisions. Although in our view the current technology is still relatively immature, and unlikely to be fielded in applications that would cause harm of this sort, it is desirable that ML-LMs provide audit trails, and recourse so that their predictions can be explained to and critiqued by affected parties.

B Hyper-parameters
We summarize hyper-parameters that we used for training ML-LMs on X-CODAH and X-CSQA in Table 7. Evaluation Steps are equally 100 for all models and datasets. Maximum Sequence Length is 100 for X-CODAH and 64 for X-CSQA. The batch size here refers to "train batch size per device ⇥ # GPUs ⇥ # gradient accumulation steps". Note that the MCP methods use the exactly the same hyper-parameters which we have found optimal by tuning over the dev set. The learning rates we tried for all models are from the range {3e-5, 2e-5, 1e-5, 8e-6, 6e-6, 5e-6}. The warm up steps are selected from {50, 100, 200, 300, 500}. Table 4 shows the model architectures and sizes that we used from (Conneau et al., 2020). We show the tokenization (tnz) used by each Transformer model, the number of layers L, the number of hidden states of the model H m , the dimension of the feed-forward layer H ff , the number of attention heads A, the size of the vocabulary V and the total number of parameters #params.

D Additional Experimental Results
D.1 Hit@1 Accuracy in Histogram D.2 Hit@k Accuracy of Mickey Probes Table 5 shows the Hit@2 Accuracy of the five ML-LMs for the MickeyProbe. Hit@2 Accuracy evaluates whether the models can rank the correct assertion within top 2. Unlike Hit@1 which only accepts best predictions, Hit@2 is more flexible. Thus, the performances in Hit@2 increase compared to the ones in Hit@1. We can see that the discrepancies across languages still exist.

D.3 Categorized X-CODAH Analysis
Please refer the CODAH  paper for the definition and concrete examples in each category. We show benchmark results of MCP(XLM-R L ) on X-CODAH within different carriages in Table 6. The RB stands for using the RoBERTa-Large model to fine-tune on the English X-CODAH dataset. We find that the largest gaps in En are in the Idioms and the Others. Interestingly, we find that the quantities category is where MCP performs better than the RoBERTa large.