Breaking the Language Barrier: Improving Cross-Lingual Reasoning with Structured Self-Attention

In this work, we study whether multilingual language models (MultiLMs) can transfer logical reasoning abilities to other languages when they are fine-tuned for reasoning in a different language. We evaluate the cross-lingual reasoning abilities of MultiLMs in two schemes: (1) where the language of the context and the question remain the same in the new languages that are tested (i.e., the reasoning is still monolingual, but the model must transfer the learned reasoning ability across languages), and (2) where the language of the context and the question is different (which we term code-switched reasoning). On two logical reasoning datasets, RuleTaker and LeapOfThought, we demonstrate that although MultiLMs can transfer reasoning ability across languages in a monolingual setting, they struggle to transfer reasoning abilities in a code-switched setting. Following this observation, we propose a novel attention mechanism that uses a dedicated set of parameters to encourage cross-lingual attention in code-switched sequences, which improves the reasoning performance by up to 14% and 4% on the RuleTaker and LeapOfThought datasets, respectively.


Introduction
Recent studies show that language models (LMs) are capable of logically reasoning over natural language statements (Clark et al., 2020b), reasoning with their implicit knowledge (Talmor et al., 2020), and performing multi-step reasoning via chain-ofthought prompting when the model size is large enough (Wei et al., 2022b;Kojima et al., 2022;Wei et al., 2022a).A separate line of work has focused on pre-training language models on multilingual corpora to enable knowledge transfer across different languages.These efforts led to multilingual language models (MultiLM) such as mBERT (Devlin et al., 2019), mT5 (Xue et al., 2021), and XLM-R (Conneau et al., 2020), which have been shown to generalize in a zero-shot cross-lingual setting (Pires et al., 2019a;Conneau and Lample, 2019).The cross-lingual transfer is often enabled by fine-tuning the MultiLM on a high-resource language (typically English) and then evaluating it on other target languages.
However, as most of the recent efforts on reasoning-related tasks have been centered around English, our knowledge of the multilingual reasoning capabilities of language models remains limited.In this work, we investigate the logical reasoning capabilities of MultiLMs, especially in monolingual and structured code-switched2 settings (Figure 1).In the monolingual setting, the context and the question are in the same language.In the structured code-switched setting, we refer to a setting where the context and question are in two different languages.The code-switched setting can be found in many realistic scenarios, such as when non-English speakers may ask questions about information that is unavailable in their native language (Asai et al., 2021).
For both reasoning settings, we conduct experiments using the RuleTaker dataset (Clark et al., 2020b), which contains artificial facts and rules, and the LeapOfThought dataset (Talmor et al., 2020), which incorporates real-world knowledge into the reasoning context.Our results show that although MultiLMs perform well when fine-tuned in different languages (i.e., high in-language performance when fine-tuning and testing on the same language), their cross-lingual transfer can vary considerably, especially in the code-switched setting.We posit that the lack of code-switched data in MultiLM pre-training data makes fine-tuning on code-switched data inconsistent with pre-training.
To improve the code-switched reasoning capabilities of MultiLMs, we propose two methods.First, we propose a dedicated cross-lingual query matrix (section 4.1) to better model cross-lingual attention when the MultiLMs receive code-switched sequences as input.This query matrix is pre-trained on unsupervised code-switched data, either shared across all language pairs or specific to a single one.Then, we propose a structured attention dropout (see section 4.1), in which we randomly mask attention between tokens from different languages (i.e., context-question attentions) during training.This masking makes the fine-tuning phase more consistent with the pre-training by regularizing crosslingual attention.
By mixing the two methods, we also experiment with an interfered variant of the cross-lingual query, which considerably improves cross-lingual generalization, especially in code-switched settings.We evaluate our methods for the code-switched setting and show they improve the cross-lingual transfer of MultiLMs by 14% and 4% for the RuleTaker and LeapOfThought datasets, respectively.

Motivation
Most prior work on reasoning with language models remains limited to monolingual (English) systems (Han et al., 2021;Sanyal et al., 2022;Shi et al., 2023;Tang et al., 2023).In this work, we investigate the reasoning abilities of MultiLMs, formulating an analysis in formal reasoning that evaluates MultiLMs on their ability to resolve logical statements.Given a set of facts and rules as context (in natural language sentences), the task is to predict whether a given statement is true.
In our multilingual reasoning setting, we assume a given set of languages = {L 1 , L 2 , ..., L N }, and define L c and L q as the context and statement languages, respectively.Typically, MultiLMs are evaluated in a monolingual setup where L c = L q .However, if MultiLMs are truly multilingual, we posit that they should also be able to reason in a scenario where L c ̸ = L q .Thus, to evaluate the multilingual reasoning ability of MultiLMs, we first define four different evaluation setups based on the language of context or statement: (1) both the context and statement are always in one language (monolingual reasoning); (2) the context is always in one language, and the statement can be in any language; (3) the context can be in any language, but the statement is always in one language; and (4) both the context and statement can be in any language.
To have a reasonable baseline to compare with the code-switched setups, we first focus on the monolingual evaluation (1), in which we evaluate the reasoning ability of MultiLMs for nine typologically different languages.Then, by fine-tuning the models on code-switched data, we evaluate their performance for setups (2) and (3) where either the language of the context or the language of the question is different from the training data.This evaluation aims to study the possibility of teaching models to reason across languages in a codeswitched setting, and to investigate the extent they can transfer their reasoning to other code-switched data formats.Finally, we hypothesize that in order to succeed in setup (4), the model would have to be strong in setups ( 2) and (3).Since our experimental results show that the MultiLMs struggle in these two setups, we focus on improving their performance for setups (2) and (3).

Multilingual Reasoning
In this section, we describe our evaluation of the logical reasoning capabilities of MultiLMs for monolingual and code-switched settings.

Analysis Setup
We run our experiments on two datasets focusing on multi-hop logical reasoning over natural language knowledge: RuleTaker.This is a set of five datasets, each constrained by the maximum depth of inference required to prove the facts used in its questions (Clark et al., 2020b).This dataset is generated with the closed-world assumption (CWA), assuming a statement is false if it is not provable.Each example consists of facts and rules (i.e., context) and a statement (more details in Appendix A.1).
LeapOfThought (LoT).This dataset comprises ∼30K true or false hypernymy inferences, verbalized using manually written templates (Talmor et al., 2020).The hypernymy relations and properties are derived from WORDNET (Fellbaum, 1998) and CONCEPTNET (Speer et al., 2017).This dataset contains two main test sets; EXPLICIT REA-SONING which performs inference over explicit natural language statements, and IMPLICIT REASON-ING where the model must reason by combining the context with missing information that should be implicitly encoded by the model.We create a modified version of LoT, and use the IMPLICIT REASONING test set in our evaluation.The dataset modification pipeline and the reason behind using only the IMPLICIT evaluation setting is further discussed in Appendix A.2.
Models.We conduct all our experiments using the cased version of multilingual BERT (mBERT; Devlin et al. 2019) and the base version of XLM-R (Conneau et al., 2020).We train a binary classifier on top of the model's classification token (e.g., [CLS] in mBERT) to predict whether a given statement/question is true or false.The model's input is [CLS] context [SEP] statement [SEP] and the [CLS] output token is used for the classification.For evaluation, we measure the model's accuracy.We use full fine-tuning for these experiments.The random baseline is 50% (binary classification).
Languages.Both RuleTaker and LoT datasets are only available in English.We translated these two datasets into eight languages using the Google Translate API.We have chosen typologically diverse languages covering different language families: Germanic, Romance, Indo-Aryan, and Semitic, and including both high-and mediumresource languages from the NLP perspective.These languages include French (fr), German (de), Chinese (zh), Russian (ru), Spanish (es), Farsi (fa), Italian (it), and Arabic (ar).

Reasoning Over Monolingual Data
The average in-language and cross-lingual zeroshot performance of mBERT for each source language are depicted in Table 1.For the cross-lingual zero-shot performance, we first fine-tune models on a single source language, test it on other languages, and then take an average of these results.
On the RuleTaker dataset, the model is able to learn the task for the Depth-0 subset nearly perfectly for almost all the languages, exhibiting relatively high cross-lingual transfer performance (∼87%).However, for models trained on higher depths (i.e., requiring more reasoning hops), the model's performance drops for both inlanguage and cross-lingual evaluation settings, and the performance gap between different source languages increases.Moreover, when increasing the depth, zero-shot cross-lingual performance suffers more compared to in-language performance, showing that as the complexity of the task increases, the harder it becomes to generalize to other languages.
For the LoT dataset, the model must learn to reason by combining its implicit knowledge of hypernyms with the given explicit knowledge.However, the model's performance differs for different languages, suggesting that the model's ability to access and use the implicit knowledge is not the same for all languages.We also observe that a language with high in-language performance does not necessarily have a high zero-shot cross-lingual performance.We hypothesize that for some languages, the model starts learning in-language noises that are not generalizable to other languages.
We generally observe the same patterns for the XLM-R model (see Appendix B) when fine-tuned on the monolingual RuleTaker and LoT datasets.

Reasoning Over Code-Switch Data
When we fine-tune the model using a codeswitched data format, the context is in one language and the statement is in another.In our experiments, we use English as an anchor language for the context (i.e., en-X) or for the statement (i.e., X-en).In the fine-tuning phase, we learn the task using the en-X data format, and evaluate it on both en-X and X-en data formats.The models' average inlanguage and zero-shot cross-lingual performance are shown in Table 2.
For Depth-0 of the RuleTaker dataset, mBERT is able to learn the reasoning task almost perfectly for most languages.As the depth of the task increases, the performance of the code-switched reasoning declines.This decline is more pronounced at higher depths compared to the monolingual scenario.While the model is capable of learning reasoning within this framework, its zero-shot generalization to other code-switched data, such as en-X (where the context language remains English but the statement language changes), is poor.Reasoning over two languages poses a greater challenge than reasoning within monolingual data due to the need for information alignment across languages.Consequently, the transferability of such tasks to other language pairs becomes more challenging.In-language and cross-lingual zero-shot performance (accuracy) of the mBERT model for the RuleTaker and LeapOfThought datasets.In-language performance corresponds to evaluating the model in the same language as the training data.
On the LoT dataset, the model performs quite well on the code-switched data, outperforming the monolingual scenario for nearly all languages.The relatively high code-switched performance shows that the language of the context plays an important role in accessing the implicit knowledge encoded in the model's parameters, as the model must rely on this knowledge to solve the task.Providing the context in English facilitates access to implicit knowledge compared to other languages.This is also inline with the empirical observation that generalization to X-en is considerably worse than en-X.We generally observe the same pattern for the XLM-R (see Appendix B) when fine-tuned on the monolingual RuleTaker and LoT datasets.
Following the empirical observations showing MultiLMs struggle to transfer reasoning abilities in a code-switched setting, we propose a novel attention mechanism to mitigate the lack of codeswitched transfer in these models.
4 Cross-lingual-aware Self-Attention Although MultiLMs have been pre-trained on multilingual corpora, individual inputs to the model stay mostly monolingual (Devlin et al., 2019;Conneau et al., 2020).When these models are finetuned on a code-switched downstream task, unlike the pre-training phase, tokens from different languages can attend to each other, which, as demonstrated in Tables 2 and 8, results in poor generalization to other code-switched language pairs.We also observe that self-attention patterns considerably change when we compare code-switched in-language and cross-lingual samples' attention patterns3 (see Figure 4).

Approach
In order to make the fine-tuning phase more consistent with the pre-training, we propose two sets of methods to better handle the cross-lingual interactions of tokens.
Cross-lingual Query To better model the crosslingual attention for code-switched tasks, we pretrain a cross-lingual query matrix Q cross (while keeping all other parameters frozen) on codeswitched unsupervised data (more experimental Figure 2: Illustration of the drop attention scheme.Due to the input's code-switched structure, we want to limit the attention between context and question tokens.It can be seen that tokens from the same language can fully attend to each other, but there is a dropout (white cells) when cross-lingual attention is being applied.In order to ensure a reliable bridge between context and question, the first token (e.g., [CLS] in mBERT) attends to all tokens, and also all tokens attend to the first token.details in section 4.2).More specifically, we use two sets of attention masks, M 1 and M 2 , where M 1 enforces the query matrix Q to focus only on monolingual attentions, and M 2 constrains the cross-lingual query Q cross to cross-lingual attentions (see Figure 3.a).Formally, the self-attention probabilities for a given attention head, up to a (row-wise) normalization term, are computed as below: where Q and K are the query and key matrices, d is the model's hidden dimension, and ⊙ represents the Hadamard product.It is worth noting that this scheme still allows attention between all tokens; however, monolingual and cross-lingual attentions are handled by different query matrices.The proposed Q cross can either be pre-trained for a single language pair (e.g., en-fr pair where context is in English and question/statement is in French), or it can be shared across many language pairs.We show in Section 4.3 that having languagepair specific Q cross enables modularity, meaning a model that is trained on a given source language pair can perform considerably better on another language pair by just swapping the source Q cross matrices with the target ones.
Structured Attention Dropout As mentioned earlier, poor generalization of MultiLMs in codeswitched settings can be attributed to inconsistency between the pre-training and fine-tuning phases, where the former mostly deals with monolingual attention while the latter needs to handle crosslingual attention as well.We propose that the consistency can be improved by limiting the crosslingual attention in the fine-tuning phase (i.e., regularizing computational interactions between languages).As demonstrated in Figure 2, this can be achieved by randomly masking attention scores (i.e., attention dropout), with probability P mask , when tokens from different languages attend to each other.Moreover, to ensure a reliable bridge between context and question, we never mask the attention scores of the first token (e.g., [CLS] in mBERT) to help the model better flow information between two sections.Table 11 demonstrates the importance of structured attention dropout for better generalization in code-switched settings.
Interfering Cross-lingual Query Given the promising performance of the attention dropout for code-switched tasks, we experiment with a variation of cross-lingual query, where queries Q and Q cross also partially handle cross-lingual and monolingual attentions, respectively (see Figure 3b).We empirically observe that having attention masks that could randomly interfere with each other generally results in better performance (see Table 12) compared to the attention masks proposed in Figure 3a.In this scheme, M 1 and M 2 are generated randomly and online,4 but once sampled, the same masks will be used for all the attention heads in all layers (more details in Appendix D).Due to better empirical performance, this variation of the cross-lingual query will be used for all the following experiments.

Experimental Setup
All models are trained with the AdamW optimizer (Loshchilov and Hutter, 2017) using the Hug-gingFace Transformers (Wolf et al., 2020) implementation for Transformer-based (Vaswani et al., 2017) models.The hyperparameters used for performing different experiments can be found in Appendix C. All the reported scores are averaged over three different seeds.
Fine-tuning Setup.As Bitfit fine-tuning outperforms full fine-tuning for all our experiments, we only report the Bitfit results here (Zaken et al., 2021).In Bitfit tuning, only biases are tuned in the MultiLM encoder, together with classifier and pooler parameters.
Language Pairs.To show the effectiveness of the proposed method, we fine-tune the models on four typologically diverse languages (language of the statement), namely fr, de, zh, and ru.Our analysis shows that combining monolingual and codeswitched data in the fine-tuning step improves the reasoning performance.Moreover, a multilingual reasoner should be able to reason over both monolingual and code-switched data.So, for this set of evaluations, we use a combination of English and en-X (half of each) as the training dataset, which we denote mix(en, en-X).
Pre-training Cross-lingual Query.We train a shared (Shared Q cross ) or language-pair specific (Pair Q cross ) cross-lingual query matrix.For Shared Q cross , a shared cross-lingual query is trained on a parallel code-switched variant of XNLI (Conneau et al., 2018a), where an English premise is followed by the same premise but in another language.For Pair Q cross , we train a crosslingual query for each en-X language pair again using the XNLI dataset.In both cases, only the cross-lingual query matrix is trained and the rest of the parameters are frozen.The training happens for 500K iterations.
Baselines.We compare the performance of the proposed method against two baselines: (1) The pre-trained model (original) (2) a model pretrained on code-switched data (CS-baseline).For the CS-baseline, we pre-train the model on the parallel code-switched variant of XNLI (similar to the data we use to learn the shared cross-lingual query matrix) for 500K iterations to adapt the model to the code-switched setting.
Cross-lingual Evaluation.For all the experiments, we evaluate the zero-shot performance of the model on (1) a monolingual setting (where both context and question are in one language), (2) an en-X code-switched setting (where the context is in English and the question is in other languages), and (3) a X-en code-switched setting (where the question is in English and the context is in other languages).For the case when we Bitfit fine-tune the model using a language-specific query matrix (Pair Q cross ), we use the query matrix of the target language during the inference (only the weights).For example, while doing the zero-shot evaluation on en-zh, we use the en-zh cross-lingual query matrix instead of the one from the fine-tuned model.

Experimental Results
Table 3 shows the average zero-shot transfer performance (accuracy) for the RuleTaker dataset.For both mBERT and XLM-R, introducing a shared cross-lingual query matrix (Shared Q cross ) improves the reasoning accuracy.These results underscore the significance of maintaining consistency between the pre-training and fine-tuning phases in code-switched downstream tasks to facilitate effective transfer learning.
Using a specific query matrix for each language pair (Pair Q cross ) further boosts the crosslingual transfer performance across most tested settings (up to 18%).In this scenario, there is a dedicated set of parameters to learn the attention patterns for a language pair rather than having them share the same number of parameters among many different language pairs.In other words, dedicated parameters help the model learn attention patterns for specific language pairs. 5nterestingly, in many cases, our approach also improves the transfer performance for monolingual data (mono).We hypothesize that, by having a separate cross-lingual query matrix, the model does not need to learn the cross-lingual attention pattern using the same parameters, reducing the chance of overfitting to the code-switched format.
We also conducted a comparison with a codeswitched baseline in which the MultiLM is pretrained on a code-switched version of XNLI.The code-switched baseline (CS-baseline) showed improved transfer results for en-X format and, in some cases, performed competitively with the Pair Q cross approach.However, it negatively affected performance in monolingual and X-en scenarios, particularly for the mBERT model.In essence, the model exhibited overfitting to the language pairs in en-X format it was trained on, making it unable to generalize effectively to monolingual and other code-switched formats.On the other hand, both Shared Q cross and Pair Q cross demonstrated the ability to generalize their reasoning to Table 3: Average cross-lingual transfer of mBERT and XLM-R models on RuleTaker datasets to monolingual samples (mono) and code-switched language pairs (en-X and X-en).The original is the pre-trained model and the CS-baseline is the model that pre-trained on code-switched data.Shared Q cross and Pair Q cross , refer to cases where the cross-lingual query matrix Q cross is either shared across many language pairs or is specific to each language pair, respectively.
the X-en format.We also perform a qualitative analysis of self-attention patterns for our proposed method in Figure 5, showing that the attention patterns remain more similar between in-language and cross-lingual code-switched samples (unlike Figure 4).We hypothesize that the attention pattern stability makes the MultiLM more language-neutral.
Regarding the cross-lingual transfer across languages, we observe that the reasoning ability of the model is not transferred across languages equally (Appendix F).The more similar the languages, the higher the transfer performance is.For example, the model trained on en-fr has its highest transfer performance in Latin languages (e.g., es, it, en-es, and en-it).For almost all cases, and regardless of the training data language, en-fa and en-ar are the hardest languages to transfer to.
To study the effect of the cross-lingual query matrix on an implicit reasoning task, we expand our experimentation to include the LeapOfThought (LoT) dataset.Table 4 illustrates the average zero-shot transfer performance for this dataset.For this dataset, our proposed method also enhances the reasoning ability of the models for all examined language pairs.However, the degree of improvement observed is smaller compared to the Rule-Taker dataset.In the case of the implicit reasoning task within the LoT dataset, the model must rely on both contextual cues and implicit knowledge to successfully solve the task.Conversely, for the RuleTaker dataset, the model is required to fully reason over the context.Consequently, for implicit reasoning, the model only partially uses contextual information, resulting in a lesser impact on performance when improving cross-lingual contextquestion attentions.

Generalization to other Reasoning Tasks
So far, our experiments have focused on the logical reasoning ability of MultiLMs, either in monolin-  gual or code-switched settings.However, to demonstrate the proposed method's generalization to other reasoning tasks, we extend our experiments to the XNLI dataset.To create structured code-switched inputs for this task, we change the language of the premise and the hypothesis for a given input.More specifically, in a code-switched setting (e.g., enfr), the premise is in English, and the hypothesis is in French.We fine-tune the mBERT model on a combination of EN and code-switched EN and FR data (mix(en, en-fr)), then zero-shot transfer it to other languages for monolingual evaluations (excluding en and fr) and other language pairs for code-switched evaluation (excluding en-fr and fren pairs).Table 5 presents the performance of the mBERT model with the cross-lingual query compared to the baselines in both monolingual and code-switched settings.We observe ∼4% improvement on en-X, ∼7% on X-en, and competitive performance on monolingual evaluation setups, indicating the effectiveness of our proposed method on downstream tasks other than logical reasoning.

Related Work
Reasoning in NLP.Language models (LMs) have demonstrated their ability to perform logical reasoning over natural language statements (Clark et al., 2020b;Chen et al., 2023).They can also leverage their implicit knowledge for reasoning purposes (Talmor et al., 2020) and exhibit multi-step reasoning capabilities by utilizing chain-of-thought prompting, even with minimal demonstrations or instructions, when the model size is sufficiently large (Wei et al., 2022b;Kojima et al., 2022;Wei et al., 2022a;Tang et al., 2023).In parallel to English-centric efforts on reasoning tasks, there have been attempts to create multilingual reasoning datasets to evaluate the cross-lingual abilities of pretrained MultiLMs (Conneau et al., 2018b;Artetxe et al., 2020;Clark et al., 2020a;Hu et al., 2020;Shi et al., 2022) Tarunesh et al., 2021).To the best of our knowledge, this work is the first to study logical reasoning in the context of code-switched NLP.Furthermore, a majority of prior studies have focused on wordlevel code-switching, where the language of certain words in a text randomly changes.However, our investigation delves into the realm of "structured code-switching", wherein language transitions occur at a section level.

Discussion
In this study, we explored the effectiveness of MultiLMs in a code-switched setting and found that while these models exhibit strong reasoning capabilities in monolingual settings, they struggle when it comes to code-switching.To address this, we first proposed the structured attention dropout, which encourages the model to rely less on cross-lingual attention when dealing with code-switched data.This simple method considerably improved cross-lingual transfer to other code-switched languages, demonstrating the importance of structured attention for this setting.We then proposed a novel structured attention mechanism, incorporating the cross-lingual query, that helps the model to better handle cross-lingual attention in the code-switched setting.The proposed cross-lingual query matrix, pre-trained on unsupervised code-switched data, significantly improved the cross-lingual transfer to other codeswitched language pairs in all studies settings, demonstrating the importance of code-switched alignment for MultiLMs.We also observed better cross-lingual code-switched performance for the LeapOfThought dataset (real-world knowledge contexts) compared to RuleTaker (utilizing artificial facts and rules).We attribute LeapOfThought's better code-switched performance to the usage of real-world knowledge in the reasoning context (compared to artificial facts and rules in RuleTaker), in line with Tang et al. (2023) observation that language models perform better when provided with commonsense-consistent context, and struggle with artificial ones.

Limitations
In this work, we evaluate our proposed method on encoder-only language models, and the impact of this method on autoregressive models and encodedecoder-only models has not been explored, leaving room for further investigation and evaluation.Moreover, our experiments are limited to relatively small language models (less than one billion parameters), and the results and our findings do not necessarily extend to large language models.Furthermore, we should highlight that the scope of our experiments is constrained by the availability of multilingual data and computational resources.Consequently, our evaluation is limited to two specific datasets and covers only nine languages.While we strive for diversity in our selection, it is important to recognize that broader and more extensive datasets encompassing a wider range of languages could offer additional perspectives and potentially reveal new insights.

A Dataset Details
This section further elaborates on the datasets that are used in our experiments.Both datasets in this study are translated using Google Translate API to investigate our proposed method's cross-lingual transfer.Starting from the English dataset, the samples are translated into eight other languages, namely, French (Fr), Farsi (Fa), German (De), Arabic (Ar), Spanish (Es), Chinese (Zh), Russian (Ru), and Italian (It).Below we discuss in more detail each studied dataset.

A.1 RuleTaker Dataset
RuleTaker dataset (Clark et al., 2020b) is a set of five datasets, requiring various depths of inference to answer the questions.Each dataset consists of examples in the form of a triple: (context, statement, answer).The context is composed of a series of facts and rules, while the statement represents the question that needs to be proven and the answer is either "T" (true) if the statement logically follows from the context, or "F" (false) if it does not (false under a closed-world assumption, CWA).All the facts, rules, and question statements are expressed in synthetic English.Essentially, each example represents a self-contained logical theory in linguistic form, with a question asking, "Is it true?"The dataset generation procedure ensures that every question can be answered by a formal reasoner, given the closed-world assumption (CWA).Each dataset limited by the maximum level of inference needed to validate the facts employed in its corresponding questions.These datasets are categorized based on their depth restrictions (up to depths D = 0, D ≤ 1, D ≤ 2, D ≤ 3 and D ≤ 5 respectively).A depth of D = 0 implies that the accurate facts can be readily "proven" by directly looking them up within the given context, without requiring any inference.The fifth dataset, encompasses questions that span up to a depth of 5.This dataset serves as a test to assess the ability to generalize to depths not encountered during training on the other four datasets.In our experiments, we use datasets with depths 0 to 4. Each dataset contains 100k examples randomly split 70/10/20 into train/dev/test partitions.

A.2 LeapOfThought (LoT) Dataset
The primary focus of the LoT dataset (Talmor et al., 2020)  as hypernymy and meronymy) with explicit rules derived from natural language.The hypernymy relations and properties are derived from WORD-NET (Fellbaum, 1998) and CONCEPTNET (Speer et al., 2017).Each example consists of two components: (1) a hypothesis, which is a textual statement that can be either true or false, and (2) explicit knowledge, represented as a list of textual statements.These statements can be classified as either facts, which describe properties of specific entities, or rules, which describe properties of a particular class.The explicit knowledge is carefully constructed to ensure that the truth value of the hypothesis cannot be determined solely based on the provided information.

A.2.1 Discussion on LoT Dataset Artifacts
LoT dataset was designed to test how well NLP models can (possibly) reason using real-world knowledge.However, as we show in this section, the dataset has some artifacts that causes the NLP models to take shortcuts instead of actually performing the reasoning.In the following analysis, we only focus on the original English LoT dataset (and not on the translated samples).
In our preliminary experiments on this dataset, we observed that MultiLMs perform surprisingly high in cross-lingual code-switched settings (on EXPLICIT dev set), even if the statement is in a medium-resource language like Farsi or Arabic (context being in English).We hypothesized that the model is mostly relying on the context for reasoning; therefore, the statement being in a medium/low-resource language does not necessarily impact the model's performance.We validate this hypothesis by training a context-only model (without having access to respective statements), and surprisingly this model performs ∼94% on the EXPLICIT dev set (see Table 6).In order to ensure that the model can not get non-random accuracy by relying only on the context, we randomly negate 50% of statements (also negating the respective labels), so that a context-only model would perform randomly.The resulting dataset is the modified LoT that is used in all experiments in the paper.
In order to further investigate artifacts present in the modified LoT dataset, we inject noise into the statement (without changing the context) as following: • Swapping statement's subject with a randomly selected entity from the whole dataset • Swapping statement's object with a randomly selected entity from the whole dataset • Swapping statement's subject and object with a randomly selected entity from the whole dataset As demonstrated in Table 6, given the EXPLICIT evaluation results, the model can still get high reasoning performance even when the entities in the context and statement are not consistent.However, as reasoning performance on IMPLICIT evaluation set drops to (almost) random when noise are injected into the statement entities, we believe that LoT artifacts have less effect on this evaluation setting.Therefore, to evaluate the MultiLM's reasoning performance, we use the IMPLICIT evaluation set throughout the paper.

B Multilingual Reasoning: XLM-R results
Sections 3.2 and 3.3 discussed the in-language and cross-lingual performance of the mBERT model on monolingual and code-switched data.This section evaluates the XLM-R model on the same evaluation settings as mBERT.
Table 7 demonstrates the average in-language and cross-lingual zero-shot performance of XLM-R Table 7: Monolingual Setting: In-language and cross-lingual zero-shot performance (accuracy) of the XLM-R model for the RuleTaker and LeapOfThought datasets.for each source language in a monolingual setting.Code-switched evaluation results are depicted in Table 8.

C Experimental Setup Details
C.1 Full Fine-tuning Versus Bitfit As discussed in section 4.2, our proposed model and the baselines' performances in Tables 4 and 3 are achieved by Bitfit tuning (Zaken et al., 2021).It has been previously observed by Tu et al. (2022) that parameter-efficient fine-tuning (PEFT) has better cross-lingual generalization than full finetuning.In our experiments, we also found out that using a PEFT method like Bitfit considerably improves our cross-lingual transfer across different languages.
Table 9 demonstrates the generalization improvement brought by Bitfit over full fine-tune baseline for the RuleTaker dataset, especially in codeswitched settings.We observed similar pattern for other RuleTaker depths and the LoT dataset.It is worth noting that using a PEFT method especially helps with transfer to code-switched tasks, which is our main focus in this paper.

C.2 Curriculum Learning
For depths 2 and 3 of RuleTaker dataset, which involves more reasoning hops, we observed that curriculum learning (Bengio et al., 2009) makes the XLM-R training more robust.The curriculum learning is performed by first training the MultiLM for 3 epochs on a subset of dataset that has depth 0 (i.e., no hop is needed for reasoning), and then the training is continued on the full dataset.This technique not only makes the XLM-R training more robust, but also improves the final reasoning per-formance.

C.3 Hyperparameters
The hyperparameters for all the experiments is provided in Table 10 for both mBERT and XLM-R models.We use the AdamW optimizer with a warmup ratio of 0.1 for all experiments.

D Cross-lingual Query
This section further discusses the methods proposed in section 4.1.

D.1 Structured Attention Dropout
As previously discussed in section 4.1, limiting the cross-lingual attention in the fine-tuning makes this phase more consistent with the pre-training, where the MultiLM mostly deals with monolingual attentions.Table 11 demonstrates that applying dropout on cross-lingual attenions (see Figure 2) considerably improves cross-lingual generalization in codeswitched settings.Table 11 results are achieved by a 40% dropout on cross-lingual attentions (i.e., P mask = 0.4)

D.2 Interfering Cross-lingual Query
Inspired by the promising performance of the structured attention dropout, we propose a setting where the query matrix Q also partially handles the crosslingual attentions, and cross-lingual query Q cross partially handles monolingual attentions.The only difference between the interfering cross-lingual query and the non-interfering scheme is their respective attention masks, M 1 and M 2 , as illustrated in Figure 3.We also empirically demonstrate in Table 12 that the interfering scheme consistently performs better generalization than the non-interfering one, especially in the code-switched settings.For all the fine-tuning experiments with the interfering cross-lingual query, we use a 70% attention dropout (i.e., P mask = 0.7), meaning that 70% of crosslingual attentions for query Q, and 70% of monolingual attentions for query Q cross are masked.

E Attention Visualization
As discussed earlier, MultiLMs perform well on inlanguage, but when they are transferred to other languages (especially code-switched languages) their performance hinders considerably (see Table 2).This section first analyzes the attention pattern of baseline models, both on in-language and crosslingual evaluation settings.Then, we analyze the attention pattern of our proposed model which incorporates cross-lingual query.
We hypothesize that in order to have a reasonable cross-lingual performance, the cross-lingual samples' attention pattern should not change significantly compared to the in-language samples.Figure 4 visualizes the attention pattern between tokens in the last (baseline) mBERT layer across all attention heads.The mBERT model is fine-tuned on the mix(en, en-fr) depth-0 of RuleTaker dataset, so the en-fr sample is considered in-language and the en-ar sample is considered a zero-shot transfer.It is worth noting the two samples are semantically the same and only the questions are in different languages.Comparing the two samples' attention patterns, we can see that the attention pattern considerably changes (especially the strong attention signals getting much weaker when en-ar sample is given as input), which to some extent explains the poor generalization of the baseline models to other code-switched tasks.
In contrast, as demonstrated by Figure 5, the attention pattern of our proposed method, which incorporates cross-lingual query, is much more stable between in-language (i.e., en-fr sample), and the zero-shot transfer (i.e., en-ar sample).We believe that the observed stability in the attention patterns makes our models more language-neutral compared to the baseline, which is also demonstrated by the significant cross-lingual improvements over the baselines in Tables 3 and 4.

F Detailed Cross-lingual Query Results
Tables 3 and 4 demonstrated the average crosslingual transfer to either monolingual or codeswitched settings.This section demonstrates the detailed cross-lingual performance of models with cross-lingual query and the Original and CSbaseline.Tables 13 and 14 present the detailed cross-lingual transfer of mBERT trained on the RuleTaker and LeapOfThought datasets, respectively.Tables 15 and 16     Table 13: Cross-lingual transfer of mBERT model on the RuleTaker datasets to either monolingual samples or code-switched language pairs (en-X and X-en).The original is the pre-trained model, and the CS-baseline is the model that continues pre-training on code-switched data.Shared Q cross and Pair Q cross refer to cases where the cross-lingual query is either shared across many language pairs or is specific to each language pair, respectively.Scores are averaged across three different seeds.

Figure 1 :
Figure 1: An example of monolingual and codeswitched reasoning.In code-switched reasoning, the context and question are in different languages.

Figure 3 :
Figure3: Illustration of the attention masks in Section 4.1.In the proposed scheme, two sets of independent query matrices (Q and Q cross ) collaborate to compute the attention scores.Matrix M 1 enforces the Q matrix to mostly focus on monolingual attentions, and matrix M 2 constrains the Q cross to mostly handle cross-lingual attentions.The difference between masks in the two figures are the structured attention dropout probability being either one (left) or less than one (right).It is worth noting that the first token (e.g.,[CLS]  in mBERT) is used as a bridge in both M 1 and M 2 , meaning its respective attentions are not masked.

Figure 5 :
Figure5: Attention visualization of the mBERT model with cross-lingual query for in-language (en-fr) and zero-shot transfer (en-ar), both from depth-0 of the RuleTaker dataset.The underlying mBERT model is fine-tuned on the mix(en, en-fr) of the RuleTaker depth-0 dataset.We can see that attention patterns for our proposed model is more stable between in-language and cross-lingual samples, compared to baseline model in Figure4.

Table 1 :
Monolingual Setting: In-language and cross-lingual zero-shot performance (accuracy) of the mBERT model for the RuleTaker and LeapOfThought datasets.Cross-lingual performance is the average performance of the model being fine-tuned on a single source language and then zero-shot transferred to other languages.

Table 4 :
Average cross-lingual transfer of mBERT and XLM-R on LoT dataset to monolingual samples (mono) and code-switched language pairs (en-X and X-en).The original is the pre-trained model and CS-baseline refers to the model pre-trained on code-switched data.Shared Q cross and Pair Q cross , refer to cases where crosslingual query is either shared across many language pairs or is specific to each language pair, respectively.

Table 5 :
Performance (accuracy) of mBERT model for the XNLI dataset in both monolingual and codeswitched evaluation settings.

Table 6 :
revolves around a specific form of inference that integrates implicit taxonomic knowledge (such This table investigates the artifacts present in the LeapOfThought dataset.We evaluate the model's .cross-ling.in-lang.cross-ling.in-lang.cross-ling.in-lang.cross-ling.in-lang.cross-ling.

Table 8 :
Code-Switched Setting: In-language and cross-lingual performance (accuracy) of the XLM-R model for the RuleTaker and LeapOfThought datasets.

Table 10 :
Hyperparameters of the pre-training and fine-tuning experiments for mBERT and XLM-RoBERTa models.Learning rate decays linearly from the initial value to zero.

Table 11 :
Average cross-lingual transfer of mBERT model when tuned on a mixture of English and English-French (mix(en, en-fr)) RuleTaker dataset (depth-0).

Table 14 :
Cross-lingual transfer of mBERT model on the LeapOfThought dataset to either monolingual samples or code-switched language pairs (en-X and X-en).The original is the pre-trained model, and the CS-baseline is the model that continues pre-training on code-switched data.Shared Q cross and Pair Q cross refer to cases where the cross-lingual query is either shared across many language pairs or is specific to each language pair, respectively.Scores are averaged across three different seeds.