COSY: COunterfactual SYntax for Cross-Lingual Understanding

Pre-trained multilingual language models, e.g., multilingual-BERT, are widely used in cross-lingual tasks, yielding the state-of-the-art performance. However, such models suffer from a large performance gap between source and target languages, especially in the zero-shot setting, where the models are fine-tuned only on English but tested on other languages for the same task. We tackle this issue by incorporating language-agnostic information, specifically, universal syntax such as dependency relations and POS tags, into language models, based on the observation that universal syntax is transferable across different languages. Our approach, called COunterfactual SYntax (COSY), includes the design of SYntax-aware networks as well as a COunterfactual training method to implicitly force the networks to learn not only the semantics but also the syntax. To evaluate COSY, we conduct cross-lingual experiments on natural language inference and question answering using mBERT and XLM-R as network backbones. Our results show that COSY achieves the state-of-the-art performance for both tasks, without using auxiliary training data.


Introduction
With the emergence of BERT (Devlin et al., 2019), large-scale pre-trained language models have become an indispensable component in the solutions to many natural language processing (NLP) tasks. Recently, large-scale multilingual transformer-based models, such as mBERT (Devlin et al., 2019), XLM (Lample and Conneau, 2019) and XLM-R (Conneau et al., 2020a), have been widely deployed as backbones in cross-lingual NLP tasks (Wu and Dredze, 2019;Pires et al., 2019;Keung et al., 2019). However, these models trained 1 Our code is publicly available on GitHub: https:// github.com/PluviophileYU/COSY Shared Syntax: Figure 1: Examples of two sentences in English and Chinese that have the same meaning and share the same syntax in the format of dependency relations and POS tags.
on a single resource-rich language, e.g., English, all suffer from a large drop of performance when tested on different target languages, e.g., Chinese and German-where the setting is called zeroshot cross-lingual transfer. For example, on the XQUAD dataset, mBERT achieves a 24 percentage points lower exact match score on the target language Chinese than on the training language English (Hu et al., 2020). This indicates that this model has seriously overfitted English.
An intuitive way to tackle this is to introduce language-agnostic information-the most transferable feature across languages, which is lacking in existing multilingual language models (Choenni and Shutova, 2020). In our work, we propose to exploit reliable language-agnostic informationsyntax in the form of universal dependency relations and universal POS tags (de Marneffe et al., 2014;Nivre et al., 2016;Zhou et al., 2019Zhou et al., , 2021. As illustrated in Figure 1, the sentences in Chinese and English share the same meaning but have differ-  ent word orders. The order difference hampers the transferability between English and Chinese in conventional language models (with sequential words as input). In contrast, it is clear from Figure 1 that the two sentences share identical dependency relations and POS tags. Thus, we can incorporate such universal syntax 2 information to enhance the transferability across different languages. To achieve this learning objective in deep models, we design syntax-aware networks that incorporate the encodings of dependency relations and POS tags into the encoding of semantics. However, we find that empirically the conventional attention-based incorporation of syntax, e.g., relational graph attention networks (Ishiwatari et al., 2020), has little effect on improving the model. One possible reason is that the learning process may be dominated by the pre-trained language models due to their strength in semantic representation learning, which leads to an overfitted model. This raises the question of how to induce the model to focus more on syntax while maintaining its original capability of representing semantics? To this end, we propose a novel COunterfactual SYntax (COSY) method, inspired by causal inference (Roese, 1997;Pearl et al., 2009) and contrastive learning (He et al., 2020).
The intuition behind COSY is to create copies of training instances with their syntactic features altered (see the "counterfactual" syntax in Figure 2), and to force the encodings of the counterfactual in-stances to be different from the encodings of their corresponding factual instances. In this way, the model would learn to put more emphasis on the syntactic information when learning how to encode an instance, and such encodings are likely to perform well across languages.
We evaluate our COSY method on both question answering (QA) and natural language inference (NLI) under cross-lingual settings. Experimental results show that, without using any additional data, COSY is superior to the state-of-the-art methods. Contributions: 1) we develop a syntax-aware network that incorporates transferable syntax in language models; 2) we propose a novel counterfactual training method that addresses the technical challenge of emphasizing syntax; and 3) extensive experiments on three benchmarks demonstrate the effectiveness of our method for cross-lingual tasks.

Related Work
Cross-lingual Transfer. Large-scale pre-trained language models (Devlin et al., 2019; have achieved sequential success in various natural language processing tasks. Recent studies (Lample and Conneau, 2019;Conneau et al., 2020a) extend the pre-trained language models to multilingual tasks and demonstrate their prominent capability on cross-lingual knowledge transfer, even under zero-shot scenario (Wu and Dredze, 2019;Pires et al., 2019;Hsu et al., 2019).
Motivated by the success of multilingual language models on cross-lingual transfer, several works explore how these models work and what their bottleneck is. On the one hand, some studies find that the shared sub-words (Wu and Dredze, 2019;Dufter and Schütze, 2020) and the parameters of top layers (Conneau et al., 2020b) are crucial for cross-lingual transfer. On the other hand, the bottleneck is attributed to two issues: (i) catastrophic forgetting (Keung et al., 2020;, where knowledge learned in the pre-training stage is forgotten in downstream fine-tuning; (ii) lack of language-agnostic features (Choenni and Shutova, 2020;Zhao et al., 2020) or linguistic discrepancy between the source and the target languages (Wu and Dredze, 2019;Lauscher et al., 2020). In this work, we aim to tackle zero-shot and few-shot cross-lingual transfer by focusing on the second issue.
Existing works can be roughly divided into two groups. The first proposes to modify the lan-guage model by aligning languages with parallel data (Zhao et al., 2020) or strengthening sentencelevel representation (Wei et al., 2020). The second group focuses on the learning paradigm for finetuning on downstream tasks. For instance, some methods adopt meta-learning  or intermediate tasks training (Phang et al., 2020) to learn cross-lingual knowledge. Our COSY belongs to the second group and fills the blank of using the syntactic information in zero-shot (few-shot) cross-lingual understanding. Counterfactual Analysis. Counterfactual analysis aims to evaluate the causal effect of a variable by considering its counterfactual scenario. Counterfactual analysis has been widely studied in epidemiology (Rothman and Greenland, 2005) and social science (Steel, 2004). Recently, counterfactual reasoning has motivated studies in applications.
In the community of natural language processing, counterfactual methods are also emerging recently in text classification (Choi et al., 2020), story generation (Qin et al., 2019), dialog systems (Zhu et al., 2020), gender bias (Vig et al., 2020;Shin et al., 2020), question answering , and sentiment bias (Huang et al., 2020). To the best of our knowledge, we are the first to conduct counterfactual analysis in cross-lingual understanding. Different from previous works (Zhu et al., 2020;Qin et al., 2019) that generate word-level or sentence-level counterfactual samples, our counterfactual analysis dives into syntax level that is more controllable than text and free from complex language generation module.

COSY: COunterfactual SYntax
COSY aims to leverage the syntactic information, e.g., dependency relations and POS tags, to increase the transferability of cross-lingual language models. Specifically, COSY implicitly forces the networks to learn to encode the input not only based on semantic features but also based on syntactic features through syntax-aware networks and a counterfactual training method.
As illustrated in Figure 3, COSY consists of three branches with each branch based on syntaxaware networks (SAN) indicated by a distinct color. The main branch (in black) is the factual branch that uses factual syntax as input. The red and blue branches are counterfactual branches using counterfactual dependency relations and counterfactual POS tags as input, respectively. The counterfactual training method guides the black branch to put more emphasis on syntactic information with the help of other two branches. Note that the red and blue branches work for counterfactual training, and only the prediction from the black branch is used in testing.
Below, we first elaborate the modules of SAN in Section 3.1, and then introduce the counterfactual training method in Section 3.2.

Syntax-Aware Networks (SAN)
As shown in Figure 3, SAN contains four major modules: a set of feature extractors, a relational graph attention network (RGAT), fusion projection, and a classifier. In this section, we use the route in the black branch as an example to elaborate each module. The set of feature extractors include three components: a pre-trained language model, a dependency graph constructor and a POS tags extractor. Pre-trained Language Model. Following previous work (Hu et al., 2020), we deploy a pre-trained multi-lingual language model, e.g., mBERT (Devlin et al., 2019), to encode each input sentence into contextual features. Given a sequence of tokens with a length of S, we denote the derived contextual features as H = [h 1 , ..., h S ] ∈ R S×d , where d is the dimensionality of each hidden vector. Dependency Graph Constructor. We use it to construct the (factual) dependency graph for each input sentence. In this work, the Stanza toolkit (Qi et al., 2020) is used to extract the universal dependency relations as the first step. Then, the dependency graph can be represented as G = {V, R, E}, where the nodes V are tokens, the edges E denote the existence of dependency relations, and the set R contains the relation types for E.
As shown in Figure 3, we define three kinds of relation types in R : 1) a forward syntactic relation, e.g., love OBJ − −− → apples; 2) an inverse syntactic relation, e.g., apples OBJ −1 − −− → love; and 3) a self loop Figure 3: The overall pipeline of our COSY. We call the architecture as syntax-aware networks (Section 3.1) and the training method as counterfactual training (Section 3.2). In this architecture, there are three branches: black, red and blue. Black branch is just the normal attention-based network with additional syntactic information, and only its prediction is used in the testing stage. Red branch and blue branch are novel as they generate the counterfactual syntax samples and drive the counterfactual losses in the training stage-the key functions in COSY. RGAT stands for Relational Graph Attention Network (Ishiwatari et al., 2020;Linmei et al., 2019). The modules of RGAT and the modules of Fusion Projection are shared across branches, e.g., two RGAT modules are sharing parameters. Cat denotes concatenation.
SELF that allows the information to flow from a node to itself. Note that we regard the ROOT relation as a self-loop. In this way, we obtain 75 different types of relations in total, and thus denote the embedding matrix as R ∈ R 75×d . POS Tags Extractor. We deploy the same Stanza toolkit (Qi et al., 2020) to assign (factual) POS tags P for all tokens. We obtain 17 different types of POS tags and denote the embedding matrix as T ∈ R 17×d .

Relational Graph Attention Networks (RGAT).
RGAT is one of the standard backbones to incorporate the dependency graph (Ishiwatari et al., 2020;Linmei et al., 2019). Given the (factual) dependency graph G with the contextual features of each node, RGAT can generate the relation-aware features (for each node). Details are given below. Suppose e ij is the directed edge from node v i to node v j and the dependency relation r. The importance score of v j from v i is computed as: where W Attn ∈ R (d/2+d )×1 maps a vector to a scalar, e r ij is the embedding of the dependency relation between v i and v j from R, and e s ij is computed by element-wise multiplication between v i and v j : where W K ∈ R d×d/2 and W Q ∈ R d×d/2 are the learnable parameters for key and query projections (Vaswani et al., 2017), and h i and h j denote their contextual features extracted from pre-trained language models. Then, the importance scores are normalized across N j to obtain the attention score of v j from v i : where N j denotes the set of nodes pointing to v j . The relation-aware features of v j is computed as the weighted sum of all nodes in N j with corresponding attention scores. After computing all nodes, we get the relation-aware featureŝ Fusion Projection. We fuse the relation-aware featuresĤ with the (factual) POS tags informa-tion before feeding them into the classifier. Given POS tags P , the fused features for each token are represented by where W F ∈ R (d+d )×d are learnable parameters of fusion projection and p j is the corresponding embedding of the POS tag of the j-th token from T. The fused features of the entire sequence are denoted as Classifier. It is designed based on the specific task, such as NLI or QA, following Devlin et al. (2019).

Counterfactual Training
Recall that the challenge in the effective utilization of syntax is how to induce the model to focus more on syntax while maintaining its original representation capability of semantics. Inspired by counterfactual analysis (Pearl et al., 2009;Pearl, 2010;Pearl and Mackenzie, 2018) and contrastive learning (Hadsell et al., 2006), we propose a counterfactual training method by incorporating counterfactual syntax (counterfactual dependency graph and counterfactual POS tags) on the red and blue branches in Figure 3. Each branch is designed to guide the model to focus on one type of syntax, i.e., dependency graph or POS tags. Counterfactual Dependency Graph is utilized on the red branch with factual POS tags in Figure 3. We build a counterfactual dependency graph by maintaining graph structure and nodes, and replacing each type of relation (except for a self-loop SELF) with a randomized (counterfactual) type. We name it G − . We feed G − and H into RGAT to obtain the counterfactual relation-aware features denoted asĤ − . Then, we fuseĤ − with the factual POS tags to derive the counterfactual features on the red branch. Finally, we can calculate the similarity between the factual and the counterfactual features, by leveraging the dot-product operation, as follows, This counterfactual loss forces the model to emphasize the syntactic information related to dependency relations.
Counterfactual POS Tags are utilized with the factual dependency graph on the blue branch in Figure 3. We create counterfactual POS tags P − from factual POS tags P by randomly selecting a POS tag for each token. Accordingly, we replace each embedding p i by p − i . Given the relationaware featuresĤ from the black branch, we then feed the embeddings of counterfactual POS tags in Eq. 4 and get the counterfactual features as Finally, we can calculate the similarity between the factual and the counterfactual features (on the blue branch) by leveraging the dot-product operation, as follows, This counterfactual loss forces the model to emphasize the syntactic information related to POS tags. The overall loss function used in training is as follows, where L task is the task-specific loss, i.e., a crossentropy loss, and λ is a scale to balance between the task-specific loss and our proposed counterfactual losses.

Experiments
In this section, we evaluate our COSY method for cross-lingual understanding under both zero-shot and few-shot settings. For the zero-shot setting, we use English for training and evaluate the model on different target languages. For the few-shot setting, we follow the implementation in  and use the development set of the target languages for model fine-tuning 3 .

Datasets
We evaluate our method on the natural language inference (NLI) and the question answering (QA) tasks. We briefly introduce the datasets used in our experiments as follows. Natural Language Inference (NLI). Given two sentences, NLI asks for the relationship between the two sentences, which can be entailment, contradiction or neutral. We conduct experiments on XNLI (Conneau et al., 2018) and evaluate our method on 13 target languages 4 . Question Answering (QA). In this paper, we consider the QA task that asks the model to locate the

Implementation
In data preprocessing, we feed the same syntactic information to each of the subwords in the same word after tokenization. Our implementation of pre-trained language models (mBERT and XLM-R) is based on HuggingFaces's Transformers (Wolf et al., 2020). We select the checkpoint and set hyper-parameters, e.g., learning rate and λ in the loss function, based on the performance on the corresponding development sets. We select learning rate amongst {7.5e−6, 1e−5, 3e−5} and fix the batch size to 32. We select dimension d amongst {100, 300}. λ in counterfactual loss is set to 0.1 (see Figure 4). A linear warm up strategy for learning rate is adopted with first 10% optimization steps. Adam (Kingma and Ba, 2014) is adopted as the optimizer. All experiments are conducted on a workstation with dual NVIDIA V100 32GB GPUs.

Results
We compare our method with naive fine-tuning and the state-of-the-art methods. The overall results on three benchmarks are presented in  Table 2: Results of XNLI under the few-shot setting (mBERT). We report the testing results of English ("en."), the average results over all non-English languages ("non-en. avg.") and the average results over all languages ("avg."). * denotes the results from . More details are available in Appendix.
Comparison with Naive Fine-tuning.  Table 1 and Table 2, we can observe that COSY consistently outperforms the naive fine-tuning method on all datasets, e.g., by average 1.9 percentage points (accuracy) and 2.9 percentage points (F1) on XNLI and XQUAD with XLM-R large in the zero-shot setting. These observations demonstrate the effectiveness of COSY and suggest that universal syntax as language-agnostic features can enhance the transferability for cross-lingual understanding. Fur-thermore, the results show that COSY is able to work with different backbones and thus is modelagnostic.
Comparison with the State of the Art. We first outline the SOTA zero-shot (few-shot) crosslingual methods we compared with as follows: (1) XMAML-one  borrows the idea from meta-learning. Specifically, XMAML-one utilizes an auxiliary language development data in training, e.g., using the development set of Spanish in training to assist German on MLQA. XMAML-One reports the results based on the most beneficial auxiliary language.
(2) STILT (Phang et al., 2020)  On the one hand, we observe that COSY surpasses the compared SOTA methods over all evaluation metrics. Although meta-learning methods (Finn et al., 2017;Gu et al., 2018;Sun et al., 2019) advance the state-of-the-art performance for few-shot learning, our COSY still outperforms the meta-learning-based method, i.e., XMAML-One, with 1.1 percentage points in the few-shot setting. On the other hand, the superiority of COSY is also reflected in other aspects, which are shown in Table 1. Specifically, COSY does not require additional datasets and cumbersome data selection process, which is more convenient and resources saving.

Discussion and Analysis
Ablation Study. In Table 3, we show the MLQA, XQUAD and XNLI results in 4 ablative settings, to evaluate the approach when we (1) only utilize the SAN-Black branch; (2) utilize the SAN-Black branch with an intuitive gate mechanism to control the information of pre-trained language model and syntax; (3) utilize the SAN-Black branch and SAN-Red branch; (4) utilize the SAN-Black branch and SAN-Blue branch.
Compared to the ablative results, we can see that our full method achieves the overall top per-   formance in all settings. Syntax features are incorporated into the models in (1)-(5) and all of them outperform the naive fine-tuning method, which demonstrates the effectiveness of universal syntax. By analyzing the settings one by one, we can observe that SAN-Black only attains limited improvement compared to naive fine-tuning since syntax is incorporated in the model by overlooked. Gate mechanism (2) fails to solve the overlooking issue. Both of (3) and (4) with counterfactual training are able to bring gains compared to (1), and the results indicate that dependency relations are more effective compared to POS labels. We also observe that our full method (5) does not accumulate the gains from (3) and (4). One explanation could be that part of the information provided by the dependency relations and POS labels overlaps. For instance, if we see an edge of relation, word a AMOD − −− → word b , we may infer that word a is NOUN and word b is ADJ. Effect of λ. We now study the impact of the scale value λ with counterfactual losses. For clarity, we show the results with different values of logλ in Figure 4. We can observe that COSY attains the highest results when λ = 0.1 on both MLQA and XNLI. As the value drops, the effect of counterfactual loss is also smaller and the performance is getting closer to that from naive fine-tuning (red dotted line). If a large value of λ is applied, e.g., λ = 1, the model begins to over-emphasize the syntax and semantics are overlooked, which leads to significant decrease on performance. Effect of COSY. In this part, we first study whether counterfactual training method indeed guides the model to focus more on syntactic information. We conduct analysis on the COSY and SAN-Black. Since it is non-trivial to measure the utilization of syntax in a straightforward way, we adopt a standard way to measure the importance of the neurons in deep models (Kádár et al., 2017). Specifically, we perturb the syntactic features with a Gaussian noise to test data and check whether our model would be more easily affected by the syntax perturbation. If so, then it verifies that our model indeed relies more on syntax.. The results are shown in Figure 5. We can discover that the performance drop of COSY is larger compared to that with SAN-Black.
Meanwhile, we also explore whether COSY is beneficial for yielding more meaningful syntax embedding than SAN-Black. Specifically, we compute the correlation score (absolute cosine similarity) between the embedding of syntactic relation and the corresponding inverse relation from the  Table 4: Results of different generation ways for generating counterfactual syntax with mBERT as backbone. "Current" means the current generation way described in Section 3. We report the average performance of all languages. same type. For COSY, we observe that the score of the related types are 42.4× larger than that of two randomly selected embeddings (average over 10000 times). However, for SAN-Black, its score is only 1.4× larger than that of two randomly selected embeddings. It demonstrates that COSY attains more meaningful syntax representations than SAN-Black.

Counterfactual Syntax Generation.
Here we analyze other alternative ways of counterfactual syntax generation. Specifically, we design the following variants and report the results in Table 4: (1) we not only replace edge types, but also replace connections for counterfactual dependency graph construction; (2) for each input sequence, we create 5 counterfactual dependency graphs, 5 sets of counterfactual POS tags, and the counterfactual loss is the average over the 5 sets; (3) we replace the factual syntax with a fixed type, e.g., a type of padding instead of a random type from all types; (4) in each generating process, we only replace 50% of the factual syntax.
Comparing (1) with the result of "SAN-Black,Blue" in Table 3, we can see that (1) does not work. We believe that randomly changing connections in G − , e.g., an edge is created from the first token to the last token in a long passage, may have a significant effect toĤ − , it is undesirable for further optimization of counterfactual loss. Results from (2) and (4) suggest that the number of the generated counterfactual syntax and ratio of randomizing do not play an important role in COSY. It is also discovered that randomizing with all types is better than simple replacement with a fixed type.

Conclusion
We study how to effectively plug in syntactic information for cross-lingual understanding. Specifically, we propose a novel counterfactual-syntaxbased approach to emphasize the importance of syntax in cross-lingual models. We conduct extensive experiments on three cross-lingual benchmarks, and show that our approach can outperform the SOTA methods without additional dataset. For future work, we will combine our approach with other orthogonal methods, e.g., meta-learning, to further improve its effectiveness.   Table 6: Results on MLQA of zero-shot setting. We report the Exact Match and F1 score (EM / F1) on 7 languages. * : our implementation by official code; 1: (Hu et al., 2020); 2: (Liang et al., 2020); 3: (Yuan et al., 2020); 4: ; 5: (Phang et al., 2020).  Table 7: Results on XQUAD of zero-shot setting. We report the Exact Match and F1 score (EM / F1) on 10 languages. * : our implementation by official code; 1: (Hu et al., 2020); 2: (Phang et al., 2020).