Syntax-augmented Multilingual BERT for Cross-lingual Transfer

In recent years, we have seen a colossal effort in pre-training multilingual text encoders using large-scale corpora in many languages to facilitate cross-lingual transfer learning. However, due to typological differences across languages, the cross-lingual transfer is challenging. Nevertheless, language syntax, e.g., syntactic dependencies, can bridge the typological gap. Previous works have shown that pre-trained multilingual encoders, such as mBERT (CITATION), capture language syntax, helping cross-lingual transfer. This work shows that explicitly providing language syntax and training mBERT using an auxiliary objective to encode the universal dependency tree structure helps cross-lingual transfer. We perform rigorous experiments on four NLP tasks, including text classification, question answering, named entity recognition, and task-oriented semantic parsing. The experiment results show that syntax-augmented mBERT improves cross-lingual transfer on popular benchmarks, such as PAWS-X and MLQA, by 1.4 and 1.6 points on average across all languages. In the generalized transfer setting, the performance boosted significantly, with 3.9 and 3.1 points on average in PAWS-X and MLQA.


Introduction
Cross-lingual transfer reduces the requirement of labeled data to perform natural language processing (NLP) in a target language, and thus has the ability to avail NLP applications in low-resource languages. However, transferring across languages is challenging because of linguistic differences at levels of morphology, syntax, and semantics. For example, word order difference is one of the crucial factors that impact cross-lingual transfer (Ahmad et al., 2019). The two sentences in English and Hindi, as shown in Figure 1 have the same * Work done during internship at Facebook AI. meaning but a different word order (while English has an SVO (Subject-Verb-Object) order, Hindi follows SOV). However, the sentences have a similar dependency structure, and the constituent words have similar part-of-speech tags. Presumably, language syntax can help to bridge the typological differences across languages.
In recent years, we have seen a colossal effort to pre-train Transformer encoder (Vaswani et al., 2017) on large-scale unlabeled text data in one or many languages. Multilingual encoders, such as mBERT (Devlin et al., 2019) or XLM-R (Conneau et al., 2020) map text sequences into a shared multilingual space by jointly pre-training in many languages. This allows us to transfer the multilingual encoders across languages and have found effective for many NLP applications, including text classification (Bowman et al., 2015;Conneau (Lewis et al., 2020) with predictions from mBERT and our proposed syntax-augmented mBERT. In "Q:x-C:y", x and y indicates question and context languages, respectively. Based on our analysis of the highlighted tokens' attention weights, we conjecture that mBERT answers 630 as the token is followed by "miembros", while 315 is followed by "senadores" in Spanish. et al., 2018), question answering (Rajpurkar et al., 2016;Lewis et al., 2020), named entity recognition (Pires et al., 2019;Wu and Dredze, 2019), and more. Since the introduction of mBERT, several works (Wu and Dredze, 2019;Pires et al., 2019;K et al., 2020) attempted to reason their success in cross-lingual transfer. In particular, Wu and Dredze (2019) showed that mBERT captures language syntax that makes it effective for cross-lingual transfer. A few recent works (Hewitt and Manning, 2019; Jawahar et al., 2019;Chi et al., 2020) suggest that BERT learns compositional features; mimicking a tree-like structure that agrees with the Universal Dependencies taxonomy.
However, fine-tuning for the downstream task in a source language may not require mBERT to retain structural features or learn to encode syntax. We argue that encouraging mBERT to learn the correlation between syntax structure and target labels can benefit cross-lingual transfer. To support our argument, we show an example of question answering (QA) in Figure 2. In the example, mBERT predicts incorrect answers given the Spanish language context that can be corrected by exploiting syntactic clues. Utilizing syntax structure can also benefit generalized cross-lingual transfer (Lewis et al., 2020) where the input text sequences belong to different languages. For example, answering an English question based on a Spanish passage or predicting text similarity given the two sentences as shown in Figure 1. In such a setting, syntactic clues may help to align sentences.
In this work, we propose to augment mBERT with universal language syntax while fine-tuning on downstream tasks. We use a graph attention network (GAT) (Veličković et al., 2018) to learn structured representations of the input sequences that are incorporated into the self-attention mechanism. We adopt an auxiliary objective to train GAT such that it embeds the dependency structure of the input sequence accurately. We perform an evaluation on zero-shot cross-lingual transfer for text classification, question answering, named entity recognition, and task-oriented semantic parsing. Experiment results show that augmenting mBERT with syntax improves cross-lingual transfer, such as in PAWS-X and MLQA, by 1.4 and 1.6 points on average across all the target languages. Syntaxaugmented mBERT achieves remarkable gain in the generalized cross-lingual transfer; in PAWS-X and MLQA, performance is boosted by 3.9 and 3.1 points on average across all language pairs. Furthermore, we discuss challenges and limitations in modeling universal language syntax. We release the code to help future works. 1 2 Syntax-augmented Multilingual BERT Multilingual BERT (mBERT) (Devlin et al., 2019) enables cross-lingual learning as it embeds text sequences into a shared multilingual space. mBERT is fine-tuned on downstream tasks, e.g., text classification using monolingual data and then directly employed to perform on the target languages. This refers to zero-shot cross-lingual transfer. Our main idea is to augment mBERT with language syntax for zero-shot cross-lingual transfer. We employ graph attention network (GAT) (Veličković et al., 2018) to learn syntax representations and fuse them into the self-attention mechanism of mBERT.
In this section, we first briefly review the transformer encoder that bases mBERT ( § 2.1), and then describe the graph attention network (GAT) that learns syntax representations from dependency structure of text sequences ( § 2.2). Finally, we describe how language syntax is explicitly incorporated into the transformer encoder ( § 2.3).

Transformer Encoder
Transformer encoder (Vaswani et al., 2017) is composed of an embedding layer and stacked encoder layers. Each encoder layer consists of two sublayers, a multi-head attention layer followed by a fully connected feed-forward layer. We detail the process of encoding an input token sequence (w 1 , . . . , w n ) into a sequence of vector representations H = [h 1 , . . . , h n ] as follows.
Embedding Layer is parameterized by two embedding matrices -the token embedding matrix W e ∈ R U ×d model and the position embedding matrix W p ∈ R U ×d model (where U is the vocabulary size and d model is the encoder output dimension). An input text sequence enters into the model as two sequences: the token sequence (w 1 , . . . , w n ) and the corresponding absolute position sequence (p 1 , . . . , p n ). The output of the embedding layer is a sequence of vectors {x i } n i=1 where x i = w i W e + p i W p . The vectors are packed into matrix H 0 = [x 1 , . . . , x n ] ∈ R n×d model and fed to an L-layer encoder.
Multi-head Attention allows to jointly attend to information from different representation subspaces, known as attention heads. Multi-head attention layer composed of h attention heads with the same parameterization structure. At each attention head, the output from the previous layer H l−1 is first linearly projected into queries, keys, and values as follows.
l ∈ R d model ×dv are unique per attention head. Then scaled dot-product attention is performed to compute the output vectors where M ∈ R n×n is the masking matrix that determines whether a pair of input positions can attend Figure 3: A simplified illustration of the multi-head self-attention in the graph attention network wherein each head attention is allowed between words within δ distance from each other in the dependency graph. For example, as shown, in one of the attention heads, the word "likes" is only allowed to attend its adjacent (δ=1) words "dog" and "play". each other. In classic multi-head attention, M is a zero matrix (all positions can attend each other). The output vectors from all the attention heads are concatenated and projected into d model dimension using the parameter matrix W o ∈ R hdv×d model . Finally the vectors are passed through a feedforward network to output H l ∈ R n×d model .

Graph Attention Network
We embed the syntax structure of the input token sequences using their universal dependency parse. A dependency parse is a directed graph where the nodes represent words, and the edges represent dependencies (the dependency relation between the head and dependent words). We use a graph attention network (GAT) (Veličković et al., 2018) to embed the dependency tree structure of the input sequence. We illustrate GAT in Figure 3.
Given the input sequence, the words (w i ) and their part-of-speech tags (pos i ) are embedded into vectors using two parameter matrices: the token embedding matrix W e and the part-of-tag embedding W pos . The input sequence is then encoded into an input matrix G 0 = [g 1 , . . . , g n ], where g i = w i W e + pos i W pos ∈ R d model . Note that token embedding matrix W e is shared between GAT and the Transformer encoder. Then G 0 is fed into an L G -layer GAT where each layer generates word representations by attending their adjacent words.
GAT uses the multi-head attention mechanism and perform a dependency-aware self-attention as namely setting the query and key matrices to be the same T ∈ R n×dg respectively and the mask M by where D is the distance matrix and D ij indicates the shortest path distance between word i and j in the dependency graph structure. Typically in GAT, δ is set to 1; allowing attention between adjacent words only. However, in our study, we find setting δ to [2, 4] helpful for the downstream tasks. Finally, the vector representations from all the attention heads (as in Eq. (2)) are concatenated to form the output representations G l ∈ R n×kdg , where k is the number of attention heads employed. The goal of the GAT encoder is to encode the dependency structure into vector representations. Therefore, we design GAT to be light-weight; consisting of much less parameters in comparison to Transformer encoder. Note that, GAT does not employ positional representations and only consists of multi-head attention; there is no feed-forward sublayer and residual connections.
Dependency Tree over Wordpieces and Special Symbols mBERT tokenizes the input sequence into subword units, also known as wordpieces. Therefore, we modify the dependency structure of linguistic tokens to accommodate wordpieces. We introduce additional dependencies between the first subword (head) and the rest of the subwords (dependents) of a linguistic token. More specifically, we introduce new edges from the head subword to the dependent subwords. The inputs to mBERT use special symbols: [CLS] and [SEP]. We add an edge from the [CLS] token to the root of the dependency tree and the [SEP] tokens.

Syntax-augmented Transformer Encoder
We want the Transformer encoder to consider syntax structure while performing the self-attention between input sequence elements. We use the syntax representations produced by GAT (outputs from the last layer, denoting as G) to bias the self-attention.
are new parameters that learn representations to bias the self-attention.
We consider the addition terms (GG Q l , GG K l ) as syntax-bias that provide syntactic clues to guide the self-attention. The high-level intuition behind the syntax bias is to attend tokens with a specific part-of-speech tag sequence or dependencies. 2 Syntax-heads mBERT employs h (=12) attention heads and the syntax representations can be infused into one or more of these heads, and we refer them as syntax-heads. In our experiments, we observed that instilling structural information into many attention heads degenerates the performance. For the downstream tasks, we consider one or two syntax-heads that gives the best performance. 3 Syntax-layers refers to the encoder layers that are infused by syntax representations from GAT. mBERT has a 12-layer encoder and our study finds considering all of the layers as syntax-layers beneficial for cross-lingual transfer.

Fine-tuning
We jointly fine-tune mBERT and GAT on downstream tasks in the source language (English in this work) following the standard procedure. However, the task-specific training may not guide GAT to encode the tree structure. Therefore, we adopt an auxiliary objective that supervises GAT to learn representations which can be used to decode the tree structure. More specifically, we use GAT's output representations G = [g 1 , . . . , g n ] to predict the tree distance between all pairs of words (g i , g j ) and the tree depth ||g i || of each word w i in the input sequence. Following Hewitt and Manning (2019), we apply a linear transformation θ 1 ∈ R m×kdg to compute squared distances as follows.
The parameter matrix θ 1 is learnt by minimizing: where s denotes all the text sequences in the training corpus. Similarly, we train another parameter matrix θ 2 to compute squared vector norms, depth of the words. We train GAT's parameters and θ 1 , θ 2 by minimizing the loss: where α is weight for the tree structure prediction loss.
Pre-training GAT Unlike mBERT's parameters, GAT's parameters are trained from scratch during task-specific fine-tuning. For low-resource tasks, GAT may not learn to encode the syntax structure accurately. Therefore, we utilize the universal dependency parses (Nivre et al., 2019) to pre-train GAT on the source and target languages. Note that, the pre-training objective for GAT is to predict the tree distances and depths as described above.

Experiment Setup
To study syntax-augmented mBERT's performance in a broader context, we perform an evaluation on four NLP applications: text classification, named entity recognition, question answering, and taskoriented semantic parsing. Our evaluation focuses on assessing the usefulness of utilizing universal syntax in the zero-shot cross-lingual transfer.

Evaluation Tasks
Text Classification We conduct experiments on two widely used cross-lingual text classification tasks: (i) natural language inference and (ii) paraphrase detection. We use the XNLI (Conneau et al., 2018) and PAWS-X (Yang et al., 2019) datasets for the tasks, respectively. In both tasks, a pair of sentences is given as input to mBERT. We combine the dependency tree structure of the two sentences by adding two edges from the [CLS] token to the roots of the dependency trees.
Named Entity Recognition is a structure prediction task that requires to identify the named entities mentioned in the input sentence. We use the Wikiann dataset (Pan et al., 2017) and a subset of two tasks from CoNLL-2002 (Tjong Kim Sang, 2002) and CoNLL-2003 NER (Tjong Kim Sang andDe Meulder, 2003). We collect the CoNLL datasets from XGLUE (Liang et al., 2020). In both datasets, there are 4 types of named entities: Person, Location, Organization, and Miscellaneous. 4 Question Answering We evaluate on two crosslingual question answering benchmarks, MLQA (Lewis et al., 2020), and XQuAD (Artetxe et al., 2020). We use the SQuAD dataset (Rajpurkar et al., 2016) for training and validation. In the QA task, the inputs are a question and a context passage that consists of many sentences. We formulate QA as a multi-sentence reading comprehension task; jointly train the models to predict the answer sentence and extract the answer span from it. We concatenate the question and each sentence from the context passage and use the [CLS] token representation to score the candidate sentences. We adopt the confidence method from Clark and Gardner (2018) and pick the highest-scored sentence to extract the answer span during inference. We provide more details of the QA models in Appendix.
Task-oriented Semantic Parsing The fourth evaluation task is cross-lingual task-oriented semantic parsing. In this task, the input is a user utterance and the goal is to predict the intent of the utterance and fill the corresponding slots. We conduct experiments on two recently proposed benchmarks: (i) mTOP (Li et al., 2021) and (ii) mATIS++ . We jointly train the BERT models as suggested in Chen et al. (2019). We summarize the evaluation task benchmark datasets and evaluation metrics in Table 1.   (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003) [2] 90.6 -  Table 2: Cross-lingual transfer results for all the evaluation tasks (on test set) across 17 languages. We report F1 score for the question answering (QA) datasets (for other datasets, see Table 1). We train and evaluate mBERT on the same pre-processed datasets and considers its performance as the baseline (denoted by "mBERT" rows in the

Implementation Details
We collect the universal part-of-speech tags and the dependency parse of sentences by pre-processing the datasets using UDPipe. 5 We fine-tune mBERT on the pre-processed datasets and consider it as the baseline for our proposed syntax-augmented mBERT. We extend the XTREME framework (Hu et al., 2020) that is developed based on transformers API (Wolf et al., 2020). We use the same hyper-parameter setting for mBERT models, as suggested in XTREME. For the graph at-s 1 /s 2 en de es fr ja ko zh en -0.7 1.6 1.4 4.7 2.5 5.4 de 0.5 -2.0 2.1 5.1 3.5 5.9 es 1.0 2.1 -1.7 4.6 3.0 6.6 fr 0.9 1.7 1.9 -5.0 2.7 5.4 ja 5.2 5.3 5.6 5.1 -5.9 5.1 ko 3.1 2.8 4.3 3.9 6.4 -5.1 zh 5.8 5.5 6.3 6.0 6.1 4.5 -  Table 3: The performance difference between syntax-augmented mBERT and mBERT in the generalized crosslingual transfer setting. The rows and columns indicate (a) language of the first and second sentences in the candidate pairs and (b) context and question languages. The gray cells have a value greater than or equal to the average performance difference, which is 3.9 and 3.1 for (a) and (b).

Experiment Results
We aim to address the following questions. 1. Does augmenting mBERT with syntax improve (generalized) cross-lingual transfer? 2. Does incorporating syntax benefit specific languages or language families? 3. Which NLP tasks or types of tasks get more benefits from utilizing syntax?

Cross-lingual Transfer
Experiment results to compare mBERT and syntaxaugmented mBERT are presented in Table 2. Overall, the incorporation of language syntax in mBERT improves cross-lingual transfer for the downstream tasks, in many languages by a significant margin (p < 0.05, t-test). The average performances across all languages on XNLI, PAWS-X, MLQA, and mTOP benchmarks improve significantly (by at least 1 point). On the other benchmarks: Wikiann, CoNLL, XQuAD, and mATIS++, the average performance improvements are 0.5, 0.2, 0.8, and 0.7 points, respectively. Note that the performance gains in the source language (English) for all the datasets except Wikiann is ≤ 0.3. This indicates that cross-lingual transfer gains are not due to improving the downstream tasks, but instead, language syntax helps to transfer across languages.

Generalized Cross-lingual Transfer
In the generalized cross-lingual transfer setting (Lewis et al., 2020), the input text sequences for the downstream tasks (e.g., text classification, QA) may come from different languages. As shown in Figure 2, given the context passage in English, a multilingual QA model should answer the question written in Spanish. Due to the parallel nature of the existing benchmark datasets: XNLI, PAWS-X, MLQA, and XQuAD, we evaluate mBERT and its' syntax-augmented variant on the generalized crosslingual transfer setting. The results for PAWS-X and MLQA are presented in Table 3 (results for the other datasets are provided in Appendix).
In both text classification and QA benchmarks, we observe significant improvements for most language pairs. In the PAWS-X text classification task, language pairs with different typologies (e.g., en-ja, en-zh) have the most gains. When Chinese (zh) or Japanese (ja) is in the language pairs, the performance is boosted by at least 4.5%. The dataset characteristics explain this; the task requires modeling structure, context, and word order information. On the other hand, in the XNLI task, the performance gain pattern is scattered, and this is perhaps syntax plays a less significant role in the XNLI task. The largest improvements result when the languages of the premise and hypothesis sentences belong to {Bulgarian, Chinese} and {French, Arabic}.
In both QA datasets, syntax-augmented mBERT boosts performance when the question and context languages are typologically different except the Hindi language. Surprisingly, we observe a large performance gain when questions in Spanish and German are answered based on the English context. Based on our manual analysis on MLQA, we suspect that although questions in Spanish and German are translated from English questions (by human), the context passages are from Wikipedia that often are not exact translation of the corresponding English passage. Take the context passages in Figure  2 as an example. We anticipate that syntactic clues help a QA model in identifying the correct answer span when there are more than one semantically equivalent and plausible answer choices.

Analysis & Discussion
We discuss and analyze our findings on the following points based on the empirical results.

Impact on Languages
We study if fine-tuning syntax-augmented mBERT on English (source language) impacts specific target languages or families of languages. We show the performance gains on the target languages grouped by their families in four downstream tasks in Figure 4. There is no observable trend in the overall performance improvements across tasks. However, the XNLI curve weakly indicates that when target languages are typologically different from the source language, there is an increase in the transfer performance (comparing left half to the right half of the curve).
Impact of Pre-training GAT Before fine-tuning syntax-augmented mBERT, we pre-train GAT on the 17 target languages (discussed in § 2.4). In our experiments, we observe such pre-training boosts semantic parsing performance, while there is a little gain on the classification and QA tasks. We also observe that pre-training GAT diminishes the gain of fine-tuning with the auxiliary objective (predicting the tree structure). We hypothesize that pre-training or fine-tuning GAT using auxiliary objective helps when there is limited training data. For example, semantic parsing benchmarks have a small number of training examples, while XNLI has many. As a result, the improvement due to pre-training or fine-tuning GAT in the semantic parsing tasks is significant, and in the XNLI task, it is marginal.
Discussion To foster research in this direction, we discuss additional experiment findings.
• A natural question is, instead of using GAT, why we do not modify attention heads in mBERT to embed the dependency structure (as shown in Eq. 3). We observed a consistent performance drop across all the tasks if we intervene in self-attention (blocking pair-wise attention). We anticipate fusing GAT encoded syntax representations helps as it adds bias to the self-attention. For future works, we suggest exploring ways of adding structure bias, e.g., scaling attention weights based on dependency structure (Bugliarello and Okazaki, 2020).
• Among the evaluation datasets, Wikiann consists of sentence fragments, and the semantic parsing benchmarks consist of user utterances that are typically short in length. Sorting and analyzing the performance improvements based on sequence lengths suggests that the utilization of dependency structure has limited scope for shorter text sequences. However, part-of-speech tags help to identify span boundaries improving the slot filling tasks.

Limitations and Challenges
In this work, we assume we have access to an offthe-shelf universal parser, e.g., UDPipe (Straka and Straková, 2017) or Stanza  to collect part-of-speech tags and the dependency structure of the input sequences. Relying on such a parser has a limitation that it may not support all the languages available in benchmark datasets, e.g., we do not consider Thai and Swahili languages in the benchmark datasets.
There are a couple of challenges in utilizing the universal parsers. First, universal parsers tokenize the input sequence into words and provide partof-speech tags and dependencies for them. The tokenized words may not be a part of the input. 7 As a result, tasks requiring extracting text spans (e.g., QA) need additional mapping from input tokens to words. Second, the parser's output word sequence is tokenized into wordpieces that often results in inconsistent wordpieces resulting in degenerated performance in the downstream tasks. 8

Related Work
Encoding Syntax for Language Transfer Universal language syntax, e.g., part-of-speech (POS) tags, dependency parse structure, and relations are shown to be helpful for cross-lingual transfer (Kozhevnikov and Titov, 2013;Pražák and Konopík, 2017;Wu et al., 2017;Subburathinam et al., 2019;Liu et al., 2019;Xie et al., 2020;Ahmad et al., 2021). Many of these prior works utilized graph neural networks (GNN) to encode the dependency graph structure of the input sequences. In this work, we utilize graph attention networks (GAT) (Veličković et al., 2018), a variant of GNN that employs the multihead attention mechanism.
Syntax-aware Multi-head Attention A large body of prior works investigated the advantages of incorporating language syntax to enhance the self-attention mechanism (Vaswani et al., 2017). Existing techniques can be broadly divided into two types. The first type of approach relies on an external parser (or human annotation) to get a sentence's dependency structure during inference. This type of approaches embed the dependency structure into contextual representations (Wu et al., 2017;Chen et al., 2017;Wang et al., 2019a,b;Bugliarello and Okazaki, 2020;Sachan et al., 2021;Ahmad et al., 2021). Our proposed method falls under this category; however, unlike prior works, our study investigates if fusing the universal dependency structure into the self-attention of existing multilingual encoders help cross-lingual transfer. Graph attention networks (GATs) that use multi-head attention has also been adopted for NLP tasks (Huang and Carley, 2019) also fall into this category. The second category of approaches does not require the syntax structure of the input text during inference. These approaches are trained to predict the dependency parse via supervised learning (Strubell et al., 2018;Deguchi et al., 2019). by infusing structured representations into its multihead attention mechanism. We employ a modified graph attention network to encode the syntax structure of the input sequences. The results endorse the effectiveness of our proposed approach in the cross-lingual transfer. We discuss limitations and challenges to drive future works.

Broader Impact
In today's world, the number of speakers for some languages is in billions, while it is only a few thousands for many languages. As a result, a few languages offer large-scale annotated resources, while for many languages, there are limited or no labeled data. Due to this disparity, natural language processing (NLP) is extremely challenging in the lowresourced languages. In recent years, cross-lingual transfer learning has achieved significant improvements, enabling us to avail NLP applications to a wide range of languages that people use across the world. However, one of the challenges in crosslingual transfer is to learn the linguistic similarity and differences between languages and their correlation with the target NLP applications. Modern transferable models are pre-trained on unlabeled humongous corpora such that they can learn language syntax and semantic and encode them into universal representations. Such pre-trained models can benefit from explicit incorporation of universal language syntax during fine-tuning for different downstream applications. This work presents a thorough study to analyze the pros and cons of utilizing Universal Dependencies (UD) framework that consists of grammar annotations across many human languages. Our work can broadly impact the development of cross-lingual transfer solutions and making them accessible to people across the globe. In this work, we discuss the limitations and challenges in utilizing universal parsers to benefit the pre-trained models. Among the negative aspects of our work is the lack of explanation that why some languages get more benefits over others due to universal syntax knowledge incorporation.

B Hyper-prameter Details
We present the hyper-parameter details in Table 4.

C Additional Experiment Results
Cross-lingual Transfer We provide the exact match (EM) and F1 accuracy of the MLQA dataset in Table 5. Intent classification accuracy, slot F1, and exact match (EM) accuracy for the taskoriented semantic parsing is reported in Table zeroshot cross-lingual transfer results for the evaluation tasks in Table 6. We highlight the cross-lingual transfer gap for mBERT and syntax-augmented mBERT on the evaluation tasks in Table 7.
Generalized Cross-lingual Transfer In generalized cross-lingual transfer, we assume the task inputs are a pair of text that belong to two different languages, e.g., answering Spanish question based on an English context (Lewis et al., 2020). We present the generalized cross-lingual transfer performance of syntax-augmented mBERT on XNLI, MLQA, and XQuAD in Table 8, 9, 10, and 11, respectively. The performance differences between syntax-augmented mBERT and mBERT on the generalized cross-lingual transfer on XNLI and XQuAD is presented in Table 12 and 13.
Different Source Languages In our study, we primarily use English as the source language as training examples used in all the benchmarks are in English. However, authors of many of these benchmarks released translated-train examples in the target languages. This allows us to train mBERT and syntax-augmented mBERT in different languages (as source) and examine how it impacts cross-lingual transfer. We perform experiments on PAWS-X task and present the results in Figure 5. We observe the largest transfer performance improvements when English and German are used as the source language. The improvements are relatively smaller when Japanese, Korean, and Chinese languages are used as the source language. We suspect that the dependency parser may not accurately parse translated sentences, and as a result, we do not see an explainable trend in the improvements. 4 (classification, QA), 1 (NER, semantic parsing) Syntax-augmented mBERT # syntax-layers 12 (tuned in the range 1 -12) # syntax-heads 1 (XNLI, PAWS-X, Wikiann, CoNLL, mATIS++), 2 (MLQA, XQuAD, mTOP) α 0.0 (XNLI), 0.2 (mATIS++), 0.5 (PAWS-X, Wikiann, MLQA, XQuAD), 1.0 (CoNLL), 2.0 (mTOP) # epochs 3 (QA), 5 (classification), 10 (NER, semantic parsing)           : Zero-shot cross-lingual transfer performance difference between syntax-augmented mBERT and mBERT for PAWS-X task using different languages as source.