Sharing, Teaching and Aligning: Knowledgeable Transfer Learning for Cross-Lingual Machine Reading Comprehension

In cross-lingual language understanding, machine translation is often utilized to enhance the transferability of models across languages, either by translating the training data from the source language to the target, or from the target to the source to aid inference. However, in cross-lingual machine reading comprehension (MRC), it is difficult to perform a deep level of assistance to enhance cross-lingual transfer because of the variation of answer span positions in different languages. In this paper, we propose X-STA, a new approach for cross-lingual MRC. Specifically, we leverage an attentive teacher to subtly transfer the answer spans of the source language to the answer output space of the target. A Gradient-Disentangled Knowledge Sharing technique is proposed as an improved cross-attention block. In addition, we force the model to learn semantic alignments from multiple granularities and calibrate the model outputs with teacher guidance to enhance cross-lingual transferability. Experiments on three multi-lingual MRC datasets show the effectiveness of our method, outperforming state-of-the-art approaches.


Introduction
Recently, significant progress has been made in NLP by pre-trained language models (PLMs) (Radford et al., 2018;Devlin et al., 2019;Zhang et al., 2022).Yet, these models often require a sufficient amount of training data to perform well, which is difficult to achieve in cross-lingual low-resource adaptation.Although many cross-lingual PLMs have been proposed to learn generic feature representations (Devlin et al., 2019;Conneau and Lample, 2019;Conneau et al., 2020;Xue et al., 2021;Liu et al., 2020), the performance gap between Figure 1: Machine translation as an aid for cross-lingual transfer.Above is a natural language inference (NLI) task.The probability distribution of the source language can be fitted by KL Divergence for teaching lowresource languages; during inference, the target language can be translated into the source language with its output used for calibration.Below is an MRC task, where the knowledge is difficult to transfer directly.source and target languages is still relatively large, especially for token-level tasks such as machine reading comprehension (MRC).In addition, ultralarge PLMs such as ChatGPT (OpenAI, 2023) exhibit amazing zero-shot generation abilities over multiple languages.We observe that such models may not be sufficient for cross-language MRC due to the linguistic and cultural differences between these languages, together with the requirements of very fine-grained extraction of answer spans.
One of the most significant challenges in crosslingual MRC is the lack of annotated datasets in low-resource languages, which are difficult to obtain.As seen, most of the current MRC datasets are in English (Rajpurkar et al., 2016).Another challenge is the linguistic and cultural variations that exist across different languages, which exhibit different sentence structures, word orders and morphological features.For instance, languages such as Japanese, Chinese, Hindi and Arabic have different writing systems and a more complicated grammatical system than English, making it challenging for MRC models to comprehend the texts.
In the literature, machine translation-based data augmentation is often employed to translate the dataset of the source language into each target language for model training (Conneau et al., 2018;Hu et al., 2020;Ruder et al., 2021).As shown in Figure 1, it is relatively easy to enhance cross-lingual transferability of simple sequential classification tasks by directly fitting the output probability distribution of the source language via Kullback-Leibler Divergence (Fang et al., 2021;Zheng et al., 2021;Yang et al., 2022).However, for MRC, it is not possible to use the output distribution of the source language directly to teach the target language, due to the answer span shift caused by translation.
Motivated by this, we propose X-STA, a new approach for cross-lingual MRC that follows three principles: Sharing, Teaching and Aligning.For sharing, we propose the Gradient-Disentangled Knowledge Sharing (GDKS) technique, which uses parallel language pairs as model inputs and extracts knowledge from the source language.It enhances the understanding of the target language while avoiding degradation of the source language representations.For teaching, our approach leverages an attention mechanism by finding answers span from the target language's context that are semantically similar to the source language's output answers to calibrate the output answers.For aligning, alignments at multi-granularity are utilized to further enhance the cross-lingual transferability of the MRC model.In this way, we can enhance the language understanding of the model for different languages through knowledge sharing, teacherguided calibration and multi-granularity alignment.
In summary, the main contributions of this study are as follows: • We propose X-STA, a new approach for crosslingual MRC based on three principles: sharing, teaching, and aligning.
• In X-STA, a Gradient-Disentangled Knowledge Sharing technique is proposed for transferring language representations.Output calibration and semantic alignments are further leveraged to enhance the cross-lingual transferability of the model.
• Extensive experiments on three multi-lingual MRC datasets verify that our approach outperforms state-of-the-art methods.Thorough ablation studies are conducted to understand the impact of each component of our method.

Related Work
In this section, we summarize the related work in the following three aspects.

Cross-lingual Knowledge Transfer
It aims to transfer knowledge learned from a source language to target languages.A intuitive approach is to use machine translation for data augmentation (Conneau et al., 2018;Bornea et al., 2021;Hu et al., 2020).Under this setting, more transferable crosslingual representations can be learned through feature fusion (Fang et al., 2021), consistency regularization (Zheng et al., 2021) and mainfold mixup (Yang et al., 2022).However, these works are not sufficiently exploited on translation data for MRC.

X-STA: Proposed Approach
In this section, we present the detailed techniques of X-STA for cross-lingual MRC.

Task Definition and Basic Notations
Given the a context C and a question Q, the MRC task is to extract a sub-sequence from context C as the right answer to question Q. Denote the input sequence as X = {Q, C} ∈ R N , where N is the sequence length.We use p start ∈ R N and p end ∈ R N to denote the answer start and end position probability distributions.For the sake of simplicity, we concatenate the two together to p ∈ R N ×2 .Similarly, y ∈ R N ×2 represents the one-hot golden label sequence.Hu et al., 2020).In addition, we use h l to denote the hidden states of a sequence in layer l ∈ L, where L is the total number of transformer layers.Thus, to predict the start position and end position of the correct answer span in X, the probability distributions p is induced over the entire sequence by feeding h L into a linear classification layer and a softmax function: p = softmax(Wh L + b), W and b are the weights and bias of the linear classifier.

Source Language Target Language
Multi-head Attention

Gradient-Disentangled Knowledge Sharing
Although machine translation from high-resource languages to low-resource ones can be used for training multi-lingual models, the drawbacks are evident.i) Machine translation quality varies across languages.ii) The original semantics can be easily lost during translation.iii) Task labels are relatively expensive to obtain, especially for tokenlevel cross-lingual tasks.Thus, as shown in 2, we leverage parallel language pairs as the input and fuse cross-lingual representations.
As in Yang et al. (2022), cross-attention can be leveraged for feature fusion.However, a performance loss can be observed in the source language, as shown in Figure 3.A reasonable conjecture is that helping the target language to extract targetrelated information from the hidden states of the source language leads to a degeneration of source language representations.To alleviate this problem, we propose Gradient-Disentangled Knowledge Sharing (GDKS), which is an improved version of the cross-attention block.Specifically, we The performance of previous methods and our method on cross-lingual MRC.X-MIXUP (a crossattention based approach) improves the performance on target languages, but with a performance drop on the source language (en).Our approach addresses the issue by GDKS.
block the gradients from the target language output back to the source language hidden states h l S .As a compensation, we add a trainable correction term: Here, sg(•) is used to stop back-propagating gradients, preventing interfering with source language representations.f (•) refers to a trainable linear transformation with dropout.Then use the target hidden states as the query and the converted source hidden states hl S as key and value to perform crossattention, defined as follows: where MHA is multi-head attention (Vaswani et al., 2017).Then, the target hidden states are fused with the source-aware target hidden states by the weight λ, computed as follows: where λ = w * λ 0 + b, with w and b to be trainable parameters.It is worth noting that GDKS is implemented in a certain transformer layer only.

Attentive Teacher-Guided Calibration
As GDKS focuses on transferring knowledge from hidden states of the teacher model (trained from the source language), we also calibrate the model output distributions with teacher guidance.
Normalization.The premise of obtaining good guidance is that the representations of different languages should be normalized first.Following Pires et al. (2019), we hypothesize that the representation of a multi-lingual model is composed of languagespecific and language-agnostic representations.We estimate language-specific features as the mean of the language representations and remove languagespecific features by subtracting the mean to retain only the generic semantic features.The intuition behind this is that a certain language may have a large number of phenomena such as function words (Libovickỳ et al., 2020).Therefore, the average representation of that language is prominent.Inspired by Batch Normalization (Ioffe and Szegedy, 2015), we transform the generic semantic representation to the standard normal distribution space: where µ β and σ β are mean and variance of tokenlevel representations in batch β. ϵ is a constant for numerical stability.To facilitate its use in inference, it is set to be linguistically independent.
Calibration.After normalization, we use the hidden states of the target language as query, and the hidden states and the output distribution of the source language as key and value, respectively.We also leverage MHA and average the results of the transformation of multiple heads.Hence, the transferred output distribution p T |S ∈ R N ×2 is: where hT and hS are the normalized hidden states of source and target languages, respectively.During the model training phase, we incorporate a teacher-guided loss L tg for the computation of p T |S .Thus, tokens with the same semantics but in different languages can still be brought closer together by annotated data, even if their representations differ significantly.Specifically, we have the sample-wise loss L tg defined as follows: For model inference, we leverage p T |S to calibrate the output for the target language by averaging the results from two output distributions, i.e., pT =

Multi-Granularity Semantic Alignment
We further enhance the knowledge transfer of our model, based on our proposed Multi-Granularity Semantic Alignment (MGSA) technique.Sentence-Level Alignment.A vanilla approach to learn alignments is from the sentence level.Here, we employ Contrastive Learning (CL, Hadsell et al., 2006;Chen et al., 2020) to strengthen the alignment across languages: where r is the mean pooled sentence representation.r + and r − represent a positive sample from the parallel translated data and a negative example in the mini-batch, respectively.sim(r 1 , r 2 ) is the cosine similarity, i.e., sim(r τ is the temperature hyper-parameter, which we set to 0.05 in default.Token-Level Alignment.In Fomicheva et al. (2020); Yang et al. (2022), the entropy of the crossattention distribution (ECA) is used to measure the quality of machine translation.A smaller entropy of the attention distribution, i.e., more focused attention, can indicate a relatively higher translation quality (Rikters and Fishel, 2017).Similarly, ECA can also be used to represent the cross-lingual alignment quality, which we use as a penalty term for training the cross-lingual model to avoid distraction in GDKS.The token-level alignment loss L align T can be defined as: where a ij = softmax( √ n ) represents attention weights, n is the hidden size, I is the number of target tokens and J is the number of source tokens.Next, the total alignment loss is summed by the two parts, with ς and η to be the coefficients:

Final Training Objective
In brief, the final training objective of X-STA is: where γ is a factor for the teacher-guided loss L tg .L MRC refers to the cross-entropy loss of the MRC task.Following Yang et al. (2022), we split the MRC loss L MRC into the MRC loss of the source language and the target language with a balancing factor α: 4 Experiments

Experimental Settings
We conduct extensive experiments based on two multi-lingual pre-trained backbones: mBERT (Devlin et al., 2019) and XLM-R base (Conneau et al., 2020).The batch size is set to 32.The learning rate is set to 3e-5, and decreases linearly with warmup.Following Yang et al. (2022), α is set to 0.2 and we implement GDKS in the 8th layer.We set λ 0 to 0.3 and ϵ to 1e-8.We perform grid search ς, η and γ from [0.01, 0.05, 0.1, 0.5] on the validation set of MLQA, and finally set them to 0.05, 0.05 and 0.1, respectively.We save the model with the best averaged performance of all languages on the validation set for testing.Since there are no validation sets in XQuAD and TyDiQA.Following

Baselines
We systematically compare our method with the following strong baselines: • Zero-shot models are trained on labeled data in the source language only, and directly evaluated on target languages.
• Trans-train (Hu et al., 2020) translates training data in English into target languages.The model is trained on the combination of these original and translated training sets.
• LAKM (Yuan et al., 2020) leverages a language-agnostic knowledge masking task by knowledge phrases based on mBERT.
• CalibreNet (Liang et al., 2021) employs a unsupervised phrase boundary recovery pretraining task to enhance the multi-lingual boundary detection capability of XLM-R base .
• AA-CL (Chen et al., 2022) is a two-stage stepby-step algorithm for finding the best answer for cross-lingual MRC over XLM-R base .
• X-MIXUP (Yang et al., 2022) is a crosslingual manifold mixup method that learns compromised representations for target languages, which produces the state-of-the-art results for cross-lingual MRC.
For Zero-shot and Trans-train, we report the results of mBERT from Hu et al. (2020) and reproduce the results of XLM-R base .For LAKM, CalibreNet and AA-CL (which have been evaluated over part of our settings), we report the results from their original papers.As for X-MIXUP (the state-of-the-art method), in order to conduct a rigorous comparison, we report both the results from the original paper and our re-implementation.Among these methods, only Zero-shot and CalibreNet are under zero-shot setting, for the rest of the methods translate data are available.

General Experimental Results
As in Table 1, based on mBERT, we achieved an average of 71.2% F1 and 53.2% EM in MLQA, exceeding all strong baselines.A gain of 1.7/1.5% is obtained compared to the state-of-the-art X-MIXUP.As shown in Tables 2 and 3, our method also consistently outperforms all the strong baselines on XQuAD and TyDiQA.Our method obtains on average 1.8/2.2 and 2.6/3.5 improvement F1/EM scores compared to X-MIXUP.In conclusion, based on two backbones, our method outperforms state-of-the-art methods on three datasets, showing the effectiveness and generalization of our method.In addition, X-MIXUP significantly reduces the performance gap between the source and target languages, but also compromises performance on the source language; whereas our approach can achieve comparable performance to translate-train on English without negatively affecting the representation of the source language.

Ablation Study
We conduct an ablation study by removing each key component individually to evaluate the effectiveness of our method.As shown in Table 4, there is a performance gap when removing any of the components.Although the removal of GDKS has little effect on the overall performance, it significantly affects the representation of the source language, resulting in an obvious performance drop in the source language (i.e., English).
Removing the Attentive Teacher-Guided Calibration (ATGC) component degrades the model performance the most, and the results demonstrate that mapping the output of the source language to the target language space is efficient and feasible.In addition, there is still some performance loss compared to removing ATGC only at inference time, which suggests that the improvement from ATGC does not only come from weighted outputs of source and target languages.Using the answer span as additional knowledge can enhance the cross-lingual alignment through the teacher-guided loss L tg .Figure 5 shows the visual distribution of sentence representations before and after normalization.In the original space, the distributions of the same language (same color) tend to cluster together, while after normalization, these representations are sparsely dispersed, which also shows that normalization can indeed decompose some of the language-specific representations.
Finally, we analyze the effectiveness of MGSA.As seen, token-level alignment contributes more than sentence-level alignment.A reasonable speculation is that the MRC task is more concerned with token-level representations.As shown in Figure 4, without token-level alignment, attention is more distracted and not well aligned across languages.Instead, token-level alignment penalizes this behavior, allowing attention to be focused on the QA-related token (e.g., "many").

Case Study
To further demonstrate the output results of our approach, we show the answer generation process of a Hindi example and the corresponding English example from the XQuAD dataset, and com-pare it with a powerful ultra-large language model (i.e., ChatGPT).Figure 6 shows that for English, both our approach and ChatGPT answer the question well.However, in a low-resource language setting such as Hindi, there are some capability limitations of mBERT and ChatGPT.Without ATGC, our method fails to find the correct answer.When mapping the source language output to the target language output space, it successfully calibrates the output and generates the correct answer after averaging the two outputs.ChatGPT, on the other hand, produces plausible but incorrect answers, showing a sign of producing hallucinations (also reported in Bang et al. (2023)).More cases in low-resource languages can be found in Appendix B.

Conclusion
In this paper, we propose X-STA, which addresses the challenges of cross-lingual MRC in effectively utilizing translation data and the linguistic and cultural differences.Our work follows three principles: sharing, teaching and aligning.Experimental results on three datasets show that our approach obtains the state-of-the-art performance compared to strong baselines.We further analyze the effec-  6: An example from XQuAD dataset, its ground-truth answer is marked with another color and underlined.The source language example (English) on the left corresponds to the low-resource target language (Hindi) example on the right.The ChatGPT used is Mar 14 Version, and our method uses mBERT as the backbone.
tiveness of each component.In the future, we will extend our work to other cross-lingual NLP tasks for low-resource languages.

Limitations
Our approach requires a translation system as an aid and incurs additional inference costs during the inference process (the sequences translated back to the source language also need to go through model).
For other cross-lingual token-level tasks (e.g., POS, NER), it is difficult to obtain the labels of translatetrain data directly.Previous approaches usually use trained models to generate pseudo-labels.These low-quality labels pose significant challenges to our approach.Extending our approach to these tasks is left to our subsequent work.
Passage: As northwest Europe slowly began to warm up from 22,000 years ago onward, frozen subsoil and expanded alpine glaciers began to thaw and fallwinter snow covers melted in spring.Much of the discharge was routed to the Rhine and its downstream extension.Rapid warming and changes of vegetation, to open forest, began about 13,000 BP.By 9000 BP, Europe was fully forested.With globally shrinking ice-cover, ocean water levels rose and the English Channel and North Sea re-inundated.Meltwater, adding to the ocean and land subsidence, drowned the former coasts of Europe transgressionally.Passage: Through combining the definition of electric current as the time rate of change of electric charge, a rule of vector multiplication called Lorentz's Law describes the force on a charge moving in a magnetic field.The connection between electricity and magnetism allows for the description of a unified electromagnetic force that acts on a charge.This force can be written as a sum of the electrostatic force (due to the electric field) and the magnetic force (due to the magnetic field).Fully stated, this is the law: The Scottish Parliament is unable to legislate on such issues that are reserved to, and dealt with at, Westminster (and where Ministerial functions usually lie with UK Government ministers).These include abortion, broadcasting policy, civil service, common markets for UK goods and services, constitution, electricity, coal, oil, gas, nuclear energy, defence and national security, drug policy, employment, foreign policy and relations with Europe, most aspects of transport safety and regulation, National Lottery, protection of borders, social security and stability of UK's fiscal, economic and monetary system.

}
For cross-lingual scenarios, only annotated training data from the source language D and raw test data from the target language D Test T = {X Test T , y Test T } are available.S and T denote the source and target language.Machine translation can be used to obtain training data for the target language D Train T

Figure 2 :
Figure 2: Model architecture.The cross-attention block (GDKS) is implemented only in a certain layer.In other layers, vanilla transformer layers are applied.
Figure3: The performance of previous methods and our method on cross-lingual MRC.X-MIXUP (a crossattention based approach) improves the performance on target languages, but with a performance drop on the source language (en).Our approach addresses the issue by GDKS.

Figure 4 :
Figure 4: The attention distribution heat map of query part.We show the average result of multi-head attention.

Question:Figure 10
Figure 10: A Vietnamese (vi) example from the XQuAD dataset.

Question:Figure 11
Figure 11: A Chinese (zh) example from the XQuAD dataset.

Table 1 :
Overall evaluation (F1/EM) over the MLQA dataset.* denotes the results of our re-implementation.
* * denotes the results of our re-implementation.a) w/o.alignment  b) w/ alignment
* denotes the results of our re-implementation.

Table 4 :
Ablation study of our method on MLQA and XQuAD.w/o.GDKS refers to vanilla cross-attention is used, Rep-Norm is Representation Normalization.alignment s and alignment t refer to sentence-level alignment and token-level alignment.† and ‡ have a performance drop of 1.1/1.0 and 1.3/1.0 on English.