Cross-Lingual Knowledge Distillation for Answer Sentence Selection in Low-Resource Languages

While impressive performance has been achieved on the task of Answer Sentence Selection (AS2) for English, the same does not hold for languages that lack large labeled datasets. In this work, we propose Cross-Lingual Knowledge Distillation (CLKD) from a strong English AS2 teacher as a method to train AS2 models for low-resource languages in the tasks without the need of labeled data for the target language. To evaluate our method, we introduce 1) Xtr-WikiQA, a translation-based WikiQA dataset for 9 additional languages, and 2) TyDi-AS2, a multilingual AS2 dataset with over 70K questions spanning 8 typologically diverse languages. We conduct extensive experiments on Xtr-WikiQA and TyDi-AS2 with multiple teachers, diverse monolingual and multilingual pretrained language models (PLMs) as students, and both monolingual and multilingual training. The results demonstrate that CLKD either outperforms or rivals even supervised fine-tuning with the same amount of labeled data and a combination of machine translation and the teacher model. Our method can potentially enable stronger AS2 models for low-resource languages, while TyDi-AS2 can serve as the largest multilingual AS2 dataset for further studies in the research community.


Introduction
Answer Sentence Selection (AS2) is the task of ranking a given set of answer candidates according to their probability of correctly answering a given question.This is a core task for retrieval-based web Question Answering (QA) systems.Indeed, AS2 models applied to the sentences of documents relevant to a question, e.g., retrieved by a search engine, provide accurate answers.
Figure 1: Cross-Lingual Knowledge Distillation (CLKD) in two different scenarios: (Top) using unlabeled English AS2 dataset for target low-resource language lacking any data and (Bottom) using unlabeled original low-resource language AS2 dataset.CLKD enables student AS2 models to learn from English teacher AS2 models without human-annotated datasets.
While AS2 has been extensively studied for English (Wang and Nyberg, 2015;Chen et al., 2017;Tan et al., 2017;Tymoshenko and Moschitti, 2018;Nicosia and Moschitti, 2018;Garg et al., 2020;Tian et al., 2020;Matsubara et al., 2020;Laskar et al., 2020;Bonadiman and Moschitti, 2020;Soldaini and Moschitti, 2020;Lauriola and Moschitti, 2021;Krishnamurthy et al., 2021;Han et al., 2021;Zhang et al., 2021;Mrini et al., 2021;Di Liello et al., 2022;Matsubara et al., 2022), much less research has been devoted to other languages.This is despite the rapidly increasing importance of multilingual QA with the proliferation of conversational agents and voice assistants using multilingual content from the Web to target locales across the world (Li et al., 2022).A major barrier to achieving similar performance obtained with English in other languages is the lack of large labeled datasets.However, labeling AS2 datasets for every language will be prohibitively expensive as even a single AS2 instance can contain hundreds of candidate answers per question (see Table 2).This necessitates methods that do not require labeled target language data.A simple approach is to just translate questions to English and then use an English AS2 model (Vu and Moschitti, 2021;Asai et al., 2021).While this pipeline can be quite accurate (Li et al., 2022), the need for machine translation makes inference slow and inefficient.An alternative approach would be to train AS2 models on target language translations of English datasets.However, training using translationese seems sub-optimal due to errors and artifacts introduced by machine translation.Moreover, models trained on English questions could be ill-suited for answering target language questions due to information asymmetry (Asai et al., 2021) i.e., questions asked in English are likely to differ from those in the target language due to cultural bias, e.g., they can refer to different entities.
In this work, we propose Cross-Lingual Knowledge Distillation (CLKD) as a method to use readily available and highly-accurate English AS2 models to train AS2 models for low-resource languages lacking labeled data.CLKD can use English datasets to train AS2 models for languages lacking any data and can further leverage unlabeled original target language data without the need for costly manual annotation.
Figure 1 illustrates a high-level description of our approach.CLKD works similarly to classic Knowledge Distillation (KD) (Hinton et al., 2015) in that a student model is trained to mimic a teacher.The main novelty of our approach is the fact that the teacher and student models operate in different languages, namely the source and target languages.Thus, the input question-answer pairs are translated into both the source and the target languages.Additionally, to allow use of original target language data, which is typically unlabeled, we use only the soft labels obtained from the teacher even when gold labels are available, i.e., given an unlabeled input question and candidate answer pair (q, a), the student is trained using the Kullback-Leibler divergence loss between the probability scores of the teacher and the student when applied to (q, a).
Using the above datasets, we perform extensive experiments with multiple teacher models, about 20 different student models of varying sizes, including both monolingual and multilingual PLMs, and both monolingual and multilingual training.Additionally, to evaluate the utility of our method for both languages lacking any data and for those with some unlabeled data, we experiment with both using only English data (Xtr-WikiQA and English TyDi-AS2) and only unlabeled target language data (TyDi-AS2).We show that CLKD consistently either rivals or outperforms even supervised finetuning with the same amount of gold-labeled data demonstrating the benefit of CLKD using soft labels obtained from a strong English AS2 teacher model.In particular, we show that CLKD using original language unlabeled data outperforms 1) fine-tuning with gold-labeled translationese and 2) for larger students, even the MT+English AS2 model pipeline, demonstrating the importance of original target language data.
We expect that the ability of CLKD to train AS2 models without the need for costly annotation process will enable stronger AS2 performance for the world's many low-resource languages.To support further studies on AS2 tasks for such languages, we will make the datasets introduced in this work and our trained models publicly available.

Related Work
We briefly summarize the related studies.

KD for Model Compression
KD was originally proposed as a method for model compression that improves the performance of a weaker model to be trained (student) by learning from a strong but cumbersome model (teacher) (Hinton et al., 2015).With large pretrained language models based on Transformer (Vaswani et al., 2017) becoming the new paradigm for natural language processing (NLP) tasks, KD has gained greater attention from the NLP community, with many studies on KD for Transformer-based models (Sanh et al., 2019;Jiao et al., 2020;Lu et al., 2020;Park et al., 2021a).For AS2 tasks, Matsubara et al. (2022) propose a multi-head student model (CERBERUS) to distill knowledge in an ensemble of multiple diverse teacher models to improve model accuracy without significantly increasing model complexity.

Learning from Teacher Models in
Different Domains/Tasks There are also a few related studies regarding KD in cross-lingual problem settings.To address the lack of Chinese sentiment corpora, Wan (2009) leverages machine translations (English-to-Chinese and Chinese-to-English) and studies a cross-lingual sentiment classification problem.Xu and Yang (2017) also work on sentiment analysis tasks and propose a cross-language distillation with feature adaptation.Reimers and Gurevych (2020) propose a method to extend existing (English) sentence embedding models to new languages for multilingual student models.Karamanolakis et al. (2020) present a text classification model training method with a small budget of word-level translations for words that are most indicative of the target task and unlabeled documents in the target language.Li et al. (2022) propose a multi-stage KD to learn a cross-lingual document retriever from an English retriever, which is the most relevant work to ours.While similar in regard to learning from an English model, our approach significantly differs from theirs.First, Li et al. (2022) train crosslingual document retrievers (i.e., query and document differ in language), whereas we focus on AS2 models taking as input question and answer in the same language, while the teacher is in English and the student is in another language.Second, while their multi-stage KD method requires the student and teacher to share the embedding size, our single-stage KD method does not have this restriction.Finally, they evaluate their method on XLM-RoBERTa (Conneau et al., 2020) only (as student and teacher models), whereas we perform a much more comprehensive study spanning multiple teachers and approximately 20 different pretrained language models.
3 Knowledge Distillation for AS2

AS2 Task
We consider the task of Answer Sentence Selection where given a question q and a set of answer sentence candidates, S = {s 1 , . . ., s n }, the goal is to select the sentence s * that best answers the question.Following prior work (Garg et al., 2020), we frame this as a ranking task where we assign a score to each sentence s i for the question q and then select the sentence with the highest score.Formally, given a question-sentence pair (q, s), the AS2 model M produces a score M(q, s) measuring the likelihood of s being the correct answer to q.We then select the sentence with the highest score as the answer, i.e., s * = argmax s∈S M(q, s).

Knowledge Distillation
Knowledge Distillation (Hinton et al., 2015) is an effective method to transfer knowledge from a strong teacher model T to a student model S, by training the student to mimic the teacher.Formally, given inputs {x i } N i=1 , the distillation loss is a weighted sum of cross-entropy (L CE ) of the student w.r.t.gold labels and KL-divergence (L KL ) of the teacher and student's class probabilities, = S(x i ) are logits from teacher and student models, respectively.y indicates a gold label (human annotation), and α and τ ("temperature") are hyperparameters.

Cross-Lingual Knowledge Distillation
In order to train AS2 models for low-resource languages lacking labeled data, we propose Cross-Lingual Knowledge Distillation (CLKD) from accurate and readily available English AS2 models.
CLKD assumes the absence of gold labels for target languages and, in general, teaches the student model for a "target" language to mimic the teacher from a different "source" language (English in this study) as illustrated in Fig. 1.In other words, the CLKD loss is the second term of Eq. 1 (α = 0, no gold labels are used) with student and teacher logits obtained by feeding them the same input in target and source language, respectively.
Formally, given a teacher T l for source language l, and two parallel unlabeled datasets, , for source and target languages, l and l ′ , respectively, CLKD trains a student model S l ′ using the same loss as the monolingual distillation case (Eq. 1) but with the teacher and student logits obtained as For the AS2 task, the input will be questionsentence pairs i.e., x i = (q i , s i ).Additionally, as we distill knowledge in English AS2 models, the source and target languages will be English and a low-resource language, respectively.Also, since parallel datasets are likely not available, they will be obtained using automatic machine translation.
Depending on the low-resource language data available, CLKD can be applied in two different ways: (1) In absence of any target-language data, CLKD can be applied using an English AS2 dataset (see Fig. 1 top).In this scenario, the teacher and student will be fed the original English and translationese instances, respectively.While this can be applied to any language, errors and artifacts inevitably introduced by machine translation and information asymmetry due to cultural bias with respect to the target language (Asai et al., 2021) will limit the student's performance.(2) CLKD allows overcoming this limitation by utilising original target language unlabeled data when available.As shown in the Fig. 1 bottom, this would involve feeding the original language and English-translated input to the student and teacher models, respectively.
Note that the success of CLKD, particularly with original data, relies on two practical assumptions: (i) two AS2 models for two different languages should produce similar probability scores when applied to inputs that are translations of each other, and (ii) the teacher working on automatically translated data is still accurate enough to transfer useful knowledge to the student.4 New Datasets In this section, we introduce two new AS2 datasets, Xtr-WikiQA and TyDi-AS2.The datasets are constructed from the WikiQA (Yang et al., 2015) and TyDi-QA (Clark et al., 2020a) datasets, respectively, and the intended use of our new datasets follows Community Data License Agreement (CDLA) -Permissive (Version 2.0).6

TyDi-AS2
In addition to our translationese dataset above, we need a large and accurate multilingual AS2 dataset to evaluate our method and compare against supervised baselines on original target language data.Due to the lack of such datasets, we introduce TyDi-AS2, a large multilingual AS2 benchmark derived from the TyDi-QA dataset (Clark et al., 2020a), a multilingual Machine Reading dataset.TyDi-AS2 is a collection of AS2 datasets for eight typologically diverse languages, including Bengali (bn), English (en), Finnish (fi), Indonesian (id), Japanese (ja), Korean (ko), Russian (ru), and Swahili (sw).The dataset was constructed from the data for the primary task in TyDi-QA, where each instance is accompanied by a Wikipedia article.
Conversion TyDi-QA is a QA dataset spanning questions from 11 typologically diverse languages.
Each instance comprises a human-generated question, a single Wikipedia document as context, and one or more spans from the document containing the answer.To convert each instance into AS2 instances, we split the context document into sentences and use the answer spans to identify the correct answer sentences.
To split documents, we use multiple different sentence tokenizers for the diverse languages and omit languages for which we could not find a suitable sentence tokenizer: 1) bltk8 for Bengali, 2) blingfire9 for Swahili, Indonesian, and Korean, 3) pysdb10 (Sadvilkar and Neumann, 2020) for English and Russian, 4) nltk11 (Bird et al., 2009) for Finnish, and 5) Konoha12 for Japanese.
Translation For CLKD experiments with original target language data, we use Amazon Translate 7 to translate the non-English corpora of TyDi-AS2 datasets into English.Furthermore, to conduct another translationese experiments, we also translate the English TyDi-AS2 dataset to all the other TyDi-AS2 languages, similar to (Vu and Moschitti, 2021).We refer to this dataset as Xtr-TyDi-AS2.

Dataset Statistics
As the original TyDi-QA test set is not publicly available, we repurposed the dev set for test set and used an 80-20 split of the original training set to create TyDi-AS2's training and dev sets.Table 2 shows statistics of TyDi-AS2.

Experimental Setup
For a rigorous assessment of the efficacy of CLKD, we design various experiments with different teachers, students, and training data.

AS2 Models
This section describes Transformer (Vaswani et al., 2017) models we use as AS2 models.Table 3 shows the full list of Hugging Face Transformers (Wolf et al., 2020) pretrained language models used in this study.
Table 3 summarizes the pretrained language models used in this study.We note that the teacher models in Table 3 are fine-tuned on the original English corpus in the target datasets, thus there are two ELECTRA-Large models separately fine-tuned to be teachers for Xtr-WikiQA and TyDi-AS2.

English Teacher Models
To ensure non-specificity to a particular teacher, we experiment with two English AS2 models as teachers in CLKD for Xtr-WikiQA: RoBERTa-Large (Liu et al., 2019) and ELECTRA-Large (Clark et al., 2020b) are the teacher models trained by (Matsubara et al., 2022) for WikiQA using TANDA, a state-of-the-art AS2 model training method (Garg et al., 2020).For TyDi-AS2, we use the ELECTRA-Large model fine-tuned by TANDA on the TyDi-AS2 English dataset instead of WikiQA as the teacher.

Student Models
We experiment with both monolingual and multilingual pretrained language models (PLMs) as students.Additionally, while we experiment with monolingual training for both the two types of student PLMs, for multilingual students, we also experiment with multilingual training using data for all the languages in the corresponding dataset.(Conneau et al., 2020) Table 3: List of pretrained language models used in this study.

Monolingual Student Models
For experiments with Xtr-WikiQA, we use ELECTRA-Base (Clark et al., 2020b) pretrained on Hindi corpus and BERT-Base (Devlin et al., 2019) pretrained on Arabic, German, Italian, Japanese, Dutch, and Portuguese, respectively.We did not find working monolingual PLMs for Spanish and French.For TyDi-AS2, we use ELECTRA-Base (Clark et al., 2020b) pretrained on Bengali corpus, mBERT (Devlin et al., 2019) finetuned on Swahili corpus, and BERT-Base pretrained on Finnish, Indonesian, Japanese, Korean, and Russian respectively.Table 3 includes the pretrained monolingual models used in this study.

Training Languages
In addition to pretrained monolingual and multilingual student models, we also experiment with mono-and multilingual training.For monolingual training, we train the model using a single language's training data.We refer to this setting as SINGLE.For the multilingual setting, which is only possible for multilingual models, we use data for all the languages in a particular dataset, which we refer to as ALL.

Methods
For each dataset, student model, and training languages (SINGLE or ALL), we compare two approaches: direct finetuning using gold labels and CLKD using a teacher's soft labels.We refer to these as FINETUNE and CLKD, respectively.In particular, we use CLKD [E] and CLKD[R] to denote CLKD with ELECTRA-Large and RoBERTa-Large as English teachers, respectively.Finally, for experiments using original language data (i.e., with TyDi-AS2), we additionally compare with the MT-English AS2 pipeline, which involves directly feeding the English translations of the test instances to the English Teacher.This is considered as a strong baseline in (Asai et al., 2021) and (Li et al., 2022).Note that a potential baseline could be to translate all the English data that the teacher was trained on to the target language.However, this is not a feasible approach as (i) the data may not be available, and (ii) even if it were, it would be prohibitively expensive to translate and retrain for every language.Moreover, it will still suffer from the shortcomings of training on translationese such as artifacts from MT and cultural bias as described in § 3.3.
Table 4: Xtr-WikiQA: Averaged test results (P@1) of models trained on dataset of 1) single target language and 2) all the target languages.We highlight better results(FINETUNE vs. CLKD) and additionally use a bold font for the best student model for each of the nine target languages.Our English teacher models, ELECTRA-Large (E) and RoBERTa-Large (R), achieved 87.7% and 91.8% P@1 respectively.

Training Details
For every model and training configuration, we run three training sessions with different random seeds and present average results.Our implementation is based on PyTorch (Paszke et al., 2019) and Hugging Face Transformers (Wolf et al., 2020) with Python 3.7.
Unless specified otherwise, we use the same hyperparameters for both the supervised baselines and CLKD.To train AS2 models, we use AdamW (Loshchilov and Hutter, 2019) with an initial learning rate of 10 −6 and the linear scheduler with a warm-up for the first 2.5% of the training iterations.The number of training iterations (model updates) is set to 20,000 and 40,000 for Xtr-WikiQA and TyDi-AS2, respectively.For better training convergence with multilingual training, we increase the number of training iterations to 150,000.We use only 1 GPU to train each of the AS2 models.
In this study, we select τ ∈ {1, 3, 5, 7} based on the results for the development split 13 and report 13 Tables 7-9 in Appendix B show the selected temperatures.
the averaged test results of the selected AS2 model individually trained with three different random seeds.To run the extensive amount of experiments, we use Amazon EC214 instances of p2.8xlarge, p3.8xlarge, and p3dn.24xlarge.

Evaluations
We now describe the results of our experiments.In § 6.1, we have a problem setting where we assume that no target language data is available for training and we use translations of the English data instead.In § 6.2, we use the TyDi-AS2 dataset to experiment with the setting where some original target language unlabeled data is available.

Translationese
Tables 4 and 5 show the results for all the experiments with the Xtr-WikiQA and Xtr-TyDi-AS2 translationese datasets, respectively.Note that the experiments with the Xtr-TyDi-AS2 dataset use the test split of TyDi-AS2 for the evaluation.It is Table 5: Xtr-TyDi-AS2: Averaged test results (P@1) of models trained on translationese data for 1) single target language and 2) all the target languages.We highlight better results (FINETUNE vs. CLKD) and additionally use a bold font for the best student model for each of the seven target languages.
clear that the performance improves with increasing student model size and going from monolingual training to multilingual training in all languages, even though the training datasets for the diverse languages are translationese from the English corpus.Nevertheless, CLKD consistently outperforms supervised finetuning with gold labels for all the target languages in both the datasets, and we confirm that CLKD significantly improves FINETUNE on nearly all the considered configurations of teacher, student, and both monolingual and multilingual training.This is true even for the Xtr-TyDi-AS2 dataset, which is nearly eight times larger than Xtr-WikiQA, making it even more challenging to reach the performance of the supervised baseline (FINETUNE) with an unsupervised method.Finally, the performance improvement is greater for smaller models and for monolingual training.This is expected as the teachers have also been trained on the source English dataset, and their performances can be seen as upper bounds for the student performances.The results demonstrate the benefits of soft labels from a strong English AS2 teacher when training AS2 models with no original language data.

Original Language Data
Table 6 shows results of experiments with the original language datasets in TyDi-AS2.We observe similar trends as with translationese; the perfor-mance improves with bigger models and multilingual training, and CLKD clearly rivals and regularly outperforms the supervised finetuning with gold-labeled target language data.These results are especially surprising for a method that does not require any manual annotation as TyDi-AS2 has diverse questions and documents in native languages and orders of magnitude more data than Xtr-WikiQA does.
Unlike translationese, however, the English teacher's performance is no longer an upper bound as the student models are trained on original language data.In fact, the XLM-R-Large model with both the supervised finetuning and CLKD using data from all the languages (ALL) consistently outperforms the MT+TEACHER pipeline for all the considered target languages.The improved results in Table 6 v/s Table 5 also confirm the importance of training on original target language data as opposed to translationese.
Since the cost of manual annotation makes supervised finetuning infeasible for AS2 tasks, these results demonstrate the advantages of our proposed approach in being able to leverage strong English AS2 models and original target language data.While supervision from a strong English teacher precludes the need for costly manual annotation, when used with original target language data, the teacher absorbs the errors and artifacts introduced by machine translation allowing the student to be Table 6: TyDi-AS2: Averaged test results (P@1) of models trained on original language data for 1) single target language and 2) all the target languages.We highlight better results (FINETUNE vs. CLKD) and additionally use a bold font for the best student model for each of the seven target languages.
trained directly on native text.Moreover, the softlabel supervision from the teacher seems even more useful than gold labels for mono-lingual training and/or smaller student models.

Conclusion
In this work, we proposed cross-lingual knowledge distillation (CLKD) to leverage strong English AS2 models to train accurate models for low-resource languages without the need for costly manual annotation.Furthermore, we introduced 1) Xtr-WikiQA, a machine-translated WikiQA dataset in 9 additional languages and 2) TyDi-AS2, a new multilingual AS2 benchmark spanning 8 languages.We conducted comprehensive experiments involving various teachers, students, and training settings, to discuss the potential of CLKD.Our results demonstrate the benefits of using soft supervision from a strong English teacher to train a student model for low-resource languages, suggesting the importance of original target language data compared to translationese potentially due to cultural biases and noise introduced by machine translation.
Despite requiring no manual annotations, CLKD leverages both strong English teachers and original target language data and outperforms or rivals strong baselines such as supervised finetuning with the same amount of data and direct usage of a strong teacher model on English translations.
The results also suggest that CLKD has a po-tential to greatly reduce the cost of training strong AS2 models for languages lacking labeled training data.To engage studies on AS2 for such languages, we publish Xtr-WikiQA 1 and TyDi-AS2. 2

Limitations
The proposed CLKD is technically applicable to other NLP tasks, but we discuss the effectiveness of the approach for question answering systems, specifically for answer sentence selection (AS2) tasks.In this study, we put our focus on AS2 tasks as the research community has not well discussed or proposed multilingual AS2 tasks/datasets.We also find that using only English teacher models is another major limitation of this study.However, choices of teacher models in the proposed CLKD are not limited to English models.It would be interesting to discuss the generalizability of the proposed CLKD beyond AS2 tasks, but we note that such discussions would need a much more space to conduct as comprehensive experiments for different NLP tasks as we did for AS2 tasks.

A Dataset Validation
Since sentence tokenization and identifying answer sentences can introduce errors, we conducted a manual validation of the TyDi-AS2 datasets.each language, we randomly selected 50 instances and verified the accuracy of the answer sentences through manual inspection.Our findings revealed that the answer sentences were accurate in 98% of the cases.

B Temperatures for CLKD
Tables 7 -9 present the best hyperparameter value of temperature τ in CLKD for each configuration for Xtr-WikiQA, Xtr-TyDi-AS2, and TyDi-AS2 datsaets.Following (Matsubara et al., 2022), we select the best temperature value in terms of mean average precision for the development split.Those hyperparameter values are used to obtain student models presented in Tables 4 -6.Table 8: Xtr-TyDi-AS2 (translationese): Best temperature τ for CLKD we found in search space (see § 5.4) with respect to dev split and used to report test results in Table 5.

Table 1 :
Statistics of Xtr-WikiQA for each language.

Table 7 :
Xtr-WikiQA: Best temperature τ for CLKD we found in search space (see § 5.4) with respect to dev split and used to report test results in Table4.

Table 9 :
: Best temperature τ for CLKD we found in search space (see § 5.4) with respect to dev split and used to report test results in Table6.