Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering

Coupled with the availability of large scale datasets, deep learning architectures have enabled rapid progress on the Question Answering task. However, most of those datasets are in English, and the performances of state-of-the-art multilingual models are significantly lower when evaluated on non-English data. Due to high data collection costs, it is not realistic to obtain annotated data for each language one desires to support. We propose a method to improve the Cross-lingual Question Answering performance without requiring additional annotated data, leveraging Question Generation models to produce synthetic samples in a cross-lingual fashion. We show that the proposed method allows to significantly outperform the baselines trained on English data only. We report a new state-of-the-art on four datasets: MLQA, XQuAD, SQuAD-it and PIAF (fr).


Introduction
Question Answering is a fast-growing research field, aiming to improve the capabilities of machines to read and understand documents. Significant progress has recently been enabled by the use of large pre-trained language models (Devlin et al., 2019;Raffel et al., 2020), which reach human-level performances on several publicly available benchmarks, such as SQuAD (Rajpurkar et al., 2016) and NewsQA (Trischler et al., 2017).
Given that the majority of large scale Question Answering (QA) datasets are in English (Hermann et al., 2015;Rajpurkar et al., 2016;Choi et al., 2018), the development of QA systems targeting other languages is currently addressed via two cross-lingual QA datasets: XQuAD (Artetxe et al., 2020) and MLQA (Lewis et al., 2020a), covering * * : equal contribution. The work of Arij Riabi was partly carried out while she was working at reciTAL. respectively 10 and 7 languages. Due to the cost of annotation, both are limited only to an evaluation set. They are comparable to the validation set of the original SQuAD (see more details in Section 3.3). In both datasets, each paragraph is paired with questions in various languages, allowing to evaluate models in a cross-lingual experimental scenario: the input context and the question can be in two different languages. This scenario has important practical applications, such as querying a set of documents in various languages.
Performing this cross-lingual task is complex and remains challenging for current models, assuming only English training data: transfer results are shown to rank behind training-language performance (Artetxe et al., 2020; Lewis et al., 2020a). In other words, multilingual models fine-tuned only on English data are found to perform significantly better on English than on other languages. Besides the almost simultaneous work of Shakeri et al. (2020), very few alternatives to such a simple zeroshot transfer method have been proposed so far.
In this paper, we propose to generate synthetic data in a cross-lingual fashion, borrowing the idea from monolingual QA research efforts (Duan et al., 2017). On English corpora, generating synthetic questions has shown to significantly improve the performance of QA models (Du et al., 2017;Golub et al., 2017;Du and Cardie, 2018;Alberti et al., 2019). However, the adaptation of this technique to cross-lingual QA is not straightforward: crosslingual text generation is a challenging task per se which has not been yet extensively explored, in particular when no multilingual training data is available.
We explore two Question Generation scenarios: (i) requiring only SQuAD data; and (ii) using a translator tool to obtain translated versions of SQuAD. As expected, the method leveraging on a translator has shown to perform the best. Leveraging on such synthetic data, our best model obtains significant improvements on XQuAD and MLQA over the state-of-the-art for both Exact Match and F1 scores. In addition, we evaluate the QA models on languages not seen during training (even for the synthetic data) -using SQuAD-it (for Italian), PIAF (for French), and KorQUaD (for Korean) -reporting a new state-of-the-art for Italian and French, and observing significant improvements on Korean compared to zero-shot without augmentation. This indicates that the proposed method allows to capture better multilingual representations beyond the training languages. Our method paves the way toward multilingual QA domain adaptation, especially for under-resourced languages.
Our contributions can be summarized as follows: • We present a data augmentation approach for Cross-Lingual Question Answering based on synthetic Question Generation; • We report extensive experiments showing significant improvements on two multilingual evaluation datasets (XQuAD and MLQA); • We additionally evaluate the proposed methodology on languages unseen during training, thus showing the potential benefits for QA on low-resource languages.

Related Work
Question Answering (QA) QA is the task for which given a context and a question, a model has to find the answer. The interest for Question Answering goes back a long way: in a 1965 survey, Simmons (1965) reported fifteen implemented English language question-answering systems. More recently, with the rise of large scale datasets (Hermann et al., 2015), and large pre-trained models (Devlin et al., 2019), the performance drastically increased, approaching human-level performance on standard benchmarks -see for instance the SQuAD leader board. However, all these works are focused on English. Another popular research direction focuses on the development of multilingual QA models. For this purpose, the first step has been to provide the community with multilingual evaluation sets: Artetxe et al. (2020) and Lewis et al. (2020a) concurrently proposed two different evaluation sets which are comparable to the SQuAD development set. Both reach the same conclusion: due to the lack of non-English training data, models do not achieve the same performance in Non-English languages than they do in English. To the best of our knowledge, no method has been proposed to fill this gap.
Question Generation (QG) QG can be seen as the dual task of QA: the input is composed of the answer and the paragraph containing it, and the model is trained to generate the question. Proposed by Rus et al. (2010), it has leveraged on the development of new QA datasets (Zhou et al., 2017;Scialom et al., 2019). Similar to QA, significant performance improvements have been obtained using pre-trained language models (Dong et al., 2019). Still, due to the lack of multilingual datasets, most previous works have been limited to monolingual text generation. We note the exceptions of Kumar et al. (2019) and Chi et al. (2020), who resorted to multilingual pre-training before fine-tuning on monolingual downstream NLG tasks. However, the quality of the generated questions is still found inferior to the corresponding English ones.
Question Generation for Question Answering Data augmentation via synthetic data generation is a well-known technique to improve models' accuracy and generalisation. It has found successful application in several areas, such as time series analysis (Forestier et al., 2017) andcomputer vision (Buslaev et al., 2020). In the context of QA, generating synthetic questions to complete a dataset has shown to improve QA performances (Duan et al., 2017;Alberti et al., 2019). So far, all these works have focused on English QA given the difficulty to generate questions in other languages without available data. This lack of data, and the difficulty to obtain some, constitutes the main motivation of our work and justifies exploring cost-effective approaches such as data augmentation via the generation of questions.
In a very recent work, almost simultaneous to our previously submitted version, Shakeri et al. (2020) address multilingual QA with a similar approach. However, we argue that their experimental protocol does not allow to totally answer the research question. We detail the differences in our discussion, Section 5.3.

English Training Data
SQuAD en The original SQuAD (Rajpurkar et al., 2016), which we refer as SQuAD en for clarity in this paper. It is one of the first, and among the most popular, large scale QA datasets. It contains about 100K question/paragraph/answer triplets in English, annotated via Mechanical Turk. 2 QG datasets Any QA dataset can be reversed into a QG dataset, by switching the generation targets from the answers to the questions. In this paper, we use the qg subscript to specify when the dataset is used for QG (e.g. SQuAD en;qg indicates the English SQuAD data in QG format).

Synthetic Training Sets
SQuAD trans is a machine translated version of the SQuAD train set in the seven languages of MLQA, released by the authors together with their paper.
WikiScrap We collected 500 Wikipedia articles for all the languages present in MLQA. They are not paired with any question or answer. We use them as contexts to generate synthetic multilingual questions, as detailed in Section 4.2. Following the SQuAD en protocol, we used project Nayuki's code 3 to parse the top 10K Wikipedia pages according to the PageRank algorithm (Page et al., 1999). We then filtered out paragraphs with character length outside of a [500, 1500] interval. Articles with less than 5 paragraphs are discarded, since they tend to be less developed, in a lower quality or being only redirection pages. Out of the filtered articles, we randomly selected 500 per language.

Multilingual Evaluation Sets
XQuAD (Artetxe et al., 2020) is a human translation of the SQuAD en development set in 10 languages (Arabic, Chinese, German, Greek, Hindi, Russian, Spanish, Thai, Turkish, and Vietnamese), providing 1k QA pairs for each language.
MLQA (Lewis et al., 2020a) is an evaluation dataset in 7 languages (English, Arabic, Chinese, German, Hindi, and Spanish). The dataset is built from aligned Wikipedia sentences across at least two languages (full alignment between all languages being impossible), with the goal of providing natural rather than translated paragraphs. The QA pairs are manually annotated on the English sentences and then human translated on the aligned sentences. The dataset contains about 46k aligned QA pairs in total.
Language-specific benchmarks In addition to the two aforementioned multilingual evaluation corpora, we benchmark our models on three languagespecific datasets for French, Italian and Korean, as detailed below. We choose these datasets since none of these languages are present in XQuAD or MLQA. Hence, they allow us to evaluate our models in a scenario where the target language is not available during training, even for the synthetic questions.

Models
Recent works (Raffel et al., 2020;Lewis et al., 2019) have shown that classification tasks can be framed as a text-to-text problem, achieving stateof-the-art results on established benchmarks, such as GLUE (Wang et al., 2018). Accordingly, we employ the same architecture for both Question Answering and Generation tasks. This also allows fairer comparisons for our purposes, by removing differences between QA and QG architectures and their potential impact on the results obtained. In particular, we use a distilled version of XLM-R

Baselines
QA No-synth Following previous works, we finetuned the multilingual models on SQuAD en , and consider them as our baselines.
English as Pivot Leveraging on translation models, we consider a second baseline method, which uses English as a pivot. First, both the question in language L q and the paragraph in language L p are translated into English. We then invoke the baseline model described above, QA No-synth , to predict the answer. Finally, the predicted answer is translated back into the target language L p . We used the google translate API. 4 QA +SQuAD-trans the translated data SQuAD trans are used as additional training data to SQuAD en , to train the QA model.

Question Generation Data Augmentation
In this work we consider data augmentation via generating synthetic questions, to improve the QA performance. Different training schemes for the question generator are possible, resulting in different quality of the synthetic data. Before this work, its impact on the final QA system remained unexplored in a multilingual context.
For all the following experiments, only the synthetic data changes. Given a specific set of synthetic data, we always follow the same two-stages protocol, similar to Alberti et al. (2019): we first train the QA model on the synthetic QA data, then on SQuAD en . We also tried to train the QA model in one stage, with all the synthetic and human data shuffled together, but observed no improvements over the baseline.
We explored two different synthetic generation modes: Synth the QG model is trained on SQuAD en,qg (i.e., English data only) and the synthetic data are generated on WikiScrap. Under this setup, the only annotated samples this model has access to are those from SQuAD-en.
Synth+trans the QG model is trained on SQuAD trans,qg in addition to SQuAD en,qg . The questions can thus be in a different languages than the context. Hence, the model needs an indication about the language it is expected to generate the question in. To control the target language, we use a specific prompt per language, defining a special token <LANG>, which corresponds to the desired target language Y . Thus, the input is structured as <LANG> <SEP> Answer <SEP> Context, where <LANG> indicates to the model in what language the question should be generated, and <SEP> is a special token acting as a separator. These attributes offer flexibility on the target language. Similar techniques are used in the literature to control the style of the output (Keskar et al., 2019;Scialom et al., 2020;Chi et al., 2020).

Implementation details
For all our experiments we use Multilingual MiniLM v1 (MiniLM-m) (Wang et al., 2020), a 12-layer with 384 hidden size architecture distilled from XLM-R Base multilingual (Conneau et al., 2020). With only 66M parameters, it is an order of magnitude smaller than state-of-the-art architectures such as BERT-large or XLM-large. We used the official Microsoft implementation. 5 For all the experiments -both QG and QA-we trained the model for 5 epochs, using the default hyperparameters. We used a single nVidia gtx2080ti with 11G RAM, and the training times amount to circa 4 and 2 hours for Question Generation and for Question Answering, respectively. To evaluate our models, we used the official MLQA evaluation scripts. 6 For reproducibility purposes, we make the code available. 7

Question Generation
We report examples of generated questions in Table 1.
Controlling the Target Language In the context of multilingual text generation, controlling the target language is not trivial. When a QA model is trained only on English data, at inference, given a non-English paragraph, it predicts the answer in the input language, as one would expect, since it is an extractive process. Ideally, we would like to observe the same behavior for a Question Generation model trained only on English data (such as Synth), leveraging on the multilingual pre-training. Conversely to QA, QG is a language generation task. Multilingual generation is much more challenging, as the model's decoding ability plays a major role. When a QG model is fine-tuned only on English data (i.e SQuAD-en), its controllability of the target language suffers from catastrophic forgetting: the input language does not Paragraph (EN) Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led the Broncos to victory in Super Bowl XXXIII at age 38 and is currently Denver's Executive Vice President of Football Operations and General Manager. Answer Broncos QG synth What team did John Elway lead to victory at age 38? QG synth+trans (target language = en) What team did John Elway lead to win in the Super Bowl?
Paragraph (ES) Peyton Manning se convirtió en el primer mariscal de campo de la historia en llevar a dos equipos diferentes a participar en múltiples Super Bowls. Ademas, es con 39 años, el mariscal de campo más longevo de la historia en jugar ese partido. El récord anterior estaba en manos de John Elway -mánager general y actual vicepresidente ejecutivo para operaciones futbolísticas de Denver-que condujo a los Broncos a la victoria en la Super Bowl XXXIII a los 38 años de edad. Answer Broncos QG synth Where did Peyton Manning condujo? QG synth+trans (target language = es) Qué equipo ganó el récord anterior? (Which team won the previous record?) QG synth+trans (target language = en) What team did Menning win in the Super Bowl?
What is the name for the name that the name is used? QG synth+trans (target language = zh) 约翰·埃尔维在13岁时带领哪支球队赢得第33届超级碗? (Which team did John Elvey lead to win the 33rd Super Bowl at the age of 13?) QG synth+trans (target language = en) What team won the 33th Super Bowl? Table 1: Example of questions generated by the different models on an XQuAD's paragraph in different languages. For QG synth+trans , we report the outputs given two target languages, the one of the context and English.
propagate to the generated text. While still relevant to the context, the synthetic questions are generated in English: for instance, in Table 1 we observe that the QG synth model outputs English questions for the paragraphs in Chinese and Spanish. The same phenomenon was reported by Chi et al. (2020).

Cross-Lingual Training
To overcome the aforementioned limitation on target language controllability (i.e. to enable the generation in other languages than English), multilingual data is needed. We can leverage on the translated versions of the dataset to add the required non-English examples. As detailed in Section 4.2, we simply use a specific prompt that corresponds to the target language (with N different prompts corresponding to the N languages present in the dataset). In Table 1, we show how QG synth+trans can generate questions in the same language as the input. These synthetic questions seem much more relevant, coherent and fluent, if compared to those produced by QG synth : for the Spanish paragraph, the question is well formed and focused on the input answer; for Chinese (see bottom row of Table 1 for QG synth+trans ) is perfectly written.
In Table 2 we report the BLEU4 scores for QG synth+trans grouped by the language of the question. As expected, the score is maximized on the q/c en es de ar hi vi zh en 14.5 8.9 7.2 5.9 6.5 8.4 6.0 es 9.0 10. 6.6 4.2 5.9 6.3 4.6 de 6.2 4.8 6.3 3.1 3.7 5.0 3.2 ar 2.8 2.2 2.4 3.3 2.0 2.3 2.1 hi 7.9 6.7 6.6 5.8 8.3 6.6 5.2 vi 9.1 7.3 7.2 6.0 6.5 12.3 6.1 zh 9.2 8.0 7.8 6.1 7.2 8.0 15.0 diagonal (same languages for the context and the question). Still, most of these scores are lower on non-English languages. It is interesting to note BLEU4 correlates with the QA scores: 0.51 Pearson coefficient. The reasons are two folds: 1) QA and QG share the same Language Model, which might struggle for the same languages; 2) the better the QG, the better the synthetic data, therefore the better the QA performs. We discuss further in Section 5.3 how this impacts the QA performance.
In addition to BLEU, we also report the QA F1 scores for different QA models when applied on the generated questions in the supplementary material. Yet, we warn the reader that these results should be taken with caution: evaluating NLG is known to be an open research problem; BLEU is known to suffer from important limitations (Novikova et al., 2017), which might be accentuated in a multilingual context (Lee et al., 2020). For this reason, we conducted a manual qualitative analysis on a small number of samples. Note that the annotators need to have a professional level in the language of the generated question to evaluate its fluency, and to be bilingual, when evaluating its relevance w.r.t. input context in our cross language scenario. This is a significant challenge to conduct a large scale evaluation.
So far, our results (see at the end of Supplementary Material) for Arabic and German show an overall good quality in the questions: only one question for Arabic was genuinely missing the point while for German there were 2 lexical questionable choices that invalidate the question (out of 10 samples for both languages so far). This indicates that Arabic questions could actually be better than what their low BLEU score shows. Arabic has a very different morphological structure that could explain such low BLEU (Bouamor et al., 2014). This emphasizes the limitation of the current automatic metrics in a multilingual context.

Question Answering
We report the main results of our experiments on XQuAD and MLQA in Table 3. The scores correspond to the average over all the different possible combination of languages (de-de, de-ar, etc.).
English as Pivot Using English as a pivot does not lead to good results. This may be due to the evaluation metrics, which are based on n-grams similarity. For extractive QA, F1 and EM metrics measure the overlap between the predicted answer and the ground truth. Therefore, meaningful answers worded differently are penalized, a situation that is likely to occur because of the backtranslation mechanism. This makes automatic evaluation challenging for this setup, as metrics suffer from similar difficulties as those observed for text generation (Sulem et al., 2018). As an additional downside, this model requires multiple translations at inference time. For these reasons, we decided not to explore this approach further.
Synthetic without translation (+synth) Compared to the MiniLM baseline, we observe a small performance increase for MiniLM +synth (Exact Match increases from 29.5 to 33.1 on XQuAD and from 26.0 to 27.5 on MLQA).
During the self-supervised pre-training stage, the model was exposed to multilingual inputs. Yet, for a given input, the target language was always consistent, preventing the model to be exposed to such a cross-lingual scenario. The synthetic inputs are composed of questions in English (see examples in Table 1) while the contexts can be in any languages. Therefore, the QA model is exposed for the first time to a cross-lingual scenario. We hypothesise that such a cross-lingual ability is not innate for a default multilingual model: exposing a model to this scenario allows to develop this ability and contributes to improve its performance.
Synthetic with translation (+synth-trans) : For MiniLM +synth-trans , we obtain a much larger improvement over its baselines, MiniLM, compared to MiniLM +synth , on both MLQA and XQuAD. Also, it outperforms MiniLM +SQuADtrans , indicating the benefit of our proposed approach. This supports the intuition developed in the previous paragraph: independently of the multilingual capacity of the model, a cross-lingual ability is developed when the two inputs components are not exclusively written in the same language. In Section 5.3, we discuss this phenomenon more in depth.

Discussion
Cross Lingual Generalisation To explore the models' effectiveness in dealing with cross-lingual inputs, we report in Figure 1 the performance for our MiniLM +synth-trans setup, varying the number of samples and the languages present in the synthetic data. The abscissa x corresponds to the progressively increasing number of synthetic samples used; at x = 0, it corresponds to the MiniLM +trans baseline, where the model has access only to the original English data from SQuAD en . We explore two sampling strategies for the synthetic examples: 0. All Languages corresponds to sampling the examples from any of the different languages. 0. Conversely, for Not All Languages, we progressively added the different languages: for x = 50K, all the 50K synthetic data are on a unique language input, L1. Then for x = 100K, the synthetic data are from either L1, or an additional language L2; finally, for x = 250K, all MLQA languages are present. In Figure 1, we observe that the performance for All Languages increases largely at the beginning, then remains mostly stable. Conversely, we note a gradual improvement for Not All Languages, as   However, it appears that even with only one language pair present, the model is able to develop a cross-lingual ability that brings benefits on other languages: of Figure 2, we can see that most of the improvement is happening given only one crosslingual language pair (i.e. English and Spanish).
Unseen Languages To measure the benefit of our approach on unseen languages (i.e. not present in the synthetic data from MLQA/XQuAD), we test our models on three QA evaluation sets: PIAF (fr), KorQuAD and SQuAD-it (see Section 3.3). The results are consistent with the previous experiments on MLQA and XQuAD. Our MiniLM +synth-trans model outperforms its baseline by more than 4 Exact Match points, while XLM-R +synth-trans obtains a new state-of-the-art. Notably, our multilingual XLM-R +synth-trans outperforms CamemBERT on PIAF, even if the latter is a pure monolingual, in-domain language model.
On the correlation between BLEU4 and QA scores To measure the impact of the quality of the generated questions on the QA performance, we computed the Pearson correlation between the BLEU4 and the QA scores. The coefficient is equal to 0.65 (p < .001). When we observe the correlations grouping the samples w.r.t. their language question (i.e. the rows in Table 2), we obtain: en 0.94; es 0.84; de 0.46; ar 0.36; hi 0.33; vi 0.73; zh 0.92. We observe stronger correlation for languages with higher BLEU scores (i.e en & zh), and lower for the Arab that had the lowest BLEU, indicating an impact on the final QA score in par to the quality of the synthetic questions.

Differences with Shakeri et al. (2020)
A very recent work has addressed multilingual QA with a very similar approach. However, we note a major difference in our respective experiments regarding the choice for the QA and QG models. Shakeri et al. (2020) choose mBert for QA and T5-m for QG. We would like to emphasize that because T5m significantly outperforms mBert it is not clear where the improvement comes from: is it due to the proposed approach, or simply from a distillation effect from T5-m to mBert? In our case, we  Figure 1: Left: F1 score on MLQA, for models with different number of synthetic data in two setups: for All Languages, the synthetic questions are sampled among all the five languages in MLQA; for Not All Languages, the synthetic questions are sampled progressively from only one language, two, . . . , to all five for the last point, which corresponds to All Languages. We report the standard deviation over five different permutations of the language ordering. Note that, as expected, the more the synthetic data, the lower the variance in the results. Right: same as on the left, but evaluated on XQuAD.  Hidden distillation effect The relative improvement for our best synthetic configuration +synthtrans, over the baseline, is above 60% EM for MiniLM (from 29.5 to 49.5 on XQuAD and from 26.0 to 41.4 on MLQA). Significantly higher than that observed for XLM-R (+11.7% on XQuAD and +2.71% on MLQA), it indicates that XLM-R provides superior cross-lingual transfer abilities than MiniLM, a fact that we hypothesize due to distillation. Such loss of generalisation can be difficult to identify, and opens questions for future work. QA, an unsolved task for lower resource languages Factoid QA tasks have been criticized for being a too easy task: the answer can often be identified given simple heuristics: e.g. a "When" question is answered by one of the "date" spans in the context (Kočiský et al., 2018;Kwiatkowski et al., 2019). SQuAD-v2 was for instance introduced to increase the difficulty of the task by adding unanswerable questions. The research community is now moving towards the construction of long context questions and non-factoid QA datasets (Dulceanu et al., 2018;Hashemi et al., 2019;Fan et al., 2019;Lewis et al., 2020b). In any case, the motivation of this work was to cope for the lack of training data for under-served languages in the QA domain which was severely impacting models performance. Therefore, potential criticisms regarding the simplicity of the task do not apply if seen from a lowerresource language scenario: our work deals with alleviating the lack of native training data, allowing us to focus our future work on further important issues such as domain adaptation, robustness and explainability in low-resource contexts.

Conclusion
In this work, we presented a method to generate synthetic QA dataset in a multilingual fashion, showing how QA models can benefit from it and reporting large improvements over the baselines. The proposed approach contributes to fill the gap between English and other languages, and is shown to generalize for languages not present in the synthetic corpus (e.g. French, Italian, Korean).
In future work, we plan to investigate whether the proposed data augmentation method could be applied to other multilingual tasks, such as classification. We will also experiment more in depth with different strategies to control the target language of a model, and extrapolate on unseen ones.

Acknowledgments
Djamé Seddah was partly funded by the French Research National Agency via the ANR project ParSiTi (ANR-16-CE33-0021), Arij Riabi was partly funded by Benoît Sagot's chair in the PRAIRIE institute as part of the French national agency ANR "Investissements d'avenir" programme (ANR-19-P3IA-0001) and by the Counter H2020 European project (grant 101021607). Thomas Scialom, Benjamin Piwowarski, and Jacopo Staiano. 2019. Self-attention architectures for answer-agnostic neural question generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6027-6032, Florence, Italy. Association for Computational Linguistics.
Siamak Shakeri, Noah Constant, Mihir Sanjay Kale, and Linting Xue. 2020. Multilingual synthetic question and answer generation for cross-lingual reading comprehension. arXiv preprint arXiv:2010.12008. When relying on the translated versions of SQuAD, the target language for generating synthetic questions can easily be controlled, and results in fluent and relevant questions in the different languages. However, one limitation of this approach is that synthetic questions can only be generated in the languages that were available during training: the <LANG> prompts are special tokens that are randomly initialised when fine-tuning QG on SQuAD trans;qg : before fine-tuning, they bear no semantic relation with the corresponding language names ("English", "Español" etc.), thus the learned representations for the <LANG> tokens are limited to the languages present in the training set.

Robert
To the best of our knowledge, no method allows so far to generalize this target control to an unseen language. It would be valuable, for instance, to be able to generate synthetic data in Korean, French and Italian, without having to translate the entire SQuAD−en dataset in these three languages to then fine-tune the QG model.
To this purpose, we report -alas, as a negative result -the following attempt: instead of controlling the target language with a special, randomly initialised, token, we used a token semantically related to the language-word: "English", "Español" for Spanish, or "中文" for Chinese. The intuition is that the model might adopt the correct language at inference, even for a target language unseen during training. 9 A similar intuition has been explored in GPT-2: the authors report an improvement for summarization when the input text is followed by "TL;DR" (i.e. Too Long Didn't Read).
At inference time, we evaluated this approach on French with the prompt language=Français. Unfortunately, the model did not succeed to generate text in French. Controlling the target language in the context of multilingual text generation remains under-explored, and progress in this direction could have direct applications to improve this work, and beyond.

B Question Generation Scores
We report the BLEU-4 scores for MLQA on QG synth+trans on Table 5 and QG synth on Table 6. 9 With unseen during training, we mean not present in the QG dataset; obviously, the language should have been present in the first self-supervised stage.      In addition, we report the F1 scores for XLM-R finetuned on SQuAD en and XLM-R +synth-trans on all the language pairs, on both MLQA and XQuAD in Tables 7, 8, 9 and 10.

C Qualitative Evaluation
We report in Tables 11, 12, and 13 different examples that we analysed in our manual qualitative analysis, discussed at the end of section 5.1 in the main paper.     Paragraph Taiwan is an island country in East Asia. The main island, known historically as Formosa, makes up 99% of the area controlled by the ROC, measuring 35,808 square kilometres (13,826 sq mi) and lying some 180 kilometres (112 mi) across the Taiwan Strait from the southeastern coast of mainland China. The East China Sea lies to its north, the Philippine Sea to its east, the Luzon Strait directly to its south and the South China Sea to its southwest. Smaller islands include a number in the Taiwan Strait including the Penghu archipelago, the Kinmen and Matsu Islands near the Chinese coast, and some of the South China Sea Islands.
Answer 99% QG synth+trans (target language = de) Welcher Prozentsatz der Gebiete von Taiwan wird von der ROK kontrolliert? (tr: What percentage of the areas of Taiwan is controlled by the ROK?) QG synth+trans (target language = ar) ( ) (tr: What is the percentage of lands controlled by the (Divan)?) Answer 35,808 square kilometres QG synth+trans (target language = de) Wie groß ist die RAF? (tr: How big is the RAF?) QG synth+trans (target language = ar) (tr: How many square miles is the island?) Answer Luzon Strait QG synth+trans (target language = de) Welcher Fluss ist direkt zum Süden von Taiwan? (tr: Which river is directly to the south of Taiwan?) QG synth+trans (target language = ar) (tr: What is the name of the nearby railway?) Answer South China Sea QG synth+trans (target language = de) Welches Meer ist im Südwesten von Taiwan? (tr: Which sea is in southwest of Taiwan?) QG synth+trans (target language = ar) (tr: What is the ocean seen on the western side of Taiwan?) Answer 180 kilometres QG synth+trans (target language = de) Wie weit ist die RAF von Taiwan aus der südlichen Küste von China? (tr: How far is the RAF from Taiwan from the southern coast of China?) QG synth+trans (target language = ar) (tr: How long are the Taiwan Islands from the coast of China?)  The synthetic questions are sampled among all the five languages in MLQA . The standard deviation over four different seeds for the sampling are displayed through the confidence interval (light blue) around the averaged main curves. We observe that for SQuAD-it and KorQuAD the performances increase significantly at the beginning, then remain mostly stable, while for PIAF (fr) the best performances are obtained with 100k of additional synthetic data, a slight improvement from 50k additional questions before starting to decrease.