Multi-Lingual Question Generation with Language Agnostic Language Model

Question generation is the task of generating coherent and relevant question given context paragraph. Recently, with the development of large-scale question answering datasets such as SQuAD, the English question generation has been rapidly developed. However, for other languages such as Chinese, the available training data is limited, which hinders the development of question generation in the corresponding language. To investigate the multi-lingual question generation, in this paper, we develop a language-agnostic language model, which learns the shared representation from several languages in a single architecture. We propose an adversarial training ob-jective to encourage the model to learn both language-speciﬁc and language-independent information. We utilize abundant monolingual text to improve the multi-lingual question generation via pre-training. With the language-agnostic language model, we achieve significant improvement in multi-lingual question generation over ﬁve languages. In addition, we propose a large-scale Chinese question generation dataset containing more than 220k human-generated questions to beneﬁt the multi-lingual question generation research.


Introduction
Question Generation (QG), also known as learning to ask, has attracted a lot of research interest in recent years. QG is regarded as the dual task of machine reading comprehension (Yuan et al., 2017;Xiao et al., 2018). Rather than answering a given question, learning to ask a coherent, relevant, and non-trivial question also requires a deep understanding of the context (Davey and McBride, 1986;Graesser et al., 2010), providing a good testbed for natural language understanding.
Conventional methods for question generation rely heavily on heuristic rules, and the standalone dependency parsing tool is needed to generate handcrafted templates (Mostow and Chen, 2009;Heilman and Smith, 2010;Rus et al., 2010;Hussein et al., 2014;Dhole and Manning, 2020). In recent years, with the development of deep learning and large-scale QA datasets, more and more neural network model has been proposed, which is also referred as neural question generation. Neural QG shows great advantage compared with previous rule-based systems in terms of both fluency and diversity of the generated questions (Duan et al., 2017;Yuan et al., 2017).
However, most progress in QG is made in English. For other languages such as Hindi, the lack of large-scale QG data limits its development. Recently, multi-lingual and cross-lingual language understanding has been studied in several NLP tasks, such as question answering Cui et al., 2019), summarization (Zhu et al., 2019), natural language inference (Conneau et al., 2018), etc. For QG, Kumar et al. (2019) demonstrate that for low-resource Hindi, incorporating the large-scale English SQuAD (Rajpurkar et al., 2016) dataset could boost the QG result a lot.
For multi-lingual QG, a key factor is to learn a model that could transfer knowledge across different languages. In this paper, we propose a languageagnostic language model: it consists of the specific low-level module for each language, and a shared high-level module for multi-lingual information aggregation. Separating the language model into two levels enables us to learn the language-specific information in each language and the common information shared among languages. In this way, the knowledge in multi-lingual QG could be transferred via the high-level module.
For the language-agnostic language model, however, the distributed representation of the low-level module could be easily mixed with the language information, which makes the high-level module con-tain some unnecessary language-specific features that are too specific to transfer across languages. Inspired by previous works on transfer learning Liu et al., 2017), we propose an adversarial training objective to decouple the lowlevel module with the high-level module, which prevents the private and shared latent feature spaces from interfering with each other, making the highlevel module language-invariant, thus achieving better transferability for different languages.
To get a better initialization for our model, we develop two self-supervised methods to pre-train our model on abundant monolingual text. We apply our model to five languages QG tasks that have human-labeled QG datasets. The experimental results demonstrate that all languages QG could benefit from the multi-lingual training. Our models surpass previous monolingual or multi-lingual QG methods by a large margin, even in zero-shot learning where we had no training data in the lowresource languages, our model achieves satisfactory results by merely trained on English dataset, which shows a promising transferability of the proposed model.
Besides, we also propose a large-scale Chinese QG dataset containing more than 220k humanlabeled questions. We hope the proposed Chinese dataset could benefit the community for more comprehensive multi-lingual QG research. The codes and proposed datasets are available at https: //github.com/benywon/LALM.
Our contributions are summarized as follow: • We propose a novel language-agnostic language model which decouples the language specific and language independent information in QG. • The proposed model achieves significant improvement over previous models in multi-lingual QG, and we analyze the transferability in multiple languages. • We release a large-scale human labeled Chinese QG dataset containing more than 220k questions. To our best knowledge, this is the largest specific question generation dataset so far.

Related Work
Question generation has received increasing attention from the research community. Traditional QG systems are mostly rule-based, which sometimes utilizing off-the-shelf tools to get the syntactic structure, dependency relations, and semantic role of the passage (Mostow and Chen, 2009;Heilman and Smith, 2010). First, the target answers are generated using rules or semantic roles, next, lowquality questions are generated using hand-crafted rules or templates. Finally, the generated questions are ranked by features such as keyword matching degree or sentence perplexity (Hussein et al., 2014). The main drawbacks of these symbolic systems are that the rules and templates are expensive to manually create, and lack diversity.
With the development of deep learning and largescale question answering datasets, motivated by neural machine translation, Du et al. (2017) proposed a sequence to sequence (seq2seq) architecture combined with attention mechanism, achieving a promising result on QA dataset SQuAD. Since then, many works have been proposed to extend the preliminary framework with rich features, such as named entity tags  or answer position features (Duan et al., 2017), and incorporate copy mechanism to copy words from the context paragraph (Song et al., 2018). Other types of models are also introduced such as graph neural networks  or Transformer (Scialom et al., 2019). However, most of these works are focus on English QG and have not been validated in other languages. Compared with the previous multi-lingual methods, our method directly separates the languagedependent module and language-independent module. We propose an adversarial decoupling module to improve the adaptive ability of the model. Besides, our model could be properly pre-trained by monolingual data, which obviates the need to construct the back-translation or pseudo-parallel data.

Language-agnostic Language Model
The Language-agnostic language model (LALM) consists of the low-level module and the high-level module, the whole architecture is illustrated in Figure 1(a) we describe it below.

Low-Level Module
The low-level module is built to perform the basic language understanding. In this paper, we adopt the LSTM (Hochreiter and Schmidhuber, 1997) encoder as the low-level language understanding module 1 . LSTM processes text in sequential order and embeds the language information into dense representations. We adopt the uni-directional LSTM in this paper to make the model auto-regressive. For the language-agnostic language models, each language has its specific word embeddings and specific low-level language understanding LSTMs. This is different from some previous multi-lingual methods that a shared or aligned word embedding is utilized for different languages (Conneau et al., 2018;Lample and Conneau, 2019). Separating the language understanding module enables us to model specific linguistic characteristics in different languages. In Section 4, we will show that separating the low-level module for each language could benefit a lot for multi-lingual QG.

High-Level Module
The low-level module is built to perform the basic linguistic understanding, and the high-level module is built on top of the low-level module to perform higher-level information aggregation, which requires higher model capacity. In this paper, we use the Transformer (Vaswani et al., 2017) model as the high-level module.
The Transformer, with the core building-block called multi-head attention, has shown great advantages in representing languages in many NLP tasks. Current state-of-the-art models in natural language understanding benchmark GLEU 2  are almost Transformer-based. In this paper, we focus on QG which is a sequence-tosequence problem, so we adopt the mask operation similar with , which is illustrated in Figure 1(b). For a pair of sequence (x, y) where x = x 1 , ..., x |x| is the source, and y = y 1 , ..., y |y| is the target, we concatenate them together with a special token <sep>, forming a single sequence with length |x| + |y| + 1. We want all the positions in the source {1, 2, ..., |x|} to attend to each other so we can obtain the bi-directional representations of the source, and all the positions in the target {|x| + 1, ..., |x| + |y| + 1} are forbidden to attend to future words: This attention mask operation enables us to build a causal language model that the generation of the current word only depends on its previous words. Therefore, the probability of y could be denoted as: And the loss for the whole model is the negative log likelihood of the data:

Adversarial Decoupling Module
In this paper, we want the representations of the low-level module in different languages to contain no language-specific information that is interleaved with the high-level module. In this way, the high-level module could focus on the semantic understanding shared across languages. We build a discriminator on top of the low-level module to determine whether the output of low-level representations contains the specific language information. The discriminator is a bi-directional LSTM taking the output of the low-level module as input and tries to predict its language. Concretely, denote the output of the low-level module is S ∈ R n,d where n is the sequence length (i.e. |x| + |y| + 1), and d is the hidden size of the low-level module. The output of the discriminator can be represented as: h ∈ R d is a pooled representations of the discriminator for classification.ŷ is the language distribution in R C where C is the number of languages. For the discriminator, the target is to maximize the probability of the corresponding language while the low-level module (generator) tries to minimize it. Therefore, they form an adversarial training objective that the low-level module must produce representations without discriminative language information. In this way, the discriminator acts as an adversarial decoupling module (ADM) to encourage the low-level module to generate languageagnostic representations.
The architecture of ADM is shown in Figure 1 (c), and the loss function for the discriminator and low-level module (generator) are: whereŷ i is the discriminator probability for the input language i. In fact, the objective of the generator is to maximize the entropy of the discriminator's output to make it less confident of the language.

Pre-training
Recent works on NLP and language generation have shown the great advantage of large-scale pre-training (Devlin et al., 2018;Radford et al., 2019;Lewis et al., 2019;Roberts et al., 2020). In this paper, we also pre-train our model in massive multilingual text. Since our model is a sequence to sequence architecture, we develop two self-supervised objectives for language generation pre-training: Denoised Auto-Encoder (DAE): Most previous works on natural language generation pre-training resort to DAE to initialize the model. In DAE, a corrupted version of the original sentence is created as the source and the model should reconstruct the original sentence. In this paper, we adopt the similar noising strategy as Lewis et al. (2019): (1) Token Masking random tokens are sampled and replaced with a special [MASK] token. (2) Token Deletion randomly deletes several tokens in the document. (3) Token Replacement randomly replace some tokens with other tokens in the vocabulary. (4) Sentence Permutation randomly swap some tokens in the sentence.
Next Sentence Generation: One of the problems of the DAE is that the input is always the corrupted sentence, which is not the case during finetune, the pretrain-finetune discrepancy may hurt the performance of the downstream tasks . Similar to Kiros et al. (2015) and , we sample a consecutive segment in the text and divide it into two parts, we treat the first parts as the source and the second part as the target. The objective is to generate the second part based on the first part.

Question Generation Fine-tuning
After pre-training, we suppose the low-level module of our model has learned the multi-lingual linguistic information. Then the fine-tuning objective is to adjust the high-level module for question generation. Therefore, in this phase, we fix the lowlevel module, i.e. the word embedding, LSTM, and output projection linear layer, and only update the parameter of the high-level module.

Dataset
The question generation datasets are sometimes directly derived from the corresponding question answering datasets. In the current question answering application, most multi-lingual datasets are automatically derived by translating from English SQuAD (Asai et al., 2018). However, it may reduce the multi-lingual QG tasks to translation tasks if we use these datasets. Therefore, we consider four different language QG datasets that are developed by the specific language speakers.
• English (En) We use the SQuAD (Rajpurkar et al., 2016) as the English question generation dataset. It is a standard machine reading comprehension data consists of nearly 100k humanlabeled questions from Wikipedia. Since the size of the QG dataset except English is comparative small, so we propose a new large-scale QG dataset created by humans on Chinese (Zh). First of all, we collect nearly 3.5m passages from Baike 3 , a Chinese Wikipedia-like encyclopedia. To increase the diversity of the selected paragraphs, we cluster the passages based on the bag-of-words, then we use Ward (Ward Jr, 1963) algorithm to select the centroid in each cluster, which result in nearly 100k passages. We ask volunteers to ask no  more than 5 questions for each paragraph. Since we did not give the specific answer candidates for each paragraph, the annotators were encouraged to ask more general and comprehensive questions. We also ask other volunteers to check the quality and remove the questions that are either unanswerable or contain grammar errors. Finally, we obtain 224,962 question-paragraph pairs. We randomly select 180k of them as the training data, 20k samples for development, and the rest 24,962 for testing. We name it LAB (Learning to Ask on Baike). We adopt the 2020-05-20 data dumps of the Wikipedia 4 in the corresponding language as the pre-training data. The details of the training data are shown in Table 1.

Implementation Details
In all experiments, we tokenize the text with sentencepiece (Kudo and Richardson, 2018). For all languages datasets, we set the vocabulary size to 30,000. We use the Adam (Kingma and Ba, 2014) optimizer with 5k warm-up steps and linearly decay the learning rate. β 1 , β 2 , was set to 0.9, 0.99 and 10 −6 , respectively. For both pre-training and fine-tuning, the max learning rate was set to 10 −4 . The batch size was 256 during pre-training and 64 during fine-tuning. We limit the max sequence length to 512. For the adversarial decoupling module training, following previous works of generative adversarial networks (Goodfellow et al., 2014;Salimans et al., 2016), the update rate for discriminator and generator was set to 1:10. For each of the  guage model. It is similar with the proposed model but has no specific low-level LSTM for each language. That is, the low-level and high-level parameters are both shared across different languages. The hidden size was set to 768 and the layer size was set to 12, and each layer consists of 12 heads. We set the shared vocabulary size to 100,000. LALM base is the base version of our model. It has the same hidden size as LALM share . The low-level module was single layer uni-directional LSTM with hidden size 768. LALM base has nearly 138m parameters, where nearly half of them are low-level language understanding parameters.
LALM large is the large version of our proposed model. The hidden size, layer size, and head size were set to 1024,24,16, respectively. The lowlevel module was two-layer uni-directional LSTMs. LALM large has 548m parameters, where nearly a quarter of them are low-level module's parameters.

Criterion:
Following previous works of QG , we adopt three widely used automatic metrics for evaluation: BLEU, Meteor and Rouge-L, which measure the n-gram similarities between the generated questions and real questions.

Baselines
We adopt 5 baseline methods for comparison.
Transformer (Vaswani et al., 2017;Scialom et al., 2019) is the most widely used architecture in sequence-to-sequence learning. For each language, we train the correspondent Transformer model based on its training data. We set dropout ratio to 0.4 to prevent overfitting. NQG++  is a popular neural QG model based on LSTM. It is enhanced with attention and copy mechanism 5 . Multi-BERT (Devlin et al., 2018) is a multilingual extension to the original BERT model. It was trained on the multi-lingual wikipedia. All the language shares the same vocabulary. We adopt the way same with Rönnqvist et al. (2019) to extend BERT to language generation task. CLQG (Kumar et al., 2019) is a cross-lingual QG method based on Transformer. It is pretrained by denoising autoencoders along with back-translation. We use the public implementation 6 and adopt the same word tokenization as well as pre-training data as our model. XNLG (Chi et al., 2019) is a multi-lingual language generation model that transfers monolingual supervision to all pre-trained languages. It was trained with English, Chinese and French datasets. We use their public pre-trained models 7 and fine-tune on the three QG dataset.

Multi-Lingual Question Generation
To evaluate the multi-lingual question generation ability of the proposed methods, we assemble all  QG data and train the LALM thereof. For Transformer and NQG++, we initialize the word embeddings by fasttext multilingual word embeddings (Grave et al., 2018). The result is shown in Table 2.
We can see from the table that our model excels at multi-lingual QG, achieving significant improvement over previous methods in all languages. Compared with other architectures such as Transformer, we explicitly separate the low-level and the high-level module in the proposed model and use adversarial networks to decouple them. Therefore, the shared high-level module is encouraged to learn more common representations across different languages, which is more transferable and benefits the downstream QG task a lot.
Besides, we can see that if we don't explicitly separate the low and high-level parameters (LALM share ), the result drops a lot. We hypothesis that different languages have different low-level language information, such as lexical, syntactical, etc. Embedding all language processing procedures into a single model may make the model hard to discriminate the language-specific information.
Besides, the model trained with the adversarial decoupling module achieves further improvement, the ADM may impose an implicit regularization on the low-level module to make the representations more abstract, and therefore encourage the high-level module to learn more common representations Liu et al., 2017).

Human Evaluation
The automatic metrics are sometimes biased toward a specific attribute of the generated question (Hosking and Riedel, 2019). So we conduct human qualitative evaluation of the generated outputs. We consider three aspects of the generated questions: Fluency: Whether the generated questions are well-posed and natural, in terms of both grammar and semantic. Answerable: Whether the generated questions could be answered by the context paragraph. Significance: Whether the generated question is just a simple syntactical transformation of the paragraph sentence or trivial one that seems unlikely asked by human.
We randomly sample 50 generated questions from English and 50 from Chinese and ask three volunteers to evaluate the sample quality. The result is shown in Table 5. The result shows our proposed model is also excels at human evaluation, especially for significance, which is sometimes regarded as the most important factor in QG (Graesser et al., 2010). We also showcase some outputs of our model in Table 4. We can see that LALM could generate fluent and sound questions. Kumar et al. (2019) have found that in QG the performance of Hindi could be improved by training with additional English data. In this section, we evaluate whether the multi-lingual is superior to the mono-lingual QG. We focus on two aspects:

Multi-Lingual v.s. Mono-Lingual
(1) Pre-training. In contrast to the proposed multilingual pre-training, we adopt the mono-lingual pre-training where we only pretrain on specific languages 8 and fine-tune the QG models in the same language.
(2) Fine-tuning. Different from the setup in Sec. 4.5 where we aggregate all languages QG data for training, we only fine-tune the model on specific language.
We experiment on English and Chinese with the LALM base model. The BLEU-4 and ROUGE-L scores are shown in Table 3. It is clear that for both pre-training and fine-tuning, the multi-lingual training improves the model a lot. Moreover, the multi-lingual plays a more important role in pretraining than in fine-tuning. We suppose that during pre-training, multiple languages perform a type of regularization on the shared high-level module, while in fine-tuning the language-dependent super- Table 4: Some generated cases of the proposed model.

English
Context: The United Methodist Church opposes conscription as incompatible with the teaching of Scripture. Therefore, the Church supports and extends its ministry to those persons who conscientiously oppose all war, or any particular war, and who therefore refuse to serve in the armed forces or to cooperate with systems of military conscription. However, the United Methodist Church also supports and extends its ministry to those persons who conscientiously choose to serve in the armed forces or to accept alternative service. The church also states that "as Christians they are aware that neither the way of military action, nor the way of inaction is always righteous before God." Original: The Church supports those persons who conscientiously oppose what? LALM: what does the church states after they oppose the construction ?
Korean  vision of QG is more specific, which makes transfer learning less useful.

Zero-Shot Learning
In this section, we study the zero-shot multi-lingual learning ability of our model. The previous Section demonstrates that English SQuAD could strengthen other languages a lot. So we choose SQuAD as the training data and evaluate other languages. We only update the parameters of the high-level module for SQuAD without modifying the low-level language understanding module. Therefore, the replacement of the low-level module has little influence on the whole architecture, making the zero-shot inference available. We compare the zero-shot results of LALM base model with the supervised NQG++. The result is shown in Table 7.
We can see that the zero-shot version of our LALM appears to have equaled or eclipsed the QG ability of NQG++. It is an interesting result showing our model could transfer the question generation ability of English to other languages even without supervision. However, pure zero-shot learning is still struggle to achieve a good result, the supervision from the target language is necessary.

The Effect of Pre-training
We propose the self-supervised denoised autoencoding and next sentence generation to pre-train the model. In this section, we construct a model that does not employ the pre-training but directly fine-tuned on the target data. The LALM hidden size to 256 and layer and head numbers of 4 and 8, respectively, to prevent overfitting. The results of English, Chinese, and Hindi are shown in Table 6. The performance of our model drops a lot without pre-training. Especially, it barely performs well for the low resource Hindi data because there   are only 4,000 training instances. Nevertheless, when trained with the adversarial decoupling module, our model could achieve consistent improvement, demonstrating that the ADM is good at multilingual transfer learning.

Conclusion
In this paper, we propose a language-agnostic language model to deal with the multi-lingual question generation. The model consists of the low-level and the high-level module to explicitly represent the language-dependent and language-independent information, respectively. We operate the attention mask matrix to fit our model to the sequence to sequence learning. We propose an adversarial training mechanism to decouple the two-level modules, making the low-level module contains more abstractive representations and the high-level module language-agnostic. We also proposed a large-scale Chinese QG data containing more than 220k questions. Experiments on five languages demonstrate our model achieves significant improvements over previous methods in multi-lingual QG. For future work, we would like to apply our proposed model to other multi-lingual tasks such as summarization and question answering.