Latent Reasoning for Low-Resource Question Generation

Multi-hop question generation requires complex reasoning and coherent language realization. Learning a generation model for the problem requires extensive multi-hop question answering (QA) data, which are limited due to the manual collection effort. A two-phase strategy addresses the insufﬁciency of multi-hop QA data by ﬁrst generating and then composing single-hop sub-questions. Learning this generating and then composing two-phase model, however, requires manually labeled question decomposition data, which is labor intensive. To overcome this limitation, we propose a novel generative approach that optimizes the two-phase model without question decomposition data. We treat the unobserved sub-questions as latent variables and propose an objective that estimates the true sub-questions via variational inference. We further generalize the generative modeling to single-hop QA data. We hypothesize that each single-hop question is a sub-question of an unobserved multi-hop question, and propose an objective that generates single-hop questions by decomposing latent multi-hop questions. We show that the two objectives can be uniﬁed and both optimize the two-phase generation model. Experiments show that the proposed approach outperforms competitive base-lines on H OTPOT QA, a benchmark multi-hop question answering dataset.


Introduction
Question generation aims to automatically generate valid and coherent questions based on given context, which is widely applied to enrich question answering (QA) datasets, facilitate text comprehension (Ko et al., 2020), seek clarification in conversation (Rao and Daumé III, 2019), etc. Recently, neural encoder-decoder based approaches * Rui Zhang is the corresponding author.  [1] George ..., known professionally as Dario Franchitti, is a retired Scottish racing driver.
[2] After Franchitti did not secure a single-seater drive in 1995, he was contracted by the AMG team to compete in touring cars in the DTM and its successor -the International Touring Car Championship.
have shown promising results for simple, singlehop question generation (Du and Cardie, 2018). Such approaches directly maps context (e.g., text passages) to questions without reasoning, and thus struggle when generating multi-hop questions (Pan et al., 2020). Here, reasoning refers to identifying and aggregating the relevant information taken from multiple documents to derive the question. Table 1 illustrates the reasoning process of a multihop question; in this example, the entity that links the two passages, i.e., "AMG", is firstly identified, and the relations around it in the context are transformed into a question. To model such reasoning processes in an end-to-end manner requires extensive training data, and is thus impractical due to the extensive collection effort of multi-hop QA data.
To address this problem, recent studies propose to augment the generation model with an explicit reasoning progress. For example, a straightforward solution is to identify the anchoring entities via named entity recognition (NER), and find relations via relation extraction. The extracted structural reasoning path, in the form of subject-predicate-object triples as illustrated in Table 1, is then fed to the generation model as auxiliary features . However, the reasoning capability is constrained by the off-the-shelf extraction tools which cannot be extended to arbitrary context (Yang et al., 2018;Dhingra et al., 2020).
Another line of recent studies on multi-hop question answering models the reasoning process by decomposing a multi-hop question into several subquestions (Min et al., 2019;Wolfson et al., 2020). As illustrated in Table 1, the answer of the multihop question can be derived by answering a series of single-hop sub-questions. Ideally, question generation can also adopt this two-phase strategy which first generates sub-questions and then composes the sub-questions into a multi-hop question. However, this strategy requires a parallel corpus that annotates each multi-hop question to its corresponding sub-questions, and obtaining such annotations still requires extensive efforts and costs.
To address these issues, we propose to jointly optimize the two-phase model using non-parallel single-hop and multi-hop corpuses only, in which the questions are not paired. We propose a generative objective that models the multi-hop and single-hop question generation (QG) tasks in a unified way. The key idea is that each question, either multi-hop or single-hop, can be considered a partially observed multi-hop question, sub-question pair and treat the unobserved part as a latent variable. In the generative modeling of multi-hop QG, we use the two-phase model as a generation model and introduce a posterior model to estimate unobserved sub-questions. The generation and the posterior models are jointly optimized via variational inference (Kingma and Welling, 2014). For generative single-hop QG, we instead use the two-phase model as a posterior model to estimate unobserved multi-hop questions, and the posterior model is jointly optimized with a generation model that decomposes a multi-hop question into sub-questions. In this way, we integrate the optimization of the two-phase model in both generative multi-hop and single-hop QG tasks, serving as the generation and the posterior model, respectively.
Optimizing the generative objective in the text space is, however, prone to compounding errors due to the diversities of potential reasoning paths. There are multiple ways to raise a single-hop question given the same piece of information, and it is challenging to find the valid one only given the text passages. We address this challenge by equipping the generative modeling with a planning mechanism that uses a latent variable to encode the desired reasoning path. In this way, the inference of sub-questions is guided by a pre-sampled plan (i.e., the latent variable) and thus maintains consistency with the target multi-hop question. We achieve latent variable learning by incorporating an end-to-end differentiable bottleneck into the subquestion generation model, which can be naturally integrated into the overall objective. Moreover, the proposed planning mechanism also promotes a more stable training. This is because the original generative modeling involves a sequential sampling of latent variables (i.e., sub-questions), which is known to cause high variance and result in an unstable training (He et al., 2020). The planning mechanism relieves the sequential sampling requirement, since it encodes the high-level planning and covers the dependency between sub-questions.
Our contributions are summarized as follows: • We propose a novel generative objective that unifies non-parallel question corpuses and relieves the requirements of extensive annotations for learning a two-phase question generation model. • We propose a planning mechanism to guide the generation towards sub-questions that are more probable to compose into a multi-hop question. • We conduct experiments on a benchmark multihop question answering dataset. The results show that our approach outperforms the state-of-the-art under both language generation and question answering based evaluations.

Preliminaries
hop question-answer-evidence triples, where the evidence is a set of potentially relevant sentences c i = {d 1 , d 2 , ..., d k }, and each multi-hop question q requires reasoning over multiple sentences to find the answer a. Multi-hop question generation (QG) aims to generate a question q that has the pre-selected answer a given the evidence set c. Existing studies adopt a strategy commonly used in single-hop question generation, which formulates multi-hop QG as a seq-to-seq problem. Since extensive annotation efforts are needed to produce multi-hop QG examples, few multi-hop QG examples are available. Thus, a naive adoption of seq-toseq learning may not yield an effective multi-hop QG model, especially in the low-resource scenario.
To address data insufficiency, a two-phase strategy is considered based on the assumption that each multi-hop question q can be decomposed into two single-hop sub-questions s 1 and s 2 . 2 The multihop question generation is then performed by a sub-question generation model p S and a question composing model p C as p(q|a, c) = p C (q|s 1 , s 2 )p S (s 2 , s 1 |a, c).
The training of these two models require question decomposition data, which are pairs of a multi-hop question and its corresponding sub-question annotations {(q, s 1 , s 2 )}. However, it is non-trivial to obtain the question decomposition data, which requires extensive human annotation effort. 1 Note that the generation of single-hop questions s, multihop question q, and planning variables z are conditioned on evidence set c and answer a, which is omitted in Fig. 1. 2 The formulation can be easily extended to more sub-questions

Proposed Model
We take a two-phase approach for multi-hop question generation while do not require a question decomposing dataset that contains pairs of multihop questions and sub-questions. We assume that a single-hop question answering dataset D S and a multi-hop dataset D M are available for training. Both datasets are non-parallel, i.e., contain question-answer pairs but not sub-questions, and the evidence passages of both datasets shall come from the same source (e.g., Wikipedia articles).
Under these problem settings, we aim to learn the single-hop QG model p S and the question composing model p C using both D S and D M . To effectively train these two models in the absence of question decomposition data, we propose a unified generative formulation that naturally connects single-hop and multi-hop questions. Specifically, in modeling the generation process of multi-hop questions, we treat the corresponding sub-questions as latent variables and propose an objective that jointly optimizes p S and p C (Sec. 3.1). We further extend the generative formulation to model the generation of single-hop questions, and both generation processes together form the overall optimization objective (Sec. 3.2). Then, we propose a planning-aware generation strategy to better optimize the objective in Sec. 3.3. We summarize the overall learning and inference process in Sec. 3.4.

Generative Modeling of Multi-Hop QG
We now reconsider the two-phase question generation strategy in Eqn. 1. Since we do not have the parallel data, it is infeasible to directly model the conditional probability p(q|s), where s = {s 1 , s 2 } is the set of sub-questions of q. We thus propose to treat the unobserved sub-questions as latent variables, and describe p(q|a, c) in a generative way as where p(q) and p(q, s) are shorthands for p(q|a, c) and p(q, s|a, c), 3 p ψ is a conditional prior model, and p θ is a generation model for multi-hop questions. Since this likelihood is intractable, we instead derive and optimize its evidence lower bound (ELBO) (Kingma and Welling, 2014) is a posterior model for latent variable s, and KL denotes the Kullback-Leibler divergence. We now substitute the latent variable s with two sub-questions, s 1 and s 2 , and define the factorized form of the posterior and the prior in a hierarchical manner We can now rewrite the ELBO in Eqn. 3 with the factorization and obtain Fig. 1(a) shows the directed graphical model of the generative modeling of multi-hop question generation. Specifically, given an evidence set and a pre-selected answer, a single-hop question s 1 is first sampled. Given s 1 and relevant information in the context, a second sub-question s 2 that satisfies a valid reasoning process is further sampled. Since two sub-questions are both unobserved, we estimate s 1 and s 2 using the posterior model q φ . The sub-questions then form the observed multi-hop question q via question composing as p θ (q|s 1 , s 2 ).
To perform effective optimization, we tie the parameters of the posterior model q φ at different hierarchies, i.e., q φ (s 1 |·) and q φ (s 2 |·), as one singlehop QG model. Such parameter tying also applies to the prior model p ψ . We implement the generation model p θ , the prior p ψ , and the posterior q φ in Eqn. 5 using pre-trained encoder-decoder models which will be detailed in Sec. 3.4. We notice that the prior p ψ and the generation model p θ actually play the same role as the single-hop QG model p S and question composing model p C in Eqn. 1. Thus, the generative modeling enables a joint optimization of p S and p C using multi-hop QA data only and without question decomposing data.

Generative Modeling of Single-Hop QG
Considering that the multi-hop QA data is limited, we propose to integrate single-hop QA data into the joint optimization objective. We extend the proposed generative modeling by assuming that each single-hop question is obtained by decomposing an unobserved multi-hop question. With a slight abuse of notation, we use (s, a, c) to denote a single-hop question-answer-evidence triple, and describe p(s|a, c) as where we omit the condition as in Eqn. 2, and q is a multi-hop question that has a sub-question s. The generation model p θ and the prior model p ψ are parameterized with θ and ψ , respectively. We treat the unobserved q as a latent variable and derive the evidence lower bound as where q φ is a posterior model to estimate the unobserved question q. Fig. 1(b) illustrates the generative modeling for single-hop QG. Specifically, a multi-hop question is first sampled by the prior p ψ , and we assume that its sub-question set includes the observed singlehop question s. The question s is then generated by decomposing the multi-hop question q via p θ (s|q). We estimate the unobserved multi-hop question q using the posterior model p φ .
We observe that the posterior approximation in single-hop QG (dashed line in Fig. 1(b)-left) is the same as the generative process in multi-hop QG (solid line in Fig. 1(a)-left). Thus, we can realize the posterior model q φ (q|s) by reusing the prior p ψ and the generative model p θ in Eqn. 5 as whereŝ is the unobserved second sub-question that forms the multi-hop question together with s. Note that we no longer need a hierarchical form since one sub-question is observed. Further, we observe that the generative process in single-hop QG (solid line in Fig. 1(b)-left) is part of the posterior approximation of multi-hop QG (dashed line in Fig. 1(b)-left). This way, we realize the generation model p θ and the prior p ψ using the models already present in multi-hop QG

Planning Guided Question Generation
There is a challenge under the generative formulation: the diversification of feasible generated questions can impinge the model training. Given the same evidence set and pre-selected answer, there can be multiple ways to raise a questions (Lee et al., 2020). However, not every potential single-hop question is qualified as a sub-question to form the target multi-hop question, as illustrated in Table. 2.
To address this challenge, we propose to learn a latent planning variable which serves as a generation planning to guide the generation process. The latent planning variable aims to capture the high-level reasoning required to answer the multihop questions, which is abstracted as a reasoning path in existing studies. In order to model decision making of the reasoning path, we define the latent variable z as a discrete variable. We now incorporate the latent variable into the generative modeling of multi-hop QG where q ω and p ω are posterior and prior models, respectively, and the reason of having the same parameters ω will be detailed later. The conditional probability p(q|z) is modeled by letting the terms of L ELBO (q) in Eqn. 5 be additionally conditioned on the sampled latent variable z (as illustrated in Fig. 1(a)-right). The generation of sub-questions, both prior p ψ and posterior q φ , is now aware of the planning as We now no longer need a hierarchical form like Eqn. 4, since the latent planning variable already encodes the information of the other sub-question. Thus, this formulation also alleviates the high variance issue commonly encountered in hierarchical variational training (Vahdat and Kautz, 2020). We also consider the planning guided mechanism in the generative modeling of single-hop QG where p(s|z) is modeled by letting the prior and the posterior in L ELBO (s) be additionally conditioned on z The realizations in Eqn. 8 and Eqn. 9 are now formulated as q φ (q|s, z) = p θ (q|ŝ, s)p ψ (ŝ|z) p ψ (q|z) = p θ (q|s 1 , s 2 )p ψ (s 1 |z)p(s 2 |z) We implement the latent variable as discretized VAE (van den Oord et al., 2017) by adding a learnable codebook between the encoder and the decoder. The codebook is a set of prototype vectors e k , k ∈ 1, 2...K, each having the same dimensionality as that of the encoder output. The discrete variable is obtained by using a nearest-neighbor lookup to find the vector closest to the encoder output. The corresponding prototype vector is then fed into the decoder as an additional context embedding to which every decoding step could attend. With this discretization bottleneck design, the encoder-decoder model and the codebook can be jointly optimized.

Learning and Inference
We initialize the generative model p θ , the prior p ψ and the posterior q φ using BART , a pre-trained seq-to-seq model. BART uses the standard Transformer based encoder-decoder architecture (Vaswani et al., 2017), and is optimized by reconstructing the intentionally corrupted documents. We adopt an initial fine-tuning step for all three models using question answering data D S and D M , which adjusts the initialization pretrained from general texts to better fit the question generation tasks. We then optimize p θ , p ψ , and q φ together with the discretization bottleneck q ω using the generative modeling of both multi-hop and single-hop question answering data After training the single-hop QG model (i.e., p ψ ), question composing model (i.e., p θ ), and the bottleneck q ω , inference follows the two-stage strategy. We first infer a latent planning variable given the evidence set and the answer. The sub-questions are generated based on the inferred planning variable and are composed into a multi-hop question.

Experiments
To show the effectiveness of the proposed approach, planning guided latent reasoning (PLAR), we experiment on two multi-hop question generation settings (Sec. 4.1). We compare against state-ofthe-art approaches in both settings (Sec. 4.2). We further consider a question answering based performance measure, and analyze the effectiveness of the proposed generative modeling (Sec. 4.3).

Settings
We use HOTPOTQA (Yang et al., 2018), a crowdsourced multi-hop question answering (QA) dataset in our experiments. It contains over 90K question answering examples, and the evidence set of each question includes relevant paragraphs from Wikipedia. The question-relevant sentences within these paragraphs are further annotated as supporting facts. We follow the original data split of HOT-POTQA, which includes 90,440 / 6,072 examples for training and evaluation, respectively. We further hold out 6,072 examples from the training data as the validation set. We use SQuAD (Rajpurkar et al., 2016) as the single-hop QA dataset, which has over 100K questions also crowd-sourced based on Wikipedia articles. Following the conventional evaluation metrics, we use n-gram BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), and ROUGE-L (Lin and Hovy, 2002) to evaluate the question generation quality.
We consider two input settings to thoroughly evaluate the multi-hop question generation (QG) performance: sentence-level and paragraph-level. In the first setting, following the existing multi-hop QG task formulation (Pan et al., 2020;, we take the question-relevant sentences (i.e., supporting facts) along with the answer as inputs to generate the question. However, human annotated supporting facts are not always available, while identifying two relevant paragraphs is relatively achievable. Thus, we further consider a paragraph-level setting where, besides the answer, we instead use the paragraphs containing supporting facts as part of the input. In both settings, in order to simulate a low-resource scenario, we train PLAR and other baselines using two different subsets of the question answering examples, HOTPOT-10K and HOTPOT-30K, containing 10K and 30K randomly sampled training examples, respectively.
Note that we do not utilize any annotated question decomposition dataset (e.g., QDMR (Wolfson et al., 2020)). This is because it is labour-intensive to obtain the extra question decomposition annotations, which are not present in HotpotQA. Thus, it is not practical to assume such decomposition annotations would be available in different QA tasks. We aim to tackle this challenge by utilizing non-parallel single-hop questions, which is relatively easy to acquire and do not require extra task-specific annotations.
We compare with three baselines that are based on seq-to-seq models and are competitive in singlehop question generation tasks: ASs2s (Kim et al., 2019), Maxout-QG (Zhao et al., 2018), BART . We compare with two baselines that consume auxiliary reasoning path features for multi-hop QG: RC-QG  uses reasoning chains built via named entity recognition and relation extraction; and SG-DQG (Pan et al., 2020) adopts semantic role labeling techniques to build semantic graphs. We also compare the full PLAR with its two variants: Pipeline individually trains a single-hop QG model and a question composing model using synthetic question decomposition data obtained as Perez et al. (2020); and PLAR w/o plan uses the generative objectives as PLAR without the planning mechanism.  We further find that Pipeline has a heavier performance decrease (comparing with the baselines) when having fewer data. For example, Pipeline outperforms RC-QG under BLEU-4 on HOTPOT-30K, while it is outperformed by RC-QG on HOTPOT-10K. This is largely because the training of each phase is individual performed which is prone to data insufficiency especially in a more extreme low-resource scenario. For the paragraph-level input setting, Table 4 shows the results that PLAR consistently outperforms the baselines by a large margin. For example, PLAR (17.79) achieves a gain of more than 33% compared to SD-DQG (13.37) under METEOR on HOTPOT-30K. By comparing PLAR (27.64) with PLAN w/o plan (24.81) and Pipeline (23.04) under ROUGE-L on HOTPOT-10K, we find that the contribution of the planning mechanism is more significant than that of the generative modeling. This is largely because the diversification of potential sub-questions raises greater challenges in the paragraph-level setting. Using the planning variables, PLAR can effectively generate the feasible sub-questions. We also provide qualitative examples in Appendix to show the effectiveness of the planning variables. We also find that the reasoning path augmented baselines are not as competitive as in the sentence-level input setting. For example, RC-QG outperforms all the seq-to-seq based baselines under METEOR in the sentence-level setting, while it only outperforms ASs2s in the paragraphlevel setting. The reason is that handcrafted reasoning features cannot generalize well to a larger evidence set. PLAR overcomes this limitation by optimizing reasoning capability taking advantage of both single-hop and multi-hop QA data.

Discussion
We first study whether the generated questions can boost the question answering performance. We compare the performances of a BERT QA model (Devlin et al., 2019) on both subsets, where the QA model is trained using QA data generated by different QG models. The results in Table 5 show that the learning of multi-hop QA models relies heavily on sufficient supervision, since a significant perform reduction is observed when training on a subset only. PLAR achieves more effective training than the baselines and its variants, especially in the more challenging subset. It achieves the most performance gain (17.3%) over the subset-only training result under F1 on HOTPOT-30K. We also find that the QG results of BART do not improve QA performance while BART performs comparable to other baselines (e.g., SG-DQG) on automatic evaluation metrics. This is aligned with our intuition that the text fluency is insufficient for obtaining multi-hop questions that benefit the QA task. It is essential to incorporate reasoning into the generation process.
We now study the effect of unified generative question generation. To investigate how the generative multi-hop (Eqn. 5) and single-hop objective (Eqn. 7) contribute to the overall question generation training, we compare PLAR with PLAR using multi-hop objective only and PLAR using planning guided multi-hop objective under varying sizes of single-hop question answering data. The results on  the two subsets under ROUGE-L in the sentencelevel setting are shown in Fig. 2(a) and Fig. 2(b).
We can see that both objectives are important. For example, when using complete single-hop QA data on HOTPOT-30K, multi-hop and single-hop generative objectives bring 7.3% and 11.5% improvement, respectively. We further find that the performance gain of PLAR is largely attributed to the singlehop generative objective when available single-hop questions are limited. The reason is that without the generative single-hop objective, training the subquestion generation model heavily relies on the initial fine-tuning step, and is thus prone to singlehop QA data insufficiency. The full PLAR model addresses this limitation by further training the subquestion generation model with supervision from generative single-hop and multi-hop QG.

Related Work
Question generation has a wide range of applications besides expanding question answering data, such as initiating a conversation of dialogue systems (Mostafazadeh et al., 2017), providing practice exercises for educational purposes (Jia et al., 2020), and accelerating real-time question answering (Seo et al., 2019). It also has great potential in enriching task-oriented dialogue datasets (Sun  Huang et al., 2020a;Kim et al., 2020b). Early studies build on encoder-decoder models and utilize different evidence information, e.g., Wikipedia passages (Du and Cardie, 2018), reviews , and dialogue history (Gao et al., 2019). These studies often assume that the questions are single-hop which be answered by one piece of evidence. As more high-quality multi-hop question answering datasets become available (e.g., HOTPOTQA (Yang et al., 2018)), recent years have seen a growing interest in multi-hop question generation. Most recent approaches add heuristically extracted features to the encoder-decoder model, which relies on large-scale training data and can still suffer from error propagation Pan et al., 2020). A recent study  which also studies low-resource question generation assumes that a large amount of unanswered multi-hop questions are available, which is also difficult to obtain. We aim to overcome these limitations in this study.
Our study is also related to generative modeling which treats unobserved variables (e.g., features or labels) as latent variables, and approximates the distribution through variational inference (Kingma and Welling, 2014). Generative modeling has been applied to dialogue response generation (Zhao et al., 2019;Huang et al., 2020b;Yang et al., 2020), policy learning (Huang et al., 2019(Huang et al., , 2020c, sentiment analysis (Xu et al., 2017;, knowledge retrieval Kim et al., 2020a;Su et al., 2021;Tan et al., 2021), and text style transfer (He et al., 2020). While these works focus on utilizing unlabeled data to boost model performance, we aim to unify non-parallel question corpuses to enable joint learning.

Conclusions
We proposed a jointly optimized two-phase model named PLAR for low-resource question generation. PLAR effectively utilizes non-parallel singlehop and multi-hop question answering data to perform optimization. We further designed a planning mechanism to guide the generation process of subquestions so that the generation results are valid to compose a multi-hop question. Experimental results confirm that PLAR achieves better performance compared with the state-of-the-art under various metrics, especially in a question answering based evaluation. For future work, we will explore the heterogeneous multi-hop QG task that requires reasoning beyond plain texts, e.g., tables.

A Implementation Details
We use HOTPOTQA split as the original paper (Yang et al., 2018) 4 , and use SQuAD v1.1 (Rajpurkar et al., 2016) 5 training set only since single-hop question answering data only involves in the training. We use the BART-base model implementation from huggingface library 6 as the single-hop question generation model and question composing model. We set the batch size to 32 in sentence-level setting and 16 in paragraph-level setting. The models are trained by Adam (Kingma and Ba, 2015) with a learning rate initially set to 3e-5 on NVIDIA GeForce RTX 2080 Ti. We use grid search to find the best hyperparameters for the models based on validation performance, which we use a combination of METEOOR, ROUGE-L and BLEU scores to measure. 7 We set dimensionality of codebook of the planning mechanism (i.e., K) to 100, which is chosen among {50, 75, 100, 150, 200} B Sub-Question Generation Qualitative Analysis Table 6 and 7 show question generation results from PLAR and Pipeline model. Table 8 and 9 show the generation results by different sampled planning variables z in the paragraph-level setting. We can see that with different predicted z (denoted by different z i ), PLAR raises different sub-questions and presents different high-level reasoning type. We also find that some planning variable cannot lead to a reasonable multi-hop question, and the prediction of PLAR can well capture the correct plan (denoted by higher p(z|a, c)). [2] After Franchitti did not secure a single-seater drive in 1995, he was contracted by the AMG team to compete in touring cars in the DTM and its successor -International Touring Car Championship.