Chain-of-Questions Training with Latent Answers for Robust Multistep Question Answering

We train a language model (LM) to robustly answer multistep questions by generating and answering sub-questions. We propose Chain-of-Questions, a framework that trains a model to generate sub-questions and sub-answers one at a time by leveraging human annotated question decomposition meaning representation (QDMR). The key technical challenge is that QDMR only contains sub-questions but not answers to those sub-questions, so we treat sub-answers as latent variables and optimize them using a novel dynamic mixture of Hard-EM and MAPO. Chain-of-Questions greatly outperforms strong neuro-symbolic methods by 9.0 F1 on DROP contrast set, and outperforms GPT-3.5 by 24.3 F1 on HOTPOTQA adversarial set, thus demonstrating the effectiveness and robustness of our framework.


Introduction
Multistep question answering (QA) poses a reasoning challenge that current state-of-the-art QA models have not fully addressed.Strong finetuned QA models like UnifiedQA (Khashabi et al., 2020a) can achieve impressive results on various QA tasks through multitask training, but exhibit subpar performance on multistep reasoning.Moreover, because some multistep reasoning benchmarks contain annotation artifacts or reasoning shortcuts (Jiang and Bansal, 2019), dedicated models trained on these benchmarks often have much lower F1 performance on contrast sets (Gardner et al., 2020) and adversarial sets (Schlegel et al., 2021), indicating their lack of robustness.
Prior research has attempted to tackle this challenge with various question decomposition strategies to explicitly incorporate reasoning chains into the question answering process.However, as we show in our experiments, existing methods (Andor et al., 2019;Chen et al., 2020) that perform explicit reasoning steps still suffer from robustness issues.Moreover, multi-step reasoning methods are often engineered for a specific domain or type of multistep QA (Fu et al., 2021;Perez et al., 2020), and thus cannot be easily extended to other multistep QA settings.Prompting methods (Chen et al., 2020;Dua et al., 2022) have shown promise in generating multistep solutions to questions, but they require very large language models (LMs) as well as careful prompt engineering, and still lag behind fine-tuned methods (OpenAI, 2023).
To develop a robust multistep QA system, we propose a novel framework, Chain-of-Questions training with latent answers.Our framework trains a model to generate sub-questions and their corresponding sub-answers one at a time, as shown in Fig. 1, then aggregates those sub-answers to answer the original question.To define an appropriate set of sub-questions, we use question decomposition meaning representation (QDMR), an existing dataset with human-annotated sub-questions for questions from multiple multistep QA benchmarks.While QDMR is helpful, it only contains annotated sub-questions, not sub-answers, which makes training a QA system to generate sub-answers technically challenging.We view the sub-answers in the intermediate steps as latent variables, and apply Hard-EM (Neal and Hinton, 1998) to optimize these latent variables during training.To further improve performance, we use a memory buffer to store trajectories with high F1 score, inspired by Memory-Augmented Policy Optimization (MAPO; Liang et al., 2018), previously used for semantic parsing.Because starting with MAPO alone does not converge well, we design a dynamic loss function that combines the Hard-EM and MAPO objectives for fast improvement at the beginning and better final convergence.
We conduct experiments on DROP (Dua et al., 2019), HOTPOTQA (Yang et al., 2018), and their contrast and adversarial sets to evaluate the performance of our proposed Chain-of-Questions framework.On the contrast set of DROP, Chainof-Questions outperforms neuro-symbolic baselines by 9.0 on F1 score, and outperforms Chainof-Thought on GPT-3.5 by 16.8 despite using a much smaller model (T5-Large, 770M parameters).On the adversarial set of HOTPOTQA, Chain-of-Questions outperforms Longformer by 5.5 on F1 score, and outperforms Chain-of-Thought on GPT-3.5 by 24.3.Our experimental results demonstrate that Chain-of-Questions successfully leverages existing QDMR annotations to train an effective and robust multistep QA model.

Background and Related Work
We introduce the multistep QA benchmarks we use, as well as other methods using question decomposition during training and prompting.
Multistep QA Benchmarks.We focus on two popular multistep QA benchmarks-DROP and HOTPOTQA.DROP (Dua et al., 2019) focuses on questions that require discrete and symbolic reasoning.Most of its questions require multiple steps of retrieval and numerical execution.HOTPOTQA (Yang et al., 2018) contains 2-hop questions over 10 paragraphs.Other work has constructed contrast and adversarial sets to evaluate the robustness of models trained on these datasets.Gardner et al. (2020) created a contrast set for DROP (DROP-CS) by modifying test instances in ways that often change the correct answer.HOTPOTQA-ADV (Jiang and Bansal, 2019) adds adversarial paragraphs that do not change the correct answer but fool models that rely too heavily on reasoning shortcuts.We experiment on DROP, HOTPOTQA, and their robustness evaluation sets.
Training with Question Decomposition.Wolfson et al. (2020) introduce QDMR, a humanannotated question decomposition format and dataset.Since QDMR's for each question were annotated without looking at evidence passages, the QDMR dataset does not include sub-answers to any sub-questions within each QDMR.Subsequent work has used QDMR to help train QA models.TeaBReaC (Trivedi et al., 2022b) uses the QDMR decomposition graph to generate a large multistep QA dataset with synthetic contexts for pretraining.We show that TeaBReaC has complementary benefits with Chain-of-Questions, which adds explicit multistep reasoning at inference time.Guo et al. (2022b) train a model to generate QDMR's and provide them as context to a single-step QA model.Unlike this model, our model explicitly attempts to generate sub-answers of QDMR sub-questions.We compare with a single-step baseline similar to Guo et al. (2022b), and show that learning to generate sub-answers improves performance.
Other work on multistep QA, such as Decom-pRC (Min et al., 2019b), ONUS (Perez et al., 2020) and RERC (Fu et al., 2021), generate a decomposition with one model and the answers for the subquestions with another model, although none of these use QDMR.These methods require a singlestep QA model trained with other QA data, whereas our approach does not.Moreover, they rely on entity matching to decompose questions; such approaches do not naturally extend to tasks requiring forms of multistep reasoning that are not entitycentric, such as numerical reasoning.
Neuro-symbolic methods such as BERT-Calculator (Andor et al., 2019) and NeRd (Chen et al., 2020) generate functional programs for multistep numerical reasoning.However, they require the model to generate accurate programs in a single run, without observing the results of intermediate stages of computation.This process can make them more susceptible to learning simple reasoning shortcuts, compared with Chain-of-Questions.Prompting with Question Decomposition.
Chain-of-Thought prompting (Wei et al., 2022) inserts explicit reasoning chains into prompts to help language models answer compositional and multistep questions, especially ones involving mathematical reasoning.Subsequent work such as successive (Dua et al., 2022;Zhou et al., 2023), iterative (Zelikman et al., 2022), modularized (Khot et al., 2022) or tool-based prompting (Yao et al., 2023) filter or refine the reasoning process.However, LLMs are computationally expensive and require careful prompt engineering.We show that smaller, more efficient LMs can outperform LLMs when trained to output reasoning chains.

Problem Formulation
We define the notations related to the question decomposition annotation QDMR, as well as the multistep QA training and testing setups.

QDMR
Formally, a QDMR for a question q is a corresponding list of natural language sub-questions q sub = [q sub 1 , q sub 2 , ..., q sub n ], where answering each sub-question would lead to answering q (Wolfson et al., 2020).The number of sub-questions n = |q sub | varies for different questions q; if q requires multistep reasoning, n is usually > 1.
QDMR Parser.We assume access to a QDMR parser model g(•; ϕ), parameterized by ϕ, that takes in a question q and generates a corresponding QDMR, as proposed by Wolfson et al. (2020).The parser is trained with the original QDMR annotation, and can predict QDMR for any new question.

Multistep Question Answering
A QA model takes in a question q and context passage c, and outputs a predicted answer â.We train our model on a training dataset of question-contextanswer triples (q, c, a).Our multistep QA model answers a question q by generating and answering QDMR sub-questions one at a time.
Training Data.We assume that all QA training examples have a QDMR q sub corresponding to question q.However, the human-annotated QDMR dataset only has gold QDMR's annotated for a small fraction of QA examples.We consider two settings: (1) only use examples with gold QDMR data for QA training (D QDMR ); (2) we use the QDMR parser to generate silver QDMR for the rest of the QA training data (D QDMR+ ).The sub-answers a sub = [a sub 1 , a sub 2 , ..., a sub n ] to each sub-question are not included in the QDMR, as QDMR annotators only looked at the question without supporting passages.We use âsub = [â sub 1 , âsub 2 , ..., âsub n ] to denote the model predicted sub-answers.At inference time the model is given only the question q and context c as input, and must generate both sub-questions and sub-answers.

Proposed Methods
We discuss how to use the QDMR annotation and the QDMR parser, as well as how to train the QA model with Hard-EM and reinforcement learning in this section.We combine question decomposition and latent variable learning to construct a generalizable multistep reasoning framework.

Chain-of-Questions Framework
Our model f (; θ) predicts each QDMR subquestion and its corresponding sub-answer one at a time.During training, we feed the question and context into our model f , as in the blue box of Fig. 2, and first ask it to predict the first subquestion, q sub 1 , along with its corresponding subanswer, âsub 1 .To separate the sub-question and the sub-answer, we append a special [QDMR] token to the sub-question and another special [QDMR-ANS] token to the sub-answer in the model input and output.In the second iteration, we append the gold q sub 1 and predicted âsub 1 after the question and context, thereby allowing the model to predict the second sub-question and its sub-answer based on previous steps.This process is repeated until all subquestions and sub-answers have been predicted.In the final iteration, the model outputs a [END-QDMR] tag indicating the end of iterations.We use the final sub-answer âsub n as the answer to the input question.

Learning with Latent Answers
We illustrate how the model f learns to predict the sub-questions and sub-answers.The sub-questions are provided in the annotation, so we can simply apply supervised learning to optimize the likelihood of ground-truth sub-questions given the question, the context and any sub-answers as in Eq. ( 1).
where q sub 1:j denotes the set {q sub i } j i=1 , and likewise for a sub 1:j .Notice that the model is trained to generate the gold next sub-question regardless of whether its previous predicted sub-answers are correct (Bengio et al., 2015).Because the ground truth sub-answers are not provided in the training time, we regard the intermediate sub-answers as latent variables and use Hard-EM and RL to optimize them.

Hard-EM.
A variant of the EM algorithm, Hard-EM (Neal and Hinton, 1998) assigns the most likely values to all latent variables and maximizes the expected log-likelihood of the label based on these values.Hard-EM helps to filter spurious ways to derive the correct answer.Min et al. (2019a) use Hard-EM for weakly supervised training for multimention QA tasks.
Since it is computationally infeasible to enumerate all possible sets of sub-answers to find the best set, we approximately compute the best âsub with beam search (see Appendix A.1 for details).In particular, we pick the sequence of sub-answers ãsub where ãsub n = a and ãsub 1:n−1 has the highest likelihood to predict a: Following Hard-EM, we train the model to maximize the probability of both q sub and ãsub , which is equivalent to minimizing negative log likelihood: + ℓ SL (θ, q, c, ãsub ). (2) Reinforcement Learning.In another perspective, we view each sub-answer as an action and the whole sub-answer set as a trajectory.By doing so, we can use reinforcement learning methods such as Memory-Augmented Policy Optimization (MAPO; Liang et al., 2018), originally designed for semantic parsing, to optimize the latent subanswers.MAPO reduces the variance of policy gradient estimates with a memory buffer that stores high-reward trajectories (in our case, sequences of predicted sub-answers). 1e adapt MAPO to multistep QA as follows.While the original MAPO algorithm samples many independent trajectories using the model f , we instead use the trajectories from the beam, which reduces sampling time and yields better quality trajectories.During training, we maintain a replay buffer B of high-quality sequences of predicted subanswers.For each example (q, c, a), we choose the Chain-of-Questions outperforms other baselines on both the in-distribution dev set and the contrast set.Integrating Chain-of-Questions with TeaBReaC Pretraining further improves the performance.⋆ We report our reproduced results of GPT-3.5, which is slightly lower than their official report (64.1 on DROP with few-shot prompting).
top 5 trajectories from the beam that are not in the replay buffer to use as out-of-memory trajectories.
Next, we sample at most 5 in-memory trajectories from the replay buffer.We thus have a total of m different sub-answer trajectories (5 ≤ m ≤ 10), denoted as {â sub i } m i=1 , for each (q, c) example.Finally, we use the F1 score of the final predicted sub-answer âsub n as the reward R(â sub ) of the trajectory, The MAPO training objective uses both sets of trajectories to derive an unbiased stratified sampling estimator of policy gradient objective: where r B is the ratio of the number of trajectories in the buffer to the total number of sampled trajectories.As step 3 in Fig. 2, after computing the objectives, we update the replay buffer B with high-reward examples from the beam.

Chain-of-Questions Training Algorithm
As mentioned in Agarwal et al. (2019), MAPO works poorly at the beginning of training, as ini-tially no sampled trajectories receive high reward.
Hard-EM provides useful training signal at the start of training, but MAPO can help training converge to a better final model once some successful trajectories are added to the buffer.Thus, we apply a mixture weight λ to dynamically balance Eq. ( 2) and (3).The overall Chain-of-Questions (CoQ) loss for a given example (q, c, a) is defined as: where λ is the proportion of examples with at least one trajectory in the replay buffer.We rely on Hard-EM at the beginning of training, but switch to MAPO as the model finds successful trajectories.
Note that ℓ SL is used in both ℓ MAPO and ℓ H-EM .

Experiments
We present our experimental setup and show CoQ outperforms other multistep fine-tuning or prompting baselines over multiple benchmarks.

Experimental Details
Datasets.(Brown et al., 2020): We prompt GPT-3.5 to generate the answer without providing any in-context examples.We engineered prompts to make these baselines as competitive as possible, as detailed in Appendix C.
Single-step Baselines.We compare to two single-step baselines.One is standard fine-tuning with the given dataset, and the other is fine-tuning with QDMR-augmented contexts.For the latter, we concatenate all sub-questions from the question decomposition to (q, c) as the input to the model and fine-tune it to perform single-step QA (i.e., directly generate the answer).At inference time, we first use the QDMR parser g to generate the subquestions for each question, and input them along with (q, c) to the model.This baseline is similar to (Guo et al., 2022b), which uses the same QDMR parser and the same model.The only difference is they train the QDMR parser with Hard-EM.

Task-Specific Modifications
To match the in-distribution performance of stateof-the-art systems, we make task-specific modifications for DROP and HOTPOTQA.
Modifications for DROP.Smaller language models such as T5-B and T5-L struggle with numerical operations.Since DROP focuses on numerical reasoning, analogous to how BERT-Calculator help the model to do arithmetic, we add a regular expression matching module that can handle basic numerical operations.We note that only the last sub-question of a DROP example may require numerical operation.Hence, we add regular expression matching and the "[REGEX]" tag in the last step.
For each example, we take the last sub-question generated by the model f , parse it to a functional program based on the keyword matching and execute it.If the parsing and execution process are both successful, we put the "[REGEX]" token in front of the numerical execution result and input them together into the answer generation process.Else, we keep the answer generation process unchanged as in Fig. 2.
Modification for HOTPOTQA.Predicting the supporting facts is an auxiliary task of HOTPOTQA used in many models (Groeneveld et al., 2020;Beltagy et al., 2020).Following the same input format as Longformer, we add the supporting fact (SF) prediction and the span prediction (SP) tasks in the encoder as auxiliary tasks.We use a two-layer feedforward network for supporting fact prediction, and one-layer classification head for span prediction.We perform these two tasks at each run of model f and add the two cross-entropy losses to ℓ CoQ .

Results
We show results for DROP and DROP-CS in Table 1, and results for HOTPOTQA and HOTPOTQA-ADV in Table 2.The results indicate the effectiveness and robustness of Chain-of-Questions.
Chain-of-Questions outperforms other baselines.
In Table 1, CoQ w/ D QDMR+ on DROP outperforms all baselines on F1 by 2.6% in-distribution and 7.8% on the contrast set.Moreover, the T5-L version is 3.5% better than recently released GPT-4 F1 score on the DROP (OpenAI, 2023).The model can be further improved by initializing with TeaBReaC Pretraining.Similarly, in Table 2, CoQ w/ D QDMR+ on HOTPOTQA is on-par in-distribution with Longformer and 5.5% higher on the adversarial set.On both datasets, CoQ has a smaller performance gap between the indistribution dev set and the robustness evaluation set compared with the baselines, indicating its strong robustness.3 Chain-of-Thought prompting is weaker than fine-tuning methods on multistep QA.On both DROP and HOTPOTQA, prompting methods have lower F1 than fine-tuning methods, which indicates the difficulty of prompting large language models to do multistep QA.Moreover, Chain-of-Thought is 2.9% worse than zero-shot Standard Prompting on HOTPOTQA F1 score.We find it difficult to design prompts for GPT-3.5 that lead to clean and concise answers under the Chain-of-Thought setup.HOTPOTQA-ADV benefits from question decomposition, which suggests Standard Prompting may take reasoning shortcuts on HOTPOTQA examples.
MAPO works better on HOTPOTQA than DROP.Using MAPO in addition to Hard-EM, as opposed to Hard-EM alone, leads to larger gains in F1 on HOTPOTQA (+[1.2-2.5]%)than on DROP (+[0.1-0.9]%).As the sub-answers should be spans in the context for both DROP and HOTPOTQA, our hypothesis is that the span prediction task provides an inductive bias for actions to focus on spans in the context, which makes the model more likely to generate correct answers.The context of 2-paragraph HOTPOTQA is shorter than DROP, which makes the action space smaller.
Task-specific modifications are necessary.Both the regular expression module in DROP and the auxiliary tasks (SF & SP) in HOTPOTQA improves F1 by more than 5% on the in-distribution dev sets and 3% on the robustness evaluation sets in Table 1 and 2. This shows that task-specific modifications are an important part of Chain-of-Questions.
Chain-of-Questions can generalize to benchmarks with no QDMR annotation.To test if our framework is still effective when annotated QDMR's are not available for the training dataset, we conduct additional experiments where we omit the QDMR annotations for one dataset, then run CoQ on that dataset using only QDMR generated by a QDMR parser trained on other datasets.Recall that in our main DROP experiments, D QDMR+ includes both human-annotated gold QDMR and silver QDMR generated by a QDMR parser trained with the QDMR annotation of DROP.Instead, we now train a QDMR parser with only the QDMR annotation of COMPLEXWE-BQUESTIONS (Talmor and Berant, 2018) and HOT-POTQA.Then, we use that QDMR parser to generate bronze QDMR augmentation for the whole DROP dataset, and use these to run CoQ.In this Compared to training on D QDMR+ , CoQ training on bronze QDMR results in decreases of 4-5% F1 on DROP and DROP-CS and 1-2% F1 on HOTPOTQA and HOTPOTQA-ADV.Nonetheless, these F1 scores are still much better than the singlerun baselines on the contrast and adversarial sets.These experiments show that CoQ can be effective even on benchmarks without QDMR annotations.Training on bronze QDMR is also much more effective than training only on gold D QDMR , as shown in Table 3.This suggests that having a large amount of training data greatly helps CoQ, even if that data is lower quality; future work could explore using more unannotated data to further improve performance.

Qualitative Analysis on QDMR
We conduct qualitative analysis on QDMR to check its quality and effectiveness.
How good are the generated sub-questions and sub-answers?We list three multistep QA examples in Table 4 to show quality of generated QDMR sub-questions and sub-answers. 4The generated sub-questions from DROP (1st) and HOTPOTQA (3rd) examples are correct decompositions.In the DROP-CS (2nd) example, the generated QDMR is also a valid decomposition, although answering the first two generated sub-questions require additional reasoning steps.Note that the QDMR is annotated by only looking at the questions (Wolfson et al., 2020), and the QDMR sub-question generation in Chain-of-Questions is trained by supervised learning.The QDMR annotation may cause the sub-question generation to focus only on the question, and thus generate sub-questions that require multiple reasoning steps from the passage, which are hard for the model to answer.
The final answer of the question is generated in the first sub-answer in the HOTPOTQA (3rd) example, which is not a correct answer to the first sub-question.The model may use reasoning shortcuts to generate the final answer as sub-answers, which is beneficial to generate the final answer at the last step, but not answering the sub-question.
How do the generated sub-questions help the QA model?We hypothesize generated subquestions helps the model return the right answer for the right reason, and we try to detect this by looking at whether it can correctly identify supporting facts in HOTPOTQA-ADV.For each question, the model predicts two supporting facts in the context using the encoder output embedding.We observe that the accuracy of supporting fact prediction improves as we incorporate question decomposition in the input.
For example, given a question "What government position was held by the woman who portrayed Corliss Archer in the film Kiss and Tell?", the ground truth supporting facts are two sentences: "Kiss and Tell is ... Shirley Temple as Corliss Archer." and "As an adult, she ... served as Chief of Protocol of the United States.",from two different paragraphs.
In the first run, the input to the encoder is only the question and context; the model predicts the first supporting fact correctly, while it takes a sentence "As an adult, she ... served as Chief of treaty of the United States."from an adversarial paragraph, which refers to a different person, as the second supporting fact.However, in the second run, when the input to the encoder contains the first sub-question and sub-answer ("[QDMR] return based on how well the context and the relevant information can be visualized.(3) the model may use reasoning shortcuts to generate the final answer as sub-answers for early sub-questions.
the woman who portrayed Corliss Archer in the film Kiss and Tell [QDMR-ANS] Shirley Temple"), the model selects the second sentence from the paragraph about Shirley Temple, thus predicting both supporting facts correctly.For all the 2-hop questions in HOTPOTQA-ADV, the supporting fact prediction accuracy is 75.6% during the first run, and rises to 83.2% during the second run.This accuracy increase shows the effectiveness of generating sub-questions stepby-step, which helps the model gradually filter out close but irrelevant context.

Discussion
We present Chain-of-Questions (CoQ), a robust sub-question generation and answering framework that shows strong performance on DROP and HOT-POTQA.CoQ uses a combination of Hard-EM and MAPO for training, effectively optimizing the latent variables associated with sub-answers of intermediate questions.
We envision multiple directions for future work.CoQ requires supervision from QDMR; other families of RL methods we did not explore may be used to reduce our reliance on this supervision, and instead allow the model to learn appropriate decompositions from scratch.On the other hand, we could also explore using different question decompositions, such as ones generated by LLMs like GPT-3.5.Either approach could help us extend CoQ to other multistep reasoning datasets with no QDMR annotation.Similar to DROP, FINQA (Chen et al., 2021) consists of numerical reasoning questions over financial data.In a similar format as HOT-POTQA, MUSIQUE (Trivedi et al., 2022a) contains 3-hop and 4-hop retrieval questions.ROPES (Lin et al., 2019) requires complex multistep reasoning between a background context paragraph and situation context paragraph.We could either train models on these datasets if we can eliminate our reliance on QDMR data, or test whether models trained with CoQ can transfer well to these other datasets.

Limitations
Due to GPU resource constraints, we were unable to scale up our method to larger models such as T5-3B.However, the smaller models we were able to experiment with already show good performance.Similar, for computational efficiency, we did not try other more advanced on-policy reinforcement algorithms, but we find that MAPO already yields good improvement on F1.
Chain-of-Questions still requires task-specific modifications for different multistep QA benchmarks-we did not find out a good way to build a universal model that is highly effective on all datasets.UnifiedQA (Khashabi et al., 2020b) constructs a unified model for multiple QA benchmarks, but their largest model (T5-11B) still has poor performance on DROP, again suggesting the importance of dataset-specific modifications.
Finally, the best version of Chain-of-Questions requires QDMR annotation during training, which is only available for some datasets.In order to be independent of the QDMR annotation, we show some generalization of CoQ by assuming no QDMR annotation on DROP and training on bronze question decompositions generated by QDMR parser.CoQ could also be tested on other datasets without QDMR annotation to evaluate its transferability, by having the QDMR parser for bronze annota-tions.On the other hand, the method can be further explored to work with question decompositions generated by LLMs.

A Chain-of-Questions Algorithm Details
We list omitted details of Hard-EM and MAPO in the Chain-of-Questions algorithm.

A.1 Sub-Answer Sampling Strategy
Given q sub and âsub , we compute the likelihood to predict the sequence of q sub and âsub by: We use beam search to sample the sub-answers and keep a beam of size k.In each iteration, we expand each beam with b different answers and select the top-k likelihood answers to construct a new beam following Eq.( 5).We choose k = 25 and b = 5 in our experiments.Thus, we have 5 different candidates for âsub 1 in the first run, 25 different candidates for (â sub 1 , âsub 2 ) after the second run, and in general 25 candidates for âsub 1:j for the j-th run for j > 2.
The challenge is to ensure the sampling will provide some correct sub-answers.However, notice that roughly 10% of annotated QDMR's across DROP and HOTPOTQA are single-step decomposition (i.e., n = 1).Our hypothesis is the model may to learn to do single-step QA from these examples, which will also help it learn to produce meaningful sub-answers to sub-questions that come from multistep QDMR's.

A.2 Difference from Original MAPO
Given a policy model π(; θ), and a replay buffer B, the original MAPO objective is: where R(â sub ) denotes the reward of the trajectory, and r B is the ratio of the number of trajectories in the buffer to the total number of sampled trajectories, used to derive an unbiased stratified sampling estimator of the gradient.π + θ (â sub ) and π − θ (â sub ) are the normalized probability distribution inside and outside the buffer, defined as: Notice that sampling a new trajectory is expensive in our scenario, especially with a large model.So instead of use samples from a policy model, we instead use the trajectories from the beam, which reduces sampling time and yields better quality trajectories.
On the other hand, MAPO stores trajectories with rewards greater than 0 in B. However, wrong answers with some corrects words could also get a F1 score greater than 0 with our reward function.Hence, we select R(â sub ) > 0.8 and store these trajectories in the replay buffer B.

B Implementation Details
We list the implementation details and hyperparameter choice as follows.

B.1 Datasets
Jiang and Bansal (2019) constructed HOTPOTQA-ADV by generating up to 8 adversarial paragraphs for a given HOTPOTQA example.Because we trained on the 2-paragraph HOTPOTQA, we generate a 4-paragraph version of HOTPOTQA-ADV to reduce length generalization.We take the adversarial set from Jiang and Bansal (2019), filter the examples with at least 2 adversarial paragraphs, and select their intersection with the validation set of 2-paragraph HOTPOTQA.We randomly choose the 2 adversarial paragraphs, and randomly order them with the 2 supporting paragraphs.

B.2 DROP Regular Expression
During the execution time, we detect the output [END-QDMR] token for the last sub-question.
We search for specific keywords in the last subquestion and match them with numerical operations, e.g., we match "higher" to max and "less" to min.The keyword matching perfectly matches the ground truth QDMR annotation to numerical operations.We use 7 operators in total: max, min, sum, diff, mul, div, or.
We take the last sub-question generated by the model f , parse it to a functional program based on the keyword matching and execute it.If the parsing and execution process are both successful, we put the "[REGEX]" token in front of the numerical execution result and input them together into the answer generation process.For example, if the last sub-question generated is "[QDMR] return the largest of 4 and 3 [END-QDMR]", we will input "[QDMR] return the largest of 4 and

Question:SubT5Figure 1 :
Figure1: Single-step QA vs. Chain-of-Questions.We show that a single-model with sub-question generation and answering works better than single-step QA on questions that require multistep reasoning.

, {12} [END-QDMR] [QDMR-ANS] {Federales} Append Append Figure
2: Chain-of-Questions framework and its training process with Hard-EM and MAPO.In the left panel, the blue box contains the input question and context, the pink box shows the intermediate sub-questions and sub-answers, the orange box is the final sub-question and final answer.Words in red braces {} are sampled as latent variables, and the other words are learned with supervision.In the right panel, during training, we first select the best of top-k candidates to compute the Hard-EM loss, then we combine these candidates with samples from the buffer to compute the MAPO loss.Finally, we add high-reward candidates into the buffer.

Table 1 :
F1 scores on DROP dev set and DROP-CS.Blue bold is the best model, purple bold is the second best.
For in-distribution evaluation, we use DROP and HOTPOTQA.DROP contains 77,400 training examples and 9,536 validation examples.HOTPOTQA contains 72,928 training examples and 5,901 validation examples. 2D QDMR contains the question decomposition of 7,705 training examples from DROP and 6,233 training examples from HOTPOTQA.We use a T5-base (Raffel et al., 2020) QDMR parser to create D QDMR+ , which has question decompositions of all training examples.We do not use QDMR examples from other datasets during training, e.g., we only use the DROP examples in D QDMR for training on DROP.
(Andor et al., 2019) Chain-of-Questions to several fine-tuning methods and prompting methods.For fine-tuning methods, we compare with: • BERT-Calculator(Andor et al., 2019): A

Table 2 :
F1 scores on HOTPOTQA dev set and HOTPOTQA-ADV.Blue bold is the best model, purple bold is the second best.Chain-of-Questions matches the performance of Longformer on the in-distribution dev set, and outperforms all baselines on the adversarial set.

Table 4 :
Qualitative analysis of model-generated reasoning chains.We use black bold to mark relevant information in the context, blue bold to mark correct final answers, and red bold to mark wrong final answers.We find (1) most generated sub-questions are correct; (2) a generated sub-question by CoQ models can contain multiple reasoning steps;