Teaching Broad Reasoning Skills for Multi-Step QA by Generating Hard Contexts

Question-answering datasets require a broad set of reasoning skills. We show how to use question decompositions to teach language models these broad reasoning skills in a robust fashion. Specifically, we use widely available QDMR representations to programmatically create hard-to-cheat synthetic contexts for real questions in six multi-step reasoning datasets. These contexts are carefully designed to avoid common reasoning shortcuts prevalent in real contexts that prevent models from learning the right skills. This results in a pretraining dataset, named TeaBReaC, containing 525K multi-step questions (with associated formal programs) covering about 900 reasoning patterns. We show that pretraining standard language models (LMs) on TeaBReaC before fine-tuning them on target datasets improves their performance by up to 13 F1 points across 4 multi-step QA datasets, with up to 21 point gain on more complex questions. The resulting models also demonstrate higher robustness, with a 5-8 F1 point improvement on two contrast sets. Furthermore, TeaBReaC pretraining substantially improves model performance and robustness even when starting with numerate LMs pretrained using recent methods (e.g., PReasM, POET). Our work thus shows how to effectively use decomposition-guided contexts to robustly teach multi-step reasoning.


Introduction
Multi-step Question Answering (QA) is a complex problem that requires a wide variety of reasoning skills.In addition to basic reading comprehension (RC), models must connect multiple pieces of information, sometimes employ numerical and other forms of discrete reasoning, and compose these skills as needed for the question.However, Figure 1: TEABREAC Dataset Construction: We leverage widely available question decomposition annotations (QDMRs) for real questions from a broad range of datasets to carefully construct synthetic contexts such that answering the resulting question requires proper multi-step reasoning.These questions are further rebalanced to help teach a broad set of reasoning skills.even though questions in multi-step datasets often cover a broad range of interesting reasoning patterns, most questions follow only a few patterns, which is what models trained on these datasets naturally focus on.Moreover, the contexts occurring in existing RC datasets often contain artifacts and reasoning shortcuts (Min et al., 2019a;Chen and Durrett, 2019;Trivedi et al., 2020).Such contexts allow models to find the answer while bypassing some reasoning steps, in turn preventing models from learning the intended reasoning skills.
How, then, can we teach models broad multistep reasoning skills?One way is to have greater control over the distribution of reasoning patterns and the types of input contexts models see during training-contexts that don't allow models to easily succeed via shortcuts.We observe that questions in existing datasets (henceforth referred to as "real questions") already cover a wide variety of reason-ing patterns.The challenge, then, is to teach these reasoning patterns robustly, even when they are relatively rare (e.g., 4-6 step reasoning).As a means to this end, we turn to synthetic context generation for real questions.Specifically, we propose to construct contexts for real questions synthetically from scratch (instead of perturbing existing contexts), resulting in much greater control over reasoning shortcuts.Further, context generation also enables us to balance out the distribution of reasoning patterns, e.g., by synthesizing additional contexts (and thereby examples) for questions from the long-tail of underrepresented reasoning patterns.
Our use of synthetic contexts to reliably teach broad skills is inspired by three strands of recent RC QA research.One strand has shown that skills learnt over synthetic data can indeed transfer to real datasets (Geva et al., 2020;Yang et al., 2021;Yoran et al., 2022;Pi et al., 2022).A second strand has shown that perturbing the existing (natural) contexts of RC instances in a targeted fashion can reduce artifact-based reasoning (Jia and Liang, 2017;Trivedi et al., 2020).A third strand has shown that carefully constructing contexts (for synthetic questions) to have sufficient distractors can reduce artifacts (Trivedi et al., 2022;Khot et al., 2022a).
Building upon these three strands, we introduce TEABREAC,2 a teaching dataset that includes carefully constructed synthetic contexts for a broad set of real multi-step questions sourced from six existing datasets.TEABREAC was designed with the goals of strong control over cheatability and balanced coverage of reasoning patterns.To identify the intended reasoning, we leverage question decomposition annotations, specifically Question Decomposition Meaning Representation or QDMR annotations which are widely available for a broad set of datasets (Wolfson et al., 2020).
Figure 1 shows the overview of our construction process for TEABREAC.Our approach relies on treating a question decomposition as an unambiguous typed program that can be used to generate a synthetic context and can be executed to provide an answer.To this end, we first turn natural language QDMRs into a precise typed program.We then construct a synthetic context by asserting a set of facts that relate to various parts of the multi-step question.We do this by grounding the predicates of QDMR (e.g., field goals of Shayne Graham in Fig. 1) with randomly generated entities.We also add distractor statements to the context to ensure that bypassing reasoning steps results in an incorrect answer.The resulting contexts are hard to cheat on and thereby force models to learn the intended reasoning.We then add an outer loop around this process that ensures that the reasoning patterns-as measured by the program signatures of the questions-remain balanced in the final dataset.This forces models to learn a broad range of reasoning patterns instead of focusing on the few dominant ones.Finally, similar to prior work (Geva et al., 2020), we also add simpler single-step questions to teach individual primitive skills underlying our formal programs.
Our experiments demonstrate that pretraining3 large language models (LMs) on TEABREAC before fine-tuning on target multi-step QA datasets results in significant improvements on multiple in-distribution evaluation sets (DROP (Dua et al., 2019), TAT-QA (Zhu et al., 2021), IIRC (Ferguson et al., 2020)), NumGLUE (Mishra et al., 2022) by up to 13 F1 points, as well as on two contrastive evaluation sets of DROP by 5-8 points.Furthermore, even if we start with numerate LMs already pretrained on similar past work (Geva et al., 2020;Yang et al., 2021;Yoran et al., 2022;Pi et al., 2022), TEABREAC provides further improvement by up to 11 F1 points.Interestingly, TEABREAC is substantially more beneficial for more complex questions (those with more reasoning steps), improving the T5-Large model by about 20 F1 points on questions with 5 or more steps.More generally, we expect TEABREAC to be most valuable for datasets that require complex aggregation operations and their diverse compositions.
In summary, we make three contributions: (1) A novel methodology to create a teaching dataset (a) with broad reasoning skills covering a wide range of multi-step reasoning patterns and (b) leveraging existing QDMR annotations to carefully construct contexts that require true multi-step reasoning.(2) The TEABREAC teaching dataset with over 525K questions covering about 900 reasoning patterns or program signatures.(3) An empirical demonstration that pretraining on TEABREAC before fine-tuning makes both regular and numerate LMs much more effective and robust at multi-step reasoning, especially for more complex questions.

Related Work
Question Decompositions have been used to build stronger models (Talmor and Berant, 2018;Min et al., 2019b;Khot et al., 2021) and challenge evaluation sets by modifying the questions (Geva et al., 2022).In contrast, our goal in this work is to use decompositions to teach broad multi-step reasoning skills to any text-to-text model by creating challenging contexts for real questions.
Building synthetic datasets to teach requisite skills has been considered in prior work, but limited to only numeric reasoning skills (Geva et al., 2020;Yang et al., 2021) or few templated multistep reasoning patterns (Yoran et al., 2022;Pan et al., 2021).Even pretraining on program executions (arithmetic, logic-based, and SQL-based) has been shown to help on multi-step QA tasks (Pi et al., 2022).In this work, we use real questions from a wide variety of datasets and show larger gains than these prior models.We even improve these prior models by fine-tuning on our dataset.
We create more robust models by teaching reasoning skills via a dataset carefully designed to avoid shortcuts.Past work often focuses on identifying lack of robustness via analysis (Min et al., 2019a;Trivedi et al., 2020) or challenge evaluation sets (Jiang and Bansal, 2019;Geva et al., 2022).
Lastly, we define new conditions for constructing contexts for real questions with minimal reasoning shortcuts.This differs from prior work that only provides conditions to measure reasoning shortcuts in existing datasets (Trivedi et al., 2020).The "MuSiQue condition" of Trivedi et al. (2022) targets the construction of new non-cheatable multi-step datasets.We enforce this condition in TEABREAC and introduce two additional ones that are especially pertinent to our construction.Appendix A includes additional discussion.

Teaching Broad-Coverage Reasoning Skills in a Robust Fashion
Multi-step questions come in a wide variety.Some involve numeric operations (Dua et al., 2019), some involve assessing whether complete information is present or not (Ferguson et al., 2020), some involve tables and text (Zhu et al., 2021), and so on.One way to surface the reasoning needed for answering these questions is to look at their decomposition into smaller reasoning steps.E.g., consider the question in Fig. 1, From what yard-line did Shayne kick two field goals?.This can be decomposed as follows: list the field goals by Shayne Graham, identify the yard-lines for each of them, map each yard-line with the field goal and count them, and select the yard-line with two field goals.While questions in multi-step QA datasets are authored with the intent that such multi-step reasoning will be used to answer them, the context associated with the questions often allows models to cheat by taking shortcuts (Min et al., 2019a;Chen and Durrett, 2019;Trivedi et al., 2020).E.g., if the context mentions field goals only by Shayne Graham and no one else, models can ignore the player name and still succeed.
Our key observation is that the decomposition of a question can be leveraged to carefully design a synthetic context for it that is hard to cheat, thereby allowing us to teach models a broad range of reasoning skills in a robust fashion.To achieve this, we procedurally create a large pretraining RC dataset, TEABREAC, by using real multi-step questions (from existing datasets) and their decompositions (available in the form of QDMRs), and carefully building synthetic contexts.
QDMR or Question Decomposition Meaning Representation (Wolfson et al., 2020) is a common way to represent the reasoning in many types of multi-step questions as a structured decomposition graph.QDMR has standardized operators (represented as nodes) such as select, project, group, etc., that transform their input.These are connected together to a final node which produces the answer.Figure 1 shows the above example question paired with its QDMR graph.Importantly, QDMRs are already available for several multi-step QA datasets.
Briefly, our method involves the following main steps; these are described in more detail in §4.
Making QDMRs more precise.To create QA instances that teach the precise reasoning in QDMRs, we need a precise and formal representation of reasoning captured in QDMRs.QDMRs, although structured, don't quite do so, as they are written in natural language and don't specify the datatypes of their inputs/outputs.Since this is crucial for our approach, we convert QDMRs into formal programs with over 44 executable primitive operations along with their input/output types ( § 4.1).
Teaching robust compositional skills.Past work has shown that compositional questions don't necessitate multi-step reasoning as datasets often have reasoning shortcuts (Min et al., 2019a  and Durrett, 2019;Trivedi et al., 2020).To teach the reasoning reflected our formal programs robustly, our QA instances must be such that models cannot bypass the reasoning steps and still arrive at the correct answer.To achieve this goal, we create a synthetic QA instance from a questionprogram pair, where the question is the same as the original question, but the context is procedurally constructed by grounding the predicates in QDMR in a careful way such that models can't cheat their way to the correct answer.

Teaching a broad range of reasoning patterns
Although QDMRs cover a broad range of reasoning patterns, we find that the natural distribution of reasoning patterns in QDMRs is extremely skewed towards popular reasoning patterns ( § 4.2).Training on QA instances generated from such a distribution leads models to overfit to only a few most representative reasoning patterns, and not learn broad-range reasoning skills.To ensure this doesn't happen, we make sure our synthetic dataset is more balanced in terms of reasoning patterns ( § 4.2).
Teaching a broad range of reasoning primitives.In addition to our process of constructing a pretraining dataset to teach compositional skills described thus far, we observe that it also helps if we teach models the constituent primitive reasoning skills.To achieve this, similar to prior work (Geva et al., 2020), we procedurally generate QA instances based on fixed templates for each of the 44 primitives present in our formal programs ( § 4.3).

TEABREAC Dataset Construction
The overview of TEABREAC construction pipeline is shown in Fig. 2. We discuss the QA instance generator in § 4.1 and the dataset generator in § 4.2.

Instance Generator
The Instance Generator takes a question Q and its QDMR decomposition D as input, and generates a synthetic context C and the corresponding answer A as its output.The tuple (Q, C, A) is the generated RC QA instance.This conversion happens in two steps: (i) QDMR to Typed Program, (ii) Typed Program to Context and Answer.

QDMR to Typed Program:
Our goal is to generate a synthetic context C that can be used to answer the question Q (based on the QDMR D), and to also provide the answer A. To generate C and A, we must be able to create facts corresponding to steps in the QDMR reasoning graph (i.e., ground the QDMR predicates4 ) and compute the final answer by stepping through it.
To achieve this, we need a formal representation (Program) that captures the precise reasoning implied by D, and that can be executed step-bystep (e.g., in a programming language like Python).This isn't possible directly via QDMRs as (i) although structured, they are written in natural language and have variation inherent in natural language; (ii) they don't have input and output type information, e.g., it is unclear whether the project operator should generate a dictionary, a list, or a scalar, making it difficult to make execute it.
To convert a QDMR D into a Program P , we define a set of python functions (primitives)5 like select, filter, grouped_count, etc, and parse QDMRs into these functions using rules and heuristics.An example conversion is shown in Fig. 3.
#1: return field goals #2: return players who kicked #1 #3: return number of #1 for each #2 #4: return number #2 where #3 is least goals are the field goals") #2: project("What player kicked #1?") #3: grouped_count("#2", "#1") #4: filter_a_where_b_is_min("#2", "#3") These primitives don't always have a clearly defined output type.While in most cases the output type is obvious (e.g., arithmetic_sum returns a number), for some of them (select,  project, filter), it's under-defined.E.g., select("number of soldiers in USA") should output a number, select("when did India get independence") should output a date, and select("countries surrounding India") should output a list of named entities.For such primitives, we use heuristic rules and type propagation on the global structure of P to infer expected types and structures of output.We call the program having type information for each step a Typed Program P , an example of which is shown in Fig. 3.6

Synthetic Context + Answer:
Next, we generate C and A from the typed program P .We generate C by grounding the predicates derived from the QDMR D with random entities.Fig. 4 shows an example of C for a program with three steps.Predicates: The predicates that need to be grounded belong to four primitives, i.e., select, project, filter, boolean.Example in Fig. 4 uses select and filter.Examples involving project and boolean are shown in App.H. Entities: The grounded entities are of 3 types: number, date, or named entity7 .Since our programs are typed, we know which predicate should be grounded with which entity type.E.g., select("number of soldiers in USA") should be grounded with a number.
Minimizing reasoning shortcuts.Naively creating C using QDMR D can introduce shortcuts that models can exploit and bypass the necessary reasoning.Note that QDMR is a sequence of steps where each step s i can use answers from zero or more previous steps; e.g., "return number #2 where #3 is least goals" in Fig 3 (top) uses the answer from step #2 and #3.However, if there is only one player who scored field goals, all the steps can be ignored.To ensure models learn the intended reasoning, our goal is to create C such that one can't bypass the intended reasoning (or program) steps and still arrive at the correct answer A.
To this end, we ground the predicates with entities such that the following three properties hold: P1: Answers to dependent steps can't be ignored.If step s j is dependent on step s i , then the answer to s j can't be identified without knowing the answer to s i .E.g., in Fig 4, step #2 is asking "which of the touchdowns by Edward are from the first quarter".Since there are many touchdowns "from the 1st quarter", and only some of them are "touchdowns by Edward" (indicated in blue), one can't narrow down the answer to step #2 without knowing step #1's answer.We ensure this property for different operators differently.E.g., for filter, we ensure the answer is always a proper subset of all the entities grounded with that predicate ({ABC, DXE} ⊂ {ABC, DXE, MNF, IOU} in Fig. 4).

P2:
Steps can't be no-op.The input and output of steps can't be the same, as otherwise the reasoning in that step can be bypassed.E.g., in Fig 4, step #2 is asking "which of the touchdowns by Ed-ward are from the 1st quarter".There are many "touchdowns by Edward", but only some of them are "from the 1st quarter" (indicated in blue).So, ignoring step #2 (i.e., treating it as a no-op) would result in an incorrect answer being used for subsequent steps.We ensure this property for different operators differently.E.g., for filter operator, we ensure the answer to the step is always a proper subset of the answer to the dependent step ({ABC, DXE} ⊂ {ABC, DXE, FGH, PQR} in Fig. 4).
Properties P1 and P2 ensure step-by-step execution will lead to the gold answer, but there is only one possible complete execution that leads to an answer.As a result, the question can be completely ignored.To fix this, we have a third property: P3: Context also supports a different answer to a contrastive question.Just as we generate facts for the gold chain of reasoning (upper half in Fig. 4), we also generate facts for distractor chain (lower half in Fig. 4), using potentially perturbed predicates (e.g., Edward ⇒ Tom, 1st ⇒ 2nd).This ensures there is always one minimally different (contrastive (Gardner et al., 2020)) question that results in a different answer in the same context.E.g., "How many touchdowns did Tom throw in the 2nd quarter" results in the answer 1, different from the gold answer 2 in Fig. 4. To perturb predicates, we swap numbers, dates, and named entities (PERSON, ORG, etc.) with a similar entity of the same type.The cases where predicate doesn't have an entity, we use a similar but different and type-consistent predicate from a different question as a perturbed predicate.E.g., "yards of rushing touchdowns" could be perturbed to "yards of passing touchdowns".To do this, we retrieve the top 30 type-consistent predicates with the highest word-overlap not exceeding 75%, and sample one.
We note that past work of Trivedi et al. (2022) has also considered similar properties to create hard-to-cheat multi-step QA datasets.Our P1 is similar to the first part of their MuSiQue condition (the 2nd part isn't needed here as artificial entities make it impossible to ignore the context).Our P2 is new and especially pertinent to TEABREAC because of its list-based filter operations.Our P3 is also new and results in stronger question dependence than MuSiQue because of the emphasis on a minimally contrastive reasoning chain (as opposed to any additional reasoning chain which a context in MuSiQue often also supports).
To construct QA instances with properties P1-P3, we iterate through the program steps maintaining the step-wise answers and distractors for gold reasoning chain (upper half of Fig. 4) and the distractor reasoning chain (lower half of Fig. 4) respectively.For steps containing grounding predicates (select, filter, project, boolean), we ground the predicate with random entities of appropriate type and cardinality as defined by typed program.While doing such groundings we make sure the aforementioned properties satisfy.The final step answer is the answer A for the QA instance.The detailed description and pseudo-code to generate QA instances is given in App.B.

Dataset Generator
Now that we have a way to generate QA instance from a (question, QDMR) pair, we can generate a dataset by just using questions from datasets with annotated QDMRs.However, we find that the natural distribution of the reasoning patterns in these datasets is extremely long-tailed.We define reasoning pattern as a unique sequence of primitives in the program.E.g., program in Fig. 4 has 3 steps having select, filter and count primitives, so the reasoning pattern is "select filter count".
Generating instances uniformly from such QDMRs would end up skewing the distribution of questions towards the popular patterns and result in the model overfitting to these patterns.To fix this, our dataset generator: (i) samples a reasoning pattern, (ii) samples a question-QDMR pair from that reasoning pattern, (iii) possibly perturbs question entities (named entities, dates, numbers, ordinals) with a closely similar entity of the same type,8 and (iv) invokes the instance generator.The resulting training dataset has about 900 reasoning patterns with the top 10 common patterns having only 4% of examples (compared to 70% had we not done such balancing).

Additional QA Instances for Primitives
We also generate instances to teach 44 individual primitives, using simple templates similar to Geva et al. (2020).E.g., for primitive filter_a_where_b_is_compared_to, a question could be "Entities that have value larger than 948768.92?" and context could be "Entity AFE has value 871781.Entity RQX has value 989,517.24." resulting in the answer ['RQX'].App.I gives example instances for all the primitives.Each primitive has 30K training and 1K development instances.

Experiments
To test the effectiveness of TEABREAC pretraining, we compare models directly fine-tuned on target datasets with models first pretrained on TEABREAC9 before fine-tuning.
Datasets.We evaluate in-distribution performance using DROP (Dua et al., 2019), TAT-QA (Zhu et al., 2021), IIRC (Ferguson et al., 2020), and NumGLUE (Mishra et al., 2022).For IIRC, we consider two settings: IIRC-G uses only gold supporting sentences as context while IIRC-R uses paragraphs obtained using a retrieval marginalization method (Ni et al., 2021).We evaluate robustness using the DROP contrast set (Gardner et al., 2020) and the DROP BPB contrast set (Geva et al., 2022) 10 .To do this, we directly evaluate DROP fine-tuned models on contrast sets.
We use author-provided checkpoints as our initial models and then fine-tune on the target datasets.Following NT5 and POET, we use character tokenization in all considered models during the finetuning stage.In some cases, prior work has also performed similar experiments (with different implementations and hyper-parameters) that we report in App.C for completeness.12Our models are implemented using PyTorch (Paszke et al., 2019), Huggingface Transformers (Wolf et al., 2019), and AllenNLP (Gardner et al., 2017).§G includes implementation details and training hyperparameters.
Furthermore, TEABREAC-pretrained T5-3B achieves new state-of-the-art performance relative to the best previously published results on IIRC-G, IIRC-R, and NumGLUE, reported in the last row of Table 1.Moreover, even the smaller TEABREACpretrained PReasM-L and POET-L models improve over previously published numbers on IIRC-G and NumGLUE respectively.On DROP and TAT-QA, specialized architectures (with special task-specific modules) developed for those datasets outperform TEABREAC-pretrained models.
Since numerate LMs are derived by pretraining plain LMs on respective synthetic datasets, we can  also directly compare such pretraining approaches with TEABREAC pretraining.From Table 1, we can also see that the T5-L + TEABREAC model is better than the PReasM-L model (T5-L pretrained on PReasM data), and the Bart-L + TEABREAC model is better than the POET-L model (Bart-L pretrained on POET data).See App.D for additional comparisons.

TEABREAC improves model robustness
We evaluate the robustness in Table 1 by comparing performance on the DROP contrast set and the DROP BPB set.For all plain language models, T5-L, T5-3B and Bart-L, TEABREAC pretraining shows substantial improvements in robustness -5-8 F1 points improvements on DROP contrast set and on DROP BPB set.For numerate LMs, NT5-S, PReasM-L and POET-L, TEABREAC pretraining results in 4-7 F1 points of improvement on DROP contrast set and 2-8 points of improvement on DROP BPB set.
improves more on more complex questions We further investigate how the improvements provided by TEABREAC vary based on the complexity or number of steps of the question.To obtain the number of reasoning steps, we use our programs. 13ut since QDMRs, and as a result programs, are not available for all the questions, we use the number of reasoning steps in predicted programs (using a T5-Large model trained on the BREAK dataset followed by conversion into our typed programs).
Figure 5 compares the performance of TEABREAC pretraining on questions with increasing (estimated) number of steps.While the T5-L baseline model drops significantly from 79 to 58, T5-L with TEABREAC pretraining stays mostly invariant to the number of steps.We thus observe a significantly larger improvement for more complex questions, where the original T5-L model struggles (e.g., 20 points gain on 4+ steps vs. 5 points gain on average).Similarly, for the numeracy-aware language model PReasM-Large, we see more improvement on more complex questions (e.g., 9-10 points on 4+ steps, 3.2 points on average).We see similar trends for the other models as well.
We also observe that more complex questions are much less frequent in the DROP development set (e.g., 4+ steps constitute only 25%).This makes our large gains on more complex questions not quite visible in the aggregate metric (Table 1).

TEABREAC Ablations
To assess the contribution of various aspects of TEABREAC to the overall performance, we perform ablation experiments with T5-L on the DROP dataset.Fig. 6 shows the results for DROP contrast set and BPB set.Pretraining on just primitive QA instances helps by 0.5-2.3 points, which further improves by 2.7-3.5 points when adding multi-step QA instances without QDMR-balancing ( § 4.2).Finally, if we add multi-step instances with QDMR-balancing instead, we get an additional 1.7-2.8 points of improvement.DROP development set has similar trends but with lower absolute differences, potentially due to shortcuts (see App. F).

Conclusions
Despite large LMs' impressive reading abilities and the availability of large scale multi-step QA datasets requiring a rich set of reasoning skills, LMbased QA models do not reliably learn to use such skills for answering complex questions.In this work, we show that the greater control that synthetic contexts offer can be leveraged to create a teaching dataset where models can learn a broad range of reasoning skills in a reliable manner, especially for more complex questions.
Our transfer results from synthetic data to ac-  tual QA datasets add to the growing line of work that shows synthetic datasets can in fact be used to inject useful skills are valuable for real, natural language tasks.Given the artifact issues in real datasets (specifically, in their contexts) and the difficulty in controlling for them via perturbations, we present a viable alternative: leveraging existing multi-step questions for their broad reasoning patterns but using synthetic contexts for carefully constructing teaching datasets, where models can learn the right way to reason.

Ethical Considerations
The source dataset that TEABREAC is created from, i.e., BREAK, is publicly available with the MIT license which allows us to modify and release the dataset.TEABREAC models and datasets are released under the CC BY 4.0 License. 14 Since TEABREAC uses questions and decompositions from existing datasets, it may also inherit the social biases present in these underlying datasets.We haven't taken any explicit steps to remove such potential biases as it's not in the scope of this work.But we advise the users of the TEABREAC dataset and models to take appropriate caution if deploying them in any real user-facing application.

Limitations
We proposed a pretraining approach to teach a broad range of multi-step reasoning skills to language models.Even though such pretraining doesn't have to be repeated for each target dataset, there is a significant computational cost to pretraining.E.g., our T5-Large pretraining takes 5 days on a RTX A6000 GPU.This is precisely the reason why we haven't conducted experiments with even larger models such as T5-11B.Identifying more compute-efficient ways to teach models such skills remains an interesting open problem.
In general, we expect TEABREAC pretraining to improve downstream performance on the datasets that require complex aggregation operations and diverse compositions of them.We have shown the effectiveness of TEABREAC pretraining on several multi-step QA datasets which fit this criteria.However, this is not the case for other multi-step QA datasets like QASC, HotpotQA, 2WikiMul-tihopQA, and MuSiQue, which involve simpler compositions.TEABREAC pretraining thus may not lead to similar gains on these datasets.More broadly, multi-step QA datasets we have considered form only a small subset of the vast number of QA and NLU tasks the NLP community is interested in.It's possible that TEABREAC pretraining is unhelpful and even harmful to the performance of LMs on these other tasks where our learned multi-step skills are not as relevant, such as commonsense understanding.
The skills taught in TEABREAC are limited by the skills captured (or capturable) by QDMRs.While expanding the scope of QDMR operators 14 https://creativecommons.org/licenses/by/4.0 and the datasets annotated with them can automatically expand the scope of TEABREAC, the current approach is still limited to datasets where one can easily define and obtain QDMRs.
Lastly, while TEABREAC enables the teaching of reasoning skills to any text-to-text model, these black-box models don't provide explanations, making it hard to analyze their underlying reasoning.Hence, we are unable to check whether models trained on it are necessarily performing the required multi-step reasoning.We only provide indirect empirical evidence by evaluations on contrast sets.

A Related Work
Question Decomposition.Several recent multistep QA datasets come with question decompostion annotations (Khot et al., 2020;Talmor and Berant, 2018;Geva et al., 2021;Trivedi et al., 2022;Khot et al., 2022a).These works have enabled the development of explicit multi-step reasoning systems that first decomposes a question into sub-questions, and answers the sub-questions step-by-step to arrive at the answer (Min et al., 2019b;Khot et al., 2021;Trivedi et al., 2022;Patel et al., 2022;Khot et al., 2022b).In contrast, our goal is to use decompositions to teach language models multi-step reasoning implicitly (within the model).
Since each dataset has its own decomposition format, they have led to narrow dataset-specific solutions.In contrast, the BREAK dataset (Wolfson et al., 2020) defined a standardized format for several QA datasets.So in this work, we use them to build a teaching dataset for broad reasoning skills.
Robust Multi-step Reasoning.Past work has shown how to perturb existing multi-step QA instances to prevent shortcuts and incentivize robust reasoning.Jiang and Bansal (2019); Ding et al. (2021) created adversarial multi-step question by perturbing the reasoning chains in HotpotQA (Yang et al., 2018).Other datasets (Trivedi et al., 2020(Trivedi et al., , 2022;;Lee et al., 2021) incentivize robustness via minimally perturbed unanswerable questions.Our approach targets a broader set of questions and eliminates multiple reasoning shortcuts.
The closest work to ours is the Break-Perturb-Build (BPB) dataset (Geva et al., 2022).BPB also uses QDMR but to create contrastive questions via small question perturbation (Kaushik et al., 2019;Gardner et al., 2020).Unlike us, they use the existing context with reasoning shortcuts that can be hard to eliminate with only question perturbation (e.g., no distractors).Additionally it is mainly used for evaluation (as we also do) and hasn't been shown to improve models by training on it.
Data Augmentation for QA.Several past works have used data augmentation via synthetic datasets to improve QA performance.Following works are most relevant to our approach.Geva et al.
(2020) created a synthetic dataset using a few handcrafted templates for injecting numerical reasoning skills (along with a specialized architecture).This dataset was also later used to build a numeracyaware T5 (Raffel et al., 2020) model: NT5 Yang et al. (2021).Yoran et al. ( 2022) created a synthetic dataset using 13 handcrafted multi-step QA reasoning patterns applied on wikipedia tables.Lastly, Pi et al. (2022) showed that pretraining language models on synthetic dataset derived from input and output of program executors (arithmetic, logic-based and SQL-based) can also improve downstream QA performance.In contrast to these works, we use actual questions from a wide range of real datasets to teach a broad range of multi-step reasoning skills.

B Algorithm to Generate QA Instances
Algorithm 1 shows the pseudo-code for generating QA instances satisfying the three properties discussed in § 4. The GenQAInstance function takes question Q, QDMR D and expected answer cardinality N of the answer, and attempts to generate a QA instance with desirable properties for 200 maximum tries.For a given question, QDMR pair, we vary N ∈ {1, 2, 3, 4}.The facts represent list of grounded predicates that form the context, state.ansrepresents stepwise answers for gold reasoning chain (e.g., green boxes in Fig. 4), and state.disrepresents stepwise answers for distractor reasoning chain (e.g., red boxes in Fig. 4).These are initialized to ∅(L3) and updated during the instance generation.
To construct a QA instance, we iterate through the program (or QDMR) steps.For each step, we create facts for the gold reasoning chain by grounding the predicate in the QDMR and update the facts and answer state accordingly using the execute function.E.g., in step #2 in Fig. 4, the facts in the top-half are added and {ABC, DXE} is marked as the current answer state.The execute function will generate these facts and answers such that properties P1 and P2 are satisfied or return False if it can't.We similarly generate facts and update the state for the distractor reasoning chain (L7) by using a perturbed (L6) QDMR predicate (e.g., Edward ⇒ Tom, 1st ⇒ 2nd in Fig. 4).This generates the facts and reasoning chain shown in the lower half of Fig. 4 ensuring property P3 is satisfied.
The implementation of execute function is dependent on the program primitives (Table 7) and will be provided in the released code.But broadly speaking there are two classes of primitives: (1) primitives like select and filter that need to first add facts by grounding the predicate, and then update the answer state for that step (e.g., step #1 and #2 in Fig. 4) (2) primitives like count with no Algorithm 1 Pseudo-code for generating QA instances from question Q, QDMR D, and answer cardinality N end for 20: end function additional grounding of facts and only need to update the state based on the underlying computation (e.g., step #3 in Fig. 4).
If all the steps finish with success, we check if the generation is acceptable (L14) before creating a QA instance.For it to be acceptable, the generated answer cardinality must match the expected value, the number of facts must be within 25, and the final answer for gold and distractor reasoning chains must be different.We create a reading comprehension QA instance with the input question Q as question, facts as the context (concatenated after shuffling), and the answer at the final step as the gold answer.

C Our Implementation vs Previously Reported Numbers
To test the effectiveness of TEABREAC pretraining, we compare models directly fine-tuned on target datasets with models first pretrained on TEABREAC and then fine-tuned on target datasets.For a fair comparison of the fine-tuning experiments, we do the direct fine-tuning on the target datasets using our implementation instead of relying of previously reported numbers which may have other differences.Moreover, previously reported numbers are only sparsely available across the model-dataset pairs we consider, which is another reason to use our implementation.Table 2 shows results obtained by our implementation vs results reported by prior works (NT5 (Yang et al., 2021), PReasM (Yoran et al., 2022) and POET (Pi et al., 2022)), where available.Irrespective of implementation, models with TEABREAC outperform prior approaches.Note that following Yang et al. (2021) and Pi et al. (2022), we employ character tokenization for numbers, but it wasn't employed by Yoran et al. (2022).Therefore, our results obtained by our implementation are significantly better than the ones reported in Yoran et al. ( 2022) for DROP, where numerical reasoning is crucial.

D Direct Comparison of TEABREAC vs
Previous Pretraining Methods

E Performance of LMs on TEABREAC
Since our goal is to teach models the reasoning skills in TEABREAC, we assess how well models do on the TEABREAC dataset.As shown in Table 4, models are able to learn both primitive and multi-step QA skills required in TEABREAC.
On primitives instances models get 92-98 F1, and on multi-step instances, models get 84-88 F1.We show in our experiments that these scores are good enough to make progress on real datasets.At the same time, these aren't perfect scores, demonstrating limitations of vanilla LM-based neural models.Thus, TEABREAC can also serve as a benchmark to help design better multi-step models.

F Results in Exact Match (EM) metric
In addition to the F1 results reported in Table 1, we also report the corresponding EM numbers in Table 5.We see the same trends discussed in § 5.  TEABREAC Ablations on DROP dev set TEABREAC ablation on DROP dev set is provided in Fig. 7.

G Implementation Details
We train all models on a RTX A6000 (48GB) GPU.We pretrain on TEABREAC by sampling a batch from multi-step and primitive synthetic instances in an alternating fashion.The hyperparameters for pretraining and fine-tuning are given in Table 6.
The only hyperparameter we sweeped over is learning rate (10 −5 , 5 × 10 −5 , 10 −4 , 5 × 10 −4 , 10 −3 ).The number of epochs were set to a large number with early stopping based on validation score.We've used Adafactor optimizer for all our exper-iments (Shazeer and Stern, 2018).We selected training hyper-parameter (learning rate) for each baseline model and dataset, based on the validation set performance.Our fine-tuning experiments using models pretrained on TEABREAC use this identical learning rate.

H Examples of Multi-Step QA Instances
Example multi-step QA instances with project and boolean primitives are shown in Fig. 8.

I List of Primitives (Python Functions)
List of primitives (python functions) and a corresponding example is given in Table 7.

# 1 :
list of named entities select("touchdowns by Edwards") #2: list of named entities filter("#1 from1st quarter") #3: number count("#2") touchdowns by Edwards ⇒ ABC touchdowns by Edwards ⇒ DXE touchdowns by Edwards ⇒ FGH touchdowns by Edwards ⇒ PQR what is from 1st quarter?⇒ ABC what is from 1st quarter?⇒ DXE what is from 1st quarter?⇒ MNF what is from 1st quarter?⇒ IOU touchdowns by Tom ⇒ MNO touchdowns by Tom ⇒ IOH touchdowns by Tom ⇒ DPE what is from 2nd quarter?⇒ MNO what is from 2nd quarter?⇒ XRT How many touchdowns did Edwards throw in the 1st quarter?

Figure 4 :
Figure 4: A simplified example of a QA instance in TEABREAC, with a (simplified) real question from the DROP dataset and the synthetic context we construct for it using the question's 3-step decomposition.Statements in red, yellow and green form the synthetic context.The instance satisfies desirable properties P1, P2, and P3, and thus helps robustly teach multi-step reasoning skills.
8 a b 81.3 | 78.0 b 77.4 | 75.0 c 50.6 | 50.5 d n / a | 48.8 e 54.2 f 65.9 g Table 1: F1 scores of in-distribution and robustness evaluation of language models (LMs) with and without TEABREAC pretraining on dev and test sets.Pretraining LMs on TEABREAC improves their in-distribution performance and robustness across multiple QA datasets, for both plain and numerate LMs.In-distribution evaluation scores are (dev | test) scores.Robustness evaluations are on test-only contrast sets.The suffixes '-3B', '-L' and '-S' refer to model sizes 3B, large and small, respectively.Green (underlined) indicates TEABREAC pretraining improves the underlying model's performance, while red (not underlined) indicates it does not.Bold indicates that the TEABREAC-pretrained model sets a new state of the art among published models.EM scores are provided in Appendix F. a : Zhou et al. (2022), b : Zhou et al. (2022), c : Yoran et al. (2022), d : Ni et al. (2021), e : Mishra et al. (2022), f : Gardner et al. (2020), g : Geva et al. (2022).

Figure 5 :
Figure 5: F1 scores for plain and numerate LMs with and withoutTEABREAC pretraining on DROP across varying numbers of steps, as determined by our programs.TEABREAC pretraining helps more on more complex questions.The effect is more prominent on plain LMs like T5-L than on numerate LMs like PReasM-L.(Top Right) Histogram of percentage of questions for each step count.Because more complex questions are less frequent, improvements by TEABREAC pretraining don't show up as well on the average metric for the entire dataset.

Figure 6 :
Figure 6: TEABREAC Ablations: All aspects of TEABREAC pretraining data contribute to the overall performance: (i) primitive QA instances (ii) multi-step QA instances (iii) balancing of QDMR distribution.

Figure 7 :
Figure 7: TEABREAC Ablations: All the three aspects of TEABREAC pretraining data contribute to overall performance: (i) primitive QA instances (ii) multi-step QA instances (iii) balancing of QDMRs to construct the multi-step QA dataset.The results are F1 scores on DROP dev set.The effect on DROP dev set is less prominent than in DROP CS and BPB sets, potentially due to shortcuts in DROP dev set.

Table 4 :
F1 scores of models pretrained on TEABREAC on its Primitive and Multi-step dev sets.Models learn the skills required in TEABREAC during pretraining well, but achieving perfect score is challenging for vanilla LM-based neural models.

Table 7 :
List of primitives (python functions) and a corresponding example.

Table 7 -
Continued from previous page

Table 7 -
Continued from previous page