Is a Question Decomposition Unit All We Need?

Large Language Models (LMs) have achieved state-of-the-art performance on many Natural Language Processing (NLP) benchmarks. With the growing number of new benchmarks, we build bigger and more complex LMs. However, building new LMs may not be an ideal option owing to the cost, time and environmental impact associated with it. We explore an alternative route: can we modify data by expressing it in terms of the model’s strengths, so that a question becomes easier for models to answer? We investigate if humans can decompose a hard question into a set of simpler questions that are relatively easier for models to solve. We analyze a range of datasets involving various forms of reasoning and find that it is indeed possible to significantly improve model performance (24% for GPT3 and 29% for RoBERTa-SQuAD along with a symbolic calculator) via decomposition. Our approach provides a viable option to involve people in NLP research in a meaningful way. Our findings indicate that Human-in-the-loop Question Decomposition (HQD) can potentially provide an alternate path to building large LMs.


Introduction
With the advent of large LMs, we have achieved state-of-the-art performance on many NLP benchmarks (Radford et al., 2019;Brown et al., 2020;Sanh et al., 2021a).Our benchmarks are evolving and becoming harder over time.To solve new benchmarks, we have been designing more complex and bigger LMs at the cost of computational resources, time and its negative impact on the environment.Building newer LMs for solving new benchmarks may not be an ideal and sustainable option over time.Inspired by humans, who often view new tasks as a combination of existing tasks, we explore if we can mimic humans and help the model solve a new task by decomposing (Mishra et al., 2021a) it as a combination of tasks that the model excels at and already knows.
As NLP applications are increasingly more and more popular among people in their daily activities, it is essential to develop methods that involve humans in NLP-powered applications in meaningful ways.Our approach attempts to fill this gap in LMs by providing a human-centric approach to modifying data.Solving complex QA tasks such as multi-hop QA, and numerical reasoning has been a challenge for models.Question Decomposition (QD) has recently been explored to empower models to solve these tasks with the added advantage of interpretability.However, previous studies on QD are limited to some specific datasets (Khot et al., 2020b) such as DROP (Dua et al., 2019) and HOT-POTQA (Yang et al., 2018).We analyze a range of datasets involving various forms of reasoning to investigate if "a Question Decomposition Unit All We Need?" Figure 1 shows the schematic representation of a QD unit.The original question is difficult for a model to answer.However, it becomes easier for the model when a human decomposes the question into a set of simpler questions.
We manually decompose randomly selected 50 samples of each dataset.The decompositions we perform are purely based on intuitions to reduce the complexity of the question, inspired by the success of task-level instruction decomposition (Mishra et al., 2021a) in improving model performance.We experiment with GPT3 (Brown et al., 2020) and RoBERTA (Liu et al., 2019) fine-tuned on SQuAD 2.0 (Rajpurkar et al., 2018) and find that HQD significantly improves model performance (24% for GPT-3 and 29% for RoBERTa-SQuAD along with a symbolic calculator).Here, the evaluation happens on unseen tasks on which the model is not fine-tuned.Our findings indicate that Humanin-the-loop Question Decomposition (HQD) can  potentially provide an alternate path to building large LMs.We hope our work will encourage the community to develop human-centric solutions that actively involve humans while leveraging NLP resources.

Related Work
A recent methodology to reason over multiple sentences in reading comprehension datasets is to decompose the question into single-hop questions (Talmor and Berant, 2018;Min et al., 2019).Min et al. (2019) decompose questions from HOT-POTQA using span predictions based on reasoning types and picks the best decomposition using a decomposition scorer.Khot et al. (2020b) generate decompositions by training a BART model on question generation task by providing context, answers and hints.Wolfson et al. (2020) crowd-sourced annotations for decompositions of questions.Perez et al. (2020), on the other hand, uses the unsupervised mechanism of generating decomposition by mapping a hard question to a set of candidate sub-questions from a question corpus.Iyyer et al. (2017) answer a question sequentially using a neural semantic parsing framework over crowdsourced decompositions for questions from WikiTableQuestions.Decomposition using text-to-SQL query conversion has also been studied (Guo et al., 2019).Also, knowledge graphs are combined with neural networks to generate decompositions (Gupta and Lewis, 2018).Recently, Xie et al. (2022) presented another use case where decompositions can be used to probe models to create explanations for their reasoning.

Decomposition Process
For each dataset, we randomly select 50 instances for manual decomposition.The question in each dataset is decomposed into two or more questions.
where C i is the context paragraphs, Q i is the original question, Q d is the set of decomposed ques-tions, A i is an original answer, and A d is the set of answers for corresponding decomposed questions.For questions that require arithmetic or logical operations, we use a computational unit as suggested in Khot et al. (2020b), which takes a decomposed question as input in the following format: where O = {summation, difference, division, multiplication, greater, lesser, power, concat, return, remainder}, #m i are answers of previous decomposed questions and !separates the operands.

Experimental Setup
Models We use GPT-3 (Brown et al., 2020) to generate answers for original and decomposed questions.To show that QD significantly improves performance even on simpler models, we use RoBERTa-base finetuned on SQUAD 2.0 dataset (i.e., RoBERTa-SQuAD).Additionally, we use RoBERTa-base finetuned on BoolQ dataset (Clark et al., 2019) (i.e., RoBERTa-BoolQ) for original and decomposed questions in STRATEGYQA since they are True/False type questions.
Experiments To create baselines, we evaluate all models on the original question along with the context.We evaluate all models on the manually decomposed questions in the proposed method.We carry out all experiments in GPT-3 by designing prompts for each dataset 2 .For RoBERTabased models, we use RoBERTa-SQuAD for MUL-TIRC, BREAK, HOTPOTQA and DROP datasets, since SQUAD 2.0 is designed for a reading comprehension task.For STRATEGYQA, we use two RoBERTa-base models: (1) RoBERTa-BoolQ, which is used to answer the final boolean type of questions, and (2) RoBERTa-SQuAD which is used to answer the remaining decomposition questions.For SVAMP, we use the RoBERTa-SQuAD model to extract the necessary operands using decomposed questions and then we use the computational module to perform various operations.In all experiments, we use decomposition to get to the final answer sequentially.
Metrics For all our experiments, we use Rouge-L (Lin, 2004), F 1 -score and Exact Match (EM) as the evaluation metrics.
2 See Appendix A for more details

Results and Analysis
Here, we divide our datasets into four categories: (1) RC: HOTPOTQA, DROP, MULTIRC, and BREAK in Reading Comprehension (RC), (2) MATH: MATHQA and SVAMP in Mathematical reasoning , (3) MC: QASC in Multi-Choice QA (MC) , and (4) SR: STRATEGYQA in Strategy Reasoning (SR).All results presented in this sections are averaged over tasks for each category.

Experimental Results
GPT-3 Figure 3 shows the GPT-3 performance in terms of average F 1 -scores for each category.From the Figure 3, we can observe that our proposed approach outperforms baseline by ∼ 24%.Appendix D presents all results in terms of F 1 -scores, EM and Rouge-L for all datasets and categories.F 1 -scores for each category.On an average, we achieve ∼ 29% of significant improvement compared to the baseline.Appendix D presents all results in terms of F 1 -scores, EM and Rouge-L for all datasets and categories.

Analysis
Customized Question Decomposition for Each Model There can be multiple ways to decompose a question based on the context.Multiple factors go into deciding how to break down a question.One factor is the strength of the model.For instance, if we use a model finetuned on SQuAD, it might be beneficial to ensure that the decompositions are more granular and are generated to answer from a context span.On the other hand, if we have a more sophisticated model like GPT3, we might not necessarily need to do so.The results shown in Figure 2 are obtained on RoBERTa finetuned on SQuAD by using decompositions originally designed for GPT3; note that in this case, the answers to the decompositions might not always be the span of a particular sentence in the context.However, we achieve a decent performance improvement.We believe the performance gain will be greater if decompositions are designed to match the model's strengths.Examples of such decompositions are included in the Appendix A.
Qualitative Analysis We conduct qualitative analysis to capture the evaluation aspects missed in the automated evaluation metrics.Here, we manually inspect and consider a generated answer to be correct if it is semantically similar to the gold annotation.Figure 4 and 5 show the contribution of QD in correcting model prediction.We observe that the decompositions correct more than 60% of the errors made on the original questions.Error Analysis We conduct error analysis and observe that the major source of error is the error propagated from one of the decomposed questions.Errors, in general, are of two types: (i) incorrect span selection and (ii) failure to collect all possible answers in the initial step of decomposition; this often omits the actual correct answer leaving no room for later decomposition units to generate the correct answer.Errors occur in QASC because our method of context-independent decomposition (via intuition) sometimes leads to open-ended questions which models find hard to answer.Examples of errors have been included in the Appendix B.

Effect of Decomposition on Math Datasets
We observe that Math datasets benefit the most from decomposition.This may be because of two reasons: 1) majority of math questions can be decomposed as a combination of extractive QA (where the answer is a span) and a symbolic calculation.Both of these are strengths of language models (note that we use calculators that provide accurate answers consistently).However, this is not necessarily true in case of other QA tasks.In a decomposition chain, if the answer in one step goes wrong, it propagates till the end and the final prediction becomes wrong.
2) language models by default struggle to do math tasks (Patel et al., 2021;Mishra et al., 2022), so the performance improvement seems more prominent there.

Effect of Number of Decompositions on Results
We typically decompose a question based on the number of operations associated with it (e.g.mathematical calculation or single hop operation).Increase in the number of decompositions has the advantage that it simplifies the original question, but it can also have the disadvantage that if the answer to one of the questions in the chain is incorrect, the end answer becomes incorrect.This is also evident from our empirical analysis on HOTPOTQA and SVAMP datasets where we observe that there is no direct correlation between the number of labeling QA and the final performance.Figure 6 shows the variation in model performance improvement observed for questions with 2, 3, 4 and 5 decompositions.
Efforts to Automate Decomposition For HOT-POTQA, DROP, and SVAMP, we attempt to automate the decomposition process using GPT3.A limitation for generating decompositions for HOT-POTQA is that the context length makes it difficult to provide sufficient examples in prompt.With DROP and SVAMP, we observe that GPT-3 often generates incorrect arithmetic operations for the last sub-question.It also often fails to develop coherent decompositions of the questions.We also finetune a BART-base (Lewis et al., 2020)

Conclusion
The recent trend of building large LMs may not be sustainable to solve evolving benchmarks.We believe that modifying data samples can significantly help the model improve performance.We study the effect of Question Decomposition (QD) on a diverse set of tasks.We decompose questions manually and significantly improve model performance (24% for GPT3 and 29% for RoBERTa-SQuAD along with a symbolic calculator).Our findings indicate that Human-in-the-loop Question Decomposition (HQD) can potentially provide an alternate path to building large LMs.Our approach provides a viable option to involve people in NLP research.We hope our work will encourage the community to develop human-centric solutions that actively involve humans while leveraging NLP resources.

Limitations
Our human-in-the-loop methodology shows promising results by decomposing questions, however, certain questions are still difficult to decompose for humans as well.For instance, the question "Which country is New York in?", is hard to decompose further.Determining which questions to decompose is also an important challenge and under-explored in this work.Furthermore, decomposed questions in the chain which have more than one correct answers might lead to an incorrect final answer.Automating the process of decomposition while addressing these issues is a promising area for future work.

A Prompts
Due to the success of large LMs, prompt-based learning is becoming popular to achieve generalization and eliminate the need of creating task-specific models and large scale datasets (Liu et al., 2021).

A.6 MultiRC
Given a context-question pair, answer the question using information and facts present in the context.Keep your answers as short as possible.
Example: Input: Context: Should places at the same distance from the equator have the same climate?You might think they should.Unfortunately, you would not be correct to think this.Climate types vary due to other factors besides distance from the equator.So what are these factors?How they have such a large impact on local climates?For one thing, these factors are big.You may wonder, are they as big as a car.Think bigger.Are they bigger than a house?Think bigger.Are they bigger than a football stadium?You are still not close.We are talking about mountains and oceans.They are big features and big factors.Oceans and mountains play a huge role in climates around the world.You can see this in Figure above.Only one of those factors is latitude, or distance from the equator.Question: Name at least one factor of climate Output: Answer: Oceans Example: Input: Context: Earth processes have not changed over time.The way things happen now is the same way things happened in the past.Mountains grow and mountains slowly wear away.The same process is at work the same as it was billions of years ago.As the environment changes, living creatures adapt.They change over time.Some organisms may not be able to adapt.They become extinct.Becoming extinct means they die out completely.Some geologists study the history of the Earth.They want to learn about Earths past.They use clues from rocks and fossils.They use these clues to make sense of events.The goal is to place things in the order they happened.They also want to know how long it took for those events to happen.Question: What is one example of how the earth's processes are the same today as in the past?Output: Answer: Things develop and then wither away Input: Context:: «CONTEXT» Question: «QUESTION» Output: Answer: <ANSWER GENERATED BY GPT3»

B Error Examples
This section discusses the errors generated by using decompositions.We observe two types of errors while answering decomposed questions.The final answer is wrong because previous sub-questions were answered incorrectly either because such a question has multiple correct answer, or simply because the model could not understand the question correctly.

Question:
When was the date of birth of one of the founder of Congo Reform Association?True Answer: 1 September 1864 Generated Answer: 18 October 1914 Decomposed Question 1: Who is the founder of the Congo Reform Association?True Answer: Roger Casement Generated Answer: Henry Grattan Guinness Decomposed Question 2: When was #1 born?True Answer: 1 September 1864 Generated Answer: 1861 Above is an example from HotpotQA.As can be seen from the context, Congo Reform Association had multiple founders.GPT3 did give a correct answer among a set of correct answers whereas the ground truth answer provided by the dataset was some other correct option.
Below is an example of incorrect retrieval.The answer generated for the first decomposed question incorrectly returns cities taken by Ottomans as well instead of just the Venetians.Hence, the final decomposed questions returns the incorrect count.
of stars be used for the following: #?
The decomposed question for each option is posed as a yes or no question to GPT3.It returns yes for art and story telling but not for travel.

C Examples, Results and Details for Automation
We attempt to automate the process of decomposition using GPT3.We use the examples from manual decomposition in the prompts given to GPT3, some of which are presented below.The results obtained from the experiments are presented in Table 6.The generated decompositions are answered using RoBERTa-base finetuned on SQUAD 2.0 dataset.
In this section, we present the prompts we used while attempting to automatically generate decomposed questions using GPT3.
The prompt for generating decompositions for DROP was as follows: Decompose given question by breaking it into simpler sub-questions.The answer to each subsequent sub-question should lead towards the answer of the given question.To do so, use the context provided and look at the examples.Here are some helpful instructions: 1.If the given question compares two things, best strategy is to generate sub-questions that finds the answer to each of those things and compare them in the last sub-question.
2. Some sub-questions must contain phrases like "answer of sub-question 1".
3. If a sub-question is an arithmetic operation, then the sub-question should be framed as operation !"answer of sub-question 1" !"answer of sub-question 2".
4. The operation used in 3) is always one of the following: summation, difference, greater, lesser.
Example Here are some helpful instructions: 1.If the given question compares two things, best strategy is to generate sub-questions that finds the answer to each of those things and compare them in the last sub-question, 2) Some sub-questions must contain phrases like "answer of sub-question 1".
2. Some sub-questions must contain phrases like "answer of sub-question 1".
3. If a sub-question is an arithmetic operation, then the sub-question should be framed as operation !"answer of sub-question 1" !"answer of sub-question 2".Sub-question 2: How many cookies did Paco eat? Sub-question 3: difference !"answer of subquestion 1" !"answer of sub-question 2" Example 5: Context: 43 children were riding on the bus.At the bus stop some children got off the bus.Then there were 21 children left on the bus.How many children got off the bus at the bus stop?Sub-question 1: How many children were on the bus at the beginning?Sub-question 2: How many children were left on the bus?Sub-question 3: difference !"answer of subquestion 1" !"answer of sub-question 2" Example 6: Context: 28 children were riding on the bus.At the bus stop 82 children got on the bus while some got off the bus.Then there were 30 children altogether on the bus.How many more children got on the bus than those that got off?Sub-question 1: How many children were on the bus at the beginning?Sub-question 2: How many children were left on the bus?Sub-question 3: difference !"answer of subquestion 1" !"answer of sub-question 2"

D Results
We tabulate the results we get for all the datasets for baseline and our proposed mechanism.

Figure 1 :
Figure 1: The original question is answered incorrectly by a model.A human then decomposes the question into a set of simpler questions which the model then answers correctly.

Figure 2 :Figure 3 :Figure 5 :
Figure 2: Results in terms of F 1 -score across different categories for RoBERTa-based models.RC: Reading Comprehension, MATH: Mathematical reasoning, SR: Strategy Reasoning.
options A-H, choose and return the correct option.Look at the examples given below.Input: What are the vibrations in the ear called?(A) intensity (B) very complex (C) melanin content (D) lamphreys (E) Otoacoustic (F) weater (Gor no question, return yes if the answer is yes.Otherwise return no.«QUESTION» Answer: «OUTPUT GENERATED BY GPT3» SVAMP Context: Bryan took a look at his books as well.If Bryan has 56.0 books in each of his 9.0 bookshelves.Original Question: How many books does he have in total?Answer: 504.0 Decomposed Question 1:How many books in each bookshelf?Answer: 56.0 Decomposed Question 2:How many bookshelves?Answer: 9.0 Decomposed Question 3: multiplication !#1 !#2 Answer: 504.0 MATHQA Problem: if a train , travelling at a speed of 180 kmph , crosses a pole in 6 sec , then the length of train is ?Options: a ) 300 , b ) 125 , c ) 288 , d ) 266 , e ) 121 Annotated Formula: multiply(multiply(180, const_0.2778),6) Answer: 300 Generated Answer: 266 Decomposed Question 1: multiplication !0.2778 !180 Answer: 50.004Decomposed Question 2: multiplication !50.004 !6 Answer: 300 Table 3: Decomposition Examples for SVAMP and MathQA.We use the annotated formula presented in the dataset to make our decompositions.StrategyQA Context: Mail carriers, also referred to as mailmen or letter carriers, . . .Clothing also provides protection from ultraviolet radiation.Original Question: True or False: Mail carriers need multiple uniforms.Original Answer: True Generated Answer: False Decomposed Question 1: What seasons do mail carriers work through?Generated Answer: All seasons Decomposed Question 2: True or False: In order to make it through all of #1, one needs multiple clothing pieces.Generated Answer: True QASC Original Question: what kind of beads are formed from vapor condensing?(A) h2o (B) H20 (C) tiny (D) carbon (E) hydrogen (F) rain (G) oxygen (H) Dew Answer: h2o Decomposed Question 1: Are #1 beads formed from vapor condensing?Answer: yes Table 4: Examples of decompositions for StrategyQA and QASC datasets.For each option in QASC, #1 is replaced with the option and posed to GPT-3 as a yes or no question.
Example 7: Context: They decided to hold the party in their backyard.If they have 11 sets of tables and each set has 13 chairs, how many chairs do they have in the backyard?Sub-question 1: How many tables are there in the backyard?Sub-question 2: How many chairs are on each table?Sub-question 3: multiplication !"answer of sub-question 1" !"answer of sub-question 2"Context: «CONTEXT + QUESTION»The examples of decompositions generated for HotpotQA, DROP and SVAMP are shown in Table7

Table 1 :
Type of QA task corresponding to each dataset.
RC: Reading Comprehension

Table 2
i ∈ D can be represented as below:

Table 2 :
.1 HOTPOTQA, DROP, BREAK The Larkspur Press is a small letter-press publisher based in Monterey, Kentucky .... , The film also features appearances by Helen Keller, Anne Sullivan, Kate Adams Keller and Phillips Brooks Keller as themselves.The movie was directed by George Foster Platt and written by Francis Trevelyan Miller.Original Question: Are John O'Hara and Rabindranath Tagore the same nationality?Examples for DROP and HotpotQA.
4. The operation used in 3) is always one of the following: summation, difference, greater, lesser Dave was helping the cafeteria workers pick up lunch trays, but he could only carry 9.0 trays at a time.If he had to pick up 17.0 trays from one table and 55.0 trays from another.how many trips will he make?Sub-question 1: How many trays did Dave have to pick up from the first table?Sub-question 2: How many trays did Dave have to pick up from the second table?Paco had 93.0 cookies.Paco ate 15.0 of them.How many cookies did Paco have left?Sub-question 1: How many cookies did Paco start with?