Summarize-then-Answer: Generating Concise Explanations for Multi-hop Reading Comprehension

How can we generate concise explanations for multi-hop Reading Comprehension (RC)? The current strategies of identifying supporting sentences can be seen as an extractive question-focused summarization of the input text. However, these extractive explanations are not necessarily concise i.e. not minimally sufficient for answering a question. Instead, we advocate for an abstractive approach, where we propose to generate a question-focused, abstractive summary of input paragraphs and then feed it to an RC system. Given a limited amount of human-annotated abstractive explanations, we train the abstractive explainer in a semi-supervised manner, where we start from the supervised model and then train it further through trial and error maximizing a conciseness-promoted reward function. Our experiments demonstrate that the proposed abstractive explainer can generate more compact explanations than an extractive explainer with limited supervision (only 2k instances) while maintaining sufficiency.


Introduction
Recent approaches to multi-hop Reading Comprehension (RC) have greatly improved its explainability, models ability to explain their own answers (Thayaparan et al., 2020). Some adopt a pipelined architecture, where they generate an explanation first and then use it to answer the question. This "faithful-by-construction" approach is aimed at ensuring that generated explanations are closer to the systems' internal reasoning (i.e. faithfulness). The explanation generation step is typically formulated as a sentence selection task over the input text -selecting a set of sentences which provide support for the answer output by the model (Yang et al., 2018;Groeneveld et al., 2020, etc.). However, the main problem with these approaches is that the explanations obtained from the sentence selection tasks are not always minimal, sufficient, and comprehensible. The extractive explanations can include extraneous or superfluous texts which express information that is not necessary for answering questions. For example, as shown in Fig. 1 (a), the fragments such as 2007 British-American fantasy adventure and Young Tommy in "Never Let Me Go" are not needed to explain the answer Northern Lights. Secondly, the extractive explanations may also not be sufficient: the interpretation of explanations may be dependent on its original paragraphs (e.g. pronouns). In Fig. 1 (a), His film roles means Charles Rowe's film, but this is not included in the extractive explanation. These types of gaps can also limit comprehensibility of the explanations.
In this work, we target concise explanations which provide minimal, sufficient and comprehensible information related to the answer. This can also be seen as targeting an abstractive questionfocused summary. To this end, we propose SUmmarizer-augmented QA (SuQA), an RC system augmented with an abstractive explainer component that generates an abstractive summary of explanations, which is then fed to a a separate QA module to produce an answer. An abstractive explainer can summarize longer sentences into short phrases and replace pronouns with their referent, leading to more compact and sufficient explanations compared to extractive explanations. For example, as shown in Fig. 1 (b), the abstractive explainer, unlike an extractive one, is allowed to remove unnecessary information such as 2007 British-American fantasy adventure, and to generate context-independent sentences such as Charlie Rowe plays Billy Costa in The Golden Compass, instead of His film roles includes....
However, developing such an abstractive explainer imposes a significant challenge because of the limited amount of human-annotated abstractive explanations available and prohibitively high costs in extending these (Inoue et al., 2020). Given this limited supervision, how can we ensure that generated explanations are sufficient while promoting compression?
Our solution is to teach an abstractive explainer through trial and error maximizing a concisenesspromoting reward function in a reinforcement learning (RL) framework. The reward function assesses generated explanations against various criteria related to conciseness, such as linguistic acceptability, abstractiveness, and the accuracy of RC module's prediction on the generated explanations. By doing so, the model gradually learns to extract and summarize information from input texts so that they help the RC module arrive at the correct answers. Also, because the explainer aims to produce abstractive summaries, we can initialize the explainer with an abstractive summarizer that is pretrained on standard summarization datasets.
We evaluate the proposed approach on Hot-potQA (Yang et al., 2018), one of the most popular multi-hop RC datasets. The findings of this paper can be summarized as follows: • The semi-supervised abstractive explainer can generate more compact and sufficient explanations than extractive explanations while keeping explanations informative for answering questions. Compared to extractive ones, the abstractive explanations have a compression rate that is ×2.9 higher, and improve humanjudged sufficiency by 2.5 points, without incurring any significant drop in the QA accuracy.
• Even small amounts of human-annotated explanation supervision significantly improve the conciseness of generated explanations. For example, incorporating even 298 instances of annotated explanations makes the compression rate ×1.3 higher and improves human-judged sufficiency by +11.0 points compared to the setting with no supervision for explanations.
Conciseness, in contrast, has been relatively unexplored. One exception is Paranjape et al. (2020), who propose to learn to extract a minimal set of input sentences that are useful for solving downstream tasks by imposing information bottleneck on the NLP framework. Although our work shares the similar spirit with their work, unlike our work, their explainer is extractive. Our work is the first to incorporate abstractive explainers into RC systems.
Abstractive explainer A similar pipeline model has been proposed for textual entailment (Camburu et al., 2018) and commonsense QA (Rajani et al., 2019), where the model first generates an explanation, and then the downstream classifier consumes it to predict a task label. Although the architecture is the same as ours, the training process is different: they train the explainer in a fully supervised manner using input-explanation pairs, while our work additionally leverages a signal from the downstream QA model in RL. As demonstrated in §5.5, we show that this additional training is crucial when few annotated explanations are available.
Generating abstractive explanations is closely related to query-focused summarization (QFS), where a few datasets are publicly available (Dang, 2006;Baumel et al., 2016;Nema et al., 2017;Pasunuru et al., 2021). However, the task setting of QFS is radically different from our problem setting, which makes it difficult to leverage the datasets and models in a straightforward manner. The QFS task typically consists of non-question queries (e.g. keywords or complex sentences) or opinion-oriented questions (e.g. Is X a good idea?), and gold summaries are not guaranteed to contain all information required for answering questions. We leave it the future work to explore how to effectively use their datasets and models in our task.
3 SuQA: SUmmarizer-augmented QA Extractive explanations may contain superfluous information that is not necessary for answering questions or may not be sufficient for answering questions. We address this issue by generating concise explanations defined as follows.
Definition 1. An explanation is concise if it is (i) minimal, (ii) comprehensible, and (iii) sufficient for answering the question. To ensure the faithfulness of explanations, we use a pipeline architecture consisting of two main components: (i) an abstractive explainer (AX) and (ii) QA module (QAM) ( §3.1). The AX takes a question and paragraph as inputs and is responsible for generating a question-focused, abstractive summary of input paragraphs. The QAM then answers the question solely based on the generated summary. This summary is supposed to contain information necessary for answering questions and is the only factor that the QAM relies on. Thus, the generated summary can be interpreted as a faithful explanation of the model.

Architecture
First, we formalize the overall pipeline. Given a question q and paragraphs p, we first generate the most-likely explanation e as follows: where p π is the AX. We then answer the question q solely based on the generated explanation e: where p φ is the QAM. Our architecture is agnostic to the implementation of AX and QAM as long as they are differentiable. From the viewpoint of probabilistic models, this formulation is a special case of a probabilistic latent variable model of p(a|q, p) where explanations are treated as latent variables, similar to retrieval-augmented language models (Guu et al., 2020;Lewis et al., 2020b). Specifically, we have p(a|q, p) = e p φ (a|q, e)p π (e|q, p), assuming p φ (a|q, e, p) = p φ (a|q, e). Replacing the sum with arg max yields Equation 2. The main challenge is that p π (e|q, p) is not a retriever but a text generator.
Abstractive explainer (AX) It takes a paragraph p and a question q as an input, and outputs an explanation e. We implement the AX using a sequenceto-sequence generation model as follows: In our experiments, we use BART (Lewis et al., 2020a). We simply concatenate q and p into one text with a separator token to generate a questionfocused summary of the paragraph.

Reward Function
Sampling ⇒ Pretrained with large summarization dataset ⇒ Supervised with few abstractive explanations Figure 2: Training regime of the proposed method. We pretrain the AX with a large summarization dataset and finetune it on a limited amount of human-annotated explanations ( §4.1). We then train it further through indirect supervision from the QAM using Reinforcement Learning ( §4.2).
QA module (QAM) It takes a question q an explanation e generated by the AX as an input, and outputs an answer a. We implement the QAM as a generation-based question answering module.
4 Training Fig. 2 shows an overview of our training regime. The main challenge of training the AX is that human-annotated explanations are rarely available for question-answer pairs, though the conciseness of explanations heavily relies on human judgement. To address this issue, we train the AX in a semisupervised manner.

Supervised training with summarization and explanation generation
Because the AX aims to produce abstractive summaries, we initialize the AX with an abstractive summarizer that is pretrained on standard summarization datasets. As we will see later ( §5.6.2), this initialization is one of the key ingredients for the AX. Given a training dataset consisting of QA pairs annotated with its gold explanations, we train the AX with a standard teacher forcing approach. Specifically, we minimize the following loss: where q is a question, and (y * 1 , y * 2 , ..., y * n ) is a human-annotated explanation for the QA pair.

Semi-supervised training
Although the fully supervised training provides the AX with direct signals, large-scale annotation of such abstractive explanation is prohibitively costly (Inoue et al., 2020). Thus, after training the AX in a supervised fashion, we further train the AX through indirect supervision from answers, which are much cheaper to annotate.
We use the RL framework and design a reward function that assesses the goodness of generated explanations based on answers and sentence-level supporting facts. A state here is a sequence of explanation tokens generated so far y <t , an action is to generate a token, and the policy function is a probability distribution p π (y t |y <t , q) of tokens given by the AX, as with previous work on RL-based language generation (Rennie et al., 2017, etc.). Given a reward function r(·) which we describe later, we optimize the policy function p π (y t |y <t , q) via self-critical training (Rennie et al., 2017) as follows: where y is a sampled explanation according to the current policy, andŷ is an explanation generated by a greedy decoding. r(ŷ) is called a baseline reward that stabilizes the training process by reducing the variance in the gradient. To prevent generated explanations from deviating too much from gold explanations, we jointly optimize the RL loss with the supervised loss: our final loss is L RL + λL ML , where λ is a weight of the ML loss. In our experiments, we used λ = 0.1.

Reward function
Given question q, input paragraphs c, and explanation e, we define the reward function as a geometric mean of N elemental reward functions: The intuition here is that we combine elemental reward functions with "AND" operator: if one of elemental reward functions gives zero, the explanation must not be rewarded. We introduce three types of elemental reward functions as follows.
Summarization rewards promote the AX to generate more compact summaries. To keep the summary relevant to the question, we also incorporate the relevance of generated explanations to input paragraphs and questions. Let P, Q be a set of tokens, and the P 's coverage of Q be cov(P, Q) = |P ∩ Q|/|Q|. Let ng(X, i) be a set of i-grams in X, and w(X) = ng(X, 1).
• Compression ratio of e w.r.t. input paragraphs: 1 − (# tokens in e/# tokens in c) • Abstractiveness of e w.r.t. input paragraphs: • Relevance of e to input paragraphs based on unigrams: cov(w(c), w(e)) • e's coverage of question: cov(w(e), w(q)) Sufficiency rewards ensure that generated explanations are sufficient, i.e. useful for answering questions.
• F1 score of the QAM's predicted answer: we feed e into the QAM and calculate the answer F1 score of the predicted answer.
• Existence of gold answer span: 1 if e contains the gold answer span; 0 otherwise.
Comprehensibility rewards ensure the comprehensibility of generated explanations to humans.
In our experiments, we use RoBERTa-base finetuned on the CoLA dataset. 2 • Sampling noisiness: 1 if log p π (e|q, p) > T ; 0 otherwise. This is to prevent noisy explanations from being rewarded. We use T = −50.
• Well-formedness: 1 if e has repetition or too long words, starts from pronouns, or ends without period; 0 otherwise.

Dataset
We use HotpotQA (Yang et al., 2018), which consists of 90,564 training and 7,405 development instances. 3 All instances are annotated with extractive explanations called supporting facts, or SFs, sentences that are required to answer questions from input documents. We use the distractor setting in our experiments. For human-annotated explanations, we use R 4 C (Inoue et al., 2020), 4 which annotates 2,379 training instances (3% of the training instances) and 2,541 development instances from HotpotQA with reasoning steps. The reasoning steps are abstractive explanations that describe information necessary for deriving answers, consisting of entity relation triplets in natural language texts (e.g. (Biden, is a president of, US)). We concatenate entities and its relation into one sentence for training the AX.

Relevant paragraph prediction
To select relevant paragraphs for the AX, we trained a ranker that ranks paragraphs according to its relevance to questions. The ranker takes a question and one paragraph as an input and outputs a relevance score. To train the ranker, we used a binary cross entropy loss, where paragraphs containing gold SFs (henceforth, supporting paragraphs) are used as positive instances and the other distractor paragraphs are negative instances. Following Kim et al. (2020), we also randomly sample one supporting paragraph from other questions for each question and used them as negative instances.
At test time, we retain top-k paragraphs and give them to the AX. We use k = 3 because HotpotQA has two supporting paragraphs always. Our evaluation shows that all supporting paragraphs are included at top-k ranked paragraphs in 97.4% of dev instances on HotpotQA. When training the AX, we gave gold supporting paragraphs and randomly selected distractor paragraphs to the AX. To implement the ranker, we use a standard sequence classifier on top of RoBERTa-large (Liu et al., 2019).

Setup
Models We create Extr, a simple baseline model that resembles a typical extraction-based explainable NLP architecture (Glockner et al., 2020;Paranjape et al., 2020). Here, we train the AX using Eq. 5 only, where we use SFs as supervision.
We denote our proposed model as SuQA. To see the effectiveness of RL, we have SuQA-NoRL, a model trained with annotated explanations using Eq. (5) without additional RL training. SuQA-NoRL resembles fully-supervised, generation-based explain-then-predict models by Camburu et al. (2018);Rajani et al. (2019).
AX We initialize the AX with DistilBART finetuned on CNN/Daily Mail, one of large, standard datasets of summarization (Shleifer and Rush, 2020). During training, we feed supporting paragraphs as an input to the model. At test time, we use predicted relevant paragraphs from §5.2 as an input. For hyperparameter tuning, we reserve 500 training instances as a validation dataset. See §A in Appendix for further details.
QAM We use UnifiedQA-base (Khashabi et al., 2020) as the QAM and freezed it during training. Ideally, the AX should learn from a "perfect" QA model that does not perform disconnected reasoning (Trivedi et al., 2020). However, such a QA model is not available at the moment. We thus simulate it by using UnifiedQA (Khashabi et al., 2020), a T5 (Raffel et al., 2020)-based QA model finetuned on a diverse set of QA datasets (e.g. SQuAD, NarrativeQA, RACE) excluding HotpotQA. We expect this to discourage the QAM from giving correct answers for insufficient explanations by disconnected reasoning, which improves the quality of reward function of RL. At test time, we use UnifiedQA finetuned on HotpotQA, whose performance is shown in Table 2 (see QAM w/o AX).

Evaluation measures
Conciseness To assess the compactness of generated explanations, we calculate (i) a compression ratio (Cm), # tokens in an input paragraph divided by # tokens in a generated explanation, and (ii) abstractiveness (Abs) with respect to a given paragraphs selected by the paragraph ranker, calculated by the equation from §4.3.
To assess the sufficiency of generated explanations, we use crowdsourcing. Given a generated explanation and its original question, five crowdwork-ers are asked to judge if generated explanations alone provide sufficient information for answering the question in a 3-point Likert scale (yes, likely, no) plus "unsure". To reliably estimate the quality of explanations, we additionally ask them answers that they inferred from the given explanations.
To aggregate each annotator's judgement, we first replace crowdworker's submission with 'no' when (i) the answer is different from the gold standard answer, or (ii) the judgement is unsure, and replace 'likely' with 'yes'. We then used MACE (Hovy et al., 2013) to aggregate all the judgements (Suf ). Due to the cost, 5 we evaluate 100 gold explanations and 200 generated explanations for each configuration. We obtained Krippendorff's α of 0.298 on average, indicating a fair agreement. See §D in Appendix for further details of crowdsourced judgement.
In some experiments, we report the similarity between generated explanations and humanannotated explanations as a proxy for sufficiency, due to the cost of human evaluation. We employ ROUGE-2 (Lin, 2004) (RG2), which is proven a high correlation between human ratings on several summarization datasets (Bhandari et al., 2020).
QA performance We report F1, one of the official evaluation measures of HotpotQA.
Given that our ultimate goal is to create an explainable RC system, we also introduce XF1, new evaluation measure: where N is the number of instances in the dataset, suf(i) is a crowdsourced sufficiency label (yes=1, no=0), and F1(i) is a F1 score of i-th instance. This captures how well the system generates sufficient explanations and predicts the correct answer.

Results and discussion
Abstractive explanations are more concise (i.e. compact and sufficient) than extractive ones. To understand the advantage of abstractive explanations, we compare gold extractive explanations (Gold SF) with gold abstractive explanations (Gold XP) in Table 1. It clearly indicates that abstractive explanations are more abstract and compact than extractive ones. Surprisingly, it also shows that extractive explanations are much less sufficient than   abstractive ones. Our manual inspection of insufficient explanations reveals that 100% of the explanations do contain gold answer spans, but the interpretation of them depends on the context of input paragraphs that is not included in the explanations (e.g. pronoun referents). On the one hand, pronouns in abstractive explanations can be replaced with the actual referent, which allows explanations to be more self-contained and compressed. F1 also improved given more sufficient explanations.
The abstractive explainer generates more concise explanations. Now we turn to the proposed models. The results are shown in Table 2. As consistent with Table 1, it shows that SuQA generates more abstractive, compact and sufficient explanations than the extractive baseline model. Examples of sufficient explanations generated by SuQA are shown in Table 4 (see §E in Appendix for more outputs with full input paragraphs). It shows that the abstractive explainer successfully captures information about important entities in question (e.g. bridging entity World War II in (b)). One may think why F1 of SuQA is lower than that of the extractive baseline (-1.8 point) given more sufficient and compressed explanations, which is inconsistent with Table 1. To obtain further insights, we investigated the relation between the sufficiency of explanations and the correctness of answers in Table 3, where "Correct" here means the number of instances with > 0.5 Answer F1.  (17/145=11.7%). This suggests that that the QA module relies on task-unrelated lexical cues -socalled disconnected reasoning (Trivedi et al., 2020), and such task-unrelated cues become unavailable in SuQA's more compressed explanations, which undesirably degrades the QA performance. We also experimented with SAE-large (Tu et al., 2020), one of the strong QA models in HotpotQA, but got a similar trend. See §B in Appendix for further details. We believe that QA performance will improve if one can successfully develop a QA model that performs less shortcut reasoning, which is an emerging research topic in the QA community.
The proposed model generates more correct answers with sufficient explanations. Our ultimate goal is to predict correct answers and to genereate sufficient explanations. Here we investigate how many instances we generate sufficient explanations and predict the correct answer for. Table 3 show that SuQA gets more correct answers with sufficient explanations (128/145=88%) than the extractive baseline (124/151=82%). XF1 in Table 2 reflects this tendency and now tells a different story from conventional F1: the extractive baseline is now behind the proposed model. RL helps generate concise explanations. As described in §5.3, we pretrain the AX with explanations before applying RL. How much does the additional RL help the AX generate more concise explanations? The results are shown in Table 2 (SuQA-NoRL v.s. SuQA). It indicates that RL is important to obtain more concice explanations in all the aspects of conciseness.

Role of explanation supervision
It is costly to manually annotate QA datasets with abstractive explanations (Inoue et al., 2020). The natural question is then: how much supervision do we need to generate concise explanations?     Table 5: Ablation of training strategy. Pretraining on the summarization task plays an important role in generating concise explanations. Using seq2seq loss L ML during RL prevents generated explanations from deviating too much from gold explanations. †: evaluated only on 2,541 dev instances annotated with explanations.
We pretrain and apply RL, using various sizes of explanation supervision (0,298,595,1190,2379) and plotted each result in Fig. 3. Due to the cost of human evaluation, we evaluated 100 generated explanations at size 0 and 298 only, and plotted RG2 as a proxy for human-judged sufficiency.
The results indicate that incorporating even 298 explanations has a large impact on both the conciseness of explanations and the QA performance.
Our human-judged sufficiency shows 55.0 for size 0, and 66.0 for size 298. Even with zero explanation supervision, the explainer still generates con- cise explanations to some extent. This indicates that the task of generating abstractive explanations matches with the pretrained summarizer's original task. Thus, even with such small amounts of data, the AX can learn to produce question-focused summaries that are useful for answering questions. To see the benefit of RL in low-resource settings, we also repeated the same procedure with SuQA-NoRL and plotted how each evaluation measure changes from SuQA-NoRL to SuQA in Fig. 4. We observe that the benefit of F1 and RG2 is more pronounced in lower resource settings, which indicates the importance of RL for generating concise explanations. See §C in Appendix for the absolute performance of SuQA-NoRL.

Training strategy
Pretraining tasks We pretrain the AX on the summarization task (SUM) and the explanation generation task (XG) ( §4.1). To investigate the contribution of each factor, we conduct ablation experiments in Table 5. It shows that the summarization task is the most contributing factor: without the pretraining, we obtain more compact explana-

Insufficiency type Question
Generated explanation Gold answer Freq.

No answer span
In which city was this band formed, whose rhythm guitarist featured in "Cupid's Chokehold?"  tions, but fatally, they are less similar to the gold explanations and lead to more incorrect answers.
Seq2seq loss We incur the seq2seq loss (L ML ) along with the RL loss ( §4.2). To see the effect of this, we conduct ablation experiments in Table 5. Without the seq2seq loss, the generated explanations get more compact, but dissimilar to the gold standard explanations. We speculate that the seq2seq loss is important in keeping the search space of the AX closer to gold explanations.

Error analysis
When model's prediction is wrong, we have two possibilities: (A) generated explanations are insufficient, or (B) generated explanations are sufficient, but the QAM fails to find the correct answer. Table 3 indicates that case A is more frequent (69.1% (38/55)) than case B (30.9% (17/55)). We thus randomly sampled and manually analyzed 30 insufficient explanations generated by SuQA in Table 6. First of all, we found that 43.3% (13/30) of generated explanations have no gold answer spans ('No answer span'). Among the rest of explanations, the AX successfully mentions important entities, but fails to generate some related information such as entity type ('Partially missing', 26.7% (8/30)). We also observed that the AX fails to provide important information bridging two entities such as a family relation ('Bridge fact missing', 10.0% (3/30)), and sometimes the AX invents new fact that is not mentioned in the original input paragraph ('Fact invented', 3.3% (1/30)).
The error analysis highlighted that a major source of errors is the explainer failing to include answer spans in generated explanations. One can possibly enhance our architecture with one more pass: before generating explanations, the QAM predicts candidate answers based on questions and input paragraphs, and feeds them into the explainer.

Conclusions
We have proposed SuQA, an RC system augmented with an abstractive explainer component. Our experiments have demonstrated that the abstractive explainer can generate more concise explanations than an extractive explainer with limited supervison, while keeping explanations sufficient for QA.
One limitation of our work is that the QA module is trained separately from the explainer. One can jointly optimize the AX and QAM by extending our framework. Finally, our abstractive explainer explains what facts were used for answering questions, but does not explain the inference process. It would be an interesting research direction to extend our work by explaining how these facts are combined to arrive at the answer.

A Training detail
For all experiments, we used public implementations from huggingface's transformers library available at https://huggingface. co/. We used roberta-large for the paragraph ranker, distilbart-cnn-12-6 for AX, and unifiedqa-t5-base for UnifiedQA-base.
For Reinforcement Learning, we used AdamW with the learning rate of 2e-6 and the batch size of 8. We clipped the minimum reward to -0.001. For sampling, we used a temperature of 0.4. To prevent overfitting, we used early stopping with a patience of 5. Specifically, we monitor the Answer F1 on the validation set every 4096 training steps and stopped training if the best F1 is not updated for five times. The RL training took 10h31m on a single GPU (DGXA-100).
For pretraining the AX, we used AdamW with the learning rate of 8e-6 and the batch size of 16.
In all experiments, we used a linear learning rate scheduler with 10% warm up and trained the models with 5 epochs. For the learning curve, we monitored the Answer F1 every 128 steps for size 298, 256 steps for size 595, 512 steps for size 1,190 & 2,379 and used early stopping with a patience of 5. We used 512 as a maximum length of input subwords for both the AX and QAM. We used 256 as a maximum length of generation outputs for the AX. We used greedy decoding for both the AX and QAM.

B Experiments with stronger QA model
We conducted additional analysis with SAElarge (Tu et al., 2020), one of the large QA models top-ranked at the leaderboard. 6 We downloaded a publicly available pretrained model 7 and ran the exactly same experiments in Table 1, 2, and 3, where we used SAE-large as the QAM at test time only. Note that during training, we used UnifiedQA-base not finetuned on HotpotQA (see §5.3 for further details).
The results are shown in Table 7 and Table 8. Overall, they show the same trend as Table 1, 2, and 3: (i) gold abstractive explanations yields higher F1; (ii) SuQA achieved better XF1 than the extractive baseline; and (iii) there are more correct answers led by insufficient explanations in the extractive baseline.

C Learning curve of SuQA-NoRL
To see the effectiveness of RL in low-resource settings, we investigated the performance change from SuQA-NoRL to SuQA in Fig. 4. Here we plot the absolute performance of SuQA-NoRL in Fig. 5.

D Human evaluation
We use Mechanical Turk as a crowdsourcing platform for human evaluation. We hired five annotators per Human Intelligence Task (HIT) and rewarded them with $0.15. Our preliminary experiments show that it takes about one minute to finish one HIT, so it is $9.00 per hour, which is above $7.25, a minimum wage in the United States. To ensure the quality of annotations, we used crowdworkers with ≥ 5, 000 HITs experiences and ≥ 99% approval rates. Among them, we manually find the pool of high-quality workers and used the same pool throughout the experiments. The instruction to crowdworkers is shown in Fig. 6 and Fig. 7, and the task interface is shown in Fig 8.

E Example of generated explanations with full inputs
Examples of generated explanations and predicted answers along with their full input paragraphs retrieved by the paragraph ranker are shown in Table 9, Table 10 and Table 11.   Question Input paragraphs from the ranker Generated explanation Predicted answer Who was born first, Krzysztof Zanussi or Thom Andersen? [P1] Krzysztof Zanussi, (born 17 June 1939) is a Polish film and theatre director, producer and screenwriter. He is a professor of European film at the European Graduate School in Saas-Fee, Switzerland where he conducts a summer workshop. He is also a professor at the Silesian University in Katowice.
[P3] Weronika Anna Rosati (] ; born 9 January 1984) is a Polish actress and a member of European Film Academy. She began her acting career in Polish soaps. In 2005, she starred as Dżemma in her first theatrical feature film "Pitbull". A year later, she launched her international career with a small uncredited role in "Inland Empire" directed by David Lynch. Since then she has appeared in many critically acclaimed Polish and international productions. In 2013, she has received her first Polish Academy Award nomination for Best Actress for her role in "Obława" (2012). A year later, she starred alongside Agnieszka Grochowska in "Obce ciało" directed by Krzysztof Zanussi. She also had a recurring role in the HBO TV series "Luck" (2012).
Krzysztof Zanussi is born on 17 June 1939. Thom Andersen is born on 1943.
Krzysztof Zanussi