Reinforcement Replaces Supervision: Query focused Summarization using Deep Reinforcement Learning

Query-focused Summarization (QfS) deals with systems that generate summaries from document(s) based on a query. Motivated by the insight that Reinforcement Learning (RL) provides a generalization to Supervised Learning (SL) for Natural Language Generation, and thereby performs better (empirically) than SL, we use an RL-based approach for this task of QfS. Additionally, we also resolve the conflict of employing RL in Transformers with Teacher Forcing. We develop multiple Policy Gradient networks , trained on various reward signals: ROUGE, BLEU, and Semantic Similarity, which lead to a 10 - point improvement over the State-of-the-Art approach on the ROUGE-L metric for a benchmark dataset (ELI5). We also show performance of our approach in zero-shot setting for another benchmark dataset (DebatePedia) – our approach leads to results comparable to baselines, which were specifically trained on DebatePedia. To aid the RL training, we propose a better semantic similarity reward, enabled by a novel Passage Embedding scheme developed using Cluster Hypothesis. Lastly, we contribute a gold-standard test dataset to further research in QfS and Long-form Question Answering (LfQA). Github repo : github.com/rl-qfs


Introduction
Query-focused Summarization (QfS) (Tombros and Sanderson, 1998;Dang, 2005;Nema et al., 2017) advances text summarization by letting the user provide a query, pertaining to which the summary must be generated from the document(s).Specifically, we target QfS for questions on a single document (see Table 1 for example).Our work is related to Long-form Question Answering (LfQA) (Fan et al., 2019;Krishna et al., 2021;Su et al., 2022).However, it differs from LfQA research significantly in that LfQA research has to focus on the retrieval of relevant passage(s) also.Whereas our system already assumes the presence of a document, which has to be summarized.
Query: What is String Pool in Java?Document1 : String pool is nothing but a storage area in Java heap where string literals stores.It is also known as String Intern Pool or String Constant Pool.It is just like object allocation.By default, it is empty and privately maintained by the Java • • • Summary: String Pool, used to reduce the memory footprint, is a specific area in the memory allocated to the process used to store String literal declared within Java program.Problem Statement: We develop a system which abstractively generates summaries (Nallapati et al., 2016;Rush et al., 2015;Xu andLapata, 2021, 2022) from a single document given the query.Specifically, input: query and document, and output: summary.We design a novel Reinforcement Learning algorithm for training and propose a novel passage embedding-based reward function.Motivation: QfS lets the user drive the summarization.This improves the user experience by letting the user extract the needed information quickly.Additionally, present-day QA systems (Calijorne Soares and Parreiras, 2020) produce only short answers to factual questions.What they lack is the ability to consume a large piece of information and present a coherent, compressed summary based on the needs of the user.Our work aims to further research in LfQA by successfully building a QfS system for questions.
We utilize Reinforcement Learning (RL) for training the model, as RL provides a generalized loss to train our model.While Cross-Entropy loss utilizes token-by-token match, RL helps us generalize this further to phrase/sentence/semantic match.Such a generalization gives the model more freedom on what to generate at each time step, as long as the final generation suffices.Inspired by this, we employ RL for QfS.We detail on how RL is a generalization to Supervised Learning (SL) using Cross-Entropy loss in Section 3.1.
However, employing RL for text generation using Transformers is non-trivial.RL helps train agents that perform a sequence of actions, each deciding the next state and the available set of actions.
In contemporary text generation through Transformers, Teacher Forcing2 (Williams and Zipser, 1989) is used, where the generation (action) at timestep t − 1 has no influence on the generation at time-step t.Effective utilization of RL for text generation needs to model this influence, thus necessitating the omission of Teacher Forcing.However, this increases the training time and memory footprint significantly.We propose a way to employ RL for text generation without Teacher Forcing, based on Scheduled Sampling (Bengio et al., 2015), Section 3.1.To the best of our knowledge, we are the first work to resolve this conflict of employing RL in Transformers with Teacher Forcing.
Our contributions are: 1.An RL algorithm that resolves the conflict of RL-based training in Transformers with Teacher Forcing.We observe significant improvement over the State-of-the-Art SL models: 37.2% improvement in automatic evaluations (Table 6), 19% improvement in Correctness (human evaluation; Table 8).
2. A human-curated test set, 250 instances, devoid of Topic Centralization (one document can cater to multiple queries, Baumel et al. (2016)), for analysis of QfS models.
3. A passage embedding based novel reward mechanism (2.21 ROUGE-L improvement over ROUGE reward; Table 6) to score generated summaries, trained using the longstanding, time-honored Cluster Hypothesis (Jardine and van Rijsbergen, 1971).
4. A new dataset, ∼ 8 million instances, to train a passage embedder model, scraped from reddit.

Related Works
Abstractive QfS: Supervised Learning approaches to abstractive QfS have been primarily limited by the availability of large-scale datasets.Dang (2005) present the first QfS dataset with open-ended questions.However, the size of the dataset is insufficient to train neural models.Xu andLapata (2021, 2022);Laskar et al. (2020) tackle the scarcity of data through innovative approaches.Xu and Lapata (2021) generate proxy queries from generic summarization datasets by using a masked representation to train a model for QfS.Xu and Lapata (2022) model the document as an indicator of all the possible queries on it, and develop a QfS model by modeling latent queries through variational inference techniques.Laskar et al. ( 2020) take the approach of Transfer Learning: they fine-tune an abstractive summarizer, trained on XSum (Narayan et al., 2018), on a decent-sized (12695 instances) dataset: DebatePedia (Nema et al., 2017).Inspite of the presence of DebatePedia, researchers have mainly focused on leveraging generic summarization datasets due to the size limitation.We circumvent this limitation by utilizing ELI5 (Fan et al., 2019).Specificially, we utilize the version by Fan et al. (2019), not the KILT version (Petroni et al., 2021), as the latter does not provide gold documents to generate the summary from.Although, it is a dataset for LfQA, the format of the dataset makes it suitable for our use case: QfS for question on a single document.
ELI5 is an automatically curated dataset with a few shortcomings, which make it unsuitable for testing models (Krishna et al., 2021).Motivated by these shortcomings, we propose a novel, humancurated test set for QfS on well-formed questions consisting of high-quality 250 instances.RL for Summarization/QfS: The usage of RL in QfS has been limited to extractive summarization only (Mollá and Jones, 2019;Mollá et al., 2020;Chali and Mahmud, 2021;Shapira et al., 2022).Mollá and Jones (2019); Mollá et al. (2020); Shapira et al. (2022) use RL to train sentence selector models, which select sentences to be incorporated into the summary.Chali and Mahmud (2021) present a hybrid summarization, where an extractive module selects text from the document, which is then used by the abstractive module to generate an abstractive summary.However, they use RL for the extractive module only.To the best of our knowledge, we are the first to utilize RL for abstractive QfS.
We find an abundance of precedence of RL for abstractive summarization.Paulus et al. (2017) is the first work in employing RL to train LSTM models, without Teacher Forcing, for abstractive summarization.Pasunuru and Bansal (2018); Li et al. (2019) provide follow-up works with better reward functions.Both works train LSTM models without Teacher Forcing.Laban et al. (2020) were the first to employ RL to a Transformer architecture for abstractive summarization.They fine-tune the GPT2-small (Radford et al., 2019) model without Teacher Forcing.However, the omission of Teacher Forcing led to a training time of 10 days for a generation length of just 10 tokens.This highlights the severity of the problem which arises by omitting Teacher Forcing and using sequential generation during training, and motivates the need for our approach to effectively incorporate RL.

Passage Embedding/Similarity:
A straightforward way to obtain paragraph/passage embeddings is by application of some compositionality to the embeddings of the constituent words (Kenter et al., 2016;Hill et al., 2016;Sinoara et al., 2019;Iyyer et al., 2015).Yang et al. (2020); Jiang et al. (2019) attempt to generate passage embedding by training dual encoder networks to match similar documents.They rely on citation and recommendation networks to derive whether or not two passages are similar.Obtaining passage similarity/embedding is a common task in Dense Passage Retrieval (DPR) (Karpukhin et al., 2020).Ren et al. (2021) utilize Cluster Hypothesis (Jardine and van Rijsbergen, 1971) to enhance passage retrieval.However, they do not train any passage embedding network separately.Ginzburg et al. (2021) utilize pairwise sentence similarity to obtain similarity between two paragraphs.They use a noisy training dataset where paragraphs from the same document are considered similar and those from different documents are considered dissimilar.We are primarily motivated by the work of Vikraman et al. (2021).They use Cluster Hypothesis to group similar passages together in the DPR framework, for downstream tasks.Our work differs from Vikraman et al. (2021) in the following aspects: we explicitly train our model to output passage embeddings and we train on a much larger dataset.

Modeling
In this section we provide the formulation for QfS and Passage Embedding.Section 3.1 discusses the application of Policy Gradient to QfS.Section 3.2 discusses our approach to obtain passage embeddings.We present architectures in Appendix G.

Policy Gradient for QfS
We model the task using Reinforcement Learning (RL) framework.Reward Hypothesis (Sutton and Barto, 2018;Silver et al., 2021) states that all goals can be formulated as the maximization of a reward signal.Following this, our goal is to represent QfS using a reward adequately.Equations 1 and 2 specify the losses from Maximum Likelihood Estimation (MLE; used by Supervised Learning through cross-entropy loss) and Policy Gradient training.Comparing the two equations, it is visible that MLE attempts to reward a token-bytoken match We take the RL path to generalize this rewarding mechanism using more sophisticated lexical and/or semantic rewards (Table 2).Terminology: 1[x] evaluates to 1 if x is true else 0, y * k denotes the token generated at time-step k, τ is the trajectory obtained from the sequence of token generation, q and d denote the query and document representation respectively, and n is the length of the generation.
We formulate QfS in the RL framework (<S, A, R, P> tuple) as follows: 1.The agent (encoder-decoder network) exists in a state (S), defined by the thus-far generated summary, the input query and the input document.
2. It takes action (A) by sampling a token (y * k ) from the probability over the vocabulary for the next time-step.The generation of probability distribution constitutes the state transition dynamics (P).
3. The end of the episode (sequence of generation) is reached either through the generation of end-of-sequence token or by reaching the max episodic length.
4. At the end of the episode the agent achieves a reward (R), which is then used by the Policy Gradient algorithm (Equation 2) to update the policy for the agent.We use the greedily decoded sequence to compute the baseline reward (b in Equation 2), following Paulus et al. (2017).
However, it is intractable to run this vanilla RL framework for long episodes (∼ 150 tokens, Appendix G, Table 10) when the policy has a huge number of parameters.It leads to a large time for forward and backward pass through the network, in addition to the large memory requirements.Refer to Appendix E for a space and time complexity analysis.We simulate the sequential action sampling through scheduled sampling for Transformers (Mihaylova and Martins, 2019).Following the strategy, we conduct a two-pass through the decoder: the second pass uses the sampled embeddings, obtained using Gumbel reparametrization trick (Goyal et al., 2017).This enables gradient backpropagation for the first pass through the decoder too.We find that using an implementation where gradients backpropagate through both passes leads to better results in automatic evaluation (Table 6).With this formulation, we train the policy network using a mixed objective loss (Equation 3, η = 0.1), following Paulus et al. (2017).

Passage Embedding and Semantic Similarity
Contemporary works rely on composing sentence/word embeddings to obtain passage embeddings.This forms an approximate representation of the generated and the ground truth summary.Thus, relying on these representations leads to an approximate notion of similarity.We amend that by creating a passage embedder using the Cluster Hypothesis (Jardine and van Rijsbergen, 1971), which states: Passages that are clustered together answer to similar information needs.According to the hypothesis, summaries of the same query on the same document should be clustered together and have a high similarity score.We use this insight to reward the generated summaries while training, using the cosine similarity between the generated passage vectors.We find that similarity obtained from this embedding generation scheme leads to better results in human and automatic evaluation (Tables 6 and 8).We use the Cluster Hypothesis for training our passage embedding model, using a dual encoder architecture (Appendix G).Our training scheme considers a special token (for example, [CLS] in BERT) to represent the passage.We calculate the similarity, ŷ, between passages, using the dot product of the embeddings (E p and E q ) produced by the dual encoder architecture, Equation 4. This causes similar E p and E q to lie closer in the embedding space.While training, we attempt to reduce the cross-entropy loss, Equation 5, between ŷ and y (the true labels; 1 for similar passages, else 0). (4) We use the trained passage embedding model to obtain embeddings for the generated summaries and the ground truth summaries in a batch while training the QfS model.We use cosine similarity as a reward signal in the RL framework.

Datasets
We use three datasets in this work: (a) ELI5 (Fan et al., 2019), (b) Reliable QFS Tester, RQFT (our contribution), and (c) Reddit Passage Embedding DataseT, RPEDT (our contribution).In the following discussion, we provide a brief overview of the ELI5 dataset, and move on to discuss RQFT in Section 4.1, and RPEDT in Section 4.2.Fan et al. (2019) proposed ELI5 for Longform Question Answering (LfQA).It consists of a triplets in the format: <query, document, answer>, where the query is a well-formed, open-ended question, and the answer is not a span from the document.Given the nature of the query and the answer, it fits perfectly in our problem statement.The dataset contains 234420, 9930 and 24820 samples for training, validation and testing respectively.

Reliable QFS Tester (RQFT) Dataset
We note two shortcomings of the ELI5 dataset: 1. Overlap of instances between training and validation set, reported by Krishna et al. (2021).
2. <query, document, answer> triplets in the ELI5 test set, where the document is irrelevant Reward Description Formula ROUGE-L Recall oriented reward to improve coverage ROUGE L (GT, GN ) BLEU Precision oriented reward to generate concise summaries SimCSE (Gao et al., 2021) Semantic match obtained by averaging sentence embeddings cos(mean m (E S ), mean n (E T )) SBERT (Reimers and Gurevych, 2019) Semantic match obtained by averaging sentence embeddings cos(mean m (E S ), mean n (E T )) SFPEG Semantic match obtained using Passage Embedding cos(E GT , E GN ) Table 2: Various rewards used to train the RL models.We use both lexical and semantic rewards to promote lexical and semantic similarity in generation.SFPEG: Similarity From Passage EmbeddinG.ROUGE L and BLEU denote the standard ROUGE-L and BLEU functions; mean denotes average; cos denotes cosine similarity; S denotes a sentence from Ground Truth (GT ) summary; T denotes a sentence from Generated (GN ) summary.
to the query.We observed this shortcoming while analyzing the trained models on random samples from the test set.
Motivated by these, we curate a dataset with manual efforts, without any automation scripts, to ensure quality.We curate the dataset from two sources-(i) Wikipedia and (ii) high school textbooks.Both are excellent sources for a variety of concepts.We explain how the dataset is curated in Appendix K.
Table 3 presents statistics on the dataset.We can see that each document corresponds to more than 1 query on average.This was an intentional decision made while curating the dataset to tackle topic centralization (Baumel et al., 2016).We include an example of the dataset in the Appendix (Table 17).We establish the quality of the dataset in Appendix L.
While the size of RQFT is small, such a size for evaluation is not unprecedented for summarization (Angelidis and Lapata, 2018;Chu and Liu, 2019;Bražinskas et al., 2020;Amplayo et al., 2021;Angelidis et al., 2021).Inspite of the small size, our dataset acts as a high quality benchmark for QfS, covering 13 domains (listed in Appendix K), devoid of topic centralization.

Reddit Passage Embedding DataseT (RPEDT)
Cluster Hypothesis states: Passages that are clustered together answer to similar information needs.contain data in the format of posts (queries) and comments (answers to the posts, often multiple in number).We scrape this data, forming a repository of queries and multiple answers to each query.We gather data from 39 subreddits, Appendix F describes the data gathering scheme and lists the subreddits.While training, we transform the dataset into the following format-<p, q>, where p and q are passages that may or may not answer the same query.We perform in-batch negative random sampling to generate q (one q per p) that does not answer the same query as p.

Experiments and Results
In this section we provide details on our experiments.Section 5.1 presents the results obtained on all the experiments.Finally, Section 5.2 presents the results obtained from human evaluation on RQFT.SL represents our implementation of of the model trained by Lewis et al. (2020) for ELI5.

Model ID
Reward Table 5: Index of all the models trained in our work.All models except BART SL have been trained using Reinforcement Learning; the rewards are listed in the adjacent column.BART SL has been trained using Cross Entropy loss (Supervised Learning).
As the training and primary analysis involves only ELI5, we move the DebatePedia results to the Appendix (Appendix D).In EXPT-III, we replace the true document with another random document.The motivation behind this experiment was to rigorously test whether the trained models summarize the document based on the query, or generate tokens by ignoring the document altogether.We use the Fan et al. (2019)  We test the performance of the trained models using Beam Search (beam size = 15, minimum tokens = 64, maximum tokens = 256).We find that using these generation conditions work best (based on automatic evaluation).We present the results generated for all the experiments in Table 6.We also present the average length of generations in Table 7.Using these generation parameters, we see that the BART SL model obtains better ROUGE-L (1.14 points) than the one presented by Lewis et al. (2020).We attribute this gain to the usage of Scheduled Sampling, leading to more robustness in generation than a model trained using Teacher Forcing.
We can see in Table 6 that the models trained using Reinforcement Learning (RL) perform significantly better than BART SL (the baseline), with a 10.62 improvement on ROUGE-L for ELI5 test set.We can also see that the models obtain significantly better results, than the baseline, on our test dataset (EXPT-II), achieving as high as 14.89 improvement on ROUGE-L.The RL-based models achieve better scores on our dataset as compared to ELI5 test set, however, we fail to see such significant gains for BART SL.We present a possible reason for this in Section 6.3.Also, we see that the scores for EXPT-II and EXPT-III are closer for BART SL, in comparison to the RL-based models.This indicates two things-1.BART-SL model generates relevant (to the query) content (irrespective of the documentrandom or true).But, the RL models actually utilize the provided document.This shows that the RL models truly learn the task of QfS, justifying the significant boost (∼ 10 points).
2. While Krishna et al. (2021) point out several shortcomings of the ELI5 dataset, it can be seen that, using the dataset, RL models can learn to summarize according to a query given the true document.

Human Evaluation Results
We present human evaluations on Fluency and Correctness of our models.We use two annotators for human evaluation.We ask them to rate the Fluency on a Likert Scale ("Very Poor", 0 to "Very Good", 4).For Correctness, we follow a YES (1) or NO (0) marking scheme.We randomly sample 50 instances from the RQFT dataset, and present the generated summaries, and the human written summaries to the annotators.The summaries are anonymized, such that the evaluators have no understanding of the source: human written or auto- Table 7: Length of generations (# of words) from the models.We use NLTK to tokenize the generations.We provide an analysis on how RL models utilize more tokens for better summarize in Section 6.1.
matically generated or even the generator model.Table 8 presents the results generated from human evaluation.We can see a similar trend as in Table 6: BART R-SFPEG model obtains the highest Fluency and Correctness.
We also compute the IAA scores for Fluency and Correctness for the evaluations of all the models.For Fluency, we map the scores to ratings: Positive (3 and 4), Neutral (2) and Negative (0 and 1), and compute the fraction of times the annotators agree on the ratings.We observe that the annotators agree almost always, with a minimum agreement of 0.88 (for BART SL).For Correctness, we use Cohen Kappa score to report the agreement.We observe moderate agreement (McHugh, 2012), with a minimum score of 0.61 (for BART R-SEM).Through hypothesis testing, we validate that BART R-SFPEG model is significantly better than the BART SL model.We find that BART R-SFPEG is significantly more correct than BART MLE, with a 10% significance level (p value = 0.077).In terms of Fluency, BART R-SFPEG is significantly more fluent than BART SL model with a 1% siginificance level (p value = 0.00012).

Analysis
Our initial analyses on the ELI5 test set revealed a recurring problem: our models (all of the 7 models) would copy the query into the generated summary.
On further probing, we discovered that this correlated highly with the absence of content in the document relevant to the given query.Frequent occurrences motivated us to create a manually curated test set for reliable analyses.In this section we present a detailed analysis of our models.In Section 6.1 we present a comparative analysis of all the models, in Section 6.2 we analyse the models when multiple queries are posed on the same document and in Section 6.3 we analyse our models when the document is replaced with a random document.Our analysis reveals that the RL models actually learn QfS as compared to the BART-SL model, which has the tendency to ignore the provided document.This justifies the significant boost (∼ 10 points) in performance.

Comparative Analysis
All the Reinforcement Learning (RL) based models generate self-contained, easy to follow summaries.
On the other hand, the BART SL model generates very crisp summaries, too frugal with the amount of words (Table 7).The RL-based models present a lot of relevant material in the generated summary, but the BART SL model jumps straight to the point.While the latter can be usually desired, the tendency to be too frugal with words can leave the user wanting for more.We include an example in Appendix A to highlight the differences.We believe that this difference is a result of using ROUGE as a reward signal.ROUGE leads the model to choose more of the relevant content while training to increase the reward.However, we also observe a downside to this: as ROUGE rewards same word stems, we see multiple tokens with the same stem, discussed in Appendix A. Although the BART-SFPEG model does not use ROUGE as a reward, we see that it also has verbose generations.This is also understandable: choosing more relevant content increases semantic match too.We also note that BART SL model hallucinates (Ji et al., 2022) significantly more than all the RLbased models, inspite of the crispness, which contributes to lower Correctness score (Table 8).We highlight this phenomenon in Appendix A.

Multiple Queries Same Document
We also test abilities of the models when they are probed with different queries on the same document.We observe that the Reinforcement Learning (RL) based models perform better in this arena too.Appendix B includes an example highlighting the difference between the BART SL and an RL-based model.We see that, although the BART SL model manages to generate different summaries for the two queries, they are either hallucinated or mostly unrelated to the query.Whereas, the RL-based model manages to understand the necessities of the queries, and generates summaries for them.

Analysis for Random Document
Section 5.1 motivates the need for this experiment.Quantitative results show that BART SL does not have as significant a drop in performance as the Reinforcement Learning (RL) based models when we use a random document.While analyzing the generations, we observe that the BART SL model ignores the document.Although the generated text stays relevant to the query, it cannot be stated as a query focused summary of the document, as the content is absent in the document.This leads us to believe that the BART SL does not truly learn to solve QfS, which also explains the relatively insignificant change in scores when the quality of the dataset is improved (ELI5 test set vs RQFT, Table 6).
We include an example of generation from random document for BART SL and an RL-based model in the Appendix C. We see that the RLbased model generates content from the document, based on whatever it understands from the query, which strongly confirms our belief that the RLbased models learn a much better solution to QfS.

Conclusion and Future Work
In this work we present Reinforcement Learning (RL) based models for the task of QfS.We observe that these models perform significantly better than a model, with the same architecture, trained using Supervised Learning (SL).We also observe that the RL-based models generate summaries with much lesser hallucination than the Supervised Learning model.Additionally, we also resolve the conflict of employing RL in Transformers with Teacher Forcing, by utilizing Scheduled Sampling.We present a novel reward derived using the Cluster Hypothesis.Through the work, we also contribute two datasets for the community: RPEDT, to train passage embedding models, and RQFT, a gold standard test dataset for analyses of QfS models.
The takeaway from our work is that RL is a better framework to tackle QfS.We observe that RL helps the model learn QfS much better than SL, even with straightforward lexical rewards, such as ROUGE.Additionally, we also conclude, from results, that Cluster Hypothesis leads to much better semantic feedback than competing passage embedders.
Training RL models for long horizon (higher generation length) poses challenges in terms of exploration of action space and temporal credit/blame assignment (Jiang and Agarwal, 2018).In our future work, we would focus on tackling this challenge.Our motivation is that it would reduce the convergence time, and provide even better results.

Ethics Statement
We present two new datasets in our work: RPEDT and RQFT.While curating RQFT, we attempt to refrain from including any sensitive topic into our dataset, based on the knowledge of the curators and the quality check annotators.However, we cannot guarantee such claims for RPEDT.We understand that the passage embedding models trained on the dataset can learn biases and stereotypes present in RPEDT.Hence, we urge the community to use our passage embedding models and RPEDT with caution.
We utilize a total of 4 annotators to conduct quality check experiments and human evaluation. 2 annotators are post-graduates and the other 2 are under-graduates in the Computer Science department of an organization (medium of communication: English).The annotators have been paid sufficiently, according to the geographical location, on an agreed upon rate.

Limitations
The most striking limitation of our work is the necessity of huge computation power, which has restricted us from an extensive experimentation on hyperparameter search.In light of such limitations, we report results from a single run.And we also acknowledge the presence of hyperparameter configurations that can lead to better performance of the models.However, we note that such a limitation does not mirror during inference, and our models can be cheaply (relatively) deployed for public use.
Another limitation of our work is that the passage embedding models have been trained on relatively limited data gathered from only one source (Reddit).We acknowledge the fact that training on varied sources can lead to better representation of passages, however, keeping time and resource constraints in mind, we train our models on data gathered from only one source.

A Comparison of Generations from the Models
We present a randomly sampled instance from our test dataset in Table 18: we include the query and the generations from all six models, we omit the document for readability.We can see that the BART SL model generates a very crisp response to the query, stating what population growth is.However, it fails to respond to the other aspect of the query: What does population change indicate for an area?It also moves on to generate an ill-stated hallucinated (Ji et al., 2022) example (absent in the document).We note the absence of such hallucination in the Reinforcement Learning (RL) models.However, we note that only 3 of the 5 RL-based models are able to incorporate both aspects of the query.We also note digressions in the RL-based models, with BART R-SEM generating significantly more unrelated content.We observe that this is a general trend: BART R-SEM often generates content very haphazardly.We hypothesize that this is caused by using an approximate semantic similarity scheme as a reward signal.
We also note repeated generations (not necessarily consecutive) with same word stems (such as percent, leading to repeated percentage in the BART R-SFPEG generation) in the RL-based models, trained using ROUGE as a reward.

B Comparison of Models for Multiple
Queries on Same Document We present two randomly sampled instances, two queries with the same document, from our dataset in Table 19: we include the query and the generations from BART SL and BART R model only, we omit the document for readability.
For the first query, BART SL hallucinates that Green Revolution has not helped India, given that the document states otherwise.BART R model picks up the key arguments from the document and indeed presents a decent summary pertaining to Green Revolution's help to India.
For the second query, BART SL ignores the query altogether and generates some related sentences only.While BART R picks up key information related to Buffer Stock and articulates a satisfactory summary again.

C Comparison of Models for QfS with Random document
We present two random samples from our test set where the documents have been replaced by another random, unrelated document, in Table 20.We see that for the given query, BART SL and BART R have unrelated (and different) documents.BART SL manages to generate content very much related to the query, which indicates an unfair strong influence of the query on the generation, so much so that the document gets ignored.However, BART R generates content from the document indicating that document indeed has appreciable influence on its generation, which is the ideal case in QfS.

D Results on DebatePedia
We generate result on the test set of DebatePedia using our best RL model: BART R-SFPEG.We utilize the model trained on ELI5 directly to generate results on the test set of DebatePedia.We observe that we perform atleast as good as models trained/fine-tuned on DebatePedia, proving the generalization of our approach.

E Time and Space Complexity Analysis
We present an analysis for the attention layer only, as that is the bottleneck for space utilization and time consumption, as input length increases.An L-layered decoder utilizes O(n 2 ) time and space for predicting the next token provided the previous n tokens.If the decoding is done without Teacher Forcing, that is n th token is generated and then passed through the decoder as an input to generate the (n + 1) th token, then the space and time required is: On the other hand, if we employ the two pass decoding using Scheduled Sampling, the time and space utilized is

F RPEDT Scraping Scheme
We generate RPEDT by scraping posts (queries) and comments (answers) from 39 subreddits.Table 21 presents the list of the subreddits we use to generate the dataset.We gather data within the timeframe: July, 2011 and July, 2022.We set the following criteria to filter the gathered posts and comments: 1. We consider a post (and related comments) gatherable if and only if the post has a score3 of atleast 2.
2. We consider a comment (under a gatherable post) if and only if the comment has a score of atleast 2.
After scraping the dataset, we discard posts with no associated comments and comments with less than 50 words4 .Finally, we divide the dataset into train and test sets as follows: we accumulate all the posts (and associated comments) from the explainlikeimfive subreddit into the test set, and use the rest for training.

G Architecture Details
QfS: We train a Sequence-to-Sequence model to generate summaries, given the query and the document.Our trained model uses BART (Lewis et al., 2020) as the backbone, to obtain representations of the query and document, and finally generate the summary.Figure 1 illustrates the input to and output from the architecture.Query and document are concatenated and fed to the encoder, and we expect the decoder to generate the summary.We use pretrained BART-large model as the starting point for our QfS model.
Table 10 reports statistics on the length of inputs and outputs.Accordingly, we increase the positional embeddings' length of the BART model in our implementation to 1568 (approximately equal to µ valid + 2σ valid , where µ and σ denote average and standard deviation of the encoder input length respectively, Table 10).The first 1024 places of the new positional embeddings are initialized using the positional embeddings of the pre-trained BART model.Passage Embedding: We train an encoder-only model to generate passage embeddings.Specifically, we fine-tune the BERT-large model (Devlin et al., 2019) to output passage embeddings through the [CLS] token.Figure 2 depicts the dual encoder scheme used to fine-tune the BERT model.We use Equation 4 and 5 to train the model.In addition to the passage similarity task, we also employ Masked Language Modelling (MLM) task during fine-tuning.We observed that the fine-tuned model experienced an increased validation perplexity for MLM, as compared to the pretrained BERT-large model.We mirror the schemes for MLM training suggested by Devlin et al. (2019), during finetuning.We keep the default maximum length of BERT (512 tokens), as we see that the average generations from the QfS model are expected to be ∼ 150 tokens.

H Training Details for QfS
We use the BART-large (12 layers, 16 attention heads, 1024 dimensional embedding) pretrained model as our starting point.We train on 8 x A100 machines, with all the models taking a total time of 21 days to be trained till convergence.Table 11 outlines the hyperparameters used for training.We train all the models without any explicit learning rate scheduler, and found this to lead to faster convergence in our experiments.

I Training Details for Passage Embedding
We fine-tune both BERT-base and BERT-large (both cased) pretrained models.We train on 2 x A100 machines, with the base model taking a total of 64 hours and the large model taking a total of 186 hours.grammatical correctness of the written summary.The ideal score is 1 for both random-samples and correct-samples.

M Quality Check: RPEDT
In order to validate the quality of RPEDT, we provide <Q, P 1 , P 2 > triplets to annotators, where P 1 and P 2 answer the query, Q.We ask them to rate how well does each passage (P 1 and P 2 ) answer the query, on a scale of 0 (bad) to 2 (good).We employ two annotators for the task, who rate a total of 707 triplets (79 common).Table 16 presents the results from annotation.We measure agreement by noting the number of times they rate passages with the same score.We observe that annotators agree 78.4% on the rating for P 1 and 86% on the rating for P 2 .Query: Why do stars twinkle?And, why don't planets twinkle?

Score
document: The twinkling of a star is due to atmospheric refraction of starlight.The starlight, on entering the earth's atmosphere, undergoes refraction continuously before it reaches the earth.
The atmospheric refraction occurs in a medium of gradually changing refractive index.Since the atmosphere bends starlight towards the normal, the apparent position of the star is slightly different from its actual position.The star appears slightly higher (above) than its actual position when viewed near the horizon.Further, this apparent position of the star is not stationary, but keeps on changing slightly, since the physical conditions of the earth's atmosphere are not stationary, as was the case in the previous paragraph.Since the stars are very distant, they approximate point-sized sources of light • • • Summary: Stars appear to twinkle due to atmospheric refraction.The light of a star after the entry of light into the earth's atmosphere undergoes refraction continuously till it reaches the surface of the earth.Stars are far away.So, they are a point source of light.As the path of light coming from stars keeps changing, thus the apparent position of stars keeps changing and the amount of light from stars entering the eye keeps twinkling.Due to this, a star sometimes appears bright and sometimes dim, which is the effect of twinkling.The planets are much nearer to the earth than the stars and because of this, they can be considered as a large source of light.If a planet is considered to be a collection of a very large number of point sources of light, then the average value of change in the amount of light entering the eye from all point-size light sources is zero.Due to this the effect of twinkling is nullified.BART SL: Population growth refers to the change in number of inhabitants of a territory during a specific period of time .It can be expressed either in absolute numbers or in terms of percentage percentage of the total population .For example , let 's say you have a region with a population of 1, 000 people .Over the next 10 years , there will be an increase in population growth as more people move in and out of that area .This means that the population will increase by 10 % .
BART R: Population change refers to the change in number of inhabitants of a territory during a specific period of time.This change may be positive as well as negative.Population growth occurs not only by increasing births rate but also due to decreasing death rate.There are two ways by which the population size changes -births and deaths.The crude birth rate (CBR) is a simple method of measuring the rate of births in a particular region.It can be expressed either in terms of absolute numbers or in numbers of live births per thousand of population.Similarly, the mortality rate is an important indicator of the region's demographic structure, social advancement and levels of its economic development.
BART R-SFPEG: Population change refers to the change in number of inhabitants of a territory during a specific period of time.This change may be positive as well as negative.Population growth occurs not only by increasing births rate but also due to decreasing death rate.Population change is an important indicator of economic development, social upliftment and historical and cultural background of the region.It can be expressed either in absolute numbers or in terms of percentage.Percentage is the percentage of population in a particular geographical area that is increasing or decreasing over a time period as a function of its demographic structure, socio-economic advancement and other factors such as its geographical location, population density, age, sex ratio, etc. (%)Population growth is a dynamic process that affects the population size over time as it changes from one place to another.
BART SFPEG: Population change refers to the change in number of inhabitants of a territory during a specific period of time.This change may be positive as well as negative.Population growth occurs not only by increasing births rate but also due to decreasing death rate.Population change is an important indicator of economic development, social upliftment and historical and cultural background of the region.It can be expressed either in absolute numbers or in terms of percentage.There are two ways by which the population size changes -births, deaths and migration.The crude birth rate (CBR) is expressed as the ratio of live births in a year per thousand of population.Conversely, the mortality rate is a simple method of measuring mortality of any area.
BART R-SEM: By and large mortality rates are affected by the region's demographic structure, social advancement and levels of its economic development.Population growth occurs not only by increasing births rate but also due to decreasing death rate.• • • When people move from one place to another places, their population may increase or decrease depending on their place of origin and the place they move to is called the Place of destination.Population change refers to the change in the total population of a territory during a specific period of time.Population change is an important indicator of economic and social development, socio-economic status and historical and cultural background of the territory.
BART R-B: Population change refers to the change in number of inhabitants of a territory during a specific period of time.This change may be positive as well as negative.Population growth occurs not only by increasing births rate but also due to decreasing death rate.There are different ways by which the population size changes in a region depending on the region's demographic structure, social advancement and levels of its economic development.It can be expressed either in absolute numbers or in terms of percentage.The crude birth rate (CBR) is a simple method of measuring mortality of any area and it is expressed in numbers of live births per thousand of population for a particular region.
Table 18: Comparison of generations from all the trained models.The query has two aspects, color-coded by teal and violet.Text-segments in generations are colored accordingly based on which part of the query they answer.
Query One: How has green revolution helped India, if at all? Query Two: Why is a buffer stock created by the government?
BART SL: I do n't think the green revolution has helped at all .The food insecure people are disproportionately large in some regions of the country because of poverty , tribal and remote areas , regions more prone to natural disasters etc .In fact , the states of Uttar Pradesh (eastern and south -eastern parts of India ) , Orissa , Andaman and Nicobar Islands , and West Bengal have been experiencing famine since the 1970 's .
BART SL: If there is a shortage of food in a region , the government can buy it from the market at a lower price than what people are willing to pay for it , and then distribute it to the people who need it .This way , they do n't have to worry about people going hungry all the time .
BART R: The food insecure people are disproportionately large in some regions of the country, such as economically backward states with high incidence of poverty, tribal and remote areas, regions more prone to natural disasters etc.After Independence, Indian policymakers adopted all measures to achieve self-sufficiency in food grains.India adopted a new strategy in agriculture, which resulted in 'Green Revolution', especially in the production of wheat and rice.The farmers are paid a pre-announced price for their crops.This price is called Minimum Support Price (MSP).The MSP is declared by the government every year before the sowing season to provide incentives to farmers for raising the productivity of these crops and to encourage them to increase their production in order to reduce their dependence on imported foodgrains (such as wheat, rice, maize, sugar, pulses, oilseeds etc.).
In the mid-1960s, wheat was the most productive crop in India and it was grown in a large number of small and medium-scale farms in rural areas and in small towns and villages.These small farms were able to produce large quantities of foodgrain and the farmers were rewarded with a subsidy of Rs 1,000 per acre per hectare BART R: Since the advent of the Green Revolution in the early-1970s, the country has avoided famine even during adverse weather conditions.
India is aiming at Self-sufficiency in Foodgrains since Independence.The attainment of food security therefore involves eliminating current hunger and reducing the risks of future hunger.After Independence, Indian policymakers adopted all measures to achieve self-sufficient in food grains.This system has two components: (a) buffer stock, and (b) public distribution system.Buffer Stock is the stocks of grains procured by the government through the Food Corporation of India (FCI).The FCI purchases wheat and rice from the farmers in states where there is surplus production.These grains are sold to the consumers at a price lower than the market price of these grains, which is known as the Minimum Support Price (MSP).In the case of wheat or rice, it is sold at the MSP of Rs 2,000 per tonne and in case for rice at Rs 1,500/tonne of rice the price is Rs 3,200/towards Rs 4,300/tons for the same amount of grain.There are a buffer price to be paid to farmers for procuring the grains at these prices.

Figure 2 :
Figure 1: Architecture for the Query focused Summarizer.

Table 1 :
Example for Query-focused Summarization for a well-formed question.The document is shortened (marked by • • • ) for space.

Table 3 :
Statistics of RQFT dataset, we use NLTK to tokenize the strings.
Table 4 highlights the statistics; the last row denotes the total number of <p, q> samples generated during training.Our quality check experiment, Appendix M, validates the quality of RPEDT.
Table 5 lists all the trained models.BART

Table 4 :
Statistics of RPEDT, we use NLTK to tokenize into words.Q: Question, A: Answer, W: Words.

Table 6 :
Lewis et al. (2020)ison of our models.For the purposes of comparison, we also provide the results obtained byFan et al. (2019)andLewis et al. (2020).We use ROUGE metrics, ROUGE-1 (R-1), ROUGE-2 (R-2), and ROUGE-L (R-L) to compare the models on both the ELI5 test dataset (EXPT-I) and RQFT (EXPT-II: with true document and EXPT-III: with random document).We use BART SL as the baseline (denoted by b) to judge the efficacy of training models under the Reinforcement Learning framework.BART SL is our implementation ofLewis et al. (2020), hence row 3 represents the results forLewis et al. (2020)for EXPT-II and EXPT-III.

Table 8 :
Results from human evaluations of the machine generated text.The scores are reported in A / B format; where A is the average score computed from ratings by Annotator 1 and B is the average score computed from ratings by Annotator 2.
generate!faithful long form question answering with machine reading.In Findings of the Association for Computational Linguistics: ACL 2022, pages 744-756, Dublin, Ireland.Association for Computational Linguistics.

Table 9 :
Results on the DebatePedia test set.

Table 11 :
Values of various hyperparameters used while training QfS models.Effective batch size is the number of samples passed through the model before parameter updates.SL: Supervised Learning, RL: Reinforcement Learning.

Table 12 :
Values of various hyperparameters used while training Passage Embedding model.Effective batch size is the number of samples (positive labelled instances + in-batch sampled negative labelled instances) passed through the model before parameter updates.

Table 15 :
Annotation guidelines for the quality check of RPEDT.R P 1 R P 2

Table 16 :
Average ratings provided by the two annotators for P 1 (R P1 ) and P 2 (R P2 ), higher is better.

Table 17 :
Example of a randomly sampled <query, document, summary> triplet from the annotated dataset.What is population growth?What does population change indicate for an area? Query:

Table 19 :
Comparison of BART SL and BART R in the case of multiple queries over the same document.Text in red highlights the hallucinated segment in the generation.