P ROTEGE : Prompt-based Diverse Question Generation from Web Articles

,


Introduction
In a data rich era, identifying, extracting and generating responses to user's questions has become the next challenge.While search engines provide a simple interface for users to get responses to their queries, getting answers to complex queries still remains a challenge (Krishna et al., 2021).As a result, specialized knowledge bases that extract and store question-answer pairs have become prevalent.
Many applications rely on a knowledge base of generated question-answer pairs, to ensure reliable, accurate and as close to human-generated information to their users.For instance, a key frustration for online shopping is the difficulty in identifying the right product that suits the requirement.For high-consideration products such as laptops and smartphones, at times customers lack the human touch that they would otherwise experience in an offline store where trained sales agents can explain features of each product and provide high-level guidance to select the right one.The sales agent can proactively query the customer to understand her requirement, help refine her needs and finally recommend the right products.In order to bridge the gap between online and offline shopping experience, multi-turn goal-oriented dialog systems, also known as chatbots offer a promising direction.Chatbots help users to familiarize technical concepts, acquire domain knowledge, get recommended products that they are likely to buy, closely mimicking the offline shopping experience.
Towards curating a large scale knowledge bank, Large Language Models (LLMs) (Devlin et al., 2019;Raffel et al., 2020;Brown et al., 2020) have shown remarkable success in learning in-depth knowledge from data.They do so without access to any external memory as the knowledge is imbibed in the model parameters.While this is fascinating, on the downside, the model may hallucinate (Marcus, 2020) and generate answers that are factually incorrect.As a result, modern chatbot systems employ a Retrieval-Augmented Generation (RAG) architecture (Lewis et al., 2020) that has two components: a) an encoder-decoder network that does the natural language understanding of user queries and language generation and b) a back-end knowledgebase (KB) that indexes relevant bits of information for the task at hand.The encoder maps user inputs to a dense representation which is used to query the KB and retrieve evidences.The evidences, as well as the input, are fed to the decoder to generate the final response.
Scalable generation of knowledge-bases is fundamental to the success of the RAG and the underly-ing chatbot application.In this work, we investigate automatic generation of a knowledge-base.Unlike traditional RAG system (Lewis et al., 2020;Izacard et al., 2022) that indexes webpages and large documents, we look for a knowledge-base in the form of question-answer pairs.Not only this helps us improve the accuracy of evidence retrieval, but also allows rich applications to be easily built on top of it e.g., suggesting related questions, navigation of the KB etc, for improved customer experience.Our focus is on educational questions, i.e., questions that help user familiarize with product concepts ("What is the difference between SSD and HDD?") and use-case guidance ("What is the recommended configuration for a gaming laptop?").
Question generation is the task of generating questions given a paragraph of text as input.Question generation quality can be attributed to two characteristics: a) Fidelity that measures the semantic coherence of generated questions and our ability to answer them from the input paragraph, and b) Diversity which measures lexical and semantic dissimilarity between generated questions.Many previous works (Rajpurkar et al., 2018(Rajpurkar et al., , 2016;;Kwiatkowski et al., 2019) have addressed the task of generating questions from text.While it is essential to generate questions that are of high fidelity, for knowledge-base completion, it is imperative to have a diverse question set.To promote diversity, current question generation models rely on beam search.The resulting set, however contains many structurally similar questions with minor lexical changes that warrant the same answer.There has been prior work (Elhamifar et al., 2012;Song et al., 2018;Vijayakumar et al., 2018) in NLP on diversity.In particular, Song et al. (Song et al., 2018) addresses diversity via Determinantal Point Processes (DPP) for neural conversation models, it can be adapted for question generation task.
While these approaches are helpful in maximizing diversity, they fall short in terms of generating high fidelity output.Naively borrowing these techniques may allow the model to hallucinate and generate questions that are not answerable from the input paragraph.Addressing the task of diverse question generation through the lens of monotone sub-modular function (Bach, 2013) alleviates this problem and provide additional benefits.On one hand, this formulation provides flexibility in controlling diversity and fidelity of the output.On the other hand, we can leverage a well-known greedy algorithm (Nemhauser et al., 1978) to generate a near optimal set of questions, therefore, increasing yield and quality simultaneously.
We propose PROTEGE (PROmpT-based divErse question GEneration), a diverse question generation framework which consists of two stages (1) a novel encoder-decoder based LLM architecture which can take a variety of prompts and generate a diverse set of candidate questions, and (2) a greedy hill-climbing algorithm that maximizes a sub-modular objective function to balance diversity with fidelity.We demonstrate that PROTEGE improves diversity by +16% and fidelity by +8% while also improving text generation metrics, over strong baselines.Our experiments on three popular public Q&A datasets indicate that PROTEGE consistently outperforms both diverse beam search-based and prompt-based baselines.

PROTEGE: Prompted Question Generation
Question generation models take a source context x represented as a sequence of sentences  (Vaswani et al., 2017).The encoder consists of N e layers where each layer contains a self-attention and feed-forward block.The encoder takes an input x ∈ R B×S×F and passes it through all the encoder blocks to generate an output h = ENCODER(x) ∈ R B×S×F . 1 The n th encoder block is a Transformer layer TRANS-FORMER(x (n−1) ) which takes the input x (n−1) from previous layer and generates the output x (n) .The encoder blocks are applied in a sequence and finally we get the output h = x (Ne) .To generate the i th output word y i+1 , we take the previous words y •j = y 1•••i and the encoder output h and pass them through N d decoder blocks.Each decoder block contains a self-attention, cross-attention and feedforward layer.The decoder blocks are also applied in a sequence and at the final layer it emits the next word y i+1 = y

Controlled Generation
For controlled generation of questions, we feed the input document along with various types of prompts to the encoder.This requires some modification to the standard encoder-decoder architecture.We use two encoders: one for the document (or, the context) and the other for the prompt signals (Dou et al., 2021).Similar to Transformer architecture, each encoder has 1 + N e layers where the first N e layers use shared parameters Θ e for the context and prompt.The final encoder layers consist of an additional Transformer block for the context and prompt inputs with individual parameters Θ c , Θ p respectively.More specifically, given context x c and prompt x p , we run the following computation on the encoder side, Our decoder attends to both the context and prompt signals h c , h p .We achieve it by modifying the standard architecture as follows.Unlike standard decoder, each decoder layer attends to both context and prompt embeddings from the encoder via crossattention layers and their combined output is fed to the feed-forward network.More specifically, each decoder layer performs the following computation, + SA(y ) (1) Here LN, SA, CA, FF are abbreviations for layernorm (Ba et al., 2016), cross-attention (Vaswani et al., 2017), self-attention (Vaswani et al., 2017) and feedforward layers.MIXUP is an aggregation layer that combines the context and prompt cross-attention outputs d p ].We propose various ways to implement MIXUP: a) treat λ as a tunable hyper-parameter, b) learn λ as a free parameter or via attention (Lin et al., 2017) weights.Note that by setting λ = [1, 0], we recover the standard decoder.More details about the architecture choice are described in Appendix A.

Prompt Signals
We use two types of prompts: a) keyword-based: we define an entity dictionary based on domain knowledge search keywords such as, brands, features etc. Entities from this dictionary can be used as a prompt, b) sentence-based: we identify informative sentences from the context and use them as prompt input.Further, there are two strategies to compute the prompts: a) HEURIS-TIC: extract prompts from the context based on manually defined rules or ML models, b) ORA-CLE: extract prompts from both context and the ground-truth question.Note that the ORACLE strategy requires the ground-truth and hence, can be used only during training whereas HEURISTIC can be used during both training and inference.While using ORACLE during training and HEURISTIC in inference leads to train, test mismatch of distribution of prompts, our hypothesis is that it will help establish strong correlation between the prompts and the generated questions.In Appendix B we describe the various prompt signals we have used for our experiments.

Balancing Diversity and Fidelity of
Questions At the end of the first stage of PROTEGE, we have generated a diverse set of questions by varying the prompt input to the model.However, in practice, some of these questions may be irrelevant, i.e., they can't be answered from the current context.In the second stage, we leverage an algorithm that selects a subset of questions which is both relevant and diverse.Let's assume that the previous step has generated Here η is a hyper-parameter that balances the relevance (i.e., fidelity) and diversity.We discuss various choices for implementing the diversity and relevance functions.The relevance of a question set Q is determined via answerability i.e. how likely the question can be answered from the given context.The answerability of a question set where AE is an answerability model build on top of standard LLM encoders such as BERT (Devlin et al., 2019).We use n-grams to define diversity of a question bank.Let z n (q) denote the set of n-grams in q after removing stop-words.We define diversity as diversity( Note that the diversity expression promotes unique n-grams across questions and has been used as standard metric to measure diversity of text generated by LLMs in prior works, such as (Zhang et al., 2018).It can be noted that the diversity function is sub-modular (Bach, 2013) which makes the objective function ζ(Q ′ , D) sub-modular as well.Although maximization of a sub-modular function is NP-Hard, it is well-known that the algorithm that greedily picks each item has provably good approximation guarantee (Nemhauser et al., 1978).

Datasets
A supervised dataset for the question generation task typically consists of question and answer pairs along with a "context" input.In order to prove the efficacy of our approach for a specific domain of 'shopping guidance', we curate a custom dataset, termed SEARCHQA, by extracting QA pairs from a third-party search engine.We submit customized shopping guidance queries to the search engine and extract questions, answer snippets and URLs from the search results.We further pre-process the extracted content to form question, answer, context triplets.We also leverage three popular benchmark Q&A datasets namely, (1) SQUAD 2.0 (Rajpurkar et al., 2018), (2) NQ Natural Questions dataset (Kwiatkowski et al., 2019), and (3) MS MARCO (Nguyen et al., 2016a).In Appendix C we describe the pre-processing logic used to create the question, answer, context triplets from raw datasets.Table 7 in Appendix C lists the dataset statistics.

Implementation details
Our models are based on the popular T5 (Text-to-Text Transfer Transformer) (Raffel et al., 2020) architecture.T5 models closely follow the encoderdecoder Transformer implementation originally proposed in (Vaswani et al., 2017) with minor modifications.For baseline models (section 3.3), we use the vanilla T5ForConditionalGeneration implementation from the HuggingFace Transformer library (Wolf et al., 2019).For our prompt-based controlled generation models we extend the vanilla implementation by including (as described in section 2.1) (1) an additional encoder for the prompt input which shares parameters with the original encoder, (2) a new cross-attention block in the decoder which is initialized with pre-trained weights from the original cross-attention block.Hyper-parameter settings.To make it feasible to train a large number of models, for all our experiments we use the t5-small variant with 60MM parameters as the base implementation.We use a learning rate of 5e-5, epsilon of 1e-8 with AdamW optimizer.We use a sequence length of 512.We train all models up to 10 epochs with a training batch size of 4 and choose the checkpoint with the best performance on the validation set.We train our models on a single GPU of an AWS EC2 instance with a GPU memory of 64GB.

Baselines BASELINE-BEAM
Only the context is passed as input (without any additional prompts) and Diverse Beam Search (DBS) (Vijayakumar et al., 2016) is used to generate top-k questions.Diversity parameters num_beam_groups and diversity_penalty are fine-tuned by optimizing for diversity metrics through a grid search.

Metrics
Diversity metrics.We evaluate on two popular n-gram based lexical diversity metrics (1) Distinctn (Li et al., 2016), which measures the percentage of unique n-grams out of total number of n-grams in a set of generated questions.We report Dist-1, Dist-2 and Dist-3 metrics, (2) Entropy-n (Zhang et al., 2018), which measures how evenly the ngram distribution is for a given question set.These two metrics are popularly used in literature to evaluate lexical diversity of generated responses (Zhang et al., 2018;Han et al., 2022;Stasaski and Hearst, 2022;Tevet and Berant, 2020).To measure semantic diversity we report a BERTScore (Zhang et al., 2020), which is measured as the average BERTScore of each pair of generated questions.BERTScore measures the semantic similarity between a pair of generated sentences, hence lower the average BERTScore better the diversity.Fidelity metrics.To report a fidelity (or "answerability") metric, we train a separate BERT-based model that takes a context and a question and outputs a probability score for the question being answerable from the context.The ROC AUC of this BERT model was observed to be 0.84.We tune the threshold of this model to operate at a precision of 85%, corresponding to a recall of 30%.A higher bar on the precision allows us to select questions which are highly likely to be answerable from the context, at the cost of missing out on other answerable questions.We compute the answerability score for each generated question and report the average.NLG metrics.Finally, to evaluate the "closeness" of generated questions with respect to the ground-truth questions we also report standard NLG metrics popular in literature, namely: (1) METEOR (Banerjee and Lavie, 2005) which is measured as a harmonic mean of unigram precision and recall, (2) BLEU-4, a cumulative 4-gram BLEU (Papineni et al., 2002) score, which is an evaluation of matching grams of specific order (1gram, 2-gram etc.) (3) ROUGE-L, a version of ROUGE (Lin and Och, 2004), which measures the longest common subsequence (LCS) between the generated and reference text.

Results
Diversity results.number of unique questions compared to BASE-LINE-PROMPT.Thus, given the same context PRO-TEGE generates a higher number of unique questions which are better both in terms of diversity and fidelity, compared to baselines.
For the benchmark datasets, we observe in Table 1 that across datasets, PROTEGE improves on all the diversity metrics (Dist-n & Ent-n) when compared to both the baselines.For example, on the Dist-1 metric, compared to the second best (which is consistently BASELINE-PROMPT), PROTEGE shows an improvement of 17%, 13% and 12%, respectively, for SQUAD, NQ and MS MARCO.On fidelity, compared to the second best, PROTEGE performs 9%, 9% and 4% better, respectively, for SQUAD, NQ and MS MARCO.Significant reduction is also observed on BERTScore.PROTEGE generates 1 to 3 unique questions (on an average) more than BASELINE-PROMPT.
NLG results.We present metrics separately for top-1, top-2 and top-3 generated questions.The metrics for top-k is computed using the question (among top-k) which results in the maximum ME-TEOR score with reference to the ground-truth question.As described earlier, for BASELINE-BEAM we use beam search to generate the top-k questions, while for BASELINE-PROMPT we pick the top-k questions based on generation score.For PROTEGE we select the top-k questions returned by our second-stage algorithm (section 2.3, which greedily selects the question that maximizes diversity and fidelity. For the SEARCHQA dataset, in table 2 we observe that for the top-1 question the best metrics are obtained from the BASELINE-BEAM model.From our model's point of view this is expected as the topmost question is selected based on diversity and fidelity objectives, and hence need not be closest to the reference ground-truth.However, as we allow PROTEGE to select more questions (top-2 and top-3) the model often generates a question closer to the ground-truth, which shows in top-2/top-3 results where PROTEGE does better than both the baselines in matching with the reference.In other words, if we allow top-2 questions, PROTEGE shows the best performance with an improvement of 1.1% in METEOR (but, shows second best performance in BLEU-4 and ROUGE-L).Similarly, for top-3 questions the corresponding improvements are +1.7%,+0.2% for METEOR and BLEU-4 scores.We observe similar trends for SQUAD among benchmark datasets.
For the NQ and MS MARCO datasets, although PROTEGE shows a significant improvement over baselines on diversity metrics, improvements are not observed on NLG metrics.We explain our hypothesis for this observation in Appendix D.
Human evaluation.We performed human evaluation to compare the quality of top-k generated questions between PROTEGE and BASELINE-BEAM.Annotators were asked to label each set of generated questions (for a given context) w.r.t., a) Readability (no. of readable & meaningful questions), b) Diversity (no. of semantically unique questions), c) Fidelity (no. of questions answerable from the context).In figure 2 we observe that PROTEGE improves on BASELINE-BEAM with an absolute improvement of 5% on readability, 32% on diversity, 36% on answerability.Appendix I describes the details of the human audits.

Ablation studies
Effect of prompt signals.For the SEARCHQA dataset we experiment with a variety of (keywordbased and sentence-based) prompt signals as de-   Across all prompt choices, PROTEGE does better than BASELINE-BEAM on all metrics.Answer text (with a context span of size 1 during inference) performs the best on BERTScore and Fidelity metrics and second best on Dist-1 metric.Question entities shows the best performance on Dist-1 metric, which is due to the fact that the model is trained to generate a question around the specific entity passed as a prompt.Based on these results, we typically use answer text as a preferred choice for the prompt.Detailed metrics are in Appendix F.
Effect of ORACLE prompting.Across datasets, ORACLE prompting yields the best performance in terms of matching the ground-truth question.(Refer figure 4 and table 11 in Appendix G).This ablation shows the efficacy of our architecture in incorporating the prompt when generating a question, i.e., providing the "exact" prompt elicits a question which is relatively closer to the ground-truth.
Effect of greedy algorithm.As described in section 2.3, our algorithm takes the candidate set of questions generated in the first stage (prompt-based controlled generation) and in the second stage performs a greedy algorithm, at each step optimizing  Pre-Greedy is the output of prompt-based controlled generation (section 2.1), while Post-Greedy is the output of greedy hill-climbing algorithm (section 2.3).for both diversity and fidelity.
In table 4 we see that as an effect of this greedy algorithm, across datasets both diversity and fidelity metrics show a marked improvement.On an average, post-greedy Dist-1 metric improves by 13% and Fidelity improves by 9%.Further, in Appendix H we show the effects of greedy algorithm on all the diversity and NLG metrics.
Diversity versus fidelity.Our algorithm to balance diversity and fidelity of questions (section 2.3) allows us to control the trade-off between diversity and fidelity through the η parameter.Figure 3 shows how controlling the η parameter allows us to operate at different points for diversity and fidelity.Low η results in high fidelity, while high η results in high diversity.For our experiments we used an η around 0.5 to achieve the right trade-off.Table 8 in Appendix E shows the full metrics as a result of varying η.

Qualitative study
In Table 5 we provide qualitative examples of questions generated by PROTEGE when compared with BASELINE-BEAM output.Due to paucity of space we do not include the context input which is fed to both the models.The second column shows the output of BASELINE-BEAM given the context input alone.The third column is a sample of the prompts which are fed to PROTEGE model (along with con- text).We have shown examples of prompt keywords (e.g., first row), as well as prompt sentences (e.g., second row).Finally last column shows the output of PROTEGE model given the prompt and context as input.We observe that given a context PROTEGE leverages the prompts effectively in generating diverse questions when compared to BASELINE-BEAM output.Especially, when sentences are passed as prompts they often appear to be answers to the generated question.

Related Work
Rule-based (Heilman and Smith, 2010;Fabbri et al., 2020) and DNN based (Sun et al., 2018;Yin et al., 2020) models are used for question generation from text corpora.Answer extraction (Rajpurkar et al., 2016;Kwiatkowski et al., 2019) or machine comprehension (Hermann et al., 2015;Jozefowicz et al., 2016) is a branch of NLP where the goal is to extract answer snippet from text documents given a question as input.In both cases, either the question or the answer is given as input.QA extraction models (Alberti et al., 2019;Du et al., 2017;Reddy et al., 2017;Krishna and Iyyer, 2019) are generally pipeline-based which generates the question and the answer in a sequence.Boros et al. (Boros et al., 2021) uses a question-answering system to detect specific events in textual content (e.g., tweets, blogs).In this context, the entity information is used to frame template-based question (e.g.,Where did the [attack] happen?)where attack is an event of interest).Zhang et al. (Zhang et al., 2021) propose combining entity linkage with a QA system.However, our objective is differ-ent as we enrich the QA extraction technique by augmenting it with entity level metadata.

Conclusion and Future Work
In this paper we present PROTEGE, a transformer based two-stage question generation framework based on prompts that balances diversity of the generated questions with their fidelity.Through extensive experiments on multiple datasets we show that PROTEGE significantly improves diversity (by +16%) and fidelity (by +8%) compared to strong baselines.As a future work, we will extend our models to simultaneously generate both questions and answers.In preliminary experiments on the task of extracting answers for questions from a given context, we have observed that providing the "entities" in the question as additional prompt signals to a BERT-based model improves the answer extraction quality by up to +4.2% in F1 score.Similar applications to other NLG tasks such as document summarization and FAQ creation are possible using the framework proposed in our paper.Extension of our work to non English languages is part of future work.

Limitations
One limitation of PROTEGE is that it is tightly integrated with existing transformer architecture.Therefore to test its efficacy with Large Language Models (LLMs), we would need access to the pretrained model parameters.While this is possible for publicly available Large Language Models (LLMs) such as Vicuna (Chiang et al., 2023), Falcon (Penedo et al., 2023) and LLaMA (Tou-vron et al., 2023), we will miss out state-of-the-art LLMs such as GPT-4 (OpenAI, 2023) and Chat-GPT 2 .Further our approach requires large GPU cluster to train which may lead to higher carbon emission.
Experimental evidences suggest that when context span is used as prompt, our model may hallucinate or mention incomplete product names or product family.For example, instead of "Core i7 12700K CPU", it may generate a question with "Core i7 12700 CPU" which is ambiguous (i7 12700K CPU has a base frequency of 3.6 GHz in comparison to 2.1 GHz for i7 12700F).Generating questions with fully-qualified product names will be a direction of our future work.
ding happens throughout all the decoder layers.
One of our baselines, BASELINE-PROMPT, can be viewed as a combination of "context" and "prompt" tokens at the input level (i.e., early fusion) via cross-attention.We also performed limited experiments to combine the encoder outputs for the context (h c ) and prompt (h p ) (i.e., mid-level fusion) and observed it to perform worse than BASELINE-PROMPT.Our experiments also suggested that a single cross-attention and MIXUP was not sufficient to guarantee faithfulness of the generated questions w.r.t. the prompt signal, and hence needs to be repeated in each decoder layer.

B Prompt signals
In table 6 we describe the various prompt signals we have used for our experiments.

C Dataset details
Search QA (SEARCHQA): We start by creating a set of custom templates for shopping guidance queries (e.g., things to consider when buying a <category>, main features of a <usecase> <cat-egory>).We expand the templates by populating the slots to create a seed set of search queries (e.g., main features of a gaming laptop).For each query, we submit a search request to a third-party search engine and extract questions, answer snippets and URLs.Further, we extract the textual content from the URLs.We thus collect a dataset of size ~100K.After filtering out rows where we are (a) unable to extract the URL content (b) unable to locate the answer in the extracted URL content we are left with ~60K datapoints.Finally, for each question, answer, URL datapoint, starting from the textual content extracted from the URL we extract multiple paragraphs (each containing the answer in different locations) to create a "context" input (data augmentation).Thus, each question, answer, URL datapoint expands into N question, answer, context datapoints.From the URLs set we create two mutually exclusive set of domain names, one each for train and test datasets (to ensure that models generalize across unseen domains), which allows us to create a training dataset with ~100K rows, and a dev and test datasets each with ~5K rows each.
SQuAD 2.0(SQUAD): The Stanford Question Answering Dataset 2.0 (Rajpurkar et al., 2016) is a public dataset consisting of crowdsourced questions on a selection of Wikipedia articles.The dataset consists of a paragraph/context, a set of questions relevant to the context and for each question, an answer which is a phrase from within the context.We ignore unanswerable questions (where is_impossible = True).We split the original train dataset into train (~80K rows) and dev (~5K rows) by splitting based on titles.We sample from the original validation set (consisting of ~10K unique datapoints) to create a test set (~5K rows).
Google Natural Questions (NQ): Natural Questions (Kwiatkowski et al., 2019) is a collection of real user questions submitted to Google and answers gathered from Wikipedia by annotators.From the original dataset we parse the ques-tion_text, the long_answer (and treat it as a context) and the short_answer (and treat it as an answer) which is usually a phrase from within the context (except for yes/no answers).We filter out contexts that contain HTML tables (<Table>) and also filter out very long contexts (>= 20 sentences).We sample from the original training data of ~307K datapoints to create a train (~120K rows) and dev (~5K rows).We similarly parse the original ~7.8K validation set to create a test set (~3.5K rows).
MS MARCO: MS MARCO (Microsoft Machine Reading Comprehension) (Nguyen et al., 2016b) is a large scale collection of datasets (machine reading comprehension, passage ranking, etc.) of which we leverage the question answering dataset.Queries (questions) are sampled from Bing logs and 10 most relevant passages for the query are generated.Human annotators then tag passages that contain an answer to the question and identify the answers from the relevant passages.From the original dataset, for a given query we randomly select one answer and then randomly sample 3 passages (selecting one passage that contains the answer and two passages that do not contain the answer), shuffle and concatenate the passages to form our input context.We sample from the original train and dev datasets to create a train (~100K rows), dev (~5K rows) and test (~5K rows).

D NQ & MS MARCO observations
For the NQ and MS MARCO datasets, although PROTEGE shows a significant improvement over baselines on diversity metrics, improvements are not observed on NLG metrics.In the case of NQ and MS MARCO datasets, the answer is often a short phrase (specifically, in NQ we use the "short answer" provided in the dataset).During inference,

Answer text
Text of the ground-truth answer.
Iteratively select a window of k (=2) sentences from the context.

Answer keywords
Keywords derived from the ground-truth answer using RAKE (Rapid Automatic Keyword Extraction) algorithm.
Iteratively select a window of k (=2) sentences from the context and derive the keywords from the context window.

Question entities
Entities from ground-truth question identified using a pre-defined dictionary of domain-specific entities.
Iteratively select a window of k (=2) sentences from the context and derive the entities from the context window.

H Ablation: Effect of greedy algorithm
Table 12 shows the effect of greedy algorithm on the full set of diversity metrics.As expected, across all datasets diversity metrics improve with greedy algorithm.
In table 13 we observe that post-greedy top-1 METEOR reduces for some datasets.This is expected as the generated question from the first stage is often replaced by a question which displays high diversity and fidelity.However, at top-2 and top-3 the METEOR slightly increases (except for NQ) indicating that the greedy algorithm implicitly favors the question more closer to the ground-truth (which is also expected to be answerable) as long as it improves the diversity.

I Audit SOP
We perform human audits with 2 auditors to compare the generated questions between PROTEGE and BASELINE-BEAM (for a sample of ~200 contexts).For each data point, auditors record the following details regarding the top-k generated questions: (A) Are the questions readable and meaningful (i.e., well-formed and complete sentences)?(B) Out of the readable questions, how many questions are semantically unique (measures semantic diversity)?(C) Out of the readable questions, how many questions are answerable from the context (measures fidelity)?In case of conflicts on any of the labels, a third auditor re-verifies the decision to resolve the conflict.Finally, we take a cumulative count for each aspect and measure the percentage of readable, unique and answerable questions.
pass them through an encoder to learn its latent representation and finally through a decoder to generate the output question y = (y 1 , y 2 • • • ) with one word y i at a time.Given training data-set D = {(x, y)}, the model parameters θ are learned by maximizing the likelihood function (x,y)∈D log Pr(y | x; θ) The encoder and decoder are implemented as Transformer networks

Figure 1 :
Figure 1: Model Architecture of PROTEGE encoder and decoder.The left figure (a) shows the modifications (components shown in color) done to the standard decoder architecture.As shown in the figure (b), the Encoder takes the context and prompt as input and generate representations h c and h p .The decoder is modified to incorporate cross-attention with h c and h p and a mixup layer to aggregate the outputs.

Figure 2 :
Figure 2: Results of human evaluation.

Table 1 :
Diversity metrics for PROTEGE and baselines across QA datasets.
act same strategy is used for this baseline as well.

Table 2 :
For the SEARCHQA dataset, among several choices for prompt signals (described in section 2.2) we highlight the best results obtained in this section and describe the trade-offs among the choices in section 4.1.In all our tables we highlight the first best result and underline the second best result.Note that ↑ for a metric indicates higher values are preferred whereas ↓ indicates lower values are preferred.NLG metrics for PROTEGE and baselines across QA datasets.
scribed in 2.2.In table 3 we present the effects of prompt signals on diversity metrics.

Table 3 :
Effect of prompt signals on diversity metrics.

Table 4 :
Effect of greedy algorithm on diversity metrics.

Table 5 :
Table showing anecdotes of questions generated by PROTEGE and BASELINE-PROMPT.
E Ablation: Balancing diversity vs fidelity Table8shows the full metrics as a result of varying η.

Table 8 :
Diversity vs Fidelity with varying η.In tables 9 and 10 we present the complete set of diversity and NLG metrics based on the choice of prompt signals.Specifically for the NLG metrics, regardless of the choice of prompt signal for top-1 question, BASELINE-BEAM generates questions closest to ground-truth followed by answer keywords.For top-2 and top-3, the best strategy in general is to pass answer keywords as prompts during training and keywords from context spans of size 2 or 4 during inference.The best result with answer keywords is better than the best result with answer text, indicating that model benefits more when passed keywords rather than full text as guidance signal.Among different context span sizes passing a larger window (2 or 4 sentences) leads to better results.The worst performing is question entities possibly because model tends to overfit on the specific prompted entities, while it generalizes when passed with a larger window of keywords/sentences.Figure4and table 11 (with the complete set metrics for ORACLE vs HEURISTIC prompting) shows that on an average there is a 40+% improvement in METEOR metrics from HEURISTIC prompting compared to ORACLE prompting.
F Ablation: Effect of prompt signals