Selecting Better Samples from Pre-trained LLMs: A Case Study on Question Generation

Large Language Models (LLMs) have in recent years demonstrated impressive prowess in natural language generation. A common practice to improve generation diversity is to sample multiple outputs from the model. However, there lacks a simple and robust way of selecting the best output from these stochastic samples. As a case study framed in the context of question generation, we propose two prompt-based approaches to selecting high-quality questions from a set of LLM-generated candidates. Our method works under the constraints of 1) a black-box (non-modifiable) question generation model and 2) lack of access to human-annotated references -- both of which are realistic limitations for real-world deployment of LLMs. With automatic as well as human evaluations, we empirically demonstrate that our approach can effectively select questions of higher qualities than greedy generation.


Introduction & Related Work
Large Language Models (LLMs) have recently gained tremendous popularity in the NLP community (Devlin et al., 2019;Liu et al., 2019;Bao et al., 2020;Brown et al., 2020).The ever-increasing size in both models and training data renders many traditional learning methods impractical/intractable.As a result, prompt-based learning has emerged as a new paradigm tailored specifically towards leveraging the power of LLMs (Radford et al., 2019;Petroni et al., 2019;Raffel et al., 2020;Brown et al., 2020;Schick and Schütze, 2021b;Gao et al., 2021;Liu et al., 2021).In the zero-shot setting (such as in this study), a data sample is first "verbalized" into an input prompt and a ground-truth response in natural language.The prompt is then issued to a pre-trained LLM to obtain a predicted response, which is then compared to the groundtruth for evaluation.This new technique has been successfully applied to many applications including text classification (Yin et al., 2019;Schick and Schütze, 2021a), QA (Jiang et al., 2021), natural language generation (Li and Liang, 2021) and NLG evaluation (Yuan et al., 2021).
Despite the impressive results on popular NLP benchmarks, however, the back-end LLMs are usually pre-trained with general-domain data, leading to sub-optimal performance in new domains for prompt-based learning.There are two major challenges in successful applying general-purpose LLMs to specific domains.Firstly, aside from the many known issues of LLMs (Webson and Pavlick, 2021;Min et al., 2022;Zhao et al., 2021;Lampinen et al., 2022), their sheer size and/or accessibility (e.g., served via API over the internet) makes it prohibitively expensive and impractical for domain adaptation.These limitations have inspired a recent line of work known as prompt editing/tuning (Gao et al., 2021;Li and Liang, 2021;Madaan et al., 2022).Additionally, prompt-tuning often relies on the availability of ground-truth labels of the data, imposing much additional resource on the approach.
Given the ubiquity of these challenges, our study focuses on alleviating the constraints on both annotation availability and access to model parameters, making LLMs more accessible for deployment.We take a mainstream NLG task, namely question generation, as a case study (Du et al., 2017;Yuan et al., 2017;Du and Cardie, 2018;Pan et al., 2019;Liu et al., 2020;Pyatkin et al., 2021).In this task, a model is trained to generate a natural language question conditioned on a context and an answer, such that the generated question can be answered by the provided answer using the context as supporting evidence.Question generation is the corner stone for many NLP applications in education (Kurdi et al., 2020;Abdelghani et al., 2022), FAQ generation (Mass et al., 2020), information seeking (Qi et al., 2020), etc.In an educational setting, for example, a question generation system can generate demonstrations that inspire students' curiosity and thinking (teaching), or to help assess students' proficiency on certain knowledge or skills (examining).These use cases would benefit greatly from reduced dependency on computing resources, data availability, and the expertise required for finetuning an LM.
To align with these real-world scenarios, our goal is to obtain better outputs from an inferenceonly LLM (i.e., as a "black-box", which is relatively more accessible, e.g., through online APIs).In particular, given the common practice of sampling multiple outputs to improve generation diversity, we propose a method that aims at selecting the best candidate based on multiple aspects of question quality in a zero-shot manner -notably without model adaptation or human annotations.Our method can be seen as a post-hoc selection process within a larger NLG pipeline, and thus is orthogonal and applicable to zero-shot and incontext learning methods (Rubin et al., 2021;Lu et al., 2022;Liu et al., 2022).

Problem Setting
Notations: Formally, we consider a dataset of context-answer pairs (c, a) both as strings.The task of question generation is to generate a question q that can be answered by a using c as supporting evidence.We use an off-the-shelf pretrained LLM-based question generator in a zeroshot setting (prompt construction detailed in Appendix A).To simulate the black-box generator scenario, we refrain from any form of model tuning.We do, however, assume access to a set of output sequences stochastically sampled from the question generator.We thus ground our study to this application scenario by sampling k questions Q = {q i ∶ i = 1, . . ., k}.For comparison as a baseline, we also denote q g as the question generated with a greedy algorithm (i.e., generating the most probable token at each time step).
Our goal is to devise an algorithm S which selects the best candidate q i * that maximizes some evaluation metric M ∶ Q ↦ R, i.e., S(Q) = i * = arg max i M (q i ).We use M s , M s , and M s to denote the mean, min, and max of {M (q) ∶ q ∈ Q}, resp., and M g for the greedy output M (q g ).Semantically, M s ≤M s ≤M s is tautologically true, and a positive result on the design of S would translate to M (q S(Q) ) outperforming both M s and M g .

Datasets and model:
We adopt two question generation datasets with distinctive characteristics, namely SQuAD (Rajpurkar et al., 2016) and Fairytale QA (Xu et al., 2022).SQuAD was originally proposed as an extractive QA dataset.It has been used as a sentence-level question generation task in the question generation literature (Du and Cardie, 2018;Yuan et al., 2017;Bao et al., 2020), i.e., a context c is a single sentence that contains the corresponding answer a as a sub-string.Fairytale QA has also been used for both question answering and question generation.It features paragraphlevel question generation (with c being one or more paragraphs), and the answer a is not necessarily a sub-string of c.Since we do not perform any form of model/prompt tuning, we use the testing split of both datasets, which consist of 11,877 data points for SQuAD and 1,007 for Fairytale QA.
We prompt GPT-3 (Brown et al., 2020) 2 in a 0-shot manner for both question generation and selection (detailed in §3).We provide all prompts in Appendix A.
Evaluation Metrics : We use two metrics to evaluate the selected question q ′ = M (q S(Q) ): • Reference-based evaluation: Following prior works, we use BLEU-4 for SQuAD (Du and Cardie, 2018;Bao et al., 2020) and ROUGE-L for Fairytale QA (Xu et al., 2022) 3 .These metrics compare q ′ against the reference question q (a.k.a. the "groundtruth" question in the existing literature).
• Human evaluation: we solicit human annotations on a subset of the data.We postulate that an overall score given holistically to rate a question would be highly subjective and thus less inductive to annotator agreement.Accordingly, we decompose the quality of questions into seven dimensions4 , and ask human annotators to rate a question on each dimension followed by an overall rating of the question.We collect three annotations from different annotators for each data points.We provide details of the human study in Appendix B.
[context] Old Dragonbeard must have been a master swordsman standing midway between those of the first and of the second order.Molo, however, of whom this story tells, was a sword hero.At that time there lived a young man named Tsui, whose father was a high official and the friend of the prince.And the father once sent his son to visit his princely friend, who was ill.The son was young, handsome and gifted.He went to carry out his father's instructions.When he entered the prince's palace, there stood three beautiful slave girls, who piled rosy peaches into a golden bowl, poured sugar over them and presented them to him.After he had eaten he took his leave, and his princely host ordered one of the slave girls, Rose-Red by name, to escort him to the gate.As they went along the young man kept looking back at her.And she smiled at him and made signs with her fingers.First she would stretch out three fingers, then she would turn her hand around three times, and finally she would point to a little mirror which she wore on her breast.When they parted she whispered to him: "Do not forget me!" [question] Who was Rose-Red?
[output1] Yes, the question is related to the context.Rose-Red is one of the three beautiful slave girls who served the prince.The young man, Tsui, was attracted to her and she seemed to be attracted to him as well.
[options] 1: They are not at all related; 2: They are remotely related; 3: They are somewhat related; 4: They are closely related.
Figure 1: Template for prompting GPT-3 to rate a question's relevance.GPT-3 output is highlighted in green.

Method
In this section we propose three question selection methods.As described in §2, each method is used to score k sampled questions in Q and selects the candidate with the highest score.
n-gram similarity: We use n-gram similarity between a question and its context to measure their relevance.This method reflects the intuitive assumption that favorable question be closely related to the information provided by the context.Specifically, we extract all unique n-grams 5 s n (c) from a given context c, s n (q) from a question q.The n-gram similarity score is then defined as: where |s| indicates the size of set s.

Round-trip:
Intuitively, the answer to a generated question should be semantically equivalent to the answer that has been used to generated the question.Formally, a question generation model QG and a QA model (both with reasonable performance) should satisfy the following constraint: This idea is closely related to cycle consistency in the existing literature on image generation (Zhu et al., 2017), machine translation (Artetxe et al., 2018), and QA (Alberti et al., 2019;Shah et al., 2019)).Here, we use GPT-3 as an off-the-shelf QA model to obtain a ′ for each pair of c and q ′ , resulting in 5 In all our experiments n ranges from 1 to 5.

SQuAD
Fairytale QA (BLEU-4) (ROUGE-L) prior works (models trained/fine-tuned on these datasets) (Du and Cardie, 2018) 0.152 - (Zhang and Bansal, 2019) 0.184 -UniLM Large (Bao et al., 2020) 0.228 -UniLM v2 Base (Bao et al., 2020) 0.244 -ERNIE-GEN Large (Xiao et al., 2021) 0.254 -BART (Xu et al.,  sampled questions in Q.We then measure the similarity between each a ′ i and the ground-truth answer a (F 1 for SQuAD and ROUGE-L for Fairytale QA, in accordance with the evaluation setup from the original papers for the two datasets).Finally, we select the question corresponding to the generated answer a ′ i * that overlaps the most with a (i.e., that can be best answered by GPT-3).Prompts used in these experiments are detailed in Appendix A.
Prompt-based Score: We propose a two-step procedure (Figure 1) for prompting GPT-3 to answer the same set of meta-questions (i.e., questions about the quality of a given question) used for human evaluation ( §2).In step 1, given a contextquestion pair, GPT-3 is prompted to answer a metaquestion as an open question (as opposed to choosing among a list of options) as well as to verbalize a reason for its answer.In step 2, GPT-3 is prompted to choose from a list of options representing the rating scale of the meta-question.
We empirically observe that without the first step, GPT-3 output tends to have a low-entropy distribution, i.e., often choosing the same option for a given meta-question disregarding the different context-question pairs.In contrast, the additional first-step appears to improve prediction diversity, which is inline with observations made in some existing studies (Nye et al., 2021;Wei et al., 2022).
Similar to human evaluation, we also prompt

Results and Discussion
To measure the performance of a selection method ( §3), we use it to select one out of k = 5 questions sampled from GPT-3, and score the selection with the evaluation metrics outlined in §2.Additionally, we test the ensemble performance with multiple methods.To ensure comparability, we normalize the scores obtained from each selection method into the range between 0 and 1, and use their average score to perform question selection.
Reference-based evaluation Reference-based evaluation are automatic metrics that are applied to the entire test sets of SQuAD and Fairytale QA.
Table 1 shows that on both datasets, all question selection methods outperform M s , the average score over all five sampled questions, validating the effectiveness of the proposed methods.While all individual methods outperform the greedy baseline M g on SQuAD, round-trip performs the best, outperforming M g on both datasets.It can be further improved via ensemble with n-gram and/or prompt-based scores (using uniform weights).Note that prior studies require a large amount of labeled data for model training/fine-tuning, while GPT-3 performs zero-shot inference.Despite this major difference in learning paradigm, most GPT-3-based models proposed here outperform previous results by significant margins on the SQuAD dataset -even the least performant samples M s (lowerbound) achieve competitive results.For Fairytale QA, however, only the best samples M s (upperbound) outperform previous results (Xu et al., 2022), indicating margins for improvement on question selection strategies for future work.
Human evaluation Human evaluation consists of 16, 800 annotations (from 87 annotators) evenly split across the two datasets (details in Appendix B).For question generation (among many NLG tasks), model outputs may exhibit linguistic diversity while maintaining semantic equivalence.It is thus highly problematic to evaluate such outputs against a single reference (i.e, "ground-truth").Figure 2 empirically shows that the ground-truth (GT) provided in the datasets often fail to receive the highest human ratings, on many occasions scoring lower than stochastic samples from GPT-3 (M s ).Consequently, we strongly advocate for human evaluation, which we believe is higly effective in improving generalizability of our results to real-world applications.
Another prominent observation is that n-gram and APS perform quite differently on the two datasets.On SQuAD, n-gram similarity outperforms other individual methods, with further noticeable improvements via ensemble with roundtrip.APS, on the other hand, does not work nearly as well, performing the worst for almost all metaquestions.In contrast, n-gram (particularly trigram) similarity shows the worst performance on Fairytale QA, while APS outperforms all other methods by a noticeable margin.
We posit that the reversed trend in comparing n-gram and APS can be explained by the distinct natures of the datasets.For SQuAD, the sentencelevel contexts are relatively short and simple with strictly extractive answers (i.e., the answers being sub-strings of the corresponding contexts).As a result, paraphrasing the context can be a fairly effective strategy to generate questions (hence the stronger correlation between question quality and the c-q n-gram similarity).In contrast, with multiparagraph contexts and abstractive, open-ended answers, Fairytale QA questions are more likely posed about abstract ideas rather than simple context paraphrasing.Consequently, n-gram similarity, which favors local context paraphrasing, is less likely to perform well.

Conclusion
In this study, we investigate the practical problem of selecting the best output from multiple samples generated by an LLM.Using question generation as a case study, we propose two prompt-based question selection methods.To alleviate real-world constraints on using LLMs, the proposed methods do not require model fine-tuning nor human annotation.Extensive experiments with both automatic and human evaluations evince the effectiveness of our approach on question selection.

Limitations
We acknowledge that our system has some limitations that warrants further investigation.For example, one needs to be mindful of the specific downstream applications of the proposed methods, both in terms of 1) potentially large variance in outof-distribution performance (e.g.divergent question generation applications that aim to spark children's curiosity-driven thinking (Abdelghani et al., 2022)); and 2) of mitigating harmful/toxic contents in educational applications (Bender et al., 2021).As a result, we believe such techniques and applications are neither suitable nor safe to directly interact with children, we urge developers to use this technique in other ways, for instance, in teaching assistant application (e.g., a system that suggests examples for teachers), where the teacher can filter and modify the examples and thus making sure the content children receive is proper and safe.
We also acknowledge the prohibitively restrictive access to the GPT-3 model at the time of writing.We do believe that this constraint will relax over time, and meanwhile, hoping that our proposal can shed light on research and applications with more accessible LLMs such as GPT-J (Wang and Komatsuzaki, 2021) and BLOOM (BigScience, 2022) for future work.
While we acknowledge the many limitations with respect to accessing GPT-3, we are not advocating against using it.On the contrary, in fact, we believe GPT-3 is still among the most cost-effective solutions especially in the context of natural language generation.The main goal of the study is thus to explore more data efficient ways of using GPT-3 to generate and evaluate questions.We strive to share our experience and insights with the community, which hopefully can be proven valuable and helpful.

Contents in Appendices:
• In Appendix A, we report all prompt templates we used in this work.
• In Appendix B, we provide details on the human study.
• In Appendix C, we provide the full set of our experiment results.
• In Appendix D, we report implementation details.

A Prompt Designs
We report an example of our prompt for question generation in Figure 3.
We report an example of our prompt for QA (used in round-trip) in Figure 4.
We report an example of our prompt in obtaining prompt scores in Figure 1.

B Human Study
We randomly sample 50 documents from each of the two datasets SQuAD and Fairytale QA.Each document correspond to one ground-truth question and six questions generated by GPT-3 (five by stochastic sampling and one by greedy search).Each question is then rated by three human annotators wrt seven meta-questions and one over-all rating, altogether constituting 50 × 2 × (1 + 5 + 1) × 3 × (7 + 1) = 16, 800 annotations.There are in total 87 annotators involved in the annotation process, all annotators are English speakers, they are recruited from regions including Europe, the United States and United Kingdom.Each annotator on average performed 193 annotations and was paid on average $14.1 USD per hour.
We perform a basic spam filtering process on the raw annotations.We observe a 15.4% spam rate.All human scores reported in this paper are computed after spam removal.
We report the eight meta-questions we used for human annotation in Figure 5.The eight metaquestions correspond to columns in Figure 2. We collect three annotations from different annotators for every meta-question, we report the averaged human agreement rate in Table 2.

C Additional Results
In Table 3, we report the full experiment results for reference-based evaluation.2: Averaged human agreements among three annotators.An agreement indicates that all three annotators selected the same option for a meta-question.We show decomposing single-score metric (i.e., OHR) to scores measuring different aspects (listed above OHR) can significantly improve human agreements.
In Table 4, we report the full results for human evaluation on SQuAD.
In Table 5, we report the full results for human evaluation on Fairytale QA.

D Implementation Details
In all experiments, we use the text-davinci-002 (175B parameters) variant of GPT-3.It is currently the most capable GPT-3 model variant.Compared to other variants, text-davinci-002's support to inserting completions can better facilitate our question generation tasks (as shown in Figure 3).
We use a temperature of 0.7 during the sampling process of question generation.In all other use cases (e.g., QA round-trip, prompt score), we use greedy generation (temperature is set to 0).

Story:
As soon as the lady had departed the fisher's son awoke, and the dark lad told him of her visit, and how he would never see her as long as he lived.At this the fisher's son felt the cold creeping up to his heart, yet he knew the fault had not been his that sleep had overtaken him.'I will search the whole world through till I find her,' cried he, and the dark lad laughed as he heard him.But the fisher's son took no heed, and off he went, following the sun day after day, till his shoes were in holes and his feet were sore from the journey.Nought did he see but the birds that made their nests in the trees, not so much as a goat or a rabbit.On and on and on he went, till suddenly he came upon a little house, with a woman standing outside it.Instruction: Read the above story, ask a question and answer it.Question: GPT-3 FILLS IN THIS BLANK Answer: search the whole world through till he found her Figure 3: An example of prompting GPT-3 for question generation.We use the text before green as prompt, and text after green as suffix.We refer readers to the GPT-3 documentation for more details about GPT-3's inserting completion mode.
[Document]: is cheeks were red with passion, and his eyes were bright, for he could not but notice that, now that she was safe at Orphir under her true love's protection, the Lady Morna's manner had grown cold and distant again, and he was beginning to lose faith in Snorro's charm.
Angry and disappointed, he had sought his mother's room to pour out his story of vexation to her.
He stopped short, however, when he saw the wonderful waistcoat lying on the table, all gold and silver and shining colours.It was like a fairy garment, and its beauty took his breath away.
[Answer]: Harold lost faith in Snorro's charm because the Lady Morna's manner had grown cold and distant again.1.Is the question gramatically correct? 1) It is grammatically incorrect 2) It has some grammatical issues 3) It is grammatically correct 2. Is the question offensive to people? 1) It is very offensive 2) It may be offensive 3) It is not at all offensive 3. Is the question clear? 1) It is not at all clear 2) It is mostly clear 3) Is is very clear 4. Is the question related to the context of the attached document?1) It is not at all related 2) It is somewhat related 3) It is closely related 5. Is the question asking about an important aspect of the context of the attached document?1) Not at all important 2) It may be important 3) It is very important 6.Is the question asking about a specific piece of information in the attached document?1) The question is very generic 2) The question is somewhat generic 3) The question is very specific 7. Can the question be answered using information in the attached document?1) No, answering the question requires completely different information 2) The question can be partially answered using information from the document 3) The question can be perfectly answered using information from the document 8. What is your overall rating of the question generated based on the attached document?1) The question is very bad 2) The question is okay 3) The question is very good    Abstract and Section 1 on page 1.

A4. Have you used AI writing assistants when working on this paper?
This paper discuss an application of GPT-3, but we do not use GPT-3 in any of the paper section writing.

B Did you use or create scientific artifacts?
We use existing language model, namely GPT-3; we also use existing question generation dataset, namely SQuAD and Fairytale QA.We discuss them in Section 2 (page 2).

B1. Did you cite the creators of artifacts you used?
Yes, we cite the creators in Section 2, page 2.
B2. Did you discuss the license or terms for use and / or distribution of any artifacts?Not applicable.They are publicly available.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Yes, in Section 1 and 5.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Not applicable.Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.?Not applicable.Left blank.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.Yes, in section 2, 3 and appendix B.

C Did you run computational experiments?
Yes, described in section 2, 3, 4.
C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?Not applicable.We do not propose new model.We use existing language model, and properly cite the original work.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Not applicable.Left blank.
C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Not applicable.Left blank.
C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Not applicable.Left blank.
D Did you use human annotators (e.g., crowdworkers) or research with human participants?
Yes, section 4 and appendix B.
D1. Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Yes, appendix B.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Yes, appendix B.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Yes, appendix B.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?
Yes, appendix B.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 2 :
Figure 2: Human evaluation results, averaged over three annotators' scores, normalized per column.Left: SQuAD; right: Fairytale QA.Abbreviations in x-axis denote Grammatical correctness, Offensiveness, Clarity, Relevance, Importance, Specificity, Answerability, Averaged Human Rating (over all dimensions to the left), Overall Human Rating (an overall score given by annotators).Exact scores are provided in Appendix C.

Figure 4 :
Figure 4: An example of prompting GPT-3 for QA.GPT-3 output is highlighted in green.

Figure 5 :
Figure 5: Meta-questions we designed for human evaluation.
you describe the limitations of your work?Section 5 on page 5. A2.Did you discuss any potential risks of your work?Section 5 on page 5. A3.Do the abstract and introduction summarize the paper's main claims?

Table 1 :
Reference-based evaluation scores.Best and second best scores (excluding baselines) are highlighted with boldface and underline.

Table 4 :
Human eval results (SQuAD).Abbreviations in the first row denote Grammatical correctness, Offensiveness, Clarity, Relevance, Importance, Specificity, Answerability, Averaged Human Rating (over all dimensions to the left), Overall Human Rating (an overall score given by annotators).Best and second best numbers (excluding baselines) are highlighted with boldface and underline.

Table 5 :
Human eval results (Fairytale QA).Abbreviations in the first row denote Grammatical correctness, Offensiveness, Clarity, Relevance, Importance, Specificity, Answerability, Averaged Human Rating (over all dimensions to the left), Overall Human Rating (an overall score given by annotators).Best and second best numbers (excluding baselines) are highlighted with boldface and underline.