Distractor Generation based on Text2Text Language Models with Pseudo Kullback-Leibler Divergence Regulation

In this paper, we address the task of cloze-style multiple choice question (MCQs) distractor generation. Our study is featured by the following designs. First, we propose to formulate the cloze distractor generation as a Text2Text task. Second, we propose pseudo Kullback-Leibler Divergence for regulating the generation to consider the item discrimination index in education evaluation. Third, we explore the candidate augmentation strategy and multi-tasking training with cloze-related tasks to further boost the generation performance. Through experiments with benchmarking datasets, our best perfom-ring model advances the state-of-the-art result from 10.81 to 22.00 (p@1 score).


Introduction
Cloze-style multiple choice question (MCQ) is a common form of exercise used to assess the knowledge of learner.Manual crafting of cloze questions demands significant time and effort for educator, which motivates the need for automatic cloze question generation.
An important challenge in the preparation of cloze questions lies in the selection of appropriate wrong options (distractors).Carefully designing distractors is crucial for enhancing the effectiveness of learner ability assessment, but it also requires significant time and effort.As a result, there has been a growing motivation to explore automatic distractor generation (DG) techniques.
The paradigm for cloze DG is the candidate generating-and-ranking (CGR) framework.The CGR paradigm consists of two stages/components: (1) candidate generator and (2) candidate selector.The candidate generator is generally based on knowledge bases (such as Probase (Wu et al., 2012) or pre-trained language model (Devlin et al., 2018)) to have a distractor candidate set, and the candidate selector ranks the candidates by linguistic features (e.g., morphological, POS, word embedding similarity).The SOTA methods (Chiang et al., 2022; Question Stem I was in a _ to reach my office Options (a) hurry, (b) way, (c) dream, (d) deferral Table 1: Item discrimination for Distractor Generation: To consider the validity of the test questions, distractors with different levels of difficulty are needed.In this example, hurry is the correct answer, dream is an obviously wrong option, and the rest are in the middle.Ren and Zhu, 2021) in recent years are all based on the CGR paradigm.
While the CGR framework shows promise, it overlooks the importance of the item discrimination index (Hingorjo and Jaleel, 2012) when evaluating the quality of questions.When teachers design multiple-choice questions (MCQs), it is crucial to consider the validity of the test questions by including distractors of varying difficulty levels.For example, in a four-option MCQ, one option may be easily eliminated, while the remaining two options pose a greater challenge in distinguishing the correct answer, as shown in Table 1.This allows for differentiation among students with varying levels of knowledge during the test.Therefore, the objective of this paper is to incorporate this factor into the process of distractor generation.
Our study incorporates the following notable designs.First, we introduce a formulation that treats cloze distractor generation as a Text2Text task.As demonstrated in the experiment section, this approach yields a significant improvement in performance compared to traditional CGR methods.Second, we propose the utilization of the "pseudo Kullback-Leibler Divergence" technique to regulate the inter-correlation between the generated distractors.This ensures the diversity and relevance of the distractors.Third, we investigate two additional strategies: the "candidate augmentation" strategy and the "multi-tasking training with cloze-related tasks" approach, both of which aim to further enhance the generation performance.
The contributions of this paper are  • Our best performing model achieves a significant advancement in state-of-the-art results, increasing the P@1 score from 10.81 to 22.00.This remarkable improvement represents an almost two-fold increase in performance compared to previous approaches.
• Our study demonstrates that the generative Text2Text framework outperforms the traditional candidate generating-and-ranking framework in the context of distractor generation.This finding suggests that the Text2Text approach serves as a superior alternative for generating high-quality distractors.
• We introduce the concept of pseudo Kullback-Leibler divergence as a means of regulating distractor generation.By incorporating this approach, we aim to address the item discrimination factor when designing multiple-choice questions (MCQs).
• Extensive experimental evaluation with the benchmarking datasets are conducted and the insights of incorporating large models, multitasking setting, and context-sentence provision are discussed.
The rest of this paper is organized as follows.Section 2 reviews the works of automatic distractor generation in the literatures.In Section 3 we present the proposed methods.Section 4 reports the performance evaluation and Section 5 concludes this work and discuss the future work.

Related Work
In this section, we review the literature related to this work.

Datasets
The available distractor datasets are CLOTH (Xie et al., 2017), MCQ (Ren and Zhu, 2021), SCDE (Kong et al., 2020), and RACE (Lai et al., 2017).The CLOTH dataset (Xie et al., 2017) collects word-level cloze questions from English exams designed by teachers.MCQ dataset is a cross-domain cloze-style dataset, that includes the domains of science, vocabulary, common sense, and trivia.MCQ consists of various open-source multiple choice question datasets, including SciQ (Welbl et al., 2017), MCQL (Liang et al., 2018), AI2 Science Questions, and vocabulary and trivia MCQ scraped from websites.SCDE (Kong et al., 2020) consists of cloze question but with sentencelevel distractors.Specifically, the SCDE question setting is to fill up multiple blanks in a given passage from a shared candidate set of sentence level distractors.The RACE datasets also consists of sentence-level distractors.However, the RACE question setting is a reading comprehension form (instead of cloze form).As our goal is to generate word-level distractors for cloze question, we mainly use CLOTH and MCQ datasets for model learning and evaluation.
Distractor Generator The methods on distractor generation (DG) can be sorted into the following two categories: cloze distractor generation and reading comprehension (RC) distractor generation.
In cloze DG task, it is viewed as a word filling problem.In general, the first step is to extract dis-tractor candidates from context or some knowledge base, and then the next step is to rank the extracted distractors as a final result.Along this direction, the models are mainly based on similarity heuristic (Sumita et al., 2005;Mitkov et al., 2006;Guo et al., 2016;Ren and Q. Zhu, 2021) or supervised learning (Liang et al., 2018;Yeung et al., 2019;Ren and Zhu, 2021;Chiang et al., 2022).
The SOTA method for cloze distractor generation is the work by Chiang et al. (Chiang et al., 2022).The work is also based on the CGR framework.The major performance gain comes from the employment of pre-trained language models (PLMs) as a candidate generator.The idea is that PLMs are essentially equipped with the ability of fill-in-the-blank rooted from its MLM (masked token prediction) training process.However, as mentioned, CGR-based methods do not take into account the inter-relationship between generated distractors.
On the other hand, the RC-type DG focuses on generating sentence-level distractors for reading comprehension level testing, such as summarizing article or understanding author opinion (Gao et al., 2019;Zhou et al., 2019;Chung et al., 2020;Peng et al., 2022).For sentence-level distractor generation, neural models are commonly employed.
For clarity of comparison, we summarize the existing DG studies in Table 2.

Methodology
Our approach employs a two-stage training process.In the first stage (Subsection 3.1), we utilize a Text2Text framework to generate distractors.This involves training the model to generate plausible distractors based on a given cloze question and its corresponding answer.
In the second stage (Subsection 3.2), we introduce pseudo KL-divergence as a means to regulate the generation of distractors.This step is crucial for ensuring the validity of testing when designing multiple-choice questions (MCQs).By incorporating this technique, we aim to control the quality and relevance of the generated distractors.
Furthermore, we delve into the exploration of boosting techniques in Subsections 3.3 and 3.4.These techniques are intended to enhance our overall approach.They may play a role in improving the distractor generation process or optimizing the design of MCQs.As illustrated in Figure 1, the input text is a concatenation of a cloze stem C and an answer phrase A (separated by [Sep]).The output target is a distractor sequence d 1 ⊕ d 2 ⊕ d 3 .

Pseudo KL-Divergence Regulation
Let M be a PLM model and C d i be the cloze question stem with d i being placed at the blank gap.Please refer to the table below as an example.

C
I was in a _ to reach my office... d dream C d I was in a dream to reach my office... Furthermore, let the likelihood of d i conditioned at C and M be Let P D be the probability distribution given by all p d i s.Given a ground truth distractor set D and the generated distractor set D, our pseudo KLdivergence regulation is defined as follows.
During the second stage training, the training loss is set to the sum of the orginal Text2Text loss and the pseudo KL-divergence loss as follows.

Candidate Augmentation
To further boost the performance, we propose Candidate Augmentation strategy.The idea is to generate a set of candidate distractors { d1 , ..., dk } (top-k results) by a MLM neural candidate generator (we use candidate generator of the state-of-the-art CGPbased method by Chiang et al., 2022) and concatenate the candidates with the original input text as an augmented text input for generation.Specifically, the loss function is The observation behind the candidate augmentation strategy is to inject more information for generation through the MLM candidate generator in hope to boost the performance.
As a concrete example, as illustrated in Figure 2, we align the input text by concatenating the input text with the candidates by MLM neural candidate generator.

Multi-tasking with Distractor-Related Tasks
To boost the performance, we also explore the employment of multi-task training with the following tasks: • Distractor Finding: The distractor finding task is to detect a distractor span from C. The idea is to place d at the blank gap in question stem C, denoted as C ⊗ d, and train M to generate d based on input C ⊗ d.Specifically, the distractor finding model is with the following generation objective The cloze test answering task is to answer cloze questions.We take C and the option sequence Opts (the option sequence formed by a random permutation of {A, D 1 , D 2 , D 3 }) as input.The output is the question answer A. Specifically, we have

Experiment
In this section, we introduce the training datasets, the automatic metrics, the implementation details, and the performance results of the compared methods.

Dataset
We use CLOTH (Xie et al., 2017) and MCQ dataset (the dataset releated by Ren and Zhu, 2021) for performance evaluation.
CLOTH dataset CLOTH is a dataset with a cloze test answer task, it contains an article, options, answers, and source, the source is divided into middle and high, the middle is middle-school English exams and high is high-school English exams.CLOTH contains 7,131 passages with 99,433 questions from China entrance exams.The dataset is divided into train/dev/test with 5,513, 805, and 813.
Note that we find that in the original CLOTH dataset there are two forms of cloze questions: the major form is the one with cloze gaps indicated by _ (a blank) and the other is with cloze gaps indicated by _ and a number (a question number).To avoid the training data insistence, we select to remove the later form (_ with a number).The remaining data for train/dev/test are 5041, 720, and 739.We use the remaining data experiment.The detailed statistics of the dataset are presented in Table 3.
MCQ dataset MCQ dataset is a cross-domain cloze-style dataset, that includes the domains of science, vocabulary, common sense, and trivia.Each data is composed of a sentence containing **blank** of cloze stem, answer, and distractors.According to the setting reported by (Ren and Q. Zhu, 2021), MCQ contains 2880 questions and is randomly divided into train/dev/test with a ratio of 8:1:1.One thing to note for MCQ is sentencelevel cloze test while CLOTH is passage-level cloze test.We obtain the MCQ dataset from GitHub link shared by (Ren and Q. Zhu, 2021).However, we find there is a slight difference between the numbers in the shared dataset and reported in the paper.In the shared dataset, it only contains train and test data (with 2321/258).Thus, we use this data setting in our experiments.For dev data, we use 9:1 split from train as dev data.

Evaluation Metrics
Automatic Metric Following the approach by Chiang et al. (Chiang et al., 2022), we evaluate the quality of the generated distractors using several metrics, including F1 score (F1@3), precision (P@1, P@3), and recall (R@1, R@3).P@k represents the ratio of correctly labeled top-k generated distractors, while R@k indicates the ratio of correctly predicted labels among the ground truth.F1@k is the harmonic mean of P@k and R@k.Notably, when the label size is 3, P@3 and R@3 will be the same, resulting in the same F1@3 score.Since both the CLOTH test data and MCQ test data contain 3 distractors, we report the scores of P@1 and F1@3 in the experiments.
Human Evaluation Metric Following (Ren and Zhu, 2021), we asked an English teacher to evaluate the reliability and plausibility of distractors by showing her the cloze passage and answers.We randomly select 5 passages from the CLOTH-F test set, each passage contains multiple questions, and each question contains multiple distractors, including three generated by each method of the T5 model and three ground truth distractors from the dataset.For each distractor, the judgement based on whether it is correct or incorrect based on the context.For a generated result considered as a feasible distractor, a reliability score of 1 was given and further assessed its plausibility on a 3-point scale: "Obviously Wrong" (0 points), "Somewhat Plausible" (1 point), or "Plausible" (2 points).

Implementation Details
Our models are implemented based on models from Hugging Face (Wolf et al., 2019).We experiment with BART (Lewis et al., 2019) and T5 (Raffel et al., 2020) as base generation models.For neural candidate generator, we use BERT.For pseudo KL-divergence regulation, we use BART to estimate the likelihood of d i .During training, we use AdamW as the optimizer and an initial learning rate of 2e-5 for BERT, BART, and 1e-4 for T5 models.All experiments are conducted using two NVIDIA GeForce RTX 3090 GPUs.
BART-based generator With CLOTH data, the maximum number of epochs is set to 20 with a batch size of on two NVIDIA GeForce RTX 3090 GPUs for the Text2Text sentence-level (Len 1) and candidate augmentation (Len 1), the Text2Text passage-level with a batch size of 8, and other methods with a batch size of 32.With MCQ data, the maximum number of epochs is set to 50 with a batch size of 64 on two NVIDIA GeForce RTX 3090 GPUs for the Text2Text sentence-level generation method, and other methods with a batch size of 32.The average running time for BART-based generators is 5 hours (21 minutes) on CLOTH (MCQ).
T5-based generator With CLOTH data, the maximum number of epochs is set to 30 with a batch size of 8 on two NVIDIA GeForce RTX 3090 GPUs for the Text2Text passage-level generation method, and other methods with a batch size of 16.With MCQ data, the maximum number of epochs is set to 50 with a batch size of 64 on two NVIDIA GeForce RTX 3090 GPUs for the Text2Text sentence-level generation method, and other methods with a batch size of 32.The average running time for T5-based generators is 24 hours (39 minutes) on CLOTH (MCQ).

Multi-Tasking and Candidate Augmentation
Setting The default top-k for candidate augmentation is set to 20.In the multi-task training, for having a training data balance, When considering a two-tasks setting, we train the sentence-level generation model with full data and sample the same number of data for the distractor finding task (as there are three distractors for each question, the

Method
Len P@1 R@1 F1@3 MRR NDCG@3 CLOTH-F amount of data in the distractor finding task will be three times that of the task1.Thus, we randomly select 1/3 of the data for training) to have a 50%:50% data balance.For the three-tasks setting, we randomly select 1/6 data from distractor finding and 1/2 from cloze test answering to have a 50%:25%:25% data balance.The average running time for Multi-Tasking is 28.5 hours (37 minutes) on CLOTH (MCQ).

Evaluation Results
Table 4 presents the results of the compared methods on the two benchmarking datasets.We have the following notes for the results.First, Text2Text generation shows best performing results.By comparing MCQ results, we can see that all our Text2Text generation methods surpass the SOTA result reported in (Chiang et al., 2022).Our best performing method (T5 with DF multi-task) advances the SOTA result from 10.81 to 22.00 in terms of P@1.
Second, using large model brings performance improvement.By comparing the result of CLOTH-F and MCQ, T5 (with more parameters) brings near two-points improvements.
Third, the candidate augmentation strategy plays a crucial role in reducing the occurrence of generated distractors that are the same as the answer or previously generated distractors.Initially, it may seem that the candidate augmentation strategy is not effective based on a direct comparison with and without its implementation.However, upon further investigation, we observe that the candidate augmentation strategy leads to significant performance gains by addressing two critical issues: (1) the generation of distractors identical to the answer   and (2) the repetition of the same distractors.
To illustrate this, Table 5 presents the percentage of these two cases in the generation results of the compared methods.Notably, in the CLOTH-F comparison, approximately 90.33% of the results obtained from BART candidate augmentation do not contain distractors identical to the answer, and 85.09% of the results do not exhibit repeated distractors generated.
These findings highlight the effectiveness of the candidate augmentation strategy in mitigating the issues related to generating redundant or answermatching distractors, leading improved overall performance.
Fourth, from the tables, we observe that PKL does not perform well.In the Cloze dataset, its performance lags behind the best-performing method, T5 multi-task+CTA, by about two to three points.Moreover, in the MCQ comparison, PKL falls far behind other methods.Regarding this issue, we offer following observations.First, in the Cloze dataset, we find that PKL generates higher-quality outputs to meet the item discrimination index to generate incorrect options (please refer to the case study in the Appendix).Second, in the MCQ task, we noticed that the data in MCQ often consist of more challenging words (this factor causes the language model tokenizer to split complex words into two or more tokens).As a result, our current regulation based on the MLM probability distribution is not effective.Currently, we only calculate PKL distribution for individual words.
Further, the employment of multi-tasking boosts the BART-based and T5-based performance.By comparing the results of CLOTH-F and MCQ, we see that the BART with multi-tasking further advance the performance from 25.48 (14.28) to 25.64 (17.37) (P@1) and the T5 with multi-tasking further advance the performance from 28.18 (19.30) to 28.75 (21.62) (P@1).

Human Evaluation Results
Table 6 shows the results of the human evaluation of 5 passages randomly selected on the CLOTH-F test dataset.From the results of human evaluation, we found that the reliability of both ground truth and modelgenerated distractors are very high.In Plausibility, neither the ground truth nor the generated distractor score is high, because the distractor is too simple and not very suitable for questioning in the English test.Among all methods of T5, the multi-task (+ CTA) method produces distractors the highest in both reliability and plausibility, as well as with the score closest to the ground truth.

Parameter Study k value in candidate augmentation
We also investigated the effect of distractor candidate top-k in candidate augmentation.We use top-1, 3, 5, 10, and 20 distractor candidates to experiment on CLOTH-F.Table 10 shows that when the candidate distractor is top-5, all metrics are the highest, which means that the generated distractor is closer to the label.Table 7 shows that when the candidate distractor is top-20, the generated distractor has a higher ratio not the same as the answer, and the generated distractor has a higher ratio of not generating repeated distractors.

Impacts on Distractor Order
We also investigated whether the order of distractors affects model performance.We conduct experiments on CLOTH-F using the distractors of lexicographical and length-ordered (short-to-long) and compare them with the original dataset ordering.Table 8 shows that the training of the distractor using the dataset order has the highest performance on most metrics, which means that the distractor generated using the original dataset order is closer to the label.The special ordering distractor may make the learning of the distractor more difficult.Table 9 shows that when using the dataset distractor sort, the generated distractors have a higher proportion of different answers; the distractors generated using special order distractors have a higher proportion of not repeating.

Conclusion
In this paper, we introduce the utilization of a Text2Text formulation for generating clozestyle multiple-choice questions.Our experimental results highlight a significant performance improvement achieved through the adoption of the Text2Text formulation.Specifically, our approach yields a nearly two-fold increase in performance compared to the current state-of-the-art method.These results strongly suggest that the generative Text2Text framework represents a superior alternative to the traditional candidate generating-andranking (CGR) framework.

Limitations
We report the following limitations for the Text2Text-based distractor generator (the major proposal in this study): • The Text2Text-based generator still suffers from the concern of generating distractor same as answer or previous generated distractor.In fact, generating repeated incoherent or factual inconsistent results are commonly concerns for neural text generators (Durmus et al., 2020) (Wang et al., 2020).Although the concern is mitigated through the candidate augmentation strategy, there still are certain portions of generating the distractor of those types, as can be seen in Table 5.
• Although the CGR-based methods show their disadvantage in the evaluation, we find that CGR-based method might be a more practical one for facilitating the cloze-style MCQ preparation.The CGR-based method is able to generate ten or more candidates for educators to select, while the Text2Text generators are only capable of generating three or four distractors.

Appendix A Qualitative Study
In Table 11 and Table 12 we present two generation results, selected from CLOTH test set.In each result, we present the cloze passage, cloze answer, and three distractors.We list the distractor results generated by the T5 model using Text2Text (sentence-level), candidate argumentation, the generation with pseudo KL divergence regulation, and multi-task (Distractor Finding Task and Cloze Test Answering Task.) In Example 1, we observe that using T5 Text2Text produces effective distractors for certain questions, specifically questions 1, 2, 5, 6, 7, and 16.The generated distractors are distinct from the answers and vary among the three options.However, in other questions, we notice instances where Text2Text generates repeated or answer-based distractors.In such cases, the distractors generated by the candidate and multi-task approaches exhibit less repetition, as seen in questions 10, 11, and 13.Notably, the multi-task-generated distractors outperform Text2Text and candidate approaches in questions 3, 8, 9, 12, 14, and 17.These multi-taskgenerated distractors neither contain duplicates nor share the same part of speech as the answers.Additionally, we find positive outcomes with PKL regulation in questions 2, 7, 8, 9, 11, and 14.For instance, in question 2, "doctors" and "parents" are generated, providing discriminative distractors among the three options, with one being relatively straightforward while the other two pose more difficulty.
Moving to Example 2, we observe that the T5 Text2Text generator generates distinct distractors with the same part of speech as the answers for questions 1, 5, 6, 13, and 14.On the other hand, candidate augmentation generates three distinct distractors for questions 2, 4, 8, 9, 10, 12, and 15, while the Text2Text generator occasionally produces duplicated or answer-based distractors.When both Text2Text and candidate augmentation fail to provide satisfactory distractors in questions 3 and 11, the multi-task generator successfully generates three non-repetitive and non-answer-based distractors.Furthermore, we note favorable outcomes with PKL regulation in questions 1, 7, 3, and 14, showcasing the desired discrimination feature among the options.

Passage
Carly's eyes filled with tears as the dusty bus drove down a dirt road in southern Vietnam.The 14-year-old girl and her _1_ had traveled by plane from Canton, Ohio, to Ho Chi Minh City and then by bus deep into the Mekong Delta.Now, as they reached the village, hundreds of cheering _2_ lined the entrance to the Hoa Lac School, a two-story building that Carly had _3_ money for, When Carly was eight, she started _4_ others by giving Thanksgiving baskets in the church to families in need.It was a snowy day, _5_ she saw that one girl was wearing only a shirt and that others didn't have _6_ coats.The next November, she went door to door asking for uesed coats, hats, gloves, and scarves, and then _7_ them out with the baskets.But Carly wanted to do more -she wanted to"change their lives".She _8_ that her grandmother's Rotary club had, years, earlier, collected money to build a _9_ in Vietnam.That was it, she decided.She'd build a school too.She tried to let people _10_ more about Vietnam and the _11_ there.She gave speeches.She _12_ with enthusiasm."The kids in rural Vietnam don't have beautiful schools, "she told a room of 200 Rotarians."That's not _13_ .I want to give them a _14_ to make their lives better."That summer, Carly set off with her family across Ohio, _15_ three or four Rotary clubs a week."Wetraveled like crazy people to all these _16_ , "recalled her mother, Kris.In two year, Carly had collected $50,000.At the dedication ceremony in Hoa Lac, the school principal was _17_ with the girl."How wonderful it was that a girl of her age wanted to do something for kids so far away, "he said through a translator.

Passage
Ellen Sims is an 18-year-old college student.She has an important history exam tomorrow morning.Ellen is going to study all night.She is not going to _1_ at all.Many college students, like Ellen, do this often.They think that in the morning, they will _2_ everything that they studied the night before.Ellen thinks that this is a good way to study, but many doctors _3_ .They say that sleep is very important for memory and brain development.Scientists at Harvard Medical School in the USA studied sleep and memory.They studied 24 people.First, they asked the people to look at a picture and _4_ t.At night, they put the people in _5_ groups of 12. Group One went to sleep.Group Two did not.A few days later, scientists showed some _6_ to both groups.They asked the people to find the picture they _7_ before.The people in Group Two did not do so _8_ as those in Group One.It wasn't _9_ for them to remember the picture.What happened?Scientists say that sleep _10_ our memory.After we learn something new, sleep helps us remember it.And when we don't sleep, we can _11_ new things.Scientists say that many teenagers, like Ellen, sleep too _12_ They go to school and work, too.They also _13_ time with their friends.They're always _14_ and they think sleep isn't important.But scientists say the brains of teenagers are still _15_ , and sleeping is a very important part of the development.When teens sleep less than six hours, they can't think clearly.That is not very helpful for a student who is taking an exam.D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

C2.
Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Section 4, Page 5 C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Section 4, Page 5 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Section 4, Page 5 D Did you use human annotators (e.g., crowdworkers) or research with human participants?Left blank.

Table 2 :
An Overview of the Existing Distractor Generation Methods

Table 4 :
Distractor Generation Results on the Compared Datasets.In the table, DF denotes the distractor finding task, CTA denotes the cloze test answering task, and PKL denotes the pseudo KL divergence regulation.

Table 5 :
Statistics on percentage of generating distractor same as answer and generating the same distractors

Table 6 :
Randomly select 5 passages (60 questions in total) from the CLOTH-F test dataset for human evaluation.The value after ± in Plausibility is the standard deviation.

Table 7 :
Investigation on the ratio of the same and repeated distractors and answers generated by different top-k in candidate augmentation (using the CLOTH-F dataset of length 1)