CDGP: Automatic Cloze Distractor Generation based on Pre-trained Language Model

Manually designing cloze test consumes enormous time and efforts. The major challenge lies in wrong option (distractor) selection. Having carefully-design distractors improves the effectiveness of learner ability assessment. As a result, the idea of automatically generating cloze distractor is motivated. In this paper, we investigate cloze distractor generation by exploring the employment of pre-trained language models (PLMs) as an alternative for candidate distractor generation. Experiments show that the PLM-enhanced model brings a substantial performance improvement. Our best performing model advances the state-of-the-art result from 14.94 to 34.17 (NDCG@10 score). Our code and dataset is available at https://github.com/AndyChiangSH/CDGP.


Introduction
A cloze test is an assessment consisting of a portion of language with certain words removed (cloze text), where the participant is asked to select the missing language item from a given set of options.Specifically, a cloze question (as illustrated in Figure 1) is composed by a sentence with a word removed (a blank space) and list of options (one answer and three wrong options).
The cloze test with carefully-design distractors can improve the effectiveness of learner ability assessment.However, manually designing cloze test consumes enormous time and efforts.The major challenge lies in wrong option (distractor) selection.As a result, automatic cloze distractor generation is proposed (Ren and Q. Zhu, 2021;Kumar et al., 2015;Narendra et al., 2013).
In this paper, we extend the candidate-ranking framework reported in (Ren and Q. Zhu, 2021) by exploring the employment of PLMs as an alternative for candidate distractor generation.In this paper, we propose a cloze distractor generation framework called CDGP (Automatic Cloze Distractor Generation based on PLMs) which incorporates The contribution of this work is as follows.
• We show that PLM-based methods brings significant performance improvement over the knowledge-driven methods (Ren and Q. Zhu, 2021) (generating candidates from Probase (Wu et al., 2012) or Wordnet (Miller, 1995)) • We conduct evaluation using two benchmarking datasets.The experiment results indicates that our CDGP significantly outperforms the state-of-the-art result (Ren and Q. Zhu, 2021).We advance NDCG@10 score from 19.31 to 34.17 (improving up to 177%).

Related Work
The methods on cloze distractor generation can be sorted into the following two categories.The first category (Correia et al., 2010;Lee and Seneff, 2007) is to prepare cloze distractors based on linguistic heuristic rules.The problem with these methods is that the results are far from practically satisfactory.The second category (Kumar et al., 2015;Narendra et al., 2013)   To improve the quality, (Ren and Q. Zhu, 2021) proposes to use knowledge bases (Wordnet (Miller, 1995) and Probase (Wu et al., 2012)) to analyze the word semantic and hypernym-hyponym relations for generating candidate distractors.In this paper, we explore the employment of PLMs as a alternative for the knowledge bases in (Ren and Q. Zhu, 2021) and also explore various linguistic features for candidate selection.

CDGP Framework
We extend the framework proposed by (Ren and Q. Zhu, 2021) by exploring the employment of pre-trained language models as an alternative for candidate distractor generation.Specifically, as illustrated in Figure 2, the framework consists of two stages: (1) Candidate Set Generator (CSG) and (2) Distractor Selector (DS).In this paper, we revisit the framework by considering (1) PLMs at CSG and (2) various features at DS.

Candidate Set Generator (CSG)
The input to CSG is a question stem and the corresponding answer.The output is a distractors candidate set of size k.
In this study, we use PLM to generate candidates.Let M() be PLM model.For a given training instance (S, A, D), where S is a cloze stem, A is the answer, and D is a distractor.We explore the following two training setting for generating distractor candidates.The idea is to guide the model to refer A to generate D. Specifically, The training objective is to find a parameter set θ minimizing the following loss function −log(p(D|S, A; θ))

Distractor Selector (DS)
The input to DS is a question stem S, an answer A, and a candidate set {D i } from CSG.We investigate the following features for ranking candidates.
• Confidence Score s 0 : the confidence score of D i given by the PLM at CSG.Specifically, • Word Embedding Similarity s 1 : the word embedding score between A and D given by the cosine similarity between ⃗ A and ⃗ D. Specifically, • Contextual-Sentence Embedding Similarity s 2 : the sentence-level cosine similarity between the stem with the blank filled in A (denoted by ⃗ S ⊗A ) and the stem with the blank filled in D (denoted by ⃗ S ⊗D i ).
• POS match score s 3 : the POS (part-of-speech) matching indicator.s 3 = 1, if A and D i has the same POS tag.Otherwise s 3 = 0.The final score of a distractor D i is then computed by a weighted sum over the individual score with MinMax normalization.

score(D
Distractors with Top-3 scores are selected as the final resultant distractors.

Performance Evaluation 4.1 Dataset
To validate the performance of our methodology, we use the following two datasets.
The datasets consists of a passage with cloze stems, answers and distractors.The data statistics are summarized in Table 1 and Table 2.
• DGen dataset (Ren and Q. Zhu, 2021) The DGen dataset released by (Ren and Q. Zhu, 2021), which is a reorganized dataset from SciQ (Welbl et al., 2017) and MCQL (Liang et al., 2018).We compare our methods with the SOTA method (Ren and Q. Zhu, 2021) based on this dataset.The data statistics are listed in Table 3 and Table 4.

Implementation Details
We select bert-base-uncased (Devlin et al., 2018) as the default PLM.We use Adam optimizer with  an initial learning rate setting to 0.0001.We set the PLM maximal input length to 64.The default batch size is set to 64.All models are trained with NVIDIA® Tesla T4.
For computing word embedding similarity in DS, we use the fasttext model (Bojanowski et al., 2016) as the default embedding model.The fasttext is trained with the cbow setting.The minimal and maximal n-gram parameter are set to 3 and 6.The vector dimension is set to 100.The initial learning rate is 0.05.In addition, the size of distractor candidate set k is set to 10 as a default value.

Results on DGen
In this set of experiment, our goal is to compare our method with the SOTA method (Ren and Q. Zhu, 2021).Table 5 shows the comparison results.In addition to the BERT model, we also report our CDGP variants based on (SciBERT, RoBERTa, and BART).From Table 5, it can be seen that the NDCG@10 of CDGP with SciBERT was improved from 19.31 to 34.17, surpassing the existing SOTA method by 77%.
An interesting finding here is that in this set of experiment, we see CDGP using SciBERT show the best results.We think this confirms the domain matchesness between DGen dataset.Note SciBERT which is pre-trained based on science Models P@1 F1@3 MRR@10 NDCG@10 DGen (Wordnet CSG) 9.

Results on CLOTH dataset
In this experiment, we evaluate the performance of our models on CLOTH dataset and conduct ablation studies for our CDGP model.
Comparing Fine-Tuning Strategy In this set of experiment, we compare the performance of naive fine-tuning and answer-relating fine-tuning.The results are presented in Table 6.
From the above results, it can be observed that the overall score of answer-relating fine-tuning is higher than that of naive fine-tuning.Therefore, we select answer-relating fine-tuning as a default fine-tuning strategy.
Table 7 shows the comparison result.Through this experiment, we see that the BERT model has the most outstanding performance, so we use the BERT model for subsequent experiments.
Comparing DS Factors There are four scoring factors in DS, namely s 0 (confidence score), s 1 (word embedding similarity), s 2 (contextual sentence similarity) and s 3 (part-of-speech match

Models
P@1 F1@3 F1@10 MRR@10 NDCG @10 BERT 18.50 13. score).In this experiment, we adjust the weighting of each scoring index of DS (from w 0 to w 3 ), and compare the difference of using different weight ratios.Table 8 shows the experiment results.
From the results in Table 8, we see that if the weights of s 1 and s 2 is adjusted lower, a better distractor generation performance is observed, but if they are set too low, the performance starts to degrade.
After the experiments, we see that the DS weights setting to (0.6, 0.15, 0.15, 0.1) show the best performance.We use this weighting setting as default values for other experiments.
Comparing w/o CDGP Components Through the above experiment studies, we obtain the besting parameter settings for CDGP.In order to prove the effectiveness of the CDGP design, in this set of experiments, we compare the use or not of each component in the framework.Table 9 presents the experimental results.
From the results, we can see that the whole CDGP framework (CSG+DS with Methods P@1 F1@3 F1@10 MRR@10 (w 0 , w 1 , w 2 , w 3 ) = (0.6, 0.15, 0.15, 0.1)) shows the best performing results compared with the options using only one or none of the components.Furthermore, we see that using only CSG improves the performance (107.7%, in terms of NDCG@10, compared with none scheme (which uses BERT's MLM capability to have distractor candicate without any fine-tuning), while using only DS brings slightly performance improvement (2.5%).
Such results indicate that the major performance improvement comes from the CSG employment.

Result on Human Evaluation
We also recruit 40 human evaluators from our campus.The evaluation process is as follows.First, the evaluator takes a cloze exam (a passage with 10 cloze multiple choice questions).The passages are randomly selected from the CLOTH dataset.For a selected passage, we keep five original questions and replace the rest five questions with the generation results by our model.Our goal is to observe the answering correct rate over the manually designed distractors and the automatically designed distractors.Furthermore, we also ask the evaluators to exam the quality of the generated distractors.Specifically, after the exam, we ask (1) the evaluators to guess which questions are generated by CDGP and (2) rank the distractor difficulty by Likert scale ranging from 1-5.
Answering Correct Rate We find that the correct rate of the human cloze questions is 50.5%, while the correct rate of CDGP questions is 66%.The correct rate of CDGP questions is slightly higher than that of human questions, which shows that the difficulty of CDGP distractors is slight easier than that of human questions.Improving and controlling the difficulty of automatically generated distractors will be an interesting future work direction.
Distinguishing Human-design or CDGP Question In the test of judging whether a question is a CDGP question, the correct rate of the evaluators' guess is 53%, which nearly to a random guess, showing that the evaluator cannot effectively distinguish between human and CDGP questions.
Examining Difficulty of Generated Distractors From the tester feedback, as shown in Figure 3, the testers' ratings of difficulty are normally distribution, indicating that the difficulty level of the questions is moderate.It can be seen that the performance of CDGP questions is close to that of manual-design questions, which confirms that CDGP can assist in the cloze distractor preparation.

Conclusion
Our study indicates that PLM-based candidate distractor generator is a better alternative for knowledge-based component.The experiment results show that our model significantly surpassed the SOTA method, demonstrating the effectiveness of PLM-based distractor generation on Cloze Test.Also, the result shows that using domain-specific PLM will further boost the generation quality.

Limitations
The major limitation for this study is that the current evaluation on the test dataset cannot truly reflect the distractor generation quality.A mismatch with the ground truth distractors do not imply the generated distractor is not a feasible one.Also, we have no way to control the difficulty and the correctness of distractor generation.

Figure 1 :
Figure 1: A Cloze Test Example: the challenge to cloze test preparation lies in wrong option selection.A good wrong option selection improve the effectiveness of learner ability assessment.

1.
Naive Fine-Tune: M(S ⊗[Mask] ) → D The input is a given stem S with the cloze blank filled in [Mask] (denoted by S ⊗[Mask] ).The idea is to fine-tune the PLMs to predict D. The training objective is to find a parameter set θ minimizing the following loss function −log(p(D|S; θ)) 2. Answer-Relating Fine-Tune: The input is further concatenated with cloze answer A.

Figure 3 :
Figure 3: The testers' feedback on the difficulty of the questions generated by CDGP (1: easiest, 5: most difficult)

Table 6 :
The Results of Naive and Answer-Relating Fine-Tuning Comparison literature and DGen is a dataset related to scientific domains.

Table 7 :
Results on Comparing the Employment of Different Pre-trained Language Models (fine-tuned with CLOTH dataset)

Table 8 :
Distractor Selector Features Weighting Comparison

Table 9 :
Ablation study on CDGP components