Open-World Factually Consistent Question Generation

Question generation methods based on pre-trained language models often suffer from factual inconsistencies and incorrect entities and are not answerable from the input paragraph. Domain shift – where the test data is from a different domain than the training data - further exacerbates the problem of hallucination. This is a critical issue for any natural language application doing question generation. In this work, we propose an effective data processing technique based on de-lexicalization for consistent question generation across domains. Unlike existing approaches for remedying hallucination, the proposed approach does not filter training data and is generic across question-generation models. Experimental results across six benchmark datasets show that our model is robust to domain shift and produces entity-level factually consistent questions without significant impact on traditional metrics.


Introduction
Question generation is the task of generating a question that is relevant to and answerable by a piece of text (Krishna and Iyyer (2019), Chen et al. (2020), Zhu and Hauff (2021), Ushio et al. (2022), ).It is an important task in language generation (Fabbri et al. (2020), Yu et al. (2020b)), education (Wang et al. (2022)), and information retrieval (Yu et al. (2020a)).A critical metric for question generation is factual consistency, i.e., the question has facts that are derivable from the input paragraph.This work proposes novel methods to improve entitylevel factual consistency while agnostic to model and underlying training data.Nan et al. (2021) and Xiao and Carenini (2022) solve a similar problem for summarization.However, to the best of our knowledge, no work addresses the issue of entitylevel factual inconsistency for question generation.Nema and Khapra (2018) have shown that name entities are essential for a question's answerability.The presence of wrong entities may make the question nonsensical and unanswerable.Table 1 shows entity-level factual inconsistency in question generation by a fine-tuned PEGASUS (Zhang et al., 2019) model.In the first entity, "Kim Jong Un", and in the second example, "Chicago" are hallucinated.
Unlike previous work in the summarization field (Nan et al. (2021), Liu et al. (2021a), Xiao and Carenini (2022)), our work is independent of the model or training process.We also do not reduce dataset size by filtering.Instead, we preprocess datasets to force the model to generate questions faithful to the input using strategies of de-lexicalization and multi-generation and recommend the best strategy.The proposed method improves the factual consistency by 84−100% across multiple datasets while having minimal impact on traditional performance metrics.
We experimented with two popular language models viz.PEGASUS-large and BART-large (Lewis et al., 2020).Our proposed approach consistently performs better for both the language models than normal finetuning.We also compare our approach to recent methods for addressing hallucination in summarization, and our results showed significantly better performance.
Input: As the president of South Korea, she turned out to be heavily influenced by a cult and giving them access to political documents and basically letting that cult run the country through her.One of her "speech-writers" was a religious figure in this cult that totally exploited the president, who pretty much only got elected because her father had been president, using her for money and power and wealth.The people of South Korea got upset when they learned that the real power that was running their country was the religious group that their president belonged to.
Normal Finetuning: What did Kim Jong Un do wrong?Rare word delexicalization + Multiple (Ours): Why did the people of South Korea vote out the president of the South Korea?
Input: As I understand it, there's no established *process* for removing the mayor from his post.If he doesn't resign, there's no mechanism for removing him.Really, I don't understand how this guy hasn't lost his job yet.
Normal Finetuning: Why hasn't the mayor of Chicago been fired yet?Rare word delexicalization + Multiple (Ours): Why hasn't the mayor been fired yet?
Table 1: Qualitative Examples.For detailed analysis refer to section 6.4.
Previous work has explored entity-based delexicalization in settings like adapting parser for a new language (Zeman and Resnik, 2008), valid reasoning chains in multi-hop question answering (Jhamtani and Clark, 2020), and eliminating diachronic biases in fake news detection (Murayama et al., 2021).

Methodology
The objective is to generate relevant and entitylevel factually consistent questions which generalise across domains.For this, we propose novel de-lexicalization strategies combined with a multigeneration strategy.De-lexicalization involves replacing named entities with a special token or rare words during training/inference and replacing the original word after generation.The model's vocabulary is expanded to account for the special tokens used in the de-lexicalization strategies.

De-lexicalization Strategies During Training
[Name i] Token: This strategy replaces the named entity with a token [Name i], where i represents the order of the first appearance of the entity in the paragraph and in the question.
[Name i] Token with Push: This strategy is similar to the previous one.The difference is that if the question has a named entity that is not present in the input paragraph, we replace it with [Name j], where the j is a random number between 0 and the total number of named entities in the input paragraph.The intuition here is that we are pushing or explicitly asking the model to generate a named entity already present in the input paragraph.
[Multiple i] Token: The previous two strategies treat all the named entities as similar.In contrast, in this approach, the entity is replaced with its corresponding semantic tags, followed by an integer representing its order of appearance in the paragraph followed by the question.A semantic tag specifies if an entity is name, organization, loca-tion, cardinal, etc.
[Multiple i] Token with Push and Delete: This approach is similar to [Name i] Token with Push approach with multiple entity types.However, if the question consists of a named entity type not present in the paragraph, it is deleted.
Rare Word token: This strategy de-lexicalizes only the questions.Here we replace the named entities in questions that do not occur in the input paragraph with a rare word.A rare word is a word that occurs 2 to 5 times in the entire training corpus.If an entity occurs in the input paragraph, it is left as it is.
Examples showing different de-lexicalization strategies are present in the Appendix.
Entity Replacement: During testing, from the generated questions, the entities are replaced using a dictionary look-up of the special token.We treat a output as hallucinated if the special token has no corresponding named entity.
Multi-generation: Here, we generate multiple questions during inference by selecting the top five beams from the output of the language model and selecting the one that is factually consistent and has the least perplexity.If no questions are consistent, the generation with the least perplexity is chosen.input.
In the [Name i] Token strategy, we replace all named entities with [Name i].Do note that name entity, 55,000, and 2018 occur twice.Each occurrence is replaced with the same token, i.e., both occurrence of 55,000 is replaced with [Name 3].Since the "U.S." does not occur in the input, we replace it with [Name 5].Contrary to this, in the [Name i] Token with Push strategy, we replace the U.S. with [Name 3], thereby pushing the model to be faithful to the source.
In the [Multiple i] Token strategy, instead of replacing named entities with a common [Name] token, we replace them with their semantic token.Thus, 55,000 is replaced with [MONEY 1] and so on.Like before, each occurrence is replaced with the same token.The U.S. is replaced with [GPE 0] as no entity of type GPE occurs in the input.Contrary to this, the [Multiple i] Token with Push and Delete strategy deletes the entity "U.S." as no GPE-type entity exists in the input.If there were a GPE entity in input (not necessarily "U.S."), it would have been replaced with [GPE 0].
In the Rare Word Token strategy, the input is unchanged.Since the U.S. does not occur in input, it is replaced with a rare word (aster).

Datasets
We use the supervised ELI5 dataset (Fan et al., 2019b) for training.To ensure that the data is of high quality, we remove all the samples where the answer is short (having less than 50 words), or the question does not have a question mark.
We use three publicly available datasets for evaluation across different domains, viz.MS Marco (Bajaj et al., 2016), Natural Questions (Kwiatkowski et al., 2019) and SciQ (Welbl et al., 2017).We also scraped r/AskLegal1 , and r/AskEconomics2 for testing on finance and legal domains.Table 2 shows the statistics of the dataset.

Implementation Details
We use publicly available checkpoints of the language models and fine-tune them for 100k steps with a batch size of 12 and using the Adam optimizer (Kingma and Ba, 2014).The learning rate is set to 10 −5 , and the models are tested on the dev set every 10k steps.The best-performing model on the dev set is used.The model training takes approximately 6 hours on an Nvidia A100 40 GB GPU.Following Nan et al. (2021) we use the Spacy library3 to identify named entities.

Evaluation Metrics
We evaluate both the quality and factual consistency of the generated question.The quality is reported using Rouge-1, Rouge-2, Rouge-L (Lin, 2004) scores and cosine similarity between embedding (from all-mpnet-base-v2 sentence transformer model (Reimers and Gurevych, 2019)) of generated questions and ground truth.We use the perplexity value suggested by Liu et al. (2021b), using a GPT-2 (Radford et al., 2019).To evaluate factual consistency, we use two metrics.The first metric quantifies the degree of hallucination with respect to the ground truth question.We use the precision, recall, and F1 score proposed by Nan et al. (2021).More details about the exact implementation are in the appendix or in their paper.The second metric quantifies the degree of hallucination with respect to the input paragraph.This metric measures, out of all the questions that have named entities, what percentage of questions have named entities not present in the input.Let N hne represent the number of generated questions with a named entity, and N wne represent the number of generated questions with a wrong named entity.N total represents the total number of questions.Do note N total ̸ = N hne , as we can have questions with no named entity in them.Then N hne /N total * 100 represents the percentage of questions having a named entity (P ne ), and N wne /N hne * 100 represents the percentage of questions having the wrong named entity (P wne ).A system with a low P wne value and a high F1 score reflects the system is not hallucinating.We want a system with high factual consistency without significantly affecting the quality of the questions as measured by the proposed metrics.

Baseline
We compare our results with the Spancopy method proposed by Xiao and Carenini (2022) for the summarization.We test with and without global relevance in Spancopy having PEGASUS as the base language model.

Results and Analysis
Due to space constraints, we only present results for PEGASUS-large in the main text.Results for BART-large can be found in the appendix.
Table 4 shows the results of the test set of the ELI5 dataset.The results indicate that the rare word de-lexicalization plus multiple generation approach performs much better than other methods.Compared to a normal fine-tuned PEGASUS model, the P wne score decreases by about 98%, implying that the generated questions are faithful to the input text.Similarly, the F1 score increases by approximately 21%, implying that all the generated questions are faithful to ground truth.In contrast, decrements in other metric scores are less than 6.7%.Overall, rare word de-lexicalization plus multiple generation performs the best in terms of factual consistency and is comparable in other metrics.For detailed analysis refer to section 6.4.
The rare word de-lexicalization with multigeneration approach consistently performs better than all the other approaches for all the datasets.Table 5 compares rare word delexicalization + multiple generation with a normal finetuned PEGA-SUS and Spancopy without global relevance across different datasets.Detailed results for all the approaches across all the datasets are in the appendix.
From the table, it can be seen that rare word delexicalization with multiple generations solves the issue of entity-level inconsistency without negative impact on different metrics.The model was just trained for the ELI5 dataset and was directly used for other datasets.Domain shift exacerbates the issue of entity hallucination, as shown by the P wne value for a normal fine-tuned PEGASUS model, which is usually higher in the presence of domain shift.Thus, our proposed approach works across domains without re-training.
We see that the P ne value decreases across all the datasets for rare word delexicalization with multiple generations.However, this is not wrong.A question without a named entity can still be a valid question (Nema and Khapra, 2018).
Table 1 shows qualitative examples.In the first example, the fine-tuned PEGASUS produces the entity Kim Jong Un that is unfaithful to the source and is entirely unrelated to South Korea.Chicago is hallucinated in the second example.In both examples, our proposed approach generates meaningful and faithful questions.Our approach produces a question with no named entity in the second example, yet the question is meaningful and faithful to the source.This further reinforces our claim that a question without a named entity can still be valid.More outputs can be found in the appendix.
Our approach performs better than the Spancopy architecture (both with and without global relevance).This shows that simple de-lexicalization with multiple generations is better than sophisticated architecture.

Conclusion
In this paper, we study the entity-level factual inconsistency in question generation.Our proposed strategy, rare-word de-lexicalization with multigeneration, improve consistency without significantly affecting traditional metrics across data domains.Extensive experimental results further reinforce our claim.

A Processing Publicly Available Datasets
This section describes our processing for MS Marco, Natural Questions, and SciQ datasets.Since these datasets are used exclusively for testing, we can even use their training set for testing.For MS Marco, we use their train set due to the small size of the test set.Since MS Marco is a sentencebased dataset, we usually see small input contexts.So we only include those data points where the answer has at least 40 words, and the question ends with a question mark.We also use the training set for Natural questions as it is a well-defined JSON file.We randomly select five thousand questions from the training set.We also ensure that the answer is not from the table.We use the test set for the SciQ dataset; however, we filter out all the documents for which supporting text is missing.This supporting text is the input to the model.

B Precision, Recall and F1 Scores
Let q gt and q gen be the ground truth and generated questions, respectively.Let N (q gt ∩ q gen ) represent the number of named entities common between ground truth and generated question.Similarly, N(q gt ) and N(q gen ) represent the number of names entities in the ground truth and generated question, respectively.Thus, the precision is: N (q gt ∩ q gen )/N (q gen ) and recall is: N (q gt ∩ q gen )/N (q gt ).The F1 score is the harmonic mean of recall and precision.

C Results Across Multiple Datasets
This section presents the results of different delexicalization strategies across different datasets.Table 6, 7, 8, 9, and 10 present the results for MS Marco, natural questions, SciQ, AskEconomics, and AskLegal datasets for the PEGASUS model.

D More Qualitative Examples
Table 17 shows some more qualitative examples.D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Left blank.

Table 2 :
Statistics for different datasets

Table 3
One way would be to allow unlimited deductions of savings and tax withdrawals as income .So if you buy $ 50,000 in bonds in 2017 , you deduct all that from your income .Then you sale those bonds for $ 55,000 in 2018 , you would add that $ 55,000 to your 2018 income and it 's taxed like any other income .The simplest way to implement that would be to eliminate penalities and caps on IRA accounts .Said my whole question , do n't know what else to say .One way would be to allow unlimited deductions of savings and tax withdrawals as income .So if you buy $ 50,000 in bonds in 2017 , you deduct all that from your income .Then you sale those bonds for $ 55,000 in 2018 , you would add that $ 55,000 to your 2018 income and it 's taxed like any other income .The simplest way to implement that would be to eliminate penalities and caps on IRA accounts .Said my whole question , do n't know what else to say .Question: How can the aster tax system be reformed?
illustrates different de-lexicalization strategies proposed in the paper.The question contains the named entity "U.S.," which is not present in the Original Input:

Table 3 :
Examples of different de-lexicalization strategies.For details refer to section 4

Table 5 :
Results of Normal finetuned PEGASUS, Rare word delexicalization + Multiple (proposed) and Spancopy without global relevance.
Furu Wei, Wenjie Li, and Sujian Li. 2018.Faithful to the original: Fact-aware neural abstractive summarization.In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI'18/IAAI'18/EAAI'18. AAAI Press.Yu Chen, Lingfei Wu, and Mohammed J. Zaki.2020.Reinforcement learning based graph-to-sequence model for natural question generation.In Proceedings of the 8th International Conference on Learning Representations.

Table 6 :
Results of various approaches on MS Marco dataset for PEGASUS model.C.S.: Cosine Similarity

Table 7 :
Results of various approaches on Natural Questions dataset for PEGASUS model.C.S.: Cosine Similarity

Table 10 :
Results of various approaches on AskLegal dataset for PEGASUS model.C.S.: Cosine Similarity

Table 14 :
Results of various approaches on SciQ dataset for BART model.C.S.: Cosine Similarity | R-1: Rouge 1 | R-2: Rouge 2 | R-L: Rouge l | PPL: Perplexity.C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Left blank.C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Left blank.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Left blank.D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Left blank.