MQAG: Multiple-choice Question Answering and Generation for Assessing Information Consistency in Summarization

State-of-the-art summarization systems can generate highly fluent summaries. These summaries, however, may contain factual inconsistencies and/or information not present in the source. Hence, an important component of assessing the quality of summaries is to determine whether there is information consistency between the source and the summary. Existing approaches are typically based on lexical matching or representation-based methods. In this work, we introduce an alternative scheme based on standard information-theoretic measures in which the information present in the source and summary is directly compared. We propose a Multiple-choice Question Answering and Generation framework, MQAG, which approximates the information consistency by computing the expected statistical distance between summary and source answer distributions over automatically generated multiple-choice questions. This approach exploits multiple-choice answer probabilities, as predicted answer distributions can be compared. We conduct experiments on four summary evaluation datasets: QAG-CNNDM/XSum, XSum-Hallucination, Podcast Assessment, and SummEval. Experiments show that MQAG, using models trained on SQuAD or RACE, outperforms existing evaluation methods on the majority of tasks.


Introduction
The objective of summary evaluation is to quantify the quality of summaries, either on a relative or an absolute scale. Accurate and reliable automatic summary evaluation systems are useful to researchers, as they provide an easy and cheap way to compare new summarization models to existing ones. Although current summarization systems have improved dramatically in the last decade, and are capable of generating highly fluent outputs 1 The code is available at https://github.com/ potsawee/mqag0.  ( Lewis et al., 2020;Zhang et al., 2020a;Brown et al., 2020), it has been shown that generated summaries are prone to exhibit factual errors or hallucinations (Kryscinski et al., 2019;Huang et al., 2021;Nan et al., 2021;Ji et al., 2022). Thus, information consistency between the summary and source is an important assessment criterion. Existing methods that measure information consistency generally perform lexical matching, either directly in the form of ROUGE (Lin, 2004) and BLEU (Papineni et al., 2002), or indirectly using more complex representations such as triple matching (Goodrich et al., 2019). Some recent approaches adopt question answering (QA) pipelines to detect factual inconsistencies (Wang et al., 2020;Durmus et al., 2020;Deutsch et al., 2021;Scialom et al., 2021). They are based on the assumption that if the source extracted answer is consistent with the summary extracted answer then summary and source are consistent. The answers are compared using either lexical matching (Wang et al., 2020;Durmus et al., 2020) or representation-based matching (Deutsch and Roth, 2022). These spanbased QA approaches (referred to as SpanQAG) may have lexical biases present, and struggle with highly abstractive summaries or when dealing with multiple answer spans.
In this work a measure of consistency between the source and summary is defined from an information-theoretic perspective. We propose a Multiple-choice Question Answering and Generation framework, MQAG, where instead of comparing answer spans, multiple-choice questions are generated and the resulting answer distributions from the source and summary compared. The main contributions of this paper are: (i) we provide an alternative way of assessing information consistency based on probability distributions instead of lexical or representation-based methods; (ii) we show that our approach achieves the best performance on three out of the four evaluation datasets.

Background and Related Work
This work focuses on determining if information in the summary is consistent with information in the source, and does not consider 'factuality' where valid external facts are acceptable (Maynez et al., 2020). Existing methods that have been applied to evaluate information consistency include: 2 Text overlap scores. ROUGE and BLEU measure word overlap between two texts, while BERTScore (Zhang et al., 2020b) and BLEURT (Sellam et al., 2020) compare texts in their representation space. These metrics measure textual similarity, so are not necessarily a good measure of consistency. Knowledge representation. Goodrich et al.
(2019) assess factual consistency by comparing relation triples from the source and the summary. The relation triples are in the format of Subject-Relation-Object and can be obtained using a modelfree method such as OpenIE (Etzioni et al., 2008) or using a trained relation extraction model. Texual Entailment. A textual entailment classifier trained on Multi-NLI (Williams et al., 2018) has been applied to assess summaries' consistency with the source (Maynez et al., 2020). Similarly, simulated data, such as real or fake summaries created by pre-defined transformations, have been used to train classifiers to detect inconsistent summaries (Kryscinski et al., 2020;Bao et al., 2022). Question Answering. QAGS (Wang et al., 2020) and FEQA (Durmus et al., 2020) were among the first to utilise a QA framework. Typically these approaches consist of a question generation model and a question answering model, where QAGS and FEQA first generate questions based on the summary, and then measure consistency by comparing the source's answer with the summary's answer. As an extension, QuestEval (Scialom et al., 2021) also generates questions based on the source to assess the informativeness of a summary.

MQAG
Since current summarization systems generate highly fluent summaries, this work focuses on assessing whether summaries contain the same information to that of the source, or whether it is contradictory. One way to view information would be to consider the set of questions that are answerable given a certain passage. If a summary is consistent with the source, then one would expect the set of answerable questions by the summary to overlap with those of source, and yield similar answers. Though SpanQAG approaches are similarly motivated, existing span-based frameworks use text similarity measures, either in the form of lexical or representation space. In contrast, we attempt to measure information using multiple-choice questions, which allows for a more abstract understanding of information and enables convenient use of standard information-theoretic measures. Let x = source, y = summary, q = question, and o = options associated with the question q. We define information consistency IC(x, y) as q,o −KL (P a (o|q, x), P a (o|q, y)) P g (q, o|y)dodq and P a (o (i) |q (i) , y) are the option distributions given the source and summary respectively, and the negative KL-divergence is used to measure distribution similarity. The approximation in eq. 1 will be referred to as the MQAG-Sum score. Alternatively, it is possible to generate questions and options, {q, o} using the source x instead of the summary y, {q (i) , o (i) } is sampled from P g (q, o|x). We will refer to this variant as the MQAG-Src score. MQAG-Src is expected to measure the amount of source information present in the summary, i.e. the informativeness of the summary, since questions are now derived from the source.

MQAG Framework Implementation
Question Generation. The multiple-choice questions are generated in two stages. First model g1 generates the question q and answer a, then model g2 generates the distractors o \a given q and a. P g (q, o|y) = P g2 (o \a |q, a, y)P g1 (q, a|y) (2) where o = {a, o \a }. We set the number of options (answer and distractors) to four. Both g1 and g2 models are sequence-to-sequence models based on the T5-large architecture (Raffel et al., 2020) finetuned to RACE (Lai et al., 2017). Question Answering. The answering stage contains one model a, which uses the Longformer architecture (Beltagy et al., 2020) with a multiplechoice setup similar to Yu et al. (2020). The input to the model is a concatenation of context, question and option. The question answering model a is also fine-tuned to RACE. See Appendix A for additional model details.

Experimental Results
The baseline and MQAG results are shown in Table 1. The first observation is that MQAG-Sum outperforms SpanQAG on all tasks. This illustrates the benefits of directly comparing the answering distributions rather than spans.
On the Podcast data, both extractive and abstractive summaries need to be assessed. Lexical based evaluation methods will by default yield high scores for extractive summaries. This bias causes most assessment systems to have a negative correlation with human judgements. This effect is illustrated in Fig MQAG-Src, which assesses how much source information is contained within the summary by generating questions from the source, achieves lower PCCs than MQAG-Sum on all datasets. This finding aligns with our expectation, as the summaries (apart for those of Podcast) were graded by humans predominantly on the consistency aspect (which MQAG-Sum was designed to measure) rather than the quantity of source information present (which MQAG-Src measures). On Podcast, where human evaluation is a combination of consistency, informativeness, and fluency, we observe that the combination of MQAG-Sum and MQAG-Src (via Harmonic mean) yields the best performance (PCC=0.824). The results of the Harmonic mean (referred to as MQAG-F1) are shown in Table 5 in Appendix B.4. As it is more challenging to generate questions using the source (due to larger content space to be explored), MQAG-Src has a higher variance compared to MQAG-Sum.  Table 3. Using a smaller generation model does not result in a lower performance. This could be because T5-base has higher perplexity, which could yield more diverse questions. When using Roberta, with a shorter input length, the performance on SummEval (the input length is mostly shorter than 512) remains almost the same. However, as the input length is longer in QAG-XSum/Podcast, we observe a drop in PCC. Number of Questions (N ). We analyse the impact of N , the number of generated questions per summary (see Figure 3 for details). We observe a large improvement as N increases from 1 to 20, and then less significant performance gains as N increases to 50. Though the performance curve has not completely plateaued at N =50, since the computational cost of MQAG scales linearly with N , 50 questions seem to be a reasonable compromise between computational efficiency and performance. An interesting next step would be to investigate if the same/similar performance can be achieved with as low N as possible, e.g. by generating a smaller but more diverse set of questions and options. System Combination. We investigate whether MQAG is complementary to alternate approaches, see Table 5 for detailed results. On QAG-CNNDM and XSum-Factual (where some baselines outperform MQAG) we find that doing a system combination with MQAG further boosts performance and that MQAG is complementary. On other datasets that MQAG is already the best system, we also show system combination again improves performance and sets new state-of-the-art results.
Comparing the two distributions. Distributions can be poorly calibrated due to overconfidence in training or domain shift, e.g. RACE → summarization task. As an initial result, we show in Appendix B.3 that: First, the accuracy drops from around 80% on RACE to around 60% (or below) on summarization datasets. Second, we observe a performance gain by applying temperature annealing as shown in Fig. 4. In addition, we observe that some questions can be answered irrespective of context (as observed by Pang et al. (2022)), and if the answering system does not leverage context, this would fail to measure consistency. These findings suggest possible directions to improve the current MQAG: 1) calibration, 2) uncertainty and unanswerability of the answer distribution (Raina and Gales, 2022), and 3) selection of questions and options.

Conclusion
This work demonstrates the potential of a novel scheme for assessing information consistency between source and summary based on the distance between multiple-choice answer distributions. The current realization of the framework exploits current multiple-choice question generation and answering systems. It is expected that as these systems improve, for example the diversity of questions generated and selection of options. The framework may also allow insight into the balance of human assessment of summaries and the balance of faithfulness and information content.

Limitations
Our approach is designed to assess the information content, so it may not work well with other aspects of summary evaluation such as fluency or coherency. Our analysis is based on the QG and QA systems trained on the RACE dataset, which is collected from English examinations in China. Hence, the questions and options generated could be biased towards the style of the examinations.

A More Details on Experimental Setup
A.1 Baselines ROUGE(x, y) is based on the rouge-score Python package. Note that ROUGE is used compare the summary against the source. We implement the Entailment system by fine-tuning BERT (bert-large-uncased) (Devlin et al., 2019) to MNLI data (Williams et al., 2018). Given the context, the Entailment model is to classify the hypothesis into one of the three classes (entail/neutral/contradict). When applied to a summarization dataset, the context is the source document and the hypothesis is the summary. The probability of being the entail class is used as the prediction of the Entailment system. OpenIEtriple, BERTScore, and SpanQAG are based on the FactSumm implementation (Heo, 2021) with the following models: Name-Entity-Recogition (NER), Question Generation (QG), Question Answering (QA). The trained weights are from HuggingFace as follows: NER=flair/ner-englishontonotes-fast; QG=mrm8488/t5-base-finetunedquestion-generation-ap; QA=deepset/roberta-base-squad2; BERTScore=microsoft/deberta-base-mnli.

A.2 Training QG and QA systems
We train both question generation (QG) models and question answering (QA) on the RACE training set, and we do early stopping when the performance on the RACE validation set does not improve. The train/validation/test split of RACE is 87866/4887/4934. We use batch size 8 for QG models (T5) and 2 for QA model (Longformer). The learning rate is set to 1e-6, and we use the Adam optimizer. We perform training using one NVIDIA A100-80GB GPU. Training the QG model (T5-large) takes around 8 hours, and training the QA model (Longformer-4096) takes up to 2 days. Running MQAG inference (QG=T5large, QA=Longformer-4096) on one NVIDIA P100 GPU takes around 3 seconds per question.

A.3 Data Statistics
The RACE corpus (train+validation+test) has 97687 examples. The average length of the contexts is 317.8, and the average length of the questions is 11.0. The statistics of summary evaluation datasets are provided in Table 2.

B.2 Number of Generated Questions
We investigate the impact of N . Based on 50 generated questions, we perform bootstrapping using 1000 iterations for each value of N from 1 to 50. We show the mean and the confidence interval (i.e. standard deviation) in Figure 3.

B.3 Domain Shift and Calibration
Here, we investigate whether it is necessary to calibrate the probability distribution of the options P a (o|q, * ). Because the answering model is trained on RACE, we expect some domain mismatch. In Tab. 4, we report its accuracy on RACE-testset (in-domain) and the target summary evaluation datasets (out-of-domain) in this paper. We try applying temperature annealing in Eq. 3.
When T < 1.0, the distribution over the options is sharper (i.e. N o ↓) and vice versa for T > 1.0. In Tab Table 4: Performance of the answering model based on the Longformer architecture trained on RACE. Accuracy on summarization datasets is based on the answers generated by the generation model. Confidence score is the average probability of the predicted option. N o is the effective number of options, which ranges from 1 to max. number of options, e.g. 4: N o = 2 H2(p) where H 2 (.) is entropy and p is the probability distribution over the options.    (2022). MQAG-F1 is the Harmonic mean of the predictions from MQAG-Sum and MQAG-Src. System combination results are obtained by combining the normalized predictions of two systems. The confidence intervals of the MQAG results are standard deviations obtained via bootstrapping. Underlining denotes the best single system on a particular testset, and we bold the best system combination (or single system).
Source: A G4S security van has been robbed outside a branch of royal bank of Scotland in Glasgow city centre. Police said three armed men took a five-figure sum from the vehicle in the city's Sauchiehall street on Monday at about 21:45. A spokesman said no-one had been injured although two security guards aged 47 and 49 were left badly shaken. The area around the bank, which is near the Buchanan galleries shopping centre, has been cordoned off by police. Police said the security guards had been making their delivery when they were approached by the three armed men, who threatened them and demanded they hand over a box of money. It is understood the cash taken was in the region of £50,000. Following the robbery, the three men got into a white seat Leon car, which sped off along west Nile street towards the cowcaddens area. [...] Summary: Two security guards have been threatened during a robbery at a bank in Edinburgh.