SWING: Balancing Coverage and Faithfulness for Dialogue Summarization

Missing information is a common issue of dialogue summarization where some information in the reference summaries is not covered in the generated summaries. To address this issue, we propose to utilize natural language inference (NLI) models to improve coverage while avoiding introducing factual inconsistencies. Specifically, we use NLI to compute fine-grained training signals to encourage the model to generate content in the reference summaries that have not been covered, as well as to distinguish between factually consistent and inconsistent generated sentences. Experiments on the DialogSum and SAMSum datasets confirm the effectiveness of the proposed approach in balancing coverage and faithfulness, validated with automatic metrics and human evaluations. Additionally, we compute the correlation between commonly used automatic metrics with human judgments in terms of three different dimensions regarding coverage and factual consistency to provide insight into the most suitable metric for evaluating dialogue summaries.


Introduction
Dialogue summarization is a text generation task that aims to produce a compact summary given a piece of conversation. Conventional approaches to dialogue summarization rely on features of conversation data (Goo and Chen, 2018;Li et al., 2019;Oya et al., 2014). Recently, the rise of large pretrained language models (LMs) has enabled coherent and fluent summaries to be generated without these features. However, low coverage and factual inconsistency remain two pressing issues as studies have shown that the summaries generated from these pre-trained LMs often do not fully cover the reference (Liu and Chen, 2021; Figure 1: An illustration of how NLI can help determine whether a reference sentence is covered by the generated summary. We compute the entailment probability from each reference sentence (i.e. premise) to each generated sentence (i.e. hypothesis). By taking the max value along the row dimension, the resulting vector denotes the probability that each reference sentence entails a sentence in the generated summary. In this example, the entailment probability for the second reference sentence is low, indicating that this sentence is likely not covered by the generated summary. 2022) and that the generated summaries are often not factually consistent with the inputs (Zhang et al., 2020b;Maynez et al., 2020;Cao and Wang, 2021). If an unfaithful dialogue summarization model with low coverage is deployed for public use, it could spread misinformation and generate misleading content that only covers partial facts of a conversation. Hence, we are urgently in need of a solution to improve coverage without negatively impacting faithfulness for dialogue summarization.
Relatively little work addresses coverage and factual inconsistency for dialogue summarization. Some work addresses the issue of unfaithfulness with a controllable generation framework guided by person named entities (Liu and Chen, 2021) or summary sketches (Wu et al., 2021). Tang et al. (2022) categorize factual inconsistencies for dialogue summarization into different types of errors, Figure 2: Illustration of how an entailment-induced bipartite graph is built and how a MIXANDMATCH summary is derived. With the NLI model, we determine which sentences from each summary contain equivalent information by computing the entailment probabilities between pairs of generated sentences and reference sentences, as indicated by the purple edges. Based on the graph, we determine that the generated summary does not cover the first reference sentence and that the first generated sentence is not faithful. Hence, the MIXANDMATCH summary is formed by combining the first reference sentence and the second to the fourth generated sentence. such as missing information and wrong reference. Their framework integrates a contrastive loss and a self-supervised loss to reduce multiple types of errors. However, a great portion (> 40%) of their outputs does not cover the full content of the reference summary. Thus, it is important to address coverage and factual consistency synergistically in dialogue summarization. The issue where the content in the reference does not occur in the generated summary is known as the missing information issue (Liu and Chen, 2021;Tang et al., 2022). In this work, we aim to mitigate missing information in the summary while being faithful to the dialogue.
We propose SWING , Summarizing Dialogue With NLI Guidance. Our approach samples a summary from the model and utilizes natural language inference (NLI) to determine (1) the faithfulness of each generated sentence and (2) whether each reference sentence has been covered by the generated summary. An example is shown in Figure 1. Based on the results computed by NLI, two losses are proposed to encourage the model to generate missing information and distinguish between factually consistent and inconsistent generated sentences.
Our contributions can be summarized as follows: • We propose SWING, a dialogue summarization framework that effectively addresses missing information through two losses computed using NLI. The first loss encourages the model to recover content missing from the reference summaries. The second loss instructs the model to differentiate between factually consistent and inconsistent generated sentences. • Our approach achieves the best performance in mitigating missing information on two public dialogue summarization datasets, DIALOGSUM (Chen et al., 2021b) and SAMSUM (Gliwa et al., 2019), as validated by automatic metrics and human judges. • We measure the correlation of human judgments with conventional and recently developed automatic metrics to provide intuition for future research on evaluating the faithfulness and coverage of dialogue summaries.

Method
Upon analyzing the dialogue summaries in SAM-SUM, we observe that dialogues are often summarized linearly, consistent with the findings of Wu et al. (2021). Therefore, we segment the summaries into sentences and use a natural language inference (NLI) model to provide finer-grained training signals at the sentence level for two goals: (1) encourage generating sentences in the reference summaries that have not been covered by the generated sentences and (2) differentiate factually consistent generated sentences from inconsistent ones. To achieve these goals, we first determine the faithfulness of each sentence using an entailment-induced bipartite graph ( §2.1). Then, we propose two new losses addressing each challenge in turn: an Uncovered Loss that encourages the model to recover missing information ( §2.2) and a Contrastive Loss that brings closer the representations of the reference summary and the generated sentences that 2 Let τ be the entailment threshold; 3 // Resolve 1-to-many mappings; 13 // Resolve many-to-1 mappings; contain equivalent information to some sentences in the reference summary ( §2.3). For the rest of this paper, we use reference sentence and generated sentence to refer to a sentence in the reference summary and the generated summary, respectively.

Entailment-induced Bipartite Graph
To determine which reference sentence has not been covered by the generated summary and which generated sentence is not faithful to the reference summary, we construct a bipartite graph that links sentences between a reference summary and a generated summary. An edge indicates the linked sentences contain equivalent information. If no edge connects to a reference sentence, we consider this sentence not covered by the generated summary. Similarly, if a generated sentence is not linked in the bipartite graph, this sentence is likely not faithful to the reference summary. We use the entailment probabilities computed by an NLI model to determine whether a pair of sentences contain equivalent information. The procedure of constructing the bipartite graph is shown in Algorithm 1.
The NLI model takes in two sentences, a premise (P ) and a hypothesis (H), and computes whether P entails, contradicts, or is neutral to H. Here, we only focus on the entailment probability from the i-th reference sentence to the j-th generated sentence p ent (s * i , s j ). We use the ROBERTA-LARGE model 2 trained on the MNLI dataset, achieving an accuracy of around 91%, which is on par with the performance of state-of-the-art models. Let ϕ(i, j) denote the mapping between the i-th reference sentence and the j-th generated sentence. ϕ(i, j) = 1 if a link exists between s * i and s j ; otherwise, ϕ(i, j) = 0. We first consider a simplified setting by assuming each reference sentence can be mapped to at most one generated sentence, and vice versa (i.e. 0 ≤ ∑ j ϕ(i, j) ≤ 1). In this setting, we can determine whether two sentences contain equivalent information by checking the entailment relation from both directions (lines 26-27).
Here, τ is a hyperparameter that indicates the entailment threshold. However, one reference sentence may contain information equivalent to multiple generated sentences (one-to-many mappings) and vice versa (many-to-one mappings). In Figure 2, for example, the second reference sentence contains information equivalent to the second and the third generated sentences combined. This relation cannot be discovered if we only check the entailment relation between pairs of individual sentences. Therefore, we must resolve one-to-many and many-to-one mappings before checking one-to-one mappings.
To find one-to-many mappings, for every reference sentence s * i , we look for consecutive generated sentences .., j + k} (lines 6-8). We only check for consecutive sentences based on our previous observation that dialogues are often summarized linearly. For every match, we concatenate the generated sentences s j∶j+k = {s j , s j+1 , ..., s j+k } and check whether s j∶j+k entails the reference sentence s * i (lines 8-9). If the entailment holds, we let ϕ(i, m) = 1 ∀m ∈ {j, ..., j + k} (lines 11-12). The same approach is used to address many-to-one mappings (lines 14-22). Following Algorithm 1, a bipartite graph is built between the generated summary and the reference summary. Henceforth, we denote the reference sentences that have not been covered as S * = {s * i |∀j ϕ(i, j) = 0} and generated sentences that can be mapped to some of the reference sentences as S = {s j |∃i ϕ(i, j) = 1}.

Uncovered Loss
The objective of the uncovered loss is to encourage the model to generate information from the reference summary that the generated summary has not covered. To this end, we train the model with MIXANDMATCH summaries, which are constructed by combining reference sentences that are not covered by the generated summary and generated sentences that contain information equivalent to some of the reference sentences. An example is shown in Figure 2.
The MIXANDMATCHummaryŜ is constructed by taking the union of S and S * and sorting the sentences by their index, ( The uncovered loss is effectively maximum likelihood estimation (MLE) with MIXANDMATCH summaries being the decoding targets: where D is the original dialogue andŜ t denotes the t-th token in the MIXANDMATCH summary.
The main advantages of constructing MIXAND-MATCH summaries over other positive sample construction approaches, such as back translation and paraphrasing, are the two desired properties of this formulation. First, the model already has a high probability of generating sentences in S. Therefore, the loss function (Equation (3)) does not penalize the model much for generating these sentences. Second, the penalty for generating sentences S * is larger since the model has a lower probability of generating those sentences.

Contrastive Loss
In the early stage of our experiment, the original goal was to discourage the model from generating factually inconsistent sentences. We adopt unlikelihood training (Welleck et al., 2020) to decrease the probability of sampling these sentences from the model. However, we found that this objective causes the model to generate nonsense sequences. This phenomenon was also observed when we experimented with CONSEQ (Nan et al., 2021), which also incorporates such a loss function into its training process, as shown in §4.1. We hypothesize that it resulted from the fact that sentences in dialogue summaries share similar structures. Hence, using the unlikelihood training objective would confuse the model. Instead, we pivoted our focus on differentiating factually consistent sentences from their inconsistent counterparts with the proposed contrastive loss. For each summary, we use the factually inconsistent sentences as negative samples (i.e. s j ∉ S) and consistent sentences as positive samples (i.e. s j ∈ S). The contrastive learning objective takes a similar form as the InfoNCE loss (Oord et al., 2018): , where h i and h j denote the representations of the generated sentences, h S * means the representations of the reference summary, and cos(, ) denotes cosine similarity. The main difference between our contrastive objective and the other work (Cao and Wang, 2021;Tang et al., 2022) is granularity. Equation (4) operates at the sentence level rather than the summary level; therefore, it provides finer-grained training signals.

Training
The final loss function that our model is optimized with is a weighted sum of the two aforementioned loss functions and MLE, where L MLE is: 3 Experiments

Metrics
Our evaluation focuses on measuring the factual consistency, particularly the missing information challenge, of the summarization models. Therefore, we adopt recently developed metrics that have been shown to correlate well with human judgments in terms of faithfulness. BARTScore (Yuan et al., 2021) computes the semantic overlap between the generated summary and the reference summary by calculating the logarithmic probability of generating each summary conditioned on the other one. Since our goal is to assess how well the model reduce information missing from the reference summary, we consider the Recall (R) setting where we assess p(S * |S, θ), the likelihood of generating the reference summary S given the generated summary S * . FactCC (Kryscinski et al., 2020) is an entailment-based metric that predicts the faithfulness probability of a claim w.r.t. with the source texts. Similar to BARTScore, we use FactCC in the Recall setting where the claim is a reference sentence and the source text is the generated summary. We report the mean of the average CORRECT probability of each sentence within a generated summary.
In addition, we report the ROUGE-L metric (Lin, 2004), which has been also shown to better reflect faithfulness compared to ROUGE-1 and ROUGE-2 (Pagnoni et al., 2021). For these metrics, we also consider the F1 setting, where we compute each metric in the reverse direction (S * → S) and then take the average of both directions, to validate that the model is not generating too much redundant information. Finally, two recently introduced QAbased metrics that have demonstrated close approximation to human judgements in terms of factuality, QUALS (Nan et al., 2021) and QAFACTEVAL (Fabbri et al., 2022a), are also used for evaluation.

Implementation Details
We choose BART (Lewis et al., 2020) as the backbone seq2seq model as it has demonstrated better dialogue summarization performance than other pre-trained language models (Tang et al., 2022), such as PEGASUS (Zhang et al., 2020a) and T5 (Raffel et al., 2020). The proposed models are optimized using AdamW (Loshchilov and Hutter, 2019) with learning rate 3e-5 and weight decay 1e-3. The maximum input sequence length is set to 1024. For all baseline models, we use the best hyper-parameters reported in their papers. We fix τ to be 0.5 throughout all our experiments. α and β are both 1.0.

Baselines
We compare SWING with the following competitive baseline systems. TextRank (Mihalcea and Tarau, 2004) is a graph-based ranking algorithm that performs extractive summarization. BART (Lewis et al., 2020) is a seq2seq language model pre-trained on various denoising objectives. CTRL-DIASUMM (Liu and Chen, 2021) and CODS (Wu et al., 2021) are controllable generation frameworks that generate summaries guided by named entity  , except that CONFIT is optimized with an additional self-supervised loss that aims to reduce reference errors. BART-LARGE is used across all experiments that involve pre-trained language models for fair comparison. Table 1 summarizes the main results on DIALOG-SUM and SAMSUM. SWING outperforms previous approaches in almost all metrics, especially recall measures. This result reflects that the proposed approach generates summaries that cover more content in the reference summaries lexically and semantically. One interesting observation was the deficient performance of CONSEQ on both datasets. We hypothesize that poor performance was the use of the unlikelihood training objective in their loss, as mentioned in §2.3. Since sentences of dialogue summaries often share similar structures, adopting such an objective could confuse the model. We verified this hypothesis by running a small experiment by training BART-LARGE with MLE and negative samples determined by QUALS, similar to CON-SEQ. The resulting model also produces significantly lower performance than training with MLE alone. The finding confirms that the poor performance of CONSEQ is caused by the unlikelihood training and that such a loss function is unsuitable for dialogue summarization.

Human Evaluation
To further validate the effectiveness of SWING, we use Amazon's Mechanical Turk (AMT) to recruit workers to conduct human evaluations on three methods: CLIFF, CONFIT and SWING. We sampled 100 dialogues from the test set of DIALOG-SUM and SAMSUM, respectively. For each dialogue, human judges are presented with a pair of summaries produced by two different approaches and asked to select the better one with respect to three dimensions. RECALL assesses the portion of information in the reference summary covered by the generated summary. PRECISION considers whether all the content in the generated summary occurs in the reference summary. FAITHFULNESS examines whether the generated summary is factually consistent with the dialogue. "Tie" is selected if the judges consider the two summaries to be of equal quality. The final score of each system is calculated as the percentage of times the system is selected as the better one minus the percentage of times the system is not. To evaluate the annotation quality, we compute the inter-annotator agreement. The average Cohan's Kappa (Cohen, 1960) is 54.35%, indicating a moderate agreement. Details of the human evaluation setup can be found in Appendix B.
The human evaluation results are demonstrated in Figure 3. We have the following observations. First, SWING achieves the highest RECALL scores on both datasets, indicating that our approach is the best in addressing the missing information issue for dialogue summarization. Second, while SWING does not score the highest on PRECISION, we achieve the highest scores on FAITHFULNESS. This implies that even though our approach often generates summaries with extra information, Hilary has the keys to the apartment. Benjamin wants to get them and go take a nap. Hilary is having lunch with some French people at La Cantina. Hilary is meeting them at the entrance to the conference hall at 2 pm. Benjamin and Elliot might join them. They're meeting for the drinks in the evening.
Benjamin, Elliot, Daniel and Hilary will meet at La Cantina at 2 pm to have lunch with some French people who work on the history of food in colonial Mexico. They will try to avoid talking about their subject of research.
Hilary has the keys to Benjamin, Elliot and Daniel's apartment. They will meet at the entrance to the conference hall at 2 pm and go to La Cantina for lunch with some French people who work on the history of food in colonial Mexico. Table 2: Qualitative analysis on the outputs of SWING and CONFIT. The two rows demonstrate the missing details and the missing sentences issue of the summaries generated by CONFIT, respectively. The extra information in the outputs of CONFIT that also occurs in the reference summaries is highlighted in blue. In both cases, SWING is able to cover more content presented in the reference summaries.
the additional content is likely still faithful to the input. To measure the amount of additional information produced, we compute the average number of tokens per summary for each model. As seen in Table 3, the summaries generated by SWING is only slightly longer than those produced by CLIFF and CONFIT. This suggests that SWING achieves significantly higher faithfulness and coverage than CLIFF and CONFIT while maintaining conciseness.

Qualitative Analysis
To provide better insight into the effectiveness of the proposed method, we conduct a qualitative analysis using the 100 dialogues randomly sampled from the SAMSUM dataset. Specifically, we further categorize missing information errors into two sub-types: (1) missing details where partial information of a sentence in the reference summary is missing in the generated summary and (2) missing sentences where the model fails to generate an entire sentence in the reference summary. An example of each sub-type is shown in Table 2. By comparing the test sets outputs of CONFIT and SWING, we see that there are 10 improved cases with less missing details and 6 cases where missing sentences is mitigated by SWING. Meanwhile, our proposed approach only introduces missing details error and missing sentences error in 1 and 2 examples, respectively. This implies that our approach is effective in alleviating both sub-types of missing information error while particularly advantageous in reducing missing details errors.

Correlation with Human Judgements
Although recently proposed metrics have been shown to be highly correlated with human judgments on news summarization in terms of factuality (Kryscinski et al., 2020;Yuan et al., 2021), no previous work has studied the transferability of these metrics to dialogue summarization. We seek to answer this question by computing the correlation of the automatic metrics in Table 1 with the human annotations discussed in §4.2. Using Kendall's Tau (Kendall, 1938) as the correlation measure, the results are summarized in Table 4. We observe that: (1) BARTSCORE R is the most consistent and reliable metric across the three dimensions. It performs the best in RECALL on both datasets, indicating that BARTSCORE R is most suitable for measuring how well a model resolves the missing information issue in dialogue summarization.
(2) Although a large number of invalid questions and answers are generated, QUALS is the best metric for assessing PRECISION overall.
(3) FACTCC F and FACTCC R are two of the worst metrics in general. This could be explained by the fact that FACTCC constructs negative samples with some semantically variant transformations. However, these transformations may not be comprehensive enough to cover all cases. Hence, the poor transferability of FACTCC on these two datasets.

Remaining Challenges
We analyzed the remaining errors by comparing 100 generated summaries with corresponding reference summaries on the SAMSUM datasets using the categories of factual errors defined in Tang et al. (2022). The results are shown in Figure 4. We observe that missing information still accounts for the largest portion of factual errors, even though our approach significantly exceeds prior methods in mitigating this issue. This reflects that this issue is challenging to tackle and that there is still a great opportunity to improve the reduction of missing information. As a comparison, we manually inspected outputs of BART-LARGE using the same 100 dialogues as input. We found 42 cases where information is missing from the dialogue summaries produced by BART-LARGE. This observation further confirms the effectiveness of SWING in addressing insufficient coverage. In addition, redundant information is another major source of errors. Although we have shown in §4.2 that the additional information generated by SWING is likely still faithful to the input dialogue, compactness is one of the important qualities of a summary. This can be improved by using NLI to guide the model to avoid generating extra information. Other common mistakes are wrong reference and object errors, both of which can be addressed with the self-supervised loss discussed in Tang et al. (2022). 3

Related Work
Dialogue Summarization Early work on dialogue summarization focus on the AMI meeting corpus (McCowan et al., 2005) due to the lack of dialogue summarization data. These studies enhance summarization performance by leveraging features of conversational data, such as dialogue act (Goo and Chen, 2018), visual features (Li et al., 2019), and the relationships between summary and dialogue (Oya et al., 2014). Later, Gliwa et al. (2019) released the SAMSUM dataset, the first large-scale dialogue summarization dataset, enabling abstractive summarization research on casual chat dialogue. With the rise of large language models (LMs), recent work focuses on improving the controllability of sequence-to-sequence models built upon large LMs. For instance, Wu et al. (2021) proposes to utilize a summary sketch to control the granularity of the summary generated. Liu and Chen (2021) conditions the generators with person name entities to control which people to include in the generating summary. Chan et al. (2021) improves controllability by formulating the summarization task as a constrained Markov Decision Process.
Factual Consistency Enhancement While factuality has been widely explored in the field of fact-checking and fake news detection (Thorne et al., 2018;Wadden et al., 2020;Huang et al., 2022b;Shu et al., 2018;Pan et al., 2021;Huang et al., 2022a), factual inconsistency remains a major challenge for abstractive summarization. One line of work attempts to improve the faithfulness of the generated summary with a separate correction model that corrects the errors made by the summarization model Cao et al., 2020;Fabbri et al., 2022b) or directly fix factual inconsistencies in the training data (Adams et al., 2022). Another line of work employs auxiliary loss functions to improve models' representations or discourage the model from generating unfaithful outputs (Cao and Wang, 2021;Chen et al., 2021a;Nan et al., 2021;Tang et al., 2022). The main advantage of these approaches is the efficiency in inference time.
Some studies have attempted to use NLI to detect factual inconsistency in generated summaries. Early approaches rely on out-of-the-box NLI models, which did not yield satisfactory results (Falke et al., 2019). Barrantes et al. (2020) improved the detection accuracy by using an NLI model finetuned on the Adversarial NLI dataset (Nie et al., 2020). Laban et al. (2022) addresses the mismatch issue in input granularity between NLI datasets and inconsistency detection by passing sentence pairs as inputs instead of document-summary pairs. Kryscinski et al. (2020) and Yin et al. (2021) trains document-sentence entailment models to address the granularity mismatch issue. Utama et al. (2022) introduces a controllable generation framework that generates document-level NLI training data for identifying factual inconsistency. Our work leverages an NLI model to guide the dialogue summarization model to recover missing information.

Conclusion
We have proposed SWING, a dialogue summarization framework that generates summaries with mitigated missing information and improved faithfulness. To instruct the model to generate missing content from the reference summaries and to differentiate factually consistent generated sentences from their inconsistent counterparts, we propose two losses based on NLI. Experimental results on the DIALOGSUM and SAMSUM datasets showed that our approach achieves significantly higher faithfulness and coverage, while still maintaining conciseness, compared to prior methods. In addition, we measure the correlation between the reported automatic metrics and human judgments to provide insight into the most suitable metric for evaluating the coverage and factuality of dialogue summaries for future research.

Ethical Considerations
We acknowledge that the use of large language models pre-trained on the Web could lead to biased outputs. We did find out that our model may sometimes generate the incorrect pronouns for neutral names. For example, in Figure 1, Charlee is being referred to as a male in the generated summary, while Charlee is actually a female as shown in the reference summary. Such an issue is often caused by under-specified context (e.g. Charlee's gender is not mentioned in the input dialogue). Fortunately, we found that such an error accounts for < 1% of the total outputs from our framework and the issue can be largely alleviated when enough context is provided.

Limitations
While our proposed approach is effective in mitigating missing information, this issue is still far from resolved, as shown in Figure 4. Significant effort is needed to ensure dialogue summarization models produce completely factual content. In addition, our method works as we found that most of the reference summaries in the two datasets we used are faithful to the corresponding dialogue. The proposed method may not work on other summarization datasets, such as XSum, which contains hallucinations in about 70% of the reference summaries (Maynez et al., 2020).

A Dataset Statistics
We present the detailed statistics of DIALOGSUM and SAMSUM in Table 5

B Human Evaluation Details
In this section, we describe the details of our human evaluation. We recruit AMT workers from the United States for ensuring language fluency. Qualification requirements are set such that only workers who have an acceptance rate greater than 99% and have more than 10,000 accepted HITs in the past are allowed to work on our annotation task.
To further ensure annotation quality, we conducted two rounds of annotations. In the first round, we launched 100 HITs to select high-quality annotators in the first round. 8 qualified annotators are selected to enter the second round to conduct the remaining evaluation. We set the reward to $0.8 per HIT to encourage experienced annotators to participate. Our annotation interface is displayed in Figure 5. For each HIT, annotators are provided with a piece of dialogue and a corresponding reference summary as well as two summaries generated from different systems, demonstrated on the left segment of the interface. Based on the summaries and the dialogue, annotators are tasked to answer three questions shown on the right segment of the interface, each of which corresponds to RECALL, PRE-CISION, and FAITHFULNESS. They need to determine which summary is better with regard to each prompt.

C Comparison with Other Data Augmentation Methods
We compared our MIXANDMATCH summary construction technique with other data augmentation methods, including back translation (BACKTRANSLATE) and paraphrasing (PARAPHRASING). For back translation, we use mBART-50 (Tang et al., 2020) to translate a summary from English to German and then back to English. For paraphrase generation, we use this open source package 4 . The experimental results are summarized in Table 6. Training with MIXAND-MATCH summaries achieves the highest scores on most metrics, indicating that our proposed method is the most effective in improving the factuality of the generated summaries.

D Hardware and Software configurations
All experiments are conducted on a Linux machine with NVIDIA V100. We use PyTorch 1.11.0 with CUDA 10.1 as the Deep Learning framework and utilize Transformers 4.19.2 to load all pre-trained language models.

E Validation Set Performance
We report the validation set performance of our proposed model in Table 7.

F Number of Parameters
We do not introduce additional parameters to the backbone language model, BART-LARGE. During training time, the number of parameters equals to the sum of the number of parameters in BART-LARGE and ROBERTA-LARGE. In inference time, since we do not need the NLI component, the number of parameters is the same as that of BART-LARGE.