UMSE: Unified Multi-scenario Summarization Evaluation

Summarization quality evaluation is a non-trivial task in text summarization. Contemporary methods can be mainly categorized into two scenarios: (1) reference-based: evaluating with human-labeled reference summary; (2) reference-free: evaluating the summary consistency of the document. Recent studies mainly focus on one of these scenarios and explore training neural models built on PLMs to align with human criteria. However, the models from different scenarios are optimized individually, which may result in sub-optimal performance since they neglect the shared knowledge across different scenarios. Besides, designing individual models for each scenario caused inconvenience to the user. Inspired by this, we propose Unified Multi-scenario Summarization Evaluation Model (UMSE). More specifically, we propose a perturbed prefix tuning method to share cross-scenario knowledge between scenarios and use a self-supervised training paradigm to optimize the model without extra human labeling. Our UMSE is the first unified summarization evaluation framework engaged with the ability to be used in three evaluation scenarios. Experimental results across three typical scenarios on the benchmark dataset SummEval indicate that our UMSE can achieve comparable performance with several existing strong methods which are specifically designed for each scenario.


Introduction
Quantitatively evaluating the quality of generated summary is a non-trivial task that can measure the performance of the summarization system (Lin, 2004;Ng and Abrecht, 2015;Zhang et al., 2020;Scialom et al., 2021), and can also be used as a reward model to give an additional training signal for the summarization model (Wu and Hu, 2018;Narayan et al., 2018;Scialom et al., 2019;Gao et al., 2019aGao et al., , 2020a)).The dominant evaluation methods are traditional word-overlap-based metrics like ROUGE (Lin, 2004) and BLEU (Papineni et al., 2002).Although these metrics are very easy to use, they cannot evaluate semantic similarity.In recent years, many researchers focus on semantic-based evaluation tools (Ng and Abrecht, 2015;Zhang et al., 2020;Zhao et al., 2019).Different to traditional metrics which only use one score to measure the quality of the summary, Zhong et al. (2022) propose to evaluate the summary quality in several dimensions (e.g., coherence, consistency, and fluency) by calculating the similarity between the generated summary and the human-annotated summary.
The summarization evaluation methods can be categorized into two scenarios based on the input data type: (1) reference-based methods require the human-annotated summary as input and (2) reference-free methods only use the corresponding document.The reference-based methods (Lin, 2004;Papineni et al., 2002;Banerjee and Lavie, 2005;Ng and Abrecht, 2015;Zhang et al., 2020;Zhao et al., 2019;Yuan et al., 2021) usually use the human written summary (a.k.a., reference summary) as the ground truth and calculate the similarity between generated and reference summary.With the help of the pre-train language model, these methods have a powerful ability to measure semantic similarity.However, not all realworld application scenarios have human-annotated summaries.Using the reference-based evaluation method with the human-annotated ground truth summary is labor-consuming.Thus, referencefree methods (Wu et al., 2020;Gao et al., 2020b;Scialom et al., 2019Scialom et al., , 2021) ) propose to evaluate the summary by modeling the semantic consistency between the generated summary and the document.
When evaluating a summarization system, even though we can individually select a proper evaluator condition on whether we have a reference summary, it is not very convenient.Moreover, since human annotation is costly, some summarization methods (Wu and Hu, 2018;Narayan et al., 2018;Scialom et al., 2019) choose to use the automatic evaluator to provide an additional training signal, instead of relying entirely on human-labeled document-summary pair data.In this type of usage, the evaluator needs to measure the quality of the model-generated summary with partial humanlabeled document-summary data.Besides, contemporary trainable evaluation models for different scenarios (with or without reference summary) are built on pre-train language models, which may transfer knowledge across different scenarios and provides a great opportunity to bridge these evaluation scenarios with a better combination of the best of both worlds.Hence, it is valuable to build a unified multi-scenario summarization evaluator that can be used for processing both types of input data.Intuitively, this naturally leads to two questions: (1) How to build a unified multi-scenario evaluation model regardless of whether we have a reference summary?(2) How to train the evaluator so that it can share knowledge between scenarios and maintain the exclusive knowledge in a specific task?
In this paper, we propose a unified multiscenario summarization evaluation method Unified Multi-scenario Summarization Evaluation Model (UMSE).UMSE unifies three typical summary quality evaluation scenarios in one model: (1) Sum-Ref: evaluate using reference summary.UMSE measures the similarity between the generated summary and the human-annotated reference summary.
(2) Sum-Doc: evaluate using document.Since using the reference summary is labor-consuming, UMSE can measure the consistency between generated summary and the original document.(3) Sum-Doc-Ref: evaluate using both document and reference summary.This method incorporates the advantages of sum-ref and sum-doc.To process these different types of input, we propose a perturbed prefix method based on the prefix tuning method (Li and Liang, 2021;Liu et al., 2022Liu et al., , 2021) ) that shares a unified pre-train language model across three scenarios by using different continuous prefix tokens as input to identify the scenario.Then, we propose 2 hard negative sampling strategies to construct a self-supervised dataset to train the UMSE without additional human annotation.Finally, we propose an ensemble paradigm to combine these scenarios into a unified user interface.
To sum up, our UMSE can bring the following benefits: • One model adaptable to multi-scenario.UMSE uses only one model to evaluate the generated summary whenever it has a reference summary.
• Mutually enhanced training.We propose a perturbed prefix method to transfer knowledge between scenarios, and it can boost the performance of each scenario.
• Self-supervised.UMSE can be trained using a fully self-supervised paradigm without requiring any human-labeled data, and it makes UMSE has strong generalization ability.
To verify the effectiveness of the UMSE, we first compare with several baselines including the reference-based and reference-free methods.Specifically, UMSE outperforms all the strong reference-free evaluation methods by a large margin and achieves comparable performance with the state-of-the-art in a unified model.Ablation studies verify the effectiveness of our proposed perturbed prefix-tuning method.
2 Related Work

Reference-free Metrics
Reference-free metrics aim to evaluate the summary quality without the human-labeled ground truth summary as the reference, and these methods can be categorized into two types: trained model and training-free model.For the trainingfree methods, SUPERT (Gao et al., 2020b) first extracts salient sentences from the source document to construct the pseudo reference, then computes the semantic similarity to get the evaluation score.Following SUPERT, Chen et al. (2021) propose a centrality-weighted relevance score and a selfreferenced redundancy score.While computing the relevance score, the sentences of pseudo reference are weighted by centrality, the importance of each sentence.For the methods which should be trained, LS-Score (Wu et al., 2020) is an unsupervised contrastive learning framework consisting of a linguistic quality and a semantic informativeness evaluator.The question-answering paradigm is usually used in evaluating summaries, which evaluates the factual consistency between summary and document with the help of well-trained questionanswering models (Scialom et al., 2019;Gao et al., 2019b;Durmus et al., 2020;Scialom et al., 2021).

Reference-based Metrics
Referenced-based metrics, which evaluate the quality of the summary by measuring the similarity of the summary and human written reference, can be divided into two categories: lexical overlap-based metrics and semantic-based metrics.ROUGE (Lin, 2004), the most commonly used metric for summary evaluation, measures the number of matching n-grams between the system output and reference summary.Other popular lexical overlap-based metrics are BLEU (Papineni et al., 2002) and ME-TEOR (Banerjee and Lavie, 2005) which are also commonly employed in other text generation tasks (e.g., machine translation).Since using the lexical overlap to measure the quality is sometimes too strict, many researchers turn to focus on exploring the semantic-based evaluation.ROUGE-WE (Ng and Abrecht, 2015) improves ROUGE by using Word2Vec (Mikolov et al., 2013) embeddings, and S3 (Peyrard et al., 2017) takes the ROUGE and ROUGE-WE as input features and is trained on human-annotated datasets.With the prosperity of the pre-training language model (PLM), more and more researchers introduce these models for evaluation.BERTScore (Zhang et al., 2020) leverages the contextual embeddings from BERT (Devlin et al., 2019) and calculates the cosine similarity between system output and reference sentence.CTC (Deng et al., 2021) is based on information alignment from two dimensions: consistency and relevance.UniEval (Zhong et al., 2022) is a multi-dimensional evaluator based on T5 (Raffel et al., 2020), and it formulates the summary evaluation as a binary question-answering task and evaluates from four dimensions: coherence, consistency, fluency, and relevance.However, existing summarization evaluation models usually focus on measuring the summary quality from multiple aspects and transferring knowledge from PLM, they ignore the shareable knowledge between different scenarios.
Evaluating the quality of the generated text is a also crucial task in generation tasks.In machine translation evaluation, Wan et al. (2022) proposes UniTE which is a multi-scenario evaluation method.UniTE employs monotonic regional attention to conduct cross-lingual semantic matching and proposes a translation-oriented synthetic training data construction method.However, the summarization task does not have these characteristics and directly applying UniTE to summarization evaluation cannot measure the important aspect of summary (e.g., coherence and relevance).

UMSE Model
Problem Formulation Given a model-generated summary X = {x 1 , x 2 , . . ., x Lx } with L x tokens, our goal is to use a unified evaluation model to produce a score s ∈ R for X.For the Sum-Ref scenario, the model uses generated summary X and ground truth summary Y = {y 1 , y 2 , . . ., y Ly } as input.For the Sum-Doc scenario, we evaluate the summary quality by using generated summary X and document D = {d 1 , d 2 , . . ., d L d } with L d tokens as input, which does not require any human annotation (e.g., ground truth summary Y ).For the Sum-Doc-Ref scenario, the model uses generated summary X, ground truth summary Y , and document D as input.To train the evaluation model, we do not use any human-annotated summary quality dataset and we construct the training dataset by using several self-supervised training strategies.

Overview
In this section, we detail the Unified Multi-scenario Summarization Evaluation Model (UMSE).An overview of UMSE is shown in Figure 2. UMSE has two main parts: (1) Data construction.We first construct two self-supervised datasets for coherence and relevance evaluation scenarios.(2) Unified Model.To unify the different input data into a unified model, we propose a perturbed prefixtuning method to train the UMSE.

Data Construction
Employing a human annotator to annotate the quality of generated summary to train the evaluation model is labor-consuming and will lead the evaluation model hard to use.We propose to use the selfsupervised tasks to construct the training dataset for the evaluator without using any human annotation.Since measuring the quality of the summary requires two main semantic matching abilities: (1) matching with the reference summary and (2) matching with the document, we propose two self-supervised tasks to construct the training dataset automatically: • Summary matching oriented data: The goal for this task is to construct positive and negative samples which are different in whether the summary contains the salient information.Given a document-summary pair D, Y , the data sample to construct is a summary pair.The positive data pair (Y, X LD3 ) contains the reference summary Y and a candidate summary X LD3 which contains relevant information.And the negative data pair (Y, X BM ) contains the reference summary Y and a candidate summary X BM which describes similar but not relevant information.Particularly, if the negative data is very hard for the evaluation model to identify (e.g., requires reasoning ability or is very similar to the positive sample), the evaluation model will achieve better performance than using very simple negative data.Thus, we propose to use the leading three sentences of the corresponding document D as the candidate summary X LD3 .For the candidate summary X BM in negative data pair, we first use the BM25 retrieval model to retrieve the most similar document D ′ to D and obtain the reference summary Y ′ of D ′ .To make the negative sample harder, we randomly replace a sentence in Y ′ with one sentence in X LD3 as the final negative summary X BM .
• Document matching oriented data: The golden criterion for evaluating the summary quality is whether the summary describes the main facts of the document.Hence, we construct self-supervised data which aims to train the model to measure the semantic relevance between summary and document.The positive data pair (D, Y ) consists of document D and its reference summary Y .The negative data pair (D, X BM ) contains the document D and a false summary X BM which is similar to Y .We employ the same BM25 retrieval method in coherence data construction to obtain Y ′ and replace a sentence in Y with a sentence in Y ′ as the negative summary X BM .
For brevity, we omit the superscript of X in the following sections.

Perturbed Prefix-Tuning
Although the three scenarios have different input types, we can directly concatenate them into a text sequence which can be easily adopted by the pre-train language model.Following previous work (Zhong et al., 2022), although our evaluation model does not require additional summarizationquality data annotations, human-written summaries are still required to train the estimator.Therefore, reducing the dependence on human-written sum-maries can improve the applicability of our model in low-resource scenarios.Thus, we employ prefixtuning to explore the semantic understanding ability of large language models on the summarization evaluation task.Specifically, we append different prefix sequences at the start of each input text sequence according to the scenario: where [CLS] and [SEP] are both special tokens in PLM, H SR ∈ R (Lx+Ly+Lp+2),z denotes the token level representation for Sum-Ref pair, and z is the hidden size of the PLM.The P * ∈ R Lp,z denotes the prefix for each scenario, which is a continuous prompt with length L p .The advantage of using the unified evaluator is that we can use one large language model to conduct three tasks and it will reduce the size of the evaluation toolkit.
Although these data scenarios have their exclusive task characteristic, there are also some shared abilities and knowledge which can be transferred between different scenarios.To model the exclusive characteristic and transfer knowledge using the continuous prefix in a coordinated way, we propose a prefix perturbation method that uses the same tokens with different orders of different scenarios.Take the prefix of Sum-Doc scenario as an example, P SD contains L p continuous prefix tokens P SD = {p 1 , p 2 , . . ., p Lp }.We perturb P SD as {p 1 , p 3 , . . ., p Lp , p 2 , . . ., p Lp−1 }, and use this perturbed prefix as the prefix for Sum-Doc-Ref P SDR .This prefix perturbation method keeps the prefix used across scenarios to use the same continuous tokens in a different order.Thus, our model can simultaneously transfer knowledge between scenarios and keep the exclusive ability prompted by the different prefixes.
To obtain the summary-level overall representation, we conduct a pooling operation on the tokenlevel representation: where E * ∈ R z denotes the summary-level representation.Then we employ a multi-layer perception (MLP) network to conduct a binary classifica-tion and obtain the probability p: where p + * ∈ R denotes the probability of positive class in p * .During training, we use cross entropy loss L ce to optimize the model parameters to distinguish the positive and negative samples: where c i ∈ {0, 1} denotes the label of i-th training sample which indicates whether this sample is a positive or negative sample.At the inference stage, we take the probability of positive class p + as the final evaluation score s.We combine the score of the Sum-Doc and Sum-Ref scenarios to get the score for the Sum-Doc-Ref:

Variant of
where f denotes the ensemble strategy, such as min and max.In the experiment, we will analyze the performance of different implementations of f .

Datasets
In the training phase, we construct the positive and negative data pairs using the CNN/DailyMail (Nallapati et al., 2016) dataset.Then the trained evaluators are tested on the meta-evaluation benchmark SummEval (Fabbri et al., 2021) to measure the rank correlation coefficient between the evaluation model and human judgment.
CNN/DailyMail has 286, 817 training document-summary pairs, 13, 368 validation and 11, 487 test pairs in total.The documents in the training set have 766 words and 29.74 sentences on average while the reference summaries contain 53 words and 3.72 sentences.
SummEval is a meta-evaluation benchmark.To collect the human judgments towards the modelgenerated summaries, they first randomly select 100 document and reference pairs from the test set of CNN/DailyMail, then generate summaries using 16 neural summarization models.Each summary is annotated by 3 experts and 5 crowd-sourced workers along four dimensions: coherence, consistency, fluency, and relevance.Finally, there is a total of 12800 summary-level annotations.

Evaluation Metrics
Following previous work (Yuan et al., 2021;Zhong et al., 2022), we measure the rank correlation coefficient between the evaluation model and human judgment to represent the performance of the evaluator.In the experiments, we employ the Spearman (ρ) and Kendall-Tau (τ ) correlations between the evaluator output scores and human ratings.The statistical significance of differences observed between the performance of UMSE and the strongest baseline in each scenario is tested using a twotailed paired t-test and is denoted using ▲ (or ▼ ) for strong significance at α = 0.01 and p < 0.05.

Comparisons
In the experiment, we compare the proposed UMSE with widely used and strong baselines: Reference-based Methods: (1) ROUGE (Lin, 2004) is one of the most popular metrics, and it computes n-gram overlapping between the system output and reference summary.We employ the ROUGE-1, ROUGE-2, and ROUGE-L in our experiments.(2) BERTScore (Zhang et al., 2020) leverages the contextual embedding from the pre-training language model BERT (Devlin et al., 2019) and calculates the cosine similarity between system output and reference.(3) MoverScore (Zhao et al., 2019) utilizes the Word Mover's Distance to compute the distance between the embedding of generated summary and reference.(4) BARTScore (Yuan et al., 2021) uses the weighted log probability of the pre-train language model BART's (Lewis et al., 2020) output to evaluate the quality of summaries.( 5) CTC (Deng et al., 2021) is a general evaluation framework for language generation tasks including compression, transduction, and creation tasks.CTC is designed on the concept of information alignment.( 6) UniEval (Zhong et al., 2022) formulates the summary evaluation as binary question answering and can evaluate the summary from four dimensions, coherence, consistency, fluency, and relevance.Reference-free Methods: (1) BLANC (Vasilyev et al., 2020) is defined as a measure of the helpfulness of a summary to PLM while PLM performs the Cloze task on document sentences.In specific, the final score is the accuracy difference of whether use a summary to concatenate with the masked sentence.( 2) SummaQA (Scialom et al., 2019) is a QA-based evaluation metric.It generates questions from documents, answers the questions based on the summary by a QA model, and computes the QA metric as evaluation scores.( 3) SUPERT (Gao et al., 2020b) constructs the pseudo reference by extracting salient sentences from the source document and computes the similarity between generated summary and pseudo reference to evaluate the quality of the summary.( 4) UniTE (Wan et al., 2022) is a unified evaluation model for machine translation in different scenarios: reference-only, source-only and source-reference-combined.
To prove the effectiveness of the perturbed prefixtuning, we design an ablation model, UMSE-PT (w/o Prefix-Tuning).We remove the prefix of input and jointly fine-tune one pre-train language model using the two datasets we constructed.

Implementation Details
Following (Deng et al., 2021), we employ the roberta-large (Liu et al., 2019) as the backbone of our model.The MLP consists of 3 linear layers with tangent activation and the dimensions of each layer are 3072, 1024, and 2, respectively.Following (Wan et al., 2022), the max length of input sequence (with prompt) is set to 512.We vary the length of prompt in {8, 16, 32, 64, 128}, and find that 128 is the best choice.We use AdamW as the optimizer and the learning rate is set to 3.0e-05 selected from {2.0e-05, 3.0e-05, 5.0e-05}.The number of train epochs is set up to 10 epochs and the batch size is set to 8. We fix the random seed always to 12 and trained our model on an NVIDIA GeForce RTX 3090 GPU for 6-7 hours.We use PyLucene to implement the BM25 algorithm to retrieve similar documents.The size of the two training datasets is 30K respectively, and the positive and negative samples are half.

Evaluation Results
We compare our UMSE with strong baselines in Table 1.We can surprisingly find that UMSE (w/  1: Comparing with baselines on SummEval dataset.We use the notion "(w/ *)" to denote which data is used as input.(ρ) denotes the Spearman correlations and (τ ) denotes the Kendall-Tau correlations.The row with shaded background denotes the multi-dimensional metrics which output a score for each dimension, and it is unfair for comparing with these methods.The number with underline denotes the max value in the scenario and the bold-face denotes the max value over three scenarios.users from having to use multiple models.
As illustrated in the related work § 2, some evaluators (e.g., UniEval and BARTScore) focus on evaluating the summary in multi-dimension which model the specific dimension features and output multiple scores.Different from these methods, we focus on an orthogonal aspect that uses a unified model in multiple scenarios, and we only use one score to represent the summary quality.Thus, directly comparing with these multi-dimensional metrics is not fair.Since our unified multi-scenario evaluator is orthogonal to these multi-dimension evaluators, we will combine the multi-dimensional method into UMSE in future work.
Similar to our UMSE, UniTE is also a multiscenario unified evaluation method for machine translation.However, UniTE achieves worse performance than UMSE, which demonstrates our assumption that the matching framework and the data construction method in UniTE are mainly focusing on the characteristic of translation.And we cannot simply use UniTE in the summarization task.
From the results of UMSE(Fusion) (w/ SDR) and UMSE (w/ SDR), we can find that the fusion model achieves better performance, and we will use the fusion method in our release version of UMSE.An extensive analysis of why the fusion method works better than directly concatenating Sum-Doc-Ref in the input of PLM is shown in the following section.

Discussions
Ablation Studies To verify the effectiveness of our proposed perturbed prefix tuning method, we employ an ablation model UMSE-PT in three scenarios.In this model, we mix the training datasets we constructed and jointly fine-tune one PLM for all scenarios.From the results shown in Table 1, we can find that UMSE-PT underperforms with the UMSE in all scenarios.Although using a shared pre-train language model can also transfer knowledge among these scenarios, these ablation studies demonstrate that using the shared continuous prefix tokens provides an explicit way to share common matching knowledge and it can boost the performance of the UMSE.Moreover, we employ an intuitive experiment that separately fine-tunes a PLM for each scenario, and the results are shown in Table 3.Although the performance of the Sum-Ref drops slightly in terms of two dimensions, our proposed UMSE boosts the performance in the Sum-Doc scenario significantly.And boosting the performance of the Sum-Doc scenario is more valuable since evaluation in this scenario does not require any human annotating.In this section, we conduct experiments to explore which fusion method will lead to better performance.We employ four different fusion methods: (1) max method takes the maximum of s SD and s SR as s SDR ; (2) min method takes the minimum of s SD and s SR ; (3) geometric mean fusion uses √ s SD s SR as s SDR ; and (4) arithmetic mean fusion employs (s SD +s SR )

2
. From Table 2, we can find that the arithmetic mean achieves the best performance, and we finally use the arithmetic mean fusion in the UMSE(Fusion).

Analysis of Perturbed Prefix Length
To verify the effectiveness of our proposed perturbed prefix, we conduct experiments using the different lengths of the prefix.From Figure 3, we can find that the performance of our UMSE gradually improved with the growth of the prefix length.

Analysis of Hallucination Detection
To analyze the effectiveness of our model in detecting hallucinations, we conducted experiments on the dataset released by Maynez et al. (2020) and the results are shown in Table 4.According to the Spearman correlations on both faithful and factual, UMSE outperforms baselines, such as ROUGE, BERTScore, and QA, which demonstrates the ability of our proposed model in detecting hallucinations.

Conclusion
In this paper, we propose Unified Multi-scenario Summarization Evaluation Model (UMSE) which is a unified multi-scenario summarization evaluation framework.UMSE can perform the semantic evaluation on three typical evaluation scenarios: (1) Sum-Ref; (2) Sum-Doc and (3) Sum-Doc-Ref using only one unified model.Since these scenarios have different input formats, we propose a perturbed prefix-tuning method that unifies these different scenarios in one model and it can also transfer knowledge between these scenarios.To train the UMSE in a self-supervised manner, we propose two training data construction methods without using any human annotation.Extensive experiments conducted on the benchmark dataset SummEval verify that the UMSE can achieve comparable performance with existing baselines.

Limitations
In this paper, we propose the evaluation model UMSE which can be used to evaluate the summary quality in three typical scenarios.However, in the summarization task, different annotators have different writing styles, and there might exist more than one good summary for one document.Moreover, there can be summaries that concentrate on different aspects of a document (e.g., describing the location and room of a hotel).In the future, we aim to incorporate more scenarios (e.g., multireferences and multi-aspects) into our unified evaluation method.D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 2 :
Figure 2: Illustration of UMSE which tackles the summarization evaluation in three scenarios by a unified model trained with two self-supervised tasks.
Sum-Doc-Ref Evaluation Intuitively, the scenario Sum-Doc-Ref can be seen as a combination of the Sum-Doc and Sum-Ref scenarios.Hence, an intuitive method to conduct the evaluation of the Sum-Doc-Ref scenario is to directly fuse the scores of the Sum-Doc and Sum-Ref scenarios.In this section, we propose a variant implementation to conduct evaluation conditions on the input of Sum-Doc-Ref, named UMSE(Fusion).
Sum-Doc-Ref Fusion In § 3.4, we propose a variant model for the Sum-Doc-Ref scenario which directly fuses the scores of Sum-Doc and Sum-Ref to produce the score for the Sum-Doc-Ref scenario.

C2.
Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values? 4 C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run? 5 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)? 4 D Did you use human annotators (e.g., crowdworkers) or research with human participants?Left blank.

Table 2 :
Result of different fusion methods in Sum-Doc-Ref scenario.

Table 3 :
Comparison between UMSE and separately fine-tuning PLM.