Towards a Unified Multi-Dimensional Evaluator for Text Generation

Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural Language Generation (NLG), i.e., evaluating the generated text from multiple explainable dimensions, such as coherence and fluency. However, automatic evaluation in NLG is still dominated by similarity-based metrics, and we lack a reliable framework for a more comprehensive evaluation of advanced models. In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG. We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions. Furthermore, thanks to the unified Boolean QA format, we are able to introduce an intermediate learning phase that enables UniEval to incorporate external knowledge from multiple related tasks and gain further improvement. Experiments on three typical NLG tasks show that UniEval correlates substantially better with human judgments than existing metrics. Specifically, compared to the top-performing unified evaluators, UniEval achieves a 23% higher correlation on text summarization, and over 43% on dialogue response generation. Also, UniEval demonstrates a strong zero-shot learning ability for unseen evaluation dimensions and tasks. Source code, data, and all pre-trained evaluators are available at https://github.com/maszhongming/UniEval.


Introduction
The rapid development of Natural Language Generation (NLG) tasks with the support of pre-trained language models (Raffel et al., 2020;Brown et al., 2020;Lewis et al., 2020) calls for a higher quality evaluation of generated texts.However, the evaluation process is still dominated by traditional similarity-based metrics (Kasai et al., 2021), exemplified by ROUGE (Lin, 2004) and BLEU (Papineni et al., 2002) that compute n-gram overlap between the model output and the reference text.These metrics are potentially misleading as NLG models have advanced to the point where discrepancies between them are unlikely to be detected based on surface-level features (Gehrmann et al., 2022).Although using pre-trained models to obtain embedding-based similarity may alleviate this issue (Zhang et al., 2019), these metrics still naturally lead to the question: does similarity to reference text indicate the overall quality of model output?Belz and Gatt (2008) referred to this similarity as "human-likeness" and pointed out that the ability to output human-like text may be completely unrelated to the final performance on generation tasks.
Realizing that creating a one-size-fits-all score is infeasible, subsequent research has focused on a more comprehensive multi-dimensional evaluation for NLG tasks.It aims to evaluate the model output from multiple explainable dimensions and has been the dominant paradigm in human evaluation (Fabbri et al., 2021).For example, text summarization typically uses four dimensions for evaluation: coherence, consistency, fluency, and relevance (see Table 1).One way to achieve this fine-grained evaluation is to develop multiple evaluators dedicated to every single dimension (Dziri et al., 2019;Kryściński et al., 2020).However, it requires extensive effort to individually select and train an evaluator for each dimension when conducting multi-dimensional evaluations.On the other hand, several studies worked on building a unified evaluator, i.e., a single model that can produce multiple metrics (e.g., precision and recall) for the generated text (Yuan et al., 2021).Nevertheless, their evaluation scores cannot be directly aligned with the dimensions designed in human evaluation (e.g., consistency and coherence).
In this paper, we propose a unified multidimensional evaluator UNIEVAL for text generation tasks.UNIEVAL unifies all evaluation dimensions into a Boolean Question Answering (QA) Table 1: An example of evaluating the text summarization task.All metrics except BARTScore are scored in the range of 0 to 1, with higher scores indicating better quality.Our proposed UNIEVAL is consistent with human evaluation, using multiple dimensions: Coherence, Consistency, Fluency, and Relevance to evaluate the generated text.The scores predicted by UNIEVAL are closer to human judgements.
problem (Clark et al., 2019a), thus enabling the evaluation of the generated text from different perspectives using only a single model.For instance, UNIEVAL can evaluate coherence in summarization by inputting a specific question, such as "Is this a coherent summary to the document?".Moreover, thanks to the unified Boolean QA format, we are able to perform an intermediate training stage on four types of tasks related to NLG evaluation.This can be crucial for evaluation quality, since we lack large-scale human scores of model outputs to train an evaluator, a unified format that encompasses diverse existing tasks (namely, intermediate tasks) can substantially help UNIEVAL incorporate external knowledge related to NLG evaluations.
Specifically, a unified framework can bring the following benefits: 1) Ease of use.One model is sufficient, without the effort of picking multiple appropriate singledimensional evaluators for all the dimensions.2) Internal complementarity.Different dimensions in the same NLG task can be closely related to each other, so it is useful to perform joint training for these dimensions to share knowledge.3) External knowledge incorporation.The unified Boolean QA format makes it possible to enhance the pre-trained language model by multi-task learning on diverse and relevant intermediate tasks before being trained on evaluation tasks.4) Extensibility and transferability.A unified evaluator can achieve better extensibility and transferability with continual learning (Parisi et al., 2019) or prompting (Liu et al., 2021b;Chen et al., 2022), as it can accommodate more evaluation dimensions by modifying the input question.
Experimentally, UNIEVAL surpasses advanced evaluators by a large margin when evaluating three typical NLG tasks.Concretely, compared to the best unified evaluators (Yuan et al., 2021;Mehri and Eskenazi, 2020), UNIEVAL improves the correlation with human judgment by 23% on text summarization, and the improvement exceeds 43% on dialogue response generation.Ablation studies verify the effectiveness of our intermediate tasks.We also conduct transfer experiments and show that UNIEVAL achieves better performance compared with strong baseline metrics on unseen dimensions and NLG tasks in a zero-shot setting.

Related Work
Similarity-based Metrics Similarity-based metrics refer to the scores for evaluating the NLG models by measuring the similarity between a generated text and a reference text.They can be divided into lexical overlap-based (Papineni et al., 2002;Lin, 2004;Banerjee and Lavie, 2005) as well as contextualized embedding-based (Zhang et al., 2019;Zhao et al., 2019;Clark et al., 2019b) evaluators.Although more than 60% of recent NLG papers solely use ROUGE or BLEU as the evaluation metric (Kasai et al., 2021), they fail to measure content quality (Reiter and Belz, 2009) and syntactic correctness (Stent et al., 2005), and are thus insufficient to portray the reliability of NLG systems.

Single-dimensional Evaluator
To conduct more fine-grained evaluations for NLG, recent studies develop evaluators for a specific dimension, such as consistency in summarization (Kryściński et al., 2020;Wang et al., 2020;Cao et al., 2020;Durmus et al., 2020) and coherence in dialogue response generation (Dziri et al., 2019;Huang et al., 2020;Ye et al., 2021).These evaluators can help us better understand the characteristics of advanced NLG models from different perspectives.However, considering that most dimensions currently have no corresponding standard evaluators, solely using multiple single-dimensional evaluators to perform multi-dimensional evaluation is hard to achieve.
Unified Evaluator Several recent evaluators can predict multiple numbers for evaluating text by using different input and output contents (Yuan et al., 2021), multiple model variants (Mehri and Eskenazi, 2020), or different formulas (Scialom et al., 2021), and we refer to them as unified evaluators.These evaluation scores usually have no corresponding explanations or are simply categorized as precision, recall, and F 1 , which poses difficulties in how to use them.Therefore, we propose a unified multi-dimensional evaluator in this paper, which attempts to align the evaluation scores with different dimensions in human evaluation.

Method
In this section, we first introduce how to formulate multi-dimensional evaluation as a unified Boolean QA problem, and then describe in detail the training paradigm for UNIEVAL.

Problem Formulation
Multi-dimensional evaluation of NLG requires to evaluate n particular dimensions d = (d 1 , . . ., d n ) of the model output, and the input can include the candidate output x, reference text y, and context c. y is removed when evaluating referenceindependent dimensions, such as consistency in summarization.Depending on the specific generation task, c can contain different content or even be omitted.Evaluators need to evaluate the quality of the model output on each dimension and output scores s = (s 1 , . . ., s n ) for all the dimensions.
To unify all evaluation dimensions into one evaluator, we transform each dimension into a Boolean question q i .For example, for d i = coherence in summarization, the transformed question q i is "Is this a coherent summary to the document?".Then for each input (x, y, c, q i ), evaluator should output "Yes" or "No" and calculate s i as: s i = P ("Yes"|x, y, c, q i ) P ("Yes"|x, y, c, q i ) + P ("No"|x, y, c, q i ) , (1) where P (•) denotes the probability of the model generating a specific word.In this way, a single evaluator can evaluate x of all dimensions by modifying the question description.

Unsupervised Learning on Multiple Evaluation Dimensions
Since annotating large-scale human scores to judge the quality of the generated text is unaffordable, we adopt an unsupervised setting to develop our evaluator.Using T5 (Raffel et al., 2020) as the backbone model, we first design specific rules for several commonly evaluated dimensions to construct pseudo data, and then combine them to train the evaluator.

Pseudo Data Construction
To train an evaluator, we need to construct positive and negative samples for different dimensions.The former implies highquality generated text, so we use groundtruth such as the reference summary in summarization.Then we propose particular rules for each dimension to convert positive samples into negative ones.
Taking text summarization as an example, the specific rule-based transformations are as follows: 1) Coherence refers to whether all the sentences form a coherent body.To build incoherent summaries, we use BM25 (Robertson and Zaragoza, 2009) to retrieve similar summaries, and randomly select a sentence from the retrieved summary to replace one of the sentences in groundtruth summary.2) Consistency is the factual alignment between the summary and the source document.We use the method in Chen et al. (2021) to construct inconsistent summaries by antonym substitution, numerical editing, entity replacement, and syntactic pruning.
3) Fluency represents the quality of individual sentences.We randomly draw a span2 from the positive sample and perform one of repeating, deleting, and shuffling to obtain the disfluent summaries.4) Relevance means whether the summary contains only the important information of the source document.The transformation rule is similar to coherence, except that we replace multiple sentences at random instead of one.
We include the designed rules for other NLG tasks in Appendix A.2.The detailed descriptions and concrete examples for all dimensions can also be found in Appendices A.1 and A.3.
Training Strategy For each generation task, we attempt to build a single evaluator to evaluate the NLG model from different dimensions.A straightforward approach is to perform multi-task learning on synthetic data of all dimensions to obtain a unified evaluator.However, we observe the negative transfer problem in several dimensions (e.g., coherence in summarization and engagingness in dialogue generation, see Tables 3 and 4).To tackle this issue, we employ a simple and effective method from continual learning (Parisi et al., 2019): whenever a new dimension is introduced, we add small portion of data from all previous dimensions to replay.The benefit is that we can easily extend our evaluator to new dimensions without training from scratch.Moreover, this method enables to explicitly learn dimensions related to linguistic features (e.g., fluency) first, and then move on to the dimensions that require a better understanding of the text (e.g., consistency).We show that this sequential training approach can alleviate the negative transfer problem in Section 4.

Intermediate Multi-task Learning
Benefiting from the unified Boolean QA format, we can additionally introduce intermediate tasks for UNIEVAL to incorporate external knowledge from existing related datasets.As shown in Figure 1, this stage is placed before the unsupervised learning on evaluation tasks.Notably, the input here is (c, q), which no longer includes the candidate output x and the reference text y.In total, we collect four types of intermediate tasks as follows.Natural Language Inference.The task of NLI is to determine whether a "hypothesis" is true (entailment), false (contradiction), or undetermined (neutral) under a "premise".We transform the NLI task into a question: "Is this hypothesis entailed in the premise?", and only convert entailment into the label "Yes" and the rest to "No".The context c consists of a hypothesis and a premise.We use the following three datasets: document-level NLI (Yin et al., 2021), MRPC corpus (Dolan and Brockett, 2005) and QQP (Wang et al., 2017).Self-Supervised Task.Based on the classical next sentence prediction task (Devlin et al., 2019), we propose a new opening sentence prediction task.The goal of this task is to determine whether a sentence is the starting sentence of a given news article.The motivation is that the first few sentences in the news tend to be salient and informative (See et al., 2017;Zhong et al., 2019), so it allows the model to learn inter-sentence coherence while also captur-  ing the central idea of the document.We sample news from the CNN/DailyMail news corpus (Hermann et al., 2015) and randomly select the opening sentence of other news as negative samples.
Linguistics-Related Task.To facilitate the incorporation of linguistic knowledge into the unified model, we also include CoLA dataset (Warstadt et al., 2019) as the linguistic task.This requires the model to judge whether a sentence is linguistically acceptable, so the input question is: "Is this a fluent and linguistically acceptable sentence?".
Generic QA.We collect the existing Boolean QA datasets: BoolQ (Clark et al., 2019a), BoolQ-NP (Khashabi et al., 2020), BoolQ-CS (Gardner et al., 2020), StrategyQA (Geva et al., 2021), and extract the questions in MultiRC dataset (Khashabi et al., 2018) that can be answered with Yes/No as the data for generic QA task.Introducing these diverse question descriptions enables the model to better understand the importance of question in the input format as well as incorporate more openended external knowledge.
The statistics of data can be found in Table 2 and concrete examples for each task are also provided in Appendix B. Since this phase is not related to the evaluation metric, we train the model with crossentropy loss without computing s i .

Experiments
Following Deng et al. (2021), we classify NLG tasks into three types: compression, creation, and transduction, and select typical tasks from each category to conduct experiments.For compression and creation, we choose summarization and dialogue response generation to measure the performance of UNIEVAL, as well as the ability to zero-shot to unseen dimensions.For transduction, we select data-to-text to test whether UNIEVAL has the ability to transfer to a new NLG task.

Implementation Details
We use "google/t5-v1_1-large" version of T5 as the backbone model in all the experiments.The number of pseudo samples for each dimension is 30k, with an equal number of positive and negative examples.The order for continual learning is coherence → fluency → consistency → relevance for summarization, and coherence → naturalness → groundedness → engagingness for dialogue generation.For the score calculation, we follow previous work to compute sentence-level average scores for fluency and consistency (Laban et al., 2021) in summarization, and sentence-level cumulative scores for engagingness (Deng et al., 2021), while the rest is calculated as Equation 1.More details can be found in Appendix C.

Baselines
We compare UNIEVAL with several state-of-the-art evaluators.Notably, all the single-dimensional and unified evaluators are built on the same corpus.
BERTScore (Zhang et al., 2019) is a similaritybased evaluator.It computes the similarity between two text sequences based on the contextualized embedding obtained by BERT (Devlin et al., 2019).
MoverScore (Zhao et al., 2019) adds many-toone alignment to BERTScore and introduces new aggregation methods to achieve a more powerful similarity-based evaluator.
CTC (Deng et al., 2021) utilizes information alignment to define metrics for several specific dimensions in NLG tasks, and proposes three model variants for each dimension.We compare the best variants of CTC in each dimension as the singledimensional evaluators in our experiments.
BARTScore (Yuan et al., 2021) is a unified evaluator which uses average likelihood of the model output as the metric.It can predict different scores depending on the different input and output.We follow the original paper using c → x as the score for coherence, consistency and fluency, and x → y as the score for relevance.
USR (Mehri and Eskenazi, 2020) is a unified evaluator designed for dialogue response generation task.It uses different variants (e.g., MLM, dialogue retrieval and overall metric) to predict multiple scores for each generated response.We choose the score with the best correlation for each dimension for comparison in the experiments.

Benchmarks
We adopt four meta-evaluation benchmarks for various NLG tasks to measure the correlation between UNIEVAL and human judgments.SummEval (Fabbri et al., 2021) is a metaevaluation benchmark for summarization.For each summary to be evaluated, it provides human scores from four dimensions: fluency, coherence, consistency and relevance.We use it to measure the performance of UNIEVAL.
Topical-Chat (Mehri and Eskenazi, 2020) is a benchmark for knowledge-based dialogue response generation task.
It includes human scores from five dimensions: naturalness, coherence, engagingness, groundedness and understandability3 .The first four dimensions are used to measure the performance of UNIEVAL, and the last one is used for the transfer experiment.
SFRES and SFHOT (Wen et al., 2015) are metaevaluation benchmarks for data-to-text task.They provide information about restaurants and hotels in San Francisco and aim to let the model generate corresponding utterances.We leverage the annotations of informativeness and naturalness dimension to conduct transfer experiment.
QAGS (Wang et al., 2020) is also a bench-mark for summarization.It is designed to detect consistency dimension on two summarization corpora (Narayan et al., 2018).We use it to test the performance of the single-dimensional version of UNIEVAL, and the results are listed in Appendix D.

Results For Summarization
Following Liu et al. (2021a), we use summary-level Spearman and Kendall-Tau correlation to assess the performance of different evaluators for summarization.Results of similarity-based metrics are listed in the first part of Table 3.They are designed to measure the semantic overlap between the model output and the reference text, so they can obtain relatively high correlations in relevance dimension.However, they are not qualified metrics for the other dimensions due to the poor correlation.
The second part contains the results of singledimensional evaluators.CTC is currently the best evaluators of consistency and relevance, but it fails to excel on coherence and fluency.Here we also adapt UNIEVAL to several single-dimensional variants by training the model on pseudo data of only one dimension.Our proposed evaluators exceed CTC models and achieve the best correlation in all dimensions.It reveals that our proposed Boolean QA formulation can clearly enhance the backbone pre-trained model.Furthermore, we attempt to transfer the single-dimensional evaluators to other dimensions, and the underlined num- As shown in the last part, UNIEVAL substantially surpasses the state-of-the-art unified evaluator BARTScore in the summarization task.Specifically, UNIEVAL trained by multi-task learning brings an average improvement of more than 15% across all dimensions compared to BARTScore.And this gain is boosted to more than 23% by adapting continual learning in the unsupervised learning phase.The main gap between the two training strategies of UNIEVAL is the negative transfer on coherence, which clarifies that explicitly learning basic language features before learning more complex dimensions can alleviate this problem.It is also notable that compared with its single-dimensional version, the unified version of UNIEVAL is improved in both coherence and fluency, while having a slight decrement in the other two dimensions.This suggests that following continual learning, we can sequentially extend our evaluator to a new dimension while preserving the performance on previous dimensions.More-over, the clear performance drop after removing the intermediate tasks in the last row illustrates the importance and usefulness of this phase.

Results For Dialogue Generation
To test the performance of different evaluators on the dialogue response generation task, we compute turn-level Pearson and Spearman correlation on the Topical-chat benchmark as in Mehri and Eskenazi (2020).Table 4 presents that similarity-based metrics correlate relatively well on engagingness and groundedness while performing poorly on the remaining dimensions.With respect to the singledimensional evaluator, we can reach the same conclusion as for the summarization task: the scores predicted by UNIEVAL have the highest correlation with human judgments in all dimensions.
Compared to USR, the state-of-the-art unified evaluator in the dialogue response generation task, our evaluator demonstrates more remarkable boosts.According to Pearson and Spearman correlation, UNIEVAL (Continual) improves the results by an average of 48.9% and 43.2%, respectively.In comparison with the corresponding single-dimensional version, although there is a performance loss in naturalness, UNIEVAL (Continual) brings improvements in the remaining dimensions based on Spearman correlation.Especially for grounedness, the unified version increases the correlation by 12.5% (0.511 ⇒ 0.575) compared to the single-dimensional version.Meanwhile, intermediate tasks also display an indispensable role in evaluating dialogue generation, indicating that its benefits can span a variety of NLG tasks.

Transfer Experiments
We perform two zero-shot experiments to exhibit the transfer ability of UNIEVAL.

Zero-shot to Unseen Dimension
To meet the requirements of different users, new evaluation dimensions often emerge for particular NLG tasks.For instance, certain users may prefer a new "understandability" dimension over other dimensions for the dialogue generation task.Therefore, we conduct experiments on the Topicalchat meta-evaluation benchmark to observe if UNIEVAL4 has the transfer capability in this scenario.Concretely, we adjust the input question to "Is this an understandable response in the dialogue?",and calculate the metric based on Equation 1.As shown in Figure 2, although UNIEVAL has not seen or been trained on this dimension before, its predicted score still correlates well with human judgments.It even outperforms the best USR metric for both Pearson (0.326 ⇒ 0.380) and Spearman (0.327 ⇒ 0.468) correlations, which denotes that UNIEVAL is capable of transferring to unseen dimensions by modifying the prompt.
Zero-shot to Unseen Task In a more radical setting, we also transfer UNIEVAL to a new NLG task of data-to-text generation in the zero-shot setting.As annotated in the SFRES and SFHOT benchmarks, users emphasize the naturalness and informativeness of the generated utterance for this task.Therefore, we adapt the question to "Is this a fluent utterance?"and "Is this sentence informative according to the reference?" to predict  the evaluation scores for these two dimensions."T5 + intermediate" in Table 5 represents the model obtained after the intermediate multi-learning stage.While it has not been trained on any evaluation tasks, it performs on par with BARTScore based on average correlations and is particularly good at evaluating the naturalness of utterances.After training on multiple evaluation dimensions on summarization, UNIEVAL (Summ)5 demonstrates better transfer ability and superior performance over BARTScore in most dimensions of both datasets.This illustrates the capability of UNIEVAL to transfer to new NLG tasks without further adaptation.

Ablation Study of Intermediate Tasks
We conduct ablation studies on the singledimensional version of UNIEVAL to better investigate the contribution of each type of intermediate task to NLG evaluation.The results of Spearman correlation are presented in Table 6.Because of the similar task requirements, NLI contributes most to consistency, while our proposed opening sentence prediction task facilitates the evaluator to capture coherence between sentences.Due to the small data size of the linguistics-related task (see Table 2), removing it does not have a significant impact on the performance, but it can still help the model better understand fluency of individual sentences.Generic QA enhances each dimension by engaging the evaluator to focus on the meaning of the input question.Overall, training on the combination of all four types of intermediate tasks leads to the best NLG evaluation performance.

Conclusion
In this paper, we emphasize the necessity of multidimensional evaluation in advancing the field of NLG.To promote this comprehensive and finegrained evaluation approach, we propose a unified multi-dimensional evaluator UNIEVAL for various NLG tasks.UNIEVAL correlates well with human judgment on three typical generation tasks and exhibits excellent transfer performance.

Limitations
We state the limitations of this paper from the following four aspects: 1) Most of the current evaluators, including UNIEVAL, are black-box models.With the support of pre-trained language model, even though the neural evaluators can already correlate well with human judgments, it is still unclear how the model predicts these evaluation scores.Therefore, a better understanding of the evaluation process of different evaluators or the development of an interpretable and multi-dimensional evaluator may be the next stage for improving NLG evaluation.
2) Most of the neural evaluators are trained on synthetic data, while the pseudo data constructed in this paper still contain noise.For instance, for fluency in summarization, removing an unimportant span may not affect the fluency of the sentence, but we always treat the sentence after deleting as a negative sample.Thus, how to improve the quality of synthetic data could be an interesting topic.
3) We only use T5-large as the backbone model in the experiments due to the limited computational resources.How to extend the use of neural evaluators by using smaller models but retaining similar performance, or how to introduce more data to build larger evaluators with better performance, could be two future research directions.
4) We follow the categorization of NLG tasks in Deng et al. (2021) and select three typical tasks for our experiments, but UNIEVAL is still limited to English tasks.The generation tasks for crosslanguage scenarios are left for our future work.

A.1 Explanation of Each Dimension
We introduce different dimensions for text summarization in Section 3.2.Here we include the detailed descriptions of different dimensions in dialogue response generation and data-to-text tasks.
For dialogue response generation (Mehri and Eskenazi, 2020): • 1) Naturalness: judge whether a response is like something a person would naturally say • 2) Coherence: determine whether this response serves as a valid continuation of the previous conversation.
• 3) Engagingness: determine if the response is interesting or dull.
• 4) Groundedness: given the fact that this response is conditioned on, determine whether this response uses that fact.
For data-to-text (Wen et al., 2015): • 1) Naturalness: determine whether the utterance could plausibly have been produced by a human.
• 2) Informativeness: determine whether the utterance contains all the information in the given content.

A.2 Pseudo Data Construction for Dialogue Response Generation
We produce pseudo data for the four dimensions of the dialogue response generation task as follows: • 1) Naturalness: similar to fluency in summarization, except that we modify λ to 3.
• 2) Coherence: we randomly select gold response from other dialogues as the negative samples.
• 4) Groundedness: this dimension is used to measure how well the response refers to the knowledge context in knowledge-based conversations (Dinan et al., 2019).Therefore, we randomly extract a sentence from the current knowledge context and use a paraphrase generator6 to rewrite it as a positive example, and sample a sentence from other knowledge contexts as a negative example.

A.3 Examples for Evaluation Tasks
We provide the concrete examples for different dimensions of evaluation tasks in Table 7.All the pseudo data is constructed on the CNN/DailyMail (Hermann et al., 2015) and Topical-Chat (Gopalakrishnan et al., 2019) corpus.
We input reference text y (green text) to the model only when evaluating the relevance dimension in text summarization, while in the other dimensions UNIEVAL is a reference-free evaluator.Depending on the specific dimension, we feed the model with different contexts c.In addition, We use "\n" to separate the different turns in the dialogue history and end it with "\n\n".

B Examples for Intermediate Tasks
We also include the examples for each intermediate task in Table 8.We define the input as a (c, q) pair and let the model answer with "Yes" or "No".

C Implementation Details
We first train T5 on intermediate tasks for 2 epochs.
For the evaluation tasks, we construct pseudo data on the CNN/DailyMail (Hermann et al., 2015) and Topical-Chat corpus (Gopalakrishnan et al., 2019) for summarization and dialogue generation, respectively.The number of samples for each dimension is 30k, with an equal number of positive and negative examples.We set batch size to 36 and the maximum learning rate to 5e-5 for both stages.Regarding continuous learning, we randomly select 20% of the data from the previously learned tasks to replay.Relevance Summarization question: Is this summary relevant to the reference?</s> summary: The Met Office issued severe weather warnings across the country yesterday as experts predicted a weekend washout.Forecasters said wind and rain will continue to batter the country until Tuesday at the earliest, as fears grew that the Somerset Levels could be flooded again.The North will be wet and windy for the next three days, with showers also scattered across the South West.</s> reference: Heavy rain caused massive tailbacks yesterday as a pothole opened up across three lanes of the M25.Fear grow that the Somerset Levels could be flooded again.North will be wet and windy for next three days, with showers also scattered across the South West.

Yes
Naturalness Dialogue question: Is this a natural response in the dialogue?</s> response: yes and that launched the career of many people, the most notable being han solo.the acting in the first was the most notable being han solo.the acting atrocious but got better as more movies were made.all told a great movie.

No
Coherence Dialogue question: Is this a coherent response given the dialogue history?</s> response: wow that is a lot of money for a logo.did you know corproate sponsors pay $1.12 billion on the nba last year!? </s> dialogue history: hi! do you like basketball? \n yes, i am a big raptors fan.it's crazy how much companies are paying to put their logos on nba jerseys.\n i am sure it is extremely high to advertise on jerseys.do you know how much?\n different team to team but everyone seems to get them these days.i did see geico is paying 6.5 million per year for their patch on the wizards jersey!\n\n

Yes
Engagingness Dialogue question: Is this an engaging and informative response according to the dialogue history and fact?</s> response: i'm so glad i'm not the only one who thinks this.</s> dialogue history: do you follow american politics?\n some, i am not surprised that the first phone number in the white house was 1. lol \n it definitely helped people reach the white house the fastest.i am surprised they still use floppy disks for storage.\n\n </s> fact: president jimmy carter turned all white house thermostats down to 65 degrees during the winter of 1977. the very first phone number of the white house was "1".jimmy carter had solar panels installed on the white house...and ronald reagan had them removed.there is a replica of the white house in atlanta which was built as a private home you can mail a birth announcement to the white house and they'll send you a congratulations card back.

No
Groundedness Dialogue question: Does this response use knowledge from the fact?</s> response: batman's city of gotham is located in new jersey and neither batman nor the villain the joker refer to one another by name.</s> fact: there was a batman villain named condiment king and he was defeated by slipping on his own ketchup.adam west has a batman logo on one of his molars according to dc canon; batman's gotham city is located in new jersey in their face to face confrontations, neither batman nor joker refer to one another by name.weird al yankovich did voiceover work in the most recent dc animated film "batman vs robin".

Yes
Table 7: Examples of different dimensions in evaluation tasks.The red text indicates the question q, the green text denotes the candidate output x, the yellow text is the reference text y, and the blue text represents the context c. tasks, we train the evaluator for 1-3 epochs in different NLG tasks.We train UNIEVAL on two A6000 GPUs for a total of 5 hours.If the meta-evaluation benchmark contains multiple references, we only use the first one as input.
In addition, although we can compute the scores for all dimensions directly from Equation 1, we slightly modify the score calculation for several certain dimensions due to their characteristics.For example, for fluency and consistency in summarization, disfluency and inconsistency are usually detected using sentences as the basic unit (Fabbri et al., 2021;Laban et al., 2021), so we split the model output x into several sentences and calculate the score s ij for j-th sentence as: s ij = P ("Yes"|x j , y, c, q i ) P ("Yes"|x j , y, c, q i ) + P ("No"|x j , y, c, q i ) . (2) Then the final score for x in these two dimensions is s i = m j=1 s ij /m, where m is the number of sentences in x.Another special dimension is engagingness in dialogue generation.Since it indicates the total volume of interesting facts presented in the response (Deng et al., 2021), we use the summation to compute it as s i = m j=1 s ij .Therefore, the scoring range for engagingness is [0, +∞), while all others are [0, 1].

D Results on QAGS
Advanced NLG models suffer from the problem of generating text that is inconsistent with the source document (Cao et al., 2018), which has led recent research to develop evaluators for evaluating the consistency dimension in summarization (Kryściński et al., 2020;Wang et al., 2020;Cao et al., 2020;Durmus et al., 2020).Therefore, we particularly compare the single-dimensional version of UNIEVAL for consistency with the state-of-theart factuality checkers.
We conduct experiments on the QAGS metaevaluation benchmark, which contains two different summarizaion corpus: CNN/DailyMail (Hermann et al., 2015) and XSum (Narayan et al., 2018).As shown in Table 9, BARTScore performs best on the more extractive7 part (QAGS-CNN), but shows poor correlation on the more abstractive8 subset (QAGS-Xsum).UNIEVAL (Consistency) correlates well in both parts of the data, especially in the more challenging Xsum dataset, greatly outperforming all previous consistency detectors.On average, UNIEVAL (Consistency) outperforms the state-of-the-art evaluator CTC by more than 30% based on Spearman and Kendall-Tau correlations.Thus, a high-performance single-dimensional evaluators can also be developed under our proposed framework.

Figure 1 :
Figure1: The overall framework of UNIEVAL.We convert all tasks into a Boolean QA problem and utilize the model to answer with "Yes" or "No".This unified QA format allows the model to incorporate external knowledge from multiple related tasks, i.e., intermediate multi-task learning.Then we construct pseudo data for each dimension and train them sequentially to obtain UNIEVAL.

Figure 2 :
Figure2: Zero-shot performance on the "understandability" dimension in dialogue response generation.

Table 2 :
Statistics for intermediate tasks.Positive sample indicates the model should answer with "Yes".

Table 3 :
Summary-level Spearman (ρ)and Kendall-Tau (τ ) correlations of different metrics on SummEval benchmark.The underlined numbers indicate the results of transferring a single-dimensional evaluator to other dimensions.

Table 4 :
Turn-level Pearson (r)an Spearman (ρ) correlations of different metrics on the Topical-Chat benchmark.The underlined numbers indicate the results of transferring a single-dimensional evaluator to other dimensions.
bers in Table3are transferred results.Overall UNIEVAL is better than CTC, but we can see that no single-dimensional evaluator can transfer well to all dimensions.For example, both consistency ⇒ coherence and fluency ⇒ relevance exhibit poor correlations, indicating that evaluators that focus solely on a single evaluation dimension lack acceptable transfer capability.

Table 6 :
Ablation study of UNIEVAL.'-' means we remove this task from interemediate multi-task learning.
Is this a coherent summary to the document?</s> summary: Theodore Wafer's statement contradicts his attorney's claim that he feared for his life and acted in self defense when he killed Renisha McBride.The shotgun is a Mossberg pump-action 12-gauge with a pistol grip.</s> document: On trial: Theodore Wafer, 55, initially told police that the shooting was an accident ...

Table 8 :
Is this a claim consistent with the premise?</s> claim: The two victims were teenagers.</s> premise: 2 seriously wounded in Grand Crossing shooting Two men were seriously wounded in a shooting Thursday evening in the Grand Crossing neighborhood on the South Side.The men 2014 ages 28 and 39 2014 were shot at by someone inside a vehicle that pulled up to them at 6:11 p.m. in the 7300 block of South Dante, Chicago Police said.The older man was shot in his face and taken to Northwestern Memorial Hospital in serious condition, police said.The younger man took himself to Jackson Park Hospital with a gunshot wound to his shoulder in serious condition.Is this sentence the coherent first sentence of the document?</s> sentence: Diego Maradona thinks Steven Gerrard and England's defence should be held responsible for the defeat to Uruguay that crushed their World Cup hopes.</s> document: Bomb disposal experts examined the mortars and confirmed that two contained white phosphorous, a police spokesman said.It was not the first time that Hamas has attempted to target Israel using mortars containing white phosphorous, the spokesman said.White phosphorus ignites and burns, creating white smoke when it is exposed to oxygen.Militaries use it as a smoke screen to protect troops during combat ...She has five friends.Sent 4: Her mom said that Susan can invite them all to the party.Sent 5: Her first friend could not go to the party because she was sick.Sent 6: Her second friend was going out of town.Sent 7: Her third friend was not so sure if her parents would let her.Sent 8: The fourth friend said maybe.Sent 9: The fifth friend could go to the party for sure.Sent 10: Susan was a little sad.Sent 11: On the day of the party, all five friends showed up.Sent 12: Each friend had a present for Susan.Sent 13: Susan was happy and sent each friend a thank you card the next week.Examples for different intermediate tasks.Since these tasks are not relevant to the evaluation, we recognize all parts except question q as context c.