Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

Natural language generation (NLG) spans a broad range of tasks, each of which serves for specific objectives and desires different properties of generated text. The complexity makes automatic evaluation of NLG particularly challenging. Previous work has typically focused on a single task and developed individual evaluation metrics based on specific intuitions. In this paper, we propose a unifying perspective based on the nature of information change in NLG tasks, including compression (e.g., summarization), transduction (e.g., text rewriting), and creation (e.g., dialog). _Information alignment_ between input, context, and output text plays a common central role in characterizing the generation. With automatic alignment prediction models, we develop a family of interpretable metrics that are suitable for evaluating key aspects of different NLG tasks, often without need of gold reference data. Experiments show the uniformly designed metrics achieve stronger or comparable correlations with human judgement compared to state-of-the-art metrics in each of diverse tasks, including text summarization, style transfer, and knowledge-grounded dialog.


Introduction
Natural language generation (NLG) refers to the broad set of tasks that produce fluent text from input data and other contextual information. The diverse tasks serve for vastly different uses in practice. For example, summarization compresses a source article into a short paragraph containing the most important information; translation transduces content expressed in one language into another; and a chatbot creates novel responses to drive the conversation. Recent years have seen remarkably fast progress in improving and making new models That is young! You must be rich. Sadly I still rent my home and have to pay monthly.
Compression (e.g., summarization) Transduction (e.g., style transfer) Creation (e.g., dialog) Figure 1: Illustration of three categories of NLG tasks in terms of information change. Task input is in blue box and output in orange box. Text in red in the dialog output box represents newly created information.
for NLG tasks. However, evaluation of NLG has long been considered difficult (Kryscinski et al., 2019;Mathur et al., 2020): human evaluation is often prohibitively expensive and slow, while accurate automatic evaluation is challenging given the complexity of text modeling and the diverse aspects to be measured for different NLG tasks.
Previous work has developed a large variety of automatic metrics. A popular general strategy is to measure the similarity of generated text against human-written references, such as the classical BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), and more recent variants based on neural models (e.g., Zhang et al., 2020a;Sellam et al., 2020). However, an NLG task typically involves multiple desirable properties (e.g., consistency, conciseness, richness) that may have different priorities and need trade-off depending on the application scenarios (Hashimoto et al., 2019;Mir et al., 2019;Mehri and Eskenazi, 2020b;Gehrmann et al., 2021). Thus a single score without multi-aspect interpretability is often inadequate to characterize generation quality.
A growing number of recent works have proposed aspect-based metrics for popular tasks such as summarization (Kryściński et al., 2019;Wang et al., 2020) and dialog (Mehri and Eskenazi, 2020b;Nie et al., 2020). Those metrics are typically each designed for individual tasks and aspects, based on specific intuitions. The lack of a common theoretical ground makes it difficult to share the evaluation strengths across the diverse NLG problems, and fails to offer guidance to metric design for emerging tasks and aspects.
In this paper, we propose a more unifying perspective of NLG evaluation through the lens of information change, which offers a general framework to measure many key aspects of NLG tasks. In particular, based on the practical use of NLG, each task can be seen as one of (1) compression to express salient information in concise text, such as summarization and image captioning; (2) transduction to transform text while preserving content precisely, such as translation and style transfer; and (3) creation to produce new content from input context, such as dialog and story generation. A common concept underlying the three broad categories is information alignment, which we define as the extent to which the information in one generation component is grounded in another. Here the generation components include input, output, additional context, and references when available.
Inspired by recent work on model-based evaluation, we adopt contextualized language models to measure information alignment. We then demonstrate the framework by devising a family of highly intuitive metrics for three representative tasks (aspects) in each category, respectively, including summarization (relevance and consistency), style transfer (content preservation) and knowledge-based dialog (engagingness and groundedness). Experiments show that the uniformly designed metrics robustly outperform or compete with state-of-theart metrics specifically designed for each task, in terms of correlations with human judgement. We also study different implementations of the central information alignment estimation model, showing that improved alignment measure leads to better evaluation quality across all the tasks/aspects.

Related Work
Task-and Aspect-Specific NLG Evaluation. Canonical automatic evaluation (Papineni et al., 2002;Lin, 2004) often compute a single score mea-suring some forms of similarity between outputs and human-written references. The later-emerged learning-based approaches aggregate multiple features to regress on human-rated quality scores for different tasks (Lowe et al., 2017;Peyrard et al., 2017;Sellam et al., 2020). Researchers also identified that a single evaluation score cannot account for the variety of quality factors that exist in multifaceted NLG applications. A number of metrics were then proposed for specific tasks, either to evaluate multiple aspects (Mehri and Eskenazi, 2020b;Egan et al., 2021) or to focus on one particular aspect (Kryściński et al., 2019;Mehri and Eskenazi, 2020a;Nie et al., 2020;Durmus et al., 2020;Wang et al., 2020). Our framework continues this line of research to produce interpretable metrics for multiple aspects. While recent evaluation frameworks each discussed the key evaluation aspects of one NLG task (Venkatesh et al., 2018;Mir et al., 2019;Yamshchikov et al., 2020;Fabbri et al., 2021), our framework provides a unified methodology that facilitates metric design for all the three main categories of tasks. We also highlight that all of metrics (except for the relevance metric for summarization) are reference-free once trained.
Several emerging NLG benchmarks (Gehrmann et al., 2021;Liu et al., 2021) collected existing metrics for various tasks, whereas we aim at developing new unified metrics with stronger performance. Belz et al. (2020) proposed a categorization for different NLG quality aspects. Our general framework covers all the described types of quality.
Text-to-Text Information Alignment. Measuring information overlap between texts is a recurring theme in designing NLG evaluation metrics. It has typically been approximated by n-gram overlap (Papineni et al., 2002;Popović, 2015), synonym matching (Banerjee and Lavie, 2005) and embedding similarities (Kusner et al., 2015). Recently, pre-trained models (Devlin et al., 2019) were introduced to improve token-level embedding matching (Zhang et al., 2020a) and leverage extrinsic capabilities such as question answering (Eyal et al., 2019;Wang et al., 2020) and entailment classification (Falke et al., 2019;Kryściński et al., 2019;Zhou et al., 2020) to align variable spans and entire sentences. Egan et al. (2021) proposed automatic Shannon Game (Hovy and Lin, 1998) to measure the decrease of the information one can gain from a document after observing its summary; Peyrard (2019) conducted a theoretical analysis to characterize the information change among source document, background knowledge and summaries. These methods are often restricted to a single task, while we offer a general framework adaptable to a wide range of tasks and aspects.

A Unified Evaluation Framework
We present the new framework that offers a common foundation for characterizing diverse NLG tasks and leads to a set of interpretable metrics for evaluating their key aspects.
As discussed in §1, NLG tasks can be categorized as performing compression, transduction, or creation based on changes in conveyed information from input to output. For a compression task (e.g., summarization), the goal is to concisely describe the most important information in the input (e.g., a document). That is, the output should only contain content from the input, namely "consistency" (Cao et al., 2018;Kryscinski et al., 2019;Zopf et al., 2016;Peyrard, 2019), and the included content must be salient, namely "relevance" (Nenkova and Passonneau, 2004;Zopf et al., 2016). Intuitively, with an "information alignment" measure that assesses how the information in a generated output overlaps with that in the input (and in references that offer clues for salience), we can readily evaluate the two key aspects. The same intuition applies to transduction tasks (e.g., style transfer), where the output must preserve the input content precisely. The evaluation of "preservation" (Mir et al., 2019) thus also boils down to measuring the information alignment between input and output. A creation task (e.g., dialog) generates output that adds on top of input (e.g., dialog history) new information (e.g., from external knowledge). Information alignment between the output, input, and external sources is thus essential for evaluating how well the created content engages with the context (Venkatesh et al., 2018;See et al., 2019) and how meaningful the content is by grounding to the external sources (Dinan et al., 2019a;Smith et al., 2020).
From the above perspective, information alignment arises as a common central component that connects evaluations across the tasks. A single accurate alignment prediction model would enable us to reliably evaluate many relevant aspects in various applications.
Next, we first present our definition of information alignment ( §3.1); then describe the details of how the aspect metrics for compression, trans-duction, and creation are built on the alignment ( §3.2-3.4); we finally discuss different effective implementations of the underlying alignment estimation model based on neural networks ( §3.5).

Preliminaries
For an NLG task, let x be the input, c be any other additional context, and y be the output text generated conditioning on x and c. For example, in knowledge-based dialog, x is the dialog history, c is external knowledge such as a Wikipedia article, and y is the response. In the current work, we assume both x and c to be text, but the general framework is also applicable when x and c are in other modalities (e.g., images, tables), as long as we can measure their information alignment with y as defined below (e.g. using cross-modal models). In some tasks, gold standard output written by human is available, which we denote as r.
As above, information alignment is the central module for NLG evaluation. We consider the alignment from arbitrary text a to b as token-level soft alignment. More formally: Definition 3.1 (Information Alignment). Let a be a piece of text of length N ; b be arbitrary data. The information alignment from text a to b is a vector of alignment scores: where α n ∈ [0, 1] is the confidence that the information of the n-th token in a is grounded by b, i.e., the n-th token aligns with b.
Note that the alignment is "one-directional" from a to b: it does not measure how b aligns to a. We next show how the alignment scores can be used to define intuitive metrics for various tasks. Besides, the fine-grained alignment scores also offer a certain level of interpretability for the resulting metrics, as illustrated by the example in Table C.1.

Evaluation of "Compression" Tasks
We discuss compression evaluation in the context of text summarization, an extensively studied task for evaluation in previous work. The task aims to extract the most important information from document x and express it in summary y. As above, consistency and relevance have been widely identified as key aspects to characterize the content quality of generated summaries (Cao et al., 2018;Kryscinski et al., 2019;Zopf et al., 2016;Peyrard, 2019). We propose our metrics below.
Consistency We adopt the prevailing definition of consistency (Cao et al., 2018;Kryscinski et al., 2019), which dictates that the summary y should only contain information from x (instead of other sources or hallucinations). The aspect is also referred to as "factual correctness" or "faithfulness" in previous work 2 . For y to be fully consistent, all tokens in y should align with x. Therefore, we can straightforwardly devise the consistency metric based on the information alignment defined above: CONSISTENCY(y, x) = mean (align(y → x)) , (2) which is the average alignment scores of tokens in y w.r.t. x. Our metric offers a simpler solution than the recent QA-based metrics (Scialom et al., 2019;Durmus et al., 2020;Wang et al., 2020) that compare the answers extracted from y and x by a Question-Answering system, and is more interpretable than the black-box consistency classification models (Falke et al., 2019;Kryściński et al., 2019;Maynez et al., 2020). We also achieve stronger empirical performance ( §4.1).
Relevance As one of the most heavily studied aspects of summarization, relevance concerns how well the summary y retains important information in x (Nenkova and Passonneau, 2004;Zopf et al., 2016). As in previous work, the "importance" of information can be determined by human-written reference summaries r. That is, a piece of information is considered important if it is mentioned in a reference. The intuition can readily be captured by the information alignment align(r → y) that measures the extent to which information in reference r is covered by the summary y. Additionally, we account for the criterion that any information in y should be precise, i.e., consistent with x. Combining the two considerations, the full definition of our relevance metric conveys the intuition that a fully relevant summary y should achieve both and balance reference-alignment and consistency: RELEVANCE(y, x, r) = mean (align(r → y)) × mean (align(y → x)) , which is the product of both components. Traditional reference-based metrics consider only the reference text (rather than the input). For example, ROUGE (Lin, 2004) can be seen as measuring the alignment between y and r where the alignment is defined by text matching. Our metric, with the 2 For the aspects studied in this paper, we summarize in Table B.1 the alternative names that used in previous work. combination of both reference and input, plus better alignment modeling ( §3.5), greatly outperforms those previous metrics ( §4.1).

Evaluation of "Transduction" Tasks
We take style transfer as the example task to discuss semantic preservation of transduction tasks. The aim of style transfer is to generate text y that changes one or more stylistic attributes (e.g., formality) of source text x and completely preserve its style-independent information (Hu et al., 2017;Shen et al., 2017). Measuring content preservation is the core yet challenging problem for the evaluation.
Preservation A transduction result y is required to contain all and only information from x. In other words, all tokens in y should align with x, and vice versa. Considering the former to be the "precision" of the y information w.r.t x, and the latter the "recall", we naturally arrive at the following "F1"-style definition of the preservation metric: which is the harmonic mean of the two directions of information alignment. Note that the two-way alignments differ from the "consistency" and "relevance" metrics in compression where we have only required output y to align with input x. Our experiments show that it is crucial to account for alignments in both directions for transduction ( §4.2).

Evaluation of "Creation" Tasks
We formulate aspects of creation tasks using the example of knowledge-grounded dialog generation. In this task, an agent generates text y as a response to conversation history x while exhibiting information from knowledge context c, e.g., an external document (Qin et al., 2019;Guo et al., 2018) or a set of facts (Dinan et al., 2019b;Zhang et al., 2018). For the agent, sustaining an engaging conversation is considered an essential skill (Venkatesh et al., 2018;Guo et al., 2018;Mehri and Eskenazi, 2020b). Besides, the generated response must be grounded in the knowledge context by referring to its information as often as possible (Dinan et al., 2019a;Smith et al., 2020). We devise metrics for the two central aspects, respectively. A crucial property of creation tasks is that the agent is allowed to create new information beyond

RoBERTa Token Classifier
McConaughey is an avid fan of the American football team.

Discriminative Model (D) BERT Regressor
McConaughey is an avid fan of the American football team.

0.6
McConaughey is a football fan.

Regression (R)
Figure 2: We study three effective ways of information alignment prediction, i.e., embedding matching (left), discriminative model (upper right) and regression (lower right). The figure illustrates the estimation of alignment from output to input.
the input and context. Thus, to aggregate the information alignment vector, it is more suitable to consider the total volume rather than the density. That is, we would use sum(·) instead of the previous mean(·) to aggregate token-level alignment scores.
Engagingness We adopt the common definition of engagingness (e.g., Mehri and Eskenazi, 2020b), namely, the response should not be generic or dull (e.g., "I don't know"), but engages the partner in conversation, such as presenting an interesting fact. Therefore, an engaging response y should provide high volume of information that acknowledges both the history x to engage the partner and the context c which we assume contains relevant facts. This naturally leads to the following metric definition: where we concatenate the history x and knowledge context c, and measure the extent of response y's acknowledgement of the information. Previous works have devised various metrics for the aspect, ranging from measuring responsetopic consistency (Guo et al., 2018), conversation length (Venkatesh et al., 2018), retrieval of reference responses (Mehri and Eskenazi, 2020b), etc. Our metric is cleanly defined in line with all other metrics we developed, and shows stronger human correlation than previous designs.
Groundedness As a widely studied aspect of knowledge-based dialog, groundedness measures how well the response refers to the knowledge context (Dinan et al., 2019b;Qin et al., 2019;Mehri and Eskenazi, 2020b). Straightforwardly, the aspect can be evaluated with the following metric: GROUNDEDNESS(y, c) = sum (align(y → c)) , which measures the alignment between the response y and knowledge context c.

Implementation of Alignment Estimation
We have presented the metrics for a range of key aspects in different tasks, building on the core information alignment measure (Definition 3.1). We next discuss different effective implementations for measuring the alignment scores between text, including embedding matching, discriminative model, and regression, all based on powerful pretrained language models ( Figure 2).
Embedding Matching (E) One simple way to estimate the alignment vector align(a → b) is by matching the embeddings of tokens in the two sequences. Specifically, we use either pretrained BERT (Devlin et al., 2019) or RoBERTa (Liu et al., 2019) to extract contextual embedding for each token in a and b, normalize each embedding vector to unit norm, and then use greedy matching following (Corley and Mihalcea, 2005;Zhang et al., 2020a). That is, the alignment score of each token in a is defined as its maximum cosine similarity with the tokens in b. We found in our empirical studies ( §4) that the E method seems to work better when a and b have similar volume of information (so that one-to-one token matching is suitable).

Discriminative Model (D)
To estimate the information alignment from arbitrary text a to b, we formulate the problem as sequence tagging, for which we train a model that labels each token in a with 1 if it aligns with b, and 0 otherwise. The predicted probability of label 1 for each a token serves as the alignment score. We base our model on RoBERTa and train with automatically constructed weak supervision data. Appendix §A describes all details. For example, to learn to estimate the alignment of the output y to the input in an NLG task, we use the training corpus of the task: for each output y, we perturb it by masking out portions of tokens and using a pretrained BART (

Consistency (XSUM -QAGS)
R Figure 3: Correlations with human judgement on consistency in summarization. E denotes our metrics using embeddingmatching alignment estimation, D using discriminative-model and R using regression. Reference-based metrics are in blue, reference-free metrics in purple, and our metrics in red/orange. For SummEval annotation data (

Ours Baselines
Relevance (  variants of our relevance metric (Eq.3) using different components and combination strategies. r → y corresponds to mean (align(r → y)) and similarly for y → x; + sums the two components and × is our design that takes the product. put context (e.g., x), so the infilled tokens can be considered to not align with the input. We do the masking by first applying constituency parsing to the text and then randomly masking out a subtree of the parsing. Besides the infilling data, we also augment the training with paraphrasing data. That is, we apply a paraphrasing model to y, and treat all tokens in the paraphrases as alignment to the input. Note that y need not be the gold output, but can also be any automatically constructed output as long as it is guaranteed to align fully with the input. For example, an output y by an extractive summarization model aligns fully with the input article.
We will see more examples in our experiments.
Aggregated Regression (R) Instead of estimating the per-token alignment vector as defined in Eq.
(1), we may also directly estimate the single aggregated alignment score such as mean (align(a → b)) (or sum). This is because all the metrics proposed above have only used the aggregated score. To this end, we train a regression model using the same weak supervision data for D, with the aggregated alignment score as the regression target. Similar to Sellam et al. (2020), in our experiments, we implement the regression model with BERT (Devlin et al., 2019). In particular, we initialize the regression model with the intermediate BERT-base-midtrained model weights provided by Sellam et al. (2020). We note that the aggregated estimation method may not be applicable to future metrics in our evaluation framework when fine-grained per-token alignment is required.

Experiments
We evaluate the proposed metrics on commonly used human annotation datasets for summarization ( §4.1), style transfer ( §4.2) and dialog ( §4.3), and study the effect of information alignment accuracy on the performance of metrics ( §4.4).
Evaluation Criteria To measure a metric's performance on an aspect, we compute the samplelevel correlation between the metric scores and human judgments on generation samples. We also evaluate system-level correlation (based on the ranking of comparison systems) as the secondary criterion (Mathur et al., 2020) and report results in the appendix, which typically exhibits the same patterns as sample-level correlation. We measure Pearson and Spearman correlations whenever applicable. We also report Kendall-Tau correlation in the appendix when available.

Experiments for "Compression" Metrics
Datasets For the consistency aspect, we follow previous studies and evaluate metrics using human annotations from two commonly-used sources: (2) QAGS (Wang et al., 2020) (which names the aspect "correctness") on the XSUM dataset (Narayan et al., 2018), another summarization task with strong abstractive property. The dataset contains 235 outputs from a fine-tuned BART model (Lewis et al., 2020). The QAGS dataset also contains another 239 outputs for CNN/DM, for which we report results in Table D (Nagel, 2016) that is close to the summarization domains. For the discriminative-model (D) alignment, we train two RoBERTa-large token classifiers to compute align(y → x) and align(r → y), respectively, with training data automatically constructed for CNN/DM and XSUM according to Appendix §A.1. For the regressive (R) alignment, we train the BERT models ( §3.5) to estimate the respective mean alignment scores.

Results
We present the consistency results in Figure 3. On CNN/DM, our metrics based on the trained alignment models (D and R) both clearly outperform previous metrics. On XSUM, our Dbased metric also achieves the best performance. The E-based metric sees a catastrophic drop in correlations, which is likely due to the higher abstractiveness of XSUM summaries that renders embedding matching inadequate. The sentence-classifier based FactCC metric (Kryściński et al., 2019), which is trained to distinguish paraphrases from artificially perturbed sentences, also achieves a decent correlation on XSUM. However, it seems unable to effectively model the summaries on CNN/DM that tend to be longer and richer in information, and thus produces a lower correlation. Figure 4 shows the results for relevance on CNN/DM. Our metrics strongly outperform all

Baselines Ours
Preservation (  other baselines, showing that accounting for alignments with both references and the input article (Eq.3) is superior to only considering the references (metrics in blue in the figure) or the input article (metrics in purple). This is further validated by the ablation studies in Table 1, which demonstrate that multiplying the two alignments, which emphasizes joint and balanced achievement of both, improves the correlations compared to individual alignments or simply summing them together. Figure 4 also shows our E-based implementation performs better than the D-and R-based variants, likely because the metric involves alignment between generation and references which tend to have similar information volume and thus favor one-to-one token mapping. We observe similar patterns in transduction below.

Experiments for "Transduction" Metrics
Datasets

Engagingness (TopicalChat)
Groundedness (TopicalChat) R R Figure 6: Correlations with human judgement on engagingness and groundedness aspects for knowledge-grounded dialog. The plot format is the same as Figure 3.
tuning the evaluation models. For our metrics, we use RoBERTa-large-MNLI for embeddingmatching (E) due to its fine-tuning on entailment detection which is close to the domain under study. For discriminative model (D), we train RoBERTa-large on Yelp alignment data created by paraphrasing and perturbing the inputs x. For regression (R), we train to estimate the mean alignment score computed from the same dataset as D.
Results We present preservation results in Figure 5. Our metric (E) achieves competitive or better performance than all previous metrics. MoverScore (Zhao et al., 2019) as a strong baseline computes word mover's distance (Kusner et al., 2015) between input x and output y token embeddings. In contrast, our metric explicitly accounts for the twoway input-output alignments with an "F1"-style harmonic mean aggregation (Eq.4). Table 2 shows the two-way approach is effective and exhibits higher correlation compared to single-directional alignment, in line with the nature of transduction tasks. Similar to their relevance results in summarization, our D-and R-based implementations fall behind E, likely because token matching is more suitable for measuring alignments between two text pieces with similar information volume.

Experiments for "Creation" Metrics
Datasets For the engagingness aspect, we use the latest human annotation data collected by (Mehri and Eskenazi, 2020b) (which names the aspect "in-  For the groundedness aspect, we again use the human annotations from Mehri and Eskenazi (2020b) (which names the aspect "uses knowledge") on both PersonaChat and TopicalChat.

Baselines and Setup
We compare with all the diverse metrics studied in (Mehri and Eskenazi, 2020b) and FED (Mehri and Eskenazi, 2020a), a set of latest unsupervised dialogue metrics based on the DialoGPT model (Zhang et al., 2020b). We use FED-Interesting from the original paper designed for engagingness and FED-Informative designed for groundedness, respectively. We also add a particularly simple baseline-response length, which as we show performs surprisingly well. For our metrics, we use BERT-base for embedding matching (E), RoBERTa-large token classifiers trained on align(y → [x, c]) and align(y → c) for discriminative model (D), and BERT-base regressors on the sums of the respective alignment scores for regression (R). We create separate alignment datasets for PersonaChat and TopicalChat, as described in Appendix A.3.

Results
We present the results for engagingness in the top two plots of Figure 6. Our metrics with different implementations all improve over previous methods by large margins on the two datasets.
Many of the baseline metrics show decent correlations on TopicalChat, but fail on the PersonaChat corpus. This is likely because PersonaChat requires strong dependency of responses on the dialog history and knowledge context, thus metrics that do not directly model the dependency (e.g., USR-DR (Mehri and Eskenazi, 2020b) based on response retrieval) as ours struggle for accurate evaluation. Noticeably, the simple response length performs consistently well on both datasets, far better than previous metrics on PersonaChat. The baseline can be considered as a special case of ours by setting alignment scores of all tokens to 1. The stronger correlations of our model-based metrics demonstrate the effect of accurate alignment.
Ablation studies in Table 3 shows that measuring the volume (sum) instead of the density (mean) of aligned information is crucial for the superior performance of our metrics, highlighting the unique characteristics of the "creation" task ( §3.4).
The results for groundedness are shown in the bottom two plots of Figure 6. Our metrics again generally achieve strong correlations, with the Rbased metric consistently outperforming other implementations, likely because the estimation of grounded information volume (sum) benefits from the expressivity of end-to-end models. This is indicated by the underperformance of the D-based metric, which is trained on the same data but aggre-gates token-level predictions with more structure.
We provide more empirical studies in Appendix §F. In particular, we found that besides the two core aspects, our alignment based method also achieves stronger human correlations than existing metrics on other dialog aspects, such as the understandability and naturalness of responses (Table F.6).

Ablation: higher alignment estimation accuracy, better correlation
We study how the accuracy of information alignment estimation influences the performance of metrics. We demonstrate a highly desirable pattern that higher alignment estimation accuracy can usually lead to better correlation. This indicates that improvement on the single alignment estimation model could immediately benefit a broad range of aspect metrics defined in our unified framework. Specifically, we use the discriminative model ( §3.5) for our study. First, we vary the number of training iterations to get different model checkpoints, and evaluate both the alignment estimation accuracy and the metric human correlation based on the checkpoints. We evaluate accuracy with the human-annotated token alignment labels on the XSUM summarization data Maynez et al. (2020). Figure 7 (left) shows the consistency metric achieves better correlation as the alignment accuracy increases. We do the same on TopicalChat dialog data and evaluate accuracy with our weak supervision data (since no human labels are available). Figure 7 (right) shows similar trends for the groundedness metric. Second, we further use part of XSUM human alignment annotations to finetune the alignment model, and obtain even higher accuracy, which in turns gives better correlation for consistency evaluation (star marks in the figure).

Conclusions
We have proposed a general evaluation framework for NLG tasks categorized as compression, transduction, and creation. Based on the concept of information alignment between input, context, and output, we devised a family of interpretable metrics for the key aspects of diverse tasks (summarization, style transfer, and dialog). The uniformly designed metrics achieve superior or comparable human correlations compared to existing metrics. The unified framework offers a structured guidance for the metric design of new aspects/tasks, which we are excited to explore more in the future.

A Implementation of Alignment Estimation Models
We train our alignment models by constructing weakly supervised data using texts in the domain of evaluation. The data construction process can be divided into three steps: 1. Retrieve or generate a target sentence y 1 given the desired input x (e.g., the document in summarization tasks). All tokens in y 1 should be considered aligned with x; 2. Sometimes, y 1 consists of several original sentences from x. In order to make our model non-trivial and more robust, we generate a pharaphrase y 2 of y 1 with a pretrained paraphrase generator 4 ; 3. After that, we mask some portion of y 2 , and use a BART-large model (Lewis et al., 2020) to infill those masks. Because the infilled content is generated without conditioning on x, we label the infilled words as "not aligned" with x (BAD), and other words of y 2 are labeled as "aligned" (OK); Finally, x, y 2 , and alignment labels on y 2 's words are our desired training data. Specially on our paraphrasing operation, in order to make the generated paraphrase different enough from the original text, we always generate 10 paraphrases and take the one with biggest edit distance with the sentence; and specially about our masking mechanism, we randomly mask some sub-trees in the constituency parsing tree of y 2 with a pretrained parser 5 . The differences across tasks are the definitions of x and y 1 in the step (1), as detailed below.

A.1 Compression: Summarization
Our training for align(y → x) in the summarization domain is reference-free. We use the document as x, and generate its pseudo-summaries as y 1 using a traditional unsupervised extractive summarizer based on TextRank (Mihalcea and Tarau, 2004). We don't use reference summaries because they can contain hallucinations that don't align with the article (Maynez et al., 2020). In an ablation study with XSUM Consistency data (Wang et al., 2020), training a D model using reference summaries leads to 0.2822 Pearson correlation compared to 0.3222 using auto-generated summaries, which is clearly lower. To train for align(r → y), we use the reference as both x and y 1 .

A.2 Transduction: Text Style Transfer
In this domain, we simply set y 1 to be the original sentence x.

A.3 Creation: Dialog
When training for align(y → [x, c]), we use the reference response as y 1 and the concatenation of x and c as the input. For models that predict align(y → c), we set the knowledge context c as the input, and randomly extract sentences from it as y 1 . For PersonaChat, we sample 1-3 sentences at random, whereas for TopicalChat, we only sample 1 sentence because its c tends to be long. When aggregating the alignment vectors, we remove stopwords according to NLTK (Bird et al., 2009) to focus on important words.

B Key Aspects
Task Category Aspect Alternative Names Considered By C Alignment Prediction Example DOCUMENT: Darth vader and imperial stormtroopers have invaded a denbighshire seaside town to welcome the actor who plays the infamous villain. Spencer wilding, who hails from rhyl, was the guest of honour at a special screening of rogue one. He had to muster all powers of the force to keep his vader role secret until the film's release. "it's a hell of a secret to keep," said wilding, who was cast as the body actor for the role. "but when you're a professional actor -when you sign that black and white sheet of paper saying you cannot say a word... I'm true to my word and i didn't say anything." Speaking to bbc radio wales' good morning wales programme, the 44-year-old said it proved a tricky task after rumours of the role leaked a year ago. "i've been having hundreds of people every day for a year asking me if i'm vader," he said. "if i had a pound for everyone who asked i'd be buying myself a new death star -and it'd be gold plated." The 6ft 7in (2m) tall actor already has a string of hollywood appearances to his name, including guardians of the galaxy, green lantern, three harry potter films and the tv blockbuster game of thrones. He said the vader role came from a regular casting call, first with a self-filmed tape, then a recall to pinewood studios. "it's very, very secretive. We didn't even know exactly what the character was and what film it was until we got there," he said. "i opened up the curtain when i went in the dressing room and there he was -vader. "anybody out there who got into that costume and got an audition to be darth vader alone is very exciting, so to pull the character off as well, it's like 'what!' "i'm always pinching myself -i am definitely awake -it is not a dream, it is just another dream come true." While the actor has the body role, just like his predecessor in the original star wars films david prowse, the voice of lord vader is actor james earl jones. That did not stop wilding trying out the voice during filming. "i'm not james earl jones -nowhere near him -but you know i got close to him i think, which helped the other actors -you know, you've got vader in front of you." SUMMARY 1:   XSUM (Narayan et al., 2018) article. SUMMARY 1 is generated by BART (Lewis et al., 2020) and received a human consistency score of 0 according to Wang et al. (2020), meaning it contains hallucination; SUMMARY 2 is a repetition of "the". As the predictions show, our model assigns low scores to words in red, which either don't follow directly from the article ("latest", "the london film festival", "welsh"), or are meaningless repetitions ("the"s). . Explicitly accounting for two-way input-out alignments in an "F1"-style harmonic mean aggregation (Eq.4), our metrics (E) achieve competitive or better performance than previous metrics. Our D-and R-based metrics fall behind slightly, likely because one-to-one token matching is more suitable for two text pieces with similar information volume.   Table F.5: Ablation Studies: Pearson correlations with engagingness and groundedness for dialog tasks with swapped formulas vs our definition. By swapping, we use our engagingness metric to measure groundedness, and vice versa. PersonalChat swaps see across-the-board decreases in correlations, indicating the importance of using our designed formulas on this dataset. TopicalChat swaps see correlation increases more frequently, but the best methods still retain their edge.  Table F.6: Sample-level Pearson correlations for the remaining aspects in the annotations of (Mehri and Eskenazi, 2020b), including understandable (U), natural (N), maintains context (MC) and overall (O). Our metric here is the average alignment confidence from response y to dialogue history x and knowledge c, i.e. mean (align(y → [x, c]), which outperforms existing metrics on understandability and naturalness.