Language Model Augmented Relevance Score

Although automated metrics are commonly used to evaluate NLG systems, they often correlate poorly with human judgements. Newer metrics such as BERTScore have addressed many weaknesses in prior metrics such as BLEU and ROUGE, which rely on n-gram matching. These newer methods, however, are still limited in that they do not consider the generation context, so they cannot properly reward generated text that is correct but deviates from the given reference. In this paper, we propose Language Model Augmented Relevance Score (MARS), a new context-aware metric for NLG evaluation. MARS leverages off-the-shelf language models, guided by reinforcement learning, to create augmented references that consider both the generation context and available human references, which are then used as additional references to score generated text. Compared with seven existing metrics in three common NLG tasks, MARS not only achieves higher correlation with human reference judgements, but also differentiates well-formed candidates from adversarial samples to a larger degree.


Introduction
Automated metrics such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) are popular methods for evaluating natural language generation (NLG) systems. Compared with human evaluation, they are cheaper and faster, and accordingly, they often serve as essential metrics for benchmarking the performance of NLG models (Novikova et al., 2017). Despite their widespread use, however, these automated metrics often poorly correlate with ratings given by human judges, particularly for datasets in which only a single human reference exists (Gupta et al., 2019;Novikova et al., 2017). Moreover, these automated metrics only capture similarities between generated sentences and reference candidates, crucially ignoring provided contexts that are relevant for evaluating the answer in contextual NLG tasks, such as story generation, news summarization, and question-answering (Tao et al., 2018;Nema and Khapra, 2018). Table 1 shows a story generation 1 example that exemplifies some weaknesses of several common metrics. Perplexity (PPL) (Brown et al., 1992) successfully detects ungrammatical sentences, but it fails to distinguish legitimate novel continuations and copy-and-pasted ones. Relying on surface-level -gram matching, BLEU-1 and ROUGE-L 2 cannot detect reordering effectively, and wrongly score the well-formed candidate lower than its retrieval-based adversarial example. BERTScore  leverages contextual embeddings from BERT (Devlin et al., 2019), thus mitigating the above challenges, but still does not fairly evaluate candidates that correctly align with the context but happen to differ 6678 Context. Wendy was driving down the road. She heard her car making a noise. She pulled over to examine the problem. There was nothing but oil all on the road from her car.  Table 1: In this story generation example, MARS is the only metric that gives the well-formed candidate a higher score than two adversarial examples. The human rating of the candidate averaged over 20 judgements is 5.05 out of 6.00. Two adversarial examples are generated by Reordering the tokens of the candidate (as weak NLG systems whose generation is not readable) and Retrieveing a sentence from the context (as systems with no generation ability). We boxed the cases where the adversarial example does not score lower than the well-formed candidate. from the provided reference example. In our example, the candidate "... her engine was smoking" is reasonable but deviates from the human reference, and so BERTScore rates it relatively low (0.338 out of 1.0), thus correlating poorly with human rating, which was high (5.05 out of 6.00).
To address the above issues, prior studies have proposed a number of promising remedies. One line of work has proposed to combine human ratings with automated metrics (Durmus et al., 2020;Chaganty et al., 2018, inter alia). For instance, in HUSE score, Hashimoto et al. (2019) leverages the differences between perplexity and human judgements to consider both quality and diversity of generated text. Another line has proposed training separate neural models to aid automated metrics (Mehri and Eskenazi, 2020;Yuma et al., 2020, inter alia). For instance, BLEURT (Sellam et al., 2020) fine-tunes BERT (Devlin et al., 2019) on synthetic reference-candidate pairs for machine translation. These methods, however, are often limited in practical use, because the high-cost human ratings are not always available for every dataset, and the data-or system-specific training is not easily extended to other domains , and can even bias the evaluation (Freitag et al., 2020b).
In this paper, we present MARS (Language Model Augmented Relevance Score), a new NLG evaluation metric that requires neither supervision from human ratings nor additional training on specific domains. As shown in Figure 1, instead of comparing candidates only with human written references, as many prior metrics do, MARS uses a mixture of both human and augmented references. Specifically, MARS masks tokens in the reference to create templates, and then uses the context and templates to generate augmented references by infilling the masked parts with an LM guided by reinforcement learning. The augmented references thus incorporate information from both the context and the human reference, and are enriched with lexical and syntactic diversity, facilitating fairer evaluation of candidates. Finally, we compute the score as a weighted average of the similarity between the candidate and the set of augmented references in the contextual embedding space.
The advantages of MARS are three-fold. First, MARS correlates highly with human judgements. We apply MARS to three diverse NLG tasks, and demonstrate that, compared with seven popular NLG metrics, MARS better correlates with human judgements and is robust against adversarial attacks. Second, MARS is context-aware. Unlike existing metrics that only consider the given human reference, we use a constrained NLG approach to incorporate the generation context into augmented references, thus alleviating bias against diverse candidates. Third, MARS is easy to deploy and extend. Built on off-the-shelf LMs, MARS requires neither human supervision nor additional training for specific domains, and can therefore serve as a general-purpose metric for a broad range of NLG applications, as we will demonstrate for three common NLG tasks: story generation, news summarization, and question-answering.

Approach
MARS comprises three steps. First, we mask out non-important tokens from the human reference to produce templates for augmentation ( §2.1). Second, we guide off-the-shelf LMs to generate reference augmentation on these templates via a reinforced self-planning algorithm ( §2.2). Finally, we compute a weighted average score that reflects the overall similarity between the candidate and the set of augmented references ( §2.3).

Human Reference Token Masking
The first step in MARS is to take in the given human reference and generate templates-masked versions of the human reference-which can then be used to generate augmented references. Our masking procedure can be viewed as a reversed process of prior insertion-and template-based generation approaches (Zhang et al., 2020;Miao et al., 2019); whereas these generation approaches start with templates of important tokens and then fill in the details to generate complete sentences, our masking procedure starts with the complete sentence (i.e., the human reference) and then masks out unimportant tokens to generate templates. To better explain our masking procedure, we introduce two concepts, mask priority and mask cost: Mask Priority. We compute a mask priority for each token , which captures the priority of masking , where non-important words should receive higher priority. We compute as a function of two things: the inverse document frequency (IDF) of , and the part-of-speech (POS) of : where is a function that assigns a weight to each POS tag. 3 Common tokens across the corpus (e.g., stop words, with low IDF) will receive high mask priority. Tokens responsible for description details will also be assigned high mask priority based on their part-of-speech (e.g., adjectives are mainly used for details and so they are given higher priority of being masked).
Mask Cost. For each token , we also compute a mask cost . Tokens that appear in both context and human reference should have high masking cost as they are deemed context-carrying. We use the longest common sequence (LCS) matching between the context and the human reference to identify these context-carrying tokens. In our experiments, we set the of these tokens to 10 and the default of all other tokens to 1. We use to denote the ratio of tokens to be masked in a sentence of tokens, and define max = · as the maximum cost allowed. DP-based Token Masking. Now that for each token we have a mask priority and a mask cost, we aim to choose a set of tokens to mask with the highest possible sum of priorities for which the sum of mask costs is not greater than max . Given a function ( ) = {1, 0} where 1 means token is masked and 0 means it remains, the objective of token masking can be expressed as follows: (2) Such a goal is actually a NP-complete combinatorial optimization problem, called the Knapsack problem (Pisinger, 1995), which we solve using dynamic-programming (DP). In general, the masking strategy aggressively harvests tokens of high mask priority while keeping the cost of masked tokens from exceeding the mask cost limitation max . The detailed DP algorithm for solving this problem is shown in Appendix A.

Self-planning Cloze Augmentation
After creating the templates described in §2.1, we produce augmented reference examples based on both the templates as well as the generation context. This procedure can be seen as a mixture of hardand soft-constrained NLG, where the template tokens pre-exist with some blanks, and the system, conditioned on the context, aims to fill in the blanks. We henceforth refer this process of creating augmented references as cloze 4 augmentation.
Background. Masked Language Models (MLM) such as RoBERTa  and BERT (Devlin et al., 2019) are trained to predict masked tokens within sentences, and thus are able to do cloze augmentation off-the-shelf. However, without architecture-level modification, MLMs are only able to infill a pre-determined number of missing tokens (Zhu et al., 2019). This is especially problematic since-if they are directly used to augment references-all the augmented references will have the same number of tokens as that of the original human reference. We believe this unnecessarily constrains augmentation diversity, and thus consider it as a Naive method in our evaluations ( §4).  Figure 2: Compared with the Naive method, our reinforced self-planning approach infills blanks with ([blk]) varying-length tokens while considering both past and future tokens, which promote diversity and coherence respectively. The context is concatenated to the beginning of the reference template.
Autoregressive Language Models (ALM) such as GPT-2 (Radford et al., 2019), on the other hand, are trained to predict current step token given past tokens. They can generate sequences of varying lengths, but they cannot infill missing tokens within sentences effectively since they do not consider future context. To enable ALMs to infill blanks of unspecified length, prior work has proposed either retraining a new LM from scratch (Shen et al., 2020) or fine-tuning on specially prepared data (Donahue et al., 2020), which are costly and not easy to extend to new NLG tasks. As shown in Figure 2, we take a reinforcement learning (RL) approach that uses future words after the blank to guide current step infilling generation. Since such RL guidance only relies on the tokens within its own to-be-infilled template, we call it reinforced self-planning. Our method combines the advantages of both MLMs and ALMs, requiring neither re-training nor collecting new data, and thus is easier to extend to other off-the-shelf LMs.
Reinforced Self-planning. At each decoding step during generation, a vanilla ALM will pick the token that has the highest probability by applying an argmax over the softmax output of hidden states. We add a self-planning stage between the argmax and softmax function. Following the RL framework, we define the state at step as the generated sequences before (i.e., = < ), and the action at step as the -th output token (i.e., = ). We take the softmax output of the last hidden states (with parameter ) as the policy , since it is the probability of picking token (action ) given the state = < . Similarly, we denote the policy after reinforced self-planning as . Typically, the RL objective is to maximize the expectation of total reward , summed over steps on the trajectory induced by : where ∈ (0, 1] is the discounting factor, and is the single-step reward. In text generation, however, such a reward definition requires sampling over the future generated sequence to estimate current step reward (Gong et al., 2019), which may cause the policy to end in zero reward region because of high variance of the gradient (Pang and He, 2021). Since we guide the generation in every step of decoding, we derive the -th step policy gradient ( ) as: with importance sampling weight to stabilize the optimization (Munos et al., 2016), which is: If we denote a certain token in future context as ∈ { future }, single-step self-planning reward ( ) can be approximated by the cosine similarity between -th step hidden state and the embedded vector of by the LM embedding layers, which is (5) Given all above definitions, at -th step, we update towards the self-planned as: where is the learning rate and is the temperature parameter to control the stochastic sampling during token decoding (Keskar et al., 2019). After iterations of reinforced self-planning, the updated policy should produce tokens approaching the future context in embedding space, since future context contributes to the calculation of reward (Eq. 5). 5 More details about how we handle edge cases during reinforced self-planning are presented in Appendix B.

Computing Contextual Similarity
After generating augmented reference sentences, the final MARS score is computed as a weighted average of the similarity between the candidate and each reference in the augmentation set (including the original human reference). One way to obtain similarity scores is using BERTScore , but BERTScore requires training on external resources to make its outputs more readable. Therefore, in order to keep all the resources used by MARS off-the-shelf, we utilize Sentence-BERT (Reimers and Gurevych, 2019), which uses the mean of all token embeddings in a sentence as the overall sentence-level encoding. As the sentence encoder, we use RoBERTa-large , a common choice in the literature Reimers and Gurevych, 2020). As shown in Eq. 7, we then compute MARS score as the average of the cosine similarities weighted using a geometric progression with a common ratio ∈ (0, 1] and a scale factor (start value) ≠ 0: where the candidate encoding is cand, the reference encodings are ref ( is the index of the augmented reference under a certain , and ref 0 marks the zeromask human reference), and # is the number of masking ratios we use in §2.1. Different values, as defined by the geometric progression, determine how much weight each reference contributes. By default, Eq. 7 assigns the largest weight to the human reference since it is the gold standard.

Tasks & Datasets
We evaluated MARS and compared it with several popular NLG metrics on the following three tasks: Story Generation. We use the ROC stories dataset 6 for story generation, which requires candidate NLG systems to generate coherent endings to four-sentence stories (Mostafazadeh et al., 2016  an industry-level system based on Apache Solr 7 , and (2) an Open-NMT model with global attention (McCann et al., 2017).
News Summarization. For the news summarization task, we use the Newsroom summary dataset. 8 This dataset contains 1.3 million articles from 38 major publications (Grusky et al., 2018) and we use the subset with human ratings ( =540) released by the authors. 9 This dataset contains outputs from summarization models: (1) TextRank: a sentencelevel summarization system inspired by Google PageRank (Page et al., 1999), (2) a Seq2Seq model with attention (Rush et al., 2015), and (3) Pointer-N: a pointer-based neural model (See et al., 2017) trained on Newsroom dataset.
Question Answering. For question answering, we use the MOCHA dataset, 10 which includes human ratings on outputs of five models trained on six QA datasets (Chen et al., 2020). We consider a distributionally-balanced subset ( =450) of these outputs from three systems: (1)  The detailed statistics of these three datasets we used for this work are shown in Table 2. For pre-processing, we removed hashtags and urls in the text, but leave punctuation and stop words, which can affect LCS matching when computing mask costs. For all tasks, we use GPT-2 (large, with 774M parameters) as the language model for

MARS Better Correlates With Humans
As automated metrics are only helpful if they correlate sufficiently with human judgements, in this section we examine how MARS correlates with human judgements compared with prior metrics.
System-level Correlation. Table 3 shows the correlations between human judgements and automated metrics for MARS and seven other unsupervised metrics, across all NLG systems studied in our three tasks. Compared with the other metrics, MARS achieves the highest correlation with human judgements for five of the seven systems (and comparable with the top in the other two systems), making considerable improvements over the next-best metric for many of the NLG systems (e.g., 0.370 ↑ for Back-Translation, and 0.231 ↑ for Solr). We also notice that MARS has greater improvements on more open-ended tasks (e.g., story generation, which has low Ω), which corroborates MARS's original objective of judging diverse candidates more fairly. As for the baselines, -gram matching metrics such as BLEU correlate poorly with human ratings on such open-ended tasks; BERTScore performs better on short candidates and high-Ω tasks (e.g., QA); and perplexity, as expected, correlates weakly with human ratings. The Naive method, which uses multiple augmented references of the same length, improves over BERTScore, which only uses the original reference.
Ablation Study. As shown in the lower rows of Table 3, we see that the performance of MARS drops substantially when the crucial components are removed. Specifically, removing self-planning hurts performance more for tasks with longer references (e.g., story generation) since self-planning is more helpful when there are more blanks to in-fill, and removing context hurts performance more in tasks that are less open-ended (high Ω, such as QA) because there is no adequate input for a reasonable augmentation. We take these ablation study results as evidence that the techniques we propose in MARS are crucial for improving correlation with human judgements.   Figure 3: Correlation between BERTScore (left) and MARS (right) with human judgements for MOCHA QA. The -axis is the automated metric score andaxis is the human judgement. Points in different colors represent generation outputs of three NLG systems: GPT-2 (red circles), Back-Translation (green triangles), and MHPG (blue squares). and human judgements, we consider the MOCHA QA task as an example and plot the correlations of BERTScore (left) and MARS (right) with human judgements. As shown in Figure 3, compared with MARS, BERTScore has more candidates in the upper-left corner of the plot (i.e., low BERTScore but high human judgement). Many of these are generated by GPT-2 and MHPG, which, based on manual examination, tend to provide more details in the answer than the human reference. For instance, given a context about shopping, one question is "Did they need to buy any meat?". The human reference answer is simply "Yes, they did.", but GPT-2 returns "Yes, they bought chicken and a roast.", which is more detailed, even containing item names derived from the context. Whereas BERTScore cannot evaluate such cases where the generated candidate is over-described with respect to the human reference, MARS uses augmented references enriched with information from the context to provide a fairer judgement.

Is MARS robust?
Good evaluation metrics ought to also be able to detect adversarial examples by assigning them lower scores than well-formed candidates. As shown in Table 4, uni-gram matching BLEU-1 cannot detect reordered sequences, while ROUGE-L scores reordered sequence higher occasionally if tokenswapping leads to more LCS. Sentence Mover's Similarity combines word and sentence embeddings and thus is more capable of recognizing reordered samples than MoverScore. Perplexity can detect reordered examples effectively, but is unable to detect retrieved sentences, as they are usually well-formed. MARS, on the other hand, has the best robustness against adversarial samples, possibly because multiple context-infused augmented references help MARS detect adversarial samples more reliably. We also study the effects of contextual embeddings we use in §2.3-when switching to GloVe embeddings (Pennington et al., 2014), which are not contextual, MARS is less able to detect adversarial samples, especially reordered ones. The Naive method, which by default uses RoBERTa embedding, achieves comparable robustness as MARS but its task-level correlations with humans (ref.) are generally lower than MARS, potentially because its fixed-length cloze generation limits the diversity of augmented references.

Choosing Masking Ratios for MARS
The masking ratios for MARS are set using the hyperparameter { } max , which corresponds to MARS using masking ratios from 0% to { } max in increments of 20%, e.g., { } max = 40% indicates ∈ {0%, 20%, 40%}. In preliminary experiments, we observed that { } max varied for different datasets. Thus, for our three generation tasks, we evaluate MARS performance given different { } max , as shown in Table 5. We find that tasks that were more open-ended (low Ω; e.g., story generation) benefited from higher { } max , which created a more diverse set of augmented references, whereas tasks that were less open-ended (high Ω; e.g., QA) worked better with lower { } max , which kept the augmented references more similar to the original.

Error Analysis
We analyzed cases where MARS score substantially differed from human judgements. From test set outputs, we found that errors could often be categorized into one of three types (shown in Table 6): (1) Out of Vocabulary errors, often induced by unknown tokens in the candidates, (2) Confusion errors, where candidates are simply copied from context, and (3) Inference errors, where the candidates are further inferences of the context based on commonsense knowledge. In these cases, human annotators tended to assign higher scores, whereas, MARS over-penalized them.  Table 6: Error analysis of MARS. We investigated three typical types of errors within the samples which received large differences between the MARS score and human ratings. Gold: human written references.

Human Judgement
We conducted human evaluation on Amazon Mechanical Turk (MTurk) to further study the quality of MARS augmentation. In total 150 participants were randomly assigned to evaluate the three tasks. Participants (61.3% male and 38.7% female) were all from the United States and above 18 years old, with an average age of 34.7 years old. Each participant was paid 75 cents for completing 14 questions in each questionnaire (average completion time per questionnaire was about 5.11 minutes).
Results We conducted paired sample -tests to examine how much the augmentation samples resemble the original human references regarding relevance to context and readability. As shown in ity, both MARS and Naive were rated lower than the original but not significantly; we take this as a compromise of cloze style augmentation. No statistically significant differences were seen between the original and MARS augmentation in overall ratings across the three tasks. These results further confirm that augmented examples from MARS are of similar quality to the original human references.

Related Metrics
Unsupervised Metrics. In addition to the metrics we directly compared with previously, other unsupervised metrics have also been proposed. TER (Snover et al., 2006), CharacTer (Wang et al., 2016), and chrF (Popović, 2017) focus on character-level overlaps instead of -gram matching. Similar to BERTScore, YiSi (Lo, 2019) and BERTr (Mathur et al., 2019) leverage pre-trained contextual embeddings to better capture similarity. ΔBLEU (Galley et al., 2015) adds human annotated sentences as negative references. Bawden et al. (2020) find the gain from multiple references can be limited by inherent weaknesses in BLEU. We considered lessons from many of the above works while designing MARS.
Learned Metrics. Compared with unsupervised metrics, learned metrics collect human supervisions (Freitag et al., 2020a;Chaganty et al., 2018) or train on specially prepared data of a certain domain (Sellam et al., 2020;Rei et al., 2020). Other approaches train on related tasks and use these models as metrics for the original task (Goodrich et al., 2019;Eyal et al., 2019). Whereas learned metrics may have limited applicability on tasks where no such resources are available, MARS fully exploits the few-shot learning abilities of off-the-shelf LMs and therefore does not require additional training.
Task-specific Metrics. Finally, many metrics have been proposed for task-specific evaluation, such as LEIC (Cui et al., 2018)

Limitations
MARS can be limited by the LM that it usesfor instance, the total length of context + reference/candidate is limited by the max sequence length of the LM used. Additionally, our work has focused on English, and MARS may require non-trivial modifications to handle cases where the context and reference/candidate are in different languages, such as machine translation. Future work, could potentially extend MARS to these scenarios using multi-lingual sequence-to-sequence models such as multilingual-T5 (Xue et al., 2020). We also analyzed errors and found that MARS sometimes under-scores candidates that contained unknown tokens or were copied directly from the context (see Appendix C for examples and further analysis).

Conclusion
We have proposed MARS, a context-aware and easy-to-deploy NLG metric built upon an off-theshelf language model (GPT-2). On three contextual NLG tasks, we show that MARS better correlates with human judgements compared with seven other unsupervised metrics. Requiring neither costly human supervision nor additional training, MARS can be applied to a broad range of NLG tasks.

Ethical Considerations
The goal of MARS is to aid the evaluation of NLG models, and hence we draw attention to several ethical considerations. First, the augmented references of MARS can be affected by certain biases from the LM it is based on (e.g., GPT-2) (Liu et al., 2021), though those biases may be partially mitigated by the relatively narrow scope of cloze completion and by generations being guided by given context and human references. Second, MARS facilitates evaluation and therefore development of NLG models, for which a major ethical consideration is that they can mimic target properties in training data that are undesirable. This is especially true of models trained on non-contemporary data that does not represent current norms and practices. These biases can lead to ethical concerns if users or deployers of models are not aware of these issues or do not account for them. More generally, NLG models can also be used in malicious ways such as to generate fake news or spam, which we strongly discourage. Finally, our experiments and analysis are done in English, and therefore we do not claim that our findings will generalize across all languages, although our framework has potential to be extended to other languages with necessary modifications.