MISMATCH: Fine-grained Evaluation of Machine-generated Text with Mismatch Error Types

With the growing interest in large language models, the need for evaluating the quality of machine text compared to reference (typically human-generated) text has become focal attention. Most recent works focus either on task-specific evaluation metrics or study the properties of machine-generated text captured by the existing metrics. In this work, we propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts. Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types such as spatial/geographic errors, entity errors, etc, to guide the model for better prediction of human judgments. We propose a neural framework for evaluating machine texts that uses these mismatch error types as auxiliary tasks and re-purposes the existing single-number evaluation metrics as additional scalar features, in addition to textual features extracted from the machine and reference texts. Our experiments reveal key insights about the existing metrics via the mismatch errors. We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.


Introduction
Large language models have pushed the boundaries for natural language generation (NLG).More and more, the generated machine texts look human-like.The need for evaluation metrics has never been so critical in the recent decade.Typically, there are two ways to evaluate the quality of machinegenerated text: automatic evaluation and human evaluation.In automatic evaluation, the quality of the machine-generated text is captured using a single number from a range of values indicating how good the generated text is by a (hand-coded rule-based or neural-based) model.Several NLP tasks still use the metrics from 2 decades ago, for instance, (Lin, 2004) and METEOR (Banerjee and Lavie, 2005) for abstractive summarization, BLEU (Papineni et al., 2002) for machine translation, etc.
It has been noted in several works that automatic evaluation metrics are incapable of capturing the different criteria in measuring the quality of the text and often have a poor correlation with human judgments (Sai et al., 2021;Callison-Burch et al., 2006).The current automatic evaluation metrics lack the ability to measure the quality of a modern machine-generated text.In human evaluation, we evaluate the machine text based on human ratings, where we ask human annotators to judge a given pair of texts.The quality of the machine text is measured using different task-specific human evaluation criteria such as fluency, coherence, correctness, consistency, relevance, adequacy, etc. Human evaluations are often expensive, time-consuming, and subjective (low inter-annotator agreement), especially when broad criteria such as the fluency of the generated text and the interestingness of the model-generated text are used for human judgment.
To address these challenges in automatic and human evaluations, there have been recent efforts in the fine-grained evaluation of generated text in several NLP domains (Callison-Burch et al., 2006;Ethayarajh and Jurafsky, 2020;Sai et al., 2021;See et al., 2019).In this paper, we are interested in utilizing fine-grained evaluation categories to guide the prediction of human judgments.Towards this goal, we introduce a task-agnostic list of 13 mismatch error types, such as grammatical errors, spatial/temporal errors, etc, that unifies several related task-specific efforts (Pagnoni et al., 2021;Glockner et al., 2018;Dou et al., 2022).These mismatch error types are comprehensive, interpretable, and useful for predicting human evaluation criteria.For example, an occurrence of grammatical error in a machine-generated text can impact its fluency rating.
Figure 1 gives the overview of the proposed mismatch error types for fine-grained evaluation.We propose a neural framework for evaluation that uses these mismatch error types as auxiliary tasks to model the human judgment and repurposes automated evaluation metrics as additional scalar features, concatenated to textual features extracted from the machine and reference texts via pre-trained LM text embeddings (Devlin et al., 2019).We show that pre-training our proposed model using synthetic data for the mismatch prediction task, and fine-tuning using real data for human evaluation criteria, for different NLP tasks, achieves state-of-the-art performance on the main downstream task of predicting human evaluation metrics.We provide several ablation studies showing the importance of each component of our architecture, and the correlations between the mismatch error types and the automatic and human evaluation metrics.We also show how our architecture is useful in predicting novel evaluation criteria, such as factuality in abstractive summarization.

NLG Evaluation
Given a pair of texts: a reference text and a machine-generated one, we are interested in evaluating the quality of the generated text using the reference text.We measure the quality of the generated text by estimating how a human will judge this text based on different evaluation criteria.Such evaluation is common in many ML/NLP tasks, e.g., machine translation, summarization, image captioning, etc.Unlike in other automatic evaluation metrics, we consider fine-grained evaluation cues from 13 mismatch error types, inspired by several related task-specific efforts (Pagnoni et al., 2021;Glockner et al., 2018;Dou et al., 2022) to guide the main task of predicting the human judgments.We propose a neural framework for evaluating machinegenerated texts that use these mismatch error type predictions as auxiliary tasks, and automated evaluation metrics as additional scalar features, along with the pair of pre-trained LM text embeddings extracted from reference and generated texts.In this section, we discuss the role of mismatch error types as a good proxy for human judgments (Section 2.1) and the model architecture for the proposed approach (Section 2.2).

Mismatch Error Types
Recently there has been a growing interest in a set of measurable fine-grained evaluation criteria (Dou et al., 2022;Pagnoni et al., 2021;Glockner et al., 2018).Most of the recent works require human annotation.In this paper, we consider MIS-MATCH types which identify a specific violation or mismatch between a pair of texts spanning various dimensions of semantic structure: whether the mismatch is within a semantic frame, including predicates, entities, modifiers or across multiple semantic frames, for instance predicate ordering mismatch.These mismatch error types can be used as a proxy to measure the broad evaluation categories: a mismatch in sentence ordering can be a weak signal for the coherence of the generated text, and a change in the object names, gender, and numbers can indicate the correctness of the generated text.Table 1 shows the list of mismatch error types used in this paper.We want to understand the relationships between the mismatch types, the evaluation metrics and the human evaluation criteria by addressing the following three questions: • Are mismatch types a good proxy for human evaluation criteria?We show that the fine- grained evaluation based on the mismatch types can be used to approximate the evaluation criteria used for human ratings.
• Can we predict a mismatch type between a given pair of texts?We demonstrate that, in addition to the BERT-based text representations, evaluation metrics computed from the input pair of texts can reliably identify these mismatch types (with relatively fewer examples for training).
• Can we use these evaluation metrics to predict mismatches on an unseen text pair?We study the predictive power of these evaluation metricsand demonstrate that even though the evaluation metrics do not agree with human evaluation criteria, they can easily identify these mismatch types between pairs of text.
As we later see in Figure 3, our proposed mismatched error types correlate well with both the automatic evaluation metrics as well as human evaluation criteria, demonstrating the relevance of these error types.In the next section, we show how we model the human ratings on 7 NLP tasks: Abstractive Summarization (AS), Image Caption generation (IC), Question Generation (QG), Machine Translation (MT), Dialogue Generation (DG), Datato-Text generation (D2T) and Natural Language Inference (NLI) using the mismatch types.

Mismatch Error Types for NLG Evaluation
We now discuss the neural architecture for the proposed NLG evaluation and show how we model the human judgments using the mismatch error types.
A simple solution to model the human judgments is to directly train a neural network to learn a function that maps the input pairs of texts to human ratings on the different evaluation criteria, but the amount of human-annotated samples available for training in many NLP tasks is very limited.In this paper, we consider fine-grained evaluation cues based on mismatch error types to guide the model for predicting human judgments.One of the key advantages of using mismatch error types to approximate human ratings is that we can generate a large amount of synthetic data for these error types.In this paper, we generate ≈ 160K, synthetic examples for 13 mismatch error types and use publicly available task-specific data with dataset size ranging from a few thousand to hundreds of thousands of examples with human annotation (details in Section 3).
Our approach to model human judgments involves two steps: 1) task-agnostic pre-training step where we use the synthetic examples from mismatch error types to train a shared (base) neural network model for all the 7 NLP tasks and 2) taskspecific finetuning step where we finetune the pre- trained model for a specific task to predict the human ratings over different evaluation criteria.The output from the task-specific models approximates the human judgments along with the interpretable mismatches between the given pair of texts.
Figure 2 shows the architecture for the proposed mismatch-based evaluation model with pre-training and finetuning steps.In both these steps, we use pre-trained BERT (Devlin et al., 2018) to extract linguistic features via embeddings for both the reference and generated texts to predict the mismatch types and the human ratings (textual features).We generate (2x) 64-dimensional textual features, one for the machine text and the other for the reference text.It is common to generate millions of synthetic examples for pre-training the neural network (Sellam et al., 2020) or to use tens of thousands of human-annotated data (Rei et al., 2020) to make the model robust to unseen texts.On the other hand, automatic evaluation metrics utilize handcrafted logic to compute the score on any pair of texts.In Section 3, we show that evaluation metrics as features can reliably predict these mismatch error types (scalar features).We demonstrate that even with a few (synthetic and task-specific) samples, our approach benefits from the handcrafted logic in the evaluation metrics to boost the prediction performance.We choose the evaluation metrics from different NLP tasks to represent different properties of natural language text.
Unlike the textual features, the features from the automatic evaluation metrics are required to be invariant to different permutations.Traditional neural network-based models (including BERT) are very sensitive to the permutations of the input sequence.We use SetTransformer (Lee et al., 2019) to ex-tract permutation-invariant scalar features from the automatic evaluation metrics so that the scalar feature does not change under any permutation of the evaluation metric scores.We scale the evaluation metric scores between 0 and 1 before passing them to SetTransformer.We believe that textual features are extremely useful for the prediction when the reference and/or machine-generated texts are similar to the texts seen during pre-training or finetuning steps whereas scalar features are good for unseen texts.Based on this intuition, we combine the reference and generated texts with the scores computed from the automatic evaluation metrics for prediction.Both the textual and scalar features are concatenated and projected (via linear layer) to either 13 mismatch error types for the pre-training step or human ratings on the task-specific evaluation criteria during the finetuning step.

Experimental Results
In this section, we show experimental results validating the proposed model for predicting human judgments.

Datasets
To train our proposed model based on mismatch error types to predict human judgments, we use synthetic examples for 13 mismatch error types during pre-training and real annotated examples from 7 NLP tasks for task-specific finetuning.We generate the synthetic examples by sampling the reference text from multiple NLP tasks (SQuAD (Rajpurkar et al., 2016), WebNLG (Gardent et al., 2017), MSCOCO (Lin et al., 2014)).Following the previous works (Sai et al., 2021;Glockner et al., 2018), we use template-based perturbations on reference text to generate the synthetic examples for each mismatch type.E.g., perturbation rules to introduce subject-verb disagreement or dropping stopwords for GramErr, changing names/gender or changing the object order for EntErr, etc.In addition to the 13 error types, we include an additional No Contradiction type (NoContr) for machine-generated text that matches the reference text.We believe this additional category helps with the better prediction of mismatch types during pre-training.We generate ≈ 200K synthetic examples in total for the pre-training task (160K for training and the rest for validation).We report results on datasets from 7 NLP Tasks with human annotations: AS (Fabbri et al., 2021), IC (Aditya et al., 2015), QG (Nema and Khapra, 2018), MT (Bojar et al., 2017), DG (Mehri and Eskenazi, 2020), D2T (Gardent et al., 2017) and NLI (Williams et al., 2018) for taskspecific finetuning step.We show the number of human-annotated examples used for task-specific finetuning for all 7 NLP tasks in Table 2.

Correlation with Mismatch Error Types
Since most automatic evaluation metrics correlate poorly with human evaluation criteria, we study how well the proposed mismatch error types correlate with the human evaluation criteria and automatic evaluation metrics.Figure 3 shows the correlation plots for the proposed mismatch error types (with NoContr type).The correlations between mis-match types vs automatic evaluation metrics reveal key insights to justify the use of evaluation metrics as scalar features in our model.For instance, OutofRef is negatively correlated but PredOrdErr is positively correlated with most metrics, ngrambased metrics are highly correlated with RepErr, etc.The hardcoded logic-based evaluation metrics are equally correlated with our mismatch types as the neural network-based evaluation metrics.We also show the correlations between the mismatch types and human evaluation criteria for NLI (entailment, neutral, and contradiction).It is interesting to see that NoContr is positively correlated with entailment, NegErr is positively correlated with contradiction.These correlations will help guide the model for better prediction of human judgments on these human evaluation criteria.We also include the correlation plots for other NLP tasks in the supplementary material.

Model Performance
In this section, we evaluate our model both at the task-agnostic pre-training and task-specific finetuning steps.We use an 80/20 split for both steps.We precompute the automatic evaluation metrics for the pairs of texts in both synthetic and finetuning datasets for faster computation.We use the accuracy to evaluate the performance of the pretrained model on predicting the mismatch types; RMSE (lower is better), Kendall's τ correlation   (higher is better), and Spearman's ρ correlation (higher is better) between the human ratings and the predicted ratings to evaluate the performance of the task-specific finetuned models.Since we have multiple human evaluation criteria per task (e.g., entailment, neutral, and contradiction in NLI), we report the results by averaging the performance of the finetuned model over the human evaluation criteria from that task.All the experimental results reported in this paper are averaged over 3 random runs.
Table 2 shows the task-specific finetuned model performance on predicting the human rating using both the mismatch error types and scalar features.Our pre-trained model predicts the mismatch types on the held-out synthetic data with 98% accuracy.We finetune the trained model on the task-specific data and achieve relatively lower RMSE scores on most of the tasks.Kendall's τ and Spearman's ρ measure the linear correlation between the human ratings and the model-predicted ratings.We see that in all the NLP tasks, our model predictions align well (≈ 0.50 in correlation) with the human ratings.Since the NLI task involves classification labels (-1 for contradiction, 0 for neutral, and 1 for entailment) instead of human rating scores, we didn't report the RMSE score.We see that the correlation (both τ and ρ) for NLI is high compared to the other tasks.We believe that the NLI task is relatively easier for our proposed model compared to the other task.
Figure 4 compares the proposed model based on the mismatch error types against the task-specific automatic evaluation metrics both hardcoded logicbased and neural network based rules.Kendall's τ correlation between the metrics and the human ratings is used for the performance comparison.We can see that the proposed model outperforms the other metrics significantly in AS, IC and NLI.In addition, we outperform a popular neural networkbased evaluation model for machine translation, BLEURT on different language pairs from both WMT2018 and WMT2019 (See supplementary for more details).
Figure 5: (Top) Agreement on human ratings using Kendall's τ correlation for different settings of the proposed approach on 7 NLP tasks.We start with Textual Features +Without mismatch (no pre-training step with mismatch error types) and Without Scalar Features (no evaluation metric scores as features during pre-training and finetuning steps) (Bottom) Agreement on human ratings using Kendall's τ correlation with different subset of automatic evaluation metrics used for scalar features.The two subsets are selected based on the overall cost (time and space complexity) to compute the metric scores (See table 11 in the supplementary material).

Ablation Studies
In this section, we analyze the importance of the evaluation metrics and the mismatch types for predicting task-specific human judgments.First, we compare the proposed model architecture with different settings such as with and without the mismatch error types for pre-training steps and with and without the scalar features extracted from the automatic evaluation scores using SetTransformer.Figure 5 (top) shows Kendall's τ correlation for the different experimental setups.We start with the base model that uses BERT to extract the textual features and predict the human ratings without the pre-training step for predicting mismatch error types and without the scalar features from evaluation metric scores.We write this setup as Textual Features + Without Mismatch.Next, the baseline considers the pre-training step with mismatch types but without any scalar features.We write this setup as Textual Features + With Mismatch.We consider an additional baseline that uses the scalar features but without the pre-training step for mismatch-type prediction.We call this baseline, Textual Features+ With Scalar Features.Finally, we have the pro-posed model that considers both the prediction step for mismatch error types and scalar features extracted from automatic evaluation metric scores.
We see that in text-only features, the pre-training step with mismatch error type significantly boosts the performance of the task-specific finetuning step, specifically in IC, QG, DG, and DT.In NLI, textonly features didn't perform as well as expected (0.0 Kendall's τ correlation).We observe that using automatic evaluation metric for scalar features significantly boost the performance of the overall model.We believe that evaluation metrics provide valuable properties (both via the hardcoded logicbased metrics such as Rouge, METEOR, etc, and neural network-based evaluation models such as ANLI, FactCC, etc) of the input texts for better performance of the fine-tuned models.The proposed model with both the scalar features and the mismatch error types for pre-training outperforms all the other model setups.Our proposed model gets a little boost from the pre-training with mismatch error types along with the scalar features.
Figure 5 (Bottom) compares the importance of evaluation metric scores as a feature for the model prediction.We know from our previous experiment, evaluation metric score as scalar features provide a significant boost to our proposed model.One of the key issues with using automatic evaluation metrics as features is the cost associated with computing the scores, both space and time complexity.Time complexity measures how long it takes to compute the score for a given pair of text and space complexity measures the storage space occupied by the neural network-based evaluation model.We show the time and space complexity of each metric used in this paper in the supplementary material.To address this concern, we study the importance of cost in our model prediction.
We choose 2 subsets of evaluation metrics with low and high costs.The subset with low-cost metrics includes hardcoded logic-based metrics such as ROUGE, METEOR, etc.The subset with highcost metrics is mostly neural network-based models such as ANLI, FactCC, SummaC, etc.We observe that metrics with low cost perform comparably to the metrics with high-cost as scalar features.In some tasks such as IC, QG, MT, and DT, the difference is noticeable.In NLI, the difference is significantly higher compared to any other tasks.This reveals that a subset of evaluation metrics can be selected based on the computational constraints to tradeoff between the cost and the model performances.In Figure 6, we study the importance of evaluation metrics as scalar features on sample complexity during the pre-training step.We choose different sample sizes from synthetic data for predicting the mismatch error type ranging from 10K to 160K.We see the model with both the textual and scalar feature achieves better performances with a limited number of samples to train the model.This shows that evaluation metrics as scalar features have likely improved the sample complexity of the proposed model.Finally, in Figure 7, we show some sample text from 3 tasks (IC, QG and DT) showing both the predicted mismatch error type and predicted human evaluation criteria scores.

Related Work
Automatic evaluation metrics such as ROUGE (Lin, 2004), BLEU (Papineni et al., 2002), ME-TEOR (Banerjee and Lavie, 2005), have been proposed for different tasks as a substitute for human annotations.In NLP, text generation tasks have extensively used these metrics to measure the quality of the machine-generated text.Evaluation metrics are either task-specific (ANLI (Williams et al., 2022), SummaC (Laban et al., 2022), CIDER (Vedantam et al., 2015), SUPERT (Gao et al., 2020)) or task-agnostic (BERTScore (Zhang et al., 2019), BLEURT (Sellam et al., 2020)), and are based on either human handcrafted logic (ROUGE, BLEU, METEOR) or neural framework (BERTScore, BLEURT).For human annotation, several dimensions such as coherence, consistency, and fluency are considered to measure the quality of the generated text, yet most evaluation metrics compute a single score to summarize the evaluation.Further, these single-scored metrics often do not correlate well with the human ratings.To address this problem, several attempts have been proposed to combine multiple evaluation metrics.ROSE (Conroy and Dang, 2008) uses a linear combination of ROUGE variations (ROUGE_1, ROUGE_2, ROUGE_L, ROUGE_Lsum) for the machine translation task and the combined score is better than the individual rouge scores in the evaluation.S 3 (Peyrard et al., 2017) uses the combination of the ROUGE scores and Jenson-Shannon Divergence to predict the human rating.Neural-based approaches such as BLEURT and COMET directly train on the overall human rating for the sample texts.Even though neural-based evaluation metrics seem promising, they often require tens of thousands of training samples to mimic human rating, struggle with new domains/tasks and unseen samples, and still output a single score.
Understanding Evaluation Metrics: Recently, there has been growing interest in understanding what these evaluation metrics measure in terms of fine-grained evaluation criteria.It is done by studying different error categories (mismatches) in the machine-generated texts but claim none of the existing evaluation metrics can predict all of the mismatches, without providing any solution.Perturbation checklist (Sai et al., 2021) uses template-based perturbation on multiple tasks based on the human evaluation criteria to study mismatches.FRANK (Pagnoni et al., 2021) studies different evaluation metrics on factuality in abstractive summarization using error types.Scarecrow (Dou et al., 2022) explores errors in prompt-based text generation by large language models.BreakingNLI (Glockner et al., 2018) evaluates different metrics on synthetic data created from the external knowledge graph WordNet (Miller, 1995).Tang et al. (2021) studies different types of factuality and hallucinations in the generated text by large language models.
In this work, we unify these task-specific research directions.We propose several evaluation models that combine the best of handcrafted logic on robust evaluation, a neural framework for text representations, and mismatch error types types to measure the quality of the generated text based on the human evaluation criteria.

Conclusion
In this paper, we proposed a neural framework for evaluating the quality of the machine-generated text w.r.t the reference text.To achieve this, we defined a set of mismatch error types to approximate the human ratings over a set of evaluation criteria.We showed that in addition to the BERT-based text representation, feature-invariant representations learned from the automatic evaluation metrics improve the prediction of both the mismatch types as well as human ratings with pre-training on only a limited amount of synthetic examples with mismatch error types.We further showed that mismatches between pairs of texts provide an interpretable way to explain human judgments, through a series of ablation studies and correlation analyses.Our proposed mismatch error types is a crucial bridge between automatic evaluation metrics and human evaluation criteria, leading to more interpretable predictions for NLP models.

Limitations
One limitation of our work, which is also an avenue for future work, is that it is not fully understood yet why the mismatch error types help much more in some tasks than others.Trying to develop a more task or even instance-specific understanding of the benefits of mismatch error types will be very useful.We also want to try our proposed approach on a wider set of tasks, using different foundational models, and under the distribution shift setting to see if the mismatch error types as auxiliary supervision can improve robustness of natural language processing systems.

Ethics Statement
With the ubiquity of natural language processing systems in real-world applications, especially in sensitive domains, it is very important that the machine-generated text is of high quality, as measured by a list of human evaluation criteria such as coherence, consistency, among others.Thus, from a societal perspective, our proposed mismatched error types provides a way to evaluate the quality of machine-generated text with respect to the reference text.From an ecological perspective, our proposed model design only involves synthetic data for pre-training and minimal computation overhead.In addition, from a trustworthiness perspective, MIS-MATCH provides an interpretable scheme to identify the differences between pairs of text which makes it very suitable for sensitive applications in NLP.

D Additional Details
In Table 10, we show the list of all the automatic evaluation metrics used in our proposed model along with the associated NLP task with their references.Table 11 shows the cost associated with computing the evaluation metric scores.It includes both the time complexity (in seconds) and space complexity (in MBs).Time complexity measures how long will the metric takes to compute the evaluation score, whereas space complexity measures how much storage space this metric will consume during the training process.We can see that the hardcoded logic-based metrics such as ROUGE, METEOR, etc are relatively low-cost compared to the neural network-based models such as ANLI, FactCC, etc with the high cost.

Figure 1 :
Figure 1: Overview of the proposed fine-grained automatic evaluation of machine-generated text with mismatch error types, along with human evaluation scores.Sample examples are taken from Natural Language Inference (NLI) and Question Generation (QG) tasks.

Figure 2 :
Figure 2: An overview of the mismatch-based evaluation architecture showing task-agnostic pre-training with mismatch error type prediction on synthetic data as the auxiliary task (left) and task-specific finetuning with human evaluation criteria on real annotated data as the main task (right).Dotted arrows indicate that the evaluation metric scores are pre-computed.Dotted blocks indicate that the modules are reused from the pre-training step.

Figure 3 :
Figure 3: (Top) Correlation between mismatch error types vs automatic evaluation metrics from different NLP tasks.(Bottom) Correlation between mismatch error types vs human evaluation criteria for NLI task.
's τ correlation and Spearman's ρ correlation on 7 NLP tasks (averaged over human evaluation criteria).Top row shows the dataset (in parentheses) and number of samples used for each task during finetuning step.* indicates RMSE is not available as the human ratings are defined on three classes: Entailment, Neutral, Contradiction.

Figure 4 :
Figure 4: Comparison of the task-specific evaluation metrics with the proposed model.Kendall's τ correlation (agreement with human ratings) is used for the performance comparison.Average over 3 runs with 100 samples randomly taken from the task-specific test set.

Figure 6 :
Figure 6: Performance of the pre-trained model with and w/o scalar features on different synthetic sample size.Accuracy of predicting the mismatch error types is used for comparison.

Figure 7 :
Figure 7: Sample examples taken from 3 tasks (IC, QG and DT) with both predicted mismatch error type and predicted human evaluation scores.Both the human-annotated (human) and our model-estimated (predicted) evaluation criteria scores are reported for comparison.

Figure 8 :
Figure 8: Correlation between mismatch error types vs human evaluation criteria for 7 NLP tasks: Data-To-Text, Natural Language Inference, Abstractive Summarization, Image Captioning, Machine Translation and Question Generation.

Table 1 :
Fine-grained evaluation with Mismatch error types between the reference and model-generated texts.

Table 2 :
Model performance (agreement with human ratings) measured using Root Mean Squared Error (RMSE), Kendall

Table 4 :
Agreement with human ratings on the WMT18 Metrics Shared Task.Kendall Tau (τ ) is used to evaluate results.

Table 5 :
Agreement with human ratings on the WMT19 Metrics Shared Task.Kendall Tau (τ ) is used to evaluate results.

Table 6 :
Example sentences from Image Captioning task with predicted human evaluation criteria and mismatch type.

Table 7 :
Example sentences from Question Generation task with predicted human evaluation and mismatch type.

Table 8 :
Example sentences from NLI task with predicted human evaluation and mismatch type. .miller hall was started on march 30 , 2007 and has the mason school of business in the u .s .as a tenant the current tenants of alan b. miller hall are 30 march 2007 and mason school of business in united states , new mexico is located in the united states and asian americans are an ethnic group there .john sanchez , is one of the leaders , in the new mexico senate which is leading the state albuquerque, new mexico is a food from new mexico where the capital is asian americans and is led by john sanchez. albuquerque

Table 9 :
Example sentences from Data-to-Text task with predicted human evaluation criteria and mismatch type.