Is ChatGPT a Good NLG Evaluator? A Preliminary Study

Recently, the emergence of ChatGPT has attracted wide attention from the computational linguistics community. Many prior studies have shown that ChatGPT achieves remarkable performance on various NLP tasks in terms of automatic evaluation metrics. However, the ability of ChatGPT to serve as an evaluation metric is still underexplored. Considering assessing the quality of natural language generation (NLG) models is an arduous task and NLG metrics notoriously show their poor correlation with human judgments, we wonder whether ChatGPT is a good NLG evaluation metric. In this report, we provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric. In detail, we regard ChatGPT as a human evaluator and give task-specific (e.g., summarization) and aspect-specific (e.g., relevance) instruction to prompt ChatGPT to evaluate the generated results of NLG models. We conduct experiments on five NLG meta-evaluation datasets (including summarization, story generation and data-to-text tasks). Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments in most cases. In addition, we find that the effectiveness of the ChatGPT evaluator might be influenced by the creation method of the meta-evaluation datasets. For the meta-evaluation datasets which are created greatly depending on the reference and thus are biased, the ChatGPT evaluator might lose its effectiveness. We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.

In this report, we provide a preliminary metaevaluation on ChatGPT to show its reliability as an NLG metric.In detail, we regard ChatGPT as a human evaluator and give task-specific (e.g., summarization) and aspect-specific (e.g., relevance) instruction to prompt ChatGPT to evaluate the generated results of NLG models.We conduct experiments on five NLG metaevaluation datasets (including summarization, story generation and data-to-text tasks).Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments in most cases.In addition, we find that the effectiveness of the ChatGPT evaluator might be influenced by the creation method of the meta-evaluation datasets.For the meta-evaluation datasets which are created greatly depending on the reference and thus are biased, the ChatGPT evaluator might lose its effectiveness.We hope our preliminary study could prompt the emergence of a generalpurposed reliable NLG metric. 1  Figure 1: Prompting ChatGPT as an evaluator to score the generated results of NLG models (taking news summarization as an example).

Introduction
Pre-trained large language models (LLMs; e.g., GPT-3.5, ChatGPT and GPT-4), which are performed through chatting (or asking) with it, have obtained promising results on various natural language understanding (NLU) and natural language generation (NLG) downstream tasks (Ouyang et al., 2022;Kocoń et al., 2023;Qin et al., 2023;Huang et al., 2023;Yang et al., 2023;Rao et al., 2023;Bang et al., 2023;Zuccon and Koopman, 2023).For example, Zhong et al. (2023) show that Chat-GPT can attain the comparable understanding abil-arXiv:2303.04048v3[cs.CL] 24 Oct 2023 ity to some fine-tuned BERT-style models on NLU tasks while failing to surpass current task-specific NLU models.Wei et al. (2023) prove that ChatGPT can achieve good performance and even surpasses some full-shot models on several datasets through a multi-turn question-answering manner.For NLG tasks, Jiao et al. (2023) claim that ChatGPT performs competitively with commercial translation products (e.g., Google Translator) on high-resource European languages.Wang et al. (2023a) demonstrate that ChatGPT can balance well between informativeness and conciseness, and generate great cross-lingual summaries.Although impressive performance on these tasks in terms of automatic evaluation metrics has been shown, it is still not clear whether ChatGPT can evaluate the quality of textual generations as a human does.
Recently, using pre-trained language models as NLG evaluation metric, e.g., MoverScore (Zhao et al., 2019), BERTScore (Zhang et al., 2020), COMET (Rei et al., 2020), BLEURT (Sellam et al., 2020), BARTScore (Yuan et al., 2021) and MAUVE (Pillutla et al., 2022), receives increasing attention since it offers a decent human-related judgment from a deep semantic perspective.Given the powerful ability of ChatGPT as an intelligent conversational LLM, researchers also attempt to investigate whether it can evaluate the translation quality as a human evaluator (Kocmi and Federmann, 2023).However, the automated assessment of the general generation quality of NLG models still remains underexplored.
In this report, we aim to answer the following research question: Is ChatGPT a good NLG evaluator?To this end, we regard ChatGPT as a human evaluator and give task-specific (e.g., summarization) and aspect-specific (e.g., relevance) instruction to prompt ChatGPT to evaluate the generation of NLG models.As the example shows in Figure 1, we also attempt different scoring criteria and whether to provide golden references in the prompts to systematically test the reliability of the ChatGPT evaluator.We conduct experiments on five widely-used NLG meta-evaluation datasets (including summarization, story generation and data-to-text tasks).Experimental results show that ChatGPT exhibits a high correlation with human judgment in most cases especially for the story generation task, indicating its potential as an NLG metric.In addition, we find that the Chat-GPT evaluator is sensitive to the prompts, and for different tasks or aspects, the prompts should be carefully designed.Moreover, the creation method of the meta-evaluation datasets has a significant influence on the effectiveness of different evaluation metrics.If a meta-evaluation dataset is created greatly depending on the reference, the similarity between model generation and references serves as a strong signal to reflect human judgments, where simple similarity-based metrics (e.g., ROUGE) can achieve very strong performance.Therefore, the ChatGPT evaluator might lose its effectiveness in such situations.
Our main contributions are concluded as follows: • To our knowledge, we are the first to utilize Chat-GPT as a general NLG evaluation metric to study its correlations with human judgments.• We use task-specific and aspect-specific prompts to guide ChatGPT to perform as a referencefree or reference-based NLG metric, and evaluate its effectiveness on five widely-used metaevaluation datasets covering three NLG tasks.• We find that the ChatGPT evaluator has a high correlation with humans in most cases, especially for creative NLG tasks (e.g., story generation) where multiple generations can satisfy humans.• We find that the ChatGPT evaluator is sensitive to the prompts.For different tasks and aspects, the prompt should be carefully designed.• We find that the involved biases of the NLG metaevaluation datasets also influence the effectiveness of NLG metrics, and might lead to the limited effectiveness of the ChatGPT evaluator.
2 Related Work

NLG Metrics
A good automatic NLG metric can effectively indicate the quality of the textual generations and thus can save lots of human labor from conducting human evaluation.Therefore, it is vital to design automatic evaluation metrics for NLG tasks, e.g., text summarization, story generation, data-to-text generation, machine translation, and many others.Generally, the score that points out how well the systems perform on each task is computed by comparing the system texts with one or more reference texts for semantic matching.In the literature, the metrics can be roughly categorized into four types: n-gram-based Metrics.Essentially, the n-grambased metrics aim to measure the lexical overlap between a generated text and a reference text.
The standard n-gram overlap-based metrics generally include ROUGE (Lin, 2004), BLEU (Papineni et al., 2002), Distinct-n (Li et al., 2016), and METEOR (Denkowski and Lavie, 2011).For example, ROUGE is the dominant metric in the summarization evaluation area.Its variants consider the overlap of unigrams (ROUGE-1) and bigrams (ROUGE-2), among others.BLEU metric is the common practice for the machine translation evaluation area.Although these metrics achieve good correlations (typically large overlaps) with golden references, they are not general enough because a system summary might convey the same meaning while using different surface forms.
Embedding-based Metrics.To further improve semantic similarity between a generated text and a reference text, embedding-based metrics are proposed based on the word embeddings (e.g., WMD (Kusner et al., 2015)) or sentence embeddings (e.g., BERTScore (Zhang et al., 2020) and MoverScore (Zhao et al., 2019)).These metrics further bridge the gap with human-related judgment while they largely depend on the quality of embeddings, which may limit their potential.
LLM-based Metrics.With the development of LLMs, some researchers show that LLMs could achieve great correlation with human judgment, i.e., BARTScore (Yuan et al., 2021), and GPTScore (Fu et al., 2023).However, ChatGPT, as a more powerful conversational LLM, has not been investigated to evaluate the quality of the NLG model outputs.
Other Metrics.In different research fields, there are some paraphraser-based or task-specific metrics.For example, PRISM (Thompson and Post, 2020) is proposed to evaluate translation outputs based on the pre-trained paraphrase models.Sto-ryER (Chen et al., 2022), a learning metric, mimics human preference when judging a story by three steps: Ranking, Rating, and Reasoning based on a specific story-generation dataset.Besides, a specifically developed metric named PARENT (Dhingra et al., 2019) is designed for the table2text generation.Other statistical indicators, such as omission errors, hallucination errors, addition errors, duplication errors, and extrinsic errors, are also applied in the table2text task.Although these metrics have obtained impressive results, human evaluation is still inevitable in table2text.

Research on ChatGPT
In recent years, from BERT (Devlin et al., 2019) to ChatGPT (OpenAI, 2022), a large number of pretrained language models have been proposed one after another.Both their parameters and ability are gradually increased, facilitating much-advanced techniques.In particular, ChatGPT, which shows us a revolutionary change as an intelligent conversational large language model, sends shock waves through the research community and industries that have continued to reverberate to this day.With the emergence of ChatGPT, there are two growing research interests related to it: (1) leveraging Chat-GPT to deal with various NLP tasks and evaluating its performance using traditional task-specific metrics (i.e., evaluation), and (2) using as a metric to evaluate the outputs of other task-specific models (i.e., evaluator) (Kocmi and Federmann, 2023).
Evaluation.Generally, the evaluation tasks on ChatGPT can be divided into two categories, i.e., natural language understanding (NLU) and natural language generation (NLG).For NLU tasks, some researchers find that ChatGPT covers almost all NLU tasks (e.g., sentiment analysis, textual similarity and textual entailment) and achieves competitive or even better performance (Qin et al., 2023;Bang et al., 2023;Zhong et al., 2023).For NLG tasks, machine translation (Jiao et al., 2023), summarization (Yang et al., 2023), query generation (Wang et al., 2023b), and radiology report simplification (Jeblick et al., 2022) are involved.Different from them, we regard ChatGPT as a human evaluator to automatically assess the quality of general textual generations rather than using it for solving tasks.
Evaluator.As an evaluator, there are two studies that evaluate the quality of translation (Kocmi and Federmann, 2023) and human personalities (Rao et al., 2023) by prompting ChatGPT.However, in this work, we aim to evaluate the more general textual outputs to further show the ability of ChatGPT as a general NLG metric.

ChatGPT for NLG Evaluation
In this section, we discuss how to prompt ChatGPT to serve as a reference-free NLG metric ( § 3.1) or a reference-based NLG metric ( § 3.2) to evaluate the generation quality of NLG models.We take the news summarization task as an example, and give the details of the prompt templates.

Reference-free Metric
To evaluate the generation quality of NLG models, we regard ChatGPT as a human evaluator and give it evaluation instruction via different prompts.Each prompt should specify (1) which NLG task (e.g., summarization) needs to be evaluated and (2) which aspect (e.g., fluency) of the generation result should be assessed currently.
Inspired by Kocmi and Federmann (2023), we utilize the following two prompts: direct assessment (DA) and one-to-five stars ranking (star).where [task-ins] and [aspect-ins] are the instructions of the current task and aspect, respectively.
[aspect] and [ant-aspect] denote the evaluated aspect and its antonym, respectively.[Conditioned Text] is the input of NLG models while [Generated Text] is the output.For example, when evaluating news summarization models in terms of fluency, the DA prompt may be like this: Score the following news summarization given the corresponding news with respect to fluency on a continuous scale from 0 to 100, where a score of zero means "disfluency" and score of one hundred means "perfect fluency".Note that fluency measures the quality of individual sentences, are they well-written and grammati-cally correct.Consider the quality of individual sentences.
News: [a news article] Summary: [one generated summary] Scores: In this manner, both the details of the task and the evaluation aspect are given to ChatGPT.Next, ChatGPT will give its judgment (e.g., "score: 70") and the corresponding illustrative description (e.g., "the summary covers the main points of the news, but ...").A specific example is shown in Figure 1.Finally, the numerical scores could be extracted via several simple heuristic rules.

Reference-based Metric
In addition to reference-free metrics, we explicitly mention the golden references in the prompts to make ChatGPT a reference-based NLG metric: In this way, the ChatGPT evaluator will make its judgment and give the evaluation results under the consideration of the golden references. .
(1) Sample-level evaluation strategy calculates the correlation scores as follows: (1) where ρ denotes the correlation metrics like Spearman correlation.f auto and f human indicate the automatic evaluation and human judgment functions, respectively.

Baselines
We compare the ChatGPT evaluator with the following widely-used automatic NLG metrics to provide deeper analyses: • ROUGE-1, ROUGE-2 and ROUGE-L (Lin, 2004) measure the lexical overlap between the generated text and corresponding references based on unigram, bigram and longest common subsequence, respectively.• BERTScore (Zhang et al., 2020) and Mover-Score (Zhao et al., 2019) evaluate the semantic similarity via pre-trained BERT model (Devlin et al., 2019).• PRISM (Thompson and Post, 2020) is used to evaluate NLG models via pre-trained paraphrase models.• BARTScore (Yuan et al., 2021) is a state-of-theart NLG metrics based on vanilla pre-trained BART model (Lewis et al., 2020).• BARTScore+CNN (Yuan et al., 2021)

Text Summarization
We conduct meta-evaluation on SummEval (Fabbri et al., 2021), NewsRoom (Grusky et al., 2018)   RealSumm (Bhandari et al., 2020) to evaluate the performance of ChatGPT as an NLG metric for text summarization.SummEval collects 16 modelgenerated summaries on the CNN/DM dataset and annotates human judgments upon these summaries covering aspects of coherence, relevance, consistency and fluency.Newsroom, as a text summarization dataset, also provides human judgments on 7 model-generated summaries, including coherence, relevance, informativeness and fluency.RealSumm evaluates the pyramid (Nenkova and Passonneau, 2004) recall of 25 model-generated summaries.
The Potentiality of ChatGPT.Table 1 and Table 2 show the sample-level evaluation results on Sum-mEval and NewsRoom, respectively (dataset-level evaluation results on SummEval and NewsRoom also shown in Table 4 and  (Nenkova and Passonneau, 2004).In detail, this method first requires human evaluators to extract semantic content units from golden references, and then score each system summary based on how many extracted semantic content units are mentioned in the system summary.
In this manner, the more similarity between one generated summary and the corresponding golden reference, the more human evaluation scores will be achieved.Therefore, this reference-oriented annotation method makes the traditional n-grambased metric (such as ROUGE) already achieve well correlations with human judgments, which we named as lexical biases.As for SummEval and NewsRoom, human evaluators are required to directly score different summaries without comparing them with the golden references, and thus do  not involve such lexical biases.
The Impact of Different Prompt.In this work, we attempt four prompts to guide ChatGPT to evaluate the generation of NLG models.As we can see, the performances of ChatGPT are sensitive to the prompt design.For different aspects, the prompt should be carefully designed, just like formulating instructions for human evaluators.

Story Generation
Story generation is another NLG task with more emphasis on open-ended generation compared with text summarization, which also means for a given beginning of a story, various generated storylines and different plots could satisfy people.Therefore, story generation models are extremely challenging to evaluate.The automatic similarity-based metrics between the generated storylines and so-called references cannot fully evaluate the quality of the storylines since they do not consider creativity.
To show the effectiveness of ChatGPT as an NLG metric for the story generation task, we conduct experiments on OpenMEVA-ROC (Guan et al., 2021).The OpenMEVA-ROC dataset manually annotates five model-generated storylines under the consideration of their overall quality.(Zhu et al., 2023), and we think this under-explored LLM research direction deserves more research attention.

Data-to-Text Generation
Data-to-text generation aims at generating a fluent free-text description for a given structured table.We conduct experiments on BAGEL (Mairesse et al., 2010) to show the effectiveness of the Chat-GPT evaluator on data-to-text generation.
Table 7 shows the experimental results, where ChatGPT achieves competitive correlations compared with the previous state-of-the-art baselines, indicating its strong potentiality serving as a metric for data-to-text generation.It is worth noting that we do not provide reference-free ChatGPT performance in terms of informativeness because informativeness in BAGEL is defined as "whether the system generation contains all the information in the gold reference", which also means that when evaluating informativeness the golden references must be given.

Conclusion
In this technical report, we explore a research question: "Is ChatGPT a good NLG evaluator?".To this end, we design task-specific as well as aspectspecific prompts to guide ChatGPT to perform as an NLG metric.Experimental results on five widely-used meta-evaluation datasets, covering text summarization, story generation and data-to-text tasks, show the potentiality of ChatGPT as an NLG metric.ChatGPT achieves the new state-of-theart correlations (with human judgments) on Sum-mEval and OpenMEVA meta-evaluation datasets, and obtains competitive results on NewsRoom and BAGEL datasets.
In addition, we also find that the lexical biases involved in the meta-evaluation datasets would influence the effectiveness of NLG metrics, and might lead to the limited performance of the ChatGPT evaluator.Besides, the performances of ChatGPT as an NLG evaluator are sensitive to the format of the prompt, for different tasks and aspects, the prompt should be carefully designed.
We believe that ChatGPT will exceed its current performance and provide a reliable NLG metric for the research community in the near future.

Limitations
While we show that ChatGPT achieves state-ofthe-art or competitive correlation with human judgments on various NLG tasks, there are limitations that provide avenues for future work: (1) Chat-GPT's performance as an NLG metric is related to prompts, and future work could explore more powerful prompts to achieve better performance; (2) This preliminary report misses experiments on some mainstream NLG tasks, e.g., dialogue generation and report generation; (3) When we did the experiments, the OpenAI ChatGPT did not release the official API.Thus, we conducted the experiments on the ChatGPT website with default temperature, making the results difficult to reproduce.All experiments related to ChatGPT are conducted between February 24 to February 27, 2023;and March 17 to March 22. (4) The experiments are only conducted on the English NLG meta-evaluation datasets, and future work could extend this method into other languages or cross-lingual scenes.(5) The correlation between the ChatGPT evaluator and humans is also related to the quality and challenge of the corresponding meta-evaluation datasets.Our experiments are conducted on the traditional NLG meta-evaluation datasets (that appear before the LLM era).Recently, Zeng et al. (2023) propose LLM-BAR, a challenging meta-evaluation benchmark to test the ability of an LLM evaluator.Future work could adapt our method to other challenging datasets and study the performance of the ChatGPT evaluator.

(
DA Prompt) Score the following [task-ins] with respect to [aspect] on a continuous scale from 0 to 100, where a score of zero means "[ant-aspect]" and score of one hundred means "perfect [aspect]".Note that [aspect] measures [aspect-ins].the following [task-ins] with respect to [aspect] with one to five stars, where one star means "[ant-aspect]" and five stars means "perfect [aspect]".Note that [aspect] measures [aspect-ins].[Conditioned Text] [Generated Text] Stars:

(
DA Prompt w/ Reference) Score the following [task-ins] with respect to [aspect] on a continuous scale from 0 to 100, where a score of zero means "[ant-aspect]" and score of one hundred means "perfect [aspect]".Note that [aspect] measures [aspect-ins].[Conditioned Text] Human reference: [A Reference] [Generated Text] Scores: The star prompt with reference is also formed in a similar way: (Star Prompt w/ Reference)) Score the following [task-ins] with respect to [aspect] with one to five stars, where one star means "[ant-aspect]" and five stars means "perfect [aspect]".Note that [aspect] measures [aspect-ins].[Conditioned Text] Human reference: [A Reference] [Generated Text] Stars:
(Kendall, 1938)uate how well automatic metrics correlate with human judgment.Two widely-used correlation measures are adopted: (1) Spearman correlation(Zar, 2005)assesses the monotonic relationships between two variables; (2) Pearson correlation (Mukaka, 2012) measures the linear relationships between two sets of data; (3) Kendall's Tau(Kendall, 1938)evaluates the ordinal association between two measured quantities.Evaluation Strategy.When calculating the correlation scores, there are different aggregation methods.Given a set of conditioned text {c 1 , c 2 , ..., c n } (e.g., source documents in text summarization task) and M NLG models.The generated text of m-th model for the i-th condition text is denoted as g i,m

Table 2 :
and Sample-level Spearman correlation (Spear.)correlation, Pearson (Pear.)correlation and Kendall's Tau (Kend.) of different aspects on NewsRoom (a text summarization meta-evaluation dataset)." Avg." indicates the average performance.The bold indicates the best correlation.

Table 7 :
Dataset-level Spearman correlation (Spear.)correlation, Pearson (Pear.)correlation and Kendall's Tau (Kend.) of different aspects on BAGEL (a data-to-text generation meta-evaluation dataset)." Avg." indicates the average performance.The bold indicates the best correlation.