Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks

Large Language Models (LLMs) evaluation is a patchy and inconsistent landscape, and it is becoming clear that the quality of automatic evaluation metrics is not keeping up with the pace of development of generative models. We aim to improve the understanding of current models' performance by providing a preliminary and hybrid evaluation on a range of open and closed-source generative LLMs on three NLP benchmarks: text summarisation, text simplification and grammatical error correction (GEC), using both automatic and human evaluation. We also explore the potential of the recently released GPT-4 to act as an evaluator. We find that ChatGPT consistently outperforms many other popular models according to human reviewers on the majority of metrics, while scoring much more poorly when using classic automatic evaluation metrics. We also find that human reviewers rate the gold reference as much worse than the best models' outputs, indicating the poor quality of many popular benchmarks. Finally, we find that GPT-4 is capable of ranking models' outputs in a way which aligns reasonably closely to human judgement despite task-specific variations, with a lower alignment in the GEC task.


Introduction
In recent years, Large Language Models (LLMs), particularly Transformer based (Vaswani et al., 2017;Devlin et al., 2019), have shown remarkable abilities across a wide range of NLP tasks.With the recent advances in capabilities of general-purpose generative models (Brown et al., 2020;Touvron et al., 2023), a range of NLP tasks can be reformulated as generation tasks.
Robust evaluation is still an unsolved problem and established automatic evaluation metrics have been found to be poor surrogates, correlating weakly with human judgement (Coyne et al., 2023).There is often no clear consensus on how these models should be evaluated (Mousavi et al., 2022).Human evaluation has often been considered as the trusted evaluation method, though issues with human evaluation have also been widely acknowledged (Iskender et al., 2021), e.g. it can be difficult to reproduce (Cumbicus-Pineda et al., 2021).Nonetheless, a human evaluation study remains one of the best tools to sensibly assess any bias or limitation with automatic metrics (Liang et al., 2022).
Recent evaluation work has often focused on a single task (Zhang et al., 2023;Coyne et al., 2023), a single model (Bang et al., 2023), a single dataset (Gilardi et al., 2023) or automatic evaluation (Liang et al., 2022).In this work, we carry out a multidataset, multi-model, multi-task hybrid evaluation using automatic metrics, human evaluation, and model-to-model evaluation with GPT-4 (OpenAI, 2023). 1 We explore the open and closed-source LLMs space to sample the current landscape of available models and evaluate them on the following sequence-to-sequence tasks, reframed as text generation tasks without the requirement for task-specific fine-tuning: text summarisation, text simplification, and grammatical error correction (GEC).
These are our main findings: firstly, we show how traditional reference-based evaluation metrics are inadequate at predicting or replacing human judgement.It is unclear whether this is due to the limitations of the metrics or to the poor quality of references of large open source datasets, or both.While automatic metrics might have been an adequate proxy to evaluate previous models, they seem unable to reliably capture the performance of latest-generation LLMs which now generate ac-ceptable output that is significantly different from the gold reference.Secondly, we prove that even open-source models outperform the gold standard reference of large and well-established datasets according to human evaluators.This shows how data quality is now one of the main bottlenecks in evaluation research.Finally, we reveal how GPT-4 has reasonable alignment with human judgement when ranking different models on most tasks and metrics; we did however observe some variations, with lower alignment in some metrics than in others.Our code is available at https://github.com/protagolabs/seq2seq_llm_evaluation.

Datasets
For text simplification, we used the Newsela test set (Xu et al., 2015), in particular the version used by Jiang et al. (2020).We randomly selected 3,000 samples after removing samples redundancy.2For text summarisation, experiments were run on 3,000 random samples taken from the CNN / DailyMail test set (Hermann et al., 2015;Nallapati et al., 2016).For GEC, we used the BEA-2019 Shared Task (Bryant et al., 2019) development set comprising of 4,384 samples.3
For implementation details, prompt engineering and hyper-parameter tuning, refer to appendix B.

Evaluation Metrics
We analysed models' outputs using both automatic metrics and human evaluation, and assessed the ability of the recently released GPT-4 model to act as a reviewer.

Automatic Evaluation
We used the most widely adopted reference-based metrics for each of the tasks.For text simplification, we report the SARI score (Xu et al., 2016).For text summarisation, we report the ROUGE score (Lin, 2004); following Phang et al. (2022), we compute the geometric mean of ROUGE-{1, 2, L} F1 scores.
For GEC, we report the F 0.5 score computed using the ERRANT toolkit (Bryant et al., 2017).

Human Evaluation
Due to budgetary and time constraints, we recruited 3 human reviewers8 through the Prolific platform9 and asked them to review the quality of the models' outputs, as well as the gold reference on 100 randomly selected samples per dataset.All three reviewers were asked to annotate the same 100 samples for each of the three tasks.The studies were conducted on a customised version of the open-source POTATO annotation tool (Pei et al., 2022).For human evaluation of text summarisation, we followed the evaluation criteria and their definitions as adopted in Fabbri et al. ( 2021): Relevance, Fluency, Coherence and Consistency, on a 5-point Likert scale (Likert, 1932) from 1 to 5.
For text simplification, we followed the evaluation criteria and their definitions as adopted in Grabar and Saggion (2022): Semantics, Fluency and Simplicity, on a 5-point Likert scale.For GEC, we adopted the Over-correction criterion from Fang et al. (2023) and introduced two new criteria: Semantics and Grammaticality.The definitions and assessment scales for these GEC criteria are detailed in Appendix C. The full set of instructions given to human reviewers for all tasks can be found in our GitHub repository linked above.are shown both on the main subset and the small subset used for human evaluation.† Due to the specifics of HuggingFace implementation, a temperature of 0.0 cannot be used, we therefore used a value of 0.01 for such cases.

GPT-4 as a Reviewer
We used GPT-4 as an additional reviewer to assess whether it can be reliably deployed in place of human reviewers.The definition of the evaluation criteria and their assessment scales were included in the GPT-4 prompt together with the input text for each sample. 10GPT-4 was also asked to annotate the same 100 samples that were shown to human reviewers for each of the three tasks.The full prompts given to GPT-4 for all tasks can also be found in our GitHub repository linked above.

Automatic Evaluation Results
Results are shown in Table 1.In order to allow a comparison between open-source and paid-for models' performance, for each task, we report the best open-source model and two commercial models from OpenAI. 11For text summarisation, T0pp significantly outperformed GPT-3 and ChatGPT (with p < 0.001).For text simplification, Flan-T5 and InstructGPT yield the best results, significantly outperforming ChatGPT (p < 0.001).
We also observed that for each task, the same prompt seemed to perform best for all models and temperature settings, with only one exception, suggesting that the quality of prompts is almost modelinvariant.See Appendix D for more details.
10 Occasionally GPT-4 returned a score of 4.5, and we converted 4.5 to 4 for evaluation purposes (6 out of 3,000 cases).
11 More detailed results are in Appendix D.

Human and GPT-4 Evaluation Results
Human reviewers and GPT-4 were shown 4 outputs per sample: the outputs from the models in Table 1 and the gold standard, and were asked to score each model's output on the metrics and scales described in section 3.2.We then converted their scores to rankings for each model and each reviewer from best (1) to worst (4) and took the average. 12The rankings from human evaluation and GPT-4 evaluation (in brackets) are shown in Table 2, alongside the interval Krippendorff α coefficient (Krippendorff, 2011) to express inter-annotator agreement.
The raw scores and a more detailed set of Krippendorff α coefficients based on individual annotator pairs are shown in Appendix E. As it can be clearly seen, there is generally very good inter-annotator agreement, with an average Krippendorff α of 0.88 across all metrics, with the lowest being 0.62.On text summarisation, most reviewers scored ChatGPT as the best for Relevance and Fluency, and all reviewers scored ChatGPT as best model for Coherence and Consistency, while ChatGPT had a worse ROUGE score compared to other models when using automatic evaluation (see Table 1).Interestingly, all human reviewers scored the gold reference summaries as the worst on all metrics.This reveals the poor quality of reference summaries when compared to most models' outputs, and therefore reference-based automatic metrics could produce unreliable results.It is therefore not surprising that ChatGPT outputs were ranked the worst by automatic metrics in text summarisation and simplification, but the best when using human evaluators.For text simplification, ChatGPT was rated the best model by all reviewers for Fluency and Simplicity, while it was rated poorly for Semantics, with the best model being Flan-T5.We observed that this was due to Flan-T5 returning a lot of outputs which were identical to the inputs, therefore the semantics was obviously fully preserved, but without any inherent text simplification.The gold standard was scored as worst according to all reviewers.
We had substantially different results for GEC, where ChatGPT was rated the best model by human reviewers for Grammaticality (meaning all or most errors were fixed) but was rated as worst or second worst model for Semantics and Overcorrection, for which the best model was OPT-IML.This underlines how ChatGPT tends to overcorrect, and in doing so might add information to the sentence which were not originally present, which is consistent with recent findings (Fang et al., 2023;Wu et al., 2023).The gold reference was scored mostly as second worst on most metrics and by most reviewers.
For both text summarisation and simplification, GPT-4 used as a reviewer produced surprisingly good results which correlate well, albeit not perfectly, with human reviewers.We observed a stronger disagreement between human reviewers and GPT-4 in GEC.It is also worth noting that we did not observe the systematic positional bias when using GPT-4 as a reviewer as reported by Wang et al. (2023).However, we postulate that averaging the scores across the samples and using rankings instead of absolute scores helped to dampen this effect.If we include GPT-4 evaluation, the average Krippendorff α is 0.70 across all metrics, with the lowest being 0.34.

Conclusion
Model evaluation is a topic which is attracting increasing interest from the community.Liang et al. (2022) have recently published an extensive evaluation report on LLMs, however they mostly focused on automatic evaluation.Prompted by the recent advances in generative capabilities of the latest LLMs, we conducted this study to explore the drift between human judgement and automatic, reference-based evaluation of zero-shot model performance.We also explored model-to-model evaluation with GPT-4.The study was conducted using large, open-source datasets often acting as benchmarks for their respective tasks.
Our work reveals a systematic misalignment between reference-based automatic metrics and human evaluation on a range of generative tasks, highlighting the inadequacy of the gold reference in the public NLP benchmarks.It is not clear whether this misalignment is purely due to the limitations of automatic metrics, or whether poor reference quality makes using any reference-based comparative metrics unreliable.Despite ChatGPT being rated one of the best models on most metrics by human reviewers, the best open-source LLMs also consistently outperformed the reference outputs.We also explored the potential of GPT-4 to act as a reviewer and found it has strong correlation with human judgement for summarisation and simplification tasks, and moderate correlation for GEC.
Future work will look at improving the quality of prompts, providing few-shot in-context learning (Brown et al., 2020), or exploring the potential of chain-of-thought prompting (Wei et al., 2022) in improving models' outputs.Given the misalignment mentioned above, extending human evaluation to larger datasets and to a wider range of model settings will also be of particular future interest, so as to minimise the bias introduced when using automatic metrics to select a subset for human evaluation.Finally, introducing multiple automatic evaluation metrics (e.g.reference-less) for each task might help deepen our understanding of the relation between such metrics and human judgement.

Limitations
This paper suffers from the following limitations: • A limited amount of prompt tuning and prompt space investigation was carried out.Between 2 and 5 different prompts per task were tried, therefore a more focused study on prompt engineering could potentially bring significant improvements, however this is a stand-alone exploration topic, which we leave for future work.
• We did not perform any in-context learning or chain-of-thought prompting, which have been shown to significantly improve the performance of generative models.As such, there may be margin for improving the quality of models' outputs, while the quality of gold references will remain unchanged until new datasets become available.
• We used automatic metrics (SARI, ROUGE and F 0.5 ) to determine the best combination of settings (model, prompt, temperature) for each task.However, since this study revealed poor correlation between human judgement and such metrics, we cannot exclude that the settings we chose for human evaluation were not the most appropriate, which means the study may have suffered from some bias indirectly introduced by using automatic metrics for selection of outputs for the human evaluation study.This is further aggravated by traditional open source datasets only presenting one gold reference output per sample when multiple equally valid outputs could exist, leading to unreliable scores; for example, two summaries of the same story can be both very good but contain few common bi-grams, leading to a poor ROUGE score when doing automatic evaluation.
• Given the wide variety of the text corpora on which most of the models we used were pretrained on, it is very likely that at least some of the models may have been trained on some of the open-source datasets we used to evaluate them.While it is difficult to mitigate for this (for example OpenAI did not publish a list of datasets used to train their models), our results might have been affected by this, and using new unreleased datasets would have been preferable to reduce this bias.However, this was not possible due to the highly expensive and time consuming nature of the task of creating high quality large datasets from scratch, which is a well known issue across the research community.
• While we did not use the same model for both inference and evaluation, we used GPT-4 for evaluation of all models, including the outputs from ChatGPT.Considering they belong to the same family of OpenAI models, GPT-4 might have a bias for rating ChatGPT's outputs higher than other models.However, our results were not able to validate or refute this, as human reviewers also rated ChatGPT outputs as the best across most metrics.
• Due to time and budgetary constraints, we were only able to hire 3 reviewers (not including GPT-4), and asked reviewers to annotate 100 samples per dataset, which is a small proportion of each dataset.Due to the small number of reviewers and reviewed samples, the noise-to-signal ratio may affect the strength and generalisability of our findings.Furthermore, using human evaluation as gold standard is also prone to introducing bias.How-ever, we found that in most cases all annotators agreed that the gold standard was worse than the best models' outputs, so we do believe this is a valid conclusion, given how consistent it was across different tasks and annotators.

Ethics Statement
Our work makes use of LLMs, and there are known concerns associated with such models (Bender et al., 2021), including data bias, toxicity of training content or outputs, their environmental impact, the lack of explainability for their outputs, and the potential to replace human workers with resulting job losses.We did not perform any fine-tuning as part of this project, and only used open-source datasets.Some of the OpenAI's models we used are not open-source, and their overall impact on society is only starting to become apparent.Overall we believe this research does not increase the risk of harm caused by these models or datasets as we only explored their limitations and mance.We employed 3 human annotators through the Prolific platform for a 16-hour study.Reviewers were paid £13.20 per hour, not including Prolific's fees. 13We did not collect any personal information beyond demographic data provided by Prolific, including age, profession, gender amongst others.
While Prolific does provide such data, we did not use them as screening criteria, and only adopted the screening criteria mentioned in section 3.2.All annotators were provided with a detailed description of the study before committing to take part.

A Newsela Dataset Processing
We observed that the ACL 2020 version (Jiang et al., 2020) of the Newsela dataset (Xu et al., 2015) contains a number of samples where either the source (input) or the destination (reference) were duplicated.In such cases, based on our observations, it was appropriate to merge them into a single sample.If the source was a duplicate but the destination wasn't, we kept the source without duplication, and created the destination by merging the two original destination samples, in the order in which they appear in the dataset.Likewise if the destination was a duplicate but the source wasn't.

See example below
• Original dataset, sample 1 -Source: Ron Bee , a professor at San Diego State University , is worried that so few Americans serve in the military .-Destination: Ron Bee is a professor in California , and he is worried .
• Original dataset, sample 2 -Source: Ron Bee , a professor at San Diego State University , is worried that so few Americans serve in the military .-Destination: Very few people join the military now .
• Our merged sample -Source: Ron Bee , a professor at San Diego State University , is worried that so few Americans serve in the military .-Destination: Ron Bee is a professor in California , and he is worried .Very few people join the military now .

B Implementation Details
Due to time and budgetary constraints, the full scale experiments were performed using the most promising settings after a preliminary study conducted on a subset of each dataset (which consists of 100 samples) on a much broader range of settings.We experimented with a range of prompts and temperature values to better explore the capabilities of each model.The final settings are task dependent; for example, we empirically observed that lower temperature values always gave the best outcomes for text summarisation and simplification, whereas for GEC it was beneficial to use higher values for some models.

B.1 Prompt Engineering
The following prompts were used, where \n indicates a newline and [...] indicates the input sample; for each of the three tasks, we report the best prompt, i.e. the prompt whose output was used for our evaluation work, at the top (prompt (a)).
The same prompt yielded best results regardless of model and temperature, with extremely limited exceptions.

.]\n
The simplified version of the story is: (c) Simplify the following text.
[...] \n The explanation to a 5 year old could be: When using GPT-4 as a reviewer, we prompted GPT-4 to output the text following strict json format rules so its output could be processed at scale programmatically.When it failed to do so, we rerun the evaluation on that specific sample until the output was in the desired format, which happened mostly at the first attempt and occasionally after 2-3 attempts as GPT-4 output is non-deterministic.

B.2 Hyperparameter Tuning
We experimented with the following temperature values: 0.0 (we used 0.01 for HuggingFace models due to implementation requirements), 0.2, 0.5, 0.7.We observed that for text simplification and summarisation, the lowest value always yielded the best results, whereas for GEC, some combinations of models and prompts yielded better results for temperatures of 0.2 or 0.5, despite the best overall combination being at a temperature of 0.0 even for GEC.For all other hyper-parameters, we used the default settings for each model without modifications.

B.3 Tokenization and Truncation
While the Newsela and BEA-2019 dataset samples are all below 512 tokens,14 the samples from CNN / DailyMail have a broader distribution, with 80.6% exceeding 482 tokens and 9.8% exceeding 1506 tokens.Different models and implementations have different maximum sequence lengths.Furthermore, while OpenAI models count the total number of input and output tokens towards their maximum sequence length, HuggingFace models have two separate limits for input and output tokens respectively.In order to facilitate the inference process, we used the following heuristics to tailor different design decisions to each model to try to maximise performance: • For GPT-3, which accepts up to 4000 combined input and output tokens, we did not perform any truncation, as the longest sample had 2,571 tokens.
• For InstructGPT, which accepts up to 2049 combined input and output tokens, we truncated the input after 1506 tokens.This leaves 512 tokens for the generated output, as well as a further 31 tokens for the prompt (it is imperative not to truncate the portion of the prompt at the end of the input) • For HuggingFace models accepting inputs up to 512 tokens (excluding the output), we truncated at 482 tokens to leave space for the prompt; for HuggingFace models accepting inputs up to 2048 tokens we truncated at 2018 tokens.

C Human Evaluation Criteria for GEC
The criteria and their definitions and assessment scales given to reviewers for the GEC task are reported below.
• Semantics.This assesses whether the meaning of the text is preserved following the GEC.Semantic preservation is assessed on a 5-point Likert scale from 1 (Meaning Not Preserved) to 5 (Meaning fully preserved).NOTE: You should penalise corrections which change the meaning unnecessarily.For example, the sentence "I wentt at Rome for my birthday" should be corrected to "I went to Rome for my birthday".A correction such as "I went to Rome for my anniversary" should be penalised in this category as they introduce unnecessary changes to the meaning.
• Grammaticality.This assesses the quality of the correction and answers the question "How many errors are left in the corrected sentence?".Please provide a count of the remaining errors, regardless of whether they were present in the source or they were newly introduced errors in the supposed corrected version.The three options are "0", "1", "2 or more".
• Over-correction.Since there can be multiple ways to correct a sentence, this assesses whether the correction is unnecessarily verbose or makes unnecessary syntax changes.
The best correction should be done with the minimum number of edits.For example, if the sentence "I wentt at Rome for my birthday" is corrected to "I decided to go to Rome for my birthday" this should be penalised under this category because it contains unnecessary syntax changes, even though the final sentence is grammatically correct.This metric answers the question: Is the system over-correcting or making unnecessary syntax changes?The answers should be "No", "Minor over-correction", "Moderate over-correction" or "Substantial overcorrection".
Note that a correction which results in a change of meaning will most likely also be an over-correction.Therefore we expect that if a correction is given a poor score in the Semantics category, it will also receive a poor score in the Over-correction category, and as such there may be some overlap between these two metrics.However, the reverse is not necessarily true, as you could easily have an over-correction without a change of meaning.For example, correcting a sentence from "I wentt at Rome for my birthday" to "I decided to go to Rome for my birthday" doesn't significantly affect the meaning of the sentence, but it nonetheless represents a clear case of over-correction as "wentt at" should have been corrected to "went to" instead of "decided to go to".As such we felt there would be value in keeping these two metrics separate.

D Detailed Automatic Evaluation Results
Table 3 shows the average results of the experiments we run on the summarisation dataset, for each model, temperature and prompt.Refer to Appendix B.1 for prompt details.Table 4 shows the average results of the experiments we run on the simplification dataset.
Summarize the following text.[...] \n The summary is: (b) [...] \n Summarize the text above.(c) Summarize the following text.[...] \n The very short summary is: (d) This is the main story: [...] \n The summarized version of the story is: 2. Text simplification (a) Simplify the following text.[...] \n The simplified version is: (b) This is the main story: [..
Reply with a corrected version of the input sentence with all grammatical and spelling errors fixed.If there are no errors, reply with a copy of the original sentence.\n\n Input sentence: [...] \n Corrected sentence: (b) Correct the following to standard English: \n\n Sentence: [...] \n Correction:

Table 1 :
Automatic evaluation of the best open-source model and two commercial models from OpenAI.Results

Table 2 :
Average human evaluation rankings per model, task and metrics, where 1.00 means best model and 4.00 means worst model.GPT-4 rankings in brackets.When two models were ranked the same, results are shown as average between lower and upper bound (e.g. two best models are shown as 1.50 each).† α 1 represents the interval Krippendorff α coefficient based on the 3 human annotators rankings, while α 2 includes GPT-4 rankings.

Table 3 :
Table 5 shows the average results of the experiments we run on the GEC dataset.Detailed automatic evaluation results on the text summarisation task.

Table 4 :
Detailed automatic evaluation results on the text simplification task.

Table 5 :
Detailed automatic evaluation results on the GEC task.