Exploring Automatic Evaluation Methods based on a Decoder-based LLM for Text Generation

Automatic evaluation of text generation is essential for improving the accuracy of generation tasks. In light of the current trend towards increasingly larger decoder-based language models, we investigate automatic evaluation methods based on such models for text generation. This paper compares various methods, including tuning with encoder-based models and large language models under equal conditions, on two different tasks, machine translation evaluation and semantic textual similarity, in two languages, Japanese and English. Experimental results show that compared to the tuned encoder-based models, the tuned decoder-based models perform poorly. The analysis of the causes for this suggests that the decoder-based models focus on surface word sequences and do not capture meaning. It is also revealed that in-context learning of very large decoder-based models such as ChatGPT makes it difficult to identify fine-grained semantic differences.


Introduction
Neural network-based text generation models are used in various natural language processing tasks, including machine translation, dialogue systems, and text summarization.However, the outputs from these models are open-ended, and there is no single correct answer, making the evaluation of generations difficult.Manual evaluation is often used due to its high accuracy but incurs significant temporal and financial costs.Therefore, automatic evaluation is essential for the rapid development of text generation models.
Automatic evaluation methods for text generation, such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004), have been based mainly on surface word overlaps between the generated text and the reference text.In recent years, with the development of self-supervised models such as BERT (Devlin et al., 2019) and BART (Lewis et al., 2020), more accurate automatic evaluation methods have been proposed.For example, BERTScore (Zhang et al., 2020) uses word embeddings obtained by these models.Such methods can be classified along two axes: whether the model used is an encoder-based, decoder-based, or encoder-decoder-based architecture of Transformer (Vaswani et al., 2017), and whether tuning is performed.While encoder-based methods with tuning are reported to be highly accurate (Rei et al., 2020), in-context learning without tuning is the mainstream in decoder-based methods.
In recent years, self-supervised decoder-based models have become larger and larger, as seen in GPT-4 (OpenAI, 2023), Megatron-Turing (Smith et al., 2022), and PaLM (Chowdhery et al., 2022).These decoder-based self-supervised large language models are referred to as LLMs in this paper.However, encoder-based models have remained relatively smaller than decoder-based ones.
Based on the above situation, this paper compares various methods, including tuning with encoder-based models and LLMs under equal conditions, on two different tasks, machine translation evaluation and semantic textual similarity (STS), in two languages, Japanese and English.The results revealed the following three observations.1.When a decoder-based model is tuned, the accuracy is proportional to the model size up to a certain model size, but it reaches a ceiling.
2. Compared to tuned encoder-based models, tuned decoder-based models perform poorly.
3. In-context learning of very large decoderbased models such as ChatGPT 1 makes it difficult to identify fine-grained semantic differences.
The analysis of the causes for the poor performance of the tuned decoder-based models suggests 1 https://openai.com/chatgptarXiv:2310.11026v1[cs.CL] 17 Oct 2023 that they focus on surface word sequences and do not capture meaning.Note that our study focuses on evaluation methods under the assumption that reference text is available.

Related Work
Automatic evaluation of text generation mainly requires the text generated by a model and the reference text.The classic automatic evaluation metrics, such as BLEU, ROUGE, METEOR (Banerjee and Lavie, 2005), and CIDEr (Vedantam et al., 2015), are based on the n-gram overlap between these two texts.The biggest disadvantage of these metrics is that they do not score well even when synonyms are included, as the n-grams must match exactly for a higher score.TER (Snover et al., 2006) and others that base their evaluation on edit distance have similar drawbacks.METEOR aims to overcome this drawback by using a synonym dictionary, but it is unable to perform context-sensitive synonym evaluation.
Using embeddings derived from self-supervised models, synonyms can be judged to be similar based on their context.BERTScore (Zhang et al., 2020) is a method that embeds the generated text and the reference text respectively by an encoderbased model and calculates a score based on their similarity.BARTScore (Yuan et al., 2021) and T5Score (Qin et al., 2022) input the source text to the encoder and the target text to the decoder, and calculate a score based on the generation probability of the target text.GPTScore (Fu et al., 2023) calculates a score based on the generation probability of the target text by applying in-context learning (Brown et al., 2020) to an LLM.G-Eval (Liu et al., 2023) proposes a method to have an LLM generate scores directly.In addition, Chen et al. (2023) show that directly generated scores are more accurate than generation probability-based ones when using LLMs.
Other evaluation methods increase accuracy by fine-tuning a self-supervised model using datasets consisting of text pairs and their similarity labels.Models trained on translation evaluation datasets include BLEURT (Sellam et al., 2020) and COMET (Rei et al., 2020), while models trained on STS datasets include Sentence-BERT (Reimers and Gurevych, 2019).There are also methods such as SimCSE (Gao et al., 2021) that learn sentence embeddings by contrastive learning on natural language inference datasets and use them to calculate text pair similarity.Most of these selfsupervised methods use encoder-based models.In-structScore (Xu et al., 2023) is a method of finetuning LLaMA (Touvron et al., 2023).However, Xu et al. (2023)'s experiments did not involve tuned LLMs on the target datasets and did not compare them to encoder-based models under equal conditions.In this study, we compare LLMs, which do not have bidirectional attention but larger model size, with encoder-based models, which have bidirectional attention but smaller model size, by tuning them under equal conditions.

Experimental Setup
We compare various methods for text generation evaluation, including tuned encoder-based models and LLMs on equal conditions, on two different tasks, machine translation evaluation and STS, in two languages, Japanese and English.

Datasets in English
For the experiments in English, we use WMT20 (Mathur et al., 2020) and WMT21 (Freitag et al., 2021) as the translation evaluation datasets, and STS-B (Cer et al., 2017) and SICK (Marelli et al., 2014) as the datasets for STS.WMT20 and WMT21 include human-translated texts, machine-translated texts, and their evaluation labels of Direct Assessment (DA) and Multidimensional Quality Metrics (MQM).In our experiments, we adopted the MQM labels that were evaluated by experts and native speakers.Since only the Chinese-to-English translation task is labeled with MQM, we use its datasets (WMT20 MQM and WMT21 MQM).STS and SICK consist of sentence pairs and their similarity labels.Note that for WMT20 and WMT21, the datasets were not pre-separated into train, valid, and test, and we randomly split these datasets with a ratio of 8:1:1.

Datasets in Japanese
The datasets used in the experiments in Japanese are the WMT20 English to Japanese translation task (WMT20 en-ja) and JSTS included in the Japanese General Language Understanding Evaluation (JGLUE) (Kurihara et al., 2022) benchmark.The WMT20 dataset includes human-translated texts, machine-translated texts, and their evaluation labels (Direct Assessment).JSTS is an STS dataset for Japanese, consisting of sentence pairs and their similarity labels.Note that WMT20 en-ja was randomly split at a ratio of train:valid:test=8:1:1 as in the English datasets.

Tuning of LLMs
For the method by LLM tuning, we performed LoRA-tuning of LLMs using datasets of text pairs and their evaluation or similarity labels.We chose LoRA-tuning because it can achieve competitive accuracy with fine-tuning at a lower cost (Hu et al., 2021).

Architecture and Input-Output Relationships
The architecture and input-output relationship of the LLM's tuning are shown in Figure 1.Given a text pair as an input to the model, their similarity value is returned as an output.The following procedure is used to calculate the similarity.
1. Feed each text of a text pair into an LLM.
2. Obtain the embedding corresponding to the token at the end of each text (the preceding token of the EOS token).
3. Calculate the cosine similarity between the two embeddings.
4. Pass the cosine similarity to a 1-layer FNN and regard its output as the similarity of the text pair.The FNN layer is used to convert the cosine similarity values into a label distribution of the dataset.Based on the results of our preliminary experiments, we decided to use the embedding of the token at the end of a text instead of the special EOS token.

Training Method
The gold labels (similarity values) in the dataset are normalized between 0 and 1 in advance.We calculate the similarity of a text pair using the procedure described in Section 3.2.1.Next, only the parameters newly added to the model (including the parameters of the FNN) are updated based on the mean squared error between the predictions and the gold labels.Furthermore, the initial values of the FNN are set to 1 for weight and 0 for bias.We employ LoRA-tuning as the tuning method of the LLM for its high performance.
For experiments in English, we use the Cerebras- GPT models2 with parameter sizes ranging from 111M to 6.7B.These models are tuned on WMT20 MQM for the translation evaluation task and on STS-B for the STS tasks, respectively.In other words, the models trained with WMT20 MQM are evaluated on WMT20 MQM and WMT21 MQM, and the models trained with STS-B are evaluated on STS-B and SICK.
For experiments in Japanese, we use the GPT-2 and GPT-NeoX models developed by rinna3 , ranging from the 37M model to the 3.6B model.We trained models on each of the two datasets in Section 3.1.2.

Baselines
For comparison, we adopt the following baselines: BLEU, character edit distance, fine-tuned RoBERTa-large (Liu et al., 2019), BERTScore4 , BARTScore5 , OpenAI Embeddings (Neelakantan et al., 2022), in-context learning of Chat-GPT (gpt-3.5-turbo),BLEURT6 , COMET7 and InstructScore8 .For fine-tuned RoBERTa, as described in Section 3.2.2,we trained models on WMT20 MQM and STS-B for the English experiments and on the two datasets shown in Section 3.1.2for the Japanese experiments, respectively.For BERTScore, the training data is used to select the best output layer to obtain the em-beddings.For OpenAI Embeddings, the scores are the cosine similarity of the obtained embeddings.The prompt used in ChatGPT's in-context learning is shown in Appendix A. We also had a preliminary experiment with in-context learning of Cerebras-GPT as well as ChatGPT, but were unable to generate scores successfully.It is assumed that the model size of few billion is too small for in-context learning.We do not tune BLEURT, but instead use BLEURT-20 (Pu et al., 2021), which is trained in multiple languages.For COMET, we use the model trained on WMT21 MQM.We do not apply COMET to the STS datasets because COMET is a metric for automatic translation evaluation and requires three inputs: pre-translated text, human-translated text, and machine-translated text.Our hyperparameters for training are shown in Appendix B.
Note that BARTScore, COMET, and In-structScore, only support English and hence are not used for experiments in Japanese.

Main Results
Kendall's correlation coefficients between the predictions by the automatic evaluation metrics and the gold labels in English and Japanese are shown in Tables 1 and 2, respectively.For all datasets in both languages, RoBERTa-large with fine-tuning achieved the highest accuracy.For LoRA-tuned LLMs, there is a tendency for the accuracy to be proportional to the model size up to a certain model size, but it reaches a ceiling.Also, even models with overwhelmingly larger parameter sizes than  RoBERTa-large showed low accuracy.For Chat-GPT's in-context learning, the accuracy on the STS datasets was comparable to that of the tuning-based methods, but its accuracy on the translation evaluation datasets was low.Note that most of the p-values were very close to 0.

Analysis of Why Tuned LLMs are Inferior
From Tables 1 and 2, we observe that LoRA-tuned LLMs, which have by far a larger number of parameters than RoBERTa-large, are inferior in terms of performance.We analyze the causes of this from the experimental results in English.
The most significant difference between the two models is that RoBERTa, an encoder-based model, has bidirectional attention, while an LLM has unidirectional attention.Here, we hypothesized that unidirectional attention focuses more on surface word sequences as opposed to bidirectional attention.To confirm this hypothesis, we calculated the correlations of the predictions of RoBERTa and LLMs to BLEU and character edit distance, which are the metrics based on superficial word sequences.The results are shown in Table 3.As hypothesized, the results show that the correlations to both BLEU and edit distance are stronger for LLMs than the encoder-based model.The fact that the correlation decreases as the model size increases in LLMs suggests that the larger the model size, the better the prediction is able to capture not only the surface word sequences but also the meaning of the text.However, even with a model size of 6.7B, the LLM is still not as accurate as RoBERTa.

Analysis of the Inability of ChatGPT's
In-context Learning While ChatGPT's in-context learning showed high accuracy on the STS datasets, it did not perform well on the translation evaluation datasets.We analyze the causes of this from the experimental results in English.In our experiments, the prompts were created to score on a scale of 0 to 100.However, in the output scores, there were many cases where the last digit was 0 or 5 in both zero-shot and few-shot settings.Also, as shown in Figure 2, the label distributions of the translation evaluation datasets are skewed between 0.9 and 1.0, compared to the STS datasets, which have gently sloping distributions.Therefore, most of the predictions in the translation evaluation datasets are 95, etc., and this is thought to have caused the accuracy drop.Thus, it is clear that ChatGPT's in-context learning has difficulty in identifying fine-grained semantic differences.

Conclusion
In this paper, we compared various automatic evaluation methods for text generation in two languages, Japanese and English.We showed that fine-tuned encoder-based models are the strongest when training data is available, and in-context learning of ChatGPT is equally accurate when the variance of scores is large.Our analysis also revealed that tuned LLMs are less accurate than tuned encoder-based models because of their focus on surface word sequences.

Limitations
Our experiments assume the presence of a training dataset.If no dataset for training exists, refer to the results without the Tuning Method (Target Dataset) to compare the metrics in Tables 1 and 2.

Figure 1 :
Figure 1: The architecture and input-output overview of the LLM's tuning.

Figure 2 :
Figure 2: Label distribution of the test datasets used in the English experiments.

Table 1 :
Kendall's correlation coefficients between the predictions by the automatic evaluation metrics and the labels in the experiments in English.

Table 2 :
Kendall's correlation coefficients between the predictions by the automatic evaluation metrics and the labels in the experiments in Japanese.

Table 3 :
Kendall's correlations between the metrics based on superficial word sequences and the predictions by models with tuning in the experiments in English.