Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates

Yusuke Sakai; Adam Nohejl; Jiangnan Hang; Hidetaka Kamigaito; Taro Watanabe

Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates

Yusuke Sakai, Adam Nohejl, Jiangnan Hang, Hidetaka Kamigaito, Taro Watanabe

Abstract

The natural language understanding (NLU) performance of large language models (LLMs) has been evaluated across various tasks and datasets. The existing evaluation methods, however, do not take into account the variance in scores due to differences in prompts, which leads to unfair evaluation and comparison of NLU performance. Moreover, evaluation designed for specific prompts is inappropriate for instruction tuning, which aims to perform well with any prompt. It is therefore necessary to find a way to measure NLU performance in a fair manner, considering score variance between different instruction templates. In this study, we provide English and Japanese cross-lingual datasets for evaluating the NLU performance of LLMs, which include multiple instruction templates for fair evaluation of each task, along with regular expressions to constrain the output format. Furthermore, we propose the Sharpe score as an evaluation metric that takes into account the variance in scores between templates. Comprehensive analysis of English and Japanese LLMs reveals that the high variance among templates has a significant impact on the fair evaluation of LLMs.

Anthology ID:: 2024.blackboxnlp-1.31
Volume:: Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Month:: November
Year:: 2024
Address:: Miami, Florida, US
Editors:: Yonatan Belinkov, Najoung Kim, Jaap Jumelet, Hosein Mohebbi, Aaron Mueller, Hanjie Chen
Venue:: BlackboxNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 499–529
Language:
URL:: https://aclanthology.org/2024.blackboxnlp-1.31
DOI:
Bibkey:
Cite (ACL):: Yusuke Sakai, Adam Nohejl, Jiangnan Hang, Hidetaka Kamigaito, and Taro Watanabe. 2024. Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 499–529, Miami, Florida, US. Association for Computational Linguistics.
Cite (Informal):: Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates (Sakai et al., BlackboxNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.blackboxnlp-1.31.pdf

PDF Cite Search