David Cecchini
2024
Holistic Evaluation of Large Language Models: Assessing Robustness, Accuracy, and Toxicity for Real-World Applications
David Cecchini
|
Arshaan Nazir
|
Kalyan Chakravarthy
|
Veysel Kocaman
Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)
Large Language Models (LLMs) have been widely used in real-world applications. However, as LLMs evolve and new datasets are released, it becomes crucial to build processes to evaluate and control the models’ performance. In this paper, we describe how to add Robustness, Accuracy, and Toxicity scores to model comparison tables, or leaderboards. We discuss the evaluation metrics, the approaches considered, and present the results of the first evaluation round for model Robustness, Accuracy, and Toxicity scores. Our results show that GPT 4 achieves top performance on robustness and accuracy test, while Llama 2 achieves top performance on the toxicity test. We note that newer open-source models such as open chat 3.5 and neural chat 7B can perform well on these three test categories. Finally, domain-specific tests and models are also planned to be added to the leaderboard to allow for a more detailed evaluation of models in specific areas such as healthcare, legal, and finance.