AveniBench: Accessible and Versatile Evaluation of Finance Intelligence

Mateusz Klimaszewski, Pinzhen Chen, Liane Guillou, Ioannis Papaioannou, Barry Haddow, Alexandra Birch


Abstract
Over the last few years, there has been great interest in applying large language models (LLMs) to problems in the finance industry, and the field needs a robust LLM benchmark to support this work. Current financial LLM benchmarks contain simple tasks which are not representative of real use cases and have test sets with licences that do not allow commercial use. In response, we release AveniBench, a permissively licensed benchmark that tests a group of six key finance-related skills: tabular reasoning, numerical reasoning, question answering, long context modelling, summarisation and dialogue. We refactor the test sets to ensure that metrics are comparable, providing a unified framework. Furthermore, AveniBench introduces two task difficulty modes, easy and hard, enabling scalable evaluation based on real-world deployment needs. We use our benchmark to evaluate a diverse set of 20 widely used LLMs, from small open-weight models to proprietary systems like GPT-4. This evaluation initiates our public leaderboard, providing valuable insights for future academic research and commercial development.
Anthology ID:
2025.finnlp-1.10
Volume:
Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal)
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Chung-Chi Chen, Antonio Moreno-Sandoval, Jimin Huang, Qianqian Xie, Sophia Ananiadou, Hsin-Hsi Chen
Venues:
FinNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
111–117
Language:
URL:
https://aclanthology.org/2025.finnlp-1.10/
DOI:
Bibkey:
Cite (ACL):
Mateusz Klimaszewski, Pinzhen Chen, Liane Guillou, Ioannis Papaioannou, Barry Haddow, and Alexandra Birch. 2025. AveniBench: Accessible and Versatile Evaluation of Finance Intelligence. In Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal), pages 111–117, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
AveniBench: Accessible and Versatile Evaluation of Finance Intelligence (Klimaszewski et al., FinNLP 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.finnlp-1.10.pdf