The Invalsi Benchmarks: measuring the Linguistic and Mathematical understanding of Large Language Models in Italian

Giovanni Puccetti, Maria Cassese, Andrea Esuli


Abstract
While Italian is a high-resource language, there are few Italian-native benchmarks to evaluate generative Large Language Models (LLMs) in this language. This work presents three new benchmarks: Invalsi MATE to evaluate models performance on mathematical understanding in Italian, Invalsi ITA to evaluate language under standing in Italian and Olimpiadi MATE for more complex mathematical understanding. The first two benchmarks are based on the Invalsi tests, which are administered to students of age between 6 and 18 within the Italian school system and have been validated by several experts in teaching and pedagogy, the third one comes from the Italian highschool math Olympics. We evaluate 10 powerful language models on these benchmarks and we find that they are bound by 71% accuracy on Invalsi MATE, achieved by Llama 3.1 70b instruct and by 88% on Invalsi ITA. For both Invalsi MATE and Invalsi ITA we compare LLMs with the average performance of Italian students to show that Llama 3.1 is the only one to outperform them on Invalsi MATE while most models do so on Invalsi ITA, we then show that Olimpiadi MATE is more challenging than Invalsi MATE and the highest accuracy, achieved by Llama 3.1 405b instruct accuracy is 45%.
Anthology ID:
2025.coling-main.453
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6782–6797
Language:
URL:
https://aclanthology.org/2025.coling-main.453/
DOI:
Bibkey:
Cite (ACL):
Giovanni Puccetti, Maria Cassese, and Andrea Esuli. 2025. The Invalsi Benchmarks: measuring the Linguistic and Mathematical understanding of Large Language Models in Italian. In Proceedings of the 31st International Conference on Computational Linguistics, pages 6782–6797, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
The Invalsi Benchmarks: measuring the Linguistic and Mathematical understanding of Large Language Models in Italian (Puccetti et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.453.pdf