Evaluating Numeracy of Language Models as a Natural Language Inference Task

Rahmad Mahendra; Damiano Spina; Lawrence Cavedon; Karin Verspoor

doi:10.18653/v1/2025.findings-naacl.467

Evaluating Numeracy of Language Models as a Natural Language Inference Task

Rahmad Mahendra, Damiano Spina, Lawrence Cavedon, Karin Verspoor

Abstract

While recent advancements in large language models (LLMs) have enhanced their capabilities to solve mathematical problems, other aspects of numeracy remain underexplored. In this paper, we propose a benchmark to evaluate the ability of language models to perform basic numeracy tasks. We frame numeracy as a Natural Language Inference (NLI) task to assess the models’ ability to understand both numbers and language contexts. We evaluate 49 language models (LMs), including fine-tuned LMs on NLI datasets, instruction-tuned LLMs, and specialized math-LLMs. Our findings reveal three main insights: (1) LLMs only clearly outperform smaller LMs in arithmetic tasks, indicating that mathematical reasoning cannot be generalized to other numeracy skills such as number comparison and normalization; (2) while most language models achieve fair to good accuracy for NLI entailment cases, they still struggle to predict contradiction and neutral cases; and (3) the robustness of language models’ numeracy capabilities needs improvement, particularly in understanding the semantics and pragmatics of numbers in linguistic contexts.

Anthology ID:: 2025.findings-naacl.467
Volume:: Findings of the Association for Computational Linguistics: NAACL 2025
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8351–8376
Language:
URL:: https://aclanthology.org/2025.findings-naacl.467/
DOI:: 10.18653/v1/2025.findings-naacl.467
Bibkey:
Cite (ACL):: Rahmad Mahendra, Damiano Spina, Lawrence Cavedon, and Karin Verspoor. 2025. Evaluating Numeracy of Language Models as a Natural Language Inference Task. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 8351–8376, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: Evaluating Numeracy of Language Models as a Natural Language Inference Task (Mahendra et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-naacl.467.pdf

PDF Cite Search Fix data