Towards Robust Comparisons of NLP Models: A Case Study

Vicente Ivan Sanchez Carmona, Shanshan Jiang, Bin Dong


Abstract
Comparing the test scores of different NLP models across downstream datasets to determine which model leads to the most accurate results is the ultimate step in any experimental work. Doing so via a single mean score may not accurately quantify the real capabilities of the models. Previous works have proposed diverse statistical tests to improve the comparison of NLP models; however, a key statistical phenomenon remains understudied: variability in test scores. We propose a type of regression analysis which better explains this phenomenon by isolating the effect of both nuisance factors (such as random seeds) and datasets from the effects of the models’ capabilities. We showcase our approach via a case study of some of the most popular biomedical NLP models: after isolating nuisance factors and datasets, our results show that the difference between BioLinkBERT and MSR BiomedBERT is, actually, 7 times smaller than previously reported.
Anthology ID:
2025.coling-main.332
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4973–4979
Language:
URL:
https://aclanthology.org/2025.coling-main.332/
DOI:
Bibkey:
Cite (ACL):
Vicente Ivan Sanchez Carmona, Shanshan Jiang, and Bin Dong. 2025. Towards Robust Comparisons of NLP Models: A Case Study. In Proceedings of the 31st International Conference on Computational Linguistics, pages 4973–4979, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Towards Robust Comparisons of NLP Models: A Case Study (Sanchez Carmona et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.332.pdf