Class of LLMs: Benchmarking Large Language Models on the Brazilian National Medical Examination

João Vitor Mariano Correia, Pedro Henrique Alves de Castro, Gabriel Lino Garcia, Pedro Henrique Paiola, João Paulo Papa


Abstract
The evaluation of Large Language Models (LLMs) in medicine has predominantly relied on English-language benchmarks aligned with North American clinical guidelines, limiting their applicability to other healthcare systems. In this paper, we evaluate twenty-two proprietary and open-weight LLMs on the 2025 National Examination for the Evaluation of Medical Training (ENAMED), a high-stakes, government-standardized assessment used to evaluate medical graduates in Brazil. The benchmark comprises 90 multiple-choice questions grounded in Brazilian public health policy, clinical practice, and Portuguese medical terminology, and is released as an open dataset. Model performance is measured using both standard accuracy and the official Item Response Theory (IRT) framework employed by ENAMED, enabling direct comparison with human proficiency thresholds. Results reveal a clear stratification of model capabilities: proprietary frontier models achieve the highest performance, whereas many open-weight and smaller-domain-adapted models fail to meet the minimum proficiency criterion. Across comparable scales, large generalist models consistently outperform specialized medical fine-tunes, suggesting that general reasoning capacity is a stronger predictor of success than narrow domain adaptation in this setting. These findings establish ENAMED as a rigorous benchmark for evaluating medical LLMs in Portuguese and highlight both the potential and current limitations of such models for educational assessment.
Anthology ID:
2026.propor-2.17
Volume:
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
Month:
April
Year:
2026
Address:
Salvador, Brazil
Editors:
Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro
Venue:
PROPOR
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
101–111
Language:
URL:
https://aclanthology.org/2026.propor-2.17/
DOI:
Bibkey:
Cite (ACL):
João Vitor Mariano Correia, Pedro Henrique Alves de Castro, Gabriel Lino Garcia, Pedro Henrique Paiola, and João Paulo Papa. 2026. Class of LLMs: Benchmarking Large Language Models on the Brazilian National Medical Examination. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2, pages 101–111, Salvador, Brazil. Association for Computational Linguistics.
Cite (Informal):
Class of LLMs: Benchmarking Large Language Models on the Brazilian National Medical Examination (Correia et al., PROPOR 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.propor-2.17.pdf