NorBERTo: A ModernBERT Model Trained for Portuguese with 331 Billion Tokens Corpus

Lucas F. A. O. Pellicer; Guilherme Rinaldo

NorBERTo: A ModernBERT Model Trained for Portuguese with 331 Billion Tokens Corpus

Lucas F. A. O. Pellicer, Guilherme Rinaldo

Abstract

High-quality corpora are essential for advancing Natural Language Processing (NLP) in Portuguese. Building on previous encoder-only models such as BERTimbau and Albertina PT-BR, we introduce NorBERTo, a modern encoder based on the ModernBERT architecture, featuring long-context support and efficient attention mechanisms. NorBERTo is trained on Aurora-PT, a newly curated Brazilian Portuguese corpus comprising 331 billion GPT-2 tokens collected from diverse web sources and existing multilingual datasets. We systematically benchmark NorBERTo against strong baselines on semantic similarity, textual entailment and classification tasks using standardized datasets such as ASSIN 2 and PLUE. On PLUE, NorBERTo-large achieves the best results among the encoder models we evaluated, notably reaching 0.9191 F1 on MRPC and 0.7689 accuracy on RTE. On ASSIN 2, NorBERTo-large attains the highest entailment F1 (0.904) among all encoders considered, while remaining competitive, but Albertina-900M and BERTimbau-large still hold an advantage. To the best of our knowledge, Aurora-PT is currently the largest openly available monolingual Portuguese corpus, surpassing previous resources. NorBERTo provides a modern, mid-sized encoder designed for realistic deployment scenarios: it is straightforward to fine-tune, efficient to serve, and well suited as a backbone for retrieval-augmented generation and other downstream Portuguese NLP systems.

Anthology ID:: 2026.propor-1.18
Volume:: Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Month:: April
Year:: 2026
Address:: Salvador, Brazil
Editors:: Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro
Venue:: PROPOR
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 183–193
Language:
URL:: https://aclanthology.org/2026.propor-1.18/
DOI:
Bibkey:
Cite (ACL):: Lucas F. A. O. Pellicer and Guilherme Rinaldo. 2026. NorBERTo: A ModernBERT Model Trained for Portuguese with 331 Billion Tokens Corpus. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1, pages 183–193, Salvador, Brazil. Association for Computational Linguistics.
Cite (Informal):: NorBERTo: A ModernBERT Model Trained for Portuguese with 331 Billion Tokens Corpus (Pellicer & Rinaldo, PROPOR 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.propor-1.18.pdf

PDF Cite Search Fix data