Global vs. Local Sentence Embeddings for Brazilian Portuguese: Revisiting Monolingual Models in the Age of Foundation Models

Matheus Peixoto; Guilherme Silva; Giacomo Figueredo; Pedro Silva; Eduardo J. S. Luz

Global vs. Local Sentence Embeddings for Brazilian Portuguese: Revisiting Monolingual Models in the Age of Foundation Models

Matheus Peixoto, Guilherme Silva, Giacomo Figueredo, Pedro Silva, Eduardo J. S. Luz

Abstract

The choice between large-scale, multilingual, foundation models and specialized monolingual models for languages like Brazilian Portuguese (PT-BR) presents a complex trade-off between generalization and specialization. This paper investigates this trade-off through an empirical study across a diverse suite of tasks. We evaluate multiple families of language models under both linear probing and fine-tuning regimes. We find that monolingual encoders exhibit greater "adaptation plasticity" during fine-tuning, improving on both classification and semantic similarity, where global (multilingual) models degrade. However, this plasticity comes at a cost: our tokenization analysis suggests that monolingual models struggle with foreign terms, whereas modern multilingual tokenizers show surprising morphological competence, challenging a long-standing assumption in the field. We conclude that the optimal model choice is a task-dependent trade-off between vocabulary coverage and adaptation flexibility.

Anthology ID:: 2026.propor-1.52
Volume:: Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Month:: April
Year:: 2026
Address:: Salvador, Brazil
Editors:: Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro
Venue:: PROPOR
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 529–539
Language:
URL:: https://aclanthology.org/2026.propor-1.52/
DOI:
Bibkey:
Cite (ACL):: Matheus Peixoto, Guilherme Silva, Giacomo Figueredo, Pedro Silva, and Eduardo J. S. Luz. 2026. Global vs. Local Sentence Embeddings for Brazilian Portuguese: Revisiting Monolingual Models in the Age of Foundation Models. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1, pages 529–539, Salvador, Brazil. Association for Computational Linguistics.
Cite (Informal):: Global vs. Local Sentence Embeddings for Brazilian Portuguese: Revisiting Monolingual Models in the Age of Foundation Models (Peixoto et al., PROPOR 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.propor-1.52.pdf

PDF Cite Search Fix data