Giacomo Figueredo

2026

Global vs. Local Sentence Embeddings for Brazilian Portuguese: Revisiting Monolingual Models in the Age of Foundation Models
Matheus Peixoto | Guilherme Silva | Giacomo Figueredo | Pedro Silva | Eduardo J. S. Luz
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

The choice between large-scale, multilingual, foundation models and specialized monolingual models for languages like Brazilian Portuguese (PT-BR) presents a complex trade-off between generalization and specialization. This paper investigates this trade-off through an empirical study across a diverse suite of tasks. We evaluate multiple families of language models under both linear probing and fine-tuning regimes. We find that monolingual encoders exhibit greater "adaptation plasticity" during fine-tuning, improving on both classification and semantic similarity, where global (multilingual) models degrade. However, this plasticity comes at a cost: our tokenization analysis suggests that monolingual models struggle with foreign terms, whereas modern multilingual tokenizers show surprising morphological competence, challenging a long-standing assumption in the field. We conclude that the optimal model choice is a task-dependent trade-off between vocabulary coverage and adaptation flexibility.

Co-authors

Venues

PROPOR1

Fix author