Marcellus Amadeus
2026
CURUPIRA: Clever guard for harm and linguistic prompt mitigation in Brazilian Portuguese
Rogério Sousa | William Alberto Cruz-Castañeda | José Roberto Homeli Silva | Marcellus Amadeus
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Rogério Sousa | William Alberto Cruz-Castañeda | José Roberto Homeli Silva | Marcellus Amadeus
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
The safe deployment of Large Language Models remains challenging in multilingual settings, particularly when models are exposed to adversarial or malicious prompts in underrepresented languages. In this work, we present Curupira, a Brazilian Portuguese-language guard model designed to mitigate harmful prompt exploitation. To do this, we establish a three steps methodology that involves adaptation, data generation, and fine-tuning. We also evaluate our model with two state-of-the-art open guardrail architectures. The results show that targeted fine-tuning leads to consistent improvements in safety classification for Portuguese prompts, with favorable efficiency–performance trade-offs for compact models and limited degradation in cross-lingual evaluation.
JabuticaBERT: Modern Portuguese Encoders from Scratch with RTD and Long-Context Training
Thiago Porto | Gabriel Gomes | Alexandre Bender | Ulisses Corrêa | Larissa Freitas | William Cruz | Marcellus Amadeus
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Thiago Porto | Gabriel Gomes | Alexandre Bender | Ulisses Corrêa | Larissa Freitas | William Cruz | Marcellus Amadeus
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Encoder-based language models remain essential for natural language understanding tasks such as classification, semantic similarity, and retrieval-augmented generation. However, the lack of high-quality monolingual encoders for Brazilian Portuguese poses a significant challenge to performance. In this work, we systematically explore the training of Portuguese-specific encoder models from scratch using two modern architectures: DeBERTa, trained with Replaced Token Detection (RTD), and ModernBERT, trained with Masked Language Modeling (MLM). All models are pre-trained on the large-scale Jabuticaba corpus. Our DeBERTa-Large model achieves results comparable to the state-of-the-art, with F1 scores of 0.920 on ASSIN2 RTE and 0.915 on LeNER. Crucially, it matches the performance of the 900M-parameter Albertina model while utilizing significantly fewer parameters. We also release custom tokenizers that reduce token fertility rates compared to multilingual baselines. These findings provide evidence that careful architectural choices and monolingual tokenization can yield competitive performance without massive model scaling.
2024
Bridging the Language Gap: Integrating Language Variations into Conversational AI Agents for Enhanced User Engagement
Marcellus Amadeus | Jose Roberto Homeli da Silva | Joao Victor Pessoa Rocha
Proceedings of the 1st Worskhop on Towards Ethical and Inclusive Conversational AI: Language Attitudes, Linguistic Diversity, and Language Rights (TEICAI 2024)
Marcellus Amadeus | Jose Roberto Homeli da Silva | Joao Victor Pessoa Rocha
Proceedings of the 1st Worskhop on Towards Ethical and Inclusive Conversational AI: Language Attitudes, Linguistic Diversity, and Language Rights (TEICAI 2024)
This paper presents the initial steps taken to integrate language variations into conversational AI agents to enhance user engagement. The study is built upon sociolinguistic and pragmatic traditions and involves the creation of an annotation taxonomy. The taxonomy includes eleven classes, ranging from concrete to abstract, and the covered aspects are the instance itself, time, sentiment, register, state, region, type, grammar, part of speech, meaning, and language. The paper discusses the challenges of incorporating vernacular language into AI agents, the procedures for data collection, and the taxonomy organization. It also outlines the next steps, including the database expansion and the computational implementation. The authors believe that integrating language variation into conversational AI will build near-real language inventories and boost user engagement. The paper concludes by discussing the limitations and the importance of building rapport with users through their own vernacular.