Larissa Freitas
2026
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
Marlo Souza | Iria de-Dios-Flores | Diana Santos | Larissa Freitas | Jackson Wilke da Cruz Souza | Eugénio Ribeiro
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
Marlo Souza | Iria de-Dios-Flores | Diana Santos | Larissa Freitas | Jackson Wilke da Cruz Souza | Eugénio Ribeiro
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
JabuticaBERT: Modern Portuguese Encoders from Scratch with RTD and Long-Context Training
Thiago Porto | Gabriel Gomes | Alexandre Bender | Ulisses Corrêa | Larissa Freitas | William Cruz | Marcellus Amadeus
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Thiago Porto | Gabriel Gomes | Alexandre Bender | Ulisses Corrêa | Larissa Freitas | William Cruz | Marcellus Amadeus
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Encoder-based language models remain essential for natural language understanding tasks such as classification, semantic similarity, and retrieval-augmented generation. However, the lack of high-quality monolingual encoders for Brazilian Portuguese poses a significant challenge to performance. In this work, we systematically explore the training of Portuguese-specific encoder models from scratch using two modern architectures: DeBERTa, trained with Replaced Token Detection (RTD), and ModernBERT, trained with Masked Language Modeling (MLM). All models are pre-trained on the large-scale Jabuticaba corpus. Our DeBERTa-Large model achieves results comparable to the state-of-the-art, with F1 scores of 0.920 on ASSIN2 RTE and 0.915 on LeNER. Crucially, it matches the performance of the 900M-parameter Albertina model while utilizing significantly fewer parameters. We also release custom tokenizers that reduce token fertility rates compared to multilingual baselines. These findings provide evidence that careful architectural choices and monolingual tokenization can yield competitive performance without massive model scaling.
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Marlo Souza | Iria de-Dios-Flores | Diana Santos | Larissa Freitas | Jackson Wilke da Cruz Souza | Eugénio Ribeiro
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Marlo Souza | Iria de-Dios-Flores | Diana Santos | Larissa Freitas | Jackson Wilke da Cruz Souza | Eugénio Ribeiro
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
2021
Utilizando BERTimbau para a Classificação de Emoões em Português
Luiz Hammes | Larissa Freitas
Proceedings of the 13th Brazilian Symposium in Information and Human Language Technology
Luiz Hammes | Larissa Freitas
Proceedings of the 13th Brazilian Symposium in Information and Human Language Technology
Utilizando Pistas Linguistica para Detectar Conteudo Enganoso em Português
Rodrigo Rodrigues | Larissa Freitas
Proceedings of the 13th Brazilian Symposium in Information and Human Language Technology
Rodrigo Rodrigues | Larissa Freitas
Proceedings of the 13th Brazilian Symposium in Information and Human Language Technology
2020
An Assessment of Language Identification Methods on Tweets and Wikipedia Articles
Pedro Vernetti | Larissa Freitas
Proceedings of the Fourth Widening Natural Language Processing Workshop
Pedro Vernetti | Larissa Freitas
Proceedings of the Fourth Widening Natural Language Processing Workshop
Language identification is the task of determining the language which a given text is written. This task is important for Natural Language Processing and Information Retrieval activities. Two popular approaches for language identification are the N-grams and stopwords models. In this paper, these two models were tested on different types of documents such as short, irregular texts (tweets) and long, regular texts (Wikipedia articles).
A Comparison of Identification Methods of Brazilian Music Styles by Lyrics
Patrick Guimarães | Jader Froes | Douglas Costa | Larissa Freitas
Proceedings of the Fourth Widening Natural Language Processing Workshop
Patrick Guimarães | Jader Froes | Douglas Costa | Larissa Freitas
Proceedings of the Fourth Widening Natural Language Processing Workshop
In our work, we applied different techniques for the task of genre classification using lyrics. Utilizing our dataset with lyrics of typical genres in Brazil divided into seven classes, we apply some models used in machine learning and deep learning classification tasks. We explore the performance of usual models for text classification using an input in the Portuguese language. We also compare the use of RNN and classic machine learning approaches for text classification, exploring the most used methods in the field.