Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro (Editors)
- Anthology ID:
- 2026.propor-1
- Month:
- April
- Year:
- 2026
- Address:
- Salvador, Brazil
- Venue:
- PROPOR
- SIG:
- Publisher:
- Association for Computational Linguistics
- URL:
- https://aclanthology.org/2026.propor-1/
- DOI:
- ISBN:
- 979-8-89176-387-6
- PDF:
- https://aclanthology.org/2026.propor-1.pdf
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Marlo Souza | Iria de-Dios-Flores | Diana Santos | Larissa Freitas | Jackson Wilke da Cruz Souza | Eugénio Ribeiro
Marlo Souza | Iria de-Dios-Flores | Diana Santos | Larissa Freitas | Jackson Wilke da Cruz Souza | Eugénio Ribeiro
Retrieval-Augmented Generation and Knowledge Graphs in Portuguese-Language Legal Documents
Vinícius Teles de Oliveira | Deivison Oliveira da Silva | Mateus de Almeida Souza | Maurício Rodrigues Lima | Sávio Salvarino Teles de Oliveira | Thierson Couto Rosa
Vinícius Teles de Oliveira | Deivison Oliveira da Silva | Mateus de Almeida Souza | Maurício Rodrigues Lima | Sávio Salvarino Teles de Oliveira | Thierson Couto Rosa
This paper introduces a Graph Retrieval-Augmented Generation (GraphRAG) pipeline tailored for Question Answering (Q A) within Portuguese legal documents. Applied to a corpus of 203 normative resolutions from Companhia Energética de Minas Gerais (CEMIG), the proposed approach addresses the structural complexity of legal texts, such as hierarchical dependencies and temporal modifications. By explicitly modeling documents as knowledge graphs with nodes representing structural units (Articles, Paragraphs, Items) and edges denoting normative relationships, the system preserves context and traceability. The retrieval mechanism reconstructs evidence paths from root to leaf, performing semantic re-ranking before generation. Evaluation using the RAGAS framework yielded a mean answer accuracy of 0.81, with a median of 1.00. Results indicate that the system performs robustly on short, focused queries, while intermediate-length questions present challenges related to semantic dispersion. The findings suggest that structurally aware retrieval significantly enhances the interpretability and precision of legal Q A systems.
Gender Identification in Brazilian Portuguese Product Reviews: A Comparative Study of Classical Models, BERT, and LLMs
Tiago de Melo | Carlos M. S. Figueiredo
Tiago de Melo | Carlos M. S. Figueiredo
This study analyzes gender identification in Brazilian Portuguese using Amazon reviews drawn from ten product categories. Nine models were evaluated: three classical classifiers (Logistic Regression, Random Forest, and SVM), a multilingual BERT, and five LLMs (ChatGPT 4o, ChatGPT 3.5, DeepSeek, Sabia3, and Sabiazinho). Experiments show that BERT achieved the best performance (macro-F1 = 0.634), outperforming ChatGPT 4o and Logistic Regression by less than one percentage point. Reviews authored by women reach an average F1 of 0.654—four points higher than those by men. Performance also varies by domain: books and automotive are easier, whereas baby and pets are more challenging.
Leveraging political alignment information for stance detection
Matheus Camasmie Pavan | Ivandré Paraboni
Matheus Camasmie Pavan | Ivandré Paraboni
Stance detection is the task of determining whether an input text expresses a stance in favour of or against a given target topic. This, in a standard supervised fashion, will typically require a new set of labelled training examples for each test topic. As an alternative to full supervision (or costly LLM-based methods), this study leverages political alignment information by assuming that stances on related moral or political issues tend to co-occur (e.g., support for a right-wing politician correlating with support for the death penalty or opposition to abortion). This alignment, presently treated as a form of distance labelling, enables stance inference without constructing new corpora and is evaluated against standard cross-domain and prompt-based methods using a large corpus of stances in the Portuguese language.
We investigate the effect of dependency distance and its directionality on eye-tracking measures in Brazilian Portuguese. Using the RastrOS corpus enriched with surprisal and syntactic annotations, we find that absolute dependency distance significantly improves the prediction of first fixation durations, supporting memory-based accounts of sentence processing. In contrast, the direction of the dependency (whether the dependent precedes or follows the head) shows weaker and less consistent effects. These results indicate that early lexical retrieval is sensitive to distance magnitude, while later reading measures reflecting integration are less affected, highlighting the complementary role of syntactic distance alongside surprisal in modelling reading behaviour.
Experimental Evaluation of Topic Modeling Methods for Categorizing Irregularities in Health-related news
Alysson Guimarães | Methanias Colaço Junior | Samuel Almeida | Raphael Fontes
Alysson Guimarães | Methanias Colaço Junior | Samuel Almeida | Raphael Fontes
Context: The increasing availability of textual data has driven the application of Natural Language Processing (NLP) techniques in public administration to improve public services. Objective: This study aims to analyze topic modeling methods in the context of public health audits conducted by the National Department of SUS Auditing (AudSUS). Methods: A controlled in vitro experiment was conducted to assess the performance of the methods in topic modeling tasks using coherence metrics. Results: The LSA method stood out among models with the highest average C_V and C_NPMI coherence. LSA-based models achieved superior performance compared to 215 other models in configurations with lower top-n and top-k values. Overall, the statistical analysis confirms that the observed differences among the models are not due to random variation. Conclusion: The results underscore the potential of topic modeling methods for clustering news articles that exhibit indications of irregularities, thereby guiding information retrieval during the analytical phase of the audit process. This approach enhances the overall effectiveness of audits and facilitates faster preparation of teams for the operational stage.
A Multilingual Voice Analytics Module for Contact-Center Hiring
Wagner W. Ávila Bombardelli | Vanessa Marquiafavel Serrani | Edgard Kuboo | Erica C. Marins Missão
Wagner W. Ávila Bombardelli | Vanessa Marquiafavel Serrani | Edgard Kuboo | Erica C. Marins Missão
Contact-center operations often face significant challenges in identifying candidates whose vocal performance aligns with high-quality customer interactions. Existing speech analytics tools typically assess only content, providing limited insight into how candidates speak. To address this gap, we introduce SR-Voice, a multilingual speech analytics module designed to support call-center hiring. SR-Voice extends a previous text-only auditor by integrating segment-level, audio-native analysis capable of generating judgments, concise evidence-based rationales, and 0–10 scores across three dimensions: Emotion, Communication, and Rhythm. Our two-stage architecture first applies an audio-native model to propose a label, which is then reassessed by a lightweight auditor that combines transcript cues with acoustic and timing indicators grounded in phonetic and prosodic theory. We evaluate SR-Voice on a production-like volunteer dataset, reporting strong agreement and calibration performance reaching Macro-F1 = 0.83; Expected Calibration Error (ECE) = 0.053. The hybrid system achieves state-of-the-art calibration without post-hoc adjustment, with the audio-only variant attaining the lowest Negative Log-Likelihood (NLL) = 0.472). Designed for operational practicality, SR-Voice emphasizes traceability, short rationales, and well-calibrated probabilities suitable for threshold-based decisions and human-in-the-loop triage. We also discuss privacy-preserving storage and the prospective masking of Personally Identifiable Information (PII) for archival data.
CLARIN-PT-LDB: An Open LLM Leaderboard for Portuguese to assess Language, Culture and Civility
João Ricardo Silva | Luís Gomes | António Branco
João Ricardo Silva | Luís Gomes | António Branco
This paper reports on the development of a leaderboard of Open Large Language Models (LLM) for European Portuguese (PT-PT), and on its associated benchmarks.This leaderboard comes as a way to address a gap in the evaluation of LLM for European Portuguese, which so far had no leaderboard dedicated to this variant of the language.The paper also reports on novel benchmarks, including some that address aspects of performance that so far have not been available in benchmarks for European Portuguese, namely model safeguards and alignment to Portuguese culture.The leaderboard is available at https://huggingface.co/spaces/PORTULAN/portuguese-llm-leaderboard.
A Comparison of Methods to Bias Translation Toward Portuguese Variants
Catarina Costa | Sebastian Padó
Catarina Costa | Sebastian Padó
Portuguese serves as the official language of multiple countries across four continents. It is classified into two primary variants (European Portuguese and Brazilian Portuguese), but there is limited research on and resources for European Portuguese compared to the Brazilian variant.In this paper, we consider the task of Machine Translation (MT) into Portuguese. Given the resource imbalance, standard MT systems produce translations that are typically closer to the Brazilian standard. We compare four methods available to bias the translation toward the minority European Portuguese variant that target different places in the MT lifecycle: (1) reranking n-best MT outputs according to a variant classifier; (2) biasing hypothesis generation at inference time toward the target variant; (3) fine-tuning for the target variants; (4) moving completely to an LLM-based approach. We find that all methods can bias translation outputs to an extent. The LLM-based approach yields numerically the highest results, but the impact of memorisation remains unclear.
Que ao mestre vai matá-lo? The evolution of prepositional accusatives in Portuguese across time
Helena Rodrigues Menezes de Oliveira Vaz
Helena Rodrigues Menezes de Oliveira Vaz
This work investigates Differential Object Marking (DOM) in Brazilian Portuguese (BP), specifically a-marked objects, or prepositional accusatives (PP-ACCs), across four variables: semantic requirements, constituent order, verb semantics, and textual genre.An optimized parsing model was trained to recognize instances of PP-ACCs and automatically annotate historical documents for these objects for the Tycho Brahe and Colonia corpora. Contrary to expectations based on the low frequency of these objects and prior diachronic studies on European Portuguese (EP), our results reveal that PP-ACCs remain present in BP from the 18th century onward. Our findings confirm previous patterns for EP and present textual genre (specifically, narrative texts and theater plays) as a possible relevant variable, but warrants further investigation. Constituent order was proved to be less significant than previously suggested. This work also reveals methodological challenges in using computational models and NLP tools for research in historical Portuguese.
Topic Modeling in Brazilian Portuguese Documents on Antimicrobial Resistance
Enrique Reis Susin | Lilian Berton
Enrique Reis Susin | Lilian Berton
This study analyzes texts from multiple sources, including social media and news portals, to observe how different sectors of Brazilian society discuss the antimicrobial resistance. The main goal is to support epidemiological surveillance and public policy decisions through computational tools. Three datasets were used: tweets collected between 2008 and 2025 (64,225 documents), news articles from G1 (4,363 documents), and official government publications (.gov.br, 1,515 documents). These sources enable comparative analysis between informal discourse (social media) and institutional or journalistic discourse (official and media outlets). The study applies and compares topic modeling techniques, particularly those designed for Short Text Topic Modeling (STTM), such as GSDMM and BERTopic, to identify discursive trends, semantic patterns, and emerging topics related to antimicrobial resistance. By exploring these distinct contexts, this work demonstrates the potential of Natural Language Processing (NLP) and AI methods as instruments for integrated analysis of public health data in both informal and formal environments.
Geological Text Summarization Using Generative Large Language Models
Matheus Stein de Aguiar | Rafael Oleques Nunes | Dennis Giovani Balreira
Matheus Stein de Aguiar | Rafael Oleques Nunes | Dennis Giovani Balreira
Large generative language models have demonstrated impressive performance in various Natural Language Processing (NLP) tasks. However, the geological domain presents unique challenges for NLP due to its specialized language, which is full of technical terms. Therefore, pre-trained language models on generic corpora may not be suitable for performing geological domain-specific tasks. This article compares several models to identify those with the best performance in the Portuguese geological domain for a text summarization task. We applied the models to a Revista Geologia USP dataset. The dataset consists of abstracts of scientific texts and their respective titles, which we aim for the models to approximate with the summarization task. We tested the models in various scenarios, providing examples or not, and at two temperature levels. We then evaluated the models’ performance using quantitative metrics and a brief qualitative analysis comparing the titles proposed by the models with the original title. The results show that the Gemma3:27b model was better in some scenarios, while the Llama3:8b model performed best in others.
Retrieval-Augmented Generation with Small Language Models for Fake News Detection
Lucca Baptista Silva Ferraz | Jhúlia de Souza Leal | Anderson Raymundo Avila | Thiago Alexandre Salgueiro Pardo | Fernando Batista | Renato Moraes Silva
Lucca Baptista Silva Ferraz | Jhúlia de Souza Leal | Anderson Raymundo Avila | Thiago Alexandre Salgueiro Pardo | Fernando Batista | Renato Moraes Silva
The spread of online misinformation has made fake news detection an essential tool for mitigating its negative impact, but many studies often disregard the temporal information, and existing datasets become outdated as news evolve. Some modern solutions using Retrieval-Augmented Generation (RAG) can solve the problem of unseen news events by providing context to the models. However, there are no studies evaluating the feasibility of web searches to attain context to decide whether a news article is true or not. This work aims to address this gap by conducting a comparative study between RAG-based solutions, traditional fake news classification methods, and deep learning-based methods. The results show that although RAG is a modern and promising technique, it cannot outperform techniques already adopted in the literature.
Bridging Citizens and Public Services: Improving Service Association with Retrieval-Augmented Generation (RAG) Labels
Ticiana L. Coelho da Silva | Celso França | Marcos André Gonçalves | Leonardo Rocha | Leonardo Alamy | Fernando Sola Pereira | Eduardo Soares de Paiva
Ticiana L. Coelho da Silva | Celso França | Marcos André Gonçalves | Leonardo Rocha | Leonardo Alamy | Fernando Sola Pereira | Eduardo Soares de Paiva
Linking citizen complaints to the public services they concern remains a major challenge in the Brazilian federal administration. In 2025, over 1.2 million manifestations were submitted across 328 agencies, yet only about 1.8% are currently associated with a specific service, limiting large-scale monitoring and evidence-based management. We cast this task as an extreme multi-class text classification problem marked by severe class imbalance and strong lexical–semantic gaps between citizen language and official service descriptions. Building on recent work that reframes the task as information retrieval, we combine sparse retrieval with BM25 over representative complaint corpora and dense retrieval enriched with RAG-labels: semantically expanded label descriptions generated via Retrieval-Augmented Generation and Small Language Models. This approach markedly reduces vocabulary mismatch and semantic ambiguity, yielding substantial gains over direct text or embedding matching. To our knowledge, this is the first Portuguese-language application of RAG-labels for service–complaint association. In real operational data from the Federal Ombudsman Office, our method can automatically assign plausible services to roughly 73% of previously unlabeled cases, improving coverage and supporting more effective public service evaluation.
Negation-Aware Data Augmentation for Portuguese Natural Language Inference
Maria Cecília M. Corrêa | Felipe S. F. Paula | Matheus Westhelle | Viviane P. Moreira
Maria Cecília M. Corrêa | Felipe S. F. Paula | Matheus Westhelle | Viviane P. Moreira
Negation plays a fundamental role in human communication and logical reasoning, yet it remains underrepresented in natural language inference (NLI) datasets. This work investigates the impact of targeted data augmentation using negation cues on the main NLI datasets for Portuguese (InferBR, ASSIN and ASSIN2). By synthetically generating new instances with negated hypotheses, we create more diverse training and test sets. A BERT-based model was fine-tuned and tested on the combined datasets and augmented data. The results show that the model was heavily influenced by the bias in the use of negation, and increased data diversity improves the model’s handling of negation.
To Describe or Not to Describe? Benchmarking Database Representations for Schema Linking in Text-to-SQL
Daiane Ucceli Kreitlow | Hilário Tomaz Alves de Oliveira
Daiane Ucceli Kreitlow | Hilário Tomaz Alves de Oliveira
Text-to-SQL systems aim to translate natural language questions into Structured Query Language (SQL) queries, enabling database access without requiring SQL expertise. In real-world scenarios, these systems often need to manage multiple databases with heterogeneous schemas, making Schema Linking a crucial preliminary step for identifying relevant databases, tables, and columns. This study investigates Schema Linking for questions written in Brazilian Portuguese and compares two schema representation strategies: natural-language descriptions generated by Large Language Models (LLMs) and representations based on Data Definition Language (DDL) and Data Manipulation Language (DML) commands. Experiments conducted on a Brazilian Portuguese version of the Spider dataset, with over 200 databases, evaluated several LLMs and embedding models. The experimental results based on Hit@k show that natural language descriptions consistently outperform DDL/DML-based representations, demonstrating the effectiveness of LLM-generated schema descriptions for Schema Linking tasks.
Evaluating Automated Scoring Models on Official ENEM Essays
Laís Nuto Rossman | Igor Cataneo Silveira | Denis Deratani Mauá
Laís Nuto Rossman | Igor Cataneo Silveira | Denis Deratani Mauá
Automated Essay Scoring systems can relieve teachers of this laborious task and allow students to practice more frequently due to faster feedback cycles. In Brazilian Portuguese, there is growing interest in automatic scoring systems for the standardized ENEM exam. However, the only available datasets consist of essays written as practice for the official exam. In the literature, to the best of our knowledge, there is no work that evaluates official ENEM essays using mock-exam datasets.This work fills that gap by presenting a new labeled dataset composed of 157 essays written for the official ENEM exam. The analysis shows that this dataset shares characteristics similar to existing datasets of mock exam essays. The results also indicate that, for small datasets such as this one, the use of LLMs pretrained on mock exams significantly improves the performance of automatic scorers for official ENEM essays, yielding an average gain of 0.27 points in the Quadratic Weighted Kappa metric compared to training solely on official data.
Enhancing Brazilian Inflation Forecasts through Sentiment Analysis Using Large Language Models
Lucas Miranda Mendonça Rezende | Cézio Luiz Ferreira Junior | Mateus Tarcinalli Machado | Evandro Eduardo Seron Ruiz
Lucas Miranda Mendonça Rezende | Cézio Luiz Ferreira Junior | Mateus Tarcinalli Machado | Evandro Eduardo Seron Ruiz
Reliable inflation forecasts play a critical role in economic stability and policy decisions. Traditional econometric models perform well but often overlook qualitative signals that could improve predictive accuracy. Recent advances in AI-based Natural Language Processing enable the extraction of latent sentiment, offering a promising avenue for inflation forecasting. This study proposes a framework that combines Large Language Models (LLMs) to extract sentiment variables from the Brazilian Monetary Policy Committee (COPOM) minutes, optimize bias to match human-collected sentiment, and integrate them into ARIMA and LSTM models for one-step-ahead monthly IPCA prediction. Results show that LLM-generated sentiment trends are temporally coherent with historical inflation patterns and highly statistically significant (p < 0.001). Models whose sentiment evaluations aligned more closely with human assessments (e.g., grok-4-fast and llama-4-maverick) achieved superior forecasting performance. ARIMA models consistently benefited from sentiment inclusion, while LSTM results were more variable.
NorBERTo: A ModernBERT Model Trained for Portuguese with 331 Billion Tokens Corpus
Lucas F. A. O. Pellicer | Guilherme Rinaldo
Lucas F. A. O. Pellicer | Guilherme Rinaldo
High-quality corpora are essential for advancing Natural Language Processing (NLP) in Portuguese. Building on previous encoder-only models such as BERTimbau and Albertina PT-BR, we introduce NorBERTo, a modern encoder based on the ModernBERT architecture, featuring long-context support and efficient attention mechanisms. NorBERTo is trained on Aurora-PT, a newly curated Brazilian Portuguese corpus comprising 331 billion GPT-2 tokens collected from diverse web sources and existing multilingual datasets. We systematically benchmark NorBERTo against strong baselines on semantic similarity, textual entailment and classification tasks using standardized datasets such as ASSIN 2 and PLUE. On PLUE, NorBERTo-large achieves the best results among the encoder models we evaluated, notably reaching 0.9191 F1 on MRPC and 0.7689 accuracy on RTE. On ASSIN 2, NorBERTo-large attains the highest entailment F1 (0.904) among all encoders considered, while remaining competitive, but Albertina-900M and BERTimbau-large still hold an advantage. To the best of our knowledge, Aurora-PT is currently the largest openly available monolingual Portuguese corpus, surpassing previous resources. NorBERTo provides a modern, mid-sized encoder designed for realistic deployment scenarios: it is straightforward to fine-tune, efficient to serve, and well suited as a backbone for retrieval-augmented generation and other downstream Portuguese NLP systems.
AspectRAG: Uma Arquitetura de Recuperação e Geração para Análise de Sentimentos Baseada em Aspectos
Erick R. Ribeiro | André Carvalho | Rhedson Esashika
Erick R. Ribeiro | André Carvalho | Rhedson Esashika
Propomos o AspectRAG, uma arquitetura de Recuperação e Geração para ASTE em português que opera sem treinamento supervisionado. O método extrai aspectos com um LLM, codifica-os como vetores densos e usa apenas esses vetores para recuperar evidências altamente específicas por meio de busca aproximada e fusão de rankings. As evidências recuperadas compõem o contexto do modelo gerador, que produz as triplas finais. Nos datasets ReLi e ReHol, o AspectRAG obtém até 93,47% em ATE, 80,68% em OTE e 69,83% em ASTE, superando modelos supervisionados como OTE-MTL, CMLA-MTL e BOTE, o estado da arte em Português. O estudo de ablação evidencia que a recuperação semântica guiada por aspectos é o principal fator responsável pelos ganhos observados, enquanto o tamanho do LLM tem impacto secundário. Os resultados mostram que a arquitetura AspectRAG é uma solução eficiente, e competitiva mesmo sem fine-tuning, apoiando-se apenas em recuperação vetorial e inferência contextualizada.
Semantic Representation of Relative Clauses in Lexicalized Abstract Meaning Representation
Jorge Baptista | Sónia Reis
Jorge Baptista | Sónia Reis
This paper analyzes the semantic parsing of relative clauses in Portuguese in two meaning representation frameworks: Abstract Meaning Representation (AMR) and Lexicalized Meaning Representation (LMR). While both treat relatives as noun modifiers, AMR fails to distinguish restrictive from appositive clauses–an important traditional grammatical distinction. We argue for explicitly encoding this difference. The study draws on annotated translations of *The Little Prince* (Saint-Exupéry, 1943) in Brazilian and European Portuguese, highlighting issues in the Brazilian AMR annotations.
Portuguese Sentiment Analysis with Open-Source LLMs: Models, Prompts, and Efficient Deployment
João V R J Lima | Vládia Pinheiro | Carlos Caminha
João V R J Lima | Vládia Pinheiro | Carlos Caminha
Robust sentiment analysis in Portuguese is central to applications across Lusophone contexts, yet systematic evaluations still focus predominantly on English and proprietary systems. This paper presents a comparative study of 29 open-source Large Language Models (LLMs) and two proprietary models on Portuguese sentiment classification under four prompting strategies: Zero-Shot, Few-Shot, Chain-of-Thought (CoT), and CoT with Few-Shot (CoT+FS). Experiments on a unified three-class benchmark built from three public review corpora (about 3,000 instances) comprise roughly 372,000 inferences, totaling approximately 150M input tokens and 65M output tokens. Results show that CoT+FS generally yields the best performance for larger models, while several compact open-source models obtain competitive F1-scores with substantially lower computational cost, making them suitable for real-world deployments. We identify concrete teacher–student configurations tailored for knowledge distillation in Portuguese sentiment analysis.
Auditing the Evaluators: How Far Can Automatic Evaluation Go in Assessing Portuguese Financial Texts?
Marina Ramalhete Masid | Gabriel Assis | Daniela Vianna | Aline Paes | Altigran Soares da Silva
Marina Ramalhete Masid | Gabriel Assis | Daniela Vianna | Aline Paes | Altigran Soares da Silva
Automatic metrics are widely used to evaluate text quality across various natural language processing tasks. Despite their convenience and scalability, the extent to which these metrics reliably reflect textual quality remains an open challenge. The LLM-as-a-judge paradigm has recently emerged, aligning more closely with human judgments by using LLMs themselves as evaluators. However, there is still a gap in such evaluations across specific domains and languages, as most prior work focuses on generic task benchmarks in English. In this paper, we examine the robustness of both traditional automatic metrics and the LLM-as-a-judge approach for assessing the quality of financial commentaries in Portuguese, an underexplored task and language that has been neglected in previous work. We introduce fine-grained perturbations into the texts generated by specialists to analyze which types of noise most significantly affect evaluation outcomes, using noise-free counterparts as references. The results highlight the weaknesses of classical metrics in this specific task and the limitations of even recent evaluation paradigms, underscoring the need to develop context- and domain-sensitive.
A Multimodal Framework for Financial Fake News Detection for Brazilian Portuguese
José Vitor Souza Cardoso Requena | João Victor Assaoka Ribeiro | Lilian Berton
José Vitor Souza Cardoso Requena | João Victor Assaoka Ribeiro | Lilian Berton
The rapid dissemination of digital information has exposed financial markets to the risks of disinformation. Although numerous methods exist to detect fake news, they predominantly focus on textual features, often neglecting the significant role of image-based content. This paper introduces a novel framework for detecting financial fake news in Brazilian Portuguese by bridging this gap. The proposed system integrates Natural Language Processing (NLP) with an image-to-text classification strategy: using a Tesseract-based OCR, the system extracts text from images and processes it using the unified pipeline used for text classification. Experiments on Fake.BR, FakeRecogna corpus and BBC News Brasil show that our approach achieves 98% accuracy using BERTimbau Fine Tuned on financial news. These findings underscore the critical importance of analyzing visual text and demonstrate the multimodal strategy is effective for disinformation detection.
A UD Parser to the Rescue: A Method for Bringing a Classical Annotated Corpus to Life Again
Lucelene Lopes | Magali S. Duran | Thiago A. S. Pardo
Lucelene Lopes | Magali S. Duran | Thiago A. S. Pardo
This paper reports on an effort to recover the classical morphosyntactically annotated corpus MacMorpho and realign it with the current version of the Universal Dependencies framework. We introduce a knowledge-rich approach grounded in a syntactic parser and on a specially designed tagset compatibility strategy in order to generate a "silver-standard" resource: the MacMorpho-UD-2.17. We evaluate this resource through multiple complementary methods, providing evidence for the quality of both our approach and the resulting annotation.
Data Augmentation for Named Entity Recognition in Domain-Specific Scenarios in Portuguese
Higor Moreira | Patricia Ferreira da Silva | Luciana Bencke | Viviane Moreira
Higor Moreira | Patricia Ferreira da Silva | Luciana Bencke | Viviane Moreira
Named Entity Recognition (NER) is an important task of Natural Language Processing. Achieving good results in this task usually requires a large amount of labeled data to train models. This is especially difficult in domain-specific datasets and low-resourced languages. To mitigate the high cost of human-annotated data, data augmentation can be used. In this work, we evaluate Data Augmentation techniques for NER, focusing on domain-specific datasets in Portuguese.We employed augmentation techniques based on rules, back-translation, and large language models on four datasets of varying sizes to train Transformer-based NER models.The results showed that most techniques improved over the baseline, with the best results achieved using PP-LLM, SR, and MR.
Token-Level Pun Location Using Multi-Layer BERT with Mixture of Experts
Rafael Torres Anchiêta | Roney Lira de Sales Santos | Raimundo Santos Moura
Rafael Torres Anchiêta | Roney Lira de Sales Santos | Raimundo Santos Moura
Humor processing remains a complex challenge in Natural Language Processing, particularly the task of pun location, which involves identifying the specific ”pivot word” that creates linguistic ambiguity. This paper presents a novel two-stage approach for token-level pun location in Portuguese, addressing the scarcity of research in this language. The first stage uses an ensemble of traditional classifiers to filter out non-pun sentences, thereby reducing class imbalance. The second stage employs a pre-trained BERT encoder combined with a Mixture-of-Experts (MoE) layer to capture specialized linguistic features for token classification. We validate our approach on the Puntuguese corpus, achieving an F-score of 0.74 without requiring post-processing heuristics. Interpretability analyses demonstrate that the MoE experts learn to specialize in distinct mechanisms, such as punchline detection and morphological patterns, thereby confirming the model’s capacity to capture the nuances of humor.
Language Effects in Text-to-SQL Across English and Portuguese
Lucas Nobre | Suele Sousa | Savio Teles | Anderson Soares
Lucas Nobre | Suele Sousa | Savio Teles | Anderson Soares
Text-to-SQL systems allow users to query relational databases using natural language, but accuracy remains sensitive to the choice of language, model architecture, and prompting strategy. Although recent Large Language Models (LLMs) incorporate reasoning mechanisms that improve multi-step problem solving in other domains, their effects on multilingual Text-to-SQL are not yet well understood. This work evaluates a diverse set of LLMs on the BIRD benchmark and BIRD_PT, a Portuguese version produced by translating the questions and external knowledge while keeping the original English database schema and values unchanged. We compare four controlled scenarios that vary internal reasoning and guided reasoning for SQL generation. The results show a consistent decrease in accuracy when switching from English to Portuguese, with large variations in robustness across models. Reasoning alone does not reliably improve execution accuracy and can reduce performance in Portuguese, while combining reasoning with a guided plan provides the most stable improvements, although still weaker than in English. These findings highlight ongoing challenges in multilingual Text-to-SQL and emphasize the need to jointly consider language understanding, reasoning activation, and task-aligned planning when designing future systems.
Specializing a Small Language Model for Closed-Domain Portuguese RAG using Knowledge Graph Supervision
Josué Caldas | Elvis de Souza | Patrícia Silva | Marco Pacheco
Josué Caldas | Elvis de Souza | Patrícia Silva | Marco Pacheco
Fine-tuned small language models (SLMs) have emerged as effective alternatives for closed-domain tasks, where large language models (LLMs) often lack sufficient parametric knowledge. This study presents a methodology for adapting a small language model to a closed-domain question answering (Q A) task. For each question, the model is trained to output an answer based on the most relevant context passage, among ten provided candidates, thus reproducing the logic of a Retrieval-Augmented Generation (RAG) framework. The fine-tuning data were derived from PetroKGraph, an existing knowledge graph built from Portuguese-language resources in the oil and gas (O G) domain. Experimental results show that the fine-tuned model achieves a 20 percentage points accuracy improvement over the base model on closed-domain questions. It also surpasses GPT-4o and GPT-4o Mini by 12 and 25 points, respectively. Moreover, its performance on general-domain tasks remains comparable to that of the base model, indicating that the specialized model effectively learned domain specific knowledge while maintaining general reasoning capabilities.
Modeling Linguistic Violence: An Ontology-Based Framework for the Computational Analysis of Violence Manifested in Language
Brenda Salenave Santana | Ana Marilza Pernas | Aline A. Vanin
Brenda Salenave Santana | Ana Marilza Pernas | Aline A. Vanin
The conceptual ambiguity among terms like ’hate speech’, ’toxic speech’, and ’dangerous speech’ creates a significant bottleneck for both research and automated moderation. Traditional NLP models, often focused on lexical cues, struggle to differentiate these nuanced forms of linguistic violence, especially when the harm is implicit. This paper addresses this gap with a twofold objective. First, we conduct a conceptual review and propose a unified ontology that differentiates these concepts—including verbal aggression and cyberbullying—based on their core attributes, such as their target, intent, and associated rhetorical hallmarks. Second, we propose a computational methodology designed to operationalize this ontology. Our framework uses a multi-stage NLP pipeline that leverages semantic analysis, specifically Semantic Role Labeling and Named Entity Recognition, to deconstruct speech acts into their core components (e.g., target and action). This component-based approach allows for a granular classification that can robustly distinguish between seemingly similar phenomena, such as a general insult and a targeted identity-based attack. This methodology is particularly promising for low-resource languages, such as Portuguese, as it relies on core semantic tasks for which multilingual models are available, rather than requiring massive, task-specific labeled datasets.
Software for Automatic Speech Recognition via Whisper models applied to Oral History interviews in the Portuguese language
Edgleide de Oliveira Clemente da Silva | Fernando Rezende Zagatti | Filipe Loyola Lopes | Anderson Dias Duarte | Rodrigo Bonacin | Angela Maria Alves
Edgleide de Oliveira Clemente da Silva | Fernando Rezende Zagatti | Filipe Loyola Lopes | Anderson Dias Duarte | Rodrigo Bonacin | Angela Maria Alves
This paper presents Ethos AT, a desktop software for automatic transcription that uses OpenAI Whisper models, enabling local processing and ensuring data privacy and accessibility for users who are not necessarily programming experts, such as oral history researchers. A comparative analysis of six Whisper models (small, medium, large, large-v2, large-v3, and turbo) was conducted to analyze performance in terms of transcription accuracy, error types, and processing time. Results indicate that larger models achieve higher lexical accuracy, while smaller ones provide faster execution with acceptable quality for general use; the turbo model showed an effective balance between accuracy and speed. Overall, Ethos AT offers a secure, efficient, and user-friendly solution for academic and institutional contexts.
Uma Abordagem Híbrida para Predição de Faixa Etária de Autores de Textos Escritos na Língua Portuguesa
Alice Rezende Ribeiro | Luiz Henrique de Campos Merschmann
Alice Rezende Ribeiro | Luiz Henrique de Campos Merschmann
A crescente quantidade de textos disponíveis na Web torna as ferramentas de mineração de texto essenciais para a extração de informações valiosas para diversas aplicações. No entanto, além dos próprios textos, conhecer as características de seus autores é crucial para algumas organizações. Como os textos podem ser publicados anonimamente, é crescente o interesse em pesquisas voltadas para a criação de técnicas computacionais para inferir as características demográficas de seus autores. Apesar disso, para o problema da predição da faixa etária de autores de textos escritos na língua portuguesa, a quantidade limitada de recursos e o baixo desempenho preditivo evidenciam a necessidade de mais pesquisas focadas nessa tarefa. Assim, este trabalho propõe e avalia uma abordagem que, além de um classificador tradicional, utiliza dicionários de palavras para capturar as especificidades do domínio textual e aprimorar o desempenho preditivo da tarefa de predição da faixa etária. Os resultados experimentais obtidos com a abordagem proposta mostram que explorar as características do domínio dos textos pode contribuir positivamente para o desempenho dessa tarefa.
Enhanced Universal Dependencies in the Wild: Evaluating Portuguese EUD Parsing in Realistic Scenarios
Elvis A. de Souza | Thiago A. S. Pardo
Elvis A. de Souza | Thiago A. S. Pardo
Enhanced Universal Dependencies (EUD) provide a more informative syntactic representation than Basic Universal Dependencies by relaxing tree constraints to allow for graph structures. While conversion rules from basic to enhanced relations have been established for Portuguese, they were previously evaluated only on journalistic text using gold-standard basic syntactic trees. This paper evaluates the robustness of these rules in diverse scenarios ("in the wild"), encompassing other text genres and domains, as well as realistic parsing pipelines that rely on automatically generated basic syntax. Our results demonstrate that Portuguese-specific rules consistently outperform universal rules. However, the reliance on automatic basic syntax significantly impacts performance. This degradation is particularly severe when the domain of the input text differs from the training data of the basic parser. We also provide a detailed error analysis, identifying specific difficult linguistic phenomena and scenarios.
UlyssesLegalNER-Br: from Legislative to Legal, a comprehensive corpus of Brazilian legal documents for Named Entity Recognition
Hidelberg O. Albuquerque | Ellen Souza | Danilo C. G. Lucena | Héldon J. O. Albuquerque | Nádia F. F. da Silva | Márcio de S. Dias | Rafael O. Nunes | Adriano L. I. Oliveira | André C. P. L. F. de Carvalho
Hidelberg O. Albuquerque | Ellen Souza | Danilo C. G. Lucena | Héldon J. O. Albuquerque | Nádia F. F. da Silva | Márcio de S. Dias | Rafael O. Nunes | Adriano L. I. Oliveira | André C. P. L. F. de Carvalho
The legal domain presents several challenges for Natural Language Processing (NLP), particularly due to its linguistic complexity and lack of public datasets. Named Entity Recognition (NER), a subarea of NLP, has been successfully used to extract useful knowledge from legal texts. Its widespread use is limited by the lack of legal text corpora. This paper introduces UlyssesLegalNER-Br, a comprehensive corpus of Brazilian legal documents for NER, covering bills, case laws and laws, including the first NER corpus based exclusively on Brazilian laws. This research expand the UlyssesNER-Br corpus, previously focused only on the Brazilian legislative domain. The proposed corpus has 560 public documents annotated using a hybrid approach, organized in 9 categories and 23 fine-grained types, experimentally evaluated with the CRF, BiLSTM, and BERTimbau architectures. The corpus was experimentally evaluated regarding predictive performance, computational cost and label-level results. The best micro F1 96.18% was achieved by BERTimbau on the unified corpus, providing a strong baseline for Brazilian legal NER. At the label level, six categories and seven types presented a F1-score above 95%, while the lowest were distributed in the interval 71-82%.
The PROPOR Ecosystem: Structure, Roles, and Evolution of Portuguese-Language NLP
Rafael O. Nunes | Gustavo L. Tamiosso | Pedro L. C. de Andrade | Matheus S. de Aguiar | Rafael P. de Gouveia | Higor Moreira | Bruno Tavares | Laura P. de Gouveia | Felipe S. F. Paula | Andre Spritzer | Hidelberg O. Albuquerque | Nádia F. F. da Silva | Ellen P. R. S. Pereira | Dennis G. Balreira | Joel L. Carbonera
Rafael O. Nunes | Gustavo L. Tamiosso | Pedro L. C. de Andrade | Matheus S. de Aguiar | Rafael P. de Gouveia | Higor Moreira | Bruno Tavares | Laura P. de Gouveia | Felipe S. F. Paula | Andre Spritzer | Hidelberg O. Albuquerque | Nádia F. F. da Silva | Ellen P. R. S. Pereira | Dennis G. Balreira | Joel L. Carbonera
The PROPOR conference has been the main venue for Portuguese language Natural Language Processing (NLP) research for over two decades. This paper presents a longitudinal bibliometric analysis of PROPOR from 2003 to 2024, examining thematic evolution, community structure, and scientific impact. We identify a shift from speech-oriented research toward text-based tasks, alongside the sustained importance of resources and linguistic theory. The community exhibits a stable structure, with complementary leadership models centered on institutional hubs and brokerage roles. Scientific impact is highly concentrated, following a long tail distribution, and distinguishes between cumulative productivity-driven impact and rapidly accelerating citation uptake in recent editions. These findings characterize PROPOR as a resilient regional linguistic ecosystem evolving in dialogue with broader NLP paradigms.
Twenty Years of HAREM: A Reproducible Audit and Reassessment of Portuguese Named Entity Recognition
Rafael O. Nunes | André Spritzer | Carla M. D. S. Freitas | Dennis G. Balreira
Rafael O. Nunes | André Spritzer | Carla M. D. S. Freitas | Dennis G. Balreira
For two decades, the HAREM corpus has served as the foundational benchmark for Portuguese Named Entity Recognition (NER), establishing its evaluation paradigm. Virtually all major progress has been measured against its fixed train/test split. This paper presents the first systematic audit of this split, revealing 153 overlapping (contaminated) sentences. We re-evaluate 13 NER models (ranging from CRFs to Transformers) on both the original and a new, decontaminated version of the corpus. Our statistical analysis reveals that decontamination has a significant (p < 0.05) and positive impact on the majority of models. We find that performance gains are most pronounced in the F1_textmacro score (up to +4 points), demonstrating that the contamination primarily harmed generalization on rare entity types. Furthermore, our audit reveals clear evidence of overfitting in some models that benefited from data leakage. We conclude that even minor contamination can distort performance metrics and mask true model generalization. We release our decontaminated benchmark to ensure more reliable future evaluations.
Analysis of Machine Translators on Sentences Generated by Portuguese Image Captioning Models
Natan Moura | João Medrado Gondim | Daniela Barreiro Claro | Babacar Mane
Natan Moura | João Medrado Gondim | Daniela Barreiro Claro | Babacar Mane
Recent works in the fields of computer vision and natural language processing have enabled the recognition and identification of objects in images, generating automatic descriptions. Despite these advancements, the main research in this field is primarily related to the English language, requiring some adaptation when dealing with other languages, such as Portuguese. One of these methods is the translate-train approach, which involves translating the training dataset into the desired language. However, there are various translators with different levels of effectiveness available. The primary objective of this work is to evaluate the behavior of image captioning models when trained on datasets translated into Portuguese by different automatic translators, both quantitatively (cost, training time, metrics on the test set) and qualitatively (comparative evaluation form, error analysis). The results indicate that it is possible to obtain valid automatic descriptions in Portuguese from image captioning models trained on translated datasets, and that more robust translators produce more meaningful descriptions.
Analyzing Debate Dynamics in the Portuguese Parliament with Dialogue Action Flows
Patrícia Ferreira | Ana Alves | Catarina Silva | Hugo Gonçalo Oliveira
Patrícia Ferreira | Ana Alves | Catarina Silva | Hugo Gonçalo Oliveira
Analyzing how large-scale multi-party dialogues shape collective behavior is a central challenge in computational linguistics. However, traditional text-based methods often overlook the complex, non-linear turn-taking dynamics defining these interactions. To address this gap, we propose a framework based on Dialogue Action Flows (DAFs) that integrates verbal utterances and non-verbal actions into a unified probabilistic representation of interactional behavior. Interactions are encoded as speaker-action states, forming a probabilistic DAF that reveals dominant behavioral trajectories and recurrent patterns. We validate this framework on five years of Portuguese Parliament debates. Analysis reveals systematic behavioral asymmetries driven by party roles: while government parties exhibit increasing alignment, opposition forces, particularly the radical wing, maintain persistently high conflict. Additionally, the rising volume of interactions across legislative years indicates a progressively heated environment. Overall, our framework provides a quantitative and interpretable approach for modeling polarization, alignment, and interactional dynamics in multi-party political discourse.
AMALIA: A Fully Open Large Language Model for European Portuguese
Afonso Simplício | Gonçalo Vinagre | Miguel Moura Ramos | Diogo Tavares | Rafael Ferreira | Giuseppe Attanasio | Duarte M. Alves | Inês Calvo | Inês Vieira | Rui Guerra | James Furtado | Beatriz Canaverde | Iago Paulo | Vasco Ramos | Diogo Glória-Silva | Miguel Faria | Marcos Treviso | Daniel Gomes | Pedro Gomes | David Semedo | André Martins | João Magalhães
Afonso Simplício | Gonçalo Vinagre | Miguel Moura Ramos | Diogo Tavares | Rafael Ferreira | Giuseppe Attanasio | Duarte M. Alves | Inês Calvo | Inês Vieira | Rui Guerra | James Furtado | Beatriz Canaverde | Iago Paulo | Vasco Ramos | Diogo Glória-Silva | Miguel Faria | Marcos Treviso | Daniel Gomes | Pedro Gomes | David Semedo | André Martins | João Magalhães
Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant’s linguistic and cultural nuances. We introduce AMALIA, a fully open LLM that prioritizes pt-PT by using more high-quality pt-PT data during both the mid- and post-training stages. To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias. Experiments show that AMALIA matches strong baselines on translated benchmarks while substantially improving performance on pt-PT-specific evaluations, supporting the case for targeted training and native benchmarking for European Portuguese.
LegalSim-PT: Building a Dataset for Legal Document Simplification in Portuguese Leveraging Linguistic Metrics
Arthur Scalercio | Maria José Finatto | Aline Paes
Arthur Scalercio | Maria José Finatto | Aline Paes
Document simplification has recently attracted increasing attention due to its broader practical applicability compared to sentence-level simplification. Beyond simplifying individual sentences, this task involves preserving fluency, conciseness, and coherence across the entire text, often incorporating summarization techniques. Despite its importance, research in this area remains largely concentrated on a few languages, particularly English.In this work, we introduce LegalSim-PT, the first large-scale Portuguese dataset for document simplification based on legal texts. To mitigate reliance on manual evaluation, we combined data augmentation strategies with readability, semantic similarity, and diversity metrics to select the most suitable document pairs. We conducted a comprehensive analysis of the resulting dataset, first characterizing its surface features and comparing them with those of existing simplification corpora. Next, we assessed its quality using automatic metrics, linguistic indicators, and human evaluations. Finally, we select representative models as baselines and fine-tune two models on LegalSim-PT, achieving improved performance in document-level simplification.
Portho: A Corpus-Based Resource of Orthographic Neighbors in European Portuguese
Eugénio Ribeiro | David Antunes | Nuno Mamede | Jorge Baptista
Eugénio Ribeiro | David Antunes | Nuno Mamede | Jorge Baptista
Orthographic neighbors (ONs) play a central role in models of visual word recognition and have been shown to influence reading speed, lexical access, and literacy development. Despite their importance, resources providing detailed and flexible ON information remain scarce for European Portuguese. This paper introduces Portho, a corpus-based lexical resource that provides multiple ON metrics for over 43,000 word forms, using several ON definitions. In addition to classical neighborhood size measures, Portho provides frequency-based statistics and graded orthographic distance (OD) features. We analyze the statistical properties of the resource and evaluate its empirical utility in automatic text complexity assessment using the iRead4Skills corpus. Results show that while ON features alone are insufficient to predict readability, they contribute complementary information and compare favorably with existing resources for Portuguese. Portho is made publicly available in different formats to support research in psycholinguistics, readability modeling, and Natural Language Processing (NLP) for Portuguese.
Discovery of Legal Patterns in Civil Petitions via LLM-Based Fact Extraction and Density Clustering
Rhedson Esashika | Carlos M. S. Figueiredo | Tiago de Melo
Rhedson Esashika | Carlos M. S. Figueiredo | Tiago de Melo
The analysis of unstructured civil petitions is often hindered by procedural noise and verbose argumentation. To address this, we propose a pipeline composed of LLM-based fact extraction followed by legal-domain embeddings of texts for unsupervised density clustering. We employ Large Language Models to isolate factual narratives from raw texts, which are then encoded using domain-specific representations (Legal-BERT) and grouped via UMAP dimensionality reduction and the HDBSCAN algorithm. Comparative experiments on a Brazilian judicial corpus reveal that clustering based solely on extracted yields significantly more cohesive and semantically well-defined groups than, which suffer from fragmentation due to content variability. Results indicate that the proposed method is a promising approach for thematic organization, procedural triage support, and large-scale discovery of legal patterns.
Automated Reformulation of Argumentative Essays to Improve Argument Organization and Development
Naomi James Sutcliffe de Moraes | Denis Deratani Mauá
Naomi James Sutcliffe de Moraes | Denis Deratani Mauá
This work presents a study of automated reformulation of argumentative essays written by college-bound native speakers of Brazilian Portuguese as a form of pedagogical feedback. We first evaluate the feasibility of using large language models (LLMs) to score argument quality with respect to three criteria: the defense of a point of view, organization, and development. We then employ an LLM to provide a reformulated version of the essay as feedback. As we discuss, the main challenge is to constrain the automated feedback to address only argument quality, rather than improving other aspects such as spelling or cohesion, and to modify the essay as little as possible. We achieve levels of agreement in automatic essay scoring comparable to human inter-rater agreement metrics, while increasing explainability. Instructing the LLM to add argument support (facts, examples, etc.) was the best way to get non-superficial changes to the arguments, and it was able to add true examples and facts to the essays even without being provided with background information on the topic.
Automatic Question classification in Portuguese: A Large-Scale Dataset and Comparative Evaluation of Classification Strategies
Murilo Boccardo | Valéria D. Feltrim
Murilo Boccardo | Valéria D. Feltrim
This paper presents a comparative evaluation of automatic classification strategies for Brazilian university entrance exam questions by subject and fine-grained topic. A central contribution of this study is the creation and curation of a large-scale Portuguese-language dataset comprising approximately 17,000 questions collected from the Agatha.edu platform, carefully cleaned and normalized. We investigated two alternative classification strategies: a single-step approach that directly predicts fine-grained topics and a two-stage approach in which an initial model predicts the subject, followed by specialized topic classifiers. These strategies were evaluated using both classical machine learning methods, such as Support Vector Machines, Naive Bayes, and Random Forest, and transformer-based language models pre-trained for Portuguese. Experimental results show the feasibility of large-scale automatic question classification and highlight the potential of NLP-based classification strategies to support the curation, analysis, and organization of educational question banks.
Exploring Knowledge Graphs for Automatic Fake News Detection in Portuguese
Lucas dos Santos | Manoel Rodrigues Euclides Santos | Yuri Silva Souza | João Pedro Holanda Sousa | Roney Lira de Sales Santos
Lucas dos Santos | Manoel Rodrigues Euclides Santos | Yuri Silva Souza | João Pedro Holanda Sousa | Roney Lira de Sales Santos
The proliferation of fake news in digital environments poses serious challenges to democratic processes, particularly in morphologically rich languages such as Portuguese. While most existing approaches focus on stylistic cues or propagation patterns in social networks, this paper proposes an automated fake news verification methodology grounded in Knowledge Graphs (KGs). Instead of treating news as raw text, we represent each article as a set of factual events encoded as semantic triples of subject, predicate, and object. A proprietary knowledge graph is built from Brazilian data sources, and a verification algorithm is introduced to estimate the veracity of news articles based on graph connectivity evidence. Experimental results confirm the feasibility of the proposed approach and highlight its inherent explainability as a key advantage over deep learning black-box models. Error analysis further indicates that the main limitation stems from the syntactic complexity of Open Information Extraction in Portuguese, suggesting that improvements at this extraction stage are essential to increase system robustness.
Anatomy of Data Repositories for the Analysis and Detection of Toxicity in Portuguese
Lorena Souza Moreira | Paula Teresa M. Gibrim | Leonardo Rocha | Julio C. S. Reis
Lorena Souza Moreira | Paula Teresa M. Gibrim | Leonardo Rocha | Julio C. S. Reis
The proliferation of online hate speech requires a rigorous examination of the datasets used to train detection models. In this work, we analyze six Brazilian Portuguese datasets annotated for hate speech or toxicity to investigate how their lexical "anatomy" and domain characteristics affect cross-domain generalization. We combine HurtLex-based lexical profiling with cross-dataset evaluation in a feature-based transfer-learning setup, using BERTimbau embeddings and an XGBoost classifier. Our analysis shows that, although the datasets share a broadly similar macro-level focus, they diverge substantially in how specific terms are used and labeled across platforms and topics. Results indicate that lexical breadth and annotation practices strongly predict transferability: datasets with broader and more heterogeneous toxic vocabulary yield better cross-domain performance, whereas resources with narrow, profanity-centered labeling lead to severe generalization gaps, even when lexical overlap is high. These findings underscore the impact of collection and labeling strategies on the curation and evaluation of Portuguese hate speech datasets. Warning! This work and the referenced datasets contain examples of offensive and hateful language.
Sintomas Linguísticos: Geração Aumentada por Recuperação e Raciocínio em LLMs sob a Variação Português-Inglês em Contextos Médicos
Guilherme Vianna de Moura | Gabriel Assis | Aline Paes
Guilherme Vianna de Moura | Gabriel Assis | Aline Paes
Modelos de Língua de Grande Porte (LLMs) têm demonstrado desempenho expressivo em tarefas de raciocínio médico. No entanto, sua robustez diante de variações linguísticas ainda é pouco explorada, especialmente em idiomas além do inglês, como o português. Neste trabalho, investigamos como o idioma de entrada afeta o desempenho e o comportamento de raciocínio de LLMs médicos, bem como se a Geração Aumentada por Recuperação (RAG) é capaz de mitigar eventuais limitações decorrentes dessas variações. Para isso, realizamos experimentos em português e em inglês, utilizando duas variantes do modelo MedGemma, com 4B e 27B parâmetros, e avaliando-as em três conjuntos de dados médicos. A avaliação combina métricas quantitativas de acurácia com análises qualitativas e estruturais das cadeias de raciocínio e das respostas geradas pelos modelos. Os resultados indicam que a variação linguística impacta de forma mais acentuada os modelos de menor porte. Em particular, a variante de 4B parâmetros apresenta desempenho consistentemente inferior quando as entradas são fornecidas em português. Em contraste, a variante de 27B parâmetros demonstra maior robustez entre idiomas, mantendo níveis semelhantes de acurácia e de estrutura de raciocínio tanto em português quanto em inglês. Embora o sistema de RAG implementado apresente recuperação de documentos de boa qualidade, sua integração não resulta em ganhos consistentes para o modelo menor, o que sugere limitações na exploração efetiva do contexto adicional. De forma geral, este trabalho contribui para o entendimento dos limites atuais dos LLMs médicos em contextos multilíngues, destacando os desafios associados ao desempenho em idiomas com recursos limitados.
BIPA: Brazilian Portuguese Phonetic Dataset with Dialectal Variations in IPA Standard
Thiago Monteles de Sousa | Lucas Rafael Gris | Nádia Félix Felipe da Silva
Thiago Monteles de Sousa | Lucas Rafael Gris | Nádia Félix Felipe da Silva
This work presents BIPA, a phonetic transcription corpus for Brazilian Portuguese that covers regional dialectal variations. The corpus was constructed through automated extraction from Wiktionary, resulting in 53,353 unique words and 350,021 transcriptions in IPA format, distributed across six dialects: general Brazilian, Rio de Janeiro, São Paulo, South Region, Northeast Region, and Center-West Region. The average density of 6.56 transcriptions per word reflects multiple regionally conditioned phonetic variations. To validate the utility of the corpus, the ByT5-small model was fine-tuned for grapheme-to-phoneme conversion, achieving a Minimum Phoneme Error Rate of 2.66% on the validation set. BIPA addresses the scarcity of computational linguistic resources for Brazilian Portuguese, enabling applications in regional speech synthesis, automatic accent recognition, and computational sociolinguistic analysis.
Avaliação Automática de Redações do Enem: Um Estudo Empírico sobre Representações Linguísticas e Contextuais
Gabriel Gonçalves de Matos | Valéria D. Feltrim
Gabriel Gonçalves de Matos | Valéria D. Feltrim
A Avaliação Automática de Redações (AAR) para o português brasileiro ainda é uma tarefa desafiadora, particularmente no contexto do exame Enem, no qual a qualidade textual é avaliada por meio de múltiplas competências e as notas apresentam natureza ordinal. Neste artigo, investigamos estratégias de modelagem híbrida para AAR em nível de competência, combinando características linguísticas explícitas com representações contextuais. Utilizando o córpus Enem-AES, a avaliação de cada competência foi modelada como um problema de predição ordinal por meio do framework CORAL. Foi realizada uma comparação empírica controlada entre representações lexicais tradicionais, um amplo conjunto de métricas linguísticas extraídas com o sistema NILC-Metrix, características manuais orientadas à tarefa, embeddings contextuais e combinações dessas representações. Os resultados mostram que modelos híbridos alcançam o maior nível médio de concordância com as notas humanas, embora o desempenho varie entre competências e dependa do tipo de representação utilizada. Além disso, foi analisado o comportamento dos modelos em cenários de discordância entre avaliadores, o que evidenciou o impacto da variabilidade de anotação no desempenho dos modelos. De modo geral, os resultados fornecem evidências de que a combinação de indicadores linguísticos com embeddings contextuais constitui uma estratégia promissora para a tarefa de AAR no contexto do Enem.
Analysing LLMs for spelling normalization of 18th century Portuguese
Helena Freire Cameron | Aline Paes | Fernanda Olival | Renata Vieira
Helena Freire Cameron | Aline Paes | Fernanda Olival | Renata Vieira
This paper presents an evaluation of large language models (LLMs) applied to the task of normalizing eighteenth-century written texts. Several LLMs were employed to process texts in pre-contemporary spellings and update them according to contemporary Portuguese orthography. Their outputs were rigorously compared against a curated reference corpus. The findings indicate marked disparities in model performance, with the Portuguese-specialized model Sabiá demonstrating a statistically significant advantage over multilingual alternatives.
Detecting Stuttering with Artificial Intelligence: A Hybrid Method for Brazilian Portuguese
Rubens Perini Buzzeti | Paula Bianca Meireles de Moura Buzzeti | Roney Lira de Sales Santos
Rubens Perini Buzzeti | Paula Bianca Meireles de Moura Buzzeti | Roney Lira de Sales Santos
Speech-language assessment of stuttering is traditionally manual, subjective, and time-consuming. This paper presents the development of software for automatic detection and classification of stuttering-related disfluencies in Brazilian Portuguese, aiming to support clinical assessment. The system follows a two-stage hybrid approach. In the first stage, it applies deterministic algorithms based on automatic speech recognition (ASR) and temporal information to identify simple disfluencies, such as repetitions and pauses. In the second stage, it employs a hierarchical architecture combining a Kohonen network (Self-Organizing Map, SOM) and a Multilayer Perceptron (MLP) to classify complex disfluencies, specifically blocks and prolongations, using acoustic features. Because no publicly available annotated resources exist for this task in Brazilian Portuguese, we built a initial dataset annotated by specialists. The system achieved 89.5% accuracy in classifying complex disfluencies, with a Matthews Correlation Coeficient (MCC) of 0.812. These results indicate the feasibility of the tool as decision support for clinical assessment and establish a baseline for future research.
Levados em Consideração: Uma Avaliação de Vieses de Estima por Raça, Gênero e Região em Grandes Modelos de Linguagem em Português Brasileiro
João Lucas Lima de Melo | Marlo Souza
João Lucas Lima de Melo | Marlo Souza
Este trabalho propõe a identificação de vieses sociais em português nos modelos GPT-4o, GPT-4o-mini, Sabiá-3 e Sabiázinho-3, utilizando a métrica de estima a fim de avaliar o nível de respeito e deferência dos modelos sobre diferentes grupos demográficos. A avaliação abrange sujeitos com marcadores sociais explícitos de género, raça e região brasileira, em condições com e sem o uso de uma técnica de contorno das restrições de moderação (jailbreaking). Os achados mostram que os modelos de linguagem avaliados reproduzem padrões sistemáticos de valoração diferenciada entre grupos sociais, revelando vieses de estima associados a marcadores de gênero, raça e região no português brasileiro. Sujeitos com marcadores sociais enfatizados, especialmente os de raça, tendem a receber estimas mais baixas. A utilização da técnica de jailbreaking não apresentou um impacto uniforme, podendo tanto ampliar quanto reduzir as diferenças de estima.
Global vs. Local Sentence Embeddings for Brazilian Portuguese: Revisiting Monolingual Models in the Age of Foundation Models
Matheus Peixoto | Guilherme Silva | Giacomo Figueredo | Pedro Silva | Eduardo J. S. Luz
Matheus Peixoto | Guilherme Silva | Giacomo Figueredo | Pedro Silva | Eduardo J. S. Luz
The choice between large-scale, multilingual, foundation models and specialized monolingual models for languages like Brazilian Portuguese (PT-BR) presents a complex trade-off between generalization and specialization. This paper investigates this trade-off through an empirical study across a diverse suite of tasks. We evaluate multiple families of language models under both linear probing and fine-tuning regimes. We find that monolingual encoders exhibit greater "adaptation plasticity" during fine-tuning, improving on both classification and semantic similarity, where global (multilingual) models degrade. However, this plasticity comes at a cost: our tokenization analysis suggests that monolingual models struggle with foreign terms, whereas modern multilingual tokenizers show surprising morphological competence, challenging a long-standing assumption in the field. We conclude that the optimal model choice is a task-dependent trade-off between vocabulary coverage and adaptation flexibility.
LexIris-pt and LexBert-pt: Specialized Sentence Embeddings for Legal Similarity in Brazilian Portuguese
Willgnner Ferreira Santos | João Gabriel Grandotto Viana | Antônio Pires de Castro Júnior | Fernando Ribeiro Trindade | Nádia Félix Felipe da Silva
Willgnner Ferreira Santos | João Gabriel Grandotto Viana | Antônio Pires de Castro Júnior | Fernando Ribeiro Trindade | Nádia Félix Felipe da Silva
This work presents and evaluates two specialized sentence embedding models for the Portuguese legal domain, LexIris-pt and LexBert-pt, obtained through supervised fine-tuning of BERT-based models using pairs of initial petitions. We propose a comparative evaluation protocol along three fronts: (i) zero-shot inference with pretrained embeddings, (ii) supervised fine-tuning on these pairs, and (iii) vector retrieval with incremental clustering over a corpus of 20,000 initial petitions. The results show that fine-tuning consistently increases correlations with reference scores and improves performance in vector retrieval; additionally, the vector retrieval stage indicates that the metric configured in the index (cosine similarity or inner product) can change the granularity of the partitioning under a fixed threshold, reinforcing the need for joint calibration among the encoder, metric and threshold. After auditing by specialists from the partner institution, LexIris-pt and LexBert-pt were operationally adopted to support the screening and organization of repetitive claims and predatory litigation.
The Inadequacy of Automatic Evaluation Metrics in Question Answering: A Case-Study in Portuguese
Júlia da Rocha Junqueira | Viviane P. Moreira
Júlia da Rocha Junqueira | Viviane P. Moreira
Questions and answers are among the most fundamental forms of human communication. Question Answering (QA) is the task of correctly generating answers based on a context. To assess the success of the task, the answers are typically evaluated using traditional metrics such as BLEU, ROUGE, and METEOR. However, these metrics often fail to reflect the actual quality of the outputs. More recently, new evaluation metrics and the LLM-as-a-judge paradigm have also been applied to the evaluation of QA. To gain a deeper understanding of the capabilities and limitations of QA metrics, this work performs a comparative analysis of both traditional and more recent approaches for QA evaluation. Experiments were conducted on the Pirá dataset (in Portuguese) using four LLMs to generate answers. Additionally, human evaluation was performed to assess aspects such as correctness, completeness, clarity, and relevance of the generated content. We demonstrate that lexical metrics are limited in evaluating QA. We also observed that human evaluators favor models that provide higher information density, even when this contradicts prompt constraints, whereas lexical metrics penalize this verbosity. This divergence confirms that traditional metrics are insufficient for capturing the trade-off between instruction adherence and the semantic richness valued by native speakers.
Optimizing Efficiency in Multi-Stage Semantic Re-ranking Architectures
Artur M. A. Novais | Anna P. V. L. B. Moreira | Maria C. X. de Almeida | João P. C. Presa | Fernando M. Federson | Sávio S. T. de Oliveira
Artur M. A. Novais | Anna P. V. L. B. Moreira | Maria C. X. de Almeida | João P. C. Presa | Fernando M. Federson | Sávio S. T. de Oliveira
Semantic re-ranking architectures based on cross-encoders are essential for high-precision Information Retrieval (IR) in the legal domain, but they face a dilemma: their high computational latency renders large-scale applications challenging, particularly in resource-constrained environments. Traditional single-stage approaches force a choice between computational efficiency and ranking quality. This work presents an empirical evaluation of established cascade re-ranking architectures to optimize this balance through the adaptive application of off-the-shelf models of increasing complexity over progressively smaller sets of candidates. We validated the architecture on a corpus of 300,000 legal documents in Portuguese from the Court of Accounts of the State of Goiás (TCE-GO). Experiments demonstrate a 60.3% reduction in latency (from 11.75s to 4.66s per query) compared to the most precise single-stage baseline, with a marginal degradation of only 2 p.p. in R@avg and 0.0224 in MRR@avg. The results validate the semantic funnel as a computationally viable solution for semantic document-to-document search within the specific context of the TCE-GO repository, establishing a baseline for future transferability studies in broader Portuguese legal contexts.
Marcação de correferência para a caracterização de personagens em obras literárias em português
Diana Santos | Luisa Mara Silva Lima | Emanoel Pires
Diana Santos | Luisa Mara Silva Lima | Emanoel Pires
Neste artigo descrevemos a adição de correferência a corpos literários públicos, para a tarefa de caracterização de personagens na leitura distante. Começamos por motivar essa tarefa na área dos estudos literários computacionais, explicamos a forma como tornamos essa tarefa legível e revisável a pessoas da área dos Estudos Literários, transferindo para o BRAT, descrevemos os primeiros resultados e um pequeno corpo anotado público, assim como discutimos a criação de dois módulos de correferência.
Diálogos Tóxicos: Gatilhos e Padrões de Interação no Reddit Brasileiro
Giovana Piorino | Marco Antônio de Alcântara Machado | Luiz Henrique Quevedo Lima | Adriana Pagano | Ana Paula Couto da Silva
Giovana Piorino | Marco Antônio de Alcântara Machado | Luiz Henrique Quevedo Lima | Adriana Pagano | Ana Paula Couto da Silva
In this paper we analyze the structural and linguistic dynamics of online toxicity in Reddit discussion trees, focusing on how trigger comments escalate conflicts in Brazilian Portuguese. Using a fine-tuned BERTAbaporu model, we show that toxic discussions are deeper, more engaging, and initially semantically cohesive, but degrade over time, while non-toxic interactions emphasize social bonding. Our findings contribute to a better understanding of toxicity escalation and support early detection of discursive conflicts.
Contrastive and Adversarial Disentanglement for Speaker Representations in Brazilian Portuguese
Ariadne Nascimento Matos | Arnaldo Candido Junior | Moacir Antonelli Ponti
Ariadne Nascimento Matos | Arnaldo Candido Junior | Moacir Antonelli Ponti
In this work, we study disentanglement between speaker and environment by combining an adversarial framework with contrastive learning objectives. We investigate supervised contrastive learning (SupCon), which exploits environment labels to structure the environment subspace, and self-supervised SimCLR, which learns invariance from augmented views. Experiments on a controlled synthetic dataset (ST1) and a more realistic corpus (CML-TTS) show that SupCon yields the most discriminative and stable speaker embeddings on ST1, achieving the best verification performance (EER=4.70%, MinDCF=0.24). Overall, our findings emphasize (i) the importance of synthetic benchmarks for diagnosing disentanglement under controlled factor variation and (ii) the effectiveness of combining contrastive and adversarial objectives to encourage speaker representations that are both discriminative and less sensitive to environmental factors.
Can I guess where you are from? Modeling dialectal morphosyntactic similarities in Brazilian Portuguese
Manoel Siqueira | Raquel Freitag
Manoel Siqueira | Raquel Freitag
This paper investigates morphosyntactic covariation in Brazilian Portuguese (BP) to assess whether dialectal origin can be inferred from the combined behavior of linguistic variables. Focusing on four grammatical phenomena related to second-person pronoun, correlation and clustering methods are applied to model covariation and dialectal distribution. The results indicate that correlation captures only limited pairwise associations, whereas clustering reveals speaker groupings that reflect regional dialectal patterns. Despite the methodological constraints imposed by differences in sample size requirements between sociolinguistics and computational approaches, the study highlights the importance of interdisciplinary research. Developing fair and inclusive language technologies that respect dialectal diversity outweighs the challenges of integrating these fields.
A Lexicon-Grammar of Brazilian Portuguese Predicative Adjectives
Ryan Martinez | Jorge Baptista | Oto Araújo Vale
Ryan Martinez | Jorge Baptista | Oto Araújo Vale
This paper presents a syntactic lexicon of Brazilian Portuguese predicative adjectives that are not regularly derived from verbs. From the 7,000 most frequent adjectives in a large web corpus, 3,161 lexical items were selected and annotated with 36 syntactic properties. These properties were established through introspection and corpus evidence, covering argument structure, copular verbs, prepositions, transformations (e.g., raising, nominalization), semantic roles, and others. The resulting resource constitutes a machine-readable lexicon of predicative adjectives for Brazilian Portuguese.
NLP-based Page Classification for Efficient LLM Extraction from Brazilian Public Tender Documents
Pedro Campos | Ivo de Medeiros | Adailton de Araújo
Pedro Campos | Ivo de Medeiros | Adailton de Araújo
Extracting structured information from lengthy documents using Large Language Models (LLMs) is computationally expensive and prone to accuracy degradation as input size increases. We present a two-stage pipeline for extracting products from Brazilian tender documents (editais de licitação), combining NLP-based page classification with LLM extraction. We construct a novel dataset of 11,190 annotated pages from 350 documents across five product domains. Our experiments compare transformer-based classifiers (BERTimbau, DistilBERT) with classical machine learning approaches using engineered features. Results show that XGBoost with domain-specific features achieves 97.75% F1-score, outperforming fine-tuned BERT models by over 4 percentage points. The complete pipeline reduces LLM input tokens by 64-88% while maintaining extraction completeness, enabling cost-effective document processing at scale.
Lexical and Orthographic Variation in Portuguese Financial Tweets: Annotation, Analysis, and Implications for Embedding Models
Ariani Di Felippo | Norton Trevisan Roman | Bryan K. S. Barbosa | Gabriela Pinheiro de Oliveira | Clarissa Lenina Scandarolli
Ariani Di Felippo | Norton Trevisan Roman | Bryan K. S. Barbosa | Gabriela Pinheiro de Oliveira | Clarissa Lenina Scandarolli
Twitter/X remains a key source of user-generated content, requiring Natural Language Processing tools capable of handling non-canonical language. This study presents a manual annotation of lexical and orthographic phenomena in DANTEStocks, a corpus of financial tweets in Brazilian Portuguese, using a hierarchical typology to capture both creative uses and deviations from the standard norm. Results show that orthographic variation is strongly influenced by creative forms, mainly driven by platform- and domain-specific innovations. Standard norm variation is systematic, mostly involving predictable omissions of diacritics and the cedilla, and most tokens exhibit only one phenomenon, reflecting stable and largely independent patterns of variation in this Twitter subgenre. The identified variant forms enabled the construction of a lexicon for evaluating embedding models. We assessed how BERTimbau, Word2Vec, and FastText handle lexical variation in raw, unnormalized data, showing that the lexicon reduces out-of-vocabulary rates and improves coverage. These results highlight model robustness and the value of curated lexical resources in complementing both fixed and data-driven vocabularies.
Síntese de Voz Emocional Multi-Idioma para Português Brasileiro: Uma Análise Comparativa de Abordagens de Ajuste Fino
Daniel Oliveira de Brito | Sidney Evaldo Leal | Arnaldo Candido Junior
Daniel Oliveira de Brito | Sidney Evaldo Leal | Arnaldo Candido Junior
A síntese de voz emocional multi-idioma para português brasileiro é pouco explorada. Este trabalho investiga diferentes abordagens para incorporar controle emocional em síntese multi-idioma português-inglês, comparando cinco variantes: modelo base YourTTS, ajuste fino com dados emocionais, condicionamento via tokens textuais, e arquitetura VECL-TTS com embeddings emocionais sob diferentes configurações. Utilizamos datasets emocionais em inglês (RAVDESS, Emotional Speech Dataset) e português brasileiro (VERBO), totalizando 14,4 horas, para ajuste fino a partir do modelo YourTTS pré-treinado. A avaliação combinou métricas objetivas (similaridade de embeddings emocionais e de falante) com avaliação subjetiva por dez participantes. Os resultados revelam que abordagens arquiteturalmente simples podem alcançar desempenho perceptual comparável ou superior a métodos mais complexos: o YourTTS com ajuste fino obteve a melhor qualidade geral, o condicionamento por tokens alcançou a maior similaridade emocional percebida, enquanto o VECL-TTS maximizou o controle emocional objetivo com degradação na qualidade e na similaridade de falante. Observou-se ainda uma competição entre controle emocional e preservação de identidade vocal, bem como discrepâncias entre métricas objetivas e percepção humana. Este trabalho demonstra a viabilidade de transferência emocional multi-idioma para português brasileiro via ajuste fino com recursos limitados.
CoDEl-BR: An Electoral Debate Corpus from Brazilian Municipal Elections
Alessandra Gomes | Aline Paes | Helena Caseli
Alessandra Gomes | Aline Paes | Helena Caseli
Electoral debates are influential moments in public discourse, providing candidates with a high-visibility platform to present their proposals, contrast their positions, and engage in exchanges that shape voter decisions. In Brazil, these debates reach a broad and diverse audience, reflecting regional, social, and ideological variations that affect linguistic choices and thematic content. This paper presents CoDEl-BR (Corpus de Debates Eleitorais, in Portuguese), a corpus of transcripts from 22 second-round mayoral debates held in 13 Brazilian state capitals during the 2024 municipal elections. It comprises 2,943 transcript segments totaling approximately 32 hours. Exploratory analyses reveal differences in thematic priorities between candidates and voters’ questions, as well as variations by race and party affiliation. The corpus aims to enable research in discourse and argumentation analysis, stance and sentiment detection, polarization modeling, and other related NLP tasks. We demonstrate that this initial release provides a curated, high-quality subset of debates with significant potential for expansion.
Causal_QA.PT: A Human–LLM Co-Curated Benchmark for Causal Question Answering in Portuguese Language
Lia Furtado | Cíntia Araripe | Jocelani Castilhos | Lucas Holanda | Vladia Pinheiro
Lia Furtado | Cíntia Araripe | Jocelani Castilhos | Lucas Holanda | Vladia Pinheiro
We present Causal_QA.PT, a human–LLM co-curated benchmark for causal question answering in Portuguese, addressing the lack of high-quality evaluation resources for causal reasoning in non-English languages. The dataset is developed through a hybrid human–LLM process with targeted generation, validation, and evaluation procedures, and is organized according to the PEARL causal typology. Using this resource, we evaluate the ability of Large Language Models to answer causal questions in Portuguese and examine the role of explicitly providing causal class information in prompt design. Our findings show that current LLMs are capable of producing high-quality causal responses in Portuguese, with GPT-5 Mini in particular demonstrating strong performance in judgment-based evaluation. Explicit causal class information yields model- and question-dependent benefits, particularly for interventional and counterfactual questions. Finally, we observe that human reference answers are not always superior, underscoring the importance of careful benchmark curation and robust evaluation for underrepresented languages.
ConsumerBR: A Large-Scale Corpus of Consumer Complaints in Brazilian Portuguese
Luis A. Duarte | Pedro Giacomin | Vitória Bispo | Mariana O. Silva | Adriano C. M. Pereira | Gisele L. Pappa
Luis A. Duarte | Pedro Giacomin | Vitória Bispo | Mariana O. Silva | Adriano C. M. Pereira | Gisele L. Pappa
We present ConsumerBR, a large-scale corpus of consumer complaints and company responses in Brazilian Portuguese, compiled from publicly available data on the Consumidor.gov.br platform. The corpus comprises over 3.1 million consumer–company interactions collected between 2021 and 2025 and combines anonymized textual content with rich structured metadata, including temporal information, complaint outcomes, and consumer satisfaction indicators. We describe a data collection strategy tailored to the platform’s dynamic interface, a preprocessing pipeline that includes response clustering to identify template-based replies, and a hybrid anonymization approach designed to mitigate privacy risks. We also provide a detailed statistical characterization of the corpus, highlighting its scale, coverage, and distributional properties. ConsumerBR is publicly available for research purposes and supports a wide range of applications, including complaint analysis, sentiment modeling, dialogue and response generation, and preference-based evaluation.
Geração de consultas SPARQL a partir de linguagem natural
Heber Gustavo Xavier de Castro | Clever Ricardo Guareis de Farias
Heber Gustavo Xavier de Castro | Clever Ricardo Guareis de Farias
The Semantic Web aims to make web data understandable not only to humans but also to machines, enabling more efficient data integration, sharing, and reuse. Linked Open Data (LOD) initiatives have supported this vision by promoting the publication of semantically annotated and interconnected data. However, querying LOD repositories typically requires knowledge of SPARQL, a complex query language that limits access for non-expert users. Although several approaches have been proposed to automatically generate SPARQL queries from natural-language questions, most are designed for English and are tightly coupled to specific domains, which hinders reuse. This article presents a generic, domain-independent approach for generating SPARQL queries from questions written in Portuguese. The proposed method uses reference questions, parameterized query templates, and a synonym dictionary enriched by lexical resources and similarity metrics. The implementation is supported by the Natural2SPARQL tool, and the approach is validated through a case study in the financial domain using real data from the Brazilian stock exchange (B3). The results indicate that the method enables flexible and semantically accurate SPARQL query generation from natural-language input. Unlike learning-based approaches, our method avoids retraining and achieves up to 93.3% end-to-end success in controlled settings, demonstrating robustness and low adaptation cost.
Retrato_Cantado: Criação e Análise de um Corpus para Representações de Identidade em Letras de Músicas Brasileiras
Vitória P. Firmino | Janaina N. de S. Lopes | Bruno M. Nogueira | Valéria Q. dos Reis
Vitória P. Firmino | Janaina N. de S. Lopes | Bruno M. Nogueira | Valéria Q. dos Reis
This paper presents the development of Retrato_Cantado, a dataset of sentences extracted from Brazilian song lyrics and manually annotated to identify and categorize predicative constructions that describe individuals. The corpus findings validate the effectiveness of lexical-syntactic patterns for identifying predicative sentences, confirming their suitability for large-scale linguistic annotation tasks. The dataset also serves as a valuable resource for the analysis of textual discourse and the representation of social groups in Brazilian culture. We additionally trained a person-characterization classifier to illustrate the applicability of the dataset to the automatic detection of predicative descriptions, which achieved high accuracy and highlights the potential for creating more specialized models capable of detecting physical and sociocognitive categories, as well as performing sentiment polarity analysis.
ALBA: A European Portuguese Benchmark for Evaluating Language and Linguistic Dimensions in Generative LLMs
Inês Vieira | Inês Calvo | Iago Paulo | James Furtado | Rafael Ferreira | Diogo Tavares | Diogo Glória-Silva | David Semedo | João Magalhães
Inês Vieira | Inês Calvo | Iago Paulo | James Furtado | Rafael Ferreira | Diogo Tavares | Diogo Glória-Silva | David Semedo | João Magalhães
As Large Language Models (LLMs) expand across multilingual domains, evaluating their performance in under-represented languages becomes increasingly important. European Portuguese (pt-PT) is particularly affected, as existing training data and benchmarks are mainly in Brazilian Portuguese (pt-BR). To address this, we introduce ALBA, a linguistically grounded benchmark designed from the ground up to assess LLM proficiency in linguistic-related tasks in pt-PT across eight linguistic dimensions, including Language Variety, Culture-bound Semantics, Discourse Analysis, Word Plays, Syntax, Morphology, Lexicology, and Phonetics and Phonology. ALBA is manually constructed by language experts and paired with an LLM-as-a-judge framework for scalable evaluation of pt-PT generated language. Experiments on a diverse set of models reveal performance variability across linguistic dimensions, highlighting the need for comprehensive, variety-sensitive benchmarks that support further development of tools in pt-PT.
Este artigo investiga a classificação automática multi-rótulo de cartas indígenas ao Brasil em categorias temáticas. A partir do acervo digital "Cartas Indígenas ao Brasil", que constitui um corpus de 871 cartas anotadas em 18 categorias, comparamos três abordagens de classificação: um modelo lexical (TF-IDF + regressão logística), um modelo contextual (BERTimbau-base) e um classificador baseado em grandes modelos de linguagem (LLM). Para lidar com o desbalanceamento do corpus, empregamos estratégias de balanceamento de classes no modelo neural. Os resultados revelam um trade-off entre precisão e recall: o baseline lexical apresenta maior precisão (0,65), enquanto o BERTimbau demonstra maior recall (0,67), especialmente em categorias minoritárias. Ambos alcançam macro-F1 de 0,42, evidenciando que a classificação multi-rótulo neste domínio é uma tarefa desafiadora, em especial devido ao desbalanceamento do corpus e à sobreposição semântica entre categorias. O classificador baseado em LLM atinge alto recall, especialmente em categorias minoritárias, mas tende a superestimar o número de rótulos por documento, reforçando o trade-off entre precisão e cobertura observado nas outras duas abordagens. A análise detalhada por classe revela comportamentos complementares entre os modelos, sugerindo que abordagens híbridas podem superar as limitações individuais de cada método. O corpus e os scripts dos experimentos serão disponibilizados publicamente.
Identificação de notícias falsas em português: um olhar sobre a generalização de modelos
Raphael Guedes | Bruno Barros | Hugo do Nascimento
Raphael Guedes | Bruno Barros | Hugo do Nascimento
A disseminação de desinformação em meios digitais requer mecanismos robustos de detecção, tarefa na qual modelos de linguagem apresentam desempenho satisfatório. Entretanto, são percebidas na literatura análises que desconsideram a característica da degradação da capacidade de generalização dos modelos em dados reais, diferentes daqueles nos quais o treino ou ajuste fino foi realizado. Este trabalho investiga o comportamento dos modelos BERTimbau e mBERT em cenários de generalização cruzada (dados de teste diferentes dos dados de treinamento e validação). Para isso, foi realizado um ajuste fino utilizando quatro corpora brasileiros (Fake.br, Fakepedia, FakeRecogna e FakeTrueBR). Os resultados confirmam a hipótese de que avaliações intra-base têm altas taxas de desempenho, enquanto avaliações entre-bases têm baixas taxas e alta degradação na generalização cruzada, ainda que o objetivo de identificação de notícias falsas seja mantido. Quanto à capacidade preditiva dos modelos, o BERTimbau se mostrou ligeiramente melhor na média com 71% de acurácia e 67% de f1-score contra 69% e 64%, respectivamente, para o mBERT.
Agent Orchestration - LLM for Legal Metadata Extraction: A Comparative Analysis of Efficiency and Precision
Luiz Anísio Batitucci | Luciane Inácia Lopes | Rhodie Ferreira | Emerson Cabrera Paraiso
Luiz Anísio Batitucci | Luciane Inácia Lopes | Rhodie Ferreira | Emerson Cabrera Paraiso
This work introduces and evaluates JAMEX (Judicial Multi-Agent Metadata Extraction), a multi-agent pipeline for extracting structured metadata from Brazilian court decisions (Espelho do Acórdão), and compares it against a strong single-prompt baseline under an Information Retrieval-only (IR-only) setting.We first ran a pilot on 300 decisions and then reran the experiment on a stratified dataset of n=1,225; completion rates varied across executions, yielding between 779–1,216 successfully completed instances, with non-completion concentrated in agentic configurations.Across re-executions, the accuracy impact of agents was strategy-dependent: GPT-5 improves over the baseline in multiple agentic strategies but not across all orchestration variants, while smaller models (Gemma3-12B/Gemma3-27B) show no robust gains.Orchestration refinements motivated by agent design literature (memory, planning and directed review) improved traceability, but performance remained sensitive to task decomposition and context splitting.Overall, JAMEX increases token usage and operational complexity, so deployment must balance accuracy, completion reliability, and cost for Portuguese legal metadata extraction.
Avaliação Automática de Redações do Enem: Uma Análise Comparativa entre Engenharia de Características e Transformers
Pâmela Camilo Chalegre | Vitor da Rocha Machado | Valéria Delisandra Feltrim
Pâmela Camilo Chalegre | Vitor da Rocha Machado | Valéria Delisandra Feltrim
A Avaliação Automática de Redações (AES) é um desafio central em avaliações educacionais de larga escala, como o Exame Nacional do Ensino Médio (Enem), no qual redações são avaliadas em múltiplas competências. Este trabalho apresenta uma análise comparativa de representações textuais para a AES em nível de competência no português brasileiro. Foram avaliados modelos baseados em características utilizando TF-IDF, métricas linguísticas extraídas com o NILC-Metrix e uma combinação híbrida de ambos, além de modelos baseados em transformers. Os experimentos foram conduzidos sobre o corpus Enem-AES, considerando formulações de classificação e de regressão. Os resultados indicam que formulações de regressão são, em geral, mais adequadas do que as de classificação multiclasse, pois acomodam melhor a estrutura ordinal das notas. Modelos baseados em transformers alcançaram uma concordância maior em competências relacionadas ao uso da linguagem e à coesão textual, enquanto representações baseadas em características demonstraram um desempenho comparável em competências associadas à pertinência temática. Apesar de alcançarem alta acurácia sob o critério de tolerância do Enem, todas as abordagens demonstraram dificuldade em prever notas extremas, principalmente devido ao desbalanceamento do corpus. Dessa forma, conclui-se que as metodologias são complementares e que sistemas híbridos são promissores para a AES.
Avaliação End-to-End de um Sistema RAG para Documentos Hospitalares em Português
Murilo Vargas da Cunha | Marília Rosa Silveira | César Brasil Sperb | Brenda Salenave Santana | Larissa Astrogildo Freitas | Ulisses Brisolara Corrêa
Murilo Vargas da Cunha | Marília Rosa Silveira | César Brasil Sperb | Brenda Salenave Santana | Larissa Astrogildo Freitas | Ulisses Brisolara Corrêa
Este artigo avalia um sistema end-to-end de Geração Aumentada por Recuperação (RAG) para consulta a documentos hospitalares regulatórios em português. O estudo analisa o impacto da otimização de cada componente (recuperação, reclassificação e geração) em um cenário de recursos limitados. A metodologia combinou a criação de um dataset híbrido (sintético e validado por especialistas) com avaliações quantitativas utilizando métricas como MRR, NDCG@10 e BERTScore. Os resultados demonstram que o modelo de embedding intfloat/multilingual-e5-small apresentou a maior robustez, com taxa de falha de apenas 1,4% na recuperação. Na etapa de reclassificação, o método RRF destacou-se pelo equilíbrio entre custo computacional e desempenho. Conclui-se que a arquitetura otimizada, integrando esses componentes ao gerador Gemini 2.5 Flash, oferece uma solução eficiente e precisa para suporte à decisão em ambientes hospitalares.
Extending an Ensemble Baseline with Corpus-Based Graph Features for Portuguese Pun Detection
Avelar Rodrigues de Sousa | Camilla Soares Sousa | Carlos Henrique Santos Barros | Rafael Torres Anchiêta
Avelar Rodrigues de Sousa | Camilla Soares Sousa | Carlos Henrique Santos Barros | Rafael Torres Anchiêta
Automatic pun detection remains challenging because it depends on lexical ambiguity and contextual interaction, which are not explicitly captured by linear text representations. In Portuguese, TF-IDF-based ensemble methods provide competitive and interpretable baselines, but remain limited by surface-level features. This work investigates whether corpus-based graph information can complement such methods. Three graph representations are constructed from the Puntuguese corpus: a Co-occurrence graph, a PPMI-weighted graph, and a Pun-Context graph. In the current pipeline, each graph is converted into low-dimensional node embeddings with TruncatedSVD, which are then aggregated into document-level features and concatenated with TF-IDF representations in a soft-voting ensemble. Experimental results on the test set show that graph-based enrichment does not uniformly improve performance: Pun-Context and PPMI yield the strongest graph-augmented results, whereas combining all graphs degrades performance. These findings indicate that the usefulness of graph-based information depends strongly on how lexical relations are encoded and aggregated at the document level.
RacismoBR: A Manually Annotated Dataset for Racist Discourse Detection in Brazilian Portuguese
João Vítor Vaz | Fabrício Benevenuto | Marcos André Gonçalves
João Vítor Vaz | Fabrício Benevenuto | Marcos André Gonçalves
Racist discourse on social media appears both through explicit attacks and subtle, context-dependent forms, remaining a challenge for Natural Language Processing. We introduce RacismoBR, a culturally grounded dataset for detecting racist discourse in Brazilian Portuguese, manually annotated exclusively by Black researchers to ensure sociolinguistic validity and epistemic representativeness. We conduct a controlled evaluation of binary racism classification in our dataset considering several classification modeling paradigms: classical machine learning, supervised Transformer-based (Small) Language Models, and Large Language models under in-context, few-shot learning. Results show that GPT-4.1 and BERTimbau yield the highest Macro-F1 scores; however, Wilcoxon signed-rank tests reveal no statistically significant differences across models, mostly due to high variability. Across paradigms, classifiers consistently display higher precision for non-racist content and higher recall for racist content. A qualitative analysis highlights persistent difficulties with implicit, euphemized, and context-dependent racism. These findings indicate that culturally grounded annotation plays a more decisive role than architectural sophistication alone in advancing racism detection.
Unsupervised Evaluation of Explanations for Hate Speech Classification in Portuguese
Isabel Carvalho | Hugo Gonçalo Oliveira | Catarina Silva
Isabel Carvalho | Hugo Gonçalo Oliveira | Catarina Silva
Top-performing Artificial Intelligence models often operate as black boxes. Explainable AI (XAI) can increase transparency, but its evaluation is currently hindered by a lack of annotated explanation data and agreed-upon validation standards. We propose a framework for evaluating the faithfulness of explanations in Portuguese hate speech detection. Our approach is based on the premise that a faithful explanation should identify features whose removal degrades a model’s performance. We follow a three-step process: (i) prediction on the original input; (ii) identification and removal of explanatory keywords; and (iii), prediction on the modified input, with performance differences used as an evaluation signal. We conduct experiments using ensemble classifiers, multiple keyword selection strategies, and SHAP and LIME as XAI methods. In addition, Large Language Models (LLMs) are explored both as classifiers and as explainers. Results demonstrate that removing explanatory keywords degrades model performance more than random word removal, indicating explanation faithfulness. Notably, SHAP and LIME consistently provided more faithful explanations than LLM-generated or manual alternatives, although impact depends on the keyword selection strategy. These findings highlight the importance of standardised, unsupervised evaluation protocols for XAI and the faithfulness limitations of current generative LLM explanations.
Neuro-symbolic Approaches for Rubric-Based Automatic Essay Evaluation of ENEM Essays
Igor Cataneo Silveira | Denis Deratani Mauá
Igor Cataneo Silveira | Denis Deratani Mauá
Trait-specific automated scoring of essays written for the standardized Brazilian National Entrance Exam (ENEM) has received significant attention in recent years. The task is both important in a classroom setting, to provide timely and personalized learning feedback, and in the official exam, to make the scoring process more scalable and consistent. The state-of-the-art systems approach the task as a purely statistical predictive task, ignoring the knowledge provided to human graders and test takers in the form of rubrics and guidelines.Aiming to produce more interpretable and informative formative feedback in this work, we leverage the official ENEM Grader’s handbook and develop two neuro-symbolic approaches to trait-specific essay scoring.The first approach uses a Large Language Model (GPT4o) to write an evaluative explanation of the essay score according to the subcriteria described in the guidelines; the explanation is then fed into a statistical model to effectively predict the score; the good performance of the scoring validates the quality of the explanations.The second approach formalizes the Guideline grading rubrics as logical rules that derive the essay score as a function of subcriteria, mimicking the recommended human grader’s scoring approach.In order to provide weak supervision in training and to evaluate the quality of the model, we build a dataset of 63 essays annotated with their subcriteria by two expert human graders.Our empirical results suggest that both approaches perform on par with purely statistical methods while providing more helpful and fine-grained feedback.
dialect2vec: Um método baseado em vetores para transcrição dialetal do português a partir de questionários do ALiB
Laila Mota | Daniela Barreiro Claro | Eloize R. Marques Seno | Rerisson Cavalcante de Araújo
Laila Mota | Daniela Barreiro Claro | Eloize R. Marques Seno | Rerisson Cavalcante de Araújo
A modelagem da variação dialetal enfrenta desafios quando dependente de modelos de linguagem baseados em sub-palavras, que frequentemente falham ao processar a complexidade de transcrições fonéticas devido a restrições de vocabulário e vieses semânticos. Este trabalho introduz o dialect2vec, um método para capturar a diversidade dialetal do Português Brasileiro. Nossa proposta adota o modelo token-free ByT5 para codificar sequências do Alfabeto Fonético Internacional (IPA) ao nível de byte, mitigando a perda de informação causada por tokens desconhecidos. Os experimentos foram realizados com dados do Atlas Linguístico do Brasil (ALiB), em que a dimensão fonética isolada demonstrou viabilidade em tarefas de agrupamento não supervisionado, com desempenho próximo do estado da arte léxico (BERTimbau), comprovando que arquiteturas baseadas em bytes podem recuperar estruturas dialetais complexas exclusivamente através de pistas fonológicas, oferecendo um mapeamento mais granular das fronteiras linguísticas.
Compression-based Language Complexity under Register Variation in Portuguese
Felipe Ribas Serras | Marcelo Finger
Felipe Ribas Serras | Marcelo Finger
Compression-based language complexity metrics show promise as holistic parameters for measuring linguistic complexity across intra- and cross-linguistic scenarios. Yet, their sensitivity to specific forms of linguistic variation requires further experimental validation. We examine the sensitivity of this metric family to register variation in Portuguese, a phenomenon already established for English. We refine the validation process found in previous literature by introducing a more granular statistical analysis to evaluate both the individual and joint sensitivity of these metrics to register variation at the sentence level. Our results confirm they are highly sensitive to functional variation in Portuguese, exhibiting the same structural morphosyntactic trade-off consistent with that observed in English and in cross-linguistic studies.
Certas Palavras: A 1980s-90s Brazilian Radio Corpus to Test TTS Models in Noisy Multi-Speaker Dialogues
Gustavo Evangelista Araújo | Moacir Ponti | Arnaldo Candido Junior | Sidney Leal | Edresson Casanova | Renato Moraes Silva | Miguel Oliveira Jr. | Adriana Barbosa Santos | Gustavo Wadas Lopes | Sandra Aluisio
Gustavo Evangelista Araújo | Moacir Ponti | Arnaldo Candido Junior | Sidney Leal | Edresson Casanova | Renato Moraes Silva | Miguel Oliveira Jr. | Adriana Barbosa Santos | Gustavo Wadas Lopes | Sandra Aluisio
Robust text-to-speech (TTS) systems must be trained on speech that mirrors the variability and imperfections of spontaneous dialogues. However, TTS systems trained on existing Brazilian Portuguese datasets are typically limited to clean, scripted, or studio-recorded speech. Certas Palavras (CP) bridges this gap with 70 hours of spontaneous, multi-speaker dialogues from a Brazilian radio program broadcast in the 1980s–1990s. The extensive manual annotation process captures conversational dynamics, including orality markers, filled pauses, and hesitations. For the analog medium, we annotated non-verbal phenomena as musical interference, noise and segmental corrections, describing a challenging acoustic environment for synthesis. Baseline YourTTS and F5-TTS models were trained in a 9-hour subset featuring one of the two main hosts of Certas Palavras. Baseline YourTTS and F5-TTS models were trained on a 9-hour single-speaker subset corresponding to one of the main program hosts. Objective evaluation shows that the synthesized speech remains intelligible, with moderate WER and CER. In contrast, subjective evaluation reveals a clear gap in perceived naturalness, with lower MOS scores and higher inter-rater variability compared to ground-truth audio. Together, these properties make the dataset a strong benchmark for TTS robustness.
Análise de Sentimento Baseada em Aspectos no Domínio de Acomodações Utilizando o modelo BERTimbau
Franco Noronha Pereira | Larissa Astrogildo de Freitas | Ulisses Brisolara Corrêa
Franco Noronha Pereira | Larissa Astrogildo de Freitas | Ulisses Brisolara Corrêa
Este trabalho investiga a aplicação do modelo monolíngue BERTimbau para a Análise de Sentimentos Baseada em Aspectos (ABSA) em português, visando estabelecer um baseline robusto para o domínio hoteleiro. São comparadas duas estratégias via fine-tuning: uma abordagem pipeline (extração seguida de classificação) e uma abordagem end-to-end (multitarefa com esquema de tags colapsadas). Avaliadas no conjunto de dados da competição ABSAPT 2024, os resultados evidenciam um trade-off arquitetural: o pipeline favorece a revocação na extração de aspectos (F1: 0,840), enquanto o end-to-end prioriza a precisão, mas sofre com a dispersão de classes. A análise composta demonstra desempenho competitivo (Medida-F 0,72 para ambos), oferecendo um ponto de partida para futuras investigações em arquiteturas híbridas e generativas para o português.
Combining Real and Synthetic Speech for ASR Adaptation in Brazilian Portuguese
Daniel R. da Silva | Maria Eduarda S. Borba | Állan C. P. Silva | Maria Carolina S. Barreto | Arthur F. de Morais | Paulo V. dos Santos | Guilherme C. Dutra | Sávio S. T. de Oliveira | Anderson da S. Soares
Daniel R. da Silva | Maria Eduarda S. Borba | Állan C. P. Silva | Maria Carolina S. Barreto | Arthur F. de Morais | Paulo V. dos Santos | Guilherme C. Dutra | Sávio S. T. de Oliveira | Anderson da S. Soares
Automatic Speech Recognition (ASR) systems require large amounts of annotated speech, which are difficult to obtain in specialized domains. This paper introduces GARAGEM: General Automotive Real and Artificial speech corpus for Garage Environments and Maintenance in brazilian portuguese, a domain specific ASR dataset for Brazilian Portuguese focused on automotive repair, combining real speech collected from online sources with synthetic speech generated from curated technical terminology. A reproducible methodology is proposed, encompassing real data acquisition, domain guided synthetic data generation, dataset consolidation, and ASR model fine-tuning. Experiments conducted with the Whisper, Wav2vec 2.0, and Conformer models show that synthetic data provides improvements when used to complement real recordings. Quantitative and qualitative analyses show reductions in Word Error Rate (WER) and Character Error Rate (CER) and improved recognition of domain specific terms absent from the real training set. The results indicate that domain guided synthetic speech is an effective data augmentation strategy for ASR adaptation in specialized and low resource scenarios.
Structured Summaries for Retrieval-Augmented Generation in Portuguese-Language Consumer Complaints
Rafael Sant'Ana | Pedro Garcia | Luis A. Duarte | Mariana O. Silva | Adriano C. M. Pereira | Gisele L. Pappa
Rafael Sant'Ana | Pedro Garcia | Luis A. Duarte | Mariana O. Silva | Adriano C. M. Pereira | Gisele L. Pappa
Dense retrieval is a critical component of Retrieval-Augmented Generation (RAG) systems and is highly sensitive to document representations. In consumer complaint settings, raw interaction texts are often lengthy and noisy, which limits retrieval effectiveness. This paper investigates whether schema-guided structured summaries can improve dense retrieval in RAG. We compare embeddings derived from raw interaction texts and from LLM-generated structured summaries in a controlled evaluation on Portuguese-language consumer complaints. Summary-based retrieval achieves a Recall@1 of 0.527, compared to 0.001 when indexing raw interactions, and reaches Recall@10 of 0.610, demonstrating gains of more than two orders of magnitude. These results show that structured summaries enable more effective and reliable retrieval at low cutoffs, making them particularly suitable for RAG pipelines.
Uso de técnicas de Aprendizado de Máquina e Modelos de Língua de Larga Escala para avaliação automática de textos do exame Celpe-Bras
Rafael Oleques Nunes | Bernardo Cobalchini Zietolie | Ricardo Zanini De Costa | Rodrigo Brock da Silva | João Victor Piardi Pacheco | Rafaela Dall'Agnol da Rocha | Dennis Giovani Balreira | Elisa Marchioro Stumpf | Juliana Roquele Schoffen
Rafael Oleques Nunes | Bernardo Cobalchini Zietolie | Ricardo Zanini De Costa | Rodrigo Brock da Silva | João Victor Piardi Pacheco | Rafaela Dall'Agnol da Rocha | Dennis Giovani Balreira | Elisa Marchioro Stumpf | Juliana Roquele Schoffen
O Celpe-Bras é o exame oficial brasileiro de proficiência em Português como Língua Adicional (Inep, 2020). A parte escrita do exame exige que os participantes produzam quatro textos em resposta a tarefas baseadas em vídeo, áudio e textos de insumo, o que exige que a preparação para o exame seja realizada a partir de práticas de (re)escrita de textos. Por um lado, professores que trabalham na preparação de estudantes para o exame têm um alto volume de textos para corrigir, e os estudantes têm poucas opções de recursos didáticos acessíveis alinhados ao construto teórico do Celpe-Bras. Nesse contexto, e impulsionado pelos recentes avanços no Processamento de Linguagem Natural (PLN), modelos de língua de grande escala (LLMs) e Inteligência Artificial, este estudo visa mapear e comparar métodos para a avaliação automática dos textos produzidos no exame Celpe-Bras. São apresentados e testados diversos modelos, abrangendo tanto algoritmos tradicionais de aprendizado de máquina quanto modelos de linguagem pré-treinados, como BERT, BART e T5. Ao final, foi possível perceber que os melhores resultados foram obtidos pelas adaptações do modelo BERT, levemente superiores aos dos modelos restantes, mas com considerável maior custo computacional.
Biatron: A Parameter-Efficient Small Language Model for Brazilian Portuguese with Integrated Mathematical Reasoning
Daniel Fazzioni | Maria C. X. de Almeida | Anna P. V. L. B. Moreira | Anderson S. Soares | Sávio S. T. de Oliveira | Fernando M. Federson
Daniel Fazzioni | Maria C. X. de Almeida | Anna P. V. L. B. Moreira | Anderson S. Soares | Sávio S. T. de Oliveira | Fernando M. Federson
The development of Small Language Models (SLMs) for Portuguese faces significant challenges in balancing parameter efficiency with specialized capabilities, particularly in mathematical reasoning domains where existing models demonstrate limited native competence. This work introduces the first model in the Biatron series, a 345-million-parameter language model specifically optimized for Brazilian Portuguese through strategic data curation rather than brute-force parameter scaling. Using a carefully designed 60-30-10 data mixture combining high-quality Portuguese text from GigaVerbo, chain-of-thought reasoning examples, and mathematical datasets, Biatron was trained on 300 billion tokens using the Megatron-LM framework, achieving 32% Model FLOP Utilization.The model attains an overall score of 0.245 (aggregate performance) on Portuguese-specific benchmarks, approaching within 1.6% of Tucano-630M’s performance while utilizing 45% fewer parameters. Most significantly, Biatron achieves 7.5% Pass@1 accuracy on mathematical reasoning tasks—more than doubling the performance of Tucano-2.4B (3.5%) despite being nearly seven times smaller. These results validate that strategic data mixing can rival parameter scaling for language model development, establishing a reproducible methodology for efficient AI development in resource constrained language contexts. To support reproducibility and further research, the final model weights, training logs, and intermediate checkpoints are publicly available.
Exploring Sentiment Analysis Approaches in a Public Agency Security News Dataset
Thiago Ruiz Lobo | Claudia Aparecida Martins
Thiago Ruiz Lobo | Claudia Aparecida Martins
As part of the institution’s 2024–2027 strategic plan, which includes the objective of understanding how the media portrays the organization to strengthen its public image, this paper investigates the application of deep learning algorithms in sentiment analysis of headline news about a public security institution. Four deep learning methods were applied in combination with three textual representations, resulting in twelve trained models. For each combination, a class-based analysis of the results was conducted. Models using BERT as the textual representation achieved strong performance, with an F1-score of approximately 90%.
Viés de gênero na tradução automática: uma avaliação no par linguístico inglês-português
Tayane A. Soares | Yohan B. Gumiel | Rafael Junqueira | Tácio Gomes | Adriana Pagano
Tayane A. Soares | Yohan B. Gumiel | Rafael Junqueira | Tácio Gomes | Adriana Pagano
Este artigo apresenta uma avaliação do viés de gênero na tradução automática (TA) do inglês ao português, analisando o desempenho de três tradutores comerciais (Google Translate, Microsoft Translator, Amazon Translate) e três modelos de linguagem de propósito geral (GPT-3.5 Turbo, GPT-4o-mini e Llama-3 8B-Instruct). Utilizando o corpus de teste WinoMT (Stanovsky et al., 2019), a análise quantitativa mediu a acurácia e o viés (ΔG e ΔS) no corpus traduzido. Os resultados mostram que todos os sistemas apresentam viés, com melhor desempenho na tradução de entidades-alvo masculinas (ΔG positivo) e daquelas que corroboram estereótipos ocupacionais (ΔS positivo). A análise qualitativa, fundamentada na Teoria Sistêmico-Funcional, enfocando nas profissões ‘nurse’ e ‘physician’, revela como o viés de gênero constrói significados distintos das sentenças-fontes em relação às entidades-alvo e compromete a coesão referencial. O estudo valida um algoritmo de avaliação adaptado para o português e reitera a persistência do viés como um problema sociotécnico (Savoldi et al., 2025b.). Conclui-se observando a necessidade de avaliações contínuas e de desenvolvimento de métodos de avaliação que considerem diferentes contextos de uso da TA, principalmente em domínios críticos, a fim de ponderar e mitigar danos.
Evaluating Reference-Free Summarization Quality Metrics for Portuguese: A Study with Human Judgments in Financial News
João Victor Assaoka Ribeiro | Thomas Pires Correia | José Vitor Souza Cardoso Requena | Lilian Berton
João Victor Assaoka Ribeiro | Thomas Pires Correia | José Vitor Souza Cardoso Requena | Lilian Berton
Automatic summarization of financial news in Portuguese lacks reliable reference-free evaluation metrics. While LLM-as-a-Judge approaches are gaining traction, their correlation with human perception in specialized domains remains under-explored. This work evaluates the efficacy of Question Answering (QA) based metrics against a direct LLM-as-a-Judge baseline for Portuguese financial news. We propose a pipeline comparing Lexical, Binary, and Semantic (LLM-based) QA scoring methods, validated against a human ground truth of 50 news items annotated for Faithfulness and Completeness. Our results show that granular QA metrics significantly outperform the monolithic LLM-Judge in evaluating Completeness, with QA-Binary achieving the highest rank correlation (ρ ≈ 0.49 with pessimistic human aggregation). For Faithfulness, we observe a strong ceiling effect in human evaluation, yet the Semantic QA metric demonstrated a "super-human" ability to detect subtle hallucinations (e.g., temporal shifts) missed by annotators. We conclude that decomposing evaluation into atomic QA pairs is superior to holistic judging for the Portuguese financial domain.
Formalizing the DATASUS RTS: An Ontological Model for a Resource Description Framework Knowledge Graph
Vitor Pires | Dalvan Griebler | Felipe Meneguzzi
Vitor Pires | Dalvan Griebler | Felipe Meneguzzi
The Brazilian DataSUS platform provides vast health databases in relational formats that, while operationally efficient, lack the robust representation needed for advanced scientific data management, restricting interoperability. In this paper, we develop a knowledge engineering pipeline using Scenario 2 of the NeOn methodology to extract, process, and transform knowledge from the DataSUS Health Terminology Repository into a formal knowledge graph that adheres to World Wide Web Consortium standards.We illustrate the potential of this formalization by showing how the graph captures the domain’s complex relationships.The resulting graph comprises over 1.4 million triples, with approximately 700,000 associations generated solely through logical inference. Our pipeline provides a foundational resource that enables advanced structural and semantic querying in Portuguese.
Efficient Fine-Tuning Methods for Portuguese Question Answering: A Comparative Study of PEFT on BERTimbau and Exploratory Evaluation of Generative LLMs
Mariela M. Nina | Caio Veloso Costa | Lilian Berton | Didier A. Vega-Oliveros
Mariela M. Nina | Caio Veloso Costa | Lilian Berton | Didier A. Vega-Oliveros
Although large language models have transformed natural language processing, their computational costs create accessibility barriers for low-resource languages such as Brazilian Portuguese. This work presents a systematic evaluation of Parameter-Efficient Fine-Tuning (PEFT) and quantization techniques applied to BERTimbau for Question Answering on SQuAD-BR, the Brazilian Portuguese translation of SQuAD v1.We evaluate 40 configurations combining four PEFT methods (LoRA, DoRA, QLoRA, QDoRA) across two model sizes (Base: 110M and Large: 335M parameters). Our findings reveal three critical insights: (1) LoRA achieves 95.8% of baseline performance on BERTimbau-Large while reducing training time by 73.5% (F1 = 81.32 vs. 84.86); (2) higher learning rates (2e-4) substantially improve PEFT performance, with F1 gains of up to +19.71 points compared to standard rates; and (3) larger models show twice the quantization resilience (loss of 4.83 vs. 9.56 F1 points).These results demonstrate that encoder-based models can be efficiently fine-tuned for extractive Brazilian Portuguese question answering with substantially lower computational cost than large generative LLMs, promoting more sustainable approaches aligned with Green AI principles. An exploratory evaluation of Tucano and Sabiá on the same benchmark shows that although generative models can achieve competitive F1 scores with LoRA fine-tuning, they require up to 4.2 times more GPU memory and three times more training time than BERTimbau-Base, reinforcing the efficiency advantage of smaller encoder-based architectures for this task.
Structured Sentiment Analysis in Brazilian Portuguese: An Exploratory Study Using BERTimbau
Andrew B. Campos | Ulisses B. Corrêa | Larissa A. de Freitas
Andrew B. Campos | Ulisses B. Corrêa | Larissa A. de Freitas
Structured Sentiment Analysis (SSA) aims to extract fine-grained opinion structures as tuples (holder, target, expression, polarity). While recent advances have improved SSA for English, Brazilian Portuguese lacks dedicated resources. This paper presents an exploratory study introducing a manually annotated dataset of hotel reviews for SSA in Brazilian Portuguese. We propose a baseline approach fine-tuning the BERTimbau model under a BIO tagging scheme to extract sentiment spans. Unlike traditional approaches that model relations explicitly, we assess the viability of span-level extraction as a first step for SSA in this language. Experimental results using a strict train/validation/test split show that our approach achieves a span-level F1-score of 48.41 for holder extraction and a macro F1-score of 61.52. We also discuss the linguistic challenges of holder extraction in Portuguese, specifically regarding implicit subjects (pro-drop), and provide a detailed error analysis. These results establish a preliminary baseline for future relation-aware models in Portuguese.
JabuticaBERT: Modern Portuguese Encoders from Scratch with RTD and Long-Context Training
Thiago Porto | Gabriel Gomes | Alexandre Bender | Ulisses Corrêa | Larissa Freitas | William Cruz | Marcellus Amadeus
Thiago Porto | Gabriel Gomes | Alexandre Bender | Ulisses Corrêa | Larissa Freitas | William Cruz | Marcellus Amadeus
Encoder-based language models remain essential for natural language understanding tasks such as classification, semantic similarity, and retrieval-augmented generation. However, the lack of high-quality monolingual encoders for Brazilian Portuguese poses a significant challenge to performance. In this work, we systematically explore the training of Portuguese-specific encoder models from scratch using two modern architectures: DeBERTa, trained with Replaced Token Detection (RTD), and ModernBERT, trained with Masked Language Modeling (MLM). All models are pre-trained on the large-scale Jabuticaba corpus. Our DeBERTa-Large model achieves results comparable to the state-of-the-art, with F1 scores of 0.920 on ASSIN2 RTE and 0.915 on LeNER. Crucially, it matches the performance of the 900M-parameter Albertina model while utilizing significantly fewer parameters. We also release custom tokenizers that reduce token fertility rates compared to multilingual baselines. These findings provide evidence that careful architectural choices and monolingual tokenization can yield competitive performance without massive model scaling.
Evaluating Small Language Models for English-to-Portuguese Translation: Impact of Model Scale and Quantization
Gustavo Lopes Tamiosso | Rafael Oleques Nunes | Dennis Giovani Balreira
Gustavo Lopes Tamiosso | Rafael Oleques Nunes | Dennis Giovani Balreira
Small language models (SLMs) are increasingly adopted for machine translation due to their lower computational and deployment costs, yet a focused and systematic evaluation for English-to-Portuguese remains limited. We benchmarked dozens of SLMs (135M–20B parameters) across multiple architectures and quantization schemes (FP16, Q8_0, Q4_K_M) on two datasets: FLORES-101 (Portuguese subset, 1,012 sentences) and the multidomain OPUS-100 dataset (~10k sentences). We computed lexical and semantic metrics (BLEU, chrF, and BERTScore) and assessed statistical differences using non-parametric Friedman tests over paired sentence-level scores, followed by Wilcoxon signed-rank post-hoc comparisons with Holm correction. Normality assumptions are evaluated using the Shapiro–Wilk test. Our results strongly suggest that 8-bit quantization (Q8_0) preserves semantic quality with negligible average loss, while 4-bit quantization (Q4_K_M) reaches statistical significance in roughly half of model configurations, paired effect sizes (Cliff’s δ) remain negligible to small in magnitude, with measurable degradation concentrated in lower-capacity models. Model scale exhibits only a weak correlation with translation quality: medium-sized models can match or outperform larger ones depending on model family and pretraining. These findings highlight trade-offs between efficiency and quality and inform the design of practical English–to-Portuguese translation pipelines based on SLMs.
Think Portuguese with Bode Reasoning
Gabriel Lino Garcia | André da F. Schuck | João R. R. Manesco | Pedro Henrique Paiola | Leandro A. Passos | João Paulo Papa
Gabriel Lino Garcia | André da F. Schuck | João R. R. Manesco | Pedro Henrique Paiola | Leandro A. Passos | João Paulo Papa
Large Language Models (LLMs) have introduced reasoning capabilities through multi-step problem-solving processes. These models predominantly perform reasoning in English, limiting their effectiveness in other languages. This paper introduces Bode Reasoning, a Portuguese-language reasoning approach built upon fine-tuned Qwen3-4B and Qwen3-4B-Thinking models, and the Bode Reasoning Portuguese Dataset, comprising 13,961 instances from Brazilian examinations and translated datasets. Through supervised fine-tuning, the proposed approach successfully shifts the reasoning process to Brazilian Portuguese while reducing output verbosity. Experimental evaluation demonstrates that fine-tuned models generate Portuguese reasoning in 86-98.7% of outputs and achieve superior lexical alignment with reference answers. However, this specialization results in moderate mean G-Eval and accuracy degradation across diverse multiple-choice question types, highlighting inherent trade-offs in adapting multilingual reasoning models.
Rating–Text Mismatch in Brazilian Portuguese Reviews: How Reliable Are Zero-Shot LLMs?
Emanuelle Marreira | Carlos M. S. Figueiredo | Tiago de Melo
Emanuelle Marreira | Carlos M. S. Figueiredo | Tiago de Melo
This study evaluates the ability of large language models (LLMs) to detect incoherence between the text of product reviews and their assigned rating (1 or 5 stars). Using popular LLMs such as GPT-5, Llama-4 and DeepSeek-3.2, and models optimized for Brazilian Portuguese, Sabiá-3.1 and Bode-3.1, we show that some are capable of detecting incoherence among texts and ratings (F1 > 90%) in a zero-shot protocol. Models also present a high agreement in the predictions, where several prediction rounds led to low variability (Fleiss’ κ> 0.95). With the demonstrated incoherence present in all product categories (aprox. 10% of comments), the results suggest that LLMs are very promising to perform this high semantic interpretation task, and they can be used as valuable tools for online monitoring and recommendation systems.
Accelerating Portuguese Masked Diffusion Models through Representation Alignment
Adalberto Ferreira Barbosa Junior | Lucas Lima Neves | Adriano César Santana
Adalberto Ferreira Barbosa Junior | Lucas Lima Neves | Adriano César Santana
Masked Diffusion Language Models (MDLM) have recently demonstrated that discrete diffusion can achieve competitive performance in text generation. However, training these models remains computationally expensive, particularly for lower-resourced languages like Portuguese. In this work, we adapt REPresentation Alignment (REPA), a technique originally proposed for vision, to the textual domain. We systematically evaluate the impact of aligning the internal representations of a Portuguese MDLM with those of pretrained teacher encoders (e.g., Qwen, BERTimbau). Our experiments show that REPA significantly accelerates training and improves final perplexity by 28.6% compared to a baseline without alignment. We also identify optimal hyperparameters, finding that mid-level alignment with modern teacher encoders yields the best results.
Automatic Speech Recognition for Child Reading: A Phonemic Approach using Isolated Words in Brazilian Portuguese
Aline N. Rodrigues | Carlos H. C. Ribeiro
Aline N. Rodrigues | Carlos H. C. Ribeiro
Automatic assessment of reading in children who are learning to read is challenging due to the lack of data and the high variability of children’s speech. This work investigates the improvement of Automatic Speech Recognition (ASR) models for the analysis of reading decoding of isolated words in Brazilian Portuguese. We propose a methodology based on fine-tuning Wav2Vec2.0 models, with a paradigm transformation from orthographic to phonemic transcription. Using a novel corpus of 5,400 audio word samples from children in the 2nd and 3rd grades of Elementary School, we compare pre-trained models in Portuguese and multilingual. Results reveal that the phonemic approach, combined with fine-tuning strategies, data augmentation, and adapted tokenization, significantly reduces the Phoneme Error Rate (PER). This overcomes the limitations of commercial tools and validates the use of ASR for the detailed diagnosis of decoding errors and phonological acquisition.
Improving Machine Translation of Idioms: A Spanish–Galician Parallel Dataset and Synthetic Augmentation Approach
Lúa Santamaría Montesinos | Saúl Buján | Daniel Bardanca | Pablo Gamallo
Lúa Santamaría Montesinos | Saúl Buján | Daniel Bardanca | Pablo Gamallo
Idiomatic expressions are a well-known challenge for neural machine translation, including both traditional sequence-to-sequence models and large language models (LLMs). This paper presents a systematic approach to improve idiom translation between Spanish and Galician. First, we build a high-quality parallel dataset of idioms manually aligned across both languages. Then, we automatically extend this dataset into a large synthetic parallel corpus using LLMs, following a strategy that prioritizes the most frequent idioms observed in authentic corpora. This augmented dataset is used to retrain a seq2seq translation model. We evaluate the resulting system and compare it both to the baseline model without idiom data and to state-of-the-art LLM-based translators such as SalamandraTA. Results show that the translation of idioms improves significantly after the training, alongside a slight boost in the model’s overall performance.
Multi-Agent Architecture with RAG and Dynamic Context Windows for Text-to-SQL Optimization
Willgnner Ferreira Santos | Paulo Victor dos Santos | Marcella Scoczynski Ribeiro Martins | Larissa Freire Lekakis | Frederico Lemes Rosa | Bruno Matheus Costa | Miguel Alves Pereira Filho | Isabella Alves Montalvão
Willgnner Ferreira Santos | Paulo Victor dos Santos | Marcella Scoczynski Ribeiro Martins | Larissa Freire Lekakis | Frederico Lemes Rosa | Bruno Matheus Costa | Miguel Alves Pereira Filho | Isabella Alves Montalvão
Natural language interfaces supported by LLMs have been used to translate user questions into SQL queries, but sending the complete database schema in each prompt entails high token consumption and computational cost, especially in corporate databases with hundreds of tables. This work presents a multi-agent Text-to-SQL architecture with dynamic context windows, which combines RAG and metadata dictionaries to select, at query time, only the relevant tables and columns. In a case study with Firebird enterprise databases, the approach reduces by an average of 84.4% the number of processed tokens, resulting in more efficient queries without loss of quality, thereby contributing to the democratization of access to corporate databases.
Automatic Metrical Scansion of Galician Poetry: First Results
Pablo Ruiz Fabo | Pauline Moreau | Anxo Alonso Pérez
Pablo Ruiz Fabo | Pauline Moreau | Anxo Alonso Pérez
We present the first public, user-friendly system for Galician poetry scansion, a symbolic system derived from a well-performing mixed-meter Spanish scansion library. We adapted its resources to Galician and added a preprocessing module. The system achieves 88% per-line accuracy in exact stress-pattern match on data unseen during development, and has practical value: First, it helps create a large annotated corpus to train scansion systems. Second, its web interface can help engage a non-specialist public. Third, its current accuracy is helpful for annotating large volumes of poetry and studying metrical trends in Computational Literary Studies use cases.
MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese
Tiago Teixeira | Ana Carolina Erthal | Juan Belieni | Beatriz Canaverde | Diego Mesquita | Miguel Faria | Eliezer de Souza da Silva | André F. T. Martins
Tiago Teixeira | Ana Carolina Erthal | Juan Belieni | Beatriz Canaverde | Diego Mesquita | Miguel Faria | Eliezer de Souza da Silva | André F. T. Martins
The use of large language models (LLMs) for complex mathematical reasoning is an emergent area of research, with fast progress in methods, models, and benchmark datasets. However, most mathematical reasoning evaluations exhibit a significant linguistic bias, with the vast majority of benchmark datasets being exclusively in English or (at best) translated from English. We address this limitation by introducing MATH-PT, a novel dataset comprising 1,729 mathematical problems written in European and Brazilian Portuguese. MATH-PT is curated from a variety of high-quality native sources, including mathematical Olympiads, competitions, and exams from Portugal and Brazil. We present a comprehensive benchmark of current state-of-the-art LLMs on MATHPT, revealing that frontier reasoning models achieve strong performance in multiple choice questions compared to open weight models, but that their performance decreases for questions with figures or open-ended questions. To facilitate future research, we release the benchmark dataset and model outputs
Democratizing Legal Analytics: Resource-Efficient Information Extraction for Brazilian Case Law
Rodrigo Filippi Dornelles
Rodrigo Filippi Dornelles
Legal systems produce large volumes of high-stakes decisions in unstructured natural language, making large-scale empirical analysis costly, difficult to reproduce, and unevenly accessible. This bottleneck is especially acute for legal analytics and policy evaluation in low-resource languages such as Portuguese. To address it, we present a resource-efficient pipeline for information extraction from Brazilian criminal case law that reuses a legacy dataset to fine-tune open-weight LLMs with Q-LoRA. Operating in a small-data setting and using schema-constrained JSON generation, the pipeline extracts 47 legal variables spanning charges, evidence, and sentencing outcome. In held-out evaluation, a fine-tuned Phi-4 (14B) model achieves 92.8% accuracy and 0.826 macro-F1, approaching proprietary baselines while retaining the cost and privacy benefits of local deployment. We then use the extracted data in a case study of the short-term effects of a recent Brazilian Supreme Court ruling on drug decriminalization, finding no statistically significant change in trafficking-conviction rates (p≥0.05), a pattern consistent with short-run institutional inertia. More broadly, the paper contributes a reproducible framework for legal NLP and shows how legacy empirical datasets can support scalable legal analytics under severe resource constraints.
Field of Science and Technology Classification of Academic Documents in Portuguese
Ivo Simões | Hugo Gonçalo Oliveira | João Correia
Ivo Simões | Hugo Gonçalo Oliveira | João Correia
Towards improving metadata in academic repositories, this study evaluates the efficacy of different transformer-based models in the automatic classification of the Field of Science and Technology (FOS) of academic theses written in Portuguese. We compare the performance of four different encoder models, two multilingual and two Portuguese-specific, against five larger decoder-based LLMs, on a dataset of 9,696 theses characterized by their title, keywords, and abstract. Fine-tuned encoder-based models achieved the best scores (F1 = 88%), outperforming general-purpose decoder models prompted for the task. These results suggest that, for localized academic domains, task-specific fine-tuning remains more effective than general-purpose LLM prompting.
Quando as Máquinas “Pensam”: Antropomorfização no Discurso sobre IA - Riscos Conceptuais e Proposta de Léxico Não Antropomórfico para PLN em Português
Anabela Barreiro
Anabela Barreiro
A antropomorfização de sistemas de Inteligência Artificial tornou-se particularmente relevante em contextos de Processamento de Linguagem Natural em português, onde expressões como ”o modelo compreende” ou ”o sistema alucina” podem gerar equívocos conceptuais, contribuindo para uma percepção errada das capacidades dos modelos. Este artigo propõe um enquadramento terminológico para descrever sistemas de Processamento de Linguagem Natural em português sem recurso a metáforas antropomórficas, apresentando um conjunto de reformulações linguísticas destinadas a melhorar a precisão conceptual e a literacia em Inteligência Artificial.
Semantic adapters in text-to-SQL for low-resource languages: the importance of semantic information
Anton Bulle Labate | Fabio Gagliardi Cozman
Anton Bulle Labate | Fabio Gagliardi Cozman
This paper investigates whether injecting semantic structural knowledge of low-resource or unfamiliar languages into Large Language Models (LLMs) enhances performance on downstream Text-to-SQL tasks. We evaluate our approach on Galician, a Romance low-resource language, and, to demonstrate its generality, also on Guarani, a (very) low-resource language of an entirely distinct linguistic profile. Our empirical results show that semantically-aware models consistently outperform baselines across all benchmark metrics.
CURUPIRA: Clever guard for harm and linguistic prompt mitigation in Brazilian Portuguese
Rogério Sousa | William Alberto Cruz-Castañeda | José Roberto Homeli Silva | Marcellus Amadeus
Rogério Sousa | William Alberto Cruz-Castañeda | José Roberto Homeli Silva | Marcellus Amadeus
The safe deployment of Large Language Models remains challenging in multilingual settings, particularly when models are exposed to adversarial or malicious prompts in underrepresented languages. In this work, we present Curupira, a Brazilian Portuguese-language guard model designed to mitigate harmful prompt exploitation. To do this, we establish a three steps methodology that involves adaptation, data generation, and fine-tuning. We also evaluate our model with two state-of-the-art open guardrail architectures. The results show that targeted fine-tuning leads to consistent improvements in safety classification for Portuguese prompts, with favorable efficiency–performance trade-offs for compact models and limited degradation in cross-lingual evaluation.
Lost in Quantization: Activation Outliers Explain Language-Specific FP8 Sensitivity in Llama-3
Guilherme Silva | Pedro Silva | Matheus Peixoto | Gladston Moreira | Eduardo Luz
Guilherme Silva | Pedro Silva | Matheus Peixoto | Gladston Moreira | Eduardo Luz
Quantization is key for efficient LLM inference, but its language-specific effects are understudied. We compare INT8 and FP8 (E4M3) quantization for Meta-Llama-3-8B on English and Brazilian Portuguese (PT-BR). INT8 with outlier handling preserves perplexity in both languages, while naive FP8 casting degrades English far more than PT-BR (+18% vs. +3.9%). Activation analysis shows rarer, larger English spikes (>35) that are more prone to saturation under unscaled E4M3, whereas PT-BR activations are more concentrated. Our FP8 results reflect a naive casting stress test (no calibration/scaling), not an optimized FP8 recipe.
A Multitask Transformer for Offensive Language Detection and Target Identification in HateBR
Guilherme Silva | Pedro Silva | Matheus Peixoto | Gladston Moreira | Eduardo Luz
Guilherme Silva | Pedro Silva | Matheus Peixoto | Gladston Moreira | Eduardo Luz
Hate speech detection is often treated as a binary task, ignoring the hierarchical nature of toxicity, such as severity levels and specific target groups. This work presents a Multitask Learning (MTL) approach for the HateBR dataset, utilizing a shared BERTimbau encoder to simultaneously predict binary offensiveness, ordinal severity, and hate speech targets. Our experiments demonstrate that the MTL architecture outperforms Single-Task baselines on the primary offensive detection task, increasing the Matthews Correlation Coefficient from 0.80 to 0.82. Beyond predictive performance, we show that joint training implicitly enforces hierarchical sanity: the unified model yields a 0% target-inconsistency rate (i.e., no cases where a comment is predicted Non-offensive while still assigned a hate target). However, we observe negative transfer in the fine-grained multilabel target task (Micro-F1 drops from 0.59 to 0.42), highlighting a trade-off between logical consistency and target attribution under extreme imbalance.
LARI Dataset: A Native Portuguese Question Answering Dataset from Brasileiras em PLN
Júlia da Rocha Junqueira | Larissa A. de Freitas | Ulisses Brisolara Corrêa
Júlia da Rocha Junqueira | Larissa A. de Freitas | Ulisses Brisolara Corrêa
Recent advances in the field have revolutionized Question and Answering (QA). However, for languages like Portuguese, progress is often hindered by the lack of native training resources. To address this gap, this paper introduces LARI, a new dataset designed to benchmark and enhance QA in Portuguese. Our methodology combines the capabilities of the Sabiá-7B model, fine-tuned via QLoRA on a domain-specific corpus, with human validation. We utilized the book Natural Language Processing – Concepts, Techniques, and Applications in Portuguese (2nd Edition), as a case study for content extraction. The generated instances underwent expert human evaluation, achieving an average quality score of 4.47 out of 5. The final dataset, comprising 464 context-question-answer triples, is made publicly available to the community, offering a valuable resource for future research in low-resource settings.
Evolução de Padrões Linguísticos na Escrita Científica em Português: Uma Análise com NILC-Metrix
Thiago Ruiz Lobo | Claudia Aparecida Martins
Thiago Ruiz Lobo | Claudia Aparecida Martins
Este trabalho analisa a evolução de padrões linguísticos em resumos de artigos em português da Sociedade Brasileira de Computação entre 2020 e 2025, com base em métricas linguísticas do NILC-Metrix. Foram aplicadas 72 métricas a um conjunto de mais de 10 mil resumos, e comparações estatísticas (t-test) foram realizadas entre o período de referência (2020–2022) e os anos subsequentes. Os resultados indicam transformações a partir de 2023, incluindo simplificação estrutural, aumento da densidade lexical, reconfiguração de estratégias discursivas e mudanças no uso de conectivos. Em 2024 e 2025, mais de 95% dos artigos apresentam múltiplas métricas significativamente distintas em relação ao período de referência.
Libras-UFPel Corpus: A Parallel Dataset of Brazilian Sign Language and Portuguese for Multimodal Research and Processing
Antonielle Martins | Brenda S. Santana | Francielle Martins | Tatiana Lebedeff | Darley Nunes | Luisa Bohm
Antonielle Martins | Brenda S. Santana | Francielle Martins | Tatiana Lebedeff | Darley Nunes | Luisa Bohm
The Libras-UFPel Corpus is a multimodal, multilayer parallel resource designed for the documentation and computational analysis of Brazilian Sign Language (Libras) in systematic alignment with written Portuguese. By integrating controlled recordings with naturalistic data from the Inventário Nacional de Libras-Pelotas, the corpus ensures interoperability through shared methodological standards. The dataset currently comprises 4,800 controlledaudiovisual records (2,400 sentences and 2,400 isolated signs) fully paired with Portuguese translations, supplemented by approximately 10 hours of spontaneous interaction from threenew naturalistic interviews, currently in the editing phase. To date, 1,200 controlled sentences have been lemmatized, gloss-annotatedand translated, providing a structured parallel subset for Libras-to-Portuguese Sign Language Processing tasks such as recognition and machine translation. The annotation model follows a hierarchical structure covering lexical, partially lexical, and non-lexical signs, including independent tiers for non-manual markers. By bridging descriptive linguistics and Natural Language Processing, Libras-UFPel Corpus serves as a reference source for bilingual data-driven modeling, advancing digital inclusion and linguistic accessibility.
The Superficiality Bias: Community Votes and Answer Utility in Portuguese Health Question Answering
Carlos Henrique Santos Barros | Gustavo Figueredo Rodrigues de Sousa | Rogério Figueredo de Sousa
Carlos Henrique Santos Barros | Gustavo Figueredo Rodrigues de Sousa | Rogério Figueredo de Sousa
Supervised models trained on community-labeled data have shown promise in Health Question Answering (HQA), but relying on “likes” as a proxy for clinical usefulness remains controversial. This work investigates the alignment between automated predictions and human perception in Portuguese HQA. Using a subset of the SaudeBR-QA corpus, we compare a Random Forest classifier against a controlled evaluation conducted by laypeople and healthcare professionals. Our results reveal a recurring divergence that we term Superficiality Bias: human evaluators frequently validate very brief answers, whereas the classifier often labels these cases as non-useful under its learned criteria. Rather than indicating that the model is inherently more clinically accurate, this pattern suggests a misalignment between community feedback and feature-driven utility judgments. We argue that crowd-based labels in medical domains should be treated cautiously and complemented with more rigorous annotation protocols.
FlexQwen: Exploring Hybrid Objectives and Text Originality for Portuguese
Miguel de Mello Carpi | Marcelo Finger
Miguel de Mello Carpi | Marcelo Finger
While scaling laws suggest increasing model and dataset sizes for better results, efficient pre-training techniques for low-resource scenarios present unique challenges that require further investigation. This work introduces FlexQwen, a model based on the Qwen 3 architecture adapted for a hybrid causal-masked objective, and the Carolina Originality dataset, a subset of the Corpus Carolina tailored for efficient pre-training in Portuguese. We investigate two primary research questions: the influence of hybrid masked-causal modelling and the impact of text originality on model performance. Our experiments compare a high-originality Gold split against a length-matched control group. Results indicate that hybrid objectives may be viable for efficient training. Furthermore, we provide open access to our code, datasets, and training logs to foster further research in efficient Portuguese LLMs.
NormaTex-MapSNOMED: Bridging the Gap Between Brazilian Portuguese Clinical Narratives and SNOMED CT
Isabela Araujo | Claudia Moro | Layslla Martinez
Isabela Araujo | Claudia Moro | Layslla Martinez
Clinical narratives written in free text contain valuable information for patient care. However, their unstructured nature and linguistic variability pose significant challenges for automatic processing and interoperability. In particular, mapping clinical terms to standardized terminologies such as SNOMED Clinical Terms (SNOMED CT) remains difficult for languages other than English, including Brazilian Portuguese. This paper presents NormaTex-MapSNOMED, a proposed component of the NormaTex framework that focuses on mapping clinical terms to predefined categories aligned with SNOMED CT. Given previously extracted terms, the method leverages large language models (LLMs) guided by a structured prompt to assign terms to target categories. Experiments were conducted on Portuguese-language clinical narratives and evaluated using three complementary strategies: lexical similarity based on Levenshtein distance, contextual similarity using a BERT-based model, and semantic validation using LLMs. The results show that LLM-based evaluation consistently outperforms lexical and contextual baselines across different models, with higher precision observed for disease-related terms compared to symptom-related expressions. These findings indicate that LLMs are a promising approach for semantic mapping of clinical terms in Brazilian Portuguese and can support clinical term normalization and interoperability with standardized terminologies.
Prompt Engineering for Named Entity Extraction from Portuguese Legal Documents
Giovanni Maffeo | Catarina Silva | Hugo Gonçalo Oliveira
Giovanni Maffeo | Catarina Silva | Hugo Gonçalo Oliveira
The growing volume and complexity of legal texts highlight the need for automatic methods capable of extracting structured information from unstructured documents. Motivated by the limited availability and high cost of annotated legal data, this challenge is even more severe for the Portuguese language. This work investigates whether prompt engineering over Large Language Models (LLMs) can effectively support legal Named Entity Recognition (NER) in low-supervision and low-resource settings through In-Context Learning (ICL). Using the LeNER-Br corpus, we evaluate category-specific prompts, different chunking sizes, and prompt engineering strategies. Entity-level evaluation using Exact Match Micro F1 shows that prompt engineering has a stronger impact on performance than other strategies. The best results were obtained with larger models, the 4-bit quantised Qwen-2.5:32B and GPT-5.2, achieving scores of 57.9% and 71.9%, respectively, highlighting the potential of this approach as an alternative to traditional supervised NER pipelines.