Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2

Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro (Editors)


Anthology ID:
2026.propor-2
Month:
April
Year:
2026
Address:
Salvador, Brazil
Venue:
PROPOR
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/2026.propor-2/
DOI:
ISBN:
979-8-89176-387-6
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/2026.propor-2.pdf

A revisão de trabalhos acadêmicos é uma etapa crucial, porém onerosa, na formação de pesquisadores. Trabalhos anteriores obtiveram bons resultados com abordagens automatizadasde revisão em inglês. Nesse contexto, apresentamos o ARAMIS, uma ferramenta multiagente baseada em Large Language Models (LLM) open-source, projetada para revisar Trabalhos de Conclusão de Curso (TCC) em português. A solução foca em três pilares: correção gramatical, encadeamento lógico e rigor metodológico, permitindo ao usuário receber revisões estruturadas para cada pilar escolhido. Mesmo em estágio experimental, os testes atingiram ótimos resultados de usabilidade ao aplicar o System Usability Scale (SUS),obtendo uma pontuação de 90,5/100.
Adult Learning (AL) programmes need short, trustworthy texts that match learners’ reading abilities, but educators rarely have time, tools, or evidence-based guidelines to select and adapt materials consistently.We present a live demo of iRead4Skills for European Portuguese: a web-based system that (i) estimates readability/complexity for AL-oriented levels aligned with CEFR, (ii) highlights where complexity concentrates (lexical, grammatical, semantic), and (iii) supports rewriting by offering actionable, level-aware suggestions and curated lexical resources.The demo emphasises transparency and “trainer-first” workflows: users see *why* a text is complex and *how* to revise it without losing meaning.
This demo showcases a web-based interface that provides open, interactive access to a large-scale grammatical database of European Portuguese verbal constructions. Through a unified search and exploration environment, users can query, inspect, and compare more than 7,000 distributionally free verbal constructions and over 2,700 verbal idioms (frozen constructions), grounded in long-standing Lexicon–Grammar descriptions. For each construction, the interface exposes core linguistic properties such as argument structure, distributional constraints, semantic roles, major syntactic transformations, and curated usage examples with English translations. The demo illustrates how detailed, manually validated grammatical knowledge can be explored dynamically via the web, supporting linguistic research, language teaching, and NLP development. To the best of our knowledge, this is the largest publicly accessible, web-based grammatical resource dedicated to European Portuguese verbal constructions.
This paper describes Bruna, a data-centric smart voice assistant powered by multiple Large Language Models designed to support Stilingue and Blip products. Our architecture provides an enriched conversational experience, delivering strategic insights in real-time.
Analyzing large conversational datasets is often inefficient due to the linear nature of text, which hinders the tracking of interaction evolution over time. To address this, we present FlowDisco, an interactive platform for the automatic discovery and exploration of dialogue flows. The framework uses semantic embeddings and modular clustering to transform raw text into probabilistic dialogue flows. By providing a web interface with dynamic filtering and a suite of analytical metrics, FlowDisco simplifies the visual identification and validation of conversational behaviors at scale. The platform’s utility is demonstrated through real-world application scenarios, including customer support interactions and multi-party political debates, where it successfully uncovers complex patterns and sentiment shifts that traditional sequential analysis often overlooks.
This paper presents AttentionApp, an interactive demonstration system designed to support the inspection and linguistic analysis of attention mechanisms in Transformer-based language models for Portuguese. The tool allows users to input sentences in Portuguese and visualize attention distributions across layers and heads, enabling fine-grained qualitative analysis of syntactic and semantic patterns captured by the model. AttentionApp is intended as a research-oriented tool, facilitating exploratory analysis, hypothesis generation, and interpretability studies for Portuguese Natural Language Processing.
Este artigo tem como objetivo apresentar o sistema multimodal computacional, denominado NOAH, para apoiar o gerenciamento de riscos de desastres (GRD) nas cidades brasileiras, considerando a necessidade de troca de informações e de comunicação estabelecida entre os agentes públicos de GRD e os membros da população em situações de risco e desastre. Este sistema está sendo desenvolvido através da aplicação da inteligência artificial (IA), integrando o chatbot ao processamento de linguagem natural (PLN), reconhecimento de fala, classificação de imagens e recuperação de informações por geração aumentada de recuperação (RAG). O sistema tem como foco a comunicação direta com a população via WhatsApp, permitindo a coleta de relatos em língua portuguesa nos formatos de texto, áudio e imagem. A contribuição prática do NOAH consiste na combinação de uma técnica de modelagem de tópicos (BERTopic) para classificação textual, Whisper Small para transcrição de áudio e redes neurais convolucionais Resnet50 para análise visual do tipo de incidente. Essa abordagem viabiliza o desenvolvimento de ferramenta prática e escalável para o apoio à tomada de decisão dos órgãos municipais de Proteção e Defesa Civil, que são responsáveis pelo GRD, contribuindo para uma resposta mais eficiente a situações de emergência em localidades de língua portuguesa.
Este trabalho apresenta o Lispector, uma família de modelos de linguagem especializados para revisão gramatical e ortográfica em português brasileiro. Comparamos duas estratégias de inferência para a tarefa de correção gramatical de texto com grandes modelos de linguagem (LLMs): (1) fine-tuning supervisionado e (2) prompting few-shot em modelos de maior escala. Utilizando um conjunto de dados de 4.500 pares de textos reais de usuários (2.500 registros para treino, 1.000 para avaliação e 1.000 para teste), com referências corrigidas por linguistas, analisamos duas variantes do Lispector baseadas em diferentes tamanhos de parâmetros. A avaliação empregou as métricas BLEU, GLEU, METEOR e ROUGE. Os resultados demonstram que modelos menores submetidos a fine-tuning supervisionado superam consistentemente em todas as métricas modelos maiores que operam apenas com prompting, com o Lispector small alcançando ganhos expressivos em métricas de similaridade textual como GLEU (+12%) e BLEU (+13%). Assim, além do aumento de desempenho, os modelos fine-tuned apresentam comportamento mais previsível e conservador, características desejáveis em aplicações industriais de escrita assistida. No quesito latência, o Lispector small obteve a menor mediana de tempo de resposta entre todos os modelos e o menor P95 entre os fine-tuned; o Lispector large também se mostrou competitivo. Esses achados indicam que, para tarefas específicas de revisão textual em português brasileiro, o fine-tuning pode oferecer vantagens significativas em desempenho e eficiência computacional.
Large Language Models (LLMs) are effective text generators but create legal citations at non-trivial rates, a failure mode with serious consequences in legal practice. In Brazilian Portuguese the risk is amplified by citation variability (juridiquês), fragment-level references (article → paragraph → item), and the need to distinguish jurisdictions and court instances.We describe a production Retrieval-Augmented Generation (RAG) system deployed at a Brazilian legal-technology platform. The system combines (1) domain-tuned hybrid retrieval (lexical, dense, and cross-encoder reranking) over a large-scale legal corpus; (2) grounded generation with explicit citation constraints; and (3) a post-generation Reference Audit layer that extracts legislation and jurisprudence mentions via specialized taggers, normalizes them to a canonical schema, checks existence against authoritative databases at fragment granularity, verifies fidelity against official texts, and triggers targeted rewrites when inconsistencies are detected.We report production telemetry from 184,895 audited answers containing 43,175 extracted legal references. Legislation references resolve at 81.7%, while jurisprudence references resolve at only 47.1%, identifying case-law normalization as the primary bottleneck for practitioners. Fidelity verification corrected 6.5% of checked answers before delivery, preventing misrepresented legal claims from reaching end users. By converting silent hallucinations into explicit warnings with per-reference status, the system enables legal professionals to trust verified citations and efficiently review flagged ones, rather than manually checking every authority.
This Ph.D. dissertation advances the state-of-the-art in Natural Language Processing (NLP) for Portuguese by proposing new and innovative data resources and explainable methods for hate speech detection and automated fact-checking. The thesis introduces several benchmark datasets for Brazilian Portuguese, HateBR, HateBRXplain, HateBRMoralXplain, MFTCXplain, MOL, and FactNews, which have been widely adopted by the research community and address critical gaps in the availability of high-quality annotated resources for Portuguese. In addition, this dissertation proposes novel post-hoc and self-explaining NLP methods: Sentence-Level Factual Reasoning (SELFAR), Social Stereotype Analysis (SSA), Contextual Bag-of-Words with Interpretable Input and Feature Optimization (B+M), Supervised Rational Attention (SRA), and Supervised Moral Rational Attention (SMRA). Across multiple tasks and datasets in Portuguese, these methods outperform baselines while improving interpretability and robustness, demonstrating that explainability and performance can be jointly optimized. Finally, this thesis has achieved significant national and international impact, being cited by leading universities and research institutes worldwide and fostering new M.Sc. and Ph.D. research projects in Brazil. Its scientific and social contributions have also been recognized with multiple prestigious national and international awards, including the Google LARA, the Maria Carolina Monard Best Thesis Award in Artificial Intelligence, the Trevisan Prize for Students “AI for Good” from Bocconi University for rigorous computer science research in AI with social impact, and the Diversity and Inclusion Award from the Association for Computational Linguistics (ACL). Lastly, this thesis has received two nominations for the Brazilian Computer Society Thesis Awards in Computer Science, and in Multimedia, Hypermedia, and Web.
Brazil’s ENEM, a high-stakes assessment determining university admission for millions of students annually, creates an immense evaluation burden where human raters process hundreds of essays daily. Automated Essay Scoring (AES) offers a potential solution, yet Portuguese-language systems remain understudied due to fragmented datasets and the complexity of ENEM’s multi-trait rubric. This work investigated cross-prompt, trait-specific essay scoring using a corpus of 385 essays across 38 prompts, where models evaluated essays on unseen prompts across five traits scored on a six-point ordinal scale. We compared three model classes: feature-based methods (72 features), encoder-only transformers (109M–1.5B parameters), and decoder architectures (2.4B–671B parameters) with fine-tuned and zero-shot configurations. Experiments under varying information access and rubric conditioning revealed that no single approach serves all evaluation needs: encoder models excel at mechanical traits (fluency, cohesion) despite context limitations; decoder models achieve superior performance on argumentation (QWK 0.73) and writing style (QWK 0.60) when provided full context; and language-specific pretraining benefits only surface-level features without improving complex reasoning. Best-performing models achieved QWK scores of 0.60–0.73. Gaps to oracle bounds ranged from 0.15 (argumentation) to 0.29 (writing style), with the largest disparities in writing style and persuasiveness.
Gender-based violence (GBV) is a major public health issue, with the World Health Organization estimating that one in three women experiences physical or sexual violence by an intimate partner during her lifetime. In Brazil, although healthcare professionals are legally required to report such cases, underreporting remains significant due to difficulties in identifying abuse and limited integration between public information systems. This study investigates whether FrameNet-based semantic annotation of open-text fields in electronic medical records can support the identification of patterns of GBV. We compare the performance of an SVM classifier for GBV cases trained on (1) frame-annotated text, (2) annotated text combined with parameterized data, and (3) parameterized data alone. Quantitative and qualitative analyses show that models incorporating semantic annotation outperform categorical models, achieving over 0.3 improvement in F1 score and demonstrating that domain-specific semantic representations provide meaningful signals beyond structured demographic data. The findings support the hypothesis that semantic analysis of clinical narratives can enhance early identification strategies and support more informed public health interventions.
Asthma is a chronic respiratory disease that affects breathing and may also influence speech and voice production. In this paper, we examine whether short mobile-recorded Brazilian Portuguese voice and speech audio contain cues that can be used to distinguish individuals with asthma from those without asthma. We approach this problem using transfer learning with pretrained neural audio models based on convolutional architectures trained on large-scale audio datasets (PANNs). We evaluate two recording types: sustained vowel phonation and read speech. Models are trained for a binary classification task and evaluated at both the segment level and the patient level. Read speech performs better than sustained vowels. The best configuration (CNN14 on speech) achieves 0.85 patient-level balanced accuracy (accuracy 0.85) with ROC-AUC 0.93 and PR-AUC 0.98, performing comparably to CNN10. Training from scratch performs worse than fine-tuning a pretrained model, showing that pretraining helps when data is limited. Performance also varies across age groups, suggesting demographic sensitivity. These findings support the feasibility of audio-based asthma classification from voice and speech and motivate further investigation of pretrained audio models in biomedical applications.
The adoption of LLMs in hospital environments demands solutions that ensure information security, computational efficiency, and rigorous control over sensitive institutional data. This work presents the development and evaluation of a chatbot based on RAG, using exclusively local LLMs, applied to internal documents of a university hospital in Portuguese, composed of Standard Operating Procedures and technical manuals. The methodology initially evaluates the quality of information retrieval through dense embedding models, measured by the Mean Reciprocal Rank (MRR) metric. Then, the generation stage is analyzed in two distinct scenarios: (i) RAG with fixed context, in which multiple chunks are provided simultaneously to the model, and (ii) Incremental page retrieval, in which chunks are sent sequentially according to the retrieval ranking. The generation assessment was conducted with four local LLMs — MedGemma3:27B, Gemma3:27B, Gpt-oss:20B, and Mistral Small 3.1 — using BERTScore as a quality metric. The results indicate that indiscriminate context increase in the fixed-context scenario degrades generation quality, even while increasing the probability of recovering the relevant chunk. In contrast, the incremental page retrieval technique showed improvements in BERTScore values, with the MedGemma3:27B model standing out with the best overall results. These findings demonstrate that adaptive context control is a critical factor in increasing the reliability and efficiency of RAG systems based on local LLMs in the healthcare domain.
Anaphylaxis is an acute, potentially life-threatening allergic reaction that requires rapid recognition in clinical settings. Natural language processing (NLP) approaches for automatic detection of anaphylaxis in clinical narratives can support large-scale analysis of health records and retrospective clinical research. However, such approaches depend on high-quality labeled corpora, and resources for Portuguese remain scarce. This paper introduces a corpus of Brazilian Portuguese clinical notes annotated by domain specialists for the presence or absence of anaphylaxis. The dataset comprises 969 clinical narratives drawn from three sources: clinician-authored synthetic clinical notes designed to represent realistic scenarios, case reports from the medical literature rewritten into note-like format by specialists, and a subset of de-identified notes from the publicly available SemClinBr corpus. All texts were reviewed and labeled by allergists using established clinical diagnostic criteria, and the corpus reflects realistic prevalence conditions, with approximately 5% positive cases. We describe the corpus design, data sources, annotation methodology, and composition, discuss potential research applications, and address ethical considerations. The corpus is intended as a reusable resource for Portuguese clinical NLP, supporting future work on document classification, information extraction, and language modeling in the medical domain.
Ensuring safety in clinical applications of large language models (LLMs) remains an unresolved challenge, particularly for high-risk and underrepresented conditions such as Sickle Cell Disease (SCD). Consequently, these models may exhibit limited reliability for SCD, including hallucinations and clinically unsafe outputs. This paper proposes an LLM-based Multi-Agent System (MAS) enhanced by Retrieval-Augmented Generation (RAG) to support the generation of medical care plans for SCD. The MAS decomposes clinical reasoning into specialized agents responsible for diagnosis, investigation, and treatment planning. Retrieval is framed not as a performance optimization, but as a safety control mechanism. Three RAG strategies, namely LLM-Guided Tree Retrieval, Metadata-Filtered Retrieval, and Semantic Similarity Retrieval, are evaluated alongside a baseline. Our experiments considered LLM-as-a-Judge evaluations and independent assessments by physicians. The results demonstrate high clinical quality, with safety scores exceeding 4 on a 5-point scale. While average performance was similar between RAG and baseline conditions, the Tree Retrieval strategy reduced the frequency of clinically unsafe outputs compared to conventional Semantic Retrieval, indicating fewer clinically unsafe outputs. These findings show evidence that average performance is insufficient to evaluate clinical AI systems, particularly in high-risk scenarios where retrieval serves as a safety control layer.
The evaluation of Large Language Models (LLMs) in medicine has predominantly relied on English-language benchmarks aligned with North American clinical guidelines, limiting their applicability to other healthcare systems. In this paper, we evaluate twenty-two proprietary and open-weight LLMs on the 2025 National Examination for the Evaluation of Medical Training (ENAMED), a high-stakes, government-standardized assessment used to evaluate medical graduates in Brazil. The benchmark comprises 90 multiple-choice questions grounded in Brazilian public health policy, clinical practice, and Portuguese medical terminology, and is released as an open dataset. Model performance is measured using both standard accuracy and the official Item Response Theory (IRT) framework employed by ENAMED, enabling direct comparison with human proficiency thresholds. Results reveal a clear stratification of model capabilities: proprietary frontier models achieve the highest performance, whereas many open-weight and smaller-domain-adapted models fail to meet the minimum proficiency criterion. Across comparable scales, large generalist models consistently outperform specialized medical fine-tunes, suggesting that general reasoning capacity is a stronger predictor of success than narrow domain adaptation in this setting. These findings establish ENAMED as a rigorous benchmark for evaluating medical LLMs in Portuguese and highlight both the potential and current limitations of such models for educational assessment.
Retrieval-Augmented Generation (RAG) is proposed to reduce hallucination and improve grounding in clinical language models, yet its effectiveness across different levels of clinical reasoning remains unclear. We conducted a controlled evaluation of medication-related question answering in Portuguese using over 7,000 Brazilian regulatory drug leaflets and a complementary clinical benchmark derived from national medical licensing examinations (Revalida and Fuvest). Retrieval substantially improved factual recall and clinical coherence in medication-specific queries, increasing F1 from 0.276 to 0.412. However, naive retrieval did not consistently improve complex clinical reasoning and sometimes reduced accuracy compared to a parametric-only baseline. We identify retrieval-induced anchoring bias, where partially relevant evidence shifts model decisions toward clinically incorrect conclusions. Critique-based and adaptive retrieval mitigated this effect and achieved the highest clinical benchmark accuracy (54.25%). Clinically grounded evaluation dimensions revealed safety-relevant differences beyond traditional NLP metrics. These results show that retrieval augmentation is effective in regulatory settings but requires adaptive control for higher-level clinical reasoning.
While most essential medicines have become widely accessible across all social strata in Brazil due to government initiatives and market shifts, a significant barrier remains: the technical complexity of medication leaflets. This pragmatic and linguistic gap hinders patient comprehension of critical risks and benefits. Thus, adapting these texts into plain language patterns is crucial for patient safety and treatment adherence. Large language models have been increasingly effective as practical solutions for text simplification, an important Natural Language Processing (NLP) task that serves as a basis for several other linguistic and computational tasks. However, the scarcity of annotated datasets remains a bottleneck for rigorous evaluation. To bridge this gap, we propose a streamlined pipeline for generating simplified medical leaflets and introduce an initial benchmark dataset of 30 expertly annotated samples. Our results, supported by semantic and morphosyntactic evaluations, demonstrate that the proposed method produces high-quality, simplified content suitable for health applications.
Clinical NLP for Brazilian Portuguese remains limited by the lack of semantically structured resources that support interoperability and downstream health applications. Although existing corpora provide annotated clinical narratives, their flat annotation schemes restrict semantic expressiveness and alignment with standardized terminologies. In this work, we present a lightweight domain ontology that models clinical entities, contextual qualifiers, and semantic relations in Brazilian Portuguese texts. The ontology is derived from the original corpus annotations and conceptually aligned with standards to enhance interoperability while preserving corpus-specific semantics. This work establishes foundational infrastructure for Portuguese clinical NLP, supporting tasks such as entity normalization, semantic search, and ontology-guided annotation.
Depressive symptomatology may be reflected in the language used by possible depressive profiles (PDP). This paper investigates to what extent symptoms of depression are manifested in Brazilian Portuguese narrative texts, and whether these can be used to identify relevant linguistic clues related to PDP. Moreover, the relation between these symptoms and PDP is explored, characterising the lexical, syntactic, and psycholinguistic aspects of texts produced by PDP. We found that texts associated with PDPs differed in some of these characteristics from non-PDP texts. The interactions between symptoms and PDP can also shed light on patterns of communication differentiation and the relationship between them. The results of this paper can help to characterise and understand the indicators that can be used to train more bespoke and accurate large language models.
Notícias falsas são um grande problema para a sociedade. Com a Inteligência Artificial generativa, notícias falsas produzidas pela máquina têm se proliferado, tornando o cenário mais desafiador. Apesar da relevância desse problema, em línguas sub-representadas como o Português, as pesquisas que buscam diferenciar notícias falsas de humanos e de máquinas são incipientes. Buscando preencher essa lacuna, este artigo explora os corpora Fake.br e FakeTrueBR expandidos com notícias falsas geradas automaticamente, caracterizando lexical e sintaticamente as notícias falsas produzidas por humanos e por máquina. Os resultados mostram que textos gerados por máquina apresentam palavras significativamente mais longas, maior uso de modificadores adjetivais e menor diversidade sintática, apesar de utilizarem mais regras sintáticas por sentença. Em contrapartida, textos humanos exibem maior variabilidade estilística em todas as dimensões analisadas.
Este trabalho investiga métodos simbólicos para a detecção de emoções em textos em português, considerando múltiplos córpus, domínios e diferentes configurações de pré-processamento. Os resultados mostram grande variação no desempenho absoluto entre domínios, mas estabilidade no desempenho relativo entre os métodos, evidenciando a influência das propriedades do córpus e o gradiente entre complexidade e interpretabilidade. A inclusão da classe neutra tende a degradar o desempenho ao aumentar a ambiguidade e, frequentemente, o desbalanceamento entre classes, enquanto um pré-processamento mais extensivo beneficia especialmente abordagens simbólicas. A análise qualitativa indica que parte dos erros decorre de ambiguidades linguísticas, do grande espaço para subjetividade no processo de anotação e das próprias nuances emocionais, reforçando a importância de avaliações comparativas multi-domínio.
Prosodic segmentation is the task of dividing a sound unit into smaller units, which can be distinguished between units with a completed idea, marked by TBs, and non-autonomous units, marked by NTBs. It is a useful task to enhance the performance of ASR and TTs systems, and it remains relevant for Brazilian Portuguese due to the diversity of conditions and speaker-related factors that influence its performance. Here, we explore a low-impact, open-source approach based on a Random Forest classifier and a set of features that include fundamental frequency, speech rate, pauses, and energy (Craveiro et al., 2025). We perform a robustness evaluation of the referred ML model, modifying a few conditions on its training, comparing its performance when tested in other datasets, and comparing its results with those of other studies using the same data samples. We experiment with augmenting the training dataset and evaluating how the bias of speaker profile aspects is affected when the size and diversity of the training set are changed. Although we don’t achieve statistically significant values in the bias evaluation, we observe that inequalities grow as the training dataset is expanded with a much larger, but less diverse sample of data.
Approaches based solely on textual representations have limitations in capturing structural relations between legal entities, particularly in documents with high lexical similarity. This paper presents ongoing work on a dynamic clustering system for judicial decisions that integrates hybrid representations, combining semantic embeddings from legal-domain Portuguese models with knowledge graphs automatically constructed from documents. The architecture supports incremental clustering and generates cluster justifications using Large Language Models grounded on knowledge graph relations. Preliminary evaluation combines the quantitative metrics Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Index.
The need for tools that assist in process management, automating tasks and reducing the slowness of the judicial system, justifies the improvement of traditional Information Retrieval systems, often limited by vocabulary incompatibility and the length of legal texts. Although models based on Transformers capture semantic particularities, they face input size constraints that make it difficult to process long texts without losing information. In this work, we propose a hybrid system applied to the legal domain, combining the BM25L algorithm and the BumbaLM language model.
O uso crescente de Grandes Modelos de Linguagem (LLM) tem ampliado preocupações relacionadas a viés social e justiça algorítmica. Este trabalho apresenta uma Revisão Sistemática da Literatura de 60 estudos publicados entre 2020 e 2025, analisando estratégias de mitigação, métricas de avaliação, tipos de discriminação e idiomas considerados. Os resultados indicam forte predominância de avaliações em língua inglesa, foco desproporcional no viés de gênero tratado de forma binária e maior ênfase em diagnóstico do que em mitigação. Observa-se ainda escassez de análises interseccionais, multilíngues e orientadas a cenários reais de uso, evidenciando lacunas metodológicas e socioculturais na literatura atual.
Large language models (LLMs) are increasingly used for Natural Language Inference (NLI), yet their ability to perform logic-sensitive semantic reasoning, especially outside English, remains underexplored. This paper presents a preliminary investigation into the feasibility and usefulness of developing FraCaS-BR, a Portuguese adaptation of the FraCaS benchmark for semantic inference. Using a small diagnostic subset of seven FraCaS problems focusing on generalized quantifiers, plurals, and nominal anaphora, we evaluate the behavior of three LLMs (ChatGPT, Maritalk, and Evaristo) on Brazilian Portuguese translations. Each problem is submitted multiple times to assess correctness, variance, and consistency relative to the original FraCaS gold labels. The results reveal systematic differences across models.While ChatGPT shows higher overall correctness and stability, all models exhibit limitations that undermine their reliability on logic-controlled inference tasks. The extent of manual correction required during translation further underscores the necessity of human-in-the-loop evaluation. Taken together, these findings support and motivate the development of FraCaS-BR as a controlled evaluation resource for assessing semantic reasoning in Portuguese.
This paper evaluates the impact of expanding the UD_Nheengatu-CompLin treebank on parsing performance for Nheengatu, a Brazilian endangered Indigenous language. We hypothesized that the inclusion of annotated data would result in a 10% improvement in the Labeled Attachment Score (LAS). To test this hypothesis, we conducted a 10-fold cross-validation experiment using UDPipe 1.4 under two conditions: parsing with gold tokenization and gold tags, and automatic parsing from raw text. Statistical significance was determined using the Mann-Whitney U test. Although the expected gain was not achieved, the results show improvements in parsing accuracy and reduced variance across folds. The findings highlight the importance of corpus expansion and standardized annotation workflows for improving parsing performance in low-resource language scenarios and for supporting reproducible evaluation methods in the computational modeling of minority languages.
Automated translation systems exhibit a tendency toward cultural drift when processing non-literal language, often favoring standardized outputs that diverge from the original pragmatic intent. Although Large Language Models (LLMs) have introduced more sophisticated context-handling capabilities, the transition from literal decoding to effective cultural adaptation remains inconsistent.This study investigates these linguistic detours by evaluating ChatGPT-4o, Gemini 1.5 Pro, and Google Translate using a corpus of 100 Brazilian Portuguese expressions. To ensure contemporary relevance, the expressions were validated through the Corpus Carolina and categorized into four groups: classical idioms, regionalisms, metaphors, and intensifiers. Translation quality was assessed using the Multidimensional Quality Metrics (MQM) framework, focusing on adequacy, fluency, and cultural adaptation.The analysis reveals that, even when grammatical accuracy is achieved, automated systems frequently overlook the socio-cultural weight embedded in the source language. Such semantic shifts pose significant challenges in high-stakes professional communication, where nuanced mediation is essential. The findings underscore the limitations of current AI systems in cultural competence and reinforce the ongoing necessity of human intervention to bridge the gap between algorithmic processing and regional identity.
We present an ongoing research project focused on the construction of a Universal Dependencies (UD) corpus of Portuguese epidemiological reports derived from documents published within the Brazilian public health system. We describe findings and challenges to build such a corpus from PDF reports processed through a controlled document extraction pipeline that contrasts layout-aware extraction with raw PDF text extraction, explicitly addressing the impact of tabular content on downstream syntactic analysis. Narrative text is annotated using multiple UD parsers for Portuguese, including widely used and state-of-the-art tools, and their outputs are systematically compared using descriptive structural indicators and targeted qualitative inspection. Our analysis highlights domain-specific challenges in epidemiological texts and shows that document extraction and representation choices have a stronger effect on parsing behavior than parser selection alone. Based on these findings, we identify robust preprocessing configurations and discuss design choices for a UD-epidemiological corpus to support future research on syntactic parsing, domain adaptation, and downstream natural language processing tasks in epidemiology and public health.
We study gender-associated stylistic variation in Brazilian Portuguese Google Play reviews. Using IBGE name frequencies, we infer binary gender from first names in 76.7M reviews (96 apps, 2011–2025), obtaining 22.25M high-confidence labels. Women-associated reviews show markedly higher paralinguistic expressivity (about 60% higher emoji density and more lengthening/punctuation), while lexical diversity (MTLD) is nearly identical across groups. Ratings are mostly positive, with men contributing relatively more 1-star reviews and women more 5-star reviews. These findings contribute to a deeper understanding of digital sociolinguistic behavior within the Brazilian context. We discuss limitations of name-based gender inference and future demographic extensions.
Coreference resolution is a crucial task in natural language processing (NLP) that aims to identify and link expressions in a text that refer to the same entity. However, the lack of annotated data for coreference resolution in Portuguese has hindered the development of robust and accurate systems for this language. In this paper, we present an assessment of coreference annotation utilizing large language models (LLMs) for Portuguese: LLM-PREF is proposed to annotate coreference in Portuguese texts. It was evaluated and compared to a system previously proposed in the literature. The results show that although the model’s world knowledge and inference capacity are quite rich - allowing it to recognize complex coreference patterns, including the pronominal anaphora phenomenon - it does not excel the previously developed rule based system.
Digital trace data have expanded empirical opportunities in the social sciences while intensifying the methodological challenge of scale: researchers increasingly face corpora too large and fast-moving to read exhaustively without sacrificing interpretive rigor. This article presents Social-RAG, a modular Retrieval-Augmented Generation (RAG) architecture designed to support scalable qualitative inquiry over large text corpora while preserving evidence traceability, auditability, and researcher control. Our empirical basis consists of messages from public Telegram groups and channels, organized into two thematic subsets: vaccine-related discourse and debates surrounding Brazil’s Lei Rouanet cultural funding policy. We detail key design decisions, including a “one post = one chunk” indexing strategy, semantic retrieval over vector embeddings with efficient ANN search, an Adaptive-K dynamic cutoff for context selection, MMR re-ranking for diversity, and structured analytical instructions that constrain generation to retrieved evidence. We evaluate system behavior using two complementary question blocks, hermeneutic (narrative) and factual, and compare outputs across three language models with distinct deployment profiles (a local open-weight model, a cloud open-weight model, and a commercial closed model), using an LLM-as-judge protocol with explicit qualitative criteria. Results show consistent behaviour across both thematic corpora and highlight a key trade-off: the two larger/closed models perform similarly and robustly in both narrative and factual tasks when evidential discipline is maintained, whereas the smaller local model remains useful for exploratory narrative synthesis but is less reliable for strict factual extraction and attribution. We conclude by discussing methodological implications, limitations, and future directions, with a focus on scalability and extensibility to new data types and analytical problems.
Neste artigo descrevemos brevemente o projeto ReadingFood sobre o campo semântico da comida e bebida na literatura, que pretende comparar as obras de quatro países no período 1840-1920, mas cingindo-nos a Portugal e ao Brasil. Após apresentar as infraestruturas já desenvolvidas, tornando pública a pesquisa neste domínio, apresentamos o trabalho já feito e alguns estudos preliminares: a criação de uma taxonomia do domínio na literatura, a desambiguação em contexto, e o estudo de refeições(ou eventos relacionados com comida e bebida).
Este artigo aborda tarefas de etapas anteriores ao processamento computacional de fontes históricas do século XVIII, em língua portuguesa. O trabalho desenvolvido incidiu em domínios muito especializados: fauna e flora. Por esta última característica, esperava-se um fraco nível de ambiguidade vocabular, mas assim não aconteceu. Por isso, apresenta-se um roteiro do processo de normalização ortográfica; descreve-se a constituição do corpus anotado de Entidades Nomeadas e, sobretudo, discutem-se problemas ligados à variação lexical nestes thesauri de especialidade e os constrangimentos do processo. Desta forma, pretende-se contribuir para a reflexão sobre o que é o processo de normalização de fontes históricas e chamar a atenção para a importância das boas práticas neste quadro.
This paper analyzes the performance of several terminology extraction methods when confronted with historical specialized texts that do not conform with modern orthographical norms. We tested two extraction methods based on linguistic patterns, four prompt-based generative artificial intelligence (GenAI) models, and one BERT-like model. Some of these models went through fine-tuning for terminology extraction, and one of these is specialized in the extraction of medical terms from documents written in Portuguese. For the GenAI models, we tested four different prompting strategies. As test set, we used chapter fifteen of the second part of the book Aviso ’a Gente do Mar sobre a sua Saude [Advice to Sea People about their Health], originally written in French by G. Mauran at the end of the 18th century, and translated and adapted to Portuguese in 1794. The chapter was annotated with terminology, and the evaluation was conducted independently both in terms of f-measure, as well as in terms of pure precision, to observe if the automatic extraction methods could complement the manual token-based annotation. Results show that using automatic extraction methods to complement the manual annotation can improve coverage, and that individual models do not achieve high extraction quality, but, by combining two or more models, a recall of more than 90% could be achieved in the test data.
Este artigo apresenta a modelagem semântica de entidades nomeadas em Os Lusíadas, de Luís de Camões, com base no padrão TEI P5. Propõe-se um fluxo híbrido de anotação quecombina NER (spaCy), dicionário de autoridade (gazetteer) e pós-edição filológica manual. São tipificados antropônimos, mitônimos e topônimos por meio dos elementos <persName> (nome de pessoa), <placeName> (nome de lugar) e <rs> (referencing string, para cadeias de referências), com especial atenção à marcação de epítetos. O estudo evidencia os limites de modelos treinados em corpora jornalísticos diante da sintaxe épica e da ortografia da edição de 1572, demonstrando a necessidade de uma abordagem híbrida. Conclui-se que o XML/TEI atua como ferramenta de modelagem do conhecimento literário.
This article presents the Lusíadas Digital project, which proposes the development of a virtual philological edition of Os Lusíadas by Luís de Camões, integrating principles of textual criticism, Digital Humanities, and Natural Language Processing (NLP).The project aims to develop a digital platform bringing together facsimiles of the 1572 editions, diplomatic and modernized transcriptions, a dynamic critical apparatus, a lexical glossary with etymological information, historical and literary commentary, and translations aligned with the original text.The methodology combines traditional philological practices with XML-TEI text encoding, OCR techniques, automatic lemmatization, version alignment, and lexical mining.Initially focused on Canto I, the project seeks to establish a scalable and replicable model for the remaining cantos of the work. By proposing an open, interoperable, and data-oriented digital infrastructure, the initiative contributes to the advancement of e-Philology in Brazil and to the development of technologies applied to the digital critical editing of manuscripts and early printed editions.
Recorded interviews can capture their subjects’ memories, perceptions, and emotions. When conducted with notable figures, they also have the potential to serve as a resource for interdisciplinary research, impacting various branches of science. In this work, we mark the beginning of a significant project analyzing interviews from the Roda Viva program, the longest-running interview show on Brazilian television. In this initial study, we examined six memorable interviews with six Brazilian Formula One drivers to compare the performance of two named entity recognition methods: a statistical-neural method and large language models, both evaluated against manual annotations. Still, it highlighted relevant qualitative distinctions: the statistical method showed a rigid dependence on capitalisation and lexical familiarity, leading to mechanical false positives and missing non-capitalised entities, while the LLM exhibited greater linguistic sensitivity, retrieving contextual entities and being robust to transcription errors, though it still produces false positives. The LLM-based model appears more promising due to its flexibility and the potential for refinement via instructions to filter for ambiguities, favouring the automation of social network extraction in the corpus.
Uniform Meaning Representation (UMR) is a cross-linguistic semantic representation framework designed to encode sentence meaning in a structured and interpretable way. Building on the foundations of Abstract Meaning Representation (AMR), UMR extends semantic coverage to events, participants, semantic roles, temporal/aspectual information, modality, and discourse links. It is language-agnostic and therefore suitable for multilingual exploration.This tutorial provides a beginner’s introduction to UMR aimed at an audience with no prior experience with AMR, UMR, or meaning representations. The tutorial begins with a simple introduction to the essentials of Universal Dependencies (UD) needed to understand how UMR graphs can be constructed from syntactic information. Using simple Portuguese examples, the tutorial illustrates how basic UD structures guide the creation of UMR graphs. Participants will leave with a foundational understanding of what UMR is; how it relates to syntax and semantic roles; how to create minimal UMR graphs, and how Portuguese UD treebanks can support UMR annotation.