Nádia Félix Felipe Da Silva

Also published as: Nadia Félix Felipe da Silva, Nadia Felix Felipe da Silva, Nádia Félix Felipe da Silva, Nádia Félix Felipe da Silva

2026

pdf bib abs

LexIris-pt and LexBert-pt: Specialized Sentence Embeddings for Legal Similarity in Brazilian Portuguese
Willgnner Ferreira Santos | João Gabriel Grandotto Viana | Antônio Pires de Castro Júnior | Fernando Ribeiro Trindade | Nádia Félix Felipe da Silva
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

This work presents and evaluates two specialized sentence embedding models for the Portuguese legal domain, LexIris-pt and LexBert-pt, obtained through supervised fine-tuning of BERT-based models using pairs of initial petitions. We propose a comparative evaluation protocol along three fronts: (i) zero-shot inference with pretrained embeddings, (ii) supervised fine-tuning on these pairs, and (iii) vector retrieval with incremental clustering over a corpus of 20,000 initial petitions. The results show that fine-tuning consistently increases correlations with reference scores and improves performance in vector retrieval; additionally, the vector retrieval stage indicates that the metric configured in the index (cosine similarity or inner product) can change the granularity of the partitioning under a fixed threshold, reinforcing the need for joint calibration among the encoder, metric and threshold. After auditing by specialists from the partner institution, LexIris-pt and LexBert-pt were operationally adopted to support the screening and organization of repetitive claims and predatory litigation.

pdf bib abs

BIPA: Brazilian Portuguese Phonetic Dataset with Dialectal Variations in IPA Standard
Thiago Monteles de Sousa | Lucas Rafael Gris | Nádia Félix Felipe da Silva
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

This work presents BIPA, a phonetic transcription corpus for Brazilian Portuguese that covers regional dialectal variations. The corpus was constructed through automated extraction from Wiktionary, resulting in 53,353 unique words and 350,021 transcriptions in IPA format, distributed across six dialects: general Brazilian, Rio de Janeiro, São Paulo, South Region, Northeast Region, and Center-West Region. The average density of 6.56 transcriptions per word reflects multiple regionally conditioned phonetic variations. To validate the utility of the corpus, the ByT5-small model was fine-tuned for grapheme-to-phoneme conversion, achieving a Minimum Phoneme Error Rate of 2.66% on the validation set. BIPA addresses the scarcity of computational linguistic resources for Brazilian Portuguese, enabling applications in regional speech synthesis, automatic accent recognition, and computational sociolinguistic analysis.

2025

pdf bib abs

Labor Lex: A New Portuguese Corpus and Pipeline for Information Extraction in Brazilian Legal Texts
Pedro Vitor Quinta de Castro | Nádia Félix Felipe Da Silva
Proceedings of the Natural Legal Language Processing Workshop 2025

Relation Extraction (RE) is a challenging Natural Language Processing task that involves identifying named entities from text and classifying the relationships between them. When applied to a specific domain, the task acquires a new layer of complexity, handling the lexicon and context particular to the domain in question. In this work, this task is applied to the Legal domain, specifically targeting Brazilian Labor Law. Architectures based on Deep Learning, with word representations derived from Transformer Language Models (LM), have shown state-of-the-art performance for the RE task. Recent works on this task handle Named Entity Recognition (NER) and RE either as a single joint model or as a pipelined approach. In this work, we introduce Labor Lex, a newly constructed corpus based on public documents from Brazilian Labor Courts. We also present a pipeline of models trained on it. Different experiments are conducted for each task, comparing supervised training using LMs and In-Context Learning (ICL) with Large Language Models (LLM), and verifying and analyzing the results for each one. For the NER task, the best achieved result was 89.97% F1-Score, and for the RE task, the best result was 82.38% F1-Score. The best results for both tasks were obtained using the supervised training approach.

2024

pdf bib

Enhancing Stance Detection in Low-Resource Brazilian Portuguese Using Corpus Expansion generated by GPT-3.5
Dyonnatan Maia | Nádia Félix Felipe da Silva
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1

pdf bib

Natural Language Processing Application in Legislative Activity: a Case Study of Similar Amendments in the Brazilian Senate
Diany Pressato | Pedro Lucas Castro de Andrade | Flávio Rocha Junior | Felipe Alves Siqueira | Ellen Polliana Ramos Souza | Nádia Félix Felipe da Silva | Márcio de Souza Dias | André Carlos Ponce de Leon Ferreira de Carvalho
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1

2023

pdf bib

pdf bib abs

DeepLearningBrasil@LT-EDI-2023: Exploring Deep Learning Techniques for Detecting Depression in Social Media Text
Eduardo Garcia | Juliana Gomes | Adalberto Ferreira Barbosa Junior | Cardeque Henrique Bittes de Alvarenga Borges | Nadia Félix Felipe da Silva
Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion

In this paper, we delineate the strategy employed by our team, DeepLearningBrasil, which secured us the first place in the shared task DepSign-LT-EDI@RANLP-2023 with the advantage of 2.4%. The task was to classify social media texts into three distinct levels of depression - “not depressed,” “moderately depressed,” and “severely depressed.” Leveraging the power of the RoBERTa and DeBERTa models, we further pre-trained them on a collected Reddit dataset, specifically curated from mental health-related Reddit’s communities (Subreddits), leading to an enhanced understanding of nuanced mental health discourse. To address lengthy textual data, we introduced truncation techniques that retained the essence of the content by focusing on its beginnings and endings. Our model was robust against unbalanced data by incorporating sample weights into the loss. Cross-validation and ensemble techniques were then employed to combine our k-fold trained models, delivering an optimal solution. The accompanying code is made available for transparency and further development.