Hidelberg O. Albuquerque

Also published as: Hidelberg O. Albuquerque

2026

UlyssesLegalNER-Br: from Legislative to Legal, a comprehensive corpus of Brazilian legal documents for Named Entity Recognition
Hidelberg O. Albuquerque | Ellen Souza | Danilo C. G. Lucena | Héldon J. O. Albuquerque | Nádia F. F. da Silva | Márcio de S. Dias | Rafael O. Nunes | Adriano L. I. Oliveira | André C. P. L. F. de Carvalho
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

The legal domain presents several challenges for Natural Language Processing (NLP), particularly due to its linguistic complexity and lack of public datasets. Named Entity Recognition (NER), a subarea of NLP, has been successfully used to extract useful knowledge from legal texts. Its widespread use is limited by the lack of legal text corpora. This paper introduces UlyssesLegalNER-Br, a comprehensive corpus of Brazilian legal documents for NER, covering bills, case laws and laws, including the first NER corpus based exclusively on Brazilian laws. This research expand the UlyssesNER-Br corpus, previously focused only on the Brazilian legislative domain. The proposed corpus has 560 public documents annotated using a hybrid approach, organized in 9 categories and 23 fine-grained types, experimentally evaluated with the CRF, BiLSTM, and BERTimbau architectures. The corpus was experimentally evaluated regarding predictive performance, computational cost and label-level results. The best micro F1 96.18% was achieved by BERTimbau on the unified corpus, providing a strong baseline for Brazilian legal NER. At the label level, six categories and seven types presented a F1-score above 95%, while the lowest were distributed in the interval 71-82%.

The PROPOR conference has been the main venue for Portuguese language Natural Language Processing (NLP) research for over two decades. This paper presents a longitudinal bibliometric analysis of PROPOR from 2003 to 2024, examining thematic evolution, community structure, and scientific impact. We identify a shift from speech-oriented research toward text-based tasks, alongside the sustained importance of resources and linguistic theory. The community exhibits a stable structure, with complementary leadership models centered on institutional hubs and brokerage roles. Scientific impact is highly concentrated, following a long tail distribution, and distinguishes between cumulative productivity-driven impact and rapidly accelerating citation uptake in recent editions. These findings characterize PROPOR as a resilient regional linguistic ecosystem evolving in dialogue with broader NLP paradigms.

2024

pdf bib

pdf bib