Vinícius Vanzin
Also published as: Vinicius Vanzin
2026
A Dataset of Brazilian Portuguese Clinical Notes for Anaphylaxis Detection
Matheus Machado | Vinícius Vanzin | Dilvan Moreira | Luis Felipe Ensina | Fábio Lario
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
Matheus Machado | Vinícius Vanzin | Dilvan Moreira | Luis Felipe Ensina | Fábio Lario
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
Anaphylaxis is an acute, potentially life-threatening allergic reaction that requires rapid recognition in clinical settings. Natural language processing (NLP) approaches for automatic detection of anaphylaxis in clinical narratives can support large-scale analysis of health records and retrospective clinical research. However, such approaches depend on high-quality labeled corpora, and resources for Portuguese remain scarce. This paper introduces a corpus of Brazilian Portuguese clinical notes annotated by domain specialists for the presence or absence of anaphylaxis. The dataset comprises 969 clinical narratives drawn from three sources: clinician-authored synthetic clinical notes designed to represent realistic scenarios, case reports from the medical literature rewritten into note-like format by specialists, and a subset of de-identified notes from the publicly available SemClinBr corpus. All texts were reviewed and labeled by allergists using established clinical diagnostic criteria, and the corpus reflects realistic prevalence conditions, with approximately 5% positive cases. We describe the corpus design, data sources, annotation methodology, and composition, discuss potential research applications, and address ethical considerations. The corpus is intended as a reusable resource for Portuguese clinical NLP, supporting future work on document classification, information extraction, and language modeling in the medical domain.