A Dataset of Brazilian Portuguese Clinical Notes for Anaphylaxis Detection

Matheus Machado, Vinícius Vanzin, Dilvan Moreira, Luis Felipe Ensina, Fábio Lario


Abstract
Anaphylaxis is an acute, potentially life-threatening allergic reaction that requires rapid recognition in clinical settings. Natural language processing (NLP) approaches for automatic detection of anaphylaxis in clinical narratives can support large-scale analysis of health records and retrospective clinical research. However, such approaches depend on high-quality labeled corpora, and resources for Portuguese remain scarce. This paper introduces a corpus of Brazilian Portuguese clinical notes annotated by domain specialists for the presence or absence of anaphylaxis. The dataset comprises 969 clinical narratives drawn from three sources: clinician-authored synthetic clinical notes designed to represent realistic scenarios, case reports from the medical literature rewritten into note-like format by specialists, and a subset of de-identified notes from the publicly available SemClinBr corpus. All texts were reviewed and labeled by allergists using established clinical diagnostic criteria, and the corpus reflects realistic prevalence conditions, with approximately 5% positive cases. We describe the corpus design, data sources, annotation methodology, and composition, discuss potential research applications, and address ethical considerations. The corpus is intended as a reusable resource for Portuguese clinical NLP, supporting future work on document classification, information extraction, and language modeling in the medical domain.
Anthology ID:
2026.propor-2.15
Volume:
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
Month:
April
Year:
2026
Address:
Salvador, Brazil
Editors:
Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro
Venue:
PROPOR
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
78–87
Language:
URL:
https://aclanthology.org/2026.propor-2.15/
DOI:
Bibkey:
Cite (ACL):
Matheus Machado, Vinícius Vanzin, Dilvan Moreira, Luis Felipe Ensina, and Fábio Lario. 2026. A Dataset of Brazilian Portuguese Clinical Notes for Anaphylaxis Detection. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2, pages 78–87, Salvador, Brazil. Association for Computational Linguistics.
Cite (Informal):
A Dataset of Brazilian Portuguese Clinical Notes for Anaphylaxis Detection (Machado et al., PROPOR 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.propor-2.15.pdf