A Dataset of Brazilian Portuguese Clinical Notes for Anaphylaxis Detection

Matheus Machado; Vinícius Vanzin; Dilvan Moreira; Luis Felipe Ensina; Fábio Lario

A Dataset of Brazilian Portuguese Clinical Notes for Anaphylaxis Detection

Matheus Machado, Vinícius Vanzin, Dilvan Moreira, Luis Felipe Ensina, Fábio Lario

Abstract

Anaphylaxis is an acute, potentially life-threatening allergic reaction that requires rapid recognition in clinical settings. Natural language processing (NLP) approaches for automatic detection of anaphylaxis in clinical narratives can support large-scale analysis of health records and retrospective clinical research. However, such approaches depend on high-quality labeled corpora, and resources for Portuguese remain scarce. This paper introduces a corpus of Brazilian Portuguese clinical notes annotated by domain specialists for the presence or absence of anaphylaxis. The dataset comprises 969 clinical narratives drawn from three sources: clinician-authored synthetic clinical notes designed to represent realistic scenarios, case reports from the medical literature rewritten into note-like format by specialists, and a subset of de-identified notes from the publicly available SemClinBr corpus. All texts were reviewed and labeled by allergists using established clinical diagnostic criteria, and the corpus reflects realistic prevalence conditions, with approximately 5% positive cases. We describe the corpus design, data sources, annotation methodology, and composition, discuss potential research applications, and address ethical considerations. The corpus is intended as a reusable resource for Portuguese clinical NLP, supporting future work on document classification, information extraction, and language modeling in the medical domain.

Anthology ID:: 2026.propor-2.15
Volume:: Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
Month:: April
Year:: 2026
Address:: Salvador, Brazil
Editors:: Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro
Venue:: PROPOR
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 78–87
Language:
URL:: https://aclanthology.org/2026.propor-2.15/
DOI:
Bibkey:
Cite (ACL):: Matheus Machado, Vinícius Vanzin, Dilvan Moreira, Luis Felipe Ensina, and Fábio Lario. 2026. A Dataset of Brazilian Portuguese Clinical Notes for Anaphylaxis Detection. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2, pages 78–87, Salvador, Brazil. Association for Computational Linguistics.
Cite (Informal):: A Dataset of Brazilian Portuguese Clinical Notes for Anaphylaxis Detection (Machado et al., PROPOR 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.propor-2.15.pdf

PDF Cite Search Fix data