Towards a Universal Dependencies Corpus for Portuguese Epidemiological Reports

Christian Freitas, Livy Real, Lilian Berton, Valeria de Paiva


Abstract
We present an ongoing research project focused on the construction of a Universal Dependencies (UD) corpus of Portuguese epidemiological reports derived from documents published within the Brazilian public health system. We describe findings and challenges to build such a corpus from PDF reports processed through a controlled document extraction pipeline that contrasts layout-aware extraction with raw PDF text extraction, explicitly addressing the impact of tabular content on downstream syntactic analysis. Narrative text is annotated using multiple UD parsers for Portuguese, including widely used and state-of-the-art tools, and their outputs are systematically compared using descriptive structural indicators and targeted qualitative inspection. Our analysis highlights domain-specific challenges in epidemiological texts and shows that document extraction and representation choices have a stronger effect on parsing behavior than parser selection alone. Based on these findings, we identify robust preprocessing configurations and discuss design choices for a UD-epidemiological corpus to support future research on syntactic parsing, domain adaptation, and downstream natural language processing tasks in epidemiology and public health.
Anthology ID:
2026.propor-2.31
Volume:
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
Month:
April
Year:
2026
Address:
Salvador, Brazil
Editors:
Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro
Venue:
PROPOR
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
228–237
Language:
URL:
https://aclanthology.org/2026.propor-2.31/
DOI:
Bibkey:
Cite (ACL):
Christian Freitas, Livy Real, Lilian Berton, and Valeria de Paiva. 2026. Towards a Universal Dependencies Corpus for Portuguese Epidemiological Reports. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2, pages 228–237, Salvador, Brazil. Association for Computational Linguistics.
Cite (Informal):
Towards a Universal Dependencies Corpus for Portuguese Epidemiological Reports (Freitas et al., PROPOR 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.propor-2.31.pdf