Sean S. Huang


2022

pdf bib
A New Public Corpus for Clinical Section Identification: MedSecId
Paul Landes | Kunal Patel | Sean S. Huang | Adam Webb | Barbara Di Eugenio | Cornelia Caragea
Proceedings of the 29th International Conference on Computational Linguistics

The process by which sections in a document are demarcated and labeled is known as section identification. Such sections are helpful to the reader when searching for information and contextualizing specific topics. The goal of this work is to segment the sections of clinical medical domain documentation. The primary contribution of this work is MedSecId, a publicly available set of 2,002 fully annotated medical notes from the MIMIC-III. We include several baselines, source code, a pretrained model and analysis of the data showing a relationship between medical concepts across sections using principal component analysis.