Victor Rocco
2024
A Privacy-Preserving Corpus for Occupational Health in Spanish: Evaluation for NER and Classification Tasks
Claudio Aracena
|
Luis Miranda
|
Thomas Vakili
|
Fabián Villena
|
Tamara Quiroga
|
Fredy Núñez-Torres
|
Victor Rocco
|
Jocelyn Dunstan
Proceedings of the 6th Clinical Natural Language Processing Workshop
Annotated corpora are essential to reliable natural language processing. While they are expensive to create, they are essential for building and evaluating systems. This study introduces a new corpus of 2,869 medical and admission reports collected by an occupational insurance and health provider. The corpus has been carefully annotated for personally identifiable information (PII) and is shared, masking this information. Two annotators adhered to annotation guidelines during the annotation process, and a referee later resolved annotation conflicts in a consolidation process to build a gold standard subcorpus. The inter-annotator agreement values, measured in F1, range between 0.86 and 0.93 depending on the selected subcorpus. The value of the corpus is demonstrated by evaluating its use for NER of PII and a classification task. The evaluations find that fine-tuned models and GPT-3.5 reach F1 of 0.911 and 0.720 in NER of PII, respectively. In the case of the insurance coverage classification task, using the original or de-identified corpus results in similar performance. The annotated data are released in de-identified form.
2023
Pre-trained language models in Spanish for health insurance coverage
Claudio Aracena
|
Nicolás Rodríguez
|
Victor Rocco
|
Jocelyn Dunstan
Proceedings of the 5th Clinical Natural Language Processing Workshop
The field of clinical natural language processing (NLP) can extract useful information from clinical text. Since 2017, the NLP field has shifted towards using pre-trained language models (PLMs), improving performance in several tasks. Most of the research in this field has focused on English text, but there are some available PLMs in Spanish. In this work, we use clinical PLMs to analyze text from admission and medical reports in Spanish for an insurance and health provider to give a probability of no coverage in a labor insurance process. Our results show that fine-tuning a PLM pre-trained with the provider’s data leads to better results, but this process is time-consuming and computationally expensive. At least for this task, fine-tuning publicly available clinical PLM leads to comparable results to a custom PLM, but in less time and with fewer resources. Analyzing large volumes of insurance requests is burdensome for employers, and models can ease this task by pre-classifying reports that are likely not to have coverage. Our approach of entirely using clinical-related text improves the current models while reinforcing the idea of clinical support systems that simplify human labor but do not replace it. To our knowledge, the clinical corpus collected for this study is the largest one reported for the Spanish language.
Search
Co-authors
- Claudio Aracena 2
- Jocelyn Dunstan 2
- Luis Miranda 1
- Thomas Vakili 1
- Fabián Villena 1
- show all...