Harnessing the Power of BERT in the Turkish Clinical Domain: Pretraining Approaches for Limited Data Scenarios

Hazal Türkmen, Oguz Dikenelli, Cenk Eraslan, Mehmet Calli, Suha Ozbek


Abstract
Recent advancements in natural language processing (NLP) have been driven by large language models (LLMs), thereby revolutionizing the field. Our study investigates the impact of diverse pre-training strategies on the performance of Turkish clinical language models in a multi-label classification task involving radiology reports, with a focus on overcoming language resource limitations. Additionally, for the first time, we evaluated the simultaneous pre-training approach by utilizing limited clinical task data. We developed four models: TurkRadBERT-task v1, TurkRadBERT-task v2, TurkRadBERT-sim v1, and TurkRadBERT-sim v2. Our results revealed superior performance from BERTurk and TurkRadBERT-task v1, both of which leverage a broad general-domain corpus. Although task-adaptive pre-training is capable of identifying domain-specific patterns, it may be prone to overfitting because of the constraints of the task-specific corpus. Our findings highlight the importance of domain-specific vocabulary during pre-training to improve performance. They also affirmed that a combination of general domain knowledge and task-specific fine-tuning is crucial for optimal performance across various categories. This study offers key insights for future research on pre-training techniques in the clinical domain, particularly for low-resource languages.
Anthology ID:
2023.clinicalnlp-1.22
Volume:
Proceedings of the 5th Clinical Natural Language Processing Workshop
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Tristan Naumann, Asma Ben Abacha, Steven Bethard, Kirk Roberts, Anna Rumshisky
Venue:
ClinicalNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
161–170
Language:
URL:
https://aclanthology.org/2023.clinicalnlp-1.22
DOI:
10.18653/v1/2023.clinicalnlp-1.22
Bibkey:
Cite (ACL):
Hazal Türkmen, Oguz Dikenelli, Cenk Eraslan, Mehmet Calli, and Suha Ozbek. 2023. Harnessing the Power of BERT in the Turkish Clinical Domain: Pretraining Approaches for Limited Data Scenarios. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 161–170, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Harnessing the Power of BERT in the Turkish Clinical Domain: Pretraining Approaches for Limited Data Scenarios (Türkmen et al., ClinicalNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.clinicalnlp-1.22.pdf
Video:
 https://aclanthology.org/2023.clinicalnlp-1.22.mp4