Thiusius Rajeeth Savarimuthu
2023
Benchmark for Evaluation of Danish Clinical Word Embeddings
Martin Sundahl Laursen
|
Jannik Skyttegaard Pedersen
|
Pernille Just Vinholt
|
Rasmus Søgaard Hansen
|
Thiusius Rajeeth Savarimuthu
Northern European Journal of Language Technology, Volume 9
In natural language processing, benchmarks are used to track progress and identify useful models. Currently, no benchmark for Danish clinical word embeddings exists. This paper describes the development of a Danish benchmark for clinical word embeddings. The clinical benchmark consists of ten datasets: eight intrinsic and two extrinsic. Moreover, we evaluate word embeddings trained on text from the clinical domain, general practitioner domain and general domain on the established benchmark. All the intrinsic tasks of the benchmark are publicly available.
MeDa-BERT: A medical Danish pretrained transformer model
Jannik Pedersen
|
Martin Laursen
|
Pernille Vinholt
|
Thiusius Rajeeth Savarimuthu
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
This paper introduces a medical Danish BERT-based language model (MeDa-BERT) and medical Danish word embeddings. The word embeddings and MeDa-BERT were pretrained on a new medical Danish corpus consisting of 133M tokens from medical Danish books and text from the internet. The models showed improved performance over general-domain models on medical Danish classification tasks. The medical word embeddings and MeDa-BERT are publicly available.
Danish Clinical Named Entity Recognition and Relation Extraction
Martin Laursen
|
Jannik Pedersen
|
Rasmus Hansen
|
Thiusius Rajeeth Savarimuthu
|
Pernille Vinholt
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Electronic health records contain important information regarding the patients’ medical history but much of this information is stored in unstructured narrative text. This paper presents the first Danish clinical named entity recognition and relation extraction dataset for extraction of six types of clinical events, six types of attributes, and three types of relations. The dataset contains 11,607 paragraphs from Danish electronic health records containing 54,631 clinical events, 41,954 attributes, and 14,604 relations. We detail the methodology of developing the annotation scheme, and train a transformer-based architecture on the developed dataset with macro F1 performance of 60.05%, 44.85%, and 70.64% for clinical events, attributes, and relations, respectively.