2024
pdf
bib
abs
Biomedical Entity Linking for Dutch: Fine-tuning a Self-alignment BERT Model on an Automatically Generated Wikipedia Corpus
Fons Hartendorp
|
Tom Seinen
|
Erik van Mulligen
|
Suzan Verberne
Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024
Biomedical entity linking, a main component in automatic information extraction from health-related texts, plays a pivotal role in connecting textual entities (such as diseases, drugs and body parts mentioned by patients) to their corresponding concepts in a structured biomedical knowledge base. The task remains challenging despite recent developments in natural language processing. This report presents the first evaluated biomedical entity linking model for the Dutch language. We use MedRoBERTa.nl as basemodel and perform second-phase pretraining through self-alignment on a Dutch biomedical ontology extracted from the UMLS and Dutch SNOMED. We derive a corpus from Wikipedia of ontology-linked Dutch biomedical entities in context and fine-tune our model on this dataset. We evaluate our model on the Dutch portion of the Mantra GSC-corpus and achieve 54.7% classification accuracy and 69.8% 1-distance accuracy. We then perform a case study on a collection of unlabeled, patient-support forum data and show that our model is hampered by the limited quality of the preceding entity recognition step. Manual evaluation of small sample indicates that of the correctly extracted entities, around 65% is linked to the correct concept in the ontology. Our results indicate that biomedical entity linking in a language other than English remains challenging, but our Dutch model can be used to for high-level analysis of patient-generated text.
2013
pdf
bib
A fast rule-based approach for biomedical event extraction
Quoc-Chinh Bui
|
David Campos
|
Erik van Mulligen
|
Jan Kors
Proceedings of the BioNLP Shared Task 2013 Workshop
2012
pdf
bib
abs
CALBC: Releasing the Final Corpora
Şenay Kafkas
|
Ian Lewin
|
David Milward
|
Erik van Mulligen
|
Jan Kors
|
Udo Hahn
|
Dietrich Rebholz-Schuhmann
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
A number of gold standard corpora for named entity recognition are available to the public. However, the existing gold standard corpora are limited in size and semantic entity types. These usually lead to implementation of trained solutions (1) for a limited number of semantic entity types and (2) lacking in generalization capability. In order to overcome these problems, the CALBC project has aimed to automatically generate large scale corpora annotated with multiple semantic entity types in a community-wide manner based on the consensus of different named entity solutions. The generated corpus is called the silver standard corpus since the corpus generation process does not involve any manual curation. In this publication, we announce the release of the final CALBC corpora which include the silver standard corpus in different versions and several gold standard corpora for the further usage of the biomedical text mining community. The gold standard corpora are utilised to benchmark the methods used in the silver standard corpora generation process and released in a shared format. All the corpora are released in a shared format and accessible at www.calbc.eu.
2010
pdf
bib
abs
The CALBC Silver Standard Corpus for Biomedical Named Entities — A Study in Harmonizing the Contributions from Four Independent Named Entity Taggers
Dietrich Rebholz-Schuhmann
|
Antonio José Jimeno Yepes
|
Erik M. van Mulligen
|
Ning Kang
|
Jan Kors
|
David Milward
|
Peter Corbett
|
Ekaterina Buyko
|
Katrin Tomanek
|
Elena Beisswanger
|
Udo Hahn
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
The production of gold standard corpora is time-consuming and costly. We propose an alternative: the âsilver standard corpus (SSC), a corpus that has been generated by the harmonisation of the annotations that have been delivered from a selection of annotation systems. The systems have to share the type system for the annotations and the harmonisation solution has use a suitable similarity measure for the pair-wise comparison of the annotations. The annotation systems have been evaluated against the harmonised set (630.324 sentences, 15,956,841 tokens). We can demonstrate that the annotation of proteins and genes shows higher diversity across all used annotation solutions leading to a lower agreement against the harmonised set in comparison to the annotations of diseases and species. An analysis of the most frequent annotations from all systems shows that a high agreement amongst systems leads to the selection of terms that are suitable to be kept in the harmonised set. This is the first large-scale approach to generate an annotated corpus from automated annotation systems. Further research is required to understand, how the annotations from different systems have to be combined to produce the best annotation result for a harmonised corpus.