Frank Kramer
2024
GottBERT: a pure German Language Model
Raphael Scheible
|
Johann Frei
|
Fabian Thomczyk
|
Henry He
|
Patric Tippmann
|
Jochen Knaus
|
Victor Jaravine
|
Frank Kramer
|
Martin Boeker
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Creating Ontology-annotated Corpora from Wikipedia for Medical Named-entity Recognition
Johann Frei
|
Frank Kramer
Proceedings of the 23rd Workshop on Biomedical Natural Language Processing
Acquiring annotated corpora for medical NLP is challenging due to legal and privacy constraints and costly annotation efforts, and using annotated public datasets may do not align well to the desired target application in terms of annotation style or language. We investigate the approach of utilizing Wikipedia and WikiData jointly to acquire an unsupervised annotated corpus for named-entity recognition (NER). By controlling the annotation ruleset through WikiData’s ontology, we extract custom-defined annotations and dynamically impute weak annotations by an adaptive loss scaling. Our validation on German medication detection datasets yields competitive results. The entire pipeline only relies on open models and data resources, enabling reproducibility and open sharing of models and corpora. All relevant assets are shared on GitHub.
Search
Co-authors
- Johann Frei 2
- Raphael Scheible 1
- Fabian Thomczyk 1
- Henry He 1
- Patric Tippmann 1
- show all...