Alexander Löser

2026

CliniBench: A Clinical Outcome Prediction Benchmark for Generative and Encoder-Based Language Models
Paul Grundmann | Jan Frick | Dennis Fast | Thomas Steffek | Felix Gers | Wolfgang Nejdl | Alexander Löser
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

With their growing capabilities, generative large language models (LLMs) are being increasingly investigated for complex medical tasks.However, their effectiveness in real-world clinical applications remains underexplored. To address this, we present CliniBench, the first benchmark that enables comparability of well-studied encoder-based classifiers and generative LLMs for discharge diagnosis prediction from admission notes in the MIMIC-IV dataset. Our extensive study compares 12 generative LLMs and 3 encoder-based classifiers and demonstrates that encoder-based classifiers consistently outperform generative models in diagnosis prediction. We assess several retrieval augmentation strategies for in-context learning from similar patients and find that they provide notable performance improvements for generative LLMs.

2024

pdf bib abs

Clinical Decision Support Systems assist medical professionals in providing optimal care for patients.A prominent data source used for creating tasks for such systems is the Medical Information Mart for Intensive Care (MIMIC).MIMIC contains electronic health records (EHR) gathered in a tertiary hospital in the United States.The majority of past work is based on the third version of MIMIC, although the fourth is the most recent version.This new version, not only introduces more data into MIMIC, but also increases the variety of patients.While MIMIC-III is limited to intensive care units, MIMIC-IV also offers EHRs from the emergency department.In this work, we investigate how to adapt previous work to update clinical outcome prediction for MIMIC-IV.We revisit several established tasks, including prediction of diagnoses, procedures, length-of-stay, and also introduce a novel task: patient routing prediction.Furthermore, we quantitatively and qualitatively evaluate all tasks on several bio-medical transformer encoder models.Finally, we provide narratives for future research directions in the clinical outcome prediction domain. We make our source code publicly available to reproduce our experiments, data, and tasks.

2022

pdf bib abs

What Do You See in this Patient? Behavioral Testing of Clinical NLP Models
Betty Van Aken | Sebastian Herrmann | Alexander Löser
Proceedings of the 4th Clinical Natural Language Processing Workshop

Decision support systems based on clinical notes have the potential to improve patient care by pointing doctors towards overseen risks. Predicting a patient’s outcome is an essential part of such systems, for which the use of deep neural networks has shown promising results. However, the patterns learned by these networks are mostly opaque and previous work revealed both reproduction of systemic biases and unexpected behavior for out-of-distribution patients. For application in clinical practice it is crucial to be aware of such behavior. We thus introduce a testing framework that evaluates clinical models regarding certain changes in the input. The framework helps to understand learned patterns and their influence on model decisions. In this work, we apply it to analyse the change in behavior with regard to the patient characteristics gender, age and ethnicity. Our evaluation of three current clinical NLP models demonstrates the concrete effects of these characteristics on the models’ decisions. They show that model behavior varies drastically even when fine-tuned on the same data with similar AUROC score. These results exemplify the need for a broader communication of model behavior in the clinical domain.

pdf bib abs

Attention Networks for Augmenting Clinical Text with Support Sets for Diagnosis Prediction
Paul Grundmann | Tom Oberhauser | Felix Gers | Alexander Löser
Proceedings of the 29th International Conference on Computational Linguistics

Diagnosis prediction on admission notes is a core clinical task. However, these notes may incompletely describe the patient. Also, clinical language models may suffer from idiosyncratic language or imbalanced vocabulary for describing diseases or symptoms. We tackle the task of diagnosis prediction, which consists of predicting future patient diagnoses from clinical texts at the time of admission. We improve the performance on this task by introducing an additional signal from support sets of diagnostic codes from prior admissions or as they emerge during differential diagnosis. To enhance the robustness of diagnosis prediction methods, we propose to augment clinical text with potentially complementary set data from diagnosis codes from previous patient visits or from codes that emerge from the current admission as they become available through diagnostics. We discuss novel attention network architectures and augmentation strategies to solve this problem. Our experiments reveal that support sets improve the performance drastically to predict less common diagnosis codes. Our approach clearly outperforms the previous state-of-the-art PubMedBERT baseline by up 3% points. Furthermore, we find that support sets drastically improve the performance for pregnancy- and gynecology-related diagnoses up to 32.9% points compared to the baseline.

pdf bib abs

KIMERA: Injecting Domain Knowledge into Vacant Transformer Heads
Benjamin Winter | Alexei Figueroa Rosero | Alexander Löser | Felix Alexander Gers | Amy Siu
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Training transformer language models requires vast amounts of text and computational resources. This drastically limits the usage of these models in niche domains for which they are not optimized, or where domain-specific training data is scarce. We focus here on the clinical domain because of its limited access to training data in common tasks, while structured ontological data is often readily available. Recent observations in model compression of transformer models show optimization potential in improving the representation capacity of attention heads. We propose KIMERA (Knowledge Injection via Mask Enforced Retraining of Attention) for detecting, retraining and instilling attention heads with complementary structured domain knowledge. Our novel multi-task training scheme effectively identifies and targets individual attention heads that are least useful for a given downstream task and optimizes their representation with information from structured data. KIMERA generalizes well, thereby building the basis for an efficient fine-tuning. KIMERA achieves significant performance boosts on seven datasets in the medical domain in Information Retrieval and Clinical Outcome Prediction settings. We apply KIMERA to BERT-base to evaluate the extent of the domain transfer and also improve on the already strong results of BioBERT in the clinical domain.

2020

pdf bib abs

We demonstrate TrainX, a system for Named Entity Linking for medical experts. It combines state-of-the-art entity recognition and linking architectures, such as Flair and fine-tuned Bi-Encoders based on BERT, with an easy-to-use interface for healthcare professionals. We support medical experts in annotating training data by using active sampling strategies to forward informative samples to the annotator. We demonstrate that our model is capable of linking against large knowledge bases, such as UMLS (3.6 million entities), and supporting zero-shot cases, where the linker has never seen the entity before. Those zero-shot capabilities help to mitigate the problem of rare and expensive training data that is a common issue in the medical domain.

pdf bib abs

Discovering Biased News Articles Leveraging Multiple Human Annotations
Konstantina Lazaridou | Alexander Löser | Maria Mestre | Felix Naumann
Proceedings of the Twelfth Language Resources and Evaluation Conference

Unbiased and fair reporting is an integral part of ethical journalism. Yet, political propaganda and one-sided views can be found in the news and can cause distrust in media. Both accidental and deliberate political bias affect the readers and shape their views. We contribute to a trustworthy media ecosystem by automatically identifying politically biased news articles. We introduce novel corpora annotated by two communities, i.e., domain experts and crowd workers, and we also consider automatic article labels inferred by the newspapers’ ideologies. Our goal is to compare domain experts to crowd workers and also to prove that media bias can be detected automatically. We classify news articles with a neural network and we also improve our performance in a self-supervised manner.

2019

pdf bib abs

SECTOR: A Neural Model for Coherent Topic Segmentation and Classification
Sebastian Arnold | Rudolf Schneider | Philippe Cudré-Mauroux | Felix A. Gers | Alexander Löser
Transactions of the Association for Computational Linguistics, Volume 7

When searching for information, a human reader first glances over a document, spots relevant sections, and then focuses on a few sentences for resolving her intention. However, the high variance of document structure complicates the identification of the salient topic of a given section at a glance. To tackle this challenge, we present SECTOR, a model to support machine reading systems by segmenting documents into coherent sections and assigning topic labels to each section. Our deep neural network architecture learns a latent topic embedding over the course of a document. This can be leveraged to classify local topics from plain text and segment a document at topic shifts. In addition, we contribute WikiSection, a publicly available data set with 242k labeled sections in English and German from two distinct domains: diseases and cities. From our extensive evaluation of 20 architectures, we report a highest score of 71.6% F1 for the segmentation and classification of 30 topics from the English city domain, scored by our SECTOR long short-term memory model with Bloom filter embeddings and bidirectional segmentation. This is a significant improvement of 29.5 points F1 over state-of-the-art CNN classifiers with baseline segmentation.

2018

pdf bib abs

Challenges for Toxic Comment Classification: An In-Depth Error Analysis
Betty van Aken | Julian Risch | Ralf Krestel | Alexander Löser
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)

Toxic comment classification has become an active research field with many recently proposed approaches. However, while these approaches address some of the task’s challenges others still remain unsolved and directions for further research are needed. To this end, we compare different deep learning and shallow approaches on a new, large comment dataset and propose an ensemble that outperforms all individual models. Further, we validate our findings on a second dataset. The results of the ensemble enable us to perform an extensive error analysis, which reveals open challenges for state-of-the-art methods and directions towards pending future research. These challenges include missing paradigmatic context and inconsistent dataset labels.

2017

pdf bib abs

Analysing Errors of Open Information Extraction Systems
Rudolf Schneider | Tom Oberhauser | Tobias Klatt | Felix A. Gers | Alexander Löser
Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems

We report results on benchmarking Open Information Extraction (OIE) systems using RelVis, a toolkit for benchmarking Open Information Extraction systems. Our comprehensive benchmark contains three data sets from the news domain and one data set from Wikipedia with overall 4522 labeled sentences and 11243 binary or n-ary OIE relations. In our analysis on these data sets we compared the performance of four popular OIE systems, ClausIE, OpenIE 4.2, Stanford OpenIE and PredPatt. In addition, we evaluated the impact of five common error classes on a subset of 749 n-ary tuples. From our deep analysis we unreveal important research directions for a next generation on OIE systems.

2016

pdf bib abs

Interactive Relation Extraction in Main Memory Database Systems
Rudolf Schneider | Cordula Guder | Torsten Kilias | Alexander Löser | Jens Graupmann | Oleksandr Kozachuk
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

We present INDREX-MM, a main memory database system for interactively executing two interwoven tasks, declarative relation extraction from text and their exploitation with SQL. INDREX-MM simplifies these tasks for the user with powerful SQL extensions for gathering statistical semantics, for executing open information extraction and for integrating relation candidates with domain specific data. We demonstrate these functions on 800k documents from Reuters RCV1 with more than a billion linguistic annotations and report execution times in the order of seconds.

pdf bib abs

TASTY: Interactive Entity Linking As-You-Type
Sebastian Arnold | Robert Dziuba | Alexander Löser
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

We introduce TASTY (Tag-as-you-type), a novel text editor for interactive entity linking as part of the writing process. Tasty supports the author of a text with complementary information about the mentioned entities shown in a ‘live’ exploration view. The system is automatically triggered by keystrokes, recognizes mention boundaries and disambiguates the mentioned entities to Wikipedia articles. The author can use seven operators to interact with the editor and refine the results according to his specific intention while writing. Our implementation captures syntactic and semantic context using a robust end-to-end LSTM sequence learner and word embeddings. We demonstrate the applicability of our system in English and German language for encyclopedic or medical text. Tasty is currently being tested in interactive applications for text production, such as scientific research, news editorial, medical anamnesis, help desks and product reviews.