Alan Akbik - ACL Anthology

Alan Akbik

2025

Evaluating Design Decisions for Dual Encoder-based Entity Disambiguation
Susanna Rücker | Alan Akbik
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Entity disambiguation (ED) is the task of linking mentions in text to corresponding entries in a knowledge base. Dual Encoders address this by embedding mentions and label candidates in a shared embedding space and applying a similarity metric to predict the correct label. In this work, we focus on evaluating key design decisions for Dual Encoder-based ED, such as its loss function, similarity metric, label verbalization format, and negative sampling strategy. We present the resulting model VerbalizED, a document-level Dual Encoder model that includes contextual label verbalizations and efficient hard negative sampling. Additionally, we explore an iterative prediction variant that aims to improve the disambiguation of challenging data points. To support our analysis, we first conduct comprehensive ablation experiments on specific design decisions using AIDA-Yago, followed by large-scale, multi-domain evaluation on the ZELDA benchmark.

Pre-Training Curriculum for Multi-Token Prediction in Language Models
Ansar Aynetdinov | Alan Akbik
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multi-token prediction (MTP) is a recently proposed pre-training objective for language models. Rather than predicting only the next token (NTP), MTP predicts the next *k* tokens at each prediction step, using multiple prediction heads. MTP has shown promise in improving downstream performance, inference speed, and training efficiency, particularly for large models. However, prior work has shown that smaller language models (SLMs) struggle with the MTP objective. To address this, we propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum, which gradually increases the complexity of the pre-training objective from NTP to MTP, and a reverse curriculum, which does the opposite. Our experiments show that the forward curriculum enables SLMs to better leverage the MTP objective during pre-training, improving downstream NTP performance and generative output quality, while retaining the benefits of self-speculative decoding. The reverse curriculum achieves stronger NTP performance and output quality, but fails to provide any self-speculative decoding benefits.

Question Decomposition for Retrieval-Augmented Generation
Paul J. L. Ammann | Jonas Golde | Alan Akbik
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Grounding large language models (LLMs) in verifiable external sources is a well-established strategy for generating reliable answers. Retrieval-augmented generation (RAG) is one such approach, particularly effective for tasks like question answering: it retrieves passages that are semantically related to the question and then conditions the model on this evidence. However, multi-hop questions, such as “Which company among NVIDIA, Apple, and Google made the biggest profit in 2023?,” challenge RAG because relevant facts are often distributed across multiple documents rather than co-occurring in one source, making it difficult for standard RAG to retrieve sufficient information. To address this, we propose a RAG pipeline that incorporates question decomposition: (i) an LLM decomposes the original query into sub-questions, (ii) passages are retrieved for each sub-question, and (iii) the merged candidate pool is reranked to improve the coverage and precision of the retrieved evidence. We show that question decomposition effectively assembles complementary documents, while reranking reduces noise and promotes the most relevant passages before answer generation. We evaluate our approach on the MultiHop-RAG and HotpotQA, showing gains in retrieval (MRR@10: +36.7%) and answer accuracy (F1: +11.6%) over standard RAG baselines. The pipeline is a practical, drop-in enhancement requiring no task-specific training or specialized indexing.

Sample-Efficient Language Modeling with Linear Attention and Lightweight Enhancements
Patrick Haller | Jonas Golde | Alan Akbik
Proceedings of the First BabyLM Workshop

We study architectural and optimization techniques for sample-efficient language modeling under the constraints of the BabyLM 2025 shared task. Our model, BLaLM, replaces self-attention with a linear-time mLSTM token mixer and explores lightweight enhancements, including short convolutions, sliding window attention with dynamic modulation, and Hedgehog feature maps. To support training in low-resource settings, we curate a high-quality corpus emphasizing readability and pedagogical structure. Experiments across both strict and strict-small tracks show that (1) linear attention combined with sliding window attention consistently improves zero-shot performance, and (2) the Muon optimizer stabilizes convergence and reduces perplexity over AdamW. These results highlight effective strategies for efficient language modeling without relying on scale.

Babies Learn to Look Ahead: Multi-Token Prediction in Small LMs
Ansar Aynetdinov | Alan Akbik
Proceedings of the First BabyLM Workshop

Multi-token prediction (MTP) is an alternative training objective for language models that has recently been proposed as a potential improvement over traditional next-token prediction (NTP). Instead of training models to predict only the next token, as is standard, MTP trains them to predict the next k tokens at each step. While MTP was shown to improve downstream performance and sample efficiency in large language models (LLMs), smaller language models (SLMs) struggle with this objective. Recently, a curriculum-based approach was offered as a solution to this problem for models as small as 1.3B parameters by adjusting the difficulty of the training objective over time. In this work we investigate the viability of MTP curricula in a highly data- and parameter-constrained setting. Our experimental results show that even 130M-parameter models benefit from including the MTP task in the pre-training objective. These gains hold even under severe data constraints, as demonstrated on both zero-shot benchmarks and downstream tasks.

Improving Online Job Advertisement Analysis via Compositional Entity Extraction
Kai Krüger | Johanna Binnewitt | Kathrin Ehmann | Stefan Winnige | Alan Akbik
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We propose a compositional entity modeling framework for requirement extraction from online job advertisements (OJAs), representing complex, tree-like structures that connect atomic entities via typed relations. Based on this schema, we introduce GOJA, a manually annotated dataset of 500 German job ads that captures roles, tools, experience levels, attitudes, and their functional context. We report strong inter-annotator agreement and benchmark transformer models, demonstrating the feasibility of learning this structure. A focused case study on AI-related requirements illustrates the analytical value of our approach for labor market research.

Token-Level Metrics for Detecting Incorrect Gold Annotations in Named Entity Recognition
Elena Merdjanovska | Alan Akbik
Findings of the Association for Computational Linguistics: EMNLP 2025

Annotated datasets for supervised learning tasks often contain incorrect gold annotations, i.e. label noise. To address this issue, many noisy label learning approaches incorporate metrics to filter out unreliable samples, for example using heuristics such as high loss or low confidence. However, when these metrics are integrated into larger pipelines, it becomes difficult to compare their effectiveness, and understand their individual contribution to reducing label noise. This paper directly compares popular sample metrics for detecting incorrect annotations in named entity recognition (NER). NER is commonly approached as token classification, so the metrics are calculated for each training token and we flag the incorrect ones by defining metrics thresholds. We compare the metrics based on (i) their accuracy in detecting the incorrect labels and (ii) the test scores when retraining a model using the cleaned dataset. We show that training dynamics metrics work the best overall. The best metrics effectively reduce the label noise across different noise types. The errors that the model has not yet memorized are more feasible to detect, and relabeling these tokens is a more effective strategy than excluding them from training.

Lemma Dilemma: On Lemma Generation Without Domain- or Language-Specific Training Data
Olia Toporkov | Alan Akbik | Rodrigo Agerri
Findings of the Association for Computational Linguistics: EMNLP 2025

Lemmatization is the task of transforming all words in a given text to their dictionary forms. While large language models (LLMs) have demonstrated their ability to achieve competitive results across a wide range of NLP tasks, there is no prior evidence of how effective they are in the contextual lemmatization task. In this paper, we empirically investigate the capacity of the latest generation of LLMs to perform in-context lemmatization, comparing it to the traditional fully supervised approach. In particular, we consider the setting in which supervised training data is not available for a target domain or language, comparing (i) encoder-only supervised approaches, fine-tuned out-of-domain, and (ii) cross-lingual methods, against direct in-context lemma generation with LLMs. Our experimental investigation across 12 languages of different morphological complexity finds that, while encoders remain competitive in out-of-domain settings when fine-tuned on gold data, current LLMs reach state-of-the-art results for most languages by directly generating lemmas in-context without prior fine-tuning, provided just with a few examples. Data and code will be made available upon publication.

From Data to Knowledge: Evaluating How Efficiently Language Models Learn Facts
Daniel Christoph | Max Ploner | Patrick Haller | Alan Akbik
Proceedings of the First Workshop on Large Language Model Memorization (L2M2)

Sample efficiency is a crucial property of language models with practical implications for training efficiency. In real-world text, information follows a long-tailed distribution. Yet, we expect models to learn and recall frequent and infrequent facts. Sample efficient models are better equipped to handle this challenge of learning and retaining rare information without requiring excessive exposure. This study analyzes multiple models of varying architectures and sizes, all trained on the same pre-training data. By annotating relational facts with their frequencies in the training corpus, we examine how model performance varies with fact frequency. Our findings show that most models perform similarly on high-frequency facts but differ notably on low-frequency facts. This analysis provides new insights into the relationship between model architecture, size, and factual learning efficiency.

Towards a Principled Evaluation of Knowledge Editors
Sebastian Pohl | Max Ploner | Alan Akbik
Proceedings of the First Workshop on Large Language Model Memorization (L2M2)

Model editing has been gaining increasing attention over the past few years. For Knowledge Editing in particular, more challenging evaluation datasets have recently been released. These datasets use different methodologies to score the success of editors. Yet, it remains under-explored how robust these methodologies are and whether they unfairly favor some editors. Moreover, the disruptive impact of these editors on overall model capabilities remains a constant blind spot.We address both of these problems and show that choosing different metrics and evaluation methodologies as well as different edit batch sizes can lead to a different ranking of knowledge editors. Crucially we demonstrate this effect also on general language understanding tasks evaluated alongside the knowledge editing tasks. Further we include a manual assessment of the string matching based evaluation method for knowledge editing that is favored by recently released datasets, revealing a tendency to produce false positive matches.

Measuring Label Ambiguity in Subjective Tasks using Predictive Uncertainty Estimation
Richard Alies | Elena Merdjanovska | Alan Akbik
Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025)

Human annotations in natural language corpora vary due to differing human perspectives. This is especially prevalent in subjective tasks. In these datasets, certain data samples are more prone to label variation and can be indicated as ambiguous samples.

Familiarity: Better Evaluation of Zero-Shot Named Entity Recognition by Quantifying Label Shifts in Synthetic Training Data
Jonas Golde | Patrick Haller | Max Ploner | Fabio Barth | Nicolaas Jedema | Alan Akbik
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Zero-shot named entity recognition (NER) is the task of detecting named entities of specific types (such as Person or Medicine) without any training examples. Current research increasingly relies on large synthetic datasets, automatically generated to cover tens of thousands of distinct entity types, to train zero-shot NER models. However, in this paper, we find that these synthetic datasets often contain entity types that are semantically highly similar to (or even the same as) those in standard evaluation benchmarks. Because of this overlap, we argue that reported F1 scores for zero-shot NER overestimate the true capabilities of these approaches. Further, we argue that current evaluation setups provide an incomplete picture of zero-shot abilities since they do not quantify the label shift (i.e., the similarity of labels) between training and evaluation datasets. To address these issues, we propose Familarity, a novel metric that captures both the semantic similarity between entity types in training and evaluation, as well as their frequency in the training data, to provide an estimate of label shift. It allows researchers to contextualize reported zero-shot NER scores when using custom synthetic training datasets. Further, it enables researchers to generate evaluation setups of various transfer difficulties for fine-grained analysis of zero-shot NER.

LM-Pub-Quiz: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models
Max Ploner | Jacek Wiland | Sebastian Pohl | Alan Akbik
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)

Knowledge probing evaluates to which extent a language model (LM) has acquired relational knowledge during its pre-training phase. It provides a cost-effective means of comparing LMs of different sizes and training setups and is useful for monitoring knowledge gained or lost during continual learning (CL). In prior work, we presented an improved knowledge probe called BEAR (Wiland et al., 2024), which enables the comparison of LMs trained with different pre-training objectives (causal and masked LMs) and addresses issues of skewed distributions in previous probes to deliver a more unbiased reading of LM knowledge. With this paper, we present LM-Pub-Quiz, a Python framework and leaderboard built around the BEAR probing mechanism that enables researchers and practitioners to apply it in their work. It provides options for standalone evaluation and direct integration into the widely-used training pipeline of the Hugging Face transformers library. Further, it provides a fine-grained analysis of different knowledge types to assist users in better understanding the knowledge in each evaluated LM. We publicly release LM-Pub-Quiz as an open-source project.https://lm-pub-quiz.github.io/

TransformerRanker: A Tool for Efficiently Finding the Best-Suited Language Models for Downstream Classification Tasks
Lukas Garbas | Max Ploner | Alan Akbik
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)

Classification tasks in NLP are typically addressed by selecting a pre-trained language model (PLM) from a model hub, and fine-tuning it for the task at hand. However, given the very large number of PLMs that are currently available, a practical challenge is to determine which of them will perform best for a specific downstream task. With this paper, we introduce TransformerRanker, a lightweight library that efficiently ranks PLMs for classification tasks without the need for computationally costly fine-tuning. Our library implements current approaches for transferability estimation (LogME, H-Score, kNN), in combination with layer aggregation options, which we empirically showed to yield state-of-the-art rankings of PLMs (Garbas et al., 2024). We designed the interface to be lightweight and easy to use, allowing users to directly connect to the HuggingFace Transformers and Dataset libraries. Users need only select a downstream classification task and a list of PLMs to create a ranking of likely best-suited PLMs for their task. We make TransformerRanker available as a pip-installable open-source library.

2024

Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions
Max Dallabetta | Conrad Dobberstein | Adrian Breiding | Alan Akbik
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

This paper introduces Fundus, a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code. Unlike existing news scrapers, we use manually crafted, bespoke content extractors that are specifically tailored to the formatting guidelines of each supported online newspaper. This allows us to optimize our scraping for quality such that retrieved news articles are textually complete and without HTML artifacts. Further, our framework combines both crawling (retrieving HTML from the web or large web archives) and content extraction into a single pipeline. By providing a unified interface for a predefined collection of newspapers, we aim to make Fundus broadly usable even for non-technical users. This paper gives an overview of the framework, discusses our design choices, and presents a comparative evaluation against other popular news scrapers. Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than prior work.The framework is available on GitHub under https://github.com/flairNLP/fundus and can be simply installed using pip.

BabyHGRN: Exploring RNNs for Sample-Efficient Language Modeling
Patrick Haller | Jonas Golde | Alan Akbik
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning

This paper explores the potential of recurrent neural networks (RNNs) and other subquadratic architectures as competitive alternatives to transformer-based models in low-resource language modeling scenarios. We utilize HGRN2 (Qin et al., 2024), a recently proposed RNN-based architecture, and comparatively evaluate its effectiveness against transformer-based baselines and other subquadratic architectures (LSTM, xLSTM, Mamba). Our experimental results show that, our HGRN2 language model, outperforms transformer-based models in both the 10M and 100M word tracks of the challenge, as measured by their performance on the BLiMP, EWoK, GLUE and BEAR benchmarks. Further, we show the positive impact of knowledge distillation. Our findings challenge the prevailing focus on transformer architectures and indicate the viability of RNN-based models, particularly in resource-constrained environments.

Large-Scale Label Interpretation Learning for Few-Shot Named Entity Recognition
Jonas Golde | Felix Hamborg | Alan Akbik
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Few-shot named entity recognition (NER) detects named entities within text using only a few annotated examples. One promising line of research is to leverage natural language descriptions of each entity type: the common label PER might, for example, be verbalized as ”person entity.” In an initial label interpretation learning phase, the model learns to interpret such verbalized descriptions of entity types. In a subsequent few-shot tagset extension phase, this model is then given a description of a previously unseen entity type (such as ”music album”) and optionally a few training examples to perform few-shot NER for this type. In this paper, we systematically explore the impact of a strong semantic prior to interpret verbalizations of new entity types by massively scaling up the number and granularity of entity types used for label interpretation learning. To this end, we leverage an entity linking benchmark to create a dataset with orders of magnitude of more distinct entity types and descriptions as currently used datasets. We find that this increased signal yields strong results in zero- and few-shot NER in in-domain, cross-domain, and even cross-lingual settings. Our findings indicate significant potential for improving few-shot NER through heuristical data-based optimization.

Less is More: Parameter-Efficient Selection of Intermediate Tasks for Transfer Learning
David Schulte | Felix Hamborg | Alan Akbik
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Intermediate task transfer learning can greatly improve model performance. If, for example, one has little training data for emotion detection, first fine-tuning a language model on a sentiment classification dataset may improve performance strongly. But which task to choose for transfer learning? Prior methods producing useful task rankings are infeasible for large source pools, as they require forward passes through all source language models. We overcome this by introducing Embedding Space Maps (ESMs), light-weight neural networks that approximate the effect of fine-tuning a language model. We conduct the largest study on NLP task transferability and task selection with 12k source-target pairs. We find that applying ESMs on a prior method reduces execution time and disk space usage by factors of 10 and 278, respectively, while retaining high selection performance (avg. regret@5 score of 2.95).

NoiseBench: Benchmarking the Impact of Real Label Noise on Named Entity Recognition
Elena Merdjanovska | Ansar Aynetdinov | Alan Akbik
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Available training data for named entity recognition (NER) often contains a significant percentage of incorrect labels for entity types and entity boundaries. Such label noise poses challenges for supervised learning and may significantly deteriorate model quality. To address this, prior work proposed various noise-robust learning approaches capable of learning from data with partially incorrect labels. These approaches are typically evaluated using simulated noise where the labels in a clean dataset are automatically corrupted. However, as we show in this paper, this leads to unrealistic noise that is far easier to handle than real noise caused by human error or semi-automatic annotation. To enable the study of the impact of various types of real noise, we introduce NoiseBench, an NER benchmark consisting of clean training data corrupted with 6 types of real noise, including expert errors, crowdsourcing errors, automatic annotation errors and LLM errors. We present an analysis that shows that real noise is significantly more challenging than simulated noise, and show that current state-of-the-art models for noise-robust learning fall far short of their achievable upper bound. We release NoiseBench for both English and German to the research community.

Parameter-Efficient Fine-Tuning: Is There An Optimal Subset of Parameters to Tune?
Max Ploner | Alan Akbik
Findings of the Association for Computational Linguistics: EACL 2024

The ever-growing size of pretrained language models (PLM) presents a significant challenge for efficiently fine-tuning and deploying these models for diverse sets of tasks within memory-constrained environments.In light of this, recent research has illuminated the possibility of selectively updating only a small subset of a model’s parameters during the fine-tuning process.Since no new parameters or modules are added, these methods retain the inference speed of the original model and come at no additional computational cost. However, an open question pertains to which subset of parameters should best be tuned to maximize task performance and generalizability. To investigate, this paper presents comprehensive experiments covering a large spectrum of subset selection strategies. We comparatively evaluate their impact on model performance as well as the resulting model’s capability to generalize to different tasks.Surprisingly, we find that the gains achieved in performance by elaborate selection strategies are, at best, marginal when compared to the outcomes obtained by tuning a random selection of parameter subsets. Our experiments also indicate that selection-based tuning impairs generalizability to new tasks.

BEAR: A Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models
Jacek Wiland | Max Ploner | Alan Akbik
Findings of the Association for Computational Linguistics: NAACL 2024

Knowledge probing assesses to which degree a language model (LM) has successfully learned relational knowledge during pre-training. Probing is an inexpensive way to compare LMs of different sizes and training configurations. However, previous approaches rely on the objective function used in pre-training LMs and are thus applicable only to masked or causal LMs. As a result, comparing different types of LMs becomes impossible. To address this, we propose an approach that uses an LM’s inherent ability to estimate the log-likelihood of any given textual statement. We carefully design an evaluation dataset of 7,731 instances (40,916 in a larger variant) from which we produce alternative statements for each relational fact, one of which is correct. We then evaluate whether an LM correctly assigns the highest log-likelihood to the correct statement. Our experimental evaluation of 22 common LMs shows that our proposed framework, BEAR, can effectively probe for knowledge across different LM types. We release the BEAR datasets and an open-source framework that implements the probing approach to the research community to facilitate the evaluation and development of LMs.

Choose Your Transformer: Improved Transferability Estimation of Transformer Models on Classification Tasks
Lukas Garbaciauskas | Max Ploner | Alan Akbik
Findings of the Association for Computational Linguistics: ACL 2024

There currently exists a multitude of pre-trained transformer language models (LMs) that are readily available. From a practical perspective, this raises the question of which pre-trained LM will perform best if fine-tuned for a specific downstream NLP task. However, exhaustively fine-tuning all available LMs to determine the best-fitting model is computationally infeasible. To address this problem, we present an approach that inexpensively estimates a ranking of the expected performance of a given set of candidate LMs for a given task. Following a layer-wise representation analysis, we extend existing approaches such as H-score and LogME by aggregating representations across all layers of the transformer model. We present an extensive analysis of 20 transformer LMs, 6 downstream NLP tasks, and various estimators (linear probing, kNN, H-score, and LogME). Our evaluation finds that averaging the layer representations significantly improves the Pearson correlation coefficient between the true model ranks and the estimate, increasing from 0.58 to 0.86 for LogME and from 0.65 to 0.88 for H-score.

PECC: Problem Extraction and Coding Challenges
Patrick Haller | Jonas Golde | Alan Akbik
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Recent advancements in large language models (LLMs) have showcased their exceptional abilities across various tasks, such as code generation, problem-solving and reasoning. Existing benchmarks evaluate tasks in isolation, yet the extent to which LLMs can understand prose-style tasks, identify the underlying problems, and then generate appropriate code solutions is still unexplored. Addressing this gap, we introduce PECC, a novel benchmark derived from Advent Of Code (AoC) challenges and Project Euler, including 2396 problems. Unlike conventional benchmarks, PECC requires LLMs to interpret narrative-embedded problems, extract requirements, and generate executable code. A key feature of our dataset is the complexity added by natural language prompting in chat-based evaluations, mirroring real-world instruction ambiguities. Results show varying model performance between narrative and neutral problems, with specific challenges in the Euler math-based subset with GPT-3.5-Turbo passing 50% of the AoC challenges and only 8% on the Euler problems. By probing the limits of LLMs’ capabilities, our benchmark provides a framework to monitor and assess the subsequent progress of LLMs as a universal problem solver.

OpinionGPT: Modelling Explicit Biases in Instruction-Tuned LLMs
Patrick Haller | Ansar Aynetdinov | Alan Akbik
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations)

Instruction-tuned Large Language Models (LLMs) have recently showcased remarkable ability to generate fitting responses to natural language instructions. However, an open research question concerns the inherent biases of trained models and their responses. For instance, if the data used to tune an LLM is dominantly written by persons with a specific political bias, we might expect generated answers to share this bias. Current research work seeks to de-bias such models, or suppress potentially biased answers.With this demonstration, we take a different view on biases in instruction-tuning: Rather than aiming to suppress them, we aim to make them explicit and transparent. To this end, we present OpinionGPT, a web demo in which users can ask questions and select all biases they wish to investigate. The demo will answer this question using a model fine-tuned on text representing each of the selected biases, allowing side-by-side comparison. To train the underlying model, we identified 11 different biases (political, geographic, gender, age) and derived an instruction-tuning corpus in which each answer was written by members of one of these demographics. This paper presents OpinionGPT, illustrates how we trained the bias-aware model and showcases the web application (available at https://opiniongpt.informatik.hu-berlin.de).

2023

ZELDA: A Comprehensive Benchmark for Supervised Entity Disambiguation
Marcel Milich | Alan Akbik
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Entity disambiguation (ED) is the task of disambiguating named entity mentions in text to unique entries in a knowledge base. Due to its industrial relevance, as well as current progress in leveraging pre-trained language models, a multitude of ED approaches have been proposed in recent years. However, we observe a severe lack of uniformity across experimental setups in current ED work,rendering a direct comparison of approaches based solely on reported numbers impossible: Current approaches widely differ in the data set used to train, the size of the covered entity vocabulary, and the usage of additional signals such as candidate lists. To address this issue, we present ZELDA , a novel entity disambiguation benchmark that includes a unified training data set, entity vocabulary, candidate lists, as well as challenging evaluation splits covering 8 different domains. We illustrate its design and construction, and present experiments in which we train and compare current state-of-the-art approaches on our benchmark. To encourage greater direct comparability in the entity disambiguation domain, we make our benchmark publicly available to the research community.

CleanCoNLL: A Nearly Noise-Free Named Entity Recognition Dataset
Susanna Rücker | Alan Akbik
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The CoNLL-03 corpus is arguably the most well-known and utilized benchmark dataset for named entity recognition (NER). However, prior works found significant numbers of annotation errors, incompleteness, and inconsistencies in the data. This poses challenges to objectively comparing NER approaches and analyzing their errors, as current state-of-the-art models achieve F1-scores that are comparable to or even exceed the estimated noise level in CoNLL-03. To address this issue, we present a comprehensive relabeling effort assisted by automatic consistency checking that corrects 7.0% of all labels in the English CoNLL-03. Our effort adds a layer of entity linking annotation both for better explainability of NER labels and as additional safeguard of annotation quality. Our experimental evaluation finds not only that state-of-the-art approaches reach significantly higher F1-scores (97.1%) on our data, but crucially that the share of correct predictions falsely counted as errors due to annotation noise drops from 47% to 6%. This indicates that our resource is well suited to analyze the remaining errors made by state-of-the-art models, and that the theoretical upper bound even on high resource, coarse-grained NER is not yet reached. To facilitate such analysis, we make CleanCoNLL publicly available to the research community.

Fabricator: An Open Source Toolkit for Generating Labeled Training Data with Teacher LLMs
Jonas Golde | Patrick Haller | Felix Hamborg | Julian Risch | Alan Akbik
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Most NLP tasks are modeled as supervised learning and thus require labeled training data to train effective models. However, manually producing such data at sufficient quality and quantity is known to be costly and time-intensive. Current research addresses this bottleneck by exploring a novel paradigm called zero-shot learning via dataset generation. Here, a powerful LLM is prompted with a task description to generate labeled data that can be used to train a downstream NLP model. For instance, an LLM might be prompted to “generate 500 movie reviews with positive overall sentiment, and another 500 with negative sentiment.” The generated data could then be used to train a binary sentiment classifier, effectively leveraging an LLM as a teacher to a smaller student model. With this demo, we introduce Fabricator, an open-source Python toolkit for dataset generation. Fabricator implements common dataset generation workflows, supports a wide range of downstream NLP tasks (such as text classification, question answering, and entity recognition), and is integrated with well-known libraries to facilitate quick experimentation. With Fabricator, we aim to support researchers in conducting reproducible dataset generation experiments using LLMs and help practitioners apply this approach to train models for downstream tasks.

2022

Medical Coding with Biomedical Transformer Ensembles and Zero/Few-shot Learning
Angelo Ziletti | Alan Akbik | Christoph Berns | Thomas Herold | Marion Legler | Martina Viell
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track

Medical coding (MC) is an essential pre-requisite for reliable data retrieval and reporting. Given a free-text reported term (RT) such as “pain of right thigh to the knee”, the task is to identify the matching lowest-level term (LLT) –in this case “unilateral leg pain”– from a very large and continuously growing repository of standardized medical terms. However, automating this task is challenging due to a large number of LLT codes (as of writing over 80\,000), limited availability of training data for long tail/emerging classes, and the general high accuracy demands of the medical domain.With this paper, we introduce the MC task, discuss its challenges, and present a novel approach called xTARS that combines traditional BERT-based classification with a recent zero/few-shot learning approach (TARS). We present extensive experiments that show that our combined approach outperforms strong baselines, especially in the few-shot regime. The approach is developed and deployed at Bayer, live since November 2021. As we believe our approach potentially promising beyond MC, and to ensure reproducibility, we release the code to the research community.

2021

Early Detection of Sexual Predators in Chats
Matthias Vogt | Ulf Leser | Alan Akbik
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

An important risk that children face today is online grooming, where a so-called sexual predator establishes an emotional connection with a minor online with the objective of sexual abuse. Prior work has sought to automatically identify grooming chats, but only after an incidence has already happened in the context of legal prosecution. In this work, we instead investigate this problem from the point of view of prevention. We define and study the task of early sexual predator detection (eSPD) in chats, where the goal is to analyze a running chat from its beginning and predict grooming attempts as early and as accurately as possible. We survey existing datasets and their limitations regarding eSPD, and create a new dataset called PANC for more realistic evaluations. We present strong baselines built on BERT that also reach state-of-the-art results for conventional SPD. Finally, we consider coping with limited computational resources, as real-life applications require eSPD on mobile devices.

2020

Task-Aware Representation of Sentences for Generic Text Classification
Kishaloy Halder | Alan Akbik | Josip Krapac | Roland Vollgraf
Proceedings of the 28th International Conference on Computational Linguistics

State-of-the-art approaches for text classification leverage a transformer architecture with a linear layer on top that outputs a class distribution for a given prediction problem. While effective, this approach suffers from conceptual limitations that affect its utility in few-shot or zero-shot transfer learning scenarios. First, the number of classes to predict needs to be pre-defined. In a transfer learning setting, in which new classes are added to an already trained classifier, all information contained in a linear layer is therefore discarded, and a new layer is trained from scratch. Second, this approach only learns the semantics of classes implicitly from training examples, as opposed to leveraging the explicit semantic information provided by the natural language names of the classes. For instance, a classifier trained to predict the topics of news articles might have classes like “business” or “sports” that themselves carry semantic information. Extending a classifier to predict a new class named “politics” with only a handful of training examples would benefit from both leveraging the semantic information in the name of a new class and using the information contained in the already trained linear layer. This paper presents a novel formulation of text classification that addresses these limitations. It imbues the notion of the task at hand into the transformer model itself by factorizing arbitrary classification problems into a generic binary classification problem. We present experiments in few-shot and zero-shot transfer learning that show that our approach significantly outperforms previous approaches on small training data and can even learn to predict new classes with no training examples at all. The implementation of our model is publicly available at: https://github.com/flairNLP/flair.

2019

Pooled Contextualized Embeddings for Named Entity Recognition
Alan Akbik | Tanja Bergmann | Roland Vollgraf
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Contextual string embeddings are a recent type of contextualized word embedding that were shown to yield state-of-the-art results when utilized in a range of sequence labeling tasks. They are based on character-level language models which treat text as distributions over characters and are capable of generating embeddings for any string of characters within any textual context. However, such purely character-based approaches struggle to produce meaningful embeddings if a rare string is used in a underspecified context. To address this drawback, we propose a method in which we dynamically aggregate contextualized embeddings of each unique string that we encounter. We then use a pooling operation to distill a ”global” word representation from all contextualized instances. We evaluate these ”pooled contextualized embeddings” on common named entity recognition (NER) tasks such as CoNLL-03 and WNUT and show that our approach significantly improves the state-of-the-art for NER. We make all code and pre-trained models available to the research community for use and reproduction.

FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP
Alan Akbik | Tanja Bergmann | Duncan Blythe | Kashif Rasul | Stefan Schweter | Roland Vollgraf
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)

We present FLAIR, an NLP framework designed to facilitate training and distribution of state-of-the-art sequence labeling, text classification and language models. The core idea of the framework is to present a simple, unified interface for conceptually very different types of word and document embeddings. This effectively hides all embedding-specific engineering complexity and allows researchers to “mix and match” various embeddings with little effort. The framework also implements standard model training and hyperparameter selection routines, as well as a data fetching module that can download publicly available NLP datasets and convert them into data structures for quick set up of experiments. Finally, FLAIR also ships with a “model zoo” of pre-trained models to allow researchers to use state-of-the-art NLP models in their applications. This paper gives an overview of the framework and its functionality. The framework is available on GitHub at https://github.com/zalandoresearch/flair .

2018

Contextual String Embeddings for Sequence Labeling
Alan Akbik | Duncan Blythe | Roland Vollgraf
Proceedings of the 27th International Conference on Computational Linguistics

Recent advances in language modeling using recurrent neural networks have made it viable to model language as distributions over characters. By learning to predict the next character on the basis of previous characters, such models have been shown to automatically internalize linguistic concepts such as words, sentences, subclauses and even sentiment. In this paper, we propose to leverage the internal states of a trained character language model to produce a novel type of word embedding which we refer to as contextual string embeddings. Our proposed embeddings have the distinct properties that they (a) are trained without any explicit notion of words and thus fundamentally model words as sequences of characters, and (b) are contextualized by their surrounding text, meaning that the same word will have different embeddings depending on its contextual use. We conduct a comparative evaluation against previous embeddings and find that our embeddings are highly useful for downstream tasks: across four classic sequence labeling tasks we consistently outperform the previous state-of-the-art. In particular, we significantly outperform previous work on English and German named entity recognition (NER), allowing us to report new state-of-the-art F1-scores on the CoNLL03 shared task. We release all code and pre-trained language models in a simple-to-use framework to the research community, to enable reproduction of these experiments and application of our proposed embeddings to other tasks: https://github.com/zalandoresearch/flair

FEIDEGGER: A Multi-modal Corpus of Fashion Images and Descriptions in German
Leonidas Lefakis | Alan Akbik | Roland Vollgraf
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

ZAP: An Open-Source Multilingual Annotation Projection Framework
Alan Akbik | Roland Vollgraf
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

CROWD-IN-THE-LOOP: A Hybrid Approach for Annotating Semantic Roles
Chenguang Wang | Alan Akbik | Laura Chiticariu | Yunyao Li | Fei Xia | Anbang Xu
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Crowdsourcing has proven to be an effective method for generating labeled data for a range of NLP tasks. However, multiple recent attempts of using crowdsourcing to generate gold-labeled training data for semantic role labeling (SRL) reported only modest results, indicating that SRL is perhaps too difficult a task to be effectively crowdsourced. In this paper, we postulate that while producing SRL annotation does require expert involvement in general, a large subset of SRL labeling tasks is in fact appropriate for the crowd. We present a novel workflow in which we employ a classifier to identify difficult annotation tasks and route each task either to experts or crowd workers according to their difficulties. Our experimental evaluation shows that the proposed approach reduces the workload for experts by over two-thirds, and thus significantly reduces the cost of producing SRL annotation at little loss in quality.

The Projector: An Interactive Annotation Projection Visualization Tool
Alan Akbik | Roland Vollgraf
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Previous works proposed annotation projection in parallel corpora to inexpensively generate treebanks or propbanks for new languages. In this approach, linguistic annotation is automatically transferred from a resource-rich source language (SL) to translations in a target language (TL). However, annotation projection may be adversely affected by translational divergences between specific language pairs. For this reason, previous work often required careful qualitative analysis of projectability of specific annotation in order to define strategies to address quality and coverage issues. In this demonstration, we present THE PROJECTOR, an interactive GUI designed to assist researchers in such analysis: it allows users to execute and visually inspect annotation projection in a range of different settings. We give an overview of the GUI, discuss use cases and illustrate how the tool can facilitate discussions with the research community.

2016

K-SRL: Instance-based Learning for Semantic Role Labeling
Alan Akbik | Yunyao Li
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Semantic role labeling (SRL) is the task of identifying and labeling predicate-argument structures in sentences with semantic frame and role labels. A known challenge in SRL is the large number of low-frequency exceptions in training data, which are highly context-specific and difficult to generalize. To overcome this challenge, we propose the use of instance-based learning that performs no explicit generalization, but rather extrapolates predictions from the most similar instances in the training data. We present a variant of k-nearest neighbors (kNN) classification with composite features to identify nearest neighbors for SRL. We show that high-quality predictions can be derived from a very small number of similar instances. In a comparative evaluation we experimentally demonstrate that our instance-based learning approach significantly outperforms current state-of-the-art systems on both in-domain and out-of-domain data, reaching F1-scores of 89,28% and 79.91% respectively.

Multilingual Aliasing for Auto-Generating Proposition Banks
Alan Akbik | Xinyu Guan | Yunyao Li
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Semantic Role Labeling (SRL) is the task of identifying the predicate-argument structure in sentences with semantic frame and role labels. For the English language, the Proposition Bank provides both a lexicon of all possible semantic frames and large amounts of labeled training data. In order to expand SRL beyond English, previous work investigated automatic approaches based on parallel corpora to automatically generate Proposition Banks for new target languages (TLs). However, this approach heuristically produces the frame lexicon from word alignments, leading to a range of lexicon-level errors and inconsistencies. To address these issues, we propose to manually alias TL verbs to existing English frames. For instance, the German verb drehen may evoke several meanings, including “turn something” and “film something”. Accordingly, we alias the former to the frame TURN.01 and the latter to a group of frames that includes FILM.01 and SHOOT.03. We execute a large-scale manual aliasing effort for three target languages and apply the new lexicons to automatically generate large Proposition Banks for Chinese, French and German with manually curated frames. We present a detailed evaluation in which we find that our proposed approach significantly increases the quality and consistency of the generated Proposition Banks. We release these resources to the research community.

Multilingual Information Extraction with PolyglotIE
Alan Akbik | Laura Chiticariu | Marina Danilevsky | Yonas Kbrom | Yunyao Li | Huaiyu Zhu
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

We present PolyglotIE, a web-based tool for developing extractors that perform Information Extraction (IE) over multilingual data. Our tool has two core features: First, it allows users to develop extractors against a unified abstraction that is shared across a large set of natural languages. This means that an extractor needs only be created once for one language, but will then run on multilingual data without any additional effort or language-specific knowledge on part of the user. Second, it embeds this abstraction as a set of views within a declarative IE system, allowing users to quickly create extractors using a mature IE query language. We present PolyglotIE as a hands-on demo in which users can experiment with creating extractors, execute them on multilingual text and inspect extraction results. Using the UI, we discuss the challenges and potential of using unified, crosslingual semantic abstractions as basis for downstream applications. We demonstrate multilingual IE for 9 languages from 4 different language groups: English, German, French, Spanish, Japanese, Chinese, Arabic, Russian and Hindi.

Towards Semi-Automatic Generation of Proposition Banks for Low-Resource Languages
Alan Akbik | Vishwajeet Kumar | Yunyao Li
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

POLYGLOT: Multilingual Semantic Role Labeling with Unified Labels
Alan Akbik | Yunyao Li
Proceedings of ACL-2016 System Demonstrations

2015

Generating High Quality Proposition Banks for Multilingual Semantic Role Labeling
Alan Akbik | Laura Chiticariu | Marina Danilevsky | Yunyao Li | Shivakumar Vaithyanathan | Huaiyu Zhu
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

SCHNÄPPER: A Web Toolkit for Exploratory Relation Extraction
Thilo Michael | Alan Akbik
Proceedings of ACL-IJCNLP 2015 System Demonstrations

2014

Exploratory Relation Extraction in Large Text Corpora
Alan Akbik | Thilo Michael | Christoph Boden
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

Nerdle: Topic-Specific Question Answering Using Wikia Seeds
Umar Maqsud | Sebastian Arnold | Michael Hülfenhaus | Alan Akbik
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations

The Weltmodell: A Data-Driven Commonsense Knowledge Base
Alan Akbik | Thilo Michael
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present the Weltmodell, a commonsense knowledge base that was automatically generated from aggregated dependency parse fragments gathered from over 3.5 million English language books. We leverage the magnitude and diversity of this dataset to arrive at close to ten million distinct N-ary commonsense facts using techniques from open-domain Information Extraction (IE). Furthermore, we compute a range of measures of association and distributional similarity on this data. We present the results of our efforts using a browsable web demonstrator and publicly release all generated data for use and discussion by the research community. In this paper, we give an overview of our knowledge acquisition method and representation model, and present our web demonstrator.

Freepal: A Large Collection of Deep Lexico-Syntactic Patterns for Relation Extraction
Johannes Kirschnick | Alan Akbik | Holmer Hemsen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The increasing availability and maturity of both scalable computing architectures and deep syntactic parsers is opening up new possibilities for Relation Extraction (RE) on large corpora of natural language text. In this paper, we present Freepal, a resource designed to assist with the creation of relation extractors for more than 5,000 relations defined in the Freebase knowledge base (KB). The resource consists of over 10 million distinct lexico-syntactic patterns extracted from dependency trees, each of which is assigned to one or more Freebase relations with different confidence strengths. We generate the resource by executing a large-scale distant supervision approach on the ClueWeb09 corpus to extract and parse over 260 million sentences labeled with Freebase entities and relations. We make Freepal freely available to the research community, and present a web demonstrator to the dataset, accessible from free-pal.appspot.com.

Proceedings of the First AHA!-Workshop on Information Discovery in Text
Alan Akbik | Larysa Visengeriyeva
Proceedings of the First AHA!-Workshop on Information Discovery in Text

Extracting a Repository of Events and Event References from News Clusters
Silvia Julinda | Christoph Boden | Alan Akbik
Proceedings of the First AHA!-Workshop on Information Discovery in Text

2013

Effective Selectional Restrictions for Unsupervised Relation Extraction
Alan Akbik | Larysa Visengeriyeva | Johannes Kirschnick | Alexander Löser
Proceedings of the Sixth International Joint Conference on Natural Language Processing

Propminer: A Workflow for Interactive Information Extraction and Exploration using Dependency Trees
Alan Akbik | Oresti Konomi | Michail Melnikov
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

2012

Unsupervised Discovery of Relations and Discriminative Extraction Patterns
Alan Akbik | Larysa Visengeriyeva | Priska Herger | Holmer Hemsen | Alexander Löser
Proceedings of COLING 2012

KrakeN: N-ary Facts in Open Information Extraction
Alan Akbik | Alexander Löser
Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX)

Co-authors

Ansar Aynetdinov 4

Laura Chiticariu 3

Felix Hamborg 3

Alexander Löser 3

Elena Merdjanovska 3

Thilo Michael 3

Larysa Visengeriyeva 3

Tanja Bergmann 2

Duncan Blythe 2

Christoph Boden 2

Marina Danilevsky 2

Holmer Hemsen 2

Johannes Kirschnick 2

Sebastian Pohl 2

Susanna Rücker 2

Rodrigo Agerri 1

Richard Alies 1

Paul J. L. Ammann 1

Sebastian Arnold 1

Christoph Berns 1

Johanna Binnewitt 1

Adrian Breiding 1

Daniel Christoph 1

Max Dallabetta 1

Conrad Dobberstein 1

Kathrin Ehmann 1

Lukas Garbaciauskas 1

Kishaloy Halder 1

Priska Herger 1

Thomas Herold 1

Michael Hülfenhaus 1

Nicolaas Jedema 1

Silvia Julinda 1

Oresti Konomi 1

Vishwajeet Kumar 1

Leonidas Lefakis 1

Marion Legler 1

Michail Melnikov 1

Marcel Milich 1

David Schulte 1

Stefan Schweter 1

Olia Toporkov 1

Shivakumar Vaithyanathan 1

Martina Viell 1

Matthias Vogt 1

Chenguang Wang 1

Stefan Winnige 1

Angelo Ziletti 1

Venues