Andreas Marfurt

2026

Open LLMs enable AI practitioners to control development costs by building on an existing foundation for downstream applications. While offering substantial promise, current models often fail to meet the needs of users needing open solutions aligned with responsible AI principles, including data compliance, transparency, and inclusivity. In this work, we present Apertus, a fully open suite of large language models (LLMs) designed to address responsibility shortcomings in today’s open model ecosystem, namely data responsibility and global representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of data memorization, we also adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. Apertus also drastically expands multilingual coverage, training on 15T tokens from over approximately 1800 languages, with about 40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivaling or surpassing open-weight counterparts.

2025

pdf bib abs

Analyzing the Evolution of Scientific Misconduct Based on the Language of Retracted Papers
Christof Bless | Andreas Waldis | Angelina Parfenova | Maria A. Rodriguez | Andreas Marfurt
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)

Amid rising numbers of organizations producing counterfeit scholarly articles, it is important to quantify the prevalence of scientific misconduct.We assess the feasibility of automated text-based methods to determine the rate of scientific misconduct by analyzing linguistic differences between retracted and non-retracted papers.We find that retracted works show distinct phrase patterns and higher word repetition.Motivated by this, we evaluatetwo misconduct detection methods, a mixture distribution approach and a Transformer-based one.The best models achieve high accuracy (>0.9 F1) on detection of paper mill articles and automatically generated content, making them viable tools for flagging papers for closer review.We apply the classifiers to more than 300,000 paper abstracts, to quantify misconduct over time and find that our estimation methods accurately reproduce trends observed in the real data.

pdf bib abs

Text Annotation via Inductive Coding: Comparing Human Experts to LLMs in Qualitative Data Analysis
Angelina Parfenova | Andreas Marfurt | Jürgen Pfeffer | Alexander Denzler
Findings of the Association for Computational Linguistics: NAACL 2025

This paper investigates the automation of qualitative data analysis, focusing on inductive coding using large language models (LLMs). Unlike traditional approaches that rely on deductive methods with predefined labels, this research investigates the inductive process where labels emerge from the data. The study evaluates the performance of six open-source LLMs compared to human experts. As part of the evaluation, experts rated the perceived difficulty of the quotes they coded. The results reveal a peculiar dichotomy: human coders consistently perform well when labeling complex sentences but struggle with simpler ones, while LLMs exhibit the opposite trend. Additionally, the study explores systematic deviations in both human and LLM-generated labels by comparing them to the golden standard from the test set. While human annotations may sometimes differ from the golden standard, they are often rated more favorably by other humans. In contrast, some LLMs demonstrate closer alignment with the true labels but receive lower evaluations from experts.

2024

pdf bib

Can NLP models and methods be applied to EEG data?
Lino Casanova | Andreas Marfurt
Proceedings of the 9th edition of the Swiss Text Analytics Conference

pdf bib abs

Generating Interpretations of Policy Announcements
Andreas Marfurt | Ashley Thornton | David Sylvan | James Henderson
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

Recent advances in language modeling have focused on (potentially multiple-choice) question answering, open-ended generation, or math and coding problems. We look at a more nuanced task: the interpretation of statements of political actors. To this end, we present a dataset of policy announcements and corresponding annotated interpretations, on the topic of US foreign policy relations with Russia in the years 1993 up to 2016. We analyze the performance of finetuning standard sequence-to-sequence models of varying sizes on predicting the annotated interpretations and compare them to few-shot prompted large language models. We find that 1) model size is not the main factor for success on this task, 2) finetuning smaller models provides both quantitatively and qualitatively superior results to in-context learning with large language models, but 3) large language models pick up the annotation format and approximate the category distribution with just a few in-context examples.

2022

pdf bib abs

A Corpus and Evaluation for Predicting Semi-Structured Human Annotations
Andreas Marfurt | Ashley Thornton | David Sylvan | Lonneke van der Plas | James Henderson
Proceedings of the Second Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

A wide variety of tasks have been framed as text-to-text tasks to allow processing by sequence-to-sequence models. We propose a new task of generating a semi-structured interpretation of a source document. The interpretation is semi-structured in that it contains mandatory and optional fields with free-text information. This structure is surfaced by human annotations, which we standardize and convert to text format. We then propose an evaluation technique that is generally applicable to any such semi-structured annotation, called equivalence classes evaluation. The evaluation technique is efficient and scalable; it creates a large number of evaluation instances from a comparably cheap clustering of the free-text information by domain experts. For our task, we release a dataset about the monetary policy of the Federal Reserve. On this corpus, our evaluation shows larger differences between pretrained models than standard text generation metrics.

pdf bib abs

Unsupervised Token-level Hallucination Detection from Summary Generation By-products
Andreas Marfurt | James Henderson
Proceedings of the Second Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Hallucinations in abstractive summarization are model generations that are unfaithful to the source document. Current methods for detecting hallucinations operate mostly on noun phrases and named entities, and restrict themselves to the XSum dataset, which is known to have hallucinations in 3 out of 4 training examples (Maynez et al., 2020). We instead consider the CNN/DailyMail dataset where the summarization model has not seen abnormally many hallucinations during training. We automatically detect candidate hallucinations at the token level, irrespective of its part of speech. Our detection comes essentially for free, as we only use information the model already produces during generation of the summary. This enables practitioners to jointly generate a summary and identify possible hallucinations, with minimal overhead. We repurpose an existing factuality dataset and create our own token-level annotations. The evaluation on these two datasets shows that our model achieves better precision-recall tradeoffs than its competitors, which additionally require a model forward pass.

2021

pdf bib abs

Sentence-level Planning for Especially Abstractive Summarization
Andreas Marfurt | James Henderson
Proceedings of the Third Workshop on New Frontiers in Summarization

Abstractive summarization models heavily rely on copy mechanisms, such as the pointer network or attention, to achieve good performance, measured by textual overlap with reference summaries. As a result, the generated summaries stay close to the formulations in the source document. We propose the *sentence planner* model to generate more abstractive summaries. It includes a hierarchical decoder that first generates a representation for the next summary sentence, and then conditions the word generator on this representation. Our generated summaries are more abstractive and at the same time achieve high ROUGE scores when compared to human reference summaries. We verify the effectiveness of our design decisions with extensive evaluations.