Katja Markert

2026

Whose Facts Win? LLM Source Preferences under Knowledge Conflicts
Jakob Schuster | Vagrant Gautam | Katja Markert
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

As large language models (LLMs) are more frequently used in retrieval-augmented generation pipelines, it is increasingly relevant to study their behavior under knowledge conflicts. Thus far, the role of the source of the retrieved information has gone unexamined. We address this gap with a novel framework to investigate how source preferences affect LLM resolution of inter-context knowledge conflicts in English, motivated by interdisciplinary research on credibility. By using synthetic sources, we study preferences for different types of sources without inheriting the biases of specific real-world sources. With a comprehensive, tightly-controlled evaluation of 13 open-weight LLMs, we find that LLMs prefer institutionally-corroborated information (e.g., government or newspaper sources) over information from people and social media. However, these source preferences can be reversed by simply repeating information from less credible sources. To mitigate repetition effects and maintain consistent preferences, we propose a novel method that reduces repetition bias by up to 79.2%, while also maintaining at least 72.5% of original preferences. We release all data and code to encourage future work on credibility and source preferences in knowledge-intensive NLP.

2025

pdf bib abs

ReproHum #0744-02: A Reproduction of the Human Evaluation of Meaning Preservation in “Factorising Meaning and Form for Intent-Preserving Paraphrasing”
Julius Steen | Katja Markert
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

Assessing and improving the reproducibility of human evaluation studies is an ongoing concern in the area of natural language processing. As a contribution to this effort and a part of the ReproHum reproducibility project, we describe the reproduction of a human evaluation study (Hosking and Lapata, 2021) that evaluates meaning preservation in question paraphrasing systems.Our results indicate that the original study is highly reproducible given additional material and information provided by the authors. However, we also identify some aspects of the study that may make the annotation task potentially much easier than those in comparable studies. This might limit the representativeness of these results for best-practices in study design.

2024

pdf bib abs

Bias in News Summarization: Measures, Pitfalls and Corpora
Julius Steen | Katja Markert
Findings of the Association for Computational Linguistics: ACL 2024

Summarization is an important application of large language models (LLMs). Most previous evaluation of summarization models has focused on their content selection, faithfulness, grammaticality and coherence. However, it is well known that LLMs can reproduce and reinforce harmful social biases. This raises the question: Do biases affect model outputs in a constrained setting like summarization?To help answer this question, we first motivate and introduce a number of definitions for biased behaviours in summarization models, along with practical operationalizations. Since we find that biases inherent to input documents can confound bias analysis in summaries, we propose a method to generate input documents with carefully controlled demographic attributes. This allows us to study summarizer behavior in a controlled setting, while still working with realistic input documents.We measure gender bias in English summaries generated by both purpose-built summarization models and general purpose chat models as a case study. We find content selection in single document summarization to be largely unaffected by gender bias, while hallucinations exhibit evidence of bias.To demonstrate the generality of our approach, we additionally investigate racial bias, including intersectional settings.

2023

pdf bib abs

SimCSum: Joint Learning of Simplification and Cross-lingual Summarization for Cross-lingual Science Journalism
Mehwish Fatima | Tim Kolber | Katja Markert | Michael Strube
Proceedings of the 4th New Frontiers in Summarization Workshop

Cross-lingual science journalism is a recently introduced task that generates popular science summaries of scientific articles different from the source language for non-expert readers. A popular science summary must contain salient content of the input document while focusing on coherence and comprehensibility. Meanwhile, generating a cross-lingual summary from the scientific texts in a local language for the targeted audience is challenging. Existing research on cross-lingual science journalism investigates the task with a pipeline model to combine text simplification and cross-lingual summarization. We extend the research in cross-lingual science journalism by introducing a novel, multi-task learning architecture that combines the aforementioned NLP tasks. Our approach is to jointly train the two high-level NLP tasks in SimCSum for generating cross-lingual popular science summaries. We investigate the performance of SimCSum against the pipeline model and several other strong baselines with several evaluation metrics and human evaluation. Overall, SimCSum demonstrates statistically significant improvements over the state-of-the-art on two non-synthetic cross-lingual scientific datasets. Furthermore, we conduct an in-depth investigation into the linguistic properties of generated summaries and an error analysis.

pdf bib abs

Can current NLI systems handle German word order? Investigating language model performance on a new German challenge set of minimal pairs
Ines Reinig | Katja Markert
Proceedings of the 15th International Conference on Computational Semantics

Compared to English, German word order is freer and therefore poses additional challenges for natural language inference (NLI). We create WOGLI (Word Order in German Language Inference), the first adversarial NLI dataset for German word order that has the following properties: (i) each premise has an entailed and a non-entailed hypothesis; (ii) premise and hypotheses differ only in word order and necessary morphological changes to mark case and number. In particular, each premise and its two hypotheses contain exactly the same lemmata. Our adversarial examples require the model to use morphological markers in order to recognise or reject entailment. We show that current German autoencoding models fine-tuned on translated NLI data can struggle on this challenge set, reflecting the fact that translated NLI datasets will not mirror all necessary language phenomena in the target language. We also examine performance after data augmentation as well as on related word order phenomena derived from WOGLI. Our datasets are publically available at https://github.com/ireinig/wogli.

pdf bib abs

Nut-cracking Sledgehammers: Prioritizing Target Language Data over Bigger Language Models for Cross-Lingual Metaphor Detection
Jakob Schuster | Katja Markert
Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD)

In this work, we investigate cross-lingual methods for metaphor detection of adjective-noun phrases in three languages (English, German and Polish). We explore the potential of minimalistic neural networks supported by static embeddings as a light-weight alternative for large transformer-based language models. We measure performance in zero-shot experiments without access to annotated target language data and aim to find low-resource improvements for them by mainly focusing on a k-shot paradigm. Even by incorporating a small number of phrases from the target language, the gap in accuracy between our small networks and large transformer architectures can be bridged. Lastly, we suggest that the k-shot paradigm can even be applied to models using machine translation of training data.

pdf bib abs

With a Little Push, NLI Models can Robustly and Efficiently Predict Faithfulness
Julius Steen | Juri Opitz | Anette Frank | Katja Markert
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Conditional language models still generate unfaithful output that is not supported by their input. These unfaithful generations jeopardize trust in real-world applications such as summarization or human-machine interaction, motivating a need for automatic faithfulness metrics. To implement such metrics, NLI models seem attractive, since they solve a strongly related task that comes with a wealth of prior research and data. But recent research suggests that NLI models require costly additional machinery to perform reliably across datasets, e.g., by running inference on a cartesian product of input and generated sentences, or supporting them with a question-generation/answering step. In this work we show that pure NLI models _can_ outperform more complex metrics when combining task-adaptive data augmentation with robust inference procedures. We propose: (1) Augmenting NLI training data toadapt NL inferences to the specificities of faithfulness prediction in dialogue;(2) Making use of both entailment and contradiction probabilities in NLI, and(3) Using Monte-Carlo dropout during inference. Applied to the TRUE benchmark, which combines faithfulness datasets across diverse domains and tasks, our approach strongly improves a vanilla NLI model and significantly outperforms previous work, while showing favourable computational cost.

2022

pdf bib abs

The Chinese Causative-Passive Homonymy Disambiguation: an adversarial Dataset for NLI and a Probing Task
Shanshan Xu | Katja Markert
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The disambiguation of causative-passive homonymy (CPH) is potentially tricky for machines, as the causative and the passive are not distinguished by the sentences’ syntactic structure. By transforming CPH disambiguation to a challenging natural language inference (NLI) task, we present the first Chinese Adversarial NLI challenge set (CANLI). We show that the pretrained transformer model RoBERTa, fine-tuned on an existing large-scale Chinese NLI benchmark dataset, performs poorly on CANLI. We also employ Word Sense Disambiguation as a probing task to investigate to what extent the CPH feature is captured in the model’s internal representation. We find that the model’s performance on CANLI does not correspond to its internal representation of CPH, which is the crucial linguistic ability central to the CANLI dataset. CANLI is available on Hugging Face Datasets (Lhoest et al., 2021) at https://huggingface.co/datasets/sxu/CANLI

pdf bib abs

How to Find Strong Summary Coherence Measures? A Toolbox and a Comparative Study for Summary Coherence Measure Evaluation
Julius Steen | Katja Markert
Proceedings of the 29th International Conference on Computational Linguistics

Automatically evaluating the coherence of summaries is of great significance both to enable cost-efficient summarizer evaluation and as a tool for improving coherence by selecting high-scoring candidate summaries. While many different approaches have been suggested to model summary coherence, they are often evaluated using disparate datasets and metrics. This makes it difficult to understand their relative performance and identify ways forward towards better summary coherence modelling. In this work, we conduct a large-scale investigation of various methods for summary coherence modelling on an even playing field. Additionally, we introduce two novel analysis measures, _intra-system correlation_ and _bias matrices_, that help identify biases in coherence measures and provide robustness against system-level confounders. While none of the currently available automatic coherence measures are able to assign reliable coherence scores to system summaries across all evaluation metrics, large-scale language models fine-tuned on self-supervised tasks show promising results, as long as fine-tuning takes into account that they need to generalize across different summary lengths.

pdf bib abs

Biographically Relevant Tweets – a New Dataset, Linguistic Analysis and Classification Experiments
Michael Wiegand | Rebecca Wilm | Katja Markert
Proceedings of the 29th International Conference on Computational Linguistics

We present a new dataset comprising tweets for the novel task of detecting biographically relevant utterances. Biographically relevant utterances are all those utterances that reveal some persistent and non-trivial information about the author of a tweet, e.g. habits, (dis)likes, family status, physical appearance, employment information, health issues etc. Unlike previous research we do not restrict biographical relevance to a small fixed set of pre-defined relations. Next to classification experiments employing state-of-the-art classifiers to establish strong baselines for future work, we carry out a linguistic analysis that compares the predictiveness of various high-level features. We also show that the task is different from established tasks, such as aspectual classification or sentiment analysis.

2021

pdf bib abs

How to Evaluate a Summarizer: Study Design and Statistical Analysis for Manual Linguistic Quality Evaluation
Julius Steen | Katja Markert
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Manual evaluation is essential to judge progress on automatic text summarization. However, we conduct a survey on recent summarization system papers that reveals little agreement on how to perform such evaluation studies. We conduct two evaluation experiments on two aspects of summaries’ linguistic quality (coherence and repetitiveness) to compare Likert-type and ranking annotations and show that best choice of evaluation method can vary from one aspect to another. In our survey, we also find that study parameters such as the overall number of annotators and distribution of annotators to annotation items are often not fully reported and that subsequent statistical analysis ignores grouping factors arising from one annotator judging multiple summaries. Using our evaluation experiments, we show that the total number of annotators can have a strong impact on study power and that current statistical analysis methods can inflate type I error rates up to eight-fold. In addition, we highlight that for the purpose of system comparison the current practice of eliciting multiple judgements per summary leads to less powerful and reliable annotations given a fixed study budget.

2020

pdf bib abs

Doctor Who? Framing Through Names and Titles in German
Esther van den Berg | Katharina Korfhage | Josef Ruppenhofer | Michael Wiegand | Katja Markert
Proceedings of the Twelfth Language Resources and Evaluation Conference

Entity framing is the selection of aspects of an entity to promote a particular viewpoint towards that entity. We investigate entity framing of political figures through the use of names and titles in German online discourse, enhancing current research in entity framing through titling and naming that concentrates on English only. We collect tweets that mention prominent German politicians and annotate them for stance. We find that the formality of naming in these tweets correlates positively with their stance. This confirms sociolinguistic observations that naming and titling can have a status-indicating function and suggests that this function is dominant in German tweets mentioning political figures. We also find that this status-indicating function is much weaker in tweets from users that are politically left-leaning than in tweets by right-leaning users. This is in line with observations from moral psychology that left-leaning and right-leaning users assign different importance to maintaining social hierarchies.

pdf bib abs

Dataset Reproducibility and IR Methods in Timeline Summarization
Leo Born | Maximilian Bacher | Katja Markert
Proceedings of the Twelfth Language Resources and Evaluation Conference

Timeline summarization (TLS) generates a dated overview of real-world events based on event-specific corpora. The two standard datasets for this task were collected using Google searches for news reports on given events. Not only is this IR method not reproducible at different search times, it also uses components (such as document popularity) that are not always available for any large news corpus. It is unclear how TLS algorithms fare when provided with event corpora collected with varying IR methods. We therefore construct event-specific corpora from a large static background corpus, the newsroom dataset, using differing, relatively simple IR methods based on raw text alone. We show that the choice of IR method plays a crucial role in the performance of various TLS algorithms. A weak TLS algorithm can even match a stronger one by employing a stronger IR method in the data collection phase. Furthermore, the results of TLS systems are often highly sensitive to additional sentence filtering. We consequently advocate for integrating IR into the development of TLS systems and having a common static background corpus for evaluation of TLS systems.

pdf bib abs

Context in Informational Bias Detection
Esther van den Berg | Katja Markert
Proceedings of the 28th International Conference on Computational Linguistics

Informational bias is bias conveyed through sentences or clauses that provide tangential, speculative or background information that can sway readers’ opinions towards entities. By nature, informational bias is context-dependent, but previous work on informational bias detection has not explored the role of context beyond the sentence. In this paper, we explore four kinds of context for informational bias in English news articles: neighboring sentences, the full article, articles on the same event from other news publishers, and articles from the same domain (but potentially different events). We find that integrating event context improves classification performance over a very strong baseline. In addition, we perform the first error analysis of models on this task. We find that the best-performing context-inclusive model outperforms the baseline on longer sentences, and sentences from politically centrist articles.

pdf bib abs

An analysis of language models for metaphor recognition
Arthur Neidlein | Philip Wiesenbach | Katja Markert
Proceedings of the 28th International Conference on Computational Linguistics

We conduct a linguistic analysis of recent metaphor recognition systems, all of which are based on language models. We show that their performance, although reaching high F-scores, has considerable gaps from a linguistic perspective. First, they perform substantially worse on unconventional metaphors than on conventional ones. Second, they struggle with handling rarer word types. These two findings together suggest that a large part of the systems’ success is due to optimising the disambiguation of conventionalised, metaphoric word senses for specific words instead of modelling general properties of metaphors. As a positive result, the systems show increasing capabilities to recognise metaphoric readings of unseen words if synonyms or morphological variations of these words have been seen before, leading to enhanced generalisation beyond word sense disambiguation.

pdf bib abs

Discrete Optimization for Unsupervised Sentence Summarization with Word-Level Extraction
Raphael Schumann | Lili Mou | Yao Lu | Olga Vechtomova | Katja Markert
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Automatic sentence summarization produces a shorter version of a sentence, while preserving its most important information. A good summary is characterized by language fluency and high information overlap with the source sentence. We model these two aspects in an unsupervised objective function, consisting of language modeling and semantic similarity metrics. We search for a high-scoring summary by discrete optimization. Our proposed method achieves a new state-of-the art for unsupervised sentence summarization according to ROUGE scores. Additionally, we demonstrate that the commonly reported ROUGE F1 metric is sensitive to summary length. Since this is unwillingly exploited in recent work, we emphasize that future evaluation should explicitly group summarization systems by output length brackets.

2019

pdf bib abs

Not My President: How Names and Titles Frame Political Figures
Esther van den Berg | Katharina Korfhage | Josef Ruppenhofer | Michael Wiegand | Katja Markert
Proceedings of the Third Workshop on Natural Language Processing and Computational Social Science

Naming and titling have been discussed in sociolinguistics as markers of status or solidarity. However, these functions have not been studied on a larger scale or for social media data. We collect a corpus of tweets mentioning presidents of six G20 countries by various naming forms. We show that naming variation relates to stance towards the president in a way that is suggestive of a framing effect mediated by respectfulness. This confirms sociolinguistic theory of naming and titling as markers of status.

pdf bib abs

Abstractive Timeline Summarization
Julius Steen | Katja Markert
Proceedings of the 2nd Workshop on New Frontiers in Summarization

Timeline summarization (TLS) automatically identifies key dates of major events and provides short descriptions of what happened on these dates. Previous approaches to TLS have focused on extractive methods. In contrast, we suggest an abstractive timeline summarization system. Our system is entirely unsupervised, which makes it especially suited to TLS where there are very few gold summaries available for training of supervised systems. In addition, we present the first abstractive oracle experiments for TLS. Our system outperforms extractive competitors in terms of ROUGE when the number of input documents is high and the output requires strong compression. In these cases, our oracle experiments confirm that our approach also has a higher upper bound for ROUGE scores than extractive methods. A study with human judges shows that our abstractive system also produces output that is easy to read and understand.

2018

pdf bib abs

A Temporally Sensitive Submodularity Framework for Timeline Summarization
Sebastian Martschat | Katja Markert
Proceedings of the 22nd Conference on Computational Natural Language Learning

Timeline summarization (TLS) creates an overview of long-running events via dated daily summaries for the most important dates. TLS differs from standard multi-document summarization (MDS) in the importance of date selection, interdependencies between summaries of different dates and by having very short summaries compared to the number of corpus documents. However, we show that MDS optimization models using submodular functions can be adapted to yield well-performing TLS models by designing objective functions and constraints that model the temporal dimension inherent in TLS. Importantly, these adaptations retain the elegance and advantages of the original MDS models (clear separation of features and inference, performance guarantees and scalability, little need for supervision) that current TLS-specific models lack.

pdf bib abs

Unrestricted Bridging Resolution
Yufang Hou | Katja Markert | Michael Strube
Computational Linguistics, Volume 44, Issue 2 - June 2018

In contrast to identity anaphors, which indicate coreference between a noun phrase and its antecedent, bridging anaphors link to their antecedent(s) via lexico-semantic, frame, or encyclopedic relations. Bridging resolution involves recognizing bridging anaphors and finding links to antecedents. In contrast to most prior work, we tackle both problems. Our work also follows a more wide-ranging definition of bridging than most previous work and does not impose any restrictions on the type of bridging anaphora or relations between anaphor and antecedent. We create a corpus (ISNotes) annotated for information status (IS), bridging being one of the IS subcategories. The annotations reach high reliability for all categories and marginal reliability for the bridging subcategory. We use a two-stage statistical global inference method for bridging resolution. Given all mentions in a document, the first stage, bridging anaphora recognition, recognizes bridging anaphors as a subtask of learning fine-grained IS. We use a cascading collective classification method where (i) collective classification allows us to investigate relations among several mentions and autocorrelation among IS classes and (ii) cascaded classification allows us to tackle class imbalance, important for minority classes such as bridging. We show that our method outperforms current methods both for IS recognition overall as well as for bridging, specifically. The second stage, bridging antecedent selection, finds the antecedents for all predicted bridging anaphors. We investigate the phenomenon of semantically or syntactically related bridging anaphors that share the same antecedent, a phenomenon we call sibling anaphors. We show that taking sibling anaphors into account in a joint inference model improves antecedent selection performance. In addition, we develop semantic and salience features for antecedent selection and suggest a novel method to build the candidate antecedent list for an anaphor, using the discourse scope of the anaphor. Our model outperforms previous work significantly.

pdf bib abs

Distinguishing affixoid formations from compounds
Josef Ruppenhofer | Michael Wiegand | Rebecca Wilm | Katja Markert
Proceedings of the 27th International Conference on Computational Linguistics

We study German affixoids, a type of morpheme in between affixes and free stems. Several properties have been associated with them – increased productivity; a bleached semantics, which is often evaluative and/or intensifying and thus of relevance to sentiment analysis; and the existence of a free morpheme counterpart – but not been validated empirically. In experiments on a new data set that we make available, we put these key assumptions from the morphological literature to the test and show that despite the fact that affixoids generate many low-frequency formations, we can classify these as affixoid or non-affixoid instances with a best F1-score of 74%.

2017

pdf bib abs

Automatic Extraction of News Values from Headline Text
Alicja Piotrkowicz | Vania Dimitrova | Katja Markert
Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics

Headlines play a crucial role in attracting audiences’ attention to online artefacts (e.g. news articles, videos, blogs). The ability to carry out an automatic, large-scale analysis of headlines is critical to facilitate the selection and prioritisation of a large volume of digital content. In journalism studies news content has been extensively studied using manually annotated news values - factors used implicitly and explicitly when making decisions on the selection and prioritisation of news items. This paper presents the first attempt at a fully automatic extraction of news values from headline text. The news values extraction methods are applied on a large headlines corpus collected from The Guardian, and evaluated by comparing it with a manually annotated gold standard. A crowdsourcing survey indicates that news values affect people’s decisions to click on a headline, supporting the need for an automatic news values detection.

pdf bib abs

Improving ROUGE for Timeline Summarization
Sebastian Martschat | Katja Markert
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Current evaluation metrics for timeline summarization either ignore the temporal aspect of the task or require strict date matching. We introduce variants of ROUGE that allow alignment of daily summaries via temporal distance or semantic similarity. We argue for the suitability of these variants in a theoretical analysis and demonstrate it in a battery of task-specific tests.

pdf bib abs

Fine Grained Citation Span for References in Wikipedia
Besnik Fetahu | Katja Markert | Avishek Anand
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Verifiability is one of the core editing principles in Wikipedia, where editors are encouraged to provide citations for the added content. For a Wikipedia article determining what content is covered by a citation or the citation span is not trivial, an important aspect for automated citation finding for uncovered content, or fact assessments. We address the problem of determining the citation span in Wikipedia articles. We approach this problem by classifying which textual fragments in an article are covered or hold true given a citation. We propose a sequence classification approach where for a paragraph and a citation, we determine the citation span at a fine-grained level. We provide a thorough experimental evaluation and compare our approach against baselines adopted from the scientific domain, where we show improvement for all evaluation metrics.

Katja Markert

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2015

2014

2013

2012

2011

2010

2009

2008

2007

2005

2003

2002

1996

Co-authors

Venues