Matthias Hagen

2025

Segmentation of Argumentative Texts by Key Statements for Argument Mining from the Web
Ines Zelch | Matthias Hagen | Benno Stein | Johannes Kiesel
Proceedings of the 12th Argument mining Workshop

Argument mining is the task of identifying the argument structure of a text: claims, premises, support/attack relations, etc. However, determining the complete argument structure can be quite involved, especially for unpolished texts from online forums, while for many applications the identification of argumentative key statements would suffice (e.g., for argument search). To this end, we introduce and investigate the new task of segmenting an argumentative text by its key statements. We formalize the task, create a first dataset from online communities, propose an evaluation scheme, and conduct a pilot study with several approaches. Interestingly, our experimental results indicate that none of the tested approaches (even LLM-based ones) can actually satisfactorily solve key statement segmentation yet.

pdf bib abs

Reproducing the Argument Quality Prediction of Project Debater
Ines Zelch | Matthias Hagen | Benno Stein | Johannes Kiesel
Proceedings of the 12th Argument mining Workshop

A crucial task when analyzing arguments is to determine their quality. Especially when you have to choose from a large number of suitable arguments, the determination of a reliable argument quality value is of great benefit. Probably the best-known model for determining such an argument quality value was developed in IBM’s Project Debater and made available to the research community free of charge via an API. In fact, the model was never open and the API is no longer available. In this paper, IBM’s model is reproduced using the freely available training data and the description in the corresponding publication. Our reproduction achieves similar results on the test data as described in the original publication. Further, the predicted quality scores of reproduction and original show a very high correlation (Pearson’s r=0.9) on external data.

2024

pdf bib abs

DeepCT-enhanced Lexical Argument Retrieval
Alexander Bondarenko | Maik Fröbe | Danik Hollatz | Jan Heinrich Merker | Matthias Hagen
Proceedings of the 11th Workshop on Argument Mining (ArgMining 2024)

The recent Touché lab’s argument retrieval task focuses on controversial topics like ‘Should bottled water be banned?’ and asks to retrieve relevant pro/con arguments. Interestingly, the most effective systems submitted to that task still are based on lexical retrieval models like BM25. In other domains, neural retrievers that capture semantics are more effective than lexical baselines. To add more “semantics” to argument retrieval, we propose to combine lexical models with DeepCT-based document term weights. Our evaluation shows that our approach is more effective than all the systems submitted to the Touché lab while being on par with modern neural re-rankers that themselves are computationally more expensive.

pdf bib abs

Task-Oriented Paraphrase Analytics
Marcel Gohsen | Matthias Hagen | Martin Potthast | Benno Stein
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Since paraphrasing is an ill-defined task, the term “paraphrasing” covers text transformation tasks with different characteristics. Consequently, existing paraphrasing studies have applied quite different (explicit and implicit) criteria as to when a pair of texts is to be considered a paraphrase, all of which amount to postulating a certain level of semantic or lexical similarity. In this paper, we conduct a literature review and propose a taxonomy to organize the 25 identified paraphrasing (sub-)tasks. Using classifiers trained to identify the tasks that a given paraphrasing instance fits, we find that the distributions of task-specific instances in the known paraphrase corpora vary substantially. This means that the use of these corpora, without the respective paraphrase conditions being clearly defined (which is the normal case), must lead to incomparable and misleading results.

2023

pdf bib abs

Stance-Aware Re-Ranking for Non-factual Comparative Queries
Jan Heinrich Reimer | Alexander Bondarenko | Maik Fröbe | Matthias Hagen
Proceedings of the 10th Workshop on Argument Mining

We propose a re-ranking approach to improve the retrieval effectiveness for non-factual comparative queries like ‘Which city is better, London or Paris?’ based on whether the results express a stance towards the comparison objects (London vs. Paris) or not. Applied to the 26 runs submitted to the Touché 2022 task on comparative argument retrieval, our stance-aware re-ranking significantly improves the retrieval effectiveness for all runs when perfect oracle-style stance labels are available. With our most effective practical stance detector based on GPT-3.5 (F₁ of 0.49 on four stance classes), our re-ranking still improves the effectiveness for all runs but only six improvements are significant. Artificially “deteriorating” the oracle-style labels, we further find that an F₁ of 0.90 for stance detection is necessary to significantly improve the retrieval effectiveness for the best run via stance-aware re-ranking.

pdf bib abs

Paraphrase Acquisition from Image Captions
Marcel Gohsen | Matthias Hagen | Martin Potthast | Benno Stein
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

We propose to use image captions from the Web as a previously underutilized resource for paraphrases (i.e., texts with the same “message”) and to create and analyze a corresponding dataset. When an image is reused on the Web, an original caption is often assigned. We hypothesize that different captions for the same image naturally form a set of mutual paraphrases. To demonstrate the suitability of this idea, we analyze captions in the English Wikipedia, where editors frequently relabel the same image for different articles. The paper introduces the underlying mining technology, the resulting Wikipedia-IPC dataset, and compares known paraphrase corpora with respect to their syntactic and semantic paraphrase similarity to our new resource. In this context, we introduce characteristic maps along the two similarity dimensions to identify the style of paraphrases coming from different sources. An annotation study demonstrates the high reliability of the algorithmically determined characteristic maps.

pdf bib abs

SemEval-2023 Task 5: Clickbait Spoiling
Maik Fröbe | Benno Stein | Tim Gollub | Matthias Hagen | Martin Potthast
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

In this overview paper, we report on the second PAN~Clickbait Challenge hosted as Task~5 at SemEval~2023. The challenge’s focus is to better support social media users by automatically generating short spoilers that close the curiosity gap induced by a clickbait post. We organized two subtasks: (1) spoiler type classification to assess what kind of spoiler a clickbait post warrants (e.g., a phrase), and (2) spoiler generation to generate an actual spoiler for a clickbait post.

2022

pdf bib abs

Clickbait Spoiling via Question Answering and Passage Retrieval
Matthias Hagen | Maik Fröbe | Artur Jurk | Martin Potthast
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce and study the task of clickbait spoiling: generating a short text that satisfies the curiosity induced by a clickbait post. Clickbait links to a web page and advertises its contents by arousing curiosity instead of providing an informative summary. Our contributions are approaches to classify the type of spoiler needed (i.e., a phrase or a passage), and to generate appropriate spoilers. A large-scale evaluation and error analysis on a new corpus of 5,000 manually spoiled clickbait posts—the Webis Clickbait Spoiling Corpus 2022—shows that our spoiler type classifier achieves an accuracy of 80%, while the question answering model DeBERTa-large outperforms all others in generating spoilers for both types.

pdf bib abs

Mining Health-related Cause-Effect Statements with High Precision at Large Scale
Ferdinand Schlatt | Dieter Bettin | Matthias Hagen | Benno Stein | Martin Potthast
Proceedings of the 29th International Conference on Computational Linguistics

An efficient assessment of the health relatedness of text passages is important to mine the web at scale to conduct health sociological analyses or to develop a health search engine. We propose a new efficient and effective termhood score for predicting the health relatedness of phrases and sentences, which achieves 69% recall at over 90% precision on a web dataset with cause-effect statements. It is more effective than state-of-the-art medical entity linkers and as effective but much faster than BERT-based approaches. Using our method, we compile the Webis Medical CauseNet 2022, a new resource of 7.8 million health-related cause-effect statements such as “Studies show that stress induces insomnia” in which the cause (‘stress’) and effect (‘insomnia’) are labeled.

pdf bib abs

At least 5% of questions submitted to search engines ask about cause-effect relationships in some way. To support the development of tailored approaches that can answer such questions, we construct Webis-CausalQA-22, a benchmark corpus of 1.1 million causal questions with answers. We distinguish different types of causal questions using a novel typology derived from a data-driven, manual analysis of questions from ten large question answering (QA) datasets. Using high-precision lexical rules, we extract causal questions of each type from these datasets to create our corpus. As an initial baseline, the state-of-the-art QA model UnifiedQA achieves a ROUGE-L F1 score of 0.48 on our new benchmark.

2020

pdf bib abs

Efficient Pairwise Annotation of Argument Quality
Lukas Gienapp | Benno Stein | Matthias Hagen | Martin Potthast
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We present an efficient annotation framework for argument quality, a feature difficult to be measured reliably as per previous work. A stochastic transitivity model is combined with an effective sampling strategy to infer high-quality labels with low effort from crowdsourced pairwise judgments. The model’s capabilities are showcased by compiling Webis-ArgQuality-20, an argument quality corpus that comprises scores for rhetorical, logical, dialectical, and overall quality inferred from a total of 41,859 pairwise judgments among 1,271 arguments. With up to 93% cost savings, our approach significantly outperforms existing annotation procedures. Furthermore, novel insight into argument quality is provided through statistical analysis, and a new aggregation method to infer overall quality from individual quality dimensions is proposed.

pdf bib abs

We propose a shared task on abstractive snippet generation for web pages, a novel task of generating query-biased abstractive summaries for documents that are to be shown on a search results page. Conventional snippets are extractive in nature, which recently gave rise to copyright claims from news publishers as well as a new copyright legislation being passed in the European Union, limiting the fair use of web page contents for snippets. At the same time, abstractive summarization has matured considerably in recent years, potentially allowing for more personalization of snippets in the future. Taken together, these facts render further research into generating abstractive snippets both timely and promising.

2019

pdf bib abs

Generalizing Unmasking for Short Texts
Janek Bevendorff | Benno Stein | Matthias Hagen | Martin Potthast
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Authorship verification is the problem of inferring whether two texts were written by the same author. For this task, unmasking is one of the most robust approaches as of today with the major shortcoming of only being applicable to book-length texts. In this paper, we present a generalized unmasking approach which allows for authorship verification of texts as short as four printed pages with very high precision at an adjustable recall tradeoff. Our generalized approach therefore reduces the required material by orders of magnitude, making unmasking applicable to authorship cases of more practical proportions. The new approach is on par with other state-of-the-art techniques that are optimized for texts of this length: it achieves accuracies of 75–80%, while also allowing for easy adjustment to forensic scenarios that require higher levels of confidence in the classification.

pdf bib abs

We present TARGER, an open source neural argument mining framework for tagging arguments in free input texts and for keyword-based retrieval of arguments from an argument-tagged web-scale corpus. The currently available models are pre-trained on three recent argument mining datasets and enable the use of neural argument mining without any reproducibility effort on the user’s side. The open source code ensures portability to other domains and use cases.

pdf bib abs

Categorizing Comparative Sentences
Alexander Panchenko | Alexander Bondarenko | Mirco Franzek | Matthias Hagen | Chris Biemann
Proceedings of the 6th Workshop on Argument Mining

We tackle the tasks of automatically identifying comparative sentences and categorizing the intended preference (e.g., “Python has better NLP libraries than MATLAB” → Python, better, MATLAB). To this end, we manually annotate 7,199 sentences for 217 distinct target item pairs from several domains (27% of the sentences contain an oriented comparison in the sense of “better” or “worse”). A gradient boosting model based on pre-trained sentence embeddings reaches an F1 score of 85% in our experimental evaluation. The model can be used to extract comparative sentences for pro/con argumentation in comparative / argument search engines or debating technologies.

pdf bib abs

Bias Analysis and Mitigation in the Evaluation of Authorship Verification
Janek Bevendorff | Matthias Hagen | Benno Stein | Martin Potthast
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The PAN series of shared tasks is well known for its continuous and high quality research in the field of digital text forensics. Among others, PAN contributions include original corpora, tailored benchmarks, and standardized experimentation platforms. In this paper we review, theoretically and practically, the authorship verification task and conclude that the underlying experiment design cannot guarantee pushing forward the state of the art—in fact, it allows for top benchmarking with a surprisingly straightforward approach. In this regard, we present a “Basic and Fairly Flawed” (BAFF) authorship verifier that is on a par with the best approaches submitted so far, and that illustrates sources of bias that should be eliminated. We pinpoint these sources in the evaluation chain and present a refined authorship corpus as effective countermeasure.

pdf bib abs

Heuristic Authorship Obfuscation
Janek Bevendorff | Martin Potthast | Matthias Hagen | Benno Stein
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Authorship verification is the task of determining whether two texts were written by the same author. We deal with the adversary task, called authorship obfuscation: preventing verification by altering a to-be-obfuscated text. Our new obfuscation approach (1) models writing style difference as the Jensen-Shannon distance between the character n-gram distributions of texts, and (2) manipulates an author’s subconsciously encoded writing style in a sophisticated manner using heuristic search. To obfuscate, we analyze the huge space of textual variants for a paraphrased version of the to-be-obfuscated text that has a sufficient Jensen-Shannon distance at minimal costs in terms of text quality. We analyze, quantify, and illustrate the rationale of this approach, define paraphrasing operators, derive obfuscation thresholds, and develop an effective obfuscation framework. Our authorship obfuscation approach defeats state-of-the-art verification approaches, including unmasking and compression models, while keeping text changes at a minimum.

pdf bib abs

Unraveling the Search Space of Abusive Language in Wikipedia with Dynamic Lexicon Acquisition
Wei-Fan Chen | Khalid Al Khatib | Matthias Hagen | Henning Wachsmuth | Benno Stein
Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda

Many discussions on online platforms suffer from users offending others by using abusive terminology, threatening each other, or being sarcastic. Since an automatic detection of abusive language can support human moderators of online discussion platforms, detecting abusiveness has recently received increased attention. However, the existing approaches simply train one classifier for the whole variety of abusiveness. In contrast, our approach is to distinguish explicitly abusive cases from the more “shadowed” ones. By dynamically extending a lexicon of abusive terms (e.g., including new obfuscations of abusive terms), our approach can support a moderator with explicit unraveled explanations for why something was flagged as abusive: due to known explicitly abusive terms, due to newly detected (obfuscated) terms, or due to shadowed cases.

2018

pdf bib abs

Clickbait has become a nuisance on social media. To address the urging task of clickbait detection, we constructed a new corpus of 38,517 annotated Twitter tweets, the Webis Clickbait Corpus 2017. To avoid biases in terms of publisher and topic, tweets were sampled from the top 27 most retweeted news publishers, covering a period of 150 days. Each tweet has been annotated on 4-point scale by five annotators recruited at Amazon’s Mechanical Turk. The corpus has been employed to evaluate 12 clickbait detectors submitted to the Clickbait Challenge 2017. Download: https://webis.de/data/webis-clickbait-17.html Challenge: https://clickbait-challenge.org

pdf bib abs

Modeling Deliberative Argumentation Strategies on Wikipedia
Khalid Al-Khatib | Henning Wachsmuth | Kevin Lang | Jakob Herpel | Matthias Hagen | Benno Stein
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper studies how the argumentation strategies of participants in deliberative discussions can be supported computationally. Our ultimate goal is to predict the best next deliberative move of each participant. In this paper, we present a model for deliberative discussions and we illustrate its operationalization. Previous models have been built manually based on a small set of discussions, resulting in a level of abstraction that is not suitable for move recommendation. In contrast, we derive our model statistically from several types of metadata that can be used for move description. Applied to six million discussions from Wikipedia talk pages, our approach results in a model with 13 categories along three dimensions: discourse acts, argumentative relations, and frames. On this basis, we automatically generate a corpus with about 200,000 turns, labeled for the 13 categories. We then operationalize the model with three supervised classifiers and provide evidence that the proposed categories can be predicted.

2017

pdf bib abs

Patterns of Argumentation Strategies across Topics
Khalid Al-Khatib | Henning Wachsmuth | Matthias Hagen | Benno Stein
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

This paper presents an analysis of argumentation strategies in news editorials within and across topics. Given nearly 29,000 argumentative editorials from the New York Times, we develop two machine learning models, one for determining an editorial’s topic, and one for identifying evidence types in the editorial. Based on the distribution and structure of the identified types, we analyze the usage patterns of argumentation strategies among 12 different topics. We detect several common patterns that provide insights into the manifestation of argumentation strategies. Also, our experiments reveal clear correlations between the topics and the detected patterns.

2016

pdf bib

Cross-Domain Mining of Argumentative Text through Distant Supervision
Khalid Al-Khatib | Henning Wachsmuth | Matthias Hagen | Jonas Köhler | Benno Stein
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib abs

A News Editorial Corpus for Mining Argumentation Strategies
Khalid Al-Khatib | Henning Wachsmuth | Johannes Kiesel | Matthias Hagen | Benno Stein
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Many argumentative texts, and news editorials in particular, follow a specific strategy to persuade their readers of some opinion or attitude. This includes decisions such as when to tell an anecdote or where to support an assumption with statistics, which is reflected by the composition of different types of argumentative discourse units in a text. While several argument mining corpora have recently been published, they do not allow the study of argumentation strategies due to incomplete or coarse-grained unit annotations. This paper presents a novel corpus with 300 editorials from three diverse news portals that provides the basis for mining argumentation strategies. Each unit in all editorials has been assigned one of six types by three annotators with a high Fleiss’ Kappa agreement of 0.56. We investigate various challenges of the annotation process and we conduct a first corpus analysis. Our results reveal different strategies across the news portals, exemplifying the benefit of studying editorials—a so far underresourced text genre in argument mining.

Matthias Hagen

2025

2024

2023

2022

2020

2019

2018

2017

2016

2015

2014

2013

Co-authors

Venues