Johannes Kiesel

2025

Segmentation of Argumentative Texts by Key Statements for Argument Mining from the Web
Ines Zelch | Matthias Hagen | Benno Stein | Johannes Kiesel
Proceedings of the 12th Argument mining Workshop

Argument mining is the task of identifying the argument structure of a text: claims, premises, support/attack relations, etc. However, determining the complete argument structure can be quite involved, especially for unpolished texts from online forums, while for many applications the identification of argumentative key statements would suffice (e.g., for argument search). To this end, we introduce and investigate the new task of segmenting an argumentative text by its key statements. We formalize the task, create a first dataset from online communities, propose an evaluation scheme, and conduct a pilot study with several approaches. Interestingly, our experimental results indicate that none of the tested approaches (even LLM-based ones) can actually satisfactorily solve key statement segmentation yet.

pdf bib abs

Webis at CQs-Gen 2025: Prompting and Reranking for Critical Questions
Midhun Kanadan | Johannes Kiesel | Maximilian Heinrich | Benno Stein
Proceedings of the 12th Argument mining Workshop

This paper reports on the submission of team extitWebis to the Critical Question Generation shared task at the 12th Workshop on Argument Mining (ArgMining 2025). Our approach is a fully automated two-stage pipeline that first prompts a large language model (LLM) to generate candidate critical questions for a given argumentative intervention, and then reranks the generated questions as per a classifier’s confidence in their usefulness. For the generation stage, we tested zero-shot, few-shot, and chain-of-thought prompting strategies. For the reranking stage, we used a ModernBERT classifier that we fine-tuned on either the validation set or an augmented version. Among our submissions, the best-performing configuration achieved a test score of 0.57 and ranked 5th in the shared task. Submissions that use reranking consistently outperformed baseline submissions without reranking across all metrics. Our results demonstrate that combining openweight LLMs with reranking significantly improves the quality of the resulting critical questions.

pdf bib abs

Reproducing the Argument Quality Prediction of Project Debater
Ines Zelch | Matthias Hagen | Benno Stein | Johannes Kiesel
Proceedings of the 12th Argument mining Workshop

A crucial task when analyzing arguments is to determine their quality. Especially when you have to choose from a large number of suitable arguments, the determination of a reliable argument quality value is of great benefit. Probably the best-known model for determining such an argument quality value was developed in IBM’s Project Debater and made available to the research community free of charge via an API. In fact, the model was never open and the API is no longer available. In this paper, IBM’s model is reproduced using the freely available training data and the description in the corresponding publication. Our reproduction achieves similar results on the test data as described in the original publication. Further, the predicted quality scores of reproduction and original show a very high correlation (Pearson’s r=0.9) on external data.

2024

pdf bib abs

While human values play a crucial role in making arguments persuasive, we currently lack the necessary extensive datasets to develop methods for analyzing the values underlying these arguments on a large scale. To address this gap, we present the Touché23-ValueEval dataset, an expansion of the Webis-ArgValues-22 dataset. We collected and annotated an additional 4780 new arguments, doubling the dataset’s size to 9324 arguments. These arguments were sourced from six diverse sources, covering religious texts, community discussions, free-text arguments, newspaper editorials, and political debates. Each argument is annotated by three crowdworkers for 54 human values, following the methodology established in the original dataset. The Touché23-ValueEval dataset was utilized in the SemEval 2023 Task 4. ValueEval: Identification of Human Values behind Arguments, where an ensemble of transformer models demonstrated state-of-the-art performance. Furthermore, our experiments show that a fine-tuned large language model, Llama-2-7B, achieves comparable results.

2023

pdf bib abs

Unveiling the Power of Argument Arrangement in Online Persuasive Discussions
Nailia Mirzakhmedova | Johannes Kiesel | Khalid Al-Khatib | Benno Stein
Findings of the Association for Computational Linguistics: EMNLP 2023

Previous research on argumentation in online discussions has largely focused on examining individual comments and neglected the interactive nature of discussions. In line with previous work, we represent individual comments as sequences of semantic argumentative unit types. However, because it is intuitively necessary for dialogical argumentation to address the opposing viewpoints, we extend this model by clustering type sequences into different argument arrangement patterns and representing discussions as sequences of these patterns. These sequences of patterns are a symbolic representation of argumentation strategies that capture the overall structure of discussions. Using this novel approach, we conduct an in-depth analysis of the strategies in 34,393 discussions from the online discussion forum Change My View and show that our discussion model is effective for persuasiveness prediction, outperforming LLM-based classifiers on the same data. Our results provide valuable insights into argumentation dynamics in online discussions and, through the presented prediction procedure, are of practical importance for writing assistance and persuasive text generation systems.

pdf bib abs

Argumentation is ubiquitous in natural language communication, from politics and media to everyday work and private life. Many arguments derive their persuasive power from human values, such as self-directed thought or tolerance, albeit often implicitly. These values are key to understanding the semantics of arguments, as they are generally accepted as justifications for why a particular option is ethically desirable. Can automated systems uncover the values on which an argument draws? To answer this question, 39 teams submitted runs to ValueEval’23. Using a multi-sourced dataset of over 9K arguments, the systems achieved F1-scores up to 0.87 (nature) and over 0.70 for three more of 20 universal value categories. However, many challenges remain, as evidenced by the low peak F1-score of 0.39 for stimulation, hedonism, face, and humility.

pdf bib abs

Topic Ontologies for Arguments
Yamen Ajjour | Johannes Kiesel | Benno Stein | Martin Potthast
Findings of the Association for Computational Linguistics: EACL 2023

Many computational argumentation tasks, such as stance classification, are topic-dependent: The effectiveness of approaches to these tasks depends largely on whether they are trained with arguments on the same topics as those on which they are tested. The key question is: What are these training topics? To answer this question, we take the first step of mapping the argumentation landscape with The Argument Ontology (TAO). TAO draws on three authoritative sources for argument topics: the World Economic Forum, Wikipedia’s list of controversial topics, and Debatepedia. By comparing the topics in our ontology with those in 59 argument corpora, we perform the first comprehensive assessment of their topic coverage. While TAO already covers most of the corpus topics, the corpus topics barely cover all the topics in TAO. This points to a new goal for corpus construction to achieve a broad topic coverage and thus better generalizability of computational argumentation approaches.

pdf bib abs

This paper reports on the submissions of Webis to the two subtasks of ImageArg 2023. For the subtask of argumentative stance classification, we reached an F1 score of 0.84 using a BERT model for sequence classification. For the subtask of image persuasiveness classification, we reached an F1 score of 0.56 using CLIP embeddings and a neural network model, achieving the best performance for this subtask in the competition. Our analysis reveals that seemingly clear sentences (e.g., “I support gun control”) are still problematic for our otherwise competitive stance classifier and that ignoring the tweet text for image persuasiveness prediction leads to a model that is similarly effective to our top-performing model.

2022

pdf bib abs

This paper studies the (often implicit) human values behind natural language arguments, such as to have freedom of thought or to be broadminded. Values are commonly accepted answers to why some option is desirable in the ethical sense and are thus essential both in real-world argumentation and theoretical argumentation frameworks. However, their large variety has been a major obstacle to modeling them in argument mining. To overcome this obstacle, we contribute an operationalization of human values, namely a multi-level taxonomy with 54 values that is in line with psychological research. Moreover, we provide a dataset of 5270 arguments from four geographical cultures, manually annotated for human values. First experiments with the automatic classification of human values are promising, with F₁-scores up to 0.81 and 0.25 on average.

pdf bib abs

SCAI-QReCC Shared Task on Conversational Question Answering
Svitlana Vakulenko | Johannes Kiesel | Maik Fröbe
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Search-Oriented Conversational AI (SCAI) is an established venue that regularly puts a spotlight upon the recent work advancing the field of conversational search. SCAI’21 was organised as an independent online event and featured a shared task on conversational question answering, on which this paper reports. The shared task featured three subtasks that correspond to three steps in conversational question answering: question rewriting, passage retrieval, and answer generation. This report discusses each subtask, but emphasizes the answer generation subtask as it attracted the most attention from the participants and we identified evaluation of answer correctness in the conversational settings as a major challenge and acurrent research gap. Alongside the automatic evaluation, we conducted two crowdsourcing experiments to collect annotations for answer plausibility and faithfulness. As a result of this shared task, the original conversational QA dataset used for evaluation was further extended with alternative correct answers produced by the participant systems.

2021

pdf bib abs

Image Retrieval for Arguments Using Stance-Aware Query Expansion
Johannes Kiesel | Nico Reichenbach | Benno Stein | Martin Potthast
Proceedings of the 8th Workshop on Argument Mining

Many forms of argumentation employ images as persuasive means, but research in argument mining has been focused on verbal argumentation so far. This paper shows how to integrate images into argument mining research, specifically into argument retrieval. By exploiting the sophisticated image representations of keyword-based image search, we propose to use semantic query expansion for both the pro and the con stance to retrieve “argumentative images” for the respective stance. Our results indicate that even simple expansions provide a strong baseline, reaching a precision@10 of 0.49 for images being (1) on-topic, (2) argumentative, and (3) on-stance. An in-depth analysis reveals a high topic dependence of the retrieval performance and shows the need to further investigate on images providing contextual information.

2020

pdf bib abs

The automatic summarization of argumentative texts has hardly been explored. This paper takes a further step in this direction, targeting news editorials, i.e., opinionated articles with a well-defined argumentation structure. With Webis-EditorialSum-2020, we present a corpus of 1330 carefully curated summaries for 266 news editorials. We evaluate these summaries based on a tailored annotation scheme, where a high-quality summary is expected to be thesis-indicative, persuasive, reasonable, concise, and self-contained. Our corpus contains at least three high-quality summaries for about 90% of the editorials, rendering it a valuable resource for the development and evaluation of summarization technology for long argumentative texts. We further report details of both, an in-depth corpus analysis, and the evaluation of two extractive summarization models.

2019

pdf bib abs

Hyperpartisan news is news that takes an extreme left-wing or right-wing standpoint. If one is able to reliably compute this meta information, news articles may be automatically tagged, this way encouraging or discouraging readers to consume the text. It is an open question how successfully hyperpartisan news detection can be automated, and the goal of this SemEval task was to shed light on the state of the art. We developed new resources for this purpose, including a manually labeled dataset with 1,273 articles, and a second dataset with 754,000 articles, labeled via distant supervision. The interest of the research community in our task exceeded all our expectations: The datasets were downloaded about 1,000 times, 322 teams registered, of which 184 configured a virtual machine on our shared task cloud service TIRA, of which in turn 42 teams submitted a valid run. The best team achieved an accuracy of 0.822 on a balanced sample (yes : no hyperpartisan) drawn from the manually tagged corpus; an ensemble of the submitted systems increased the accuracy by 0.048.

pdf bib abs

Today’s widely used annotation tools were designed for annotating typically short textual mentions of entities or relations, making their interface cumbersome to use for long(er) stretches of text, e.g, sentences running over several lines in a document. They also lack systematic support for hierarchically structured labels, i.e., one label being conceptually more general than another (e.g., anamnesis in relation to family anamnesis). Moreover, as a more fundamental shortcoming of today’s tools, they provide no continuous quality con trol mechanisms for the annotation process, an essential feature to intrinsically support iterative cycles in the development of annotation guidelines. We alleviated these problems by developing WAT-SL 2.0, an open-source web-based annotation tool for long-segment labeling, hierarchically structured label sets and built-ins for quality control.

2018

pdf bib abs

A Stylometric Inquiry into Hyperpartisan and Fake News
Martin Potthast | Johannes Kiesel | Kevin Reinartz | Janek Bevendorff | Benno Stein
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We report on a comparative style analysis of hyperpartisan (extremely one-sided) news and fake news. A corpus of 1,627 articles from 9 political publishers, three each from the mainstream, the hyperpartisan left, and the hyperpartisan right, have been fact-checked by professional journalists at BuzzFeed: 97% of the 299 fake news articles identified are also hyperpartisan. We show how a style analysis can distinguish hyperpartisan news from the mainstream (F1 = 0.78), and satire from both (F1 = 0.81). But stylometry is no silver bullet as style-based fake news detection does not work (F1 = 0.46). We further reveal that left-wing and right-wing news share significantly more stylistic similarities than either does with the mainstream. This result is robust: it has been confirmed by three different modeling approaches, one of which employs Unmasking in a novel way. Applications of our results include partisanship detection and pre-screening for semi-automatic fake news detection.

2017

pdf bib abs

Unit Segmentation of Argumentative Texts
Yamen Ajjour | Wei-Fan Chen | Johannes Kiesel | Henning Wachsmuth | Benno Stein
Proceedings of the 4th Workshop on Argument Mining

The segmentation of an argumentative text into argument units and their non-argumentative counterparts is the first step in identifying the argumentative structure of the text. Despite its importance for argument mining, unit segmentation has been approached only sporadically so far. This paper studies the major parameters of unit segmentation systematically. We explore the effectiveness of various features, when capturing words separately, along with their neighbors, or even along with the entire text. Each such context is reflected by one machine learning model that we evaluate within and across three domains of texts. Among the models, our new deep learning approach capturing the entire text turns out best within all domains, with an F-score of up to 88.54. While structural features generalize best across domains, the domain transfer remains hard, which points to major challenges of unit segmentation.

pdf bib abs

WAT-SL: A Customizable Web Annotation Tool for Segment Labeling
Johannes Kiesel | Henning Wachsmuth | Khalid Al-Khatib | Benno Stein
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

A frequent type of annotations in text corpora are labeled text segments. General-purpose annotation tools tend to be overly comprehensive, often making the annotation process slower and more error-prone. We present WAT-SL, a new web-based tool that is dedicated to segment labeling and highly customizable to the labeling task at hand. We outline its main features and exemplify how we used it for a crowdsourced corpus with labeled argument units.

2016

pdf bib abs

A News Editorial Corpus for Mining Argumentation Strategies
Khalid Al-Khatib | Henning Wachsmuth | Johannes Kiesel | Matthias Hagen | Benno Stein
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Many argumentative texts, and news editorials in particular, follow a specific strategy to persuade their readers of some opinion or attitude. This includes decisions such as when to tell an anecdote or where to support an assumption with statistics, which is reflected by the composition of different types of argumentative discourse units in a text. While several argument mining corpora have recently been published, they do not allow the study of argumentation strategies due to incomplete or coarse-grained unit annotations. This paper presents a novel corpus with 300 editorials from three diverse news portals that provides the basis for mining argumentation strategies. Each unit in all editorials has been assigned one of six types by three annotators with a high Fleiss’ Kappa agreement of 0.56. We investigate various challenges of the annotation process and we conduct a first corpus analysis. Our results reveal different strategies across the news portals, exemplifying the benefit of studying editorials—a so far underresourced text genre in argument mining.

2015

pdf bib

Sentiment Flow - A General Model of Web Review Argumentation
Henning Wachsmuth | Johannes Kiesel | Benno Stein
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib

A Shared Task on Argumentation Mining in Newspaper Editorials
Johannes Kiesel | Khalid Al-Khatib | Matthias Hagen | Benno Stein
Proceedings of the 2nd Workshop on Argumentation Mining

Johannes Kiesel

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Co-authors

Venues