Martin Potthast


pdf bib
Revisiting Uncertainty-based Query Strategies for Active Learning with Transformers
Christopher Schröder | Andreas Niekler | Martin Potthast
Findings of the Association for Computational Linguistics: ACL 2022

Active learning is the iterative construction of a classification model through targeted labeling, enabling significant labeling cost savings. As most research on active learning has been carried out before transformer-based language models (“transformers”) became popular, despite its practical importance, comparably few papers have investigated how transformers can be combined with active learning to date. This can be attributed to the fact that using state-of-the-art query strategies for transformers induces a prohibitive runtime overhead, which effectively nullifies, or even outweighs the desired cost savings. For this reason, we revisit uncertainty-based query strategies, which had been largely outperformed before, but are particularly suited in the context of fine-tuning transformers. In an extensive evaluation, we connect transformers to experiments from previous research, assessing their performance on five widely used text classification benchmarks. For active learning with transformers, several other uncertainty-based approaches outperform the well-known prediction entropy query strategy, thereby challenging its status as most popular uncertainty baseline in active learning for text classification.

pdf bib
Differential Bias: On the Perceptibility of Stance Imbalance in Argumentation
Alonso Palomino | Khalid Al Khatib | Martin Potthast | Benno Stein
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022

Most research on natural language processing treats bias as an absolute concept: Based on a (probably complex) algorithmic analysis, a sentence, an article, or a text is classified as biased or not. Given the fact that for humans the question of whether a text is biased can be difficult to answer or is answered contradictory, we ask whether an “absolute bias classification” is a promising goal at all. We see the problem not in the complexity of interpreting language phenomena but in the diversity of sociocultural backgrounds of the readers, which cannot be handled uniformly: To decide whether a text has crossed the proverbial line between non-biased and biased is subjective. By asking “Is text X more [less, equally] biased than text Y?” we propose to analyze a simpler problem, which, by its construction, is rather independent of standpoints, views, or sociocultural aspects. In such a model, bias becomes a preference relation that induces a partial ordering from least biased to most biased texts without requiring a decision on where to draw the line. A prerequisite for this kind of bias model is the ability of humans to perceive relative bias differences in the first place. In our research, we selected a specific type of bias in argumentation, the stance bias, and designed a crowdsourcing study showing that differences in stance bias are perceptible when (light) support is provided through training or visual aid.

pdf bib
Mining Health-related Cause-Effect Statements with High Precision at Large Scale
Ferdinand Schlatt | Dieter Bettin | Matthias Hagen | Benno Stein | Martin Potthast
Proceedings of the 29th International Conference on Computational Linguistics

An efficient assessment of the health relatedness of text passages is important to mine the web at scale to conduct health sociological analyses or to develop a health search engine. We propose a new efficient and effective termhood score for predicting the health relatedness of phrases and sentences, which achieves 69% recall at over 90% precision on a web dataset with cause-effect statements. It is more effective than state-of-the-art medical entity linkers and as effective but much faster than BERT-based approaches. Using our method, we compile the Webis Medical CauseNet 2022, a new resource of 7.8 million health-related cause-effect statements such as “Studies show that stress induces insomnia” in which the cause (‘stress’) and effect (‘insomnia’) are labeled.

pdf bib
CausalQA: A Benchmark for Causal Question Answering
Alexander Bondarenko | Magdalena Wolska | Stefan Heindorf | Lukas Blübaum | Axel-Cyrille Ngonga Ngomo | Benno Stein | Pavel Braslavski | Matthias Hagen | Martin Potthast
Proceedings of the 29th International Conference on Computational Linguistics

At least 5% of questions submitted to search engines ask about cause-effect relationships in some way. To support the development of tailored approaches that can answer such questions, we construct Webis-CausalQA-22, a benchmark corpus of 1.1 million causal questions with answers. We distinguish different types of causal questions using a novel typology derived from a data-driven, manual analysis of questions from ten large question answering (QA) datasets. Using high-precision lexical rules, we extract causal questions of each type from these datasets to create our corpus. As an initial baseline, the state-of-the-art QA model UnifiedQA achieves a ROUGE-L F1 score of 0.48 on our new benchmark.

pdf bib
Language Models as Context-sensitive Word Search Engines
Matti Wiegmann | Michael Völske | Benno Stein | Martin Potthast
Proceedings of the First Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2022)

Context-sensitive word search engines are writing assistants that support word choice, phrasing, and idiomatic language use by indexing large-scale n-gram collections and implementing a wildcard search. However, search results become unreliable with increasing context size (e.g., n>=5), when observations become sparse. This paper proposes two strategies for word search with larger n, based on masked and conditional language modeling. We build such search engines using BERT and BART and compare their capabilities in answering English context queries with those of the n-gram-based word search engine Netspeak. Our proposed strategies score within 5 percentage points MRR of n-gram collections while answering up to 5 times as many queries.

pdf bib
Clickbait Spoiling via Question Answering and Passage Retrieval
Matthias Hagen | Maik Fröbe | Artur Jurk | Martin Potthast
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce and study the task of clickbait spoiling: generating a short text that satisfies the curiosity induced by a clickbait post. Clickbait links to a web page and advertises its contents by arousing curiosity instead of providing an informative summary. Our contributions are approaches to classify the type of spoiler needed (i.e., a phrase or a passage), and to generate appropriate spoilers. A large-scale evaluation and error analysis on a new corpus of 5,000 manually spoiled clickbait posts—the Webis Clickbait Spoiling Corpus 2022—shows that our spoiler type classifier achieves an accuracy of 80%, while the question answering model DeBERTa-large outperforms all others in generating spoilers for both types.


pdf bib
On Classifying whether Two Texts are on the Same Side of an Argument
Erik Körner | Gregor Wiedemann | Ahmad Dawar Hakimi | Gerhard Heyer | Martin Potthast
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

To ease the difficulty of argument stance classification, the task of same side stance classification (S3C) has been proposed. In contrast to actual stance classification, which requires a substantial amount of domain knowledge to identify whether an argument is in favor or against a certain issue, it is argued that, for S3C, only argument similarity within stances needs to be learned to successfully solve the task. We evaluate several transformer-based approaches on the dataset of the recent S3C shared task, followed by an in-depth evaluation and error analysis of our model and the task’s hypothesis. We show that, although we achieve state-of-the-art results, our model fails to generalize both within as well as across topics and domains when adjusting the sampling strategy of the training and test set to a more adversarial scenario. Our evaluation shows that current state-of-the-art approaches cannot determine same side stance by considering only domain-independent linguistic similarity features, but appear to require domain knowledge and semantic inference, too.

pdf bib
Summary Explorer: Visualizing the State of the Art in Text Summarization
Shahbaz Syed | Tariq Yousef | Khalid Al Khatib | Stefan Jänicke | Martin Potthast
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

This paper introduces Summary Explorer, a new tool to support the manual inspection of text summarization systems by compiling the outputs of 55 state-of-the-art single document summarization approaches on three benchmark datasets, and visually exploring them during a qualitative assessment. The underlying design of the tool considers three well-known summary quality criteria (coverage, faithfulness, and position bias), encapsulated in a guided assessment based on tailored visualizations. The tool complements existing approaches for locally debugging summarization models and improves upon them. The tool is available at

pdf bib
Counter-Argument Generation by Attacking Weak Premises
Milad Alshomary | Shahbaz Syed | Arkajit Dhar | Martin Potthast | Henning Wachsmuth
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Generating Informative Conclusions for Argumentative Texts
Shahbaz Syed | Khalid Al Khatib | Milad Alshomary | Henning Wachsmuth | Martin Potthast
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Beyond Metadata: What Paper Authors Say About Corpora They Use
Nikolay Kolyada | Martin Potthast | Benno Stein
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Casting the Same Sentiment Classification Problem
Erik Körner | Ahmad Dawar Hakimi | Gerhard Heyer | Martin Potthast
Findings of the Association for Computational Linguistics: EMNLP 2021

We introduce and study a problem variant of sentiment analysis, namely the “same sentiment classification problem”, where, given a pair of texts, the task is to determine if they have the same sentiment, disregarding the actual sentiment polarity. Among other things, our goal is to enable a more topic-agnostic sentiment classification. We study the problem using the Yelp business review dataset, demonstrating how sentiment data needs to be prepared for this task, and then carry out sequence pair classification using the BERT language model. In a series of experiments, we achieve an accuracy above 83% for category subsets across topics, and 89% on average.

pdf bib
Image Retrieval for Arguments Using Stance-Aware Query Expansion
Johannes Kiesel | Nico Reichenbach | Benno Stein | Martin Potthast
Proceedings of the 8th Workshop on Argument Mining

Many forms of argumentation employ images as persuasive means, but research in argument mining has been focused on verbal argumentation so far. This paper shows how to integrate images into argument mining research, specifically into argument retrieval. By exploiting the sophisticated image representations of keyword-based image search, we propose to use semantic query expansion for both the pro and the con stance to retrieve “argumentative images” for the respective stance. Our results indicate that even simple expansions provide a strong baseline, reaching a precision@10 of 0.49 for images being (1) on-topic, (2) argumentative, and (3) on-stance. An in-depth analysis reveals a high topic dependence of the retrieval performance and shows the need to further investigate on images providing contextual information.

pdf bib
Key Point Analysis via Contrastive Learning and Extractive Argument Summarization
Milad Alshomary | Timon Gurcke | Shahbaz Syed | Philipp Heinisch | Maximilian Spliethöver | Philipp Cimiano | Martin Potthast | Henning Wachsmuth
Proceedings of the 8th Workshop on Argument Mining

Key point analysis is the task of extracting a set of concise and high-level statements from a given collection of arguments, representing the gist of these arguments. This paper presents our proposed approach to the Key Point Analysis Shared Task, colocated with the 8th Workshop on Argument Mining. The approach integrates two complementary components. One component employs contrastive learning via a siamese neural network for matching arguments to key points; the other is a graph-based extractive summarization model for generating key points. In both automatic and manual evaluation, our approach was ranked best among all submissions to the shared task.


pdf bib
Crawling and Preprocessing Mailing Lists At Scale for Dialog Analysis
Janek Bevendorff | Khalid Al Khatib | Martin Potthast | Benno Stein
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper introduces the Webis Gmane Email Corpus 2019, the largest publicly available and fully preprocessed email corpus to date. We crawled more than 153 million emails from 14,699 mailing lists and segmented them into semantically consistent components using a new neural segmentation model. With 96% accuracy on 15 classes of email segments, our model achieves state-of-the-art performance while being more efficient to train than previous ones. All data, code, and trained models are made freely available alongside the paper.

pdf bib
Target Inference in Argument Conclusion Generation
Milad Alshomary | Shahbaz Syed | Martin Potthast | Henning Wachsmuth
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In argumentation, people state premises to reason towards a conclusion. The conclusion conveys a stance towards some target, such as a concept or statement. Often, the conclusion remains implicit, though, since it is self-evident in a discussion or left out for rhetorical reasons. However, the conclusion is key to understanding an argument and, hence, to any application that processes argumentation. We thus study the question to what extent an argument’s conclusion can be reconstructed from its premises. In particular, we argue here that a decisive step is to infer a conclusion’s target, and we hypothesize that this target is related to the premises’ targets. We develop two complementary target inference approaches: one ranks premise targets and selects the top-ranked target as the conclusion target, the other finds a new conclusion target in a learned embedding space using a triplet neural network. Our evaluation on corpora from two domains indicates that a hybrid of both approaches is best, outperforming several strong baselines. According to human annotators, we infer a reasonably adequate conclusion target in 89% of the cases.

pdf bib
Efficient Pairwise Annotation of Argument Quality
Lukas Gienapp | Benno Stein | Matthias Hagen | Martin Potthast
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We present an efficient annotation framework for argument quality, a feature difficult to be measured reliably as per previous work. A stochastic transitivity model is combined with an effective sampling strategy to infer high-quality labels with low effort from crowdsourced pairwise judgments. The model’s capabilities are showcased by compiling Webis-ArgQuality-20, an argument quality corpus that comprises scores for rhetorical, logical, dialectical, and overall quality inferred from a total of 41,859 pairwise judgments among 1,271 arguments. With up to 93% cost savings, our approach significantly outperforms existing annotation procedures. Furthermore, novel insight into argument quality is provided through statistical analysis, and a new aggregation method to infer overall quality from individual quality dimensions is proposed.

pdf bib
News Editorials: Towards Summarizing Long Argumentative Texts
Shahbaz Syed | Roxanne El Baff | Johannes Kiesel | Khalid Al Khatib | Benno Stein | Martin Potthast
Proceedings of the 28th International Conference on Computational Linguistics

The automatic summarization of argumentative texts has hardly been explored. This paper takes a further step in this direction, targeting news editorials, i.e., opinionated articles with a well-defined argumentation structure. With Webis-EditorialSum-2020, we present a corpus of 1330 carefully curated summaries for 266 news editorials. We evaluate these summaries based on a tailored annotation scheme, where a high-quality summary is expected to be thesis-indicative, persuasive, reasonable, concise, and self-contained. Our corpus contains at least three high-quality summaries for about 90% of the editorials, rendering it a valuable resource for the development and evaluation of summarization technology for long argumentative texts. We further report details of both, an in-depth corpus analysis, and the evaluation of two extractive summarization models.

pdf bib
Task Proposal: Abstractive Snippet Generation for Web Pages
Shahbaz Syed | Wei-Fan Chen | Matthias Hagen | Benno Stein | Henning Wachsmuth | Martin Potthast
Proceedings of the 13th International Conference on Natural Language Generation

We propose a shared task on abstractive snippet generation for web pages, a novel task of generating query-biased abstractive summaries for documents that are to be shown on a search results page. Conventional snippets are extractive in nature, which recently gave rise to copyright claims from news publishers as well as a new copyright legislation being passed in the European Union, limiting the fair use of web page contents for snippets. At the same time, abstractive summarization has matured considerably in recent years, potentially allowing for more personalization of snippets in the future. Taken together, these facts render further research into generating abstractive snippets both timely and promising.


pdf bib
SemEval-2019 Task 4: Hyperpartisan News Detection
Johannes Kiesel | Maria Mestre | Rishabh Shukla | Emmanuel Vincent | Payam Adineh | David Corney | Benno Stein | Martin Potthast
Proceedings of the 13th International Workshop on Semantic Evaluation

Hyperpartisan news is news that takes an extreme left-wing or right-wing standpoint. If one is able to reliably compute this meta information, news articles may be automatically tagged, this way encouraging or discouraging readers to consume the text. It is an open question how successfully hyperpartisan news detection can be automated, and the goal of this SemEval task was to shed light on the state of the art. We developed new resources for this purpose, including a manually labeled dataset with 1,273 articles, and a second dataset with 754,000 articles, labeled via distant supervision. The interest of the research community in our task exceeded all our expectations: The datasets were downloaded about 1,000 times, 322 teams registered, of which 184 configured a virtual machine on our shared task cloud service TIRA, of which in turn 42 teams submitted a valid run. The best team achieved an accuracy of 0.822 on a balanced sample (yes : no hyperpartisan) drawn from the manually tagged corpus; an ensemble of the submitted systems increased the accuracy by 0.048.

pdf bib
Generalizing Unmasking for Short Texts
Janek Bevendorff | Benno Stein | Matthias Hagen | Martin Potthast
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Authorship verification is the problem of inferring whether two texts were written by the same author. For this task, unmasking is one of the most robust approaches as of today with the major shortcoming of only being applicable to book-length texts. In this paper, we present a generalized unmasking approach which allows for authorship verification of texts as short as four printed pages with very high precision at an adjustable recall tradeoff. Our generalized approach therefore reduces the required material by orders of magnitude, making unmasking applicable to authorship cases of more practical proportions. The new approach is on par with other state-of-the-art techniques that are optimized for texts of this length: it achieves accuracies of 75–80%, while also allowing for easy adjustment to forensic scenarios that require higher levels of confidence in the classification.

pdf bib
Towards Summarization for Social Media - Results of the TL;DR Challenge
Shahbaz Syed | Michael Völske | Nedim Lipka | Benno Stein | Hinrich Schütze | Martin Potthast
Proceedings of the 12th International Conference on Natural Language Generation

In this paper, we report on the results of the TL;DR challenge, discussing an extensive manual evaluation of the expected properties of a good summary based on analyzing the comments provided by human annotators.

pdf bib
Heuristic Authorship Obfuscation
Janek Bevendorff | Martin Potthast | Matthias Hagen | Benno Stein
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Authorship verification is the task of determining whether two texts were written by the same author. We deal with the adversary task, called authorship obfuscation: preventing verification by altering a to-be-obfuscated text. Our new obfuscation approach (1) models writing style difference as the Jensen-Shannon distance between the character n-gram distributions of texts, and (2) manipulates an author’s subconsciously encoded writing style in a sophisticated manner using heuristic search. To obfuscate, we analyze the huge space of textual variants for a paraphrased version of the to-be-obfuscated text that has a sufficient Jensen-Shannon distance at minimal costs in terms of text quality. We analyze, quantify, and illustrate the rationale of this approach, define paraphrasing operators, derive obfuscation thresholds, and develop an effective obfuscation framework. Our authorship obfuscation approach defeats state-of-the-art verification approaches, including unmasking and compression models, while keeping text changes at a minimum.

pdf bib
Celebrity Profiling
Matti Wiegmann | Benno Stein | Martin Potthast
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Celebrities are among the most prolific users of social media, promoting their personas and rallying followers. This activity is closely tied to genuine writing samples, which makes them worthy research subjects in many respects, not least profiling. With this paper we introduce the Webis Celebrity Corpus 2019. For its construction the Twitter feeds of 71,706 verified accounts have been carefully linked with their respective Wikidata items, crawling both. After cleansing, the resulting profiles contain an average of 29,968 words per profile and up to 239 pieces of personal information. A cross-evaluation that checked the correct association of Twitter account and Wikidata item revealed an error rate of only 0.6%, rendering the profiles highly reliable. Our corpus comprises a wide cross-section of local and global celebrities, forming a unique combination of scale, profile comprehensiveness, and label reliability. We further establish the state of the art’s profiling performance by evaluating the winning approaches submitted to the PAN gender prediction tasks in a transfer learning experiment. They are only outperformed by our own deep learning approach, which we also use to exemplify celebrity occupation prediction for the first time.

pdf bib
Bias Analysis and Mitigation in the Evaluation of Authorship Verification
Janek Bevendorff | Matthias Hagen | Benno Stein | Martin Potthast
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The PAN series of shared tasks is well known for its continuous and high quality research in the field of digital text forensics. Among others, PAN contributions include original corpora, tailored benchmarks, and standardized experimentation platforms. In this paper we review, theoretically and practically, the authorship verification task and conclude that the underlying experiment design cannot guarantee pushing forward the state of the art—in fact, it allows for top benchmarking with a surprisingly straightforward approach. In this regard, we present a “Basic and Fairly Flawed” (BAFF) authorship verifier that is on a par with the best approaches submitted so far, and that illustrates sources of bias that should be eliminated. We pinpoint these sources in the evaluation chain and present a refined authorship corpus as effective countermeasure.


pdf bib
Task Proposal: The TL;DR Challenge
Shahbaz Syed | Michael Völske | Martin Potthast | Nedim Lipka | Benno Stein | Hinrich Schütze
Proceedings of the 11th International Conference on Natural Language Generation

The TL;DR challenge fosters research in abstractive summarization of informal text, the largest and fastest-growing source of textual data on the web, which has been overlooked by summarization research so far. The challenge owes its name to the frequent practice of social media users to supplement long posts with a “TL;DR”—for “too long; didn’t read”—followed by a short summary as a courtesy to those who would otherwise reply with the exact same abbreviation to indicate they did not care to read a post for its apparent length. Posts featuring TL;DR summaries form an excellent ground truth for summarization, and by tapping into this resource for the first time, we have mined millions of training examples from social media, opening the door to all kinds of generative models.

pdf bib
A Stylometric Inquiry into Hyperpartisan and Fake News
Martin Potthast | Johannes Kiesel | Kevin Reinartz | Janek Bevendorff | Benno Stein
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We report on a comparative style analysis of hyperpartisan (extremely one-sided) news and fake news. A corpus of 1,627 articles from 9 political publishers, three each from the mainstream, the hyperpartisan left, and the hyperpartisan right, have been fact-checked by professional journalists at BuzzFeed: 97% of the 299 fake news articles identified are also hyperpartisan. We show how a style analysis can distinguish hyperpartisan news from the mainstream (F1 = 0.78), and satire from both (F1 = 0.81). But stylometry is no silver bullet as style-based fake news detection does not work (F1 = 0.46). We further reveal that left-wing and right-wing news share significantly more stylistic similarities than either does with the mainstream. This result is robust: it has been confirmed by three different modeling approaches, one of which employs Unmasking in a novel way. Applications of our results include partisanship detection and pre-screening for semi-automatic fake news detection.

pdf bib
CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Daniel Zeman | Jan Hajič | Martin Popel | Martin Potthast | Milan Straka | Filip Ginter | Joakim Nivre | Slav Petrov
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

Every year, the Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2018, one of two tasks was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on test input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. This shared task constitutes a 2nd edition—the first one took place in 2017 (Zeman et al., 2017); the main metric from 2017 has been kept, allowing for easy comparison, also in 2018, and two new main metrics have been used. New datasets added to the Universal Dependencies collection between mid-2017 and the spring of 2018 have contributed to increased difficulty of the task this year. In this overview paper, we define the task and the updated evaluation methodology, describe data preparation, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.

pdf bib
Crowdsourcing a Large Corpus of Clickbait on Twitter
Martin Potthast | Tim Gollub | Kristof Komlossy | Sebastian Schuster | Matti Wiegmann | Erika Patricia Garces Fernandez | Matthias Hagen | Benno Stein
Proceedings of the 27th International Conference on Computational Linguistics

Clickbait has become a nuisance on social media. To address the urging task of clickbait detection, we constructed a new corpus of 38,517 annotated Twitter tweets, the Webis Clickbait Corpus 2017. To avoid biases in terms of publisher and topic, tweets were sampled from the top 27 most retweeted news publishers, covering a period of 150 days. Each tweet has been annotated on 4-point scale by five annotators recruited at Amazon’s Mechanical Turk. The corpus has been employed to evaluate 12 clickbait detectors submitted to the Clickbait Challenge 2017. Download: Challenge:


pdf bib
TL;DR: Mining Reddit to Learn Automatic Summarization
Michael Völske | Martin Potthast | Shahbaz Syed | Benno Stein
Proceedings of the Workshop on New Frontiers in Summarization

Recent advances in automatic text summarization have used deep neural networks to generate high-quality abstractive summaries, but the performance of these models strongly depends on large amounts of suitable training data. We propose a new method for mining social media for author-provided summaries, taking advantage of the common practice of appending a “TL;DR” to long posts. A case study using a large Reddit crawl yields the Webis-TLDR-17 dataset, complementing existing corpora primarily from the news genre. Our technique is likely applicable to other social media sites and general web crawls.

pdf bib
Building an Argument Search Engine for the Web
Henning Wachsmuth | Martin Potthast | Khalid Al-Khatib | Yamen Ajjour | Jana Puschmann | Jiani Qu | Jonas Dorsch | Viorel Morari | Janek Bevendorff | Benno Stein
Proceedings of the 4th Workshop on Argument Mining

Computational argumentation is expected to play a critical role in the future of web search. To make this happen, many search-related questions must be revisited, such as how people query for arguments, how to mine arguments from the web, or how to rank them. In this paper, we develop an argument search framework for studying these and further questions. The framework allows for the composition of approaches to acquiring, mining, assessing, indexing, querying, retrieving, ranking, and presenting arguments while relying on standard infrastructure and interfaces. Based on the framework, we build a prototype search engine, called args, that relies on an initial, freely accessible index of nearly 300k arguments crawled from reliable web resources. The framework and the argument search engine are intended as an environment for collaborative research on computational argumentation and its practical evaluation.

pdf bib
CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Daniel Zeman | Martin Popel | Milan Straka | Jan Hajič | Joakim Nivre | Filip Ginter | Juhani Luotolahti | Sampo Pyysalo | Slav Petrov | Martin Potthast | Francis Tyers | Elena Badmaeva | Memduh Gokirmak | Anna Nedoluzhko | Silvie Cinková | Jan Hajič jr. | Jaroslava Hlaváčová | Václava Kettnerová | Zdeňka Urešová | Jenna Kanerva | Stina Ojala | Anna Missilä | Christopher D. Manning | Sebastian Schuster | Siva Reddy | Dima Taji | Nizar Habash | Herman Leung | Marie-Catherine de Marneffe | Manuela Sanguinetti | Maria Simi | Hiroshi Kanayama | Valeria de Paiva | Kira Droganova | Héctor Martínez Alonso | Çağrı Çöltekin | Umut Sulubacak | Hans Uszkoreit | Vivien Macketanz | Aljoscha Burchardt | Kim Harris | Katrin Marheinecke | Georg Rehm | Tolga Kayadelen | Mohammed Attia | Ali Elkahky | Zhuoran Yu | Emily Pitler | Saran Lertpradit | Michael Mandl | Jesse Kirchner | Hector Fernandez Alcalde | Jana Strnadová | Esha Banerjee | Ruli Manurung | Antonio Stella | Atsuko Shimada | Sookyoung Kwak | Gustavo Mendonça | Tatiana Lando | Rattima Nitisaroj | Josie Li
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, the task was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe how the data sets were prepared, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.


pdf bib
Webis: An Ensemble for Twitter Sentiment Detection
Matthias Hagen | Martin Potthast | Michel Büchner | Benno Stein
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)


pdf bib
Improving Cloze Test Performance of Language Learners Using Web N-Grams
Martin Potthast | Matthias Hagen | Anna Beyer | Benno Stein
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers


pdf bib
Crowdsourcing Interaction Logs to Understand Text Reuse from the Web
Martin Potthast | Matthias Hagen | Michael Völske | Benno Stein
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)


pdf bib
An Evaluation Framework for Plagiarism Detection
Martin Potthast | Benno Stein | Alberto Barrón-Cedeño | Paolo Rosso
Coling 2010: Posters

pdf bib
Corpus and Evaluation Measures for Automatic Plagiarism Detection
Alberto Barrón-Cedeño | Martin Potthast | Paolo Rosso | Benno Stein
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The simple access to texts on digital libraries and the World Wide Web has led to an increased number of plagiarism cases in recent years, which renders manual plagiarism detection infeasible at large. Various methods for automatic plagiarism detection have been developed whose objective is to assist human experts in the analysis of documents for plagiarism. The methods can be divided into two main approaches: intrinsic and external. Unlike other tasks in natural language processing and information retrieval, it is not possible to publish a collection of real plagiarism cases for evaluation purposes since they cannot be properly anonymized. Therefore, current evaluations found in the literature are incomparable and, very often not even reproducible. Our contribution in this respect is a newly developed large-scale corpus of artificial plagiarism useful for the evaluation of intrinsic as well as external plagiarism detection. Additionally, new detection performance measures tailored to the evaluation of plagiarism detection algorithms are proposed.

pdf bib
Evaluating Humour Features on Web Comments
Antonio Reyes | Martin Potthast | Paolo Rosso | Benno Stein
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Research on automatic humor recognition has developed several features which discriminate funny text from ordinary text. The features have been demonstrated to work well when classifying the funniness of single sentences up to entire blogs. In this paper we focus on evaluating a set of the best humor features reported in the literature over a corpus retrieved from the Slashdot Web site. The corpus is categorized in a community-driven process according to the following tags: funny, informative, insightful, offtopic, flamebait, interesting and troll. These kinds of comments can be found on almost every large Web site; therefore, they impose a new challenge to humor retrieval since they come along with unique characteristics compared to other text types. If funny comments were retrieved accurately, they would be of a great entertainment value for the visitors of a given Web page. Our objective, thus, is to distinguish between an implicit funny comment from a not funny one. Our experiments are preliminary but nonetheless large-scale: 600,000 Web comments. We evaluate the classification accuracy of naive Bayes classifiers, decision trees, and support vector machines. The results suggested interesting findings.