Torsten Zesch


2024

pdf bib
Text or Image? What is More Important in Cross-Domain Generalization Capabilities of Hate Meme Detection Models?
Piush Aggarwal | Jawar Mehrabanian | Weigang Huang | Özge Alacam | Torsten Zesch
Findings of the Association for Computational Linguistics: EACL 2024

This paper delves into the formidable challenge of cross-domain generalization in multimodal hate meme detection, presenting compelling findings. We provide evidence supporting the hypothesis that only the textual component of hateful memes enables the multimodal classifier to generalize across different domains, while the image component proves highly sensitive to a specific training dataset. The evidence includes demonstrations showing that hate-text classifiers perform similarly to hate-meme classifiers in a zero-shot setting. Simultaneously, the introduction of captions generated from images of memes to the hate-meme classifier worsens performance by an average F1 of 0.02. Through blackbox explanations, we identify a substantial contribution of the text modality (average of 83%), which diminishes with the introduction of meme’s image captions (52%). Additionally, our evaluation on a newly created confounder dataset reveals higher performance on text confounders as compared to image confounders with average ∆F1 of 0.18.

pdf bib
Unraveling the Dynamics of Semi-Supervised Hate Speech Detection: The Impact of Unlabeled Data Characteristics and Pseudo-Labeling Strategies
Florian Ludwig | Klara Dolos | Ana Alves-Pinto | Torsten Zesch
Findings of the Association for Computational Linguistics: EACL 2024

Despite advances in machine learning based hate speech detection, the need for larges amounts of labeled training data for state-of-the-art approaches remains a challenge for their application. Semi-supervised learning addresses this problem by leveraging unlabeled data and thus reducing the amount of annotated data required. Underlying this approach is the assumption that labeled and unlabeled data follow similar distributions. This assumption however may not always hold, with consequences for real world applications. We address this problem by investigating the dynamics of pseudo-labeling, a commonly employed form of semi-supervised learning, in the context of hate speech detection. Concretely we analysed the influence of data characteristics and of two strategies for selecting pseudo-labeled samples: threshold- and ratio-based. The results show that the influence of data characteristics on the pseudo-labeling performances depends on other factors, such as pseudo-label selection strategies or model biases. Furthermore, the effectiveness of pseudo-labeling in classification performance is determined by the interaction between the number, hate ratio and accuracy of the selected pseudo-labels. Analysis of the results suggests an advantage of the threshold-based approach when labeled and unlabeled data arise from the same domain, whilst the ratio-based approach may be recommended in the opposite situation.

pdf bib
Rainbow - A Benchmark for Systematic Testing of How Sensitive Visio-Linguistic Models are to Color Naming
Marie Bexte | Andrea Horbach | Torsten Zesch
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

With the recent emergence of powerful visio-linguistic models comes the question of how fine-grained their multi-modal understanding is. This has lead to the release of several probing datasets. Results point towards models having trouble with prepositions and verbs, but being relatively robust when it comes to color.To gauge how deep this understanding goes, we compile a comprehensive probing dataset to systematically test multi-modal alignment around color. We demonstrate how human perception influences descriptions of color and pay special attention to the extent to which this is reflected within the predictions of a visio-linguistic model. Probing a set of models with diverse properties with our benchmark confirms the superiority of models that do not rely on pre-extracted image features, and demonstrates that augmentation with too much noisy pre-training data can produce an inferior model. While the benchmark remains challenging for all models we test, the overall result pattern suggests well-founded alignment of color terms with hues. Analyses do however reveal uncertainty regarding the boundaries between neighboring color terms.

2023

pdf bib
Similarity-Based Content Scoring - A more Classroom-Suitable Alternative to Instance-Based Scoring?
Marie Bexte | Andrea Horbach | Torsten Zesch
Findings of the Association for Computational Linguistics: ACL 2023

Automatically scoring student answers is an important task that is usually solved using instance-based supervised learning. Recently, similarity-based scoring has been proposed as an alternative approach yielding similar perfor- mance. It has hypothetical advantages such as a lower need for annotated training data and better zero-shot performance, both of which are properties that would be highly beneficial when applying content scoring in a realistic classroom setting. In this paper we take a closer look at these alleged advantages by comparing different instance-based and similarity-based methods on multiple data sets in a number of learning curve experiments. We find that both the demand on data and cross-prompt performance is similar, thus not confirming the former two suggested advantages. The by default more straightforward possibility to give feedback based on a similarity-based approach may thus tip the scales in favor of it, although future work is needed to explore this advantage in practice.

pdf bib
Preserving the Authenticity of Handwritten Learner Language: Annotation Guidelines for Creating Transcripts Retaining Orthographic Features
Christian Gold | Ronja Laarmann-quante | Torsten Zesch
Proceedings of the Workshop on Computation and Written Language (CAWL 2023)

Handwritten texts produced by young learners often contain orthographic features like spelling errors, capitalization errors, punctuation errors, and impurities such as strikethroughs, inserts, and smudges. All of those are typically normalized or ignored in existing transcriptions. For applications like handwriting recognition with the goal of automatically analyzing a learner’s language performance, however, retaining such features would be necessary. To address this, we present transcription guidelines that retain the features addressed above. Our guidelines were developed iteratively and include numerous example images to illustrate the various issues. On a subset of about 90 double-transcribed texts, we compute inter-annotator agreement and show that our guidelines can be applied with high levels of percentage agreement of about .98. Overall, we transcribed 1,350 learner texts, which is about the same size as the widely adopted handwriting recognition datasets IAM (1,500 pages) and CVL (1,600 pages). Our final corpus can be used to train a handwriting recognition system that transcribes closely to the real productions by young learners. Such a system is a prerequisite for applying automatic orthography feedback systems to handwritten texts in the future.

pdf bib
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)
Ekaterina Kochmar | Jill Burstein | Andrea Horbach | Ronja Laarmann-Quante | Nitin Madnani | Anaïs Tack | Victoria Yaneva | Zheng Yuan | Torsten Zesch
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

pdf bib
Recognizing Learner Handwriting Retaining Orthographic Errors for Enabling Fine-Grained Error Feedback
Christian Gold | Ronja Laarmann-Quante | Torsten Zesch
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

This paper addresses the problem of providing automatic feedback on orthographic errors in handwritten text. Despite the availability of automatic error detection systems, the practical problem of digitizing the handwriting remains. Current handwriting recognition (HWR) systems produce highly accurate transcriptions but normalize away the very errors that are essential for providing useful feedback, e.g. orthographic errors. Our contribution is twofold:First, we create a comprehensive dataset of handwritten text with transcripts retaining orthographic errors by transcribing 1,350 pages from the German learner dataset FD-LEX. Second, we train a simple HWR system on our dataset, allowing it to transcribe words with orthographic errors. Thereby, we evaluate the effect of different dictionaries on recognition output, highlighting the importance of addressing spelling errors in these dictionaries.

pdf bib
Proceedings of the 1st Workshop on Linguistic Insights from and for Multimodal Language Processing
Piush Aggarwal | {\"O}zge Ala{\c{c}}am | Carina Silberer | Sina Zarrie{\ss} | Torsten Zesch
Proceedings of the 1st Workshop on Linguistic Insights from and for Multimodal Language Processing

2022

pdf bib
Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)
Robin Schaefer | Xiaoyu Bai | Manfred Stede | Torsten Zesch
Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)

pdf bib
Bye, Bye, Maintenance Work? Using Model Cloning to Approximate the Behavior of Legacy Tools
Piush Aggarwal | Torsten Zesch
Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)

pdf bib
Bringing Automatic Scoring into the Classroom - Measuring the Impact of Automated Analytic Feedback on Student Writing Performance
Andrea Horbach | Ronja Laarmann-Quante | Lucas Liebenow | Thorben Jansen | Stefan Keller | Jennifer Meyer | Torsten Zesch | Johanna Fleckenstein
Proceedings of the 11th Workshop on NLP for Computer Assisted Language Learning

pdf bib
Evaluating Automatic Spelling Correction Tools on German Primary School Children’s Misspellings
Ronja Laarmann-Quante | Lisa Prepens | Torsten Zesch
Proceedings of the 11th Workshop on NLP for Computer Assisted Language Learning

pdf bib
LeSpell - A Multi-Lingual Benchmark Corpus of Spelling Errors to Develop Spellchecking Methods for Learner Language
Marie Bexte | Ronja Laarmann-Quante | Andrea Horbach | Torsten Zesch
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Spellchecking text written by language learners is especially challenging because errors made by learners differ both quantitatively and qualitatively from errors made by already proficient learners. We introduce LeSpell, a multi-lingual (English, German, Italian, and Czech) evaluation data set of spelling mistakes in context that we compiled from seven underlying learner corpora. Our experiments show that existing spellcheckers do not work well with learner data. Thus, we introduce a highly customizable spellchecking component for the DKPro architecture, which improves performance in many settings.

pdf bib
Improving Generalization of Hate Speech Detection Systems to Novel Target Groups via Domain Adaptation
Florian Ludwig | Klara Dolos | Torsten Zesch | Eleanor Hobley
Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)

Despite recent advances in machine learning based hate speech detection, classifiers still struggle with generalizing knowledge to out-of-domain data samples. In this paper, we investigate the generalization capabilities of deep learning models to different target groups of hate speech under clean experimental settings. Furthermore, we assess the efficacy of three different strategies of unsupervised domain adaptation to improve these capabilities. Given the diversity of hate and its rapid dynamics in the online world (e.g. the evolution of new target groups like virologists during the COVID-19 pandemic), robustly detecting hate aimed at newly identified target groups is a highly relevant research question. We show that naively trained models suffer from a target group specific bias, which can be reduced via domain adaptation. We were able to achieve a relative improvement of the F1-score between 5.8% and 10.7% for out-of-domain target groups of hate speech compared to baseline approaches by utilizing domain adaptation.

pdf bib
Analyzing the Real Vulnerability of Hate Speech Detection Systems against Targeted Intentional Noise
Piush Aggarwal | Torsten Zesch
Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)

Hate speech detection systems have been shown to be vulnerable against obfuscation attacks, where a potential hater tries to circumvent detection by deliberately introducing noise in their posts. In previous work, noise is often introduced for all words (which is likely overestimating the impact) or single untargeted words (likely underestimating the vulnerability). We perform a user study asking people to select words they would obfuscate in a post. Using this realistic setting, we find that the real vulnerability of hate speech detection systems against deliberately introduced noise is almost as high as when using a whitebox attack and much more severe than when using a non-targeted dictionary. Our results are based on 4 different datasets, 12 different obfuscation strategies, and hate speech detection systems using different paradigms.

pdf bib
A Legal Approach to Hate Speech – Operationalizing the EU’s Legal Framework against the Expression of Hatred as an NLP Task
Frederike Zufall | Marius Hamacher | Katharina Kloppenborg | Torsten Zesch
Proceedings of the Natural Legal Language Processing Workshop 2022

We propose a ‘legal approach’ to hate speech detection by operationalization of the decision as to whether a post is subject to criminal law into an NLP task. Comparing existing regulatory regimes for hate speech, we base our investigation on the European Union’s framework as it provides a widely applicable legal minimum standard. Accurately deciding whether a post is punishable or not usually requires legal education. We show that, by breaking the legal assessment down into a series of simpler sub-decisions, even laypersons can annotate consistently. Based on a newly annotated dataset, our experiments show that directly learning an automated model of punishable content is challenging. However, learning the two sub-tasks of ‘target group’ and ‘targeting conduct’ instead of a holistic, end-to-end approach to the legal assessment yields better results. Overall, our method also provides decisions that are more transparent than those of end-to-end models, which is a crucial point in legal decision-making.

pdf bib
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)
Ekaterina Kochmar | Jill Burstein | Andrea Horbach | Ronja Laarmann-Quante | Nitin Madnani | Anaïs Tack | Victoria Yaneva | Zheng Yuan | Torsten Zesch
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

pdf bib
Similarity-Based Content Scoring - How to Make S-BERT Keep Up With BERT
Marie Bexte | Andrea Horbach | Torsten Zesch
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

The dominating paradigm for content scoring is to learn an instance-based model, i.e. to use lexical features derived from the learner answers themselves. An alternative approach that receives much less attention is however to learn a similarity-based model. We introduce an architecture that efficiently learns a similarity model and find that results on the standard ASAP dataset are on par with a BERT-based classification approach.

pdf bib
‘Meet me at the ribary’ – Acceptability of spelling variants in free-text answers to listening comprehension prompts
Ronja Laarmann-Quante | Leska Schwarz | Andrea Horbach | Torsten Zesch
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

When listening comprehension is tested as a free-text production task, a challenge for scoring the answers is the resulting wide range of spelling variants. When judging whether a variant is acceptable or not, human raters perform a complex holistic decision. In this paper, we present a corpus study in which we analyze human acceptability decisions in a high stakes test for German. We show that for human experts, spelling variants are harder to score consistently than other answer variants. Furthermore, we examine how the decision can be operationalized using features that could be applied by an automatic scoring system. We show that simple measures like edit distance and phonetic similarity between a given answer and the target answer can model the human acceptability decisions with the same inter-annotator agreement as humans, and discuss implications of the remaining inconsistencies.

2021

pdf bib
A Crash Course on Ethics for Natural Language Processing
Annemarie Friedrich | Torsten Zesch
Proceedings of the Fifth Workshop on Teaching NLP

It is generally agreed upon in the natural language processing (NLP) community that ethics should be integrated into any curriculum. Being aware of and understanding the relevant core concepts is a prerequisite for following and participating in the discourse on ethical NLP. We here present ready-made teaching material in the form of slides and practical exercises on ethical issues in NLP, which is primarily intended to be integrated into introductory NLP or computational linguistics courses. By making this material freely available, we aim at lowering the threshold to adding ethics to the curriculum. We hope that increased awareness will enable students to identify potentially unethical behavior.

pdf bib
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications
Jill Burstein | Andrea Horbach | Ekaterina Kochmar | Ronja Laarmann-Quante | Claudia Leacock | Nitin Madnani | Ildikó Pilán | Helen Yannakoudakis | Torsten Zesch
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
C-Test Collector: A Proficiency Testing Application to Collect Training Data for C-Tests
Christian Haring | Rene Lehmann | Andrea Horbach | Torsten Zesch
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications

We present the C-Test Collector, a web-based tool that allows language learners to test their proficiency level using c-tests. Our tool collects anonymized data on test performance, which allows teachers to gain insights into common error patterns. At the same time, it allows NLP researchers to collect training data for being able to generate c-test variants at the desired difficulty level.

pdf bib
VL-BERT+: Detecting Protected Groups in Hateful Multimodal Memes
Piush Aggarwal | Michelle Espranita Liman | Darina Gold | Torsten Zesch
Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)

This paper describes our submission (winning solution for Task A) to the Shared Task on Hateful Meme Detection at WOAH 2021. We build our system on top of a state-of-the-art system for binary hateful meme classification that already uses image tags such as race, gender, and web entities. We add further metadata such as emotions and experiment with data augmentation techniques, as hateful instances are underrepresented in the data set.

pdf bib
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)
Kilian Evang | Laura Kallmeyer | Rainer Osswald | Jakub Waszczuk | Torsten Zesch
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

pdf bib
Robustness of end-to-end Automatic Speech Recognition Models – A Case Study using Mozilla DeepSpeech
Aashish Agarwal | Torsten Zesch
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

pdf bib
Effects of Layer Freezing on Transferring a Speech Recognition System to Under-resourced Languages
Onno Eberhard | Torsten Zesch
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

pdf bib
Implicit Phenomena in Short-answer Scoring Data
Marie Bexte | Andrea Horbach | Torsten Zesch
Proceedings of the 1st Workshop on Understanding Implicit and Underspecified Language

Short-answer scoring is the task of assessing the correctness of a short text given as response to a question that can come from a variety of educational scenarios. As only content, not form, is important, the exact wording including the explicitness of an answer should not matter. However, many state-of-the-art scoring models heavily rely on lexical information, be it word embeddings in a neural network or n-grams in an SVM. Thus, the exact wording of an answer might very well make a difference. We therefore quantify to what extent implicit language phenomena occur in short answer datasets and examine the influence they have on automatic scoring performance. We find that the level of implicitness depends on the individual question, and that some phenomena are very frequent. Resolving implicit wording to explicit formulations indeed tends to improve automatic scoring performance.

2020

pdf bib
Chinese Content Scoring: Open-Access Datasets and Features on Different Segmentation Levels
Yuning Ding | Andrea Horbach | Torsten Zesch
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

In this paper, we analyse the challenges of Chinese content scoring in comparison to English. As a review of prior work for Chinese content scoring shows a lack of open-access data in the field, we present two short-answer data sets for Chinese. The Chinese Educational Short Answers data set (CESA) contains 1800 student answers for five science-related questions. As a second data set, we collected ASAP-ZH with 942 answers by re-using three existing prompts from the ASAP data set. We adapt a state-of-the-art content scoring system for Chinese and evaluate it in several settings on these data sets. Results show that features on lower segmentation levels such as character n-grams tend to have better performance than features on token level.

pdf bib
Decomposing and Comparing Meaning Relations: Paraphrasing, Textual Entailment, Contradiction, and Specificity
Venelin Kovatchev | Darina Gold | M. Antonia Marti | Maria Salamo | Torsten Zesch
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we present a methodology for decomposing and comparing multiple meaning relations (paraphrasing, textual entailment, contradiction, and specificity). The methodology includes SHARel - a new typology that consists of 26 linguistic and 8 reason-based categories. We use the typology to annotate a corpus of 520 sentence pairs in English and we demonstrate that unlike previous typologies, SHARel can be applied to all relations of interest with a high inter-annotator agreement. We analyze and compare the frequency and distribution of the linguistic and reason-based phenomena involved in paraphrasing, textual entailment, contradiction, and specificity. This comparison allows for a much more in-depth analysis of the workings of the individual relations and the way they interact and compare with each other. We release all resources (typology, annotation guidelines, and annotated corpus) to the community.

pdf bib
Don’t take “nswvtnvakgxpm” for an answer –The surprising vulnerability of automatic content scoring systems to adversarial input
Yuning Ding | Brian Riordan | Andrea Horbach | Aoife Cahill | Torsten Zesch
Proceedings of the 28th International Conference on Computational Linguistics

Automatic content scoring systems are widely used on short answer tasks to save human effort. However, the use of these systems can invite cheating strategies, such as students writing irrelevant answers in the hopes of gaining at least partial credit. We generate adversarial answers for benchmark content scoring datasets based on different methods of increasing sophistication and show that even simple methods lead to a surprising decrease in content scoring performance. As an extreme example, up to 60% of adversarial answers generated from random shuffling of words in real answers are accepted by a state-of-the-art scoring system. In addition to analyzing the vulnerabilities of content scoring systems, we examine countermeasures such as adversarial training and show that these measures improve system robustness against adversarial answers considerably but do not suffice to completely solve the problem.

pdf bib
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications
Jill Burstein | Ekaterina Kochmar | Claudia Leacock | Nitin Madnani | Ildikó Pilán | Helen Yannakoudakis | Torsten Zesch
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

2019

pdf bib
Divide and Extract – Disentangling Clause Splitting and Proposition Extraction
Darina Gold | Torsten Zesch
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Proposition extraction from sentences is an important task for information extraction systems Evaluation of such systems usually conflates two aspects: splitting complex sentences into clauses and the extraction of propositions. It is thus difficult to independently determine the quality of the proposition extraction step. We create a manually annotated proposition dataset from sentences taken from restaurant reviews that distinguishes between clauses that need to be split and those that do not. The resulting proposition evaluation dataset allows us to independently compare the performance of proposition extraction systems on simple and complex clauses. Although performance drastically drops on more complex sentences, we show that the same systems perform best on both simple and complex clauses. Furthermore, we show that specific kinds of subordinate clauses pose difficulties to most systems.

pdf bib
ltl.uni-due at SemEval-2019 Task 5: Simple but Effective Lexico-Semantic Features for Detecting Hate Speech in Twitter
Huangpan Zhang | Michael Wojatzki | Tobias Horsmann | Torsten Zesch
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper, we present our contribution to SemEval 2019 Task 5 Multilingual Detection of Hate, specifically in the Subtask A (English and Spanish). We compare different configurations of shallow and deep learning approaches on the English data and use the system that performs best in both sub-tasks. The resulting SVM-based system with lexicosemantic features (n-grams and embeddings) is ranked 23rd out of 69 on the English data and beats the baseline system. On the Spanish data our system is ranked 25th out of 39.

pdf bib
LTL-UDE at SemEval-2019 Task 6: BERT and Two-Vote Classification for Categorizing Offensiveness
Piush Aggarwal | Tobias Horsmann | Michael Wojatzki | Torsten Zesch
Proceedings of the 13th International Workshop on Semantic Evaluation

We present results for Subtask A and C of SemEval 2019 Shared Task 6. In Subtask A, we experiment with an embedding representation of postings and use BERT to categorize postings. Our best result reaches the 10th place (out of 103). In Subtask C, we applied a two-vote classification approach with minority fallback, which is placed on the 19th rank (out of 65).

pdf bib
From legal to technical concept: Towards an automated classification of German political Twitter postings as criminal offenses
Frederike Zufall | Tobias Horsmann | Torsten Zesch
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Advances in the automated detection of offensive Internet postings make this mechanism very attractive to social media companies, who are increasingly under pressure to monitor and action activity on their sites. However, these advances also have important implications as a threat to the fundamental right of free expression. In this article, we analyze which Twitter posts could actually be deemed offenses under German criminal law. German law follows the deductive method of the Roman law tradition based on abstract rules as opposed to the inductive reasoning in Anglo-American common law systems. This allows us to show how legal conclusions can be reached and implemented without relying on existing court decisions. We present a data annotation schema, consisting of a series of binary decisions, for determining whether a specific post would constitute a criminal offense. This schema serves as a step towards an inexpensive creation of a sufficient amount of data for an automated classification. We find that the majority of posts deemed offensive actually do not constitute a criminal offense and still contribute to public discourse. Furthermore, laymen can provide sufficiently reliable data to an expert reference but are, for instance, more lenient in the interpretation of what constitutes a disparaging statement.

pdf bib
RELATIONS - Workshop on meaning relations between phrases and sentences
Venelin Kovatchev | Darina Gold | Torsten Zesch
RELATIONS - Workshop on meaning relations between phrases and sentences

pdf bib
Annotating and analyzing the interactions between meaning relations
Darina Gold | Venelin Kovatchev | Torsten Zesch
Proceedings of the 13th Linguistic Annotation Workshop

Pairs of sentences, phrases, or other text pieces can hold semantic relations such as paraphrasing, textual entailment, contradiction, specificity, and semantic similarity. These relations are usually studied in isolation and no dataset exists where they can be compared empirically. Here we present a corpus annotated with these relations and the analysis of these results. The corpus contains 520 sentence pairs, annotated with these relations. We measure the annotation reliability of each individual relation and we examine their interactions and correlations. Among the unexpected results revealed by our analysis is that the traditionally considered direct relationship between paraphrasing and bi-directional entailment does not hold in our data.

pdf bib
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications
Helen Yannakoudakis | Ekaterina Kochmar | Claudia Leacock | Nitin Madnani | Ildikó Pilán | Torsten Zesch
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

2018

pdf bib
Cross-Lingual Content Scoring
Andrea Horbach | Sebastian Stennmanns | Torsten Zesch
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

We investigate the feasibility of cross-lingual content scoring, a scenario where training and test data in an automatic scoring task are from two different languages. Cross-lingual scoring can contribute to educational equality by allowing answers in multiple languages. Training a model in one language and applying it to another language might also help to overcome data sparsity issues by re-using trained models from other languages. As there is no suitable dataset available for this new task, we create a comparable bi-lingual corpus by extending the English ASAP dataset with German answers. Our experiments with cross-lingual scoring based on machine-translating either training or test data show a considerable drop in scoring quality.

pdf bib
The Role of Diacritics in Increasing the Difficulty of Arabic Lexical Recognition Tests
Osama Hamed | Torsten Zesch
Proceedings of the 7th workshop on NLP for Computer Assisted Language Learning

pdf bib
Quantifying Qualitative Data for Understanding Controversial Issues
Michael Wojatzki | Saif Mohammad | Torsten Zesch | Svetlana Kiritchenko
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
ESCRITO - An NLP-Enhanced Educational Scoring Toolkit
Torsten Zesch | Andrea Horbach
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
DeepTC – An Extension of DKPro Text Classification for Fostering Reproducibility of Deep Learning Experiments
Tobias Horsmann | Torsten Zesch
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Agree or Disagree: Predicting Judgments on Nuanced Assertions
Michael Wojatzki | Torsten Zesch | Saif Mohammad | Svetlana Kiritchenko
Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics

Being able to predict whether people agree or disagree with an assertion (i.e. an explicit, self-contained statement) has several applications ranging from predicting how many people will like or dislike a social media post to classifying posts based on whether they are in accordance with a particular point of view. We formalize this as two NLP tasks: predicting judgments of (i) individuals and (ii) groups based on the text of the assertion and previous judgments. We evaluate a wide range of approaches on a crowdsourced data set containing over 100,000 judgments on over 2,000 assertions. We find that predicting individual judgments is a hard task with our best results only slightly exceeding a majority baseline, but that judgments of groups can be more reliably predicted using a Siamese neural network, which outperforms all other approaches by a wide margin.

2017

pdf bib
Investigating neural architectures for short answer scoring
Brian Riordan | Andrea Horbach | Aoife Cahill | Torsten Zesch | Chong Min Lee
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

Neural approaches to automated essay scoring have recently shown state-of-the-art performance. The automated essay scoring task typically involves a broad notion of writing quality that encompasses content, grammar, organization, and conventions. This differs from the short answer content scoring task, which focuses on content accuracy. The inputs to neural essay scoring models – ngrams and embeddings – are arguably well-suited to evaluate content in short answer scoring tasks. We investigate how several basic neural approaches similar to those used for automated essay scoring perform on short answer scoring. We show that neural architectures can outperform a strong non-neural baseline, but performance and optimal parameter settings vary across the more diverse types of prompts typical of short answer scoring.

pdf bib
Fine-grained essay scoring of a complex writing task for native speakers
Andrea Horbach | Dirk Scholten-Akoun | Yuning Ding | Torsten Zesch
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

Automatic essay scoring is nowadays successfully used even in high-stakes tests, but this is mainly limited to holistic scoring of learner essays. We present a new dataset of essays written by highly proficient German native speakers that is scored using a fine-grained rubric with the goal to provide detailed feedback. Our experiments with two state-of-the-art scoring systems (a neural and a SVM-based one) show a large drop in performance compared to existing datasets. This demonstrates the need for such datasets that allow to guide research on more elaborate essay scoring methods.

pdf bib
The Influence of Spelling Errors on Content Scoring Performance
Andrea Horbach | Yuning Ding | Torsten Zesch
Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017)

Spelling errors occur frequently in educational settings, but their influence on automatic scoring is largely unknown. We therefore investigate the influence of spelling errors on content scoring performance using the example of the ASAP corpus. We conduct an annotation study on the nature of spelling errors in the ASAP dataset and utilize these finding in machine learning experiments that measure the influence of spelling errors on automatic content scoring. Our main finding is that scoring methods using both token and character n-gram features are robust against spelling errors up to the error frequency in ASAP.

pdf bib
Same same, but different: Compositionality of paraphrase granularity levels
Darina Benikova | Torsten Zesch
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Paraphrases exist on different granularity levels, the most frequently used one being the sentential level. However, we argue that working on the sentential level is not optimal for both machines and humans, and that it would be easier and more efficient to work on sub-sentential levels. To prove this, we quantify and analyze the difference between paraphrases on both sentence and sub-sentence level in order to show the significance of the problem. First results on a preliminary dataset seem to confirm our hypotheses.

pdf bib
Do LSTMs really work so well for PoS tagging? – A replication study
Tobias Horsmann | Torsten Zesch
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

A recent study by Plank et al. (2016) found that LSTM-based PoS taggers considerably improve over the current state-of-the-art when evaluated on the corpora of the Universal Dependencies project that use a coarse-grained tagset. We replicate this study using a fresh collection of 27 corpora of 21 languages that are annotated with fine-grained tagsets of varying size. Our replication confirms the result in general, and we additionally find that the advantage of LSTMs is even bigger for larger tagsets. However, we also find that for the very large tagsets of morphologically rich languages, hand-crafted morphological lexicons are still necessary to reach state-of-the-art performance.

2016

pdf bib
FlexTag: A Highly Flexible PoS Tagging Framework
Torsten Zesch | Tobias Horsmann
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present FlexTag, a highly flexible PoS tagging framework. In contrast to monolithic implementations that can only be retrained but not adapted otherwise, FlexTag enables users to modify the feature space and the classification algorithm. Thus, FlexTag makes it easy to quickly develop custom-made taggers exactly fitting the research problem.

pdf bib
Predicting the Spelling Difficulty of Words for Language Learners
Lisa Beinborn | Torsten Zesch | Iryna Gurevych
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Bundled Gap Filling: A New Paradigm for Unambiguous Cloze Exercises
Michael Wojatzki | Oren Melamud | Torsten Zesch
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
LTL-UDE @ EmpiriST 2015: Tokenization and PoS Tagging of Social Media Text
Tobias Horsmann | Torsten Zesch
Proceedings of the 10th Web as Corpus Workshop

pdf bib
Bridging the gap between computable and expressive event representations in Social Media
Darina Benikova | Torsten Zesch
Proceedings of the Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods

pdf bib
Validating bundled gap filling – Empirical evidence for ambiguity reduction and language proficiency testing capabilities
Niklas Meyer | Michael Wojatzki | Torsten Zesch
Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition

pdf bib
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)
Steven Bethard | Marine Carpuat | Daniel Cer | David Jurgens | Preslav Nakov | Torsten Zesch
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
ltl.uni-due at SemEval-2016 Task 6: Stance Detection in Social Media Using Stacked Classifiers
Michael Wojatzki | Torsten Zesch
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
Assigning Fine-grained PoS Tags based on High-precision Coarse-grained Tagging
Tobias Horsmann | Torsten Zesch
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We propose a new approach to PoS tagging where in a first step, we assign a coarse-grained tag corresponding to the main syntactic category. Based on this high-precision decision, in the second step we utilize specially trained fine-grained models with heavily reduced decision complexity. By analyzing the system under oracle conditions, we show that there is a quite large potential for significantly outperforming a competitive baseline. When we take error-propagation from the coarse-grained tagging into account, our approach is on par with the state of the art. Our approach also allows tailoring the tagger towards recognizing single word classes which are of interest e.g. for researchers searching for specific phenomena in large corpora. In a case study, we significantly outperform a standard model that also makes use of the same optimizations.

pdf bib
Predicting proficiency levels in learner writings by transferring a linguistic complexity model from expert-written coursebooks
Ildikó Pilán | Elena Volodina | Torsten Zesch
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

The lack of a sufficient amount of data tailored for a task is a well-recognized problem for many statistical NLP methods. In this paper, we explore whether data sparsity can be successfully tackled when classifying language proficiency levels in the domain of learner-written output texts. We aim at overcoming data sparsity by incorporating knowledge in the trained model from another domain consisting of input texts written by teaching professionals for learners. We compare different domain adaptation techniques and find that a weighted combination of the two types of data performs best, which can even rival systems based on considerably larger amounts of in-domain data. Moreover, we show that normalizing errors in learners’ texts can substantially improve classification when level-annotated in-domain data is not available.

2015

pdf bib
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)
Preslav Nakov | Torsten Zesch | Daniel Cer | David Jurgens
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
Candidate evaluation strategies for improved difficulty prediction of language tests
Lisa Beinborn | Torsten Zesch | Iryna Gurevych
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Reducing Annotation Efforts in Supervised Short Answer Scoring
Torsten Zesch | Michael Heilman | Aoife Cahill
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Task-Independent Features for Automated Essay Grading
Torsten Zesch | Michael Wojatzki | Dirk Scholten-Akoun
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Counting What Counts: Decompounding for Keyphrase Extraction
Nicolai Erbs | Pedro Bispo Santos | Torsten Zesch | Iryna Gurevych
Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction

2014

pdf bib
Sense and Similarity: A Study of Sense-level Similarity Measures
Nicolai Erbs | Iryna Gurevych | Torsten Zesch
Proceedings of the Third Joint Conference on Lexical and Computational Semantics (*SEM 2014)

pdf bib
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)
Preslav Nakov | Torsten Zesch
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
Automatic Generation of Challenging Distractors Using Context-Sensitive Inference Rules
Torsten Zesch | Oren Melamud
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Towards Automatic Scoring of Cloze Items by Selecting Low-Ambiguity Contexts
Tobias Horsmann | Torsten Zesch
Proceedings of the third workshop on NLP for computer-assisted language learning

pdf bib
DKPro Keyphrases: Flexible and Reusable Keyphrase Extraction Experiments
Nicolai Erbs | Pedro Bispo Santos | Iryna Gurevych | Torsten Zesch
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib
DKPro TC: A Java-based Framework for Supervised Learning Experiments on Textual Data
Johannes Daxenberger | Oliver Ferschke | Iryna Gurevych | Torsten Zesch
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib
Predicting the Difficulty of Language Proficiency Tests
Lisa Beinborn | Torsten Zesch | Iryna Gurevych
Transactions of the Association for Computational Linguistics, Volume 2

Language proficiency tests are used to evaluate and compare the progress of language learners. We present an approach for automatic difficulty prediction of C-tests that performs on par with human experts. On the basis of detailed analysis of newly collected data, we develop a model for C-test difficulty introducing four dimensions: solution difficulty, candidate ambiguity, inter-gap dependency, and paragraph difficulty. We show that cues from all four dimensions contribute to C-test difficulty.

2013

pdf bib
Hierarchy Identification for Automatically Generating Table-of-Contents
Nicolai Erbs | Iryna Gurevych | Torsten Zesch
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
Cognate Production using Character-based Machine Translation
Lisa Beinborn | Torsten Zesch | Iryna Gurevych
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
SemEval-2013 Task 5: Evaluating Phrasal Semantics
Ioannis Korkontzelos | Torsten Zesch | Fabio Massimo Zanzotto | Chris Biemann
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf bib
UKP-BIU: Similarity and Entailment Metrics for Student Response Analysis
Omer Levy | Torsten Zesch | Ido Dagan | Iryna Gurevych
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf bib
Recognizing Partial Textual Entailment
Omer Levy | Torsten Zesch | Ido Dagan | Iryna Gurevych
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
DKPro WSD: A Generalized UIMA-based Framework for Word Sense Disambiguation
Tristan Miller | Nicolai Erbs | Hans-Peter Zorn | Torsten Zesch | Iryna Gurevych
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib
DKPro Similarity: An Open Source Framework for Text Similarity
Daniel Bär | Torsten Zesch | Iryna Gurevych
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

2012

pdf bib
Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History
Torsten Zesch
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Detecting Malapropisms Using Measures of Contextual Fitness
Torsten Zesch
Traitement Automatique des Langues, Volume 53, Numéro 3 : Du bruit dans le signal : gestion des erreurs en traitement automatique des langues [Managing noise in the signal: Error handling in natural language processing]

pdf bib
Text Reuse Detection using a Composition of Text Similarity Measures
Daniel Bär | Torsten Zesch | Iryna Gurevych
Proceedings of COLING 2012

pdf bib
Using Distributional Similarity for Lexical Expansion in Knowledge-based Word Sense Disambiguation
Tristan Miller | Chris Biemann | Torsten Zesch | Iryna Gurevych
Proceedings of COLING 2012

pdf bib
HOO 2012 Shared Task: UKP Lab System Description
Torsten Zesch | Jens Haase
Proceedings of the Seventh Workshop on Building Educational Applications Using NLP

pdf bib
UKP: Computing Semantic Textual Similarity by Combining Multiple Content Similarity Measures
Daniel Bär | Chris Biemann | Iryna Gurevych | Torsten Zesch
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

2011

pdf bib
A Reflective View on Text Similarity
Daniel Bär | Torsten Zesch | Iryna Gurevych
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf bib
Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis
Daniel Bär | Nicolai Erbs | Torsten Zesch | Iryna Gurevych
Proceedings of the ACL-HLT 2011 System Demonstrations

pdf bib
Wikipedia Revision Toolkit: Efficiently Accessing Wikipedia’s Edit History
Oliver Ferschke | Torsten Zesch | Iryna Gurevych
Proceedings of the ACL-HLT 2011 System Demonstrations

pdf bib
Helping Our Own 2011: UKP Lab System Description
Torsten Zesch
Proceedings of the 13th European Workshop on Natural Language Generation

2010

pdf bib
The More the Better? Assessing the Influence of Wikipedia’s Growth on Semantic Relatedness Measures
Torsten Zesch | Iryna Gurevych
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Wikipedia has been used as a knowledge source in many areas of natural language processing. As most studies only use a certain Wikipedia snapshot, the influence of Wikipedia’s massive growth on the results is largely unknown. For the first time, we perform an in-depth analysis of this influence using semantic relatedness as an example application that tests a wide range of Wikipedia’s properties. We find that the growth of Wikipedia has almost no effect on the correlation of semantic relatedness measures with human judgments, while the coverage steadily increases.

pdf bib
Proceedings of the 2nd Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources
Iryna Gurevych | Torsten Zesch
Proceedings of the 2nd Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources

2009

pdf bib
Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (People’s Web)
Iryna Gurevych | Torsten Zesch
Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (People’s Web)

pdf bib
Approximate Matching for Evaluating Keyphrase Extraction
Torsten Zesch | Iryna Gurevych
Proceedings of the International Conference RANLP-2009

2008

pdf bib
Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary
Torsten Zesch | Christof Müller | Iryna Gurevych
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Recently, collaboratively constructed resources such as Wikipedia and Wiktionary have been discovered as valuable lexical semantic knowledge bases with a high potential in diverse Natural Language Processing (NLP) tasks. Collaborative knowledge bases however significantly differ from traditional linguistic knowledge bases in various respects, and this constitutes both an asset and an impediment for research in NLP. This paper addresses one such major impediment, namely the lack of suitable programmatic access mechanisms to the knowledge stored in these large semantic knowledge bases. We present two application programming interfaces for Wikipedia and Wiktionary which are especially designed for mining the rich lexical semantic information dispersed in the knowledge bases, and provide efficient and structured access to the available knowledge. As we believe them to be of general interest to the NLP community, we have made them freely available for research purposes.

2007

pdf bib
Cross-Lingual Distributional Profiles of Concepts for Measuring Semantic Distance
Saif Mohammad | Iryna Gurevych | Graeme Hirst | Torsten Zesch
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf bib
What to be? - Electronic Career Guidance Based on Semantic Relatedness
Iryna Gurevych | Christof Müller | Torsten Zesch
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

pdf bib
Analysis of the Wikipedia Category Graph for NLP Applications
Torsten Zesch | Iryna Gurevych
Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing

pdf bib
Comparing Wikipedia and German Wordnet by Evaluating Semantic Relatedness on Multiple Datasets
Torsten Zesch | Iryna Gurevych | Max Mühlhäuser
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

2006

pdf bib
Automatically Creating Datasets for Measures of Semantic Relatedness
Torsten Zesch | Iryna Gurevych
Proceedings of the Workshop on Linguistic Distances

Search
Co-authors