Sebastian Möller


2021

pdf bib
Reliability of Human Evaluation for Text Summarization: Lessons Learned and Challenges Ahead
Neslihan Iskender | Tim Polzehl | Sebastian Möller
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)

Only a small portion of research papers with human evaluation for text summarization provide information about the participant demographics, task design, and experiment protocol. Additionally, many researchers use human evaluation as gold standard without questioning the reliability or investigating the factors that might affect the reliability of the human evaluation. As a result, there is a lack of best practices for reliable human summarization evaluation grounded by empirical evidence. To investigate human evaluation reliability, we conduct a series of human evaluation experiments, provide an overview of participant demographics, task design, experimental set-up and compare the results from different experiments. Based on our empirical analysis, we provide guidelines to ensure the reliability of expert and non-expert evaluations, and we determine the factors that might affect the reliability of the human evaluation.

pdf bib
Towards Hybrid Human-Machine Workflow for Natural Language Generation
Neslihan Iskender | Tim Polzehl | Sebastian Möller
Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing

In recent years, crowdsourcing has gained much attention from researchers to generate data for the Natural Language Generation (NLG) tools or to evaluate them. However, the quality of crowdsourced data has been questioned repeatedly because of the complexity of NLG tasks and crowd workers’ unknown skills. Moreover, crowdsourcing can also be costly and often not feasible for large-scale data generation or evaluation. To overcome these challenges and leverage the complementary strengths of humans and machine tools, we propose a hybrid human-machine workflow designed explicitly for NLG tasks with real-time quality control mechanisms under budget constraints. This hybrid methodology is a powerful tool for achieving high-quality data while preserving efficiency. By combining human and machine intelligence, the proposed workflow decides dynamically on the next step based on the data from previous steps and given constraints. Our goal is to provide not only the theoretical foundations of the hybrid workflow but also to provide its implementation as open-source in future work.

2020

pdf bib
Towards a Reliable and Robust Methodology for Crowd-Based Subjective Quality Assessment of Query-Based Extractive Text Summarization
Neslihan Iskender | Tim Polzehl | Sebastian Möller
Proceedings of the 12th Language Resources and Evaluation Conference

The intrinsic and extrinsic quality evaluation is an essential part of the summary evaluation methodology usually conducted in a traditional controlled laboratory environment. However, processing large text corpora using these methods reveals expensive from both the organizational and the financial perspective. For the first time, and as a fast, scalable, and cost-effective alternative, we propose micro-task crowdsourcing to evaluate both the intrinsic and extrinsic quality of query-based extractive text summaries. To investigate the appropriateness of crowdsourcing for this task, we conduct intensive comparative crowdsourcing and laboratory experiments, evaluating nine extrinsic and intrinsic quality measures on 5-point MOS scales. Correlating results of crowd and laboratory ratings reveals high applicability of crowdsourcing for the factors overall quality, grammaticality, non-redundancy, referential clarity, focus, structure & coherence, summary usefulness, and summary informativeness. Further, we investigate the effect of the number of repetitions of assessments on the robustness of mean opinion score of crowd ratings, measured against the increase of correlation coefficients between crowd and laboratory. Our results suggest that the optimal number of repetitions in crowdsourcing setups, in which any additional repetitions do no longer cause an adequate increase of overall correlation coefficients, lies between seven and nine for intrinsic and extrinsic quality factors.

pdf bib
An Empirical Comparison of Question Classification Methods for Question Answering Systems
Eduardo Cortes | Vinicius Woloszyn | Arne Binder | Tilo Himmelsbach | Dante Barone | Sebastian Möller
Proceedings of the 12th Language Resources and Evaluation Conference

Question classification is an important component of Question Answering Systems responsible for identifying the type of an answer a particular question requires. For instance, “Who is the prime minister of the United Kingdom?” demands a name of a PERSON, while “When was the queen of the United Kingdom born?” entails a DATE. This work makes an extensible review of the most recent methods for Question Classification, taking into consideration their applicability in low-resourced languages. First, we propose a manual classification of the current state-of-the-art methods in four distinct categories: low, medium, high, and very high level of dependency on external resources. Second, we applied this categorization in an empirical comparison in terms of the amount of data necessary for training and performance in different languages. In addition to complementing earlier works in this field, our study shows a boost on methods relying on recent language models, overcoming methods not suitable for low-resourced languages.

pdf bib
From Witch’s Shot to Music Making Bones - Resources for Medical Laymen to Technical Language and Vice Versa
Laura Seiffe | Oliver Marten | Michael Mikhailov | Sven Schmeier | Sebastian Möller | Roland Roller
Proceedings of the 12th Language Resources and Evaluation Conference

Many people share information in social media or forums, like food they eat, sports activities they do or events which have been visited. Information we share online unveil directly or indirectly information about our lifestyle and health situation. Particularly when text input is getting longer or multiple messages can be linked to each other. Those information can be then used to detect possible risk factors of diseases or adverse drug reactions of medications. However, as most people are not medical experts, language used might be more descriptive rather than the precise medical expression as medics do. To detect and use those relevant information, laymen language has to be translated and/or linked against the corresponding medical concept. This work presents baseline data sources in order to address this challenge for German language. We introduce a new dataset which annotates medical laymen and technical expressions in a patient forum, along with a set of medical synonyms and definitions, and present first baseline results on the data.

pdf bib
Claim extraction from text using transfer learning.
Acharya Ashish Prabhakar | Salar Mohtaj | Sebastian Möller
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Building an end to end fake news detection system consists of detecting claims in text and later verifying them for their authenticity. Although most of the recent works have focused on political claims, fake news can also be propagated in the form of religious intolerance, conspiracy theories etc. Since there is a lack of training data specific to all these scenarios, we compiled a homogeneous and balanced dataset by combining some of the currently available data. Moreover, it is shown in the paper that how recent advancements in transfer learning can be leveraged to detect claims, in general. The obtained result shows that the recently developed transformers can transfer the tendency of research from claim detection to the problem of check worthiness of claims in domains of interest.

pdf bib
Simulating Turn-Taking in Conversations with Delayed Transmission
Thilo Michael | Sebastian Möller
Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Conversations over the telephone require timely turn-taking cues that signal the participants when to speak and when to listen. When a two-way transmission delay is introduced into such conversations, the immediate feedback is delayed, and the interactivity of the conversation is impaired. With delayed speech on each side of the transmission, different conversation realities emerge on both ends, which alters the way the participants interact with each other. Simulating conversations can give insights on turn-taking and spoken interactions between humans but can also used for analyzing and even predicting human behavior in conversations. In this paper, we simulate two types of conversations with distinct levels of interactivity. We then introduce three levels of two-way transmission delay between the agents and compare the resulting interaction-patterns with human-to-human dialog from an empirical study. We show how the turn-taking mechanisms modeled for conversations without delay perform in scenarios with delay and identify to which extend the simulation is able to model the delayed turn-taking observed in human conversation.

pdf bib
Fine-grained linguistic evaluation for state-of-the-art Machine Translation
Eleftherios Avramidis | Vivien Macketanz | Ursula Strohriegel | Aljoscha Burchardt | Sebastian Möller
Proceedings of the Fifth Conference on Machine Translation

This paper describes a test suite submission providing detailed statistics of linguistic performance for the state-of-the-art German-English systems of the Fifth Conference of Machine Translation (WMT20). The analysis covers 107 phenomena organized in 14 categories based on about 5,500 test items, including a manual annotation effort of 45 person hours. Two systems (Tohoku and Huoshan) appear to have significantly better test suite accuracy than the others, although the best system of WMT20 is not significantly better than the one from WMT19 in a macro-average. Additionally, we identify some linguistic phenomena where all systems suffer (such as idioms, resultative predicates and pluperfect), but we are also able to identify particular weaknesses for individual systems (such as quotation marks, lexical ambiguity and sluicing). Most of the systems of WMT19 which submitted new versions this year show improvements.

pdf bib
Best Practices for Crowd-based Evaluation of German Summarization: Comparing Crowd, Expert and Automatic Evaluation
Neslihan Iskender | Tim Polzehl | Sebastian Möller
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

One of the main challenges in the development of summarization tools is summarization quality evaluation. On the one hand, the human assessment of summarization quality conducted by linguistic experts is slow, expensive, and still not a standardized procedure. On the other hand, the automatic assessment metrics are reported not to correlate high enough with human quality ratings. As a solution, we propose crowdsourcing as a fast, scalable, and cost-effective alternative to expert evaluations to assess the intrinsic and extrinsic quality of summarization by comparing crowd ratings with expert ratings and automatic metrics such as ROUGE, BLEU, or BertScore on a German summarization data set. Our results provide a basis for best practices for crowd-based summarization evaluation regarding major influential factors such as the best annotation aggregation method, the influence of readability and reading effort on summarization evaluation, and the optimal number of crowd workers to achieve comparable results to experts, especially when determining factors such as overall quality, grammaticality, referential clarity, focus, structure & coherence, summary usefulness, and summary informativeness.

2019

pdf bib
Train, Sort, Explain: Learning to Diagnose Translation Models
Robert Schwarzenberg | David Harbecke | Vivien Macketanz | Eleftherios Avramidis | Sebastian Möller
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)

Evaluating translation models is a trade-off between effort and detail. On the one end of the spectrum there are automatic count-based methods such as BLEU, on the other end linguistic evaluations by humans, which arguably are more informative but also require a disproportionately high effort. To narrow the spectrum, we propose a general approach on how to automatically expose systematic differences between human and machine translations to human experts. Inspired by adversarial settings, we train a neural text classifier to distinguish human from machine translations. A classifier that performs and generalizes well after training should recognize systematic differences between the two classes, which we uncover with neural explainability methods. Our proof-of-concept implementation, DiaMaT, is open source. Applied to a dataset translated by a state-of-the-art neural Transformer model, DiaMaT achieves a classification accuracy of 75% and exposes meaningful differences between humans and the Transformer, amidst the current discussion about human parity.

2012

pdf bib
Position Paper: Towards Standardized Metrics and Tools for Spoken and Multimodal Dialog System Evaluation
Sebastian Möller | Klaus-Peter Engelbrecht | Florian Kretzschmar | Stefan Schmidt | Benjamin Weiss
NAACL-HLT Workshop on Future directions and needs in the Spoken Dialog Community: Tools and Data (SDCTD 2012)

2009

pdf bib
Modeling User Satisfaction with Hidden Markov Models
Klaus-Peter Engelbrecht | Florian Gödde | Felix Hartard | Hamed Ketabdar | Sebastian Möller
Proceedings of the SIGDIAL 2009 Conference

2008

pdf bib
A Framework for Model-based Evaluation of Spoken Dialog Systems
Sebastian Möller | Nigel Ward
Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue

pdf bib
Corpus Analysis of Spoken Smart-Home Interactions with Older Users
Sebastian Möller | Florian Gödde | Maria Wolters
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper, we present the collection and analysis of a spoken dialogue corpus obtained from interactions of older and younger users with a smart-home system. Our aim is to identify the amount and the origin of linguistic differences in the way older and younger users address the system. In addition, we investigate changes in the users’ linguistic behaviour after exposure to the system. The results show that the two user groups differ in their speaking style as well as their vocabulary. In contrast to younger users, who adapt their speaking style to the expected limitations of the system, older users tend to use a speaking style that is closer to human-human communication in terms of sentence complexity and politeness. However, older users are far less easy to stereotype than younger users.

2007

pdf bib
Pragmatic Usage of Linear Regression Models for the Prediction of User Judgments
Klaus-Peter Engelbrecht | Sebastian Möller
Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue

2006

pdf bib
Set-up of a Unit-Selection Synthesis with a Prominent Voice
Stefan Breuer | Sven Bergmann | Ralf Dragon | Sebastian Möller
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper, we describe the set-up process and an initial evaluation of a unit-selection speech synthesizer. The synthesizer is specific in that it is intended to speak with a prominent voice. As a consequence, only very limited resources were available for setting up the unit database. These resources have been extracted from an audio book, segmented with the help of an HMM-based wrapper, and then used with the non-uniform unit-selection approach implemented in the Bonn Open Synthesis System (BOSS). In order to adapt the database to the BOSS implementation, the label files were amended by phrase boundaries, converted to XML, amended by prosodic and spectral information, and then further converted to a MySQL relational database structure. The BOSS system selects units on the basis of this information, adding individual unit costs to the concatenation costs given by MFCC and F0 distances. The paper discusses the problems which occurred during the database set-up, the invested effort, as well as the quality level which can be reached by this approach.

2005

pdf bib
Parameters for Quantifying the Interaction with Spoken Dialogue Telephone Services
Sebastian Möller
Proceedings of the 6th SIGdial Workshop on Discourse and Dialogue

2004

pdf bib
INSPIRE: Evaluation of a Smart-Home System for Infotainment Management and Device Control
Sebastian Möller | Jan Krebber | Alexander Raake | Paula Smeele | Martin Rajman | Mirek Melichar | Vincenzo Pallotta | Gianna Tsakou | Basilis Kladis | Anestis Vovos | Jettie Hoonhout | Dietmar Schuchardt | Nikos Fakotakis | Todor Ganchev | Ilyas Potamitis
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
A New ITU-T Recommendation on the Evaluation of Telephone-Based Spoken Dialogue Systems
Sebastian Möller
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2002

pdf bib
Diagnostic Assessment of Telephone Transmission Impact on ASR Performance and Human-to-Human Speech Quality
Sebastian Möller | Ergina Kavallieratou
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
A new Taxonomy for the Quality of Telephone Services Based on Spoken Dialogue Systems
Sebastian Möller
Proceedings of the Third SIGdial Workshop on Discourse and Dialogue