Perceptual Quality Dimensions of Machine-Generated Text with a Focus on Machine Translation
Vivien Macketanz | Babak Naderi | Steven Schmidt | Sebastian Möller
Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)

The quality of machine-generated text is a complex construct consisting of various aspects and dimensions. We present a study that aims to uncover relevant perceptual quality dimensions for one type of machine-generated text, that is, Machine Translation. We conducted a crowdsourcing survey in the style of a Semantic Differential to collect attribute ratings for German MT outputs. An Exploratory Factor Analysis revealed the underlying perceptual dimensions. As a result, we extracted four factors that operate as relevant dimensions for the Quality of Experience of MT outputs: precision, complexity, grammaticality, and transparency.

Using Neural Machine Translation Methods for Sign Language Translation
Galina Angelova | Eleftherios Avramidis | Sebastian Möller
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

We examine methods and techniques, proven to be helpful for the text-to-text translation of spoken languages in the context of gloss-to-text translation systems, where the glosses are the written representation of the signs. We present one of the first works that include experiments on both parallel corpora of the German Sign Language (PHOENIX14T and the Public DGS Corpus). We experiment with two NMT architectures with optimization of their hyperparameters, several tokenization methods and two data augmentation techniques (back-translation and paraphrasing). Through our investigation we achieve a substantial improvement of 5.0 and 2.2 BLEU scores for the models trained on the two corpora respectively. Our RNN models outperform our Transformer models, and the segmentation method we achieve best results with is BPE, whereas back-translation and paraphrasing lead to minor but not significant improvements.


Linguistic Evaluation for the 2021 State-of-the-art Machine Translation Systems for German to English and English to German
Vivien Macketanz | Eleftherios Avramidis | Shushen Manakhimova | Sebastian Möller
Proceedings of the Sixth Conference on Machine Translation

We are using a semi-automated test suite in order to provide a fine-grained linguistic evaluation for state-of-the-art machine translation systems. The evaluation includes 18 German to English and 18 English to German systems, submitted to the Translation Shared Task of the 2021 Conference on Machine Translation. Our submission adds up to the submissions of the previous years by creating and applying a wide-range test suite for English to German as a new language pair. The fine-grained evaluation allows spotting significant differences between systems that cannot be distinguished by the direct assessment of the human evaluation campaign. We find that most of the systems achieve good accuracies in the majority of linguistic phenomena but there are few phenomena with lower accuracy, such as the idioms, the modal pluperfect and the German resultative predicates. Two systems have significantly better test suite accuracy in macro-average in every language direction, Online-W and Facebook-AI for German to English and VolcTrans and Online-W for English to German. The systems show a steady improvement as compared to previous years.

Efficient Explanations from Empirical Explainers
Robert Schwarzenberg | Nils Feldhus | Sebastian Möller
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Amid a discussion about Green AI in which we see explainability neglected, we explore the possibility to efficiently approximate computationally expensive explainers. To this end, we propose feature attribution modelling with Empirical Explainers. Empirical Explainers learn from data to predict the attribution maps of expensive explainers. We train and test Empirical Explainers in the language domain and find that they model their expensive counterparts surprisingly well, at a fraction of the cost. They could thus mitigate the computational burden of neural explanations significantly, in applications that tolerate an approximation error.

Towards Hybrid Human-Machine Workflow for Natural Language Generation
Neslihan Iskender | Tim Polzehl | Sebastian Möller
Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing

In recent years, crowdsourcing has gained much attention from researchers to generate data for the Natural Language Generation (NLG) tools or to evaluate them. However, the quality of crowdsourced data has been questioned repeatedly because of the complexity of NLG tasks and crowd workers’ unknown skills. Moreover, crowdsourcing can also be costly and often not feasible for large-scale data generation or evaluation. To overcome these challenges and leverage the complementary strengths of humans and machine tools, we propose a hybrid human-machine workflow designed explicitly for NLG tasks with real-time quality control mechanisms under budget constraints. This hybrid methodology is a powerful tool for achieving high-quality data while preserving efficiency. By combining human and machine intelligence, the proposed workflow decides dynamically on the next step based on the data from previous steps and given constraints. Our goal is to provide not only the theoretical foundations of the hybrid workflow but also to provide its implementation as open-source in future work.

Reliability of Human Evaluation for Text Summarization: Lessons Learned and Challenges Ahead
Neslihan Iskender | Tim Polzehl | Sebastian Möller
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)

Only a small portion of research papers with human evaluation for text summarization provide information about the participant demographics, task design, and experiment protocol. Additionally, many researchers use human evaluation as gold standard without questioning the reliability or investigating the factors that might affect the reliability of the human evaluation. As a result, there is a lack of best practices for reliable human summarization evaluation grounded by empirical evidence. To investigate human evaluation reliability, we conduct a series of human evaluation experiments, provide an overview of participant demographics, task design, experimental set-up and compare the results from different experiments. Based on our empirical analysis, we provide guidelines to ensure the reliability of expert and non-expert evaluations, and we determine the factors that might affect the reliability of the human evaluation.

Thermostat: A Large Collection of NLP Model Explanations and Analysis Tools
Nils Feldhus | Robert Schwarzenberg | Sebastian Möller
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

In the language domain, as in other domains, neural explainability takes an ever more important role, with feature attribution methods on the forefront. Many such methods require considerable computational resources and expert knowledge about implementation details and parameter choices. To facilitate research, we present Thermostat which consists of a large collection of model explanations and accompanying analysis tools. Thermostat allows easy access to over 200k explanations for the decisions of prominent state-of-the-art models spanning across different NLP tasks, generated with multiple explainers. The dataset took over 10k GPU hours (> one year) to compile; compute time that the community now saves. The accompanying software tools allow to analyse explanations instance-wise but also accumulatively on corpus level. Users can investigate and compare models, datasets and explainers without the need to orchestrate implementation details. Thermostat is fully open source, democratizes explainability research in the language domain, circumvents redundant computations and increases comparability and replicability.


Simulating Turn-Taking in Conversations with Delayed Transmission
Thilo Michael | Sebastian Möller
Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Conversations over the telephone require timely turn-taking cues that signal the participants when to speak and when to listen. When a two-way transmission delay is introduced into such conversations, the immediate feedback is delayed, and the interactivity of the conversation is impaired. With delayed speech on each side of the transmission, different conversation realities emerge on both ends, which alters the way the participants interact with each other. Simulating conversations can give insights on turn-taking and spoken interactions between humans but can also used for analyzing and even predicting human behavior in conversations. In this paper, we simulate two types of conversations with distinct levels of interactivity. We then introduce three levels of two-way transmission delay between the agents and compare the resulting interaction-patterns with human-to-human dialog from an empirical study. We show how the turn-taking mechanisms modeled for conversations without delay perform in scenarios with delay and identify to which extend the simulation is able to model the delayed turn-taking observed in human conversation.

Claim extraction from text using transfer learning.
Acharya Ashish Prabhakar | Salar Mohtaj | Sebastian Möller
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Building an end to end fake news detection system consists of detecting claims in text and later verifying them for their authenticity. Although most of the recent works have focused on political claims, fake news can also be propagated in the form of religious intolerance, conspiracy theories etc. Since there is a lack of training data specific to all these scenarios, we compiled a homogeneous and balanced dataset by combining some of the currently available data. Moreover, it is shown in the paper that how recent advancements in transfer learning can be leveraged to detect claims, in general. The obtained result shows that the recently developed transformers can transfer the tendency of research from claim detection to the problem of check worthiness of claims in domains of interest.

Towards a Reliable and Robust Methodology for Crowd-Based Subjective Quality Assessment of Query-Based Extractive Text Summarization
Neslihan Iskender | Tim Polzehl | Sebastian Möller
Proceedings of the 12th Language Resources and Evaluation Conference

The intrinsic and extrinsic quality evaluation is an essential part of the summary evaluation methodology usually conducted in a traditional controlled laboratory environment. However, processing large text corpora using these methods reveals expensive from both the organizational and the financial perspective. For the first time, and as a fast, scalable, and cost-effective alternative, we propose micro-task crowdsourcing to evaluate both the intrinsic and extrinsic quality of query-based extractive text summaries. To investigate the appropriateness of crowdsourcing for this task, we conduct intensive comparative crowdsourcing and laboratory experiments, evaluating nine extrinsic and intrinsic quality measures on 5-point MOS scales. Correlating results of crowd and laboratory ratings reveals high applicability of crowdsourcing for the factors overall quality, grammaticality, non-redundancy, referential clarity, focus, structure & coherence, summary usefulness, and summary informativeness. Further, we investigate the effect of the number of repetitions of assessments on the robustness of mean opinion score of crowd ratings, measured against the increase of correlation coefficients between crowd and laboratory. Our results suggest that the optimal number of repetitions in crowdsourcing setups, in which any additional repetitions do no longer cause an adequate increase of overall correlation coefficients, lies between seven and nine for intrinsic and extrinsic quality factors.

An Empirical Comparison of Question Classification Methods for Question Answering Systems
Eduardo Cortes | Vinicius Woloszyn | Arne Binder | Tilo Himmelsbach | Dante Barone | Sebastian Möller
Proceedings of the 12th Language Resources and Evaluation Conference

Question classification is an important component of Question Answering Systems responsible for identifying the type of an answer a particular question requires. For instance, “Who is the prime minister of the United Kingdom?” demands a name of a PERSON, while “When was the queen of the United Kingdom born?” entails a DATE. This work makes an extensible review of the most recent methods for Question Classification, taking into consideration their applicability in low-resourced languages. First, we propose a manual classification of the current state-of-the-art methods in four distinct categories: low, medium, high, and very high level of dependency on external resources. Second, we applied this categorization in an empirical comparison in terms of the amount of data necessary for training and performance in different languages. In addition to complementing earlier works in this field, our study shows a boost on methods relying on recent language models, overcoming methods not suitable for low-resourced languages.

From Witch’s Shot to Music Making Bones - Resources for Medical Laymen to Technical Language and Vice Versa
Laura Seiffe | Oliver Marten | Michael Mikhailov | Sven Schmeier | Sebastian Möller | Roland Roller
Proceedings of the 12th Language Resources and Evaluation Conference

Many people share information in social media or forums, like food they eat, sports activities they do or events which have been visited. Information we share online unveil directly or indirectly information about our lifestyle and health situation. Particularly when text input is getting longer or multiple messages can be linked to each other. Those information can be then used to detect possible risk factors of diseases or adverse drug reactions of medications. However, as most people are not medical experts, language used might be more descriptive rather than the precise medical expression as medics do. To detect and use those relevant information, laymen language has to be translated and/or linked against the corresponding medical concept. This work presents baseline data sources in order to address this challenge for German language. We introduce a new dataset which annotates medical laymen and technical expressions in a patient forum, along with a set of medical synonyms and definitions, and present first baseline results on the data.

Best Practices for Crowd-based Evaluation of German Summarization: Comparing Crowd, Expert and Automatic Evaluation
Neslihan Iskender | Tim Polzehl | Sebastian Möller
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

One of the main challenges in the development of summarization tools is summarization quality evaluation. On the one hand, the human assessment of summarization quality conducted by linguistic experts is slow, expensive, and still not a standardized procedure. On the other hand, the automatic assessment metrics are reported not to correlate high enough with human quality ratings. As a solution, we propose crowdsourcing as a fast, scalable, and cost-effective alternative to expert evaluations to assess the intrinsic and extrinsic quality of summarization by comparing crowd ratings with expert ratings and automatic metrics such as ROUGE, BLEU, or BertScore on a German summarization data set. Our results provide a basis for best practices for crowd-based summarization evaluation regarding major influential factors such as the best annotation aggregation method, the influence of readability and reading effort on summarization evaluation, and the optimal number of crowd workers to achieve comparable results to experts, especially when determining factors such as overall quality, grammaticality, referential clarity, focus, structure & coherence, summary usefulness, and summary informativeness.

Fine-grained linguistic evaluation for state-of-the-art Machine Translation
Eleftherios Avramidis | Vivien Macketanz | Ursula Strohriegel | Aljoscha Burchardt | Sebastian Möller
Proceedings of the Fifth Conference on Machine Translation

This paper describes a test suite submission providing detailed statistics of linguistic performance for the state-of-the-art German-English systems of the Fifth Conference of Machine Translation (WMT20). The analysis covers 107 phenomena organized in 14 categories based on about 5,500 test items, including a manual annotation effort of 45 person hours. Two systems (Tohoku and Huoshan) appear to have significantly better test suite accuracy than the others, although the best system of WMT20 is not significantly better than the one from WMT19 in a macro-average. Additionally, we identify some linguistic phenomena where all systems suffer (such as idioms, resultative predicates and pluperfect), but we are also able to identify particular weaknesses for individual systems (such as quotation marks, lexical ambiguity and sluicing). Most of the systems of WMT19 which submitted new versions this year show improvements.


Train, Sort, Explain: Learning to Diagnose Translation Models
Robert Schwarzenberg | David Harbecke | Vivien Macketanz | Eleftherios Avramidis | Sebastian Möller
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)

Evaluating translation models is a trade-off between effort and detail. On the one end of the spectrum there are automatic count-based methods such as BLEU, on the other end linguistic evaluations by humans, which arguably are more informative but also require a disproportionately high effort. To narrow the spectrum, we propose a general approach on how to automatically expose systematic differences between human and machine translations to human experts. Inspired by adversarial settings, we train a neural text classifier to distinguish human from machine translations. A classifier that performs and generalizes well after training should recognize systematic differences between the two classes, which we uncover with neural explainability methods. Our proof-of-concept implementation, DiaMaT, is open source. Applied to a dataset translated by a state-of-the-art neural Transformer model, DiaMaT achieves a classification accuracy of 75% and exposes meaningful differences between humans and the Transformer, amidst the current discussion about human parity.


Position Paper: Towards Standardized Metrics and Tools for Spoken and Multimodal Dialog System Evaluation
Sebastian Möller | Klaus-Peter Engelbrecht | Florian Kretzschmar | Stefan Schmidt | Benjamin Weiss
NAACL-HLT Workshop on Future directions and needs in the Spoken Dialog Community: Tools and Data (SDCTD 2012)


Modeling User Satisfaction with Hidden Markov Models
Klaus-Peter Engelbrecht | Florian Gödde | Felix Hartard | Hamed Ketabdar | Sebastian Möller
Proceedings of the SIGDIAL 2009 Conference


A Framework for Model-based Evaluation of Spoken Dialog Systems
Sebastian Möller | Nigel Ward
Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue

Corpus Analysis of Spoken Smart-Home Interactions with Older Users
Sebastian Möller | Florian Gödde | Maria Wolters
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper, we present the collection and analysis of a spoken dialogue corpus obtained from interactions of older and younger users with a smart-home system. Our aim is to identify the amount and the origin of linguistic differences in the way older and younger users address the system. In addition, we investigate changes in the users’ linguistic behaviour after exposure to the system. The results show that the two user groups differ in their speaking style as well as their vocabulary. In contrast to younger users, who adapt their speaking style to the expected limitations of the system, older users tend to use a speaking style that is closer to human-human communication in terms of sentence complexity and politeness. However, older users are far less easy to stereotype than younger users.


Pragmatic Usage of Linear Regression Models for the Prediction of User Judgments
Klaus-Peter Engelbrecht | Sebastian Möller
Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue


Set-up of a Unit-Selection Synthesis with a Prominent Voice
Stefan Breuer | Sven Bergmann | Ralf Dragon | Sebastian Möller
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper, we describe the set-up process and an initial evaluation of a unit-selection speech synthesizer. The synthesizer is specific in that it is intended to speak with a prominent voice. As a consequence, only very limited resources were available for setting up the unit database. These resources have been extracted from an audio book, segmented with the help of an HMM-based wrapper, and then used with the non-uniform unit-selection approach implemented in the Bonn Open Synthesis System (BOSS). In order to adapt the database to the BOSS implementation, the label files were amended by phrase boundaries, converted to XML, amended by prosodic and spectral information, and then further converted to a MySQL relational database structure. The BOSS system selects units on the basis of this information, adding individual unit costs to the concatenation costs given by MFCC and F0 distances. The paper discusses the problems which occurred during the database set-up, the invested effort, as well as the quality level which can be reached by this approach.


Parameters for Quantifying the Interaction with Spoken Dialogue Telephone Services
Sebastian Möller
Proceedings of the 6th SIGdial Workshop on Discourse and Dialogue


INSPIRE: Evaluation of a Smart-Home System for Infotainment Management and Device Control
Sebastian Möller | Jan Krebber | Alexander Raake | Paula Smeele | Martin Rajman | Mirek Melichar | Vincenzo Pallotta | Gianna Tsakou | Basilis Kladis | Anestis Vovos | Jettie Hoonhout | Dietmar Schuchardt | Nikos Fakotakis | Todor Ganchev | Ilyas Potamitis
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

A New ITU-T Recommendation on the Evaluation of Telephone-Based Spoken Dialogue Systems
Sebastian Möller
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)


Diagnostic Assessment of Telephone Transmission Impact on ASR Performance and Human-to-Human Speech Quality
Sebastian Möller | Ergina Kavallieratou
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

A new Taxonomy for the Quality of Telephone Services Based on Spoken Dialogue Systems
Sebastian Möller
Proceedings of the Third SIGdial Workshop on Discourse and Dialogue