Jan Milan Deriu

Also published as: Jan Deriu


pdf bib
Probing the Robustness of Trained Metrics for Conversational Dialogue Systems
Jan Deriu | Don Tuggener | Pius Von Däniken | Mark Cieliebak
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This paper introduces an adversarial method to stress-test trained metrics for the evaluation of conversational dialogue systems. The method leverages Reinforcement Learning to find response strategies that elicit optimal scores from the trained metrics. We apply our method to test recently proposed trained metrics. We find that they all are susceptible to giving high scores to responses generated by rather simple and obviously flawed strategies that our method converges on. For instance, simply copying parts of the conversation context to form a response yields competitive scores or even outperforms responses written by humans.


pdf bib
Are We Summarizing the Right Way? A Survey of Dialogue Summarization Data Sets
Don Tuggener | Margot Mieskes | Jan Deriu | Mark Cieliebak
Proceedings of the Third Workshop on New Frontiers in Summarization

Dialogue summarization is a long-standing task in the field of NLP, and several data sets with dialogues and associated human-written summaries of different styles exist. However, it is unclear for which type of dialogue which type of summary is most appropriate. For this reason, we apply a linguistic model of dialogue types to derive matching summary items and NLP tasks. This allows us to map existing dialogue summarization data sets into this model and identify gaps and potential directions for future work. As part of this process, we also provide an extensive overview of existing dialogue summarization data sets.


pdf bib
A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation
Jan Deriu | Katsiaryna Mlynchyk | Philippe Schläpfer | Alvaro Rodrigo | Dirk von Grünigen | Nicolas Kaiser | Kurt Stockinger | Eneko Agirre | Mark Cieliebak
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In this paper, we introduce a novel methodology to efficiently construct a corpus for question answering over structured data. For this, we introduce an intermediate representation that is based on the logical query plan in a database, called Operation Trees (OT). This representation allows us to invert the annotation process without loosing flexibility in the types of queries that we generate. Furthermore, it allows for fine-grained alignment of the tokens to the operations. Thus, we randomly generate OTs from a context free grammar and annotators just have to write the appropriate question and assign the tokens. We compare our corpus OTTA (Operation Trees and Token Assignment), a large semantic parsing corpus for evaluating natural language interfaces to databases, to Spider and LC-QuaD 2.0 and show that our methodology more than triples the annotation speed while maintaining the complexity of the queries. Finally, we train a state-of-the-art semantic parsing model on our data and show that our dataset is a challenging dataset and that the token alignment can be leveraged to significantly increase the performance.

pdf bib
DoQA - Accessing Domain-Specific FAQs via Conversational QA
Jon Ander Campos | Arantxa Otegi | Aitor Soroa | Jan Deriu | Mark Cieliebak | Eneko Agirre
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The goal of this work is to build conversational Question Answering (QA) interfaces for the large body of domain-specific information available in FAQ sites. We present DoQA, a dataset with 2,437 dialogues and 10,917 QA pairs. The dialogues are collected from three Stack Exchange sites using the Wizard of Oz method with crowdsourcing. Compared to previous work, DoQA comprises well-defined information needs, leading to more coherent and natural conversations with less factoid questions and is multi-domain. In addition, we introduce a more realistic information retrieval (IR) scenario where the system needs to find the answer in any of the FAQ documents. The results of an existing, strong, system show that, thanks to transfer learning from a Wikipedia QA dataset and fine tuning on a single FAQ domain, it is possible to build high quality conversational QA systems for FAQs without in-domain training data. The good results carry over into the more challenging IR scenario. In both cases, there is still ample room for improvement, as indicated by the higher human upperbound.

pdf bib
Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems
Jan Deriu | Don Tuggener | Pius von Däniken | Jon Ander Campos | Alvaro Rodrigo | Thiziri Belkacem | Aitor Soroa | Eneko Agirre | Mark Cieliebak
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The lack of time efficient and reliable evalu-ation methods is hampering the development of conversational dialogue systems (chat bots). Evaluations that require humans to converse with chat bots are time and cost intensive, put high cognitive demands on the human judges, and tend to yield low quality results. In this work, we introduce Spot The Bot, a cost-efficient and robust evaluation framework that replaces human-bot conversations with conversations between bots. Human judges then only annotate for each entity in a conversation whether they think it is human or not (assuming there are humans participants in these conversations). These annotations then allow us to rank chat bots regarding their ability to mimic conversational behaviour of humans. Since we expect that all bots are eventually recognized as such, we incorporate a metric that measures which chat bot is able to uphold human-like be-havior the longest, i.e.Survival Analysis. This metric has the ability to correlate a bot’s performance to certain of its characteristics (e.g.fluency or sensibleness), yielding interpretable results. The comparably low cost of our frame-work allows for frequent evaluations of chatbots during their evaluation cycle. We empirically validate our claims by applying Spot The Bot to three domains, evaluating several state-of-the-art chat bots, and drawing comparisonsto related work. The framework is released asa ready-to-use tool.


pdf bib
Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement
Jan Milan Deriu | Mark Cieliebak
Proceedings of the 12th International Conference on Natural Language Generation

We present “AutoJudge”, an automated evaluation method for conversational dialogue systems. The method works by first generating dialogues based on self-talk, i.e. dialogue systems talking to itself. Then, it uses human ratings on these dialogues to train an automated judgement model. Our experiments show that AutoJudge correlates well with the human ratings and can be used to automatically evaluate dialogue systems, even in deployed systems. In a second part, we attempt to apply AutoJudge to improve existing systems. This works well for re-ranking a set of candidate utterances. However, our experiments show that AutoJudge cannot be applied as reward for reinforcement learning, although the metric can distinguish good from bad dialogues. We discuss potential reasons, but state here already that this is still an open question for further research.


pdf bib
SB-CH: A Swiss German Corpus with Sentiment Annotations
Ralf Grubenmann | Don Tuggener | Pius von Däniken | Jan Deriu | Mark Cieliebak
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Twist Bytes - German Dialect Identification with Data Mining Optimization
Fernando Benites | Ralf Grubenmann | Pius von Däniken | Dirk von Grünigen | Jan Deriu | Mark Cieliebak
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

We describe our approaches used in the German Dialect Identification (GDI) task at the VarDial Evaluation Campaign 2018. The goal was to identify to which out of four dialects spoken in German speaking part of Switzerland a sentence belonged to. We adopted two different meta classifier approaches and used some data mining insights to improve the preprocessing and the meta classifier parameters. Especially, we focused on using different feature extraction methods and how to combine them, since they influenced very differently the performance of the system. Our system achieved second place out of 8 teams, with a macro averaged F-1 of 64.6%.

pdf bib
Syntactic Manipulation for Generating more Diverse and Interesting Texts
Jan Milan Deriu | Mark Cieliebak
Proceedings of the 11th International Conference on Natural Language Generation

Natural Language Generation plays an important role in the domain of dialogue systems as it determines how users perceive the system. Recently, deep-learning based systems have been proposed to tackle this task, as they generalize better and require less amounts of manual effort to implement them for new domains. However, deep learning systems usually adapt a very homogeneous sounding writing style which expresses little variation. In this work, we present our system for Natural Language Generation where we control various aspects of the surface realization in order to increase the lexical variability of the utterances, such that they sound more diverse and interesting. For this, we use a Semantically Controlled Long Short-term Memory Network (SC-LSTM), and apply its specialized cell to control various syntactic features of the generated texts. We present an in-depth human evaluation where we show the effects of these surface manipulation on the perception of potential users.


pdf bib
Potential and Limitations of Cross-Domain Sentiment Classification
Jan Milan Deriu | Martin Weilenmann | Dirk Von Gruenigen | Mark Cieliebak
Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media

In this paper we investigate the cross-domain performance of a current state-of-the-art sentiment analysis systems. For this purpose we train a convolutional neural network (CNN) on data from different domains and evaluate its performance on other domains. Furthermore, we evaluate the usefulness of combining a large amount of different smaller annotated corpora to a large corpus. Our results show that more sophisticated approaches are required to train a system that works equally well on various domains.

pdf bib
A Twitter Corpus and Benchmark Resources for German Sentiment Analysis
Mark Cieliebak | Jan Milan Deriu | Dominic Egger | Fatih Uzdilli
Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media

In this paper we present SB10k, a new corpus for sentiment analysis with approx. 10,000 German tweets. We use this new corpus and two existing corpora to provide state-of-the-art benchmarks for sentiment analysis in German: we implemented a CNN (based on the winning system of SemEval-2016) and a feature-based SVM and compare their performance on all three corpora. For the CNN, we also created German word embeddings trained on 300M tweets. These word embeddings were then optimized for sentiment analysis using distant-supervised learning. The new corpus, the German word embeddings (plain and optimized), and source code to re-run the benchmarks are publicly available.

pdf bib
SwissAlps at SemEval-2017 Task 3: Attention-based Convolutional Neural Network for Community Question Answering
Jan Milan Deriu | Mark Cieliebak
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

In this paper we propose a system for reranking answers for a given question. Our method builds on a siamese CNN architecture which is extended by two attention mechanisms. The approach was evaluated on the datasets of the SemEval-2017 competition for Community Question Answering (cQA), where it achieved 7th place obtaining a MAP score of 86:24 points on the Question-Comment Similarity subtask.

pdf bib
TopicThunder at SemEval-2017 Task 4: Sentiment Classification Using a Convolutional Neural Network with Distant Supervision
Simon Müller | Tobias Huonder | Jan Deriu | Mark Cieliebak
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

In this paper, we propose a classifier for predicting topic-specific sentiments of English Twitter messages. Our method is based on a 2-layer CNN.With a distant supervised phase we leverage a large amount of weakly-labelled training data. Our system was evaluated on the data provided by the SemEval-2017 competition in the Topic-Based Message Polarity Classification subtask, where it ranked 4th place.


pdf bib
SwissCheese at SemEval-2016 Task 4: Sentiment Classification Using an Ensemble of Convolutional Neural Networks with Distant Supervision
Jan Deriu | Maurice Gonzenbach | Fatih Uzdilli | Aurelien Lucchi | Valeria De Luca | Martin Jaggi
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)