Verena Rieser


2021

pdf bib
MiRANews: Dataset and Benchmarks for Multi-Resource-Assisted News Summarization
Xinnuo Xu | Ondřej Dušek | Shashi Narayan | Verena Rieser | Ioannis Konstas
Findings of the Association for Computational Linguistics: EMNLP 2021

One of the most challenging aspects of current single-document news summarization is that the summary often contains ‘extrinsic hallucinations’, i.e., facts that are not present in the source document, which are often derived via world knowledge. This causes summarisation systems to act more like open-ended language models tending to hallucinate facts that are erroneous. In this paper, we mitigate this problem with the help of multiple supplementary resource documents assisting the task. We present a new dataset MiraNews and benchmark existing summarisation models. In contrast to multi-document summarization, which addresses multiple events from several source documents, we still aim at generating a summary for a single document. We show via data analysis that it’s not only the models which are to blame: more than 27% of facts mentioned in the gold summaries of MiraNews are better grounded on assisting documents than in the main source articles. An error analysis of generated summaries from pretrained models fine-tuned on MIRANEWS reveals that this has an even bigger effects on models: assisted summarisation reduces 55% of hallucinations when compared to single-document summarisation models trained on the main article only.

pdf bib
AggGen: Ordering and Aggregating while Generating
Xinnuo Xu | Ondřej Dušek | Verena Rieser | Ioannis Konstas
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We present AggGen (pronounced ‘again’) a data-to-text model which re-introduces two explicit sentence planning stages into neural data-to-text systems: input ordering and input aggregation. In contrast to previous work using sentence planning, our model is still end-to-end: AggGen performs sentence planning at the same time as generating text by learning latent alignments (via semantic facts) between input representation and target text. Experiments on the WebNLG and E2E challenge data show that by using fact-based alignments our approach is more interpretable, expressive, robust to noise, and easier to control, while retaining the advantages of end-to-end systems in terms of fluency. Our code is available at https://github.com/XinnuoXu/AggGen.

pdf bib
OTTers: One-turn Topic Transitions for Open-Domain Dialogue
Karin Sevegnani | David M. Howcroft | Ioannis Konstas | Verena Rieser
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Mixed initiative in open-domain dialogue requires a system to pro-actively introduce new topics. The one-turn topic transition task explores how a system connects two topics in a cooperative and coherent manner. The goal of the task is to generate a “bridging” utterance connecting the new topic to the topic of the previous conversation turn. We are especially interested in commonsense explanations of how a new topic relates to what has been mentioned before. We first collect a new dataset of human one-turn topic transitions, which we callOTTers. We then explore different strategies used by humans when asked to complete such a task, and notice that the use of a bridging utterance to connect the two topics is the approach used the most. We finally show how existing state-of-the-art text generation models can be adapted to this task and examine the performance of these baselines on different splits of the OTTers data.

pdf bib
Alexa, Google, Siri: What are Your Pronouns? Gender and Anthropomorphism in the Design and Perception of Conversational Assistants
Gavin Abercrombie | Amanda Cercas Curry | Mugdha Pandya | Verena Rieser
Proceedings of the 3rd Workshop on Gender Bias in Natural Language Processing

Technology companies have produced varied responses to concerns about the effects of the design of their conversational AI systems. Some have claimed that their voice assistants are in fact not gendered or human-like—despite design features suggesting the contrary. We compare these claims to user perceptions by analysing the pronouns they use when referring to AI assistants. We also examine systems’ responses and the extent to which they generate output which is gendered and anthropomorphic. We find that, while some companies appear to be addressing the ethical concerns raised, in some cases, their claims do not seem to hold true. In particular, our results show that system outputs are ambiguous as to the humanness of the systems, and that users tend to personify and gender them as a result.

pdf bib
ConvAbuse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI
Amanda Cercas Curry | Gavin Abercrombie | Verena Rieser
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We present the first English corpus study on abusive language towards three conversational AI systems gathered ‘in the wild’: an open-domain social bot, a rule-based chatbot, and a task-based system. To account for the complexity of the task, we take a more ‘nuanced’ approach where our ConvAI dataset reflects fine-grained notions of abuse, as well as views from multiple expert annotators. We find that the distribution of abuse is vastly different compared to other commonly used datasets, with more sexually tinted aggression towards the virtual persona of these systems. Finally, we report results from bench-marking existing models against this data. Unsurprisingly, we find that there is substantial room for improvement with F1 scores below 90%.

pdf bib
What happens if you treat ordinal ratings as interval data? Human evaluations in NLP are even more under-powered than you think
David M. Howcroft | Verena Rieser
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Previous work has shown that human evaluations in NLP are notoriously under-powered. Here, we argue that there are two common factors which make this problem even worse: NLP studies usually (a) treat ordinal data as interval data and (b) operate under high variance settings while the differences they are hoping to detect are often subtle. We demonstrate through simulation that ordinal mixed effects models are better able to detect small differences between models, especially in high variance settings common in evaluations of generated texts. We release tools for researchers to conduct their own power analysis and test their assumptions. We also make recommendations for improving statistical power.

2020

pdf bib
SLURP: A Spoken Language Understanding Resource Package
Emanuele Bastianelli | Andrea Vanzo | Pawel Swietojanski | Verena Rieser
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Spoken Language Understanding infers semantic meaning directly from audio data, and thus promises to reduce error propagation and misunderstandings in end-user applications. However, publicly available SLU resources are limited. In this paper, we release SLURP, a new SLU package containing the following: (1) A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets; (2) Competitive baselines based on state-of-the-art NLU and ASR systems; (3) A new transparent metric for entity labelling which enables a detailed error analysis for identifying potential areas of improvement. SLURP is available at https://github.com/pswietojanski/slurp.

pdf bib
Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions
David M. Howcroft | Anya Belz | Miruna-Adriana Clinciu | Dimitra Gkatzia | Sadid A. Hasan | Saad Mahamood | Simon Mille | Emiel van Miltenburg | Sashank Santhanam | Verena Rieser
Proceedings of the 13th International Conference on Natural Language Generation

Human assessment remains the most trusted form of evaluation in NLG, but highly diverse approaches and a proliferation of different quality criteria used by researchers make it difficult to compare results and draw conclusions across papers, with adverse implications for meta-evaluation and reproducibility. In this paper, we present (i) our dataset of 165 NLG papers with human evaluations, (ii) the annotation scheme we developed to label the papers for different aspects of evaluations, (iii) quantitative analyses of the annotations, and (iv) a set of recommendations for improving standards in evaluation reporting. We use the annotations as a basis for examining information included in evaluation reports, and levels of consistency in approaches, experimental design and terminology, focusing in particular on the 200+ different terms that have been used for evaluated aspects of quality. We conclude that due to a pervasive lack of clarity in reports and extreme diversity in approaches, human evaluation in NLG presents as extremely confused in 2020, and that the field is in urgent need of standard methods and terminology.

pdf bib
Conversational Assistants and Gender Stereotypes: Public Perceptions and Desiderata for Voice Personas
Amanda Cercas Curry | Judy Robertson | Verena Rieser
Proceedings of the Second Workshop on Gender Bias in Natural Language Processing

Conversational voice assistants are rapidly developing from purely transactional systems to social companions with “personality”. UNESCO recently stated that the female and submissive personality of current digital assistants gives rise for concern as it reinforces gender stereotypes. In this work, we present results from a participatory design workshop, where we invite people to submit their preferences for a what their ideal persona might look like, both in drawings as well as in a multiple choice questionnaire. We find no clear consensus which suggests that one possible solution is to let people configure/personalise their assistants. We then outline a multi-disciplinary project of how we plan to address the complex question of gender and stereotyping in digital assistants.

pdf bib
Fact-based Content Weighting for Evaluating Abstractive Summarisation
Xinnuo Xu | Ondřej Dušek | Jingyi Li | Verena Rieser | Ioannis Konstas
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Abstractive summarisation is notoriously hard to evaluate since standard word-overlap-based metrics are insufficient. We introduce a new evaluation metric which is based on fact-level content weighting, i.e. relating the facts of the document to the facts of the summary. We fol- low the assumption that a good summary will reflect all relevant facts, i.e. the ones present in the ground truth (human-generated refer- ence summary). We confirm this hypothe- sis by showing that our weightings are highly correlated to human perception and compare favourably to the recent manual highlight- based metric of Hardy et al. (2019).

pdf bib
History for Visual Dialog: Do we really need it?
Shubham Agarwal | Trung Bui | Joon-Young Lee | Ioannis Konstas | Verena Rieser
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Visual Dialogue involves “understanding” the dialogue history (what has been discussed previously) and the current question (what is asked), in addition to grounding information in the image, to accurately generate the correct response. In this paper, we show that co-attention models which explicitly encode dialoh history outperform models that don’t, achieving state-of-the-art performance (72 % NDCG on val set). However, we also expose shortcomings of the crowdsourcing dataset collection procedure, by showing that dialogue history is indeed only required for a small amount of the data, and that the current evaluation metric encourages generic replies. To that end, we propose a challenging subset (VisdialConv) of the VisdialVal set and the benchmark NDCG of 63%.

2019

pdf bib
A Crowd-based Evaluation of Abuse Response Strategies in Conversational Agents
Amanda Cercas Curry | Verena Rieser
Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue

How should conversational agents respond to verbal abuse through the user? To answer this question, we conduct a large-scale crowd-sourced evaluation of abuse response strategies employed by current state-of-the-art systems. Our results show that some strategies, such as “polite refusal”, score highly across the board, while for other strategies demographic factors, such as age, as well as the severity of the preceding abuse influence the user’s perception of which response is appropriate. In addition, we find that most data-driven models lag behind rule-based or commercial systems in terms of their perceived appropriateness.

pdf bib
User Evaluation of a Multi-dimensional Statistical Dialogue System
Simon Keizer | Ondřej Dušek | Xingkun Liu | Verena Rieser
Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue

We present the first complete spoken dialogue system driven by a multiimensional statistical dialogue manager. This framework has been shown to substantially reduce data needs by leveraging domain-independent dimensions, such as social obligations or feedback, which (as we show) can be transferred between domains. In this paper, we conduct a user study and show that the performance of a multi-dimensional system, which can be adapted from a source domain, is equivalent to that of a one-dimensional baseline, which can only be trained from scratch.

pdf bib
Automatic Quality Estimation for Natural Language Generation: Ranting (Jointly Rating and Ranking)
Ondřej Dušek | Karin Sevegnani | Ioannis Konstas | Verena Rieser
Proceedings of the 12th International Conference on Natural Language Generation

We present a recurrent neural network based system for automatic quality estimation of natural language generation (NLG) outputs, which jointly learns to assign numerical ratings to individual outputs and to provide pairwise rankings of two different outputs. The latter is trained using pairwise hinge loss over scores from two copies of the rating network. We use learning to rank and synthetic data to improve the quality of ratings assigned by our system: We synthesise training pairs of distorted system outputs and train the system to rank the less distorted one higher. This leads to a 12% increase in correlation with human ratings over the previous benchmark. We also establish the state of the art on the dataset of relative rankings from the E2E NLG Challenge (Dusek et al., 2019), where synthetic data lead to a 4% accuracy increase over the base model.

pdf bib
Semantic Noise Matters for Neural Natural Language Generation
Ondřej Dušek | David M. Howcroft | Verena Rieser
Proceedings of the 12th International Conference on Natural Language Generation

Neural natural language generation (NNLG) systems are known for their pathological outputs, i.e. generating text which is unrelated to the input specification. In this paper, we show the impact of semantic noise on state-of-the-art NNLG models which implement different semantic control mechanisms. We find that cleaned data can improve semantic correctness by up to 97%, while maintaining fluency. We also find that the most common error is omitting information, rather than hallucination.

2018

pdf bib
Better Conversations by Modeling, Filtering, and Optimizing for Coherence and Diversity
Xinnuo Xu | Ondřej Dušek | Ioannis Konstas | Verena Rieser
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We present three enhancements to existing encoder-decoder models for open-domain conversational agents, aimed at effectively modeling coherence and promoting output diversity: (1) We introduce a measure of coherence as the GloVe embedding similarity between the dialogue context and the generated response, (2) we filter our training corpora based on the measure of coherence to obtain topically coherent and lexically diverse context-response pairs, (3) we then train a response generator using a conditional variational autoencoder model that incorporates the measure of coherence as a latent variable and uses a context gate to guarantee topical consistency with the context and promote lexical diversity. Experiments on the OpenSubtitles corpus show a substantial improvement over competitive neural models in terms of BLEU score as well as metrics of coherence and diversity.

pdf bib
RankME: Reliable Human Ratings for Natural Language Generation
Jekaterina Novikova | Ondřej Dušek | Verena Rieser
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Human evaluation for natural language generation (NLG) often suffers from inconsistent user ratings. While previous research tends to attribute this problem to individual user preferences, we show that the quality of human judgements can also be improved by experimental design. We present a novel rank-based magnitude estimation method (RankME), which combines the use of continuous scales and relative assessments. We show that RankME significantly improves the reliability and consistency of human ratings compared to traditional evaluation methods. In addition, we show that it is possible to evaluate NLG systems according to multiple, distinct criteria, which is important for error analysis. Finally, we demonstrate that RankME, in combination with Bayesian estimation of system quality, is a cost-effective alternative for ranking multiple NLG systems.

pdf bib
#MeToo Alexa: How Conversational Systems Respond to Sexual Harassment
Amanda Cercas Curry | Verena Rieser
Proceedings of the Second ACL Workshop on Ethics in Natural Language Processing

Conversational AI systems, such as Amazon’s Alexa, are rapidly developing from purely transactional systems to social chatbots, which can respond to a wide variety of user requests. In this article, we establish how current state-of-the-art conversational systems react to inappropriate requests, such as bullying and sexual harassment on the part of the user, by collecting and analysing the novel #MeTooAlexa corpus. Our results show that commercial systems mainly avoid answering, while rule-based chatbots show a variety of behaviours and often deflect. Data-driven systems, on the other hand, are often non-coherent, but also run the risk of being interpreted as flirtatious and sometimes react with counter-aggression. This includes our own system, trained on “clean” data, which suggests that inappropriate system behaviour is not caused by data bias.

pdf bib
A Knowledge-Grounded Multimodal Search-Based Conversational Agent
Shubham Agarwal | Ondřej Dušek | Ioannis Konstas | Verena Rieser
Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI

Multimodal search-based dialogue is a challenging new task: It extends visually grounded question answering systems into multi-turn conversations with access to an external database. We address this new challenge by learning a neural response generation system from the recently released Multimodal Dialogue (MMD) dataset (Saha et al., 2017). We introduce a knowledge-grounded multimodal conversational model where an encoded knowledge base (KB) representation is appended to the decoder input. Our model substantially outperforms strong baselines in terms of text-based similarity measures (over 9 BLEU points, 3 of which are solely due to the use of additional information from the KB).

pdf bib
Improving Context Modelling in Multimodal Dialogue Generation
Shubham Agarwal | Ondřej Dušek | Ioannis Konstas | Verena Rieser
Proceedings of the 11th International Conference on Natural Language Generation

In this work, we investigate the task of textual response generation in a multimodal task-oriented dialogue system. Our work is based on the recently released Multimodal Dialogue (MMD) dataset (Saha et al., 2017) in the fashion domain. We introduce a multimodal extension to the Hierarchical Recurrent Encoder-Decoder (HRED) model and show that this extension outperforms strong baselines in terms of text-based similarity metrics. We also showcase the shortcomings of current vision and language models by performing an error analysis on our system’s output.

pdf bib
Findings of the E2E NLG Challenge
Ondřej Dušek | Jekaterina Novikova | Verena Rieser
Proceedings of the 11th International Conference on Natural Language Generation

This paper summarises the experimental setup and results of the first shared task on end-to-end (E2E) natural language generation (NLG) in spoken dialogue systems. Recent end-to-end generation systems are promising since they reduce the need for data annotation. However, they are currently limited to small, delexicalised datasets. The E2E NLG shared task aims to assess whether these novel approaches can generate better-quality output by learning from a dataset containing higher lexical richness, syntactic complexity and diverse discourse phenomena. We compare 62 systems submitted by 17 institutions, covering a wide range of approaches, including machine learning architectures – with the majority implementing sequence-to-sequence models (seq2seq) – as well as systems based on grammatical rules and templates.

2017

pdf bib
The E2E Dataset: New Challenges For End-to-End Generation
Jekaterina Novikova | Ondřej Dušek | Verena Rieser
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue

This paper describes the E2E data, a new dataset for training end-to-end, data-driven natural language generation systems in the restaurant domain, which is ten times bigger than existing, frequently used datasets in this area. The E2E dataset poses new challenges: (1) its human reference texts show more lexical richness and syntactic variation, including discourse phenomena; (2) generating from this set requires content selection. As such, learning from this dataset promises more natural, varied and less template-like system utterances. We also establish a baseline on this dataset, which illustrates some of the difficulties associated with this data.

pdf bib
Why We Need New Evaluation Metrics for NLG
Jekaterina Novikova | Ondřej Dušek | Amanda Cercas Curry | Verena Rieser
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

The majority of NLG evaluation relies on automatic metrics, such as BLEU . In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data- and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly.

2016

pdf bib
The REAL Corpus: A Crowd-Sourced Corpus of Human Generated and Evaluated Spatial References to Real-World Urban Scenes
Phil Bartie | William Mackaness | Dimitra Gkatzia | Verena Rieser
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Our interest is in people’s capacity to efficiently and effectively describe geographic objects in urban scenes. The broader ambition is to develop spatial models capable of equivalent functionality able to construct such referring expressions. To that end we present a newly crowd-sourced data set of natural language references to objects anchored in complex urban scenes (In short: The REAL Corpus ― Referring Expressions Anchored Language). The REAL corpus contains a collection of images of real-world urban scenes together with verbal descriptions of target objects generated by humans, paired with data on how successful other people were able to identify the same object based on these descriptions. In total, the corpus contains 32 images with on average 27 descriptions per image and 3 verifications for each description. In addition, the corpus is annotated with a variety of linguistically motivated features. The paper highlights issues posed by collecting data using crowd-sourcing with an unrestricted input format, as well as using real-world urban scenes.

pdf bib
iLab-Edinburgh at SemEval-2016 Task 7: A Hybrid Approach for Determining Sentiment Intensity of Arabic Twitter Phrases
Eshrag Refaee | Verena Rieser
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
Natural Language Generation enhances human decision-making with uncertain information
Dimitra Gkatzia | Oliver Lemon | Verena Rieser
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Proceedings of the 9th International Natural Language Generation conference
Amy Isard | Verena Rieser | Dimitra Gkatzia
Proceedings of the 9th International Natural Language Generation conference

pdf bib
The aNALoGuE Challenge: Non Aligned Language GEneration
Jekaterina Novikova | Verena Rieser
Proceedings of the 9th International Natural Language Generation conference

pdf bib
Crowd-sourcing NLG Data: Pictures Elicit Better Data.
Jekaterina Novikova | Oliver Lemon | Verena Rieser
Proceedings of the 9th International Natural Language Generation conference

2015

pdf bib
Benchmarking Machine Translated Sentiment Analysis for Arabic Tweets
Eshrag Refaee | Verena Rieser
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

pdf bib
Generating and Evaluating Landmark-Based Navigation Instructions in Virtual Environments
Amanda Cercas Curry | Dimitra Gkatzia | Verena Rieser
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)

pdf bib
A Game-Based Setup for Data Collection and Task-Based Evaluation of Uncertain Information Presentation
Dimitra Gkatzia | Amanda Cercas Curry | Verena Rieser | Oliver Lemon
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)

pdf bib
From the Virtual to the RealWorld: Referring to Objects in Real-World Spatial Scenes
Dimitra Gkatzia | Verena Rieser | Phil Bartie | William Mackaness
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

pdf bib
Evaluating Distant Supervision for Subjectivity and Sentiment Analysis on Arabic Twitter Feeds
Eshrag Refaee | Verena Rieser
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

pdf bib
The PARLANCE mobile application for interactive search in English and Mandarin
Helen Hastie | Marie-Aude Aufaure | Panos Alexopoulos | Hugues Bouchard | Catherine Breslin | Heriberto Cuayáhuitl | Nina Dethlefs | Milica Gašić | James Henderson | Oliver Lemon | Xingkun Liu | Peter Mika | Nesrine Ben Mustapha | Tim Potter | Verena Rieser | Blaise Thomson | Pirros Tsiakoulis | Yves Vanrompay | Boris Villazon-Terrazas | Majid Yazdani | Steve Young | Yanchao Yu
Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)

pdf bib
An Arabic Twitter Corpus for Subjectivity and Sentiment Analysis
Eshrag Refaee | Verena Rieser
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present a newly collected data set of 8,868 gold-standard annotated Arabic feeds. The corpus is manually labelled for subjectivity and sentiment analysis (SSA) ( = 0:816). In addition, the corpus is annotated with a variety of motivated feature-sets that have previously shown positive impact on performance. The paper highlights issues posed by twitter as a genre, such as mixture of language varieties and topic-shifts. Our next step is to extend the current corpus, using online semi-supervised learning. A first sub-corpus will be released via the ELRA repository as part of this submission.

pdf bib
Cluster-based Prediction of User Ratings for Stylistic Surface Realisation
Nina Dethlefs | Heriberto Cuayáhuitl | Helen Hastie | Verena Rieser | Oliver Lemon
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

2013

pdf bib
Demonstration of the PARLANCE system: a data-driven incremental, spoken dialogue system for interactive search
Helen Hastie | Marie-Aude Aufaure | Panos Alexopoulos | Heriberto Cuayáhuitl | Nina Dethlefs | Milica Gasic | James Henderson | Oliver Lemon | Xingkun Liu | Peter Mika | Nesrine Ben Mustapha | Verena Rieser | Blaise Thomson | Pirros Tsiakoulis | Yves Vanrompay
Proceedings of the SIGDIAL 2013 Conference

2012

pdf bib
Optimising Incremental Dialogue Decisions Using Information Density for Interactive Systems
Nina Dethlefs | Helen Hastie | Verena Rieser | Oliver Lemon
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
Optimising Incremental Generation for Spoken Dialogue Systems: Reducing the Need for Fillers
Nina Dethlefs | Helen Hastie | Verena Rieser | Oliver Lemon
INLG 2012 Proceedings of the Seventh International Natural Language Generation Conference

2011

pdf bib
Adaptive Information Presentation for Spoken Dialogue Systems: Evaluation with real users
Verena Rieser | Simon Keizer | Oliver Lemon | Xingkun Liu
Proceedings of the 13th European Workshop on Natural Language Generation

pdf bib
Learning and Evaluation of Dialogue Strategies for New Applications: Empirical Methods for Optimization from Small Data Sets
Verena Rieser | Oliver Lemon
Computational Linguistics, Volume 37, Issue 1 - March 2011

2010

pdf bib
Optimising Information Presentation for Spoken Dialogue Systems
Verena Rieser | Oliver Lemon | Xingkun Liu
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Generation Under Uncertainty
Oliver Lemon | Srini Janarthanam | Verena Rieser
Proceedings of the 6th International Natural Language Generation Conference

2009

pdf bib
Natural Language Generation as Planning Under Uncertainty for Spoken Dialogue Systems
Verena Rieser | Oliver Lemon
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

2008

pdf bib
Learning Effective Multimodal Dialogue Strategies from Wizard-of-Oz Data: Bootstrapping and Evaluation
Verena Rieser | Oliver Lemon
Proceedings of ACL-08: HLT

pdf bib
Automatic Learning and Evaluation of User-Centered Objective Functions for Dialogue System Optimisation
Verena Rieser | Oliver Lemon
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The ultimate goal when building dialogue systems is to satisfy the needs of real users, but quality assurance for dialogue strategies is a non-trivial problem. The applied evaluation metrics and resulting design principles are often obscure, emerge by trial-and-error, and are highly context dependent. This paper introduces data-driven methods for obtaining reliable objective functions for system design. In particular, we test whether an objective function obtained from Wizard-of-Oz (WOZ) data is a valid estimate of real users’ preferences. We test this in a test-retest comparison between the model obtained from the WOZ study and the models obtained when testing with real users. We can show that, despite a low fit to the initial data, the objective function obtained from WOZ data makes accurate predictions for automatic dialogue evaluation, and, when automatically optimising a policy using these predictions, the improvement over a strategy simply mimicking the data becomes clear from an error analysis.

2006

pdf bib
The SAMMIE Multimodal Dialogue Corpus Meets the Nite XML Toolkit
Ivana Kruijff-Korbayová | Verena Rieser | Ciprian Gerstenberger | Jan Schehl | Tilman Becker
Proceedings of the 5th Workshop on NLP and XML (NLPXML-2006): Multi-Dimensional Markup in Natural Language Processing

pdf bib
The SAMMIE Corpus of Multimodal Dialogues with an MP3 Player
Ivana Kruijff-Korbayová | Tilman Becker | Nate Blaylock | Ciprian Gerstenberger | Michael Kaißer | Peter Poller | Verena Rieser | Jan Schehl
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We describe a corpus of multimodal dialogues with an MP3player collected in Wizard-of-Oz experiments and annotated with a richfeature set at several layers. We are using the Nite XML Toolkit (NXT) to represent and further process the data. We designed an NXTdata model, converted experiment log file data and manualtranscriptions into NXT, and are building tools for additionalannotation using NXT libraries. The annotated corpus will be used to (i) investigate various aspects of multimodal presentation andinteraction strategies both within and across annotation layers; (ii) design an initial policy for reinforcement learning of multimodalclarification requests.

pdf bib
Using Machine Learning to Explore Human Multimodal Clarification Strategies
Verena Rieser | Oliver Lemon
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

2005

pdf bib
Implications for Generating Clarification Requests in Task-Oriented Dialogues
Verena Rieser | Johanna Moore
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)

pdf bib
A Corpus Collection and Annotation Framework for Learning Multimodal Clarification Strategies
Verena Rieser | Ivana Kruijff-Korbayová | Oliver Lemon
Proceedings of the 6th SIGdial Workshop on Discourse and Dialogue

pdf bib
An Experiment Setup for Collecting Data for Adaptive Output Planning in a Multimodal Dialogue System
Ivana Kruijff-Korbayová | Nate Blaylock | Ciprian Gerstenberger | Verena Rieser | Tilman Becker | Michael Kaisser | Peter Poller | Jan Schehl
Proceedings of the Tenth European Workshop on Natural Language Generation (ENLG-05)