2024
pdf
bib
abs
Comparing Pre-Trained Embeddings and Domain-Independent Features for Regression-Based Evaluation of Task-Oriented Dialogue Systems
Kallirroi Georgila
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue
We use Gaussian Process Regression to predict different types of ratings provided by users after interacting with various task-oriented dialogue systems. We compare the performance of domain-independent dialogue features (e.g., duration, number of filled slots, number of confirmed slots, word error rate) with pre-trained dialogue embeddings. These pre-trained dialogue embeddings are computed by averaging over sentence embeddings in a dialogue. Sentence embeddings are created using various models based on sentence transformers (appearing on the Hugging Face Massive Text Embedding Benchmark leaderboard) or by averaging over BERT word embeddings (varying the BERT layers used). We also compare pre-trained embeddings extracted from human transcriptions with pre-trained embeddings extracted from speech recognition outputs, to determine the robustness of these models to errors. Our results show that overall, for most types of user satisfaction ratings and advanced/recent (or sometimes less advanced/recent) pre-trained embedding models, using only pre-trained embeddings outperforms using only domain-independent features. However, this pattern varies depending on the type of rating and the embedding model used. Also, pre-trained embeddings are found to be robust to speech recognition errors, more advanced/recent embedding models do not always perform better than less advanced/recent ones, and larger models do not necessarily outperform smaller ones. The best prediction performance is achieved by combining pre-trained embeddings with domain-independent features.
2022
pdf
bib
abs
Strategy-level Entrainment of Dialogue System Users in a Creative Visual Reference Resolution Task
Deepthi Karkada
|
Ramesh Manuvinakurike
|
Maike Paetzel-Prüsmann
|
Kallirroi Georgila
Proceedings of the Thirteenth Language Resources and Evaluation Conference
In this work, we study entrainment of users playing a creative reference resolution game with an autonomous dialogue system. The language understanding module in our dialogue system leverages annotated human-wizard conversational data, openly available knowledge graphs, and crowd-augmented data. Unlike previous entrainment work, our dialogue system does not attempt to make the human conversation partner adopt lexical items in their dialogue, but rather to adapt their descriptive strategy to one that is simpler to parse for our natural language understanding unit. By deploying this dialogue system through a crowd-sourced study, we show that users indeed entrain on a “strategy-level” without the change of strategy impinging on their creativity. Our work thus presents a promising future research direction for developing dialogue management systems that can strategically influence people’s descriptive strategy to ease the system’s language understanding in creative tasks.
pdf
bib
abs
Evaluation of Off-the-shelf Speech Recognizers on Different Accents in a Dialogue Domain
Divya Tadimeti
|
Kallirroi Georgila
|
David Traum
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We evaluate several publicly available off-the-shelf (commercial and research) automatic speech recognition (ASR) systems on dialogue agent-directed English speech from speakers with General American vs. non-American accents. Our results show that the performance of the ASR systems for non-American accents is considerably worse than for General American accents. Depending on the recognizer, the absolute difference in performance between General American accents and all non-American accents combined can vary approximately from 2% to 12%, with relative differences varying approximately between 16% and 49%. This drop in performance becomes even larger when we consider specific categories of non-American accents indicating a need for more diligent collection of and training on non-native English speaker data in order to narrow this performance gap. There are performance differences across ASR systems, and while the same general pattern holds, with more errors for non-American accents, there are some accents for which the best recognizer is different than in the overall case. We expect these results to be useful for dialogue system designers in developing more robust inclusive dialogue systems, and for ASR providers in taking into account performance requirements for different accents.
2020
pdf
bib
abs
Predicting Ratings of Real Dialogue Participants from Artificial Data and Ratings of Human Dialogue Observers
Kallirroi Georgila
|
Carla Gordon
|
Volodymyr Yanov
|
David Traum
Proceedings of the Twelfth Language Resources and Evaluation Conference
We collected a corpus of dialogues in a Wizard of Oz (WOz) setting in the Internet of Things (IoT) domain. We asked users participating in these dialogues to rate the system on a number of aspects, namely, intelligence, naturalness, personality, friendliness, their enjoyment, overall quality, and whether they would recommend the system to others. Then we asked dialogue observers, i.e., Amazon Mechanical Turkers (MTurkers), to rate these dialogues on the same aspects. We also generated simulated dialogues between dialogue policies and simulated users and asked MTurkers to rate them again on the same aspects. Using linear regression, we developed dialogue evaluation functions based on features from the simulated dialogues and the MTurkers’ ratings, the WOz dialogues and the MTurkers’ ratings, and the WOz dialogues and the WOz participants’ ratings. We applied all these dialogue evaluation functions to a held-out portion of our WOz dialogues, and we report results on the predictive power of these different types of dialogue evaluation functions. Our results suggest that for three conversational aspects (intelligence, naturalness, overall quality) just training evaluation functions on simulated data could be sufficient.
pdf
bib
abs
Evaluation of Off-the-shelf Speech Recognizers Across Diverse Dialogue Domains
Kallirroi Georgila
|
Anton Leuski
|
Volodymyr Yanov
|
David Traum
Proceedings of the Twelfth Language Resources and Evaluation Conference
We evaluate several publicly available off-the-shelf (commercial and research) automatic speech recognition (ASR) systems across diverse dialogue domains (in US-English). Our evaluation is aimed at non-experts with limited experience in speech recognition. Our goal is not only to compare a variety of ASR systems on several diverse data sets but also to measure how much ASR technology has advanced since our previous large-scale evaluations on the same data sets. Our results show that the performance of each speech recognizer can vary significantly depending on the domain. Furthermore, despite major recent progress in ASR technology, current state-of-the-art speech recognizers perform poorly in domains that require special vocabulary and language models, and under noisy conditions. We expect that our evaluation will prove useful to ASR consumers and dialogue system designers.
2018
pdf
bib
Edit me: A Corpus and a Framework for Understanding Natural Language Image Editing
Ramesh Manuvinakurike
|
Jacqueline Brixey
|
Trung Bui
|
Walter Chang
|
Doo Soon Kim
|
Ron Artstein
|
Kallirroi Georgila
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
bib
DialEdit: Annotations for Spoken Conversational Image Editing
Ramesh Manuvirakurike
|
Jacqueline Brixey
|
Trung Bui
|
Walter Chang
|
Ron Artstein
|
Kallirroi Georgila
Proceedings of the 14th Joint ACL-ISO Workshop on Interoperable Semantic Annotation
pdf
bib
A Dialogue Annotation Scheme for Weight Management Chat using the Trans-Theoretical Model of Health Behavior Change
Ramesh Manuvirakurike
|
Sumanth Bharawadj
|
Kallirroi Georgila
Proceedings of the 14th Joint ACL-ISO Workshop on Interoperable Semantic Annotation
pdf
bib
Towards Understanding End-of-trip Instructions in a Taxi Ride Scenario
Deepthi Karkada
|
Ramesh Manuvirakurike
|
Kallirroi Georgila
Proceedings of the 14th Joint ACL-ISO Workshop on Interoperable Semantic Annotation
pdf
bib
abs
Conversational Image Editing: Incremental Intent Identification in a New Dialogue Task
Ramesh Manuvinakurike
|
Trung Bui
|
Walter Chang
|
Kallirroi Georgila
Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue
We present “conversational image editing”, a novel real-world application domain combining dialogue, visual information, and the use of computer vision. We discuss the importance of dialogue incrementality in this task, and build various models for incremental intent identification based on deep learning and traditional classification algorithms. We show how our model based on convolutional neural networks outperforms models based on random forests, long short term memory networks, and conditional random fields. By training embeddings based on image-related dialogue corpora, we outperform pre-trained out-of-the-box embeddings, for intention identification tasks. Our experiments also provide evidence that incremental intent processing may be more efficient for the user and could save time in accomplishing tasks.
2017
pdf
bib
abs
Using Reinforcement Learning to Model Incrementality in a Fast-Paced Dialogue Game
Ramesh Manuvinakurike
|
David DeVault
|
Kallirroi Georgila
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue
We apply Reinforcement Learning (RL) to the problem of incremental dialogue policy learning in the context of a fast-paced dialogue game. We compare the policy learned by RL with a high-performance baseline policy which has been shown to perform very efficiently (nearly as well as humans) in this dialogue game. The RL policy outperforms the baseline policy in offline simulations (based on real user data). We provide a detailed comparison of the RL policy and the baseline policy, including information about how much effort and time it took to develop each one of them. We also highlight the cases where the RL policy performs better, and show that understanding the RL policy can provide valuable insights which can inform the creation of an even better rule-based policy.
2016
pdf
bib
New Dimensions in Testimony Demonstration
Ron Artstein
|
Alesia Gainer
|
Kallirroi Georgila
|
Anton Leuski
|
Ari Shapiro
|
David Traum
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations
2015
pdf
bib
Reinforcement Learning in Multi-Party Trading Dialog
Takuya Hiraoka
|
Kallirroi Georgila
|
Elnaz Nouri
|
David Traum
|
Satoshi Nakamura
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue
pdf
bib
Which Synthetic Voice Should I Choose for an Evocative Task?
Eli Pincus
|
Kallirroi Georgila
|
David Traum
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue
pdf
bib
Reinforcement Learning of Multi-Issue Negotiation Dialogue Policies
Alexandros Papangelis
|
Kallirroi Georgila
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue
pdf
bib
Evaluating Spoken Dialogue Processing for Time-Offset Interaction
David Traum
|
Kallirroi Georgila
|
Ron Artstein
|
Anton Leuski
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue
2014
pdf
bib
Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)
Kallirroi Georgila
|
Matthew Stone
|
Helen Hastie
|
Ani Nenkova
Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)
pdf
bib
A Demonstration of Dialogue Processing in SimSensei Kiosk
Fabrizio Morbini
|
David DeVault
|
Kallirroi Georgila
|
Ron Artstein
|
David Traum
|
Louis-Philippe Morency
Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)
pdf
bib
Single-Agent vs. Multi-Agent Techniques for Concurrent Reinforcement Learning of Negotiation Dialogue Policies
Kallirroi Georgila
|
Claire Nelson
|
David Traum
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
2013
pdf
bib
Reinforcement Learning of Two-Issue Negotiation Dialogue Policies
Kallirroi Georgila
Proceedings of the SIGDIAL 2013 Conference
pdf
bib
Verbal indicators of psychological distress in interactive dialogue with a virtual human
David DeVault
|
Kallirroi Georgila
|
Ron Artstein
|
Fabrizio Morbini
|
David Traum
|
Stefan Scherer
|
Albert Skip Rizzo
|
Louis-Philippe Morency
Proceedings of the SIGDIAL 2013 Conference
2012
pdf
bib
Reinforcement Learning of Question-Answering Dialogue Policies for Virtual Museum Guides
Teruhisa Misu
|
Kallirroi Georgila
|
Anton Leuski
|
David Traum
Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue
pdf
bib
abs
Practical Evaluation of Human and Synthesized Speech for Virtual Human Dialogue Systems
Kallirroi Georgila
|
Alan Black
|
Kenji Sagae
|
David Traum
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
The current practice in virtual human dialogue systems is to use professional human recordings or limited-domain speech synthesis. Both approaches lead to good performance but at a high cost. To determine the best trade-off between performance and cost, we perform a systematic evaluation of human and synthesized voices with regard to naturalness, conversational aspect, and likability. We vary the type (in-domain vs. out-of-domain), length, and content of utterances, and take into account the age and native language of raters as well as their familiarity with speech synthesis. We present detailed results from two studies, a pilot one and one run on Amazon's Mechanical Turk. Our results suggest that a professional human voice can supersede both an amateur human voice and synthesized voices. Also, a high-quality general-purpose voice or a good limited-domain voice can perform better than amateur human recordings. We do not find any significant differences between the performance of a high-quality general-purpose voice and a limited-domain voice, both trained with speech recorded by actors. As expected, the high-quality general-purpose voice is rated higher than the limited-domain voice for out-of-domain sentences and lower for in-domain sentences. There is also a trend for long or negative-content utterances to receive lower ratings.
2011
pdf
bib
An Annotation Scheme for Cross-Cultural Argumentation and Persuasion Dialogues
Kallirroi Georgila
|
Ron Artstein
|
Angela Nazarian
|
Michael Rushforth
|
David Traum
|
Katia Sycara
Proceedings of the SIGDIAL 2011 Conference
2010
pdf
bib
Learning Dialogue Strategies from Older and Younger Simulated Users
Kallirroi Georgila
|
Maria Wolters
|
Johanna Moore
Proceedings of the SIGDIAL 2010 Conference
pdf
bib
Cross-Domain Speech Disfluency Detection
Kallirroi Georgila
|
Ning Wang
|
Jonathan Gratch
Proceedings of the SIGDIAL 2010 Conference
pdf
bib
abs
Practical Evaluation of Speech Recognizers for Virtual Human Dialogue Systems
Xuchen Yao
|
Pravin Bhutada
|
Kallirroi Georgila
|
Kenji Sagae
|
Ron Artstein
|
David Traum
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
We perform a large-scale evaluation of multiple off-the-shelf speech recognizers across diverse domains for virtual human dialogue systems. Our evaluation is aimed at speech recognition consumers and potential consumers with limited experience with readily available recognizers. We focus on practical factors to determine what levels of performance can be expected from different available recognizers in various projects featuring different types of conversational utterances. Our results show that there is no single recognizer that outperforms all other recognizers in all domains. The performance of each recognizer may vary significantly depending on the domain, the size and perplexity of the corpus, the out-of-vocabulary rate, and whether acoustic and language model adaptation has been used or not. We expect that our evaluation will prove useful to other speech recognition consumers, especially in the dialogue community, and will shed some light on the key problem in spoken dialogue systems of selecting the most suitable available speech recognition system for a particular application, and what impact training will have.
2009
pdf
bib
Using Integer Linear Programming for Detecting Speech Disfluencies
Kallirroi Georgila
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
pdf
bib
Evaluating the Effectiveness of Information Presentation in a Full End-To-End Dialogue System
Taghi Paksima
|
Kallirroi Georgila
|
Johanna Moore
Proceedings of the SIGDIAL 2009 Conference
2008
pdf
bib
Simulating the Behaviour of Older versus Younger Users when Interacting with Spoken Dialogue Systems
Kallirroi Georgila
|
Maria Wolters
|
Johanna Moore
Proceedings of ACL-08: HLT, Short Papers
pdf
bib
abs
A Fully Annotated Corpus for Studying the Effect of Cognitive Ageing on Users’ Interactions with Spoken Dialogue Systems
Kallirroi Georgila
|
Maria Wolters
|
Vasilis Karaiskos
|
Melissa Kronenthal
|
Robert Logie
|
Neil Mayo
|
Johanna Moore
|
Matt Watson
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
In this paper we present a corpus of interactions of older and younger users with nine different dialogue systems. The corpus has been fully transcribed and annotated with dialogue acts and Information State Update (ISU) representations of dialogue context. Users not only underwent a comprehensive battery of cognitive assessments, but they also rated the usability of each dialogue system on a standardised questionnaire. In this paper, we discuss the corpus collection and outline the semi-automatic methods we used for discourse-level annotations. We expect that the corpus will provide a key resource for modelling older peoples interaction with spoken dialogue systems.
pdf
bib
Hybrid Reinforcement/Supervised Learning of Dialogue Policies from Fixed Data Sets
James Henderson
|
Oliver Lemon
|
Kallirroi Georgila
Computational Linguistics, Volume 34, Number 4, December 2008
2006
pdf
bib
An ISU Dialogue System Exhibiting Reinforcement Learning of Dialogue Policies: Generic Slot-Filling in the TALK In-car System
Oliver Lemon
|
Kallirroi Georgila
|
James Henderson
|
Matthew Stuttle
Demonstrations
2005
pdf
bib
Quantitative Evaluation of User Simulation Techniques for Spoken Dialogue Systems
Jost Schatzmann
|
Kallirroi Georgila
|
Steve Young
Proceedings of the 6th SIGdial Workshop on Discourse and Dialogue
2004
pdf
bib
A graphical Tool for Handling Rule Grammars in Java Speech Grammar Format
Kallirroi Georgila
|
Nikos Fakotakis
|
George Kokkinakis
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
2000
pdf
bib
A Graphical Parametric Language-Independent Tool for the Annotation of Speech Corpora
Kallirroi Georgila
|
Nikos Fakotakis
|
George Kokkinakis
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)