2024
pdf
bib
abs
Revisiting Three Text-to-Speech Synthesis Experiments with a Web-Based Audience Response System
Christina Tånnander
|
Jens Edlund
|
Joakim Gustafson
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
In order to investigate the strengths and weaknesses of Audience Response System (ARS) in text-to-speech synthesis (TTS) evaluations, we revisit three previously published TTS studies and perform an ARS-based evaluation on the stimuli used in each study. The experiments are performed with a participant pool of 39 respondents, using a web-based tool that emulates an ARS experiment. The results of the first experiment confirms that ARS is highly useful for evaluating long and continuous stimuli, particularly if we wish for a diagnostic result rather than a single overall metric, while the second and third experiments highlight weaknesses in ARS with unsuitable materials as well as the importance of framing and instruction when conducting ARS-based evaluation.
pdf
bib
abs
The Role of Creaky Voice in Turn Taking and the Perception of Speaker Stance: Experiments Using Controllable TTS
Harm Lameris
|
Eva Szekely
|
Joakim Gustafson
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Recent advancements in spontaneous text-to-speech (TTS) have enabled the realistic synthesis of creaky voice, a voice quality known for its diverse pragmatic and paralinguistic functions. In this study, we used synthesized creaky voice in perceptual tests, to explore how listeners without formal training perceive two distinct types of creaky voice. We annotated a spontaneous speech corpus using creaky voice detection tools and modified a neural TTS engine with a creaky phonation embedding to control the presence of creaky phonation in the synthesized speech. We performed an objective analysis using a creak detection tool which revealed significant differences in creaky phonation levels between the two creaky voice types and modal voice. Two subjective listening experiments were performed to investigate the effect of creaky voice on perceived certainty, valence, sarcasm, and turn finality. Participants rated non-positional creak as less certain, less positive, and more indicative of turn finality, while positional creak was rated significantly more turn final compared to modal phonation.
2022
pdf
bib
abs
Evaluating Sampling-based Filler Insertion with Spontaneous TTS
Siyang Wang
|
Joakim Gustafson
|
Éva Székely
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Inserting fillers (such as “um”, “like”) to clean speech text has a rich history of study. One major application is to make dialogue systems sound more spontaneous. The ambiguity of filler occurrence and inter-speaker difference make both modeling and evaluation difficult. In this paper, we study sampling-based filler insertion, a simple yet unexplored approach to inserting fillers. We propose an objective score called Filler Perplexity (FPP). We build three models trained on two single-speaker spontaneous corpora, and evaluate them with FPP and perceptual tests. We implement two innovations in perceptual tests, (1) evaluating filler insertion on dialogue systems output, (2) synthesizing speech with neural spontaneous TTS engines. FPP proves to be useful in analysis but does not correlate well with perceptual MOS. Perceptual results show little difference between compared filler insertion models including with ground-truth, which may be due to the ambiguity of what is good filler insertion and a strong neural spontaneous TTS that produces natural speech irrespective of input. Results also show preference for filler-inserted speech synthesized with spontaneous TTS. The same test using TTS based on read speech obtains the opposite results, which shows the importance of using spontaneous TTS in evaluating filler insertions. Audio samples: www.speech.kth.se/tts-demos/LREC22
pdf
bib
abs
Speech Data Augmentation for Improving Phoneme Transcriptions of Aphasic Speech Using Wav2Vec 2.0 for the PSST Challenge
Birger Moell
|
Jim O’Regan
|
Shivam Mehta
|
Ambika Kirkland
|
Harm Lameris
|
Joakim Gustafson
|
Jonas Beskow
Proceedings of the RaPID Workshop - Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments - within the 13th Language Resources and Evaluation Conference
As part of the PSST challenge, we explore how data augmentations, data sources, and model size affect phoneme transcription accuracy on speech produced by individuals with aphasia. We evaluate model performance in terms of feature error rate (FER) and phoneme error rate (PER). We find that data augmentations techniques, such as pitch shift, improve model performance. Additionally, increasing the size of the model decreases FER and PER. Our experiments also show that adding manually-transcribed speech from non-aphasic speakers (TIMIT) improves performance when Room Impulse Response is used to augment the data. The best performing model combines aphasic and non-aphasic data and has a 21.0% PER and a 9.2% FER, a relative improvement of 9.8% compared to the baseline model on the primary outcome measurement. We show that data augmentation, larger model size, and additional non-aphasic data sources can be helpful in improving automatic phoneme recognition models for people with aphasia.
2020
pdf
bib
abs
Chinese Whispers: A Multimodal Dataset for Embodied Language Grounding
Dimosthenis Kontogiorgos
|
Elena Sibirtseva
|
Joakim Gustafson
Proceedings of the Twelfth Language Resources and Evaluation Conference
In this paper, we introduce a multimodal dataset in which subjects are instructing each other how to assemble IKEA furniture. Using the concept of ‘Chinese Whispers’, an old children’s game, we employ a novel method to avoid implicit experimenter biases. We let subjects instruct each other on the nature of the task: the process of the furniture assembly. Uncertainty, hesitations, repairs and self-corrections are naturally introduced in the incremental process of establishing common ground. The corpus consists of 34 interactions, where each subject first assembles and then instructs. We collected speech, eye-gaze, pointing gestures, and object movements, as well as subjective interpretations of mutual understanding, collaboration and task recall. The corpus is of particular interest to researchers who are interested in multimodal signals in situated dialogue, especially in referential communication and the process of language grounding.
pdf
bib
abs
Augmented Prompt Selection for Evaluation of Spontaneous Speech Synthesis
Eva Szekely
|
Jens Edlund
|
Joakim Gustafson
Proceedings of the Twelfth Language Resources and Evaluation Conference
By definition, spontaneous speech is unscripted and created on the fly by the speaker. It is dramatically different from read speech, where the words are authored as text before they are spoken. Spontaneous speech is emergent and transient, whereas text read out loud is pre-planned. For this reason, it is unsuitable to evaluate the usability and appropriateness of spontaneous speech synthesis by having it read out written texts sampled from for example newspapers or books. Instead, we need to use transcriptions of speech as the target - something that is much less readily available. In this paper, we introduce Starmap, a tool allowing developers to select a varied, representative set of utterances from a spoken genre, to be used for evaluation of TTS for a given domain. The selection can be done from any speech recording, without the need for transcription. The tool uses interactive visualisation of prosodic features with t-SNE, along with a tree-based algorithm to guide the user through thousands of utterances and ensure coverage of a variety of prompts. A listening test has shown that with a selection of genre-specific utterances, it is possible to show significant differences across genres between two synthetic voices built from spontaneous speech.
2018
pdf
bib
A Multimodal Corpus for Mutual Gaze and Joint Attention in Multiparty Situated Interaction
Dimosthenis Kontogiorgos
|
Vanya Avramova
|
Simon Alexanderson
|
Patrik Jonell
|
Catharine Oertel
|
Jonas Beskow
|
Gabriel Skantze
|
Joakim Gustafson
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
bib
Crowdsourced Multimodal Corpora Collection Tool
Patrik Jonell
|
Catharine Oertel
|
Dimosthenis Kontogiorgos
|
Jonas Beskow
|
Joakim Gustafson
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2016
pdf
bib
abs
Hidden Resources ― Strategies to Acquire and Exploit Potential Spoken Language Resources in National Archives
Jens Edlund
|
Joakim Gustafson
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
In 2014, the Swedish government tasked a Swedish agency, The Swedish Post and Telecom Authority (PTS), with investigating how to best create and populate an infrastructure for spoken language resources (Ref N2014/2840/ITP). As a part of this work, the department of Speech, Music and Hearing at KTH Royal Institute of Technology have taken inventory of existing potential spoken language resources, mainly in Swedish national archives and other governmental or public institutions. In this position paper, key priorities, perspectives, and strategies that may be of general, rather than Swedish, interest are presented. We discuss broad types of potential spoken language resources available; to what extent these resources are free to use; and thirdly the main contribution: strategies to ensure the continuous acquisition of spoken language resources in a manner that facilitates speech and speech technology research.
2015
pdf
bib
Automatic Detection of Miscommunication in Spoken Dialogue Systems
Raveesh Meena
|
José Lopes
|
Gabriel Skantze
|
Joakim Gustafson
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue
2014
pdf
bib
Proceedings of the EACL 2014 Workshop on Dialogue in Motion
Tiphaine Dalmas
|
Jana Götze
|
Joakim Gustafson
|
Srinivasan Janarthanam
|
Jan Kleindienst
|
Christian Mueller
|
Amanda Stent
|
Andreas Vlachos
Proceedings of the EACL 2014 Workshop on Dialogue in Motion
pdf
bib
Human pause and resume behaviours for unobtrusive humanlike in-car spoken dialogue systems
Jens Edlund
|
Fredrik Edelstam
|
Joakim Gustafson
Proceedings of the EACL 2014 Workshop on Dialogue in Motion
pdf
bib
Crowdsourcing Street-level Geographic Information Using a Spoken Dialogue System
Raveesh Meena
|
Johan Boye
|
Gabriel Skantze
|
Joakim Gustafson
Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)
2013
pdf
bib
Human Evaluation of Conceptual Route Graphs for Interpreting Spoken Route Descriptions
Raveesh Meena
|
Gabriel Skantze
|
Joakim Gustafson
Proceedings of the IWCS 2013 Workshop on Computational Models of Spatial Language Interpretation and Generation (CoSLI-3)
pdf
bib
The Map Task Dialogue System: A Test-bed for Modelling Human-Like Dialogue
Raveesh Meena
|
Gabriel Skantze
|
Joakim Gustafson
Proceedings of the SIGDIAL 2013 Conference
pdf
bib
A Data-driven Model for Timing Feedback in a Map Task Dialogue System
Raveesh Meena
|
Gabriel Skantze
|
Joakim Gustafson
Proceedings of the SIGDIAL 2013 Conference
2009
pdf
bib
Eliciting Interactional Phenomena in Human-Human Dialogues
Joakim Gustafson
|
Miray Merkes
Proceedings of the SIGDIAL 2009 Conference
pdf
bib
Attention and Interaction Control in a Human-Human-Computer Dialogue Setting
Gabriel Skantze
|
Joakim Gustafson
Proceedings of the SIGDIAL 2009 Conference
2005
pdf
bib
How to do Dialogue in a Fairy-tale World
Johan Boye
|
Joakim Gustafson
Proceedings of the 6th SIGdial Workshop on Discourse and Dialogue
2004
pdf
bib
The NICE Fairy-tale Game System
Joakim Gustafson
|
Linda Bell
|
Johan Boye
|
Anders Lindström
|
Mats Wirén
Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue at HLT-NAACL 2004