Jens Edlund


pdf bib
Revisiting Three Text-to-Speech Synthesis Experiments with a Web-Based Audience Response System
Christina Tånnander | Jens Edlund | Joakim Gustafson
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In order to investigate the strengths and weaknesses of Audience Response System (ARS) in text-to-speech synthesis (TTS) evaluations, we revisit three previously published TTS studies and perform an ARS-based evaluation on the stimuli used in each study. The experiments are performed with a participant pool of 39 respondents, using a web-based tool that emulates an ARS experiment. The results of the first experiment confirms that ARS is highly useful for evaluating long and continuous stimuli, particularly if we wish for a diagnostic result rather than a single overall metric, while the second and third experiments highlight weaknesses in ARS with unsuitable materials as well as the importance of framing and instruction when conducting ARS-based evaluation.

pdf bib
The MEET Corpus: Collocated, Distant and Hybrid Three-party Meetings with a Ranking Task
Ghazaleh Esfandiari-Baiat | Jens Edlund
Proceedings of the 20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation @ LREC-COLING 2024

We introduce the MEET corpus. The corpus was collected with the aim of systematically studying the effects of collocated (physical), remote (digital) and hybrid work meetings on collaborative decision-making. It consists of 10 sessions, where each session contains three recordings: a collocated, a remote and a hybrid meeting between three participants. The participants are working on a different survival ranking task during each meeting. The duration of each meeting ranges from 10 to 18 minutes, resulting in 380 minutes of conversation altogether. We also present the annotation scheme designed specifically to target our research questions. The recordings are currently being transcribed and annotated in accordance with this scheme


pdf bib
Augmented Prompt Selection for Evaluation of Spontaneous Speech Synthesis
Eva Szekely | Jens Edlund | Joakim Gustafson
Proceedings of the Twelfth Language Resources and Evaluation Conference

By definition, spontaneous speech is unscripted and created on the fly by the speaker. It is dramatically different from read speech, where the words are authored as text before they are spoken. Spontaneous speech is emergent and transient, whereas text read out loud is pre-planned. For this reason, it is unsuitable to evaluate the usability and appropriateness of spontaneous speech synthesis by having it read out written texts sampled from for example newspapers or books. Instead, we need to use transcriptions of speech as the target - something that is much less readily available. In this paper, we introduce Starmap, a tool allowing developers to select a varied, representative set of utterances from a spoken genre, to be used for evaluation of TTS for a given domain. The selection can be done from any speech recording, without the need for transcription. The tool uses interactive visualisation of prosodic features with t-SNE, along with a tree-based algorithm to guide the user through thousands of utterances and ensure coverage of a variety of prompts. A listening test has shown that with a selection of genre-specific utterances, it is possible to show significant differences across genres between two synthetic voices built from spontaneous speech.


pdf bib
Bringing Order to Chaos: A Non-Sequential Approach for Browsing Large Sets of Found Audio Data
Per Fallgren | Zofia Malisz | Jens Edlund
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)


pdf bib
Hidden Resources ― Strategies to Acquire and Exploit Potential Spoken Language Resources in National Archives
Jens Edlund | Joakim Gustafson
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In 2014, the Swedish government tasked a Swedish agency, The Swedish Post and Telecom Authority (PTS), with investigating how to best create and populate an infrastructure for spoken language resources (Ref N2014/2840/ITP). As a part of this work, the department of Speech, Music and Hearing at KTH Royal Institute of Technology have taken inventory of existing potential spoken language resources, mainly in Swedish national archives and other governmental or public institutions. In this position paper, key priorities, perspectives, and strategies that may be of general, rather than Swedish, interest are presented. We discuss broad types of potential spoken language resources available; to what extent these resources are free to use; and thirdly the main contribution: strategies to ensure the continuous acquisition of spoken language resources in a manner that facilitates speech and speech technology research.


pdf bib
Human pause and resume behaviours for unobtrusive humanlike in-car spoken dialogue systems
Jens Edlund | Fredrik Edelstam | Joakim Gustafson
Proceedings of the EACL 2014 Workshop on Dialogue in Motion


pdf bib
3rd party observer gaze as a continuous measure of dialogue flow
Jens Edlund | Simon Alexandersson | Jonas Beskow | Lisa Gustavsson | Mattias Heldner | Anna Hjalmarsson | Petter Kallionen | Ellen Marklund
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present an attempt at using 3rd party observer gaze to get a measure of how appropriate each segment in a dialogue is for a speaker change. The method is a step away from the current dependency of speaker turns or talkspurts towards a more general view of speaker changes. We show that 3rd party observers do indeed largely look at the same thing (the speaker), and how this can be captured and utilized to provide insights into human communication. In addition, the results also suggest that there might be differences in the distribution of 3rd party observer gaze depending on how information-rich an utterance is.


pdf bib
Spontal-N: A Corpus of Interactional Spoken Norwegian
Rein Ove Sikveland | Anton Öttl | Ingunn Amdal | Mirjam Ernestus | Torbjørn Svendsen | Jens Edlund
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Spontal-N is a corpus of spontaneous, interactional Norwegian. To our knowledge, it is the first corpus of Norwegian in which the majority of speakers have spent significant parts of their lives in Sweden, and in which the recorded speech displays varying degrees of interference from Swedish. The corpus consists of studio quality audio- and video-recordings of four 30-minute free conversations between acquaintances, and a manual orthographic transcription of the entire material. On basis of the orthographic transcriptions, we automatically annotated approximately 50 percent of the material on the phoneme level, by means of a forced alignment between the acoustic signal and pronunciations listed in a dictionary. Approximately seven percent of the automatic transcription was manually corrected. Taking the manual correction as a gold standard, we evaluated several sources of pronunciation variants for the automatic transcription. Spontal-N is intended as a general purpose speech resource that is also suitable for investigating phonetic detail.

pdf bib
Spontal: A Swedish Spontaneous Dialogue Corpus of Audio, Video and Motion Capture
Jens Edlund | Jonas Beskow | Kjell Elenius | Kahl Hellmer | Sofia Strönbergsson | David House
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present the Spontal database of spontaneous Swedish dialogues. 120 dialogues of at least 30 minutes each have been captured in high-quality audio, high-resolution video and with a motion capture system. The corpus is currently being processed and annotated, and will be made available for research at the end of the project.

pdf bib
A Snack Implementation and Tcl/Tk Interface to the Fundamental Frequency Variation Spectrum Algorithm
Kornel Laskowski | Jens Edlund
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Intonation is an important aspect of vocal production, used for a variety of communicative needs. Its modeling is therefore crucial in many speech understanding systems, particularly those requiring inference of speaker intent in real-time. However, the estimation of pitch, traditionally the first step in intonation modeling, is computationally inconvenient in such scenarios. This is because it is often, and most optimally, achieved only after speech segmentation and recognition. A consequence is that earlier speech processing components, in today’s state-of-the-art systems, lack intonation awareness by fiat; it is not known to what extent this circumscribes their performance. In the current work, we present a freely available implementation of an alternative to pitch estimation, namely the computation of the fundamental frequency variation (FFV) spectrum, which can be easily employed at any level within a speech processing system. It is our hope that the implementation we describe aid in the understanding of this novel acoustic feature space, and that it facilitate its inclusion, as desired, in the front-end routines of speech recognition, dialog act recognition, and speaker recognition systems.