Jonas Beskow


pdf bib
Speech Data Augmentation for Improving Phoneme Transcriptions of Aphasic Speech Using Wav2Vec 2.0 for the PSST Challenge
Birger Moell | Jim O’Regan | Shivam Mehta | Ambika Kirkland | Harm Lameris | Joakim Gustafson | Jonas Beskow
Proceedings of the RaPID Workshop - Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments - within the 13th Language Resources and Evaluation Conference

As part of the PSST challenge, we explore how data augmentations, data sources, and model size affect phoneme transcription accuracy on speech produced by individuals with aphasia. We evaluate model performance in terms of feature error rate (FER) and phoneme error rate (PER). We find that data augmentations techniques, such as pitch shift, improve model performance. Additionally, increasing the size of the model decreases FER and PER. Our experiments also show that adding manually-transcribed speech from non-aphasic speakers (TIMIT) improves performance when Room Impulse Response is used to augment the data. The best performing model combines aphasic and non-aphasic data and has a 21.0% PER and a 9.2% FER, a relative improvement of 9.8% compared to the baseline model on the primary outcome measurement. We show that data augmentation, larger model size, and additional non-aphasic data sources can be helpful in improving automatic phoneme recognition models for people with aphasia.


pdf bib
A Multimodal Corpus for Mutual Gaze and Joint Attention in Multiparty Situated Interaction
Dimosthenis Kontogiorgos | Vanya Avramova | Simon Alexanderson | Patrik Jonell | Catharine Oertel | Jonas Beskow | Gabriel Skantze | Joakim Gustafson
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Crowdsourced Multimodal Corpora Collection Tool
Patrik Jonell | Catharine Oertel | Dimosthenis Kontogiorgos | Jonas Beskow | Joakim Gustafson
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)


pdf bib
A Multi-party Multi-modal Dataset for Focus of Visual Attention in Human-human and Human-robot Interaction
Kalin Stefanov | Jonas Beskow
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This papers describes a data collection setup and a newly recorded dataset. The main purpose of this dataset is to explore patterns in the focus of visual attention of humans under three different conditions - two humans involved in task-based interaction with a robot; same two humans involved in task-based interaction where the robot is replaced by a third human, and a free three-party human interaction. The dataset contains two parts - 6 sessions with duration of approximately 3 hours and 9 sessions with duration of approximately 4.5 hours. Both parts of the dataset are rich in modalities and recorded data streams - they include the streams of three Kinect v2 devices (color, depth, infrared, body and face data), three high quality audio streams, three high resolution GoPro video streams, touch data for the task-based interactions and the system state of the robot. In addition, the second part of the dataset introduces the data streams from three Tobii Pro Glasses 2 eye trackers. The language of all interactions is English and all data streams are spatially and temporally aligned.


pdf bib
Talking Heads, Signing Avatars and Social Robots
Jonas Beskow
Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies


pdf bib
3rd party observer gaze as a continuous measure of dialogue flow
Jens Edlund | Simon Alexandersson | Jonas Beskow | Lisa Gustavsson | Mattias Heldner | Anna Hjalmarsson | Petter Kallionen | Ellen Marklund
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present an attempt at using 3rd party observer gaze to get a measure of how appropriate each segment in a dialogue is for a speaker change. The method is a step away from the current dependency of speaker turns or talkspurts towards a more general view of speaker changes. We show that 3rd party observers do indeed largely look at the same thing (the speaker), and how this can be captured and utilized to provide insights into human communication. In addition, the results also suggest that there might be differences in the distribution of 3rd party observer gaze depending on how information-rich an utterance is.


pdf bib
Spontal: A Swedish Spontaneous Dialogue Corpus of Audio, Video and Motion Capture
Jens Edlund | Jonas Beskow | Kjell Elenius | Kahl Hellmer | Sofia Strönbergsson | David House
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present the Spontal database of spontaneous Swedish dialogues. 120 dialogues of at least 30 minutes each have been captured in high-quality audio, high-resolution video and with a motion capture system. The corpus is currently being processed and annotated, and will be made available for research at the end of the project.