2020
pdf
bib
abs
From Speech-to-Speech Translation to Automatic Dubbing
Marcello Federico
|
Robert Enyedi
|
Roberto Barra-Chicote
|
Ritwik Giri
|
Umut Isik
|
Arvindh Krishnaswamy
|
Hassan Sawaf
Proceedings of the 17th International Conference on Spoken Language Translation
We present enhancements to a speech-to-speech translation pipeline in order to perform automatic dubbing. Our architecture features neural machine translation generating output of preferred length, prosodic alignment of the translation with the original speech segments, neural text-to-speech with fine tuning of the duration of each utterance, and, finally, audio rendering to enriches text-to-speech output with background noise and reverberation extracted from the original audio. We report and discuss results of a first subjective evaluation of automatic dubbing of excerpts of TED Talks from English into Italian, which measures the perceived naturalness of automatic dubbing and the relative importance of each proposed enhancement.
2019
pdf
bib
abs
In Other News: a Bi-style Text-to-speech Model for Synthesizing Newscaster Voice with Limited Data
Nishant Prateek
|
Mateusz Łajszczak
|
Roberto Barra-Chicote
|
Thomas Drugman
|
Jaime Lorenzo-Trueba
|
Thomas Merritt
|
Srikanth Ronanki
|
Trevor Wood
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers)
Neural text-to-speech synthesis (NTTS) models have shown significant progress in generating high-quality speech, however they require a large quantity of training data. This makes creating models for multiple styles expensive and time-consuming. In this paper different styles of speech are analysed based on prosodic variations, from this a model is proposed to synthesise speech in the style of a newscaster, with just a few hours of supplementary data. We pose the problem of synthesising in a target style using limited data as that of creating a bi-style model that can synthesise both neutral-style and newscaster-style speech via a one-hot vector which factorises the two styles. We also propose conditioning the model on contextual word embeddings, and extensively evaluate it against neutral NTTS, and neutral concatenative-based synthesis. This model closes the gap in perceived style-appropriateness between natural recordings for newscaster-style of speech, and neutral speech synthesis by approximately two-thirds.
2016
pdf
bib
abs
Continuous Expressive Speaking Styles Synthesis based on CVSM and MR-HMM
Jaime Lorenzo-Trueba
|
Roberto Barra-Chicote
|
Ascension Gallardo-Antolin
|
Junichi Yamagishi
|
Juan M. Montero
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
This paper introduces a continuous system capable of automatically producing the most adequate speaking style to synthesize a desired target text. This is done thanks to a joint modeling of the acoustic and lexical parameters of the speaker models by adapting the CVSM projection of the training texts using MR-HMM techniques. As such, we consider that as long as sufficient variety in the training data is available, we should be able to model a continuous lexical space into a continuous acoustic space. The proposed continuous automatic text to speech system was evaluated by means of a perceptual evaluation in order to compare them with traditional approaches to the task. The system proved to be capable of conveying the correct expressiveness (average adequacy of 3.6) with an expressive strength comparable to oracle traditional expressive speech synthesis (average of 3.6) although with a drop in speech quality mainly due to the semi-continuous nature of the data (average quality of 2.9). This means that the proposed system is capable of improving traditional neutral systems without requiring any additional user interaction.
2010
pdf
bib
abs
HIFI-AV: An Audio-visual Corpus for Spoken Language Human-Machine Dialogue Research in Spanish
Fernando Fernández-Martínez
|
Juan Manuel Lucas-Cuesta
|
Roberto Barra Chicote
|
Javier Ferreiros
|
Javier Macías-Guarasa
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
In this paper, we describe a new multi-purpose audio-visual database on the context of speech interfaces for controlling household electronic devices. The database comprises speech and video recordings of 19 speakers interacting with a HIFI audio box by means of a spoken dialogue system. Dialogue management is based on Bayesian Networks and the system is provided with contextual information handling strategies. Each speaker was requested to fulfil different sets of specific goals following predefined scenarios, according to both different complexity levels and degrees of freedom or initiative allowed to the user. Due to a careful design and its size, the recorded database allows comprehensive studies on speech recognition, speech understanding, dialogue modeling and management, microphone array based speech processing, and both speech and video-based acoustic source localisation. The database has been labelled for quality and efficiency studies on dialogue performance. The whole database has been validated through both objective and subjective tests.
2009
pdf
bib
Speeding Up the Design of Dialogue Applications by Using Database Contents and Structure Information
Luis Fernando D’Haro
|
Ricardo de Cordoba
|
Juan Manuel Lucas
|
Roberto Barra-Chicote
|
Ruben San-Segundo
Proceedings of the SIGDIAL 2009 Conference