Developing Text Normalization (TN) systems for Text-to-Speech (TTS) on new languages is hard. We propose a novel architecture to facilitate it for multiple languages while using data less than 3% of the size of the data used by the state of the art results on English. We treat TN as a sequence classification problem and propose a granular tokenization mechanism that enables the system to learn majority of the classes and their normalizations from the training data itself. This is further combined with minimal precoded linguistic knowledge for other classes. We publish the first results on TN for TTS in Spanish and Tamil and also demonstrate that the performance of the approach is comparable with the previous work done on English. All annotated datasets used for experimentation will be released.
This paper presents a web-based multimedia search engine built within the Buceador (www.buceador.org) research project. A proof-of-concept tool has been implemented which is able to retrieve information from a digital library made of multimedia documents in the 4 official languages in Spain (Spanish, Basque, Catalan and Galician). The retrieved documents are presented in the user language after translation and dubbing (the four previous languages + English). The paper presents the tool functionality, the architecture, the digital library and provide some information about the technology involved in the fields of automatic speech recognition, statistical machine translation, text-to-speech synthesis and information retrieval. Each technology has been adapted to the purposes of the presented tool as well as to interact with the rest of the technologies involved.
METANET4U is a European project aiming at supporting language technology for European languages and multilingualism. It is a project in the META-NET Network of Excellence, a cluster of projects aiming at fostering the mission of META, which is the Multilingual Europe Technology Alliance, dedicated to building the technological foundations of a multilingual European information society. This paper describe the resources produced at our lab to provide Synthethic voices. Using existing 10h corpus for a male and a female Spanish speakers, voices have been developed to be used in Festival, both with unit-selection and with statistical-based technologies. Furthermore, using data produced for supporting research on intra and inter-lingual voice conversion, four bilingual voices (English/Spanish) have been developed. The paper describes these resources which are available through META. Furthermore, an evaluation is presented to compare different synthesis techniques, influence of amount of data in statistical speech synthesis and the effect of sharing data in bilingual voices.
This paper describes the first TTS evaluation campaign designed for Spanish. Seven research institutions took part in the evaluation campaign and developed a voice from a common speech database provided by the organisation. Each participating team had a period of seven weeks to generate a voice. Next, a set of sentences were released and each team had to synthesise them within a week period. Finally, some of the synthesised test audio files were subjectively evaluated via an online test according to the following criteria: similarity to the original voice, naturalness and intelligibility. Box-plots, Wilcoxon tests and WER have been generated in order to analyse the results. Two main conclusions can be drawn: On the one hand, there is considerable margin for improvement to reach the quality level of the natural voice. On the other hand, two systems get significantly better results than the rest: one is based on statistical parametric synthesis and the other one is a concatenative system that makes use of a sinusoidal model to modify both prosody and smooth spectral joints. Therefore, it seems that some kind of spectral control is needed when building voices with a medium size database for unrestricted domains.
In this paper we describe the design and production of Catalan database for building synthetic voices. Two speakers, with 10 hours per speaker, have recorded 10 hours of speech. The speaker selection and the corpus design aim to provide resources for high quality synthesis. The resources have been used to build voices for the Festival TTS. Both the original recordings and the Festival databases are freely available for research and for commertial use.
This paper describes an acceptance test procedure for evaluating a spoken language translation system between Catalan and Spanish. The procedure consists of two independent tests. The first test was an utterance-oriented evaluation for determining how the use of speech benefits communication. This test allowed for comparing relative performance of the different system components, explicitly: source text to target text, source text to target speech, source speech to target text, and source speech to target speech. The second test was a task-oriented experiment for evaluating if users could achieve some predefined goals for a given task with the state of the technology. Eight subjects familiar with the technology and four subjects not familiar with the technology participated in the tests. From the results we can conclude that state of technology is getting closer to provide effective speech-to-speech translation systems but there is still lot of work to be done in this area. No significant differences in performance between users that are familiar with the technology and users that are not familiar with the technology were evidenced. This constitutes, as far as we know, the first evaluation of a Spoken Translation System that considers performance at both, the utterance level and the task level.
We present here an open-source software platform for the integration of speech translation components. This tool is useful to integrate into a common framework different automatic speech recognition, spoken language translation and text-to-speech synthesis solutions, as demonstrated in the evaluation of the European LC-STAR project, and during the development of the national ALIADO project. Gaia operates with great flexibility, and it has been used to obtain the text and speech corpora needed when performing speech translation. The platform follows a modular distributed approach, with a specifically designed extensible network protocol handling the communication with the different modules. A well defined and publicly available API facilitates the integration of existing solutions into the architecture. Completely functional audio and text interfaces together with remote monitoring tools are provided.
The newly founded European Centre of Excellence for Speech Synthesis (ECESS) is an initiative to promote the development of the European research area (ERA) in the field of Language Technology. ECESS focuses on the great challenge of high-quality speech synthesis which is of crucial importance for future spoken-language technologies. The main goals of ECESS are to achieve the critical mass needed to promote progress in TTS technology substantially, to integrate basic research know-how related to speech synthesis and to attract public and private funding. To this end, a common system architecture based on exchangeable modules supplied by the ECESS members is to be established. The XML-based interface that connects these modules is the topic of this paper.
This paper deals with the design of a synthesis database for a high quality corpus-based Speech Synthesis system in Spanish. The database has been designed for speech synthesis, speech conversion and expressive speech. The design follows the specifications of TC-STAR project and has been applied to collect equivalent English and Mandarin synthesis databases. The sentences of the corpus have been selected mainly from transcribed speech and novels. The selection criterion is a phonetic and prosodic coverage. The corpus was completed with sentences specifically designed to cover frequent phrases and words. Two baseline speakers and four bilingual speakers were recorded. Recordings consist of 10 hours of speech for each baseline speaker and one hour of speech for each voice conversion bilingual speaker. The database is labelled and segmented. Pitch marks and phonetic segmentation was done automatically and up to 50% manually supervised. The database will be available at ELRA.
In the framework of the EU funded project TC-STAR (Technology and Corpora for Speech to Speech Translation),research on TTS aims on providing a synthesized voice sounding like the source speaker speaking the target language. To progress in this direction, research is focused on naturalness, intelligibility, expressivity and voice conversion both, in the TC-STAR framework. For this purpose, specifications on large, high quality TTS databases have been developed and the data have been recorded for UK English, Spanish and Mandarin. The development of speech technology in TC-STAR is evaluation driven. Assessment of speech synthesis is needed to determine how well a system or technique performs in comparison to previous versions as well as other approaches (systems & methods). Apart from testing the whole system, all components of the system will be evaluated separately. This approach grants better assesment of each component as well as identification of the best techniques in the different speech synthesisprocesses.This paper describes the specifications of Language Resources for speech synthesis and the specifications for evaluation of speech synthesis activities.