Mikel Penagarikano

Also published as: M. Peñagarikano

2014

KALAKA-3: a database for the recognition of spoken European languages on YouTube audios
Luis Javier Rodríguez-Fuentes | Mikel Penagarikano | Amparo Varona | Mireia Diez | Germán Bordel
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper describes the main features of KALAKA-3, a speech database specifically designed for the development and evaluation of language recognition systems. The database provides TV broadcast speech for training, and audio data extracted from YouTube videos for tuning and testing. The database was created to support the Albayzin 2012 Language Recognition Evaluation, which featured two language recognition tasks, both dealing with European languages. The first one involved six target languages (Basque, Catalan, English, Galician, Portuguese and Spanish) for which there was plenty of training data, whereas the second one involved four target languages (French, German, Greek and Italian) for which no training data was provided. Two separate sets of YouTube audio files were provided to test the performance of language recognition systems on both tasks. To allow open-set tests, these datasets included speech in 11 additional (Out-Of-Set) European languages. The paper also presents a summary of the results attained in the evaluation, along with the performance of state-of-the-art systems on the four evaluation tracks defined on the database, which demonstrates the extreme difficulty of some of them. As far as we know, this is the first database specifically designed to benchmark spoken language recognition technology on YouTube audios.

pdf bib abs

Basque Speecon-like and Basque SpeechDat MDB-600: speech databases for the development of ASR technology for Basque
Igor Odriozola | Inma Hernaez | María Inés Torres | Luis Javier Rodriguez-Fuentes | Mikel Penagarikano | Eva Navas
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper introduces two databases specifically designed for the development of ASR technology for the Basque language: the Basque Speecon-like database and the Basque SpeechDat MDB-600 database. The former was recorded in an office environment according to the Speecon specifications, whereas the later was recorded through mobile telephones according to the SpeechDat specifications. Both databases were created under an initiative that the Basque Government started in 2005, a program called ADITU, which aimed at developing speech technologies for Basque. The databases belong to the Basque Government. A comprehensive description of both databases is provided in this work, highlighting the differences with regard to their corresponding standard specifications. The paper also presents several initial experimental results for both databases with the purpose of validating their usefulness for the development of speech recognition technology. Several applications already developed with the Basque Speecon-like database are also described. Authors aim to make these databases widely known to the community as well, and foster their use by other groups.

2012

pdf bib abs

KALAKA-2: a TV Broadcast Speech Database for the Recognition of Iberian Languages in Clean and Noisy Environments
Luis Javier Rodríguez-Fuentes | Mikel Penagarikano | Amparo Varona | Mireia Diez | Germán Bordel
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents the main features (design issues, recording setup, etc.) of KALAKA-2, a TV broadcast speech database specifically designed for the development and evaluation of language recognition systems in clean and noisy environments. KALAKA-2 was created to support the Albayzin 2010 Language Recognition Evaluation (LRE), organized by the Spanish Network on Speech Technologies from June to November 2010. The database features 6 target languages: Basque, Catalan, English, Galician, Portuguese and Spanish, and includes segments in other (Out-Of-Set) languages, which allow to perform open-set verification tests. The best performance attained in the Albayzin 2010 LRE is presented and briefly discussed. The performance of a state-of-the-art system in various tasks defined on the database is also presented. In both cases, results highlight the suitability of KALAKA-2 as a benchmark for the development and evaluation of language recognition technology.

2010

pdf bib abs

KALAKA: A TV Broadcast Speech Database for the Evaluation of Language Recognition Systems
Luis Javier Rodríguez-Fuentes | Mikel Penagarikano | Germán Bordel | Amparo Varona | Mireia Díez
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

A speech database, named KALAKA, was created to support the Albayzin 2008 Evaluation of Language Recognition Systems, organized by the Spanish Network on Speech Technologies from May to November 2008. This evaluation, designed according to the criteria and methodology applied in the NIST Language Recognition Evaluations, involved four target languages: Basque, Catalan, Galician and Spanish (official languages in Spain), and included speech signals in other (unknown) languages to allow open-set verification trials. In this paper, the process of designing, collecting data and building the train, development and evaluation datasets of KALAKA is described. Results attained in the Albayzin 2008 LRE are presented as a means of evaluating the database. The performance of a state-of-the-art language recognition system on a closed-set evaluation task is also presented for reference. Future work includes extending KALAKA by adding Portuguese and English as target languages and renewing the set of unknown languages needed to carry out open-set evaluations.