CORILGA: a Galician Multilevel Annotated Speech Corpus for Linguistic Analysis

Carmen García-Mateo, Antonio Cardenal, Xosé Luis Regueira, Elisa Fernández Rei, Marta Martinez, Roberto Seara, Rocío Varela, Noemí Basanta


Abstract
This paper describes the CORILGA (“Corpus Oral Informatizado da Lingua Galega”). CORILGA is a large high-quality corpus of spoken Galician from the 1960s up to present-day, including both formal and informal spoken language from both standard and non-standard varieties, and across different generations and social levels. The corpus will be available to the research community upon completion. Galician is one of the EU languages that needs further research before highly effective language technology solutions can be implemented. A software repository for speech resources in Galician is also described. The repository includes a structured database, a graphical interface and processing tools. The use of a database enables to perform search in a simple and fast way based in a number of different criteria. The web-based user interface facilitates users the access to the different materials. Last but not least a set of transcription-based modules for automatic speech recognition has been developed, thus facilitating the orthographic labelling of the recordings.
Anthology ID:
L14-1579
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2653–2657
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/739_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Carmen García-Mateo, Antonio Cardenal, Xosé Luis Regueira, Elisa Fernández Rei, Marta Martinez, Roberto Seara, Rocío Varela, and Noemí Basanta. 2014. CORILGA: a Galician Multilevel Annotated Speech Corpus for Linguistic Analysis. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 2653–2657, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
CORILGA: a Galician Multilevel Annotated Speech Corpus for Linguistic Analysis (García-Mateo et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/739_Paper.pdf