ParlamentParla: A Speech Corpus of Catalan Parliamentary Sessions

Baybars Kulebi, Carme Armentano-Oller, Carlos Rodriguez-Penagos, Marta Villegas


Abstract
Recently, various end-to-end architectures of Automatic Speech Recognition (ASR) are being showcased as an important step towards providing language technologies to all languages instead of a select few such as English. However many languages are still suffering due to the “digital gap,” lacking thousands of hours of transcribed speech data openly accessible that is necessary to train modern ASR architectures. Although Catalan already has access to various open speech corpora, these corpora lack diversity and are limited in total volume. In order to address this lack of resources for Catalan language, in this work we present ParlamentParla, a corpus of more than 600 hours of speech from Catalan Parliament sessions. This corpus has already been used in training of state-of-the-art ASR systems, and proof-of-concept text-to-speech (TTS) models. In this work we explain in detail the pipeline that allows the information publicly available on the parliamentary website to be converted to a speech corpus compatible with training of ASR and possibly TTS models.
Anthology ID:
2022.parlaclarin-1.18
Volume:
Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Darja Fišer, Maria Eskevich, Jakob Lenardič, Franciska de Jong
Venue:
ParlaCLARIN
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
125–130
Language:
URL:
https://aclanthology.org/2022.parlaclarin-1.18
DOI:
Bibkey:
Cite (ACL):
Baybars Kulebi, Carme Armentano-Oller, Carlos Rodriguez-Penagos, and Marta Villegas. 2022. ParlamentParla: A Speech Corpus of Catalan Parliamentary Sessions. In Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference, pages 125–130, Marseille, France. European Language Resources Association.
Cite (Informal):
ParlamentParla: A Speech Corpus of Catalan Parliamentary Sessions (Kulebi et al., ParlaCLARIN 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.parlaclarin-1.18.pdf
Data
LibriSpeech