Carme Armentano-Oller


2022

pdf bib
ParlamentParla: A Speech Corpus of Catalan Parliamentary Sessions
Baybars Kulebi | Carme Armentano-Oller | Carlos Rodriguez-Penagos | Marta Villegas
Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference

Recently, various end-to-end architectures of Automatic Speech Recognition (ASR) are being showcased as an important step towards providing language technologies to all languages instead of a select few such as English. However many languages are still suffering due to the “digital gap,” lacking thousands of hours of transcribed speech data openly accessible that is necessary to train modern ASR architectures. Although Catalan already has access to various open speech corpora, these corpora lack diversity and are limited in total volume. In order to address this lack of resources for Catalan language, in this work we present ParlamentParla, a corpus of more than 600 hours of speech from Catalan Parliament sessions. This corpus has already been used in training of state-of-the-art ASR systems, and proof-of-concept text-to-speech (TTS) models. In this work we explain in detail the pipeline that allows the information publicly available on the parliamentary website to be converted to a speech corpus compatible with training of ASR and possibly TTS models.

2021

pdf bib
Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan
Jordi Armengol-Estapé | Casimiro Pio Carrino | Carlos Rodriguez-Penagos | Ona de Gibert Bonet | Carme Armentano-Oller | Aitor Gonzalez-Agirre | Maite Melero | Marta Villegas
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2005

pdf bib
An Open-Source Shallow-Transfer Machine Translation Toolbox: Consequences of Its Release and Availability
Carme Armentano-Oller | Antonio M. Corbí-Bellot | Mikel L. Forcada | Mireia Ginestí-Rosell | Boyan Bonev | Sergio Ortiz-Rojas | Juan Antonio Pérez-Ortiz | Gema Ramírez-Sánchez | Felipe Sánchez-Martínez
Workshop on open-source machine translation

By the time Machine Translation Summit X is held in September 2005, our group will have released an open-source machine translation toolbox as part of a large government-funded project involving four universities and three linguistic technology companies from Spain. The machine translation toolbox, which will most likely be released under a GPL-like license includes (a) the open-source engine itself, a modular shallow-transfer machine translation engine suitable for related languages and largely based upon that of systems we have already developed, such as interNOSTRUM for Spanish—Catalan and Traductor Universia for Spanish—Portuguese, (b) extensive documentation (including document type declarations) specifying the XML format of all linguistic (dictionaries, rules) and document format management files, (c) compilers converting these data into the high-speed (tens of thousands of words a second) format used by the engine, and (d) pilot linguistic data for Spanish—Catalan and Spanish—Galician and format management specifications for the HTML, RTF and plain text formats. After describing very briefly this toolbox, this paper aims at exploring possible consequences of the availability of this architecture, including the community-driven development of machine translation systems for languages lacking this kind of linguistic technology.