EMPAC: an English–Spanish Corpus of Institutional Subtitles

Iris Serrat Roozen, José Manuel Martínez Martínez


Abstract
The EuroparlTV Multimedia Parallel Corpus (EMPAC) is a collection of subtitles in English and Spanish for videos from the EuropeanParliament’s Multimedia Centre. The corpus has been compiled with the EMPAC toolkit. The aim of this corpus is to provide a resource to study institutional subtitling on the one hand, and, on the other hand, facilitate the analysis of web accessibility to institutional multimedia content. The corpus covers a time span from 2009 to 2017, it is made up of 4,000 texts amounting to two and half millions of tokens for every language, corresponding to approximately 280 hours of video. This paper provides 1) a review of related corpora; 2) a revision of typical compilation methodologies of subtitle corpora; 3) a detailed account of the corpus compilation methodology followed; and, 4) a description of the corpus. In the conclusion, the key findings are summarised regarding formal aspects of the subtitles conditioning the accessibility to the multimedia content of the EuroparlTV.
Anthology ID:
2020.lrec-1.498
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4044–4053
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.498
DOI:
Bibkey:
Cite (ACL):
Iris Serrat Roozen and José Manuel Martínez Martínez. 2020. EMPAC: an English–Spanish Corpus of Institutional Subtitles. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4044–4053, Marseille, France. European Language Resources Association.
Cite (Informal):
EMPAC: an English–Spanish Corpus of Institutional Subtitles (Serrat Roozen & Martínez Martínez, LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.498.pdf