Petra Bago


2022

pdf bib
Challenges of Building Domain-Specific Parallel Corpora from Public Administration Documents
Filip Klubička | Lorena Kasunić | Danijel Blazsetin | Petra Bago
Proceedings of the BUCC Workshop within LREC 2022

PRINCIPLE was a Connecting Europe Facility (CEF)-funded project that focused on the identification, collection and processing of language resources (LRs) for four European under-resourced languages (Croatian, Icelandic, Irish and Norwegian) in order to improve translation quality of eTranslation, an online machine translation (MT) tool provided by the European Commission. The collected LRs were used for the development of neural MT engines in order to verify the quality of the resources. For all four languages, a total of 66 LRs were collected and made available on the ELRC-SHARE repository under various licenses. For Croatian, we have collected and published 20 LRs: 19 parallel corpora and 1 glossary. The majority of data is in the general domain (72 % of translation units), while the rest is in the eJustice (23 %), eHealth (3 %) and eProcurement (2 %) Digital Service Infrastructures (DSI) domains. The majority of the resources were for the Croatian-English language pair. The data was donated by six data contributors from the public as well as private sector. In this paper we present a subset of 13 Croatian LRs developed based on public administration documents, which are all made freely available, as well as challenges associated with the data collection, cleaning and processing.

pdf bib
Achievements of the PRINCIPLE Project: Promoting MT for Croatian, Icelandic, Irish and Norwegian
Petra Bago | Sheila Castilho | Jane Dunne | Federico Gaspari | Andre K | Gauti Kristmannsson | Jon Arild Olsen | Natalia Resende | Níels Rúnar Gíslason | Dana D. Sheridan | Páraic Sheridan | John Tinsley | Andy Way
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

This paper provides an overview of the main achievements of the completed PRINCIPLE project, a 2-year action funded by the European Commission under the Connecting Europe Facility (CEF) programme. PRINCIPLE focused on collecting high-quality language resources for Croatian, Icelandic, Irish and Norwegian, which are severely low-resource languages, especially for building effective machine translation (MT) systems. We report the achievements of the project, primarily, in terms of the large amounts of data collected for all four low-resource languages and of promoting the uptake of neural MT (NMT) for these languages.

2020

pdf bib
Progress of the PRINCIPLE Project: Promoting MT for Croatian, Icelandic, Irish and Norwegian
Andy Way | Petra Bago | Jane Dunne | Federico Gaspari | Andre Kåsen | Gauti Kristmannsson | Helen McHugh | Jon Arild Olsen | Dana Davis Sheridan | Páraic Sheridan | John Tinsley
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

This paper updates the progress made on the PRINCIPLE project, a 2-year action funded by the European Commission under the Connecting Europe Facility (CEF) programme. PRINCIPLE focuses on collecting high-quality language resources for Croatian, Icelandic, Irish and Norwegian, which have been identified as low-resource languages, especially for building effective machine translation (MT) systems. We report initial achievements of the project and ongoing activities aimed at promoting the uptake of neural MT for the low-resource languages of the project.