Enlarging Scarce In-domain English-Croatian Corpus for SMT of MOOCs Using Serbian

Maja Popović; Kostadin Cholakov; Valia Kordoni; Nikola Ljubešić

Enlarging Scarce In-domain English-Croatian Corpus for SMT of MOOCs Using Serbian

Maja Popović, Kostadin Cholakov, Valia Kordoni, Nikola Ljubešić

Abstract

Massive Open Online Courses have been growing rapidly in size and impact. Yet the language barrier constitutes a major growth impediment in reaching out all people and educating all citizens. A vast majority of educational material is available only in English, and state-of-the-art machine translation systems still have not been tailored for this peculiar genre. In addition, a mere collection of appropriate in-domain training material is a challenging task. In this work, we investigate statistical machine translation of lecture subtitles from English into Croatian, which is morphologically rich and generally weakly supported, especially for the educational domain. We show that results comparable with publicly available systems trained on much larger data can be achieved if a small in-domain training set is used in combination with additional in-domain corpus originating from the closely related Serbian language.

Anthology ID:: W16-4813
Volume:: Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
Month:: December
Year:: 2016
Address:: Osaka, Japan
Editors:: Preslav Nakov, Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi
Venue:: VarDial
SIG:
Publisher:: The COLING 2016 Organizing Committee
Note:
Pages:: 97–105
Language:
URL:: https://aclanthology.org/W16-4813/
DOI:
Bibkey:
Cite (ACL):: Maja Popović, Kostadin Cholakov, Valia Kordoni, and Nikola Ljubešić. 2016. Enlarging Scarce In-domain English-Croatian Corpus for SMT of MOOCs Using Serbian. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 97–105, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):: Enlarging Scarce In-domain English-Croatian Corpus for SMT of MOOCs Using Serbian (Popović et al., VarDial 2016)
Copy Citation:
PDF:: https://aclanthology.org/W16-4813.pdf

PDF Cite Search Fix data