Expanding Parallel Resources for Medium-Density Languages for Free

Georgi Iliev; Angel Genov

Expanding Parallel Resources for Medium-Density Languages for Free

Abstract

We discuss a previously proposed method for augmenting parallel corpora of limited size for the purposes of machine translation through monolingual paraphrasing of the source language. We develop a three-stage shallow paraphrasing procedure to be applied to the Swedish-Bulgarian language pair for which limited parallel resources exist. The source language exhibits specifics not typical of high-density languages already studied in a similar setting. Paraphrases of a highly productive type of compound nouns in Swedish are generated by a corpus-based technique. Certain Swedish noun-phrase types are paraphrased using basic heuristics. Further we introduce noun-phrase morphological variations for better wordform coverage. We evaluate the performance of a phrase-based statistical machine translation system trained on a baseline parallel corpus and on three stages of artificial enlargement of the source-language training data. Paraphrasing is shown to have no effect on performance for the Swedish-English translation task. We show a small, yet consistent, increase in the BLEU score of Swedish-Bulgarian translations of larger token spans on the first enlargement stage. A small improvement in the overall BLEU score of Swedish-Bulgarian translation is achieved on the second enlargement stage. We find that both improvements justify further research into the method for the Swedish-Bulgarian translation task.

Anthology ID:: L12-1434
Volume:: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:: May
Year:: 2012
Address:: Istanbul, Turkey
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 3937–3943
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2012/pdf/743_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Georgi Iliev and Angel Genov. 2012. Expanding Parallel Resources for Medium-Density Languages for Free. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 3937–3943, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):: Expanding Parallel Resources for Medium-Density Languages for Free (Iliev & Genov, LREC 2012)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2012/pdf/743_Paper.pdf

PDF Cite Search Fix data