Creation of comparable corpora for English-Urdu, Arabic, Persian

Murad Abouammoh; Kashif Shah; Ahmet Aker

Creation of comparable corpora for English-Urdu, Arabic, Persian

Murad Abouammoh, Kashif Shah, Ahmet Aker

Abstract

Statistical Machine Translation (SMT) relies on the availability of rich parallel corpora. However, in the case of under-resourced languages or some specific domains, parallel corpora are not readily available. This leads to under-performing machine translation systems in those sparse data settings. To overcome the low availability of parallel resources the machine translation community has recognized the potential of using comparable resources as training data. However, most efforts have been related to European languages and less in middle-east languages. In this study, we report comparable corpora created from news articles for the pair English ―{Arabic, Persian, Urdu} languages. The data has been collected over a period of a year, entails Arabic, Persian and Urdu languages. Furthermore using the English as a pivot language, comparable corpora that involve more than one language can be created, e.g. English- Arabic - Persian, English - Arabic - Urdu, English ― Urdu - Persian, etc. Upon request the data can be provided for research purposes.

Anthology ID:: L16-1663
Volume:: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:: May
Year:: 2016
Address:: Portorož, Slovenia
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 4193–4196
Language:
URL:: https://aclanthology.org/L16-1663/
DOI:
Bibkey:
Cite (ACL):: Murad Abouammoh, Kashif Shah, and Ahmet Aker. 2016. Creation of comparable corpora for English-Urdu, Arabic, Persian. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 4193–4196, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):: Creation of comparable corpora for English-Urdu, Arabic, Persian (Abouammoh et al., LREC 2016)
Copy Citation:
PDF:: https://aclanthology.org/L16-1663.pdf

PDF Cite Search Fix data