Uzbek-English and Turkish-English Morpheme Alignment Corpora

Xuansong Li, Jennifer Tracey, Stephen Grimes, Stephanie Strassel


Abstract
Morphologically-rich languages pose problems for machine translation (MT) systems, including word-alignment errors, data sparsity and multiple affixes. Current alignment models at word-level do not distinguish words and morphemes, thus yielding low-quality alignment and subsequently affecting end translation quality. Models using morpheme-level alignment can reduce the vocabulary size of morphologically-rich languages and overcomes data sparsity. The alignment data based on smallest units reveals subtle language features and enhances translation quality. Recent research proves such morpheme-level alignment (MA) data to be valuable linguistic resources for SMT, particularly for languages with rich morphology. In support of this research trend, the Linguistic Data Consortium (LDC) created Uzbek-English and Turkish-English alignment data which are manually aligned at the morpheme level. This paper describes the creation of MA corpora, including alignment and tagging process and approaches, highlighting annotation challenges and specific features of languages with rich morphology. The light tagging annotation on the alignment layer adds extra value to the MA data, facilitating users in flexibly tailoring the data for various MT model training.
Anthology ID:
L16-1467
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2925–2930
Language:
URL:
https://aclanthology.org/L16-1467
DOI:
Bibkey:
Cite (ACL):
Xuansong Li, Jennifer Tracey, Stephen Grimes, and Stephanie Strassel. 2016. Uzbek-English and Turkish-English Morpheme Alignment Corpora. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 2925–2930, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Uzbek-English and Turkish-English Morpheme Alignment Corpora (Li et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1467.pdf