A Large-Scale Study of Machine Translation in Turkic Languages
Jamshidbek Mirzakhalov | Anoop Babu | Duygu Ataman | Sherzod Kariev | Francis Tyers | Otabek Abduraufov | Mammad Hajili | Sardana Ivanova | Abror Khaytbaev | Antonio Laverghetta Jr. | Bekhzodbek Moydinboyev | Esra Onal | Shaxnoza Pulatova | Ahsan Wahab | Orhan Firat | Sriram Chellappan
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems. However, there is still a large number of languages that are yet to reap the benefits of NMT. In this paper, we provide the first large-scale case study of the practical application of MT in the Turkic language family in order to realize the gains of NMT for Turkic languages under high-resource to extremely low-resource scenarios. In addition to presenting an extensive analysis that identifies the bottlenecks towards building competitive systems to ameliorate data scarcity, our study has several key contributions, including, i) a large parallel corpus covering 22 Turkic languages consisting of common public datasets in combination with new datasets of approximately 1.4 million parallel sentences, ii) bilingual baselines for 26 language pairs, iii) novel high-quality test sets in three different translation domains and iv) human evaluation scores. All models, scripts, and data will be released to the public.
Building a Morphological Analyser for Laz
Esra Onal | Francis Tyers
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
This study is an attempt to contribute to documentation and revitalization efforts of endangered Laz language, a member of South Caucasian language family mainly spoken on northeastern coastline of Turkey. It constitutes the first steps to create a general computational model for word form recognition and production for Laz by building a rule-based morphological analyser using Helsinki Finite-State Toolkit (HFST). The evaluation results show that the analyser has a 64.9% coverage over a corpus collected for this study with 111,365 tokens. We have also performed an error analysis on randomly selected 100 tokens from the corpus which are not covered by the analyser, and these results show that the errors mostly result from Turkish words in the corpus and missing stems in our lexicon.
- Francis Tyers 2
- Jamshidbek Mirzakhalov 1
- Anoop Babu 1
- Duygu Ataman 1
- Sherzod Kariev 1
- show all...