Toward Better Loanword Identification in Uyghur Using Cross-lingual Word Embeddings

Chenggang Mi; Yating Yang; Lei Wang; Xi Zhou; Tonghai Jiang

Toward Better Loanword Identification in Uyghur Using Cross-lingual Word Embeddings

Chenggang Mi, Yating Yang, Lei Wang, Xi Zhou, Tonghai Jiang

Abstract

To enrich vocabulary of low resource settings, we proposed a novel method which identify loanwords in monolingual corpora. More specifically, we first use cross-lingual word embeddings as the core feature to generate semantically related candidates based on comparable corpora and a small bilingual lexicon; then, a log-linear model which combines several shallow features such as pronunciation similarity and hybrid language model features to predict the final results. In this paper, we use Uyghur as the receipt language and try to detect loanwords in four donor languages: Arabic, Chinese, Persian and Russian. We conduct two groups of experiments to evaluate the effectiveness of our proposed approach: loanword identification and OOV translation in four language pairs and eight translation directions (Uyghur-Arabic, Arabic-Uyghur, Uyghur-Chinese, Chinese-Uyghur, Uyghur-Persian, Persian-Uyghur, Uyghur-Russian, and Russian-Uyghur). Experimental results on loanword identification show that our method outperforms other baseline models significantly. Neural machine translation models integrating results of loanword identification experiments achieve the best results on OOV translation(with 0.5-0.9 BLEU improvements)

Anthology ID:: C18-1256
Volume:: Proceedings of the 27th International Conference on Computational Linguistics
Month:: August
Year:: 2018
Address:: Santa Fe, New Mexico, USA
Editors:: Emily M. Bender, Leon Derczynski, Pierre Isabelle
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3027–3037
Language:
URL:: https://aclanthology.org/C18-1256/
DOI:
Bibkey:
Cite (ACL):: Chenggang Mi, Yating Yang, Lei Wang, Xi Zhou, and Tonghai Jiang. 2018. Toward Better Loanword Identification in Uyghur Using Cross-lingual Word Embeddings. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3027–3037, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):: Toward Better Loanword Identification in Uyghur Using Cross-lingual Word Embeddings (Mi et al., COLING 2018)
Copy Citation:
PDF:: https://aclanthology.org/C18-1256.pdf

PDF Cite Search Fix data