Building Language Models for Morphological Rich Low-Resource Languages using Data from Related Donor Languages: the Case of Uyghur

Ayimunishagu Abulimiti, Tanja Schultz


Abstract
Huge amounts of data are needed to build reliable statistical language models. Automatic speech processing tasks in low-resource languages typically suffer from lower performances due to weak or unreliable language models. Furthermore, language modeling for agglutinative languages is very challenging, as the morphological richness results in higher Out Of Vocabulary (OOV) rate. In this work, we show our effort to build word-based as well as morpheme-based language models for Uyghur, a language that combines both challenges, i.e. it is a low-resource and agglutinative language. Fortunately, there exists a closely-related rich-resource language, namely Turkish. Here, we present our work on leveraging Turkish text data to improve Uyghur language models. To maximize the overlap between Uyghur and Turkish words, the Turkish data is pre-processed on the word surface level, which results in 7.76% OOV-rate reduction on the Uyghur development set. To investigate various levels of low-resource conditions, different subsets of Uyghur data are generated. Morpheme-based language models trained with bilingual data achieved up to 40.91% relative perplexity reduction over the language models trained only with Uyghur data.
Anthology ID:
2020.sltu-1.38
Volume:
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Dorothee Beermann, Laurent Besacier, Sakriani Sakti, Claudia Soria
Venue:
SLTU
SIG:
Publisher:
European Language Resources association
Note:
Pages:
271–276
Language:
English
URL:
https://aclanthology.org/2020.sltu-1.38
DOI:
Bibkey:
Cite (ACL):
Ayimunishagu Abulimiti and Tanja Schultz. 2020. Building Language Models for Morphological Rich Low-Resource Languages using Data from Related Donor Languages: the Case of Uyghur. In Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), pages 271–276, Marseille, France. European Language Resources association.
Cite (Informal):
Building Language Models for Morphological Rich Low-Resource Languages using Data from Related Donor Languages: the Case of Uyghur (Abulimiti & Schultz, SLTU 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.sltu-1.38.pdf