A Language Model for Spell Checking of Educational Texts in Kurdish (Sorani)
Roshna Abdulrahman | Hossein Hassani
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages
Spell checkers are an integrated feature of most software applications handling text inputs. When we write an email or compile a report on a desktop or a smartphone editor, a spell checker could be activated that assists us to write more correctly. However, this assistance does not exist for all languages equally. The Kurdish language, which still is considered a less-resourced language, currently lacks spell checkers for its various dialects. We present a trigram language model for the Sorani dialect of the Kurdish language that is created using educational text. We also showcase a spell checker for the Sorani dialect of Kurdish that can assist in writing texts in the Persian/Arabic script. The spell checker was developed as a testing environment for the language model. Primarily, we use the probabilistic method and our trigram language model with Stupid Backoff smoothing for the spell checking algorithm. Our spell checker has been trained on the KTC (Kurdish Textbook Corpus) dataset. Hence the system aims at assisting spell checking in the related context. We test our approach by developing a text processing environment that checks for spelling errors on a word and context basis. It suggests a list of corrections for misspelled words. The developed spell checker shows 88.54% accuracy on the texts in the related context and it has an F1 score of 43.33%, and the correct suggestion has an 85% chance of being in the top three positions of the corrections.
Developing a Fine-grained Corpus for a Less-resourced Language: the case of Kurdish
Roshna Abdulrahman | Hossein Hassani | Sina Ahmadi
Proceedings of the 2019 Workshop on Widening NLP
Kurdish is a less-resourced language consisting of different dialects written in various scripts. Approximately 30 million people in different countries speak the language. The lack of corpora is one of the main obstacles in Kurdish language processing. In this paper, we present KTC-the Kurdish Textbooks Corpus, which is composed of 31 K-12 textbooks in Sorani dialect. The corpus is normalized and categorized into 12 educational subjects containing 693,800 tokens (110,297 types). Our resource is publicly available for non-commercial use under the CC BY-NC-SA 4.0 license.