A Language Model for Spell Checking of Educational Texts in Kurdish (Sorani)

Roshna Abdulrahman, Hossein Hassani


Abstract
Spell checkers are an integrated feature of most software applications handling text inputs. When we write an email or compile a report on a desktop or a smartphone editor, a spell checker could be activated that assists us to write more correctly. However, this assistance does not exist for all languages equally. The Kurdish language, which still is considered a less-resourced language, currently lacks spell checkers for its various dialects. We present a trigram language model for the Sorani dialect of the Kurdish language that is created using educational text. We also showcase a spell checker for the Sorani dialect of Kurdish that can assist in writing texts in the Persian/Arabic script. The spell checker was developed as a testing environment for the language model. Primarily, we use the probabilistic method and our trigram language model with Stupid Backoff smoothing for the spell checking algorithm. Our spell checker has been trained on the KTC (Kurdish Textbook Corpus) dataset. Hence the system aims at assisting spell checking in the related context. We test our approach by developing a text processing environment that checks for spelling errors on a word and context basis. It suggests a list of corrections for misspelled words. The developed spell checker shows 88.54% accuracy on the texts in the related context and it has an F1 score of 43.33%, and the correct suggestion has an 85% chance of being in the top three positions of the corrections.
Anthology ID:
2022.sigul-1.25
Volume:
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Maite Melero, Sakriani Sakti, Claudia Soria
Venue:
SIGUL
SIG:
SIGUL
Publisher:
European Language Resources Association
Note:
Pages:
189–198
Language:
URL:
https://aclanthology.org/2022.sigul-1.25
DOI:
Bibkey:
Cite (ACL):
Roshna Abdulrahman and Hossein Hassani. 2022. A Language Model for Spell Checking of Educational Texts in Kurdish (Sorani). In Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, pages 189–198, Marseille, France. European Language Resources Association.
Cite (Informal):
A Language Model for Spell Checking of Educational Texts in Kurdish (Sorani) (Abdulrahman & Hassani, SIGUL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.sigul-1.25.pdf
Code
 kurdishblark/ktc-language-model