Spell checkers are an integrated feature of most software applications handling text inputs. When we write an email or compile a report on a desktop or a smartphone editor, a spell checker could be activated that assists us to write more correctly. However, this assistance does not exist for all languages equally. The Kurdish language, which still is considered a less-resourced language, currently lacks spell checkers for its various dialects. We present a trigram language model for the Sorani dialect of the Kurdish language that is created using educational text. We also showcase a spell checker for the Sorani dialect of Kurdish that can assist in writing texts in the Persian/Arabic script. The spell checker was developed as a testing environment for the language model. Primarily, we use the probabilistic method and our trigram language model with Stupid Backoff smoothing for the spell checking algorithm. Our spell checker has been trained on the KTC (Kurdish Textbook Corpus) dataset. Hence the system aims at assisting spell checking in the related context. We test our approach by developing a text processing environment that checks for spelling errors on a word and context basis. It suggests a list of corrections for misspelled words. The developed spell checker shows 88.54% accuracy on the texts in the related context and it has an F1 score of 43.33%, and the correct suggestion has an 85% chance of being in the top three positions of the corrections.
Kurdish poetry and prose narratives were historically transmitted orally and less in a written form. Being an essential medium of oral narration and literature, Kurdish lyrics have had a unique attribute in becoming a vital resource for different types of studies, including Digital Humanities, Computational Folkloristics and Computational Linguistics. As an initial study of its kind for the Kurdish language, this paper presents our efforts in transcribing and collecting Kurdish folk lyrics as a corpus that covers various Kurdish musical genres, in particular Beyt, Gorani, Bend, and Heyran. We believe that this corpus contributes to Kurdish language processing in several ways, such as compensation for the lack of a long history of written text by incorporating oral literature, presenting an unexplored realm in Kurdish language processing, and assisting the initiation of Kurdish computational folkloristics. Our corpus contains 49,582 tokens in the Sorani dialect of Kurdish. The corpus is publicly available in the Text Encoding Initiative (TEI) format for non-commercial use.
The resources and technologies for Sign language processing of resourceful languages are emerging, while the low-resource languages are falling behind. Kurdish is a multi-dialect language, and it is considered a low-resource language. It is spoken by approximately 30 million people in several countries, which denotes that it has a large community with hearing-impairments as well. This paper reports on a project which aims to develop the necessary data and tools to process the Sign language for Sorani as one of the spoken Kurdish dialects. We present the results of developing a dataset in HamNoSys and its corresponding SiGML form for the Kurdish Sign lexicon. We use this dataset to implement a sign-supported Kurdish tool to check the accuracy of the Sign lexicon. We tested the tool by presenting it to hearing-impaired individuals. The experiment showed that 100% of the translated letters were understandable by a hearing-impaired person. The percentages were 65% for isolated words, and approximately 30% for the words in sentences. The data is publicly available at https://github.com/KurdishBLARK/KurdishSignLanguage for non-commercial use under the CC BY-NC-SA 4.0 licence
Kurdish is a less-resourced language consisting of different dialects written in various scripts. Approximately 30 million people in different countries speak the language. The lack of corpora is one of the main obstacles in Kurdish language processing. In this paper, we present KTC-the Kurdish Textbooks Corpus, which is composed of 31 K-12 textbooks in Sorani dialect. The corpus is normalized and categorized into 12 educational subjects containing 693,800 tokens (110,297 types). Our resource is publicly available for non-commercial use under the CC BY-NC-SA 4.0 license.
This research suggests a method for machine translation among two Kurdish dialects. We chose the two widely spoken dialects, Kurmanji and Sorani, which are considered to be mutually unintelligible. Also, despite being spoken by about 30 million people in different countries, Kurdish is among less-resourced languages. The research used bi-dialectal dictionaries and showed that the lack of parallel corpora is not a major obstacle in machine translation between the two dialects. The experiments showed that the machine translated texts are comprehensible to those who do not speak the dialect. The research is the first attempt for inter-dialect machine translation in Kurdish and particularly could help in making online texts in one dialect comprehensible to those who only speak the target dialect. The results showed that the translated texts are in 71% and 79% cases rated as understandable for Kurmanji and Sorani respectively. They are rated as slightly-understandable in 29% cases for Kurmanji and 21% for Sorani.