Enhanced Urdu Word Segmentation using Conditional Random Fields and Morphological Context Features

Aamir Farhan, Mashrukh Islam, Dipti Misra Sharma


Abstract
Word segmentation is a fundamental task for most of the NLP applications. Urdu adopts Nastalique writing style which does not have a concept of space. Furthermore, the inherent non-joining attributes of certain characters in Urdu create spaces within a word while writing in digital format. Thus, Urdu not only has space omission but also space insertion issues which make the word segmentation task challenging. In this paper, we improve upon the results of Zia, Raza and Athar (2018) by using a manually annotated corpus of 19,651 sentences along with morphological context features. Using the Conditional Random Field sequence modeler, our model achieves F 1 score of 0.98 for word boundary identification and 0.92 for sub-word boundary identification tasks. The results demonstrated in this paper outperform the state-of-the-art methods.
Anthology ID:
2020.winlp-1.41
Volume:
Proceedings of the Fourth Widening Natural Language Processing Workshop
Month:
July
Year:
2020
Address:
Seattle, USA
Editors:
Rossana Cunha, Samira Shaikh, Erika Varis, Ryan Georgi, Alicia Tsai, Antonios Anastasopoulos, Khyathi Raghavi Chandu
Venue:
WiNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
156–159
Language:
URL:
https://aclanthology.org/2020.winlp-1.41
DOI:
10.18653/v1/2020.winlp-1.41
Bibkey:
Cite (ACL):
Aamir Farhan, Mashrukh Islam, and Dipti Misra Sharma. 2020. Enhanced Urdu Word Segmentation using Conditional Random Fields and Morphological Context Features. In Proceedings of the Fourth Widening Natural Language Processing Workshop, pages 156–159, Seattle, USA. Association for Computational Linguistics.
Cite (Informal):
Enhanced Urdu Word Segmentation using Conditional Random Fields and Morphological Context Features (Farhan et al., WiNLP 2020)
Copy Citation:
Video:
 http://slideslive.com/38929582