GujMORPH - A Dataset for Creating Gujarati Morphological Analyzer

Jatayu Baxi, Brijesh Bhatt


Abstract
Computational morphology deals with the processing of a language at the word level. A morphological analyzer is a key linguistic word-level tool that returns all the constituent morphemes and their grammatical categories associated with a particular word form. For the highly inflectional and low resource languages, the creation of computational morphology-related tools is a challenging task due to the unavailability of underlying key resources. In this paper, we discuss the creation of an annotated morphological dataset- GujMORPH for the Gujarati - an indo-aryan language. For the creation of this dataset, we studied language grammar, word formation rules, and suffix attachments in depth. This dataset contains 16,527 unique inflected words along with their morphological segmentation and grammatical feature tagging information. It is a first of its kind dataset for the Gujarati language and can be used to develop morphological analyzer and generator models. The dataset is annotated in the standard Unimorph schema and evaluated on the baseline system. We also describe the tool used to annotate the data in the standard format. The dataset is released publicly along with the library. Using this library, the data can be obtained in a format that can be directly used to train any machine learning model.
Anthology ID:
2022.lrec-1.767
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
7088–7095
Language:
URL:
https://aclanthology.org/2022.lrec-1.767
DOI:
Bibkey:
Cite (ACL):
Jatayu Baxi and Brijesh Bhatt. 2022. GujMORPH - A Dataset for Creating Gujarati Morphological Analyzer. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 7088–7095, Marseille, France. European Language Resources Association.
Cite (Informal):
GujMORPH - A Dataset for Creating Gujarati Morphological Analyzer (Baxi & Bhatt, LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.767.pdf
Data
Universal Dependencies