A Universal Dependencies Treebank for Gujarati

Mayank Jobanputra, Maitrey Mehta, Çağrı Çöltekin


Abstract
The Universal Dependencies (UD) project has presented itself as a valuable platform to develop various resources for the languages of the world. We present and release a sample treebank for the Indo-Aryan language of Gujarati – a widely spoken language with little linguistic resources. This treebank is the first labeled dataset for dependency parsing in the language and the script (the Gujarati script). The treebank contains 187 part-of-speech and dependency annotated sentences from diverse genres. We discuss various idiosyncratic examples, annotation choices and present an elaborate corpus along with agreement statistics. We see this work as a valuable resource and a stepping stone for research in Gujarati Computational Linguistics.
Anthology ID:
2024.mwe-1.9
Volume:
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Archna Bhatia, Gosse Bouma, A. Seza Doğruöz, Kilian Evang, Marcos Garcia, Voula Giouli, Lifeng Han, Joakim Nivre, Alexandre Rademaker
Venues:
MWE | UDW | WS
SIGs:
SIGLEX | SIGPARSE
Publisher:
ELRA and ICCL
Note:
Pages:
56–62
Language:
URL:
https://aclanthology.org/2024.mwe-1.9
DOI:
Bibkey:
Cite (ACL):
Mayank Jobanputra, Maitrey Mehta, and Çağrı Çöltekin. 2024. A Universal Dependencies Treebank for Gujarati. In Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024, pages 56–62, Torino, Italia. ELRA and ICCL.
Cite (Informal):
A Universal Dependencies Treebank for Gujarati (Jobanputra et al., MWE-UDW-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.mwe-1.9.pdf