SiPOS: A Benchmark Dataset for Sindhi Part-of-Speech Tagging

Wazir Ali, Zenglin Xu, Jay Kumar


Abstract
In this paper, we introduce the SiPOS dataset for part-of-speech tagging in the low-resource Sindhi language with quality baselines. The dataset consists of more than 293K tokens annotated with sixteen universal part-of-speech categories. Two experienced native annotators annotated the SiPOS using the Doccano text annotation tool with an inter-annotation agreement of 0.872. We exploit the conditional random field, the popular bidirectional long-short-term memory neural model, and self-attention mechanism with various settings to evaluate the proposed dataset. Besides pre-trained GloVe and fastText representation, the character-level representations are incorporated to extract character-level information using the bidirectional long-short-term memory encoder. The high accuracy of 96.25% is achieved with the task-specific joint word-level and character-level representations. The SiPOS dataset is likely to be a significant resource for the low-resource Sindhi language.
Anthology ID:
2021.ranlp-srw.4
Volume:
Proceedings of the Student Research Workshop Associated with RANLP 2021
Month:
September
Year:
2021
Address:
Online
Editors:
Souhila Djabri, Dinara Gimadi, Tsvetomila Mihaylova, Ivelina Nikolova-Koleva
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
22–30
Language:
URL:
https://aclanthology.org/2021.ranlp-srw.4
DOI:
Bibkey:
Cite (ACL):
Wazir Ali, Zenglin Xu, and Jay Kumar. 2021. SiPOS: A Benchmark Dataset for Sindhi Part-of-Speech Tagging. In Proceedings of the Student Research Workshop Associated with RANLP 2021, pages 22–30, Online. INCOMA Ltd..
Cite (Informal):
SiPOS: A Benchmark Dataset for Sindhi Part-of-Speech Tagging (Ali et al., RANLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.ranlp-srw.4.pdf