An Annotated Dataset and Automatic Approaches for Discourse Mode Identification in Low-resource Bengali Language

Salim Sazzed


Abstract
The modes of discourse aid in comprehending the convention and purpose of various forms of languages used during communication. In this study, we introduce a discourse mode annotated corpus for the low-resource Bangla (also referred to as Bengali) language. The corpus consists of sentence-level annotation of three different discourse modes, narrative, descriptive, and informative of the text excerpted from a number of Bangla novels. We analyze the annotated corpus to expose various linguistic aspects of discourse modes, such as class distributions and average sentence lengths. To automatically determine the mode of discourse, we apply CML (classical machine learning) classifiers with n-gram based statistical features and a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) based language model. We observe that fine-tuned BERT-based approach yields more promising results than n-gram based CML classifiers. Our created discourse mode annotated dataset, the first of its kind in Bangla, and the evaluation, provide baselines for the automatic discourse mode identification in Bangla and can assist various downstream natural language processing tasks.
Anthology ID:
2022.mia-1.2
Volume:
Proceedings of the Workshop on Multilingual Information Access (MIA)
Month:
July
Year:
2022
Address:
Seattle, USA
Editors:
Akari Asai, Eunsol Choi, Jonathan H. Clark, Junjie Hu, Chia-Hsuan Lee, Jungo Kasai, Shayne Longpre, Ikuya Yamada, Rui Zhang
Venue:
MIA
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9–15
Language:
URL:
https://aclanthology.org/2022.mia-1.2
DOI:
10.18653/v1/2022.mia-1.2
Bibkey:
Cite (ACL):
Salim Sazzed. 2022. An Annotated Dataset and Automatic Approaches for Discourse Mode Identification in Low-resource Bengali Language. In Proceedings of the Workshop on Multilingual Information Access (MIA), pages 9–15, Seattle, USA. Association for Computational Linguistics.
Cite (Informal):
An Annotated Dataset and Automatic Approaches for Discourse Mode Identification in Low-resource Bengali Language (Sazzed, MIA 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.mia-1.2.pdf
Video:
 https://aclanthology.org/2022.mia-1.2.mp4