Swapnil Dhanwal


2020

pdf bib
An Annotated Dataset of Discourse Modes in Hindi Stories
Swapnil Dhanwal | Hritwik Dutta | Hitesh Nankani | Nilay Shrivastava | Yaman Kumar | Junyi Jessy Li | Debanjan Mahata | Rakesh Gosangi | Haimin Zhang | Rajiv Ratn Shah | Amanda Stent
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we present a new corpus consisting of sentences from Hindi short stories annotated for five different discourse modes argumentative, narrative, descriptive, dialogic and informative. We present a detailed account of the entire data collection and annotation processes. The annotations have a very high inter-annotator agreement (0.87 k-alpha). We analyze the data in terms of label distributions, part of speech tags, and sentence lengths. We characterize the performance of various classification algorithms on this dataset and perform ablation studies to understand the nature of the linguistic models suitable for capturing the nuances of the embedded discourse structures in the presented corpus.