CWID-hi: A Dataset for Complex Word Identification in Hindi Text

Gayatri Venugopal, Dhanya Pramod, Ravi Shekhar


Abstract
Text simplification is a method for improving the accessibility of text by converting complex sentences into simple sentences. Multiple studies have been done to create datasets for text simplification. However, most of these datasets focus on high-resource languages only. In this work, we proposed a complex word dataset for Hindi, a language largely ignored in text simplification literature. We used various Hindi knowledge annotators for annotation to capture the annotator’s language knowledge. Our analysis shows a significant difference between native and non-native annotators’ perception of word complexity. We also built an automatic complex word classifier using a soft voting approach based on the predictions from tree-based ensemble classifiers. These models behave differently for annotations made by different categories of users, such as native and non-native speakers. Our dataset and analysis will help simplify Hindi text depending on the user’s language understanding. The dataset is available at https://zenodo.org/record/5229160.
Anthology ID:
2022.lrec-1.604
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
5627–5636
Language:
URL:
https://aclanthology.org/2022.lrec-1.604
DOI:
Bibkey:
Cite (ACL):
Gayatri Venugopal, Dhanya Pramod, and Ravi Shekhar. 2022. CWID-hi: A Dataset for Complex Word Identification in Hindi Text. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5627–5636, Marseille, France. European Language Resources Association.
Cite (Informal):
CWID-hi: A Dataset for Complex Word Identification in Hindi Text (Venugopal et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.604.pdf