Benchmarking Hindi Term Extraction in Education: A Dataset and Analysis

Shubhanker Banerjee; Bharathi Raja Chakravarthi; John Philip McCrae

Benchmarking Hindi Term Extraction in Education: A Dataset and Analysis

Shubhanker Banerjee, Bharathi Raja Chakravarthi, John P. McCrae

Abstract

This paper introduces the HTEC HindiTerm Extraction Dataset 2.0, a resourcedesigned to support terminology extractionand classification tasks within the education domain. HTEC 2.0 has been developed with the objective of providing a high-quality benchmark dataset for the evaluation of term recognition and classification methodologies in Hindi educationaldiscourse. The dataset consists of 97 documents sourced from Hindi Wikipedia, covering a diverse range of topics relevant tothe education sector. Within these documents, 1,702 terms have been manuallyannotated where each term is defined as asingle-word or multi-word expression thatconveys a domain-specific meaning. Theannotated terms in HTEC 2.0 are systematically categorized into seven distinct classes.Furthermore, this paper outlines the development of annotation guidelines, detailingthe criteria used to determine term boundaries and category assignments. By offeringa structured dataset with clearly definedterm classifications, HTEC 2.0 serves as avaluable resource for researchers workingon terminology extraction, domain-specificnamed entity recognition, and text classification in Hindi.

Anthology ID:: 2025.ldk-1.3
Volume:: Proceedings of the 5th Conference on Language, Data and Knowledge
Month:: September
Year:: 2025
Address:: Naples, Italy
Editors:: Mehwish Alam, Andon Tchechmedjiev, Jorge Gracia, Dagmar Gromann, Maria Pia di Buono, Johanna Monti, Maxim Ionov
Venue:: LDK
SIG:
Publisher:: Unior Press
Note:
Pages:: 19–30
Language:
URL:: https://aclanthology.org/2025.ldk-1.3/
DOI:
Bibkey:
Cite (ACL):: Shubhanker Banerjee, Bharathi Raja Chakravarthi, and John P. McCrae. 2025. Benchmarking Hindi Term Extraction in Education: A Dataset and Analysis. In Proceedings of the 5th Conference on Language, Data and Knowledge, pages 19–30, Naples, Italy. Unior Press.
Cite (Informal):: Benchmarking Hindi Term Extraction in Education: A Dataset and Analysis (Banerjee et al., LDK 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.ldk-1.3.pdf

PDF Cite Search Fix data