A Dataset for Term Extraction in Hindi

Shubhanker Banerjee, Bharathi Raja Chakravarthi, John Philip McCrae


Abstract
Automatic Term Extraction (ATE) is one of the core problems in natural language processing and forms a key component of text mining pipelines of domain specific corpora. Complex low-level tasks such as machine translation and summarization for domain specific texts necessitate the use of term extraction systems. However, the development of these systems requires the use of large annotated datasets and thus there has been little progress made on this front for under-resourced languages. As a part of ongoing research, we present a dataset for term extraction from Hindi texts in this paper. To the best of our knowledge, this is the first dataset that provides term annotated documents for Hindi. Furthermore, we have evaluated this dataset on statistical term extraction methods and the results obtained indicate the problems associated with development of term extractors for under-resourced languages.
Anthology ID:
2022.term-1.4
Volume:
Proceedings of the Workshop on Terminology in the 21st century: many faces, many places
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Rute Costa, Sara Carvalho, Ana Ostroški Anić, Anas Fahad Khan
Venue:
TERM
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
19–25
Language:
URL:
https://aclanthology.org/2022.term-1.4
DOI:
Bibkey:
Cite (ACL):
Shubhanker Banerjee, Bharathi Raja Chakravarthi, and John Philip McCrae. 2022. A Dataset for Term Extraction in Hindi. In Proceedings of the Workshop on Terminology in the 21st century: many faces, many places, pages 19–25, Marseille, France. European Language Resources Association.
Cite (Informal):
A Dataset for Term Extraction in Hindi (Banerjee et al., TERM 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.term-1.4.pdf