NERvous About My Health: Constructing a Bengali Medical Named Entity Recognition Dataset

Alvi Khan, Fida Kamal, Nuzhat Nower, Tasnim Ahmed, Sabbir Ahmed, Tareque Chowdhury


Abstract
The ability to identify important entities in a text, known as Named Entity Recognition (NER), is useful in a large variety of downstream tasks in the biomedical domain. This is a considerably difficult task when working with Consumer Health Questions (CHQs), which consist of informal language used in day-to-day life by patients. These difficulties are amplified in the case of Bengali, which allows for a huge amount of flexibility in sentence structures and has significant variances in regional dialects. Unfortunately, the complexity of the language is not accurately reflected in the limited amount of available data, which makes it difficult to build a reliable decision-making system. To address the scarcity of data, this paper presents ‘Bangla-HealthNER’, a comprehensive dataset designed to identify named entities in health-related texts in the Bengali language. It consists of 31,783 samples sourced from a popular online public health platform, which allows it to capture the diverse range of linguistic styles and dialects used by native speakers from various regions in their day-to-day lives. The insight into this diversity in language will prove useful to any medical decision-making systems that are developed for use in real-world applications. To highlight the difficulty of the dataset, it has been benchmarked on state-of-the-art token classification models, where BanglishBERT achieved the highest performance with an F1-score of 56.13 ± 0.75%. The dataset and all relevant code used in this work have been made publicly available.
Anthology ID:
2023.findings-emnlp.383
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5768–5774
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.383
DOI:
10.18653/v1/2023.findings-emnlp.383
Bibkey:
Cite (ACL):
Alvi Khan, Fida Kamal, Nuzhat Nower, Tasnim Ahmed, Sabbir Ahmed, and Tareque Chowdhury. 2023. NERvous About My Health: Constructing a Bengali Medical Named Entity Recognition Dataset. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5768–5774, Singapore. Association for Computational Linguistics.
Cite (Informal):
NERvous About My Health: Constructing a Bengali Medical Named Entity Recognition Dataset (Khan et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.383.pdf