Exploring Word Segmentation and Medical Concept Recognition for Chinese Medical Texts

Yang Liu, Yuanhe Tian, Tsung-Hui Chang, Song Wu, Xiang Wan, Yan Song


Abstract
Chinese word segmentation (CWS) and medical concept recognition are two fundamental tasks to process Chinese electronic medical records (EMRs) and play important roles in downstream tasks for understanding Chinese EMRs. One challenge to these tasks is the lack of medical domain datasets with high-quality annotations, especially medical-related tags that reveal the characteristics of Chinese EMRs. In this paper, we collected a Chinese EMR corpus, namely, ACEMR, with human annotations for Chinese word segmentation and EMR-related tags. On the ACEMR corpus, we run well-known models (i.e., BiLSTM, BERT, and ZEN) and existing state-of-the-art systems (e.g., WMSeg and TwASP) for CWS and medical concept recognition. Experimental results demonstrate the necessity of building a dedicated medical dataset and show that models that leverage extra resources achieve the best performance for both tasks, which provides certain guidance for future studies on model selection in the medical domain.
Anthology ID:
2021.bionlp-1.23
Volume:
Proceedings of the 20th Workshop on Biomedical Language Processing
Month:
June
Year:
2021
Address:
Online
Venues:
BioNLP | NAACL
SIG:
SIGBIOMED
Publisher:
Association for Computational Linguistics
Note:
Pages:
213–220
Language:
URL:
https://aclanthology.org/2021.bionlp-1.23
DOI:
10.18653/v1/2021.bionlp-1.23
Bibkey:
Cite (ACL):
Yang Liu, Yuanhe Tian, Tsung-Hui Chang, Song Wu, Xiang Wan, and Yan Song. 2021. Exploring Word Segmentation and Medical Concept Recognition for Chinese Medical Texts. In Proceedings of the 20th Workshop on Biomedical Language Processing, pages 213–220, Online. Association for Computational Linguistics.
Cite (Informal):
Exploring Word Segmentation and Medical Concept Recognition for Chinese Medical Texts (Liu et al., BioNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.bionlp-1.23.pdf
Code
 cuhksz-nlp/acemr