COVID-19 Named Entity Recognition for Vietnamese

Thinh Hung Truong, Mai Hoang Dao, Dat Quoc Nguyen


Abstract
The current COVID-19 pandemic has lead to the creation of many corpora that facilitate NLP research and downstream applications to help fight the pandemic. However, most of these corpora are exclusively for English. As the pandemic is a global problem, it is worth creating COVID-19 related datasets for languages other than English. In this paper, we present the first manually-annotated COVID-19 domain-specific dataset for Vietnamese. Particularly, our dataset is annotated for the named entity recognition (NER) task with newly-defined entity types that can be used in other future epidemics. Our dataset also contains the largest number of entities compared to existing Vietnamese NER datasets. We empirically conduct experiments using strong baselines on our dataset, and find that: automatic Vietnamese word segmentation helps improve the NER results and the highest performances are obtained by fine-tuning pre-trained language models where the monolingual model PhoBERT for Vietnamese (Nguyen and Nguyen, 2020) produces higher results than the multilingual model XLM-R (Conneau et al., 2020). We publicly release our dataset at: https://github.com/VinAIResearch/PhoNER_COVID19
Anthology ID:
2021.naacl-main.173
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
June
Year:
2021
Address:
Online
Editors:
Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, Yichao Zhou
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2146–2153
Language:
URL:
https://aclanthology.org/2021.naacl-main.173
DOI:
10.18653/v1/2021.naacl-main.173
Bibkey:
Cite (ACL):
Thinh Hung Truong, Mai Hoang Dao, and Dat Quoc Nguyen. 2021. COVID-19 Named Entity Recognition for Vietnamese. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2146–2153, Online. Association for Computational Linguistics.
Cite (Informal):
COVID-19 Named Entity Recognition for Vietnamese (Truong et al., NAACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.naacl-main.173.pdf
Video:
 https://aclanthology.org/2021.naacl-main.173.mp4
Code
 VinAIResearch/PhoNER_COVID19
Data
PhoNER COVID19