COVID-19 Named Entity Recognition for Vietnamese

The current COVID-19 pandemic has lead to the creation of many corpora that facilitate NLP research and downstream applications to help fight the pandemic. However, most of these corpora are exclusively for English. As the pandemic is a global problem, it is worth creating COVID-19 related datasets for languages other than English. In this paper, we present the first manually-annotated COVID-19 domain-specific dataset for Vietnamese. Particularly, our dataset is annotated for the named entity recognition (NER) task with newly-defined entity types that can be used in other future epidemics. Our dataset also contains the largest number of entities compared to existing Vietnamese NER datasets. We empirically conduct experiments using strong baselines on our dataset, and find that: automatic Vietnamese word segmentation helps improve the NER results and the highest performances are obtained by fine-tuning pre-trained language models where the monolingual model PhoBERT for Vietnamese (Nguyen and Nguyen, 2020) produces higher results than the multilingual model XLM-R (Conneau et al., 2020). We publicly release our dataset at: https://github.com/VinAIResearch/PhoNER_COVID19


Introduction
As of early November 2020, the total number of COVID-19 cases worldwide has surpassed 50M. 1 The world is once again hit by a new wave of COVID-19 infection with record-breaking numbers of new cases reported everyday. Along with the outbreak of the pandemic, information about the COVID-19 is aggregated rapidly through different types of texts in different languages (Aizawa et al., 2020). Particularly, in Vietnam, text reports containing official information from the government about COVID-19 cases are presented in great detail, including de-identified personal information, travel history, as well as information of people who come into contact with the cases. The reports are frequently kept up to date at reputable online news sources, playing a significant role to help the country combat the pandemic. It is thus essential to build systems to retrieve and condense information from those official sources so that related people and organizations can promptly grasp the key information for epidemic prevention tasks, and the systems should also be able to adapt and sync quickly with epidemics that take place in the future. One of the first steps to develop such systems is to recognize relevant named entities mentioned in the texts, which is also known as the NER task.
Compared to other languages, data resources for the Vietnamese NER task are limited, including only two public datasets from the VLSP 2016 and 2018 NER shared tasks (Huyen and Luong, 2016;Nguyen et al., 2018b). Here, the VLSP-2018 NER dataset is an extension of the VLSP-2016 NER dataset with more data. These two datasets only focus on recognizing generic entities of person names, organizations, and locations in online news articles. Thus, making them difficult to adapt to the context of extracting key entity information related to COVID-19 patients. This leads to our work's main goals that are: (i) To develop a NER task in the COVID-19 specified domain, that potentially impacts research and downstream applications, and (ii) To provide the research community with a new dataset for recognizing COVID-19 related named entities in Vietnamese.
In this paper, we present a named entity annotated dataset with newly-defined entity types that can be applied to future epidemics. The dataset contains informative sentences related to COVID-19, extracted from articles crawled from reputable Vietnamese online news sites. Here, we do not consider other types of popular social media in Vietnam such as Facebook as they contain much

Label Definition
PATIENT_ID Unique identifier of a COVID-19 patient in Vietnam. An PATIENT_ID annotation over "X" refers to as the X th patient having COVID-19 in Vietnam. PERSON_NAME Name of a patient or person who comes into contact with a patient. AGE Age of a patient or person who comes into contact with a patient. GENDER Gender of a patient or person who comes into contact with a patient. OCCUPATION Job of a patient or person who comes into contact with a patient.

LOCATION
Locations/places that a patient was presented at.

ORGANIZATION
Organizations related to a patient, e.g. company, government organization, and the like, with structures and their own functions.

SYMPTOM&DISEASE
Symptoms that a patient experiences, and diseases that a patient had prior to COVID-19 or complications that usually appear in death reports.

TRANSPORTATION
Means of transportation that a patient used. Here, we only tag the specific identifier of vehicles, e.g. flight numbers and bus/car plates.

DATE
Any date that appears in the sentence. noisy information and are not as reliable as official news sources. We then empirically evaluate strong baseline models on our dataset. Our contributions are summarized as follows: • We introduce the first manually annotated Vietnamese dataset in the COVID-19 domain. Our dataset is annotated with 10 different named entity types related to COVID-19 patients in Vietnam. Compared to the VLSP-2016 and VLSP-2018 Vietnamese NER datasets, our dataset has the largest number of entities, consisting of 35K entities over 10K sentences.
• We empirically investigate strong baselines on our dataset, including BiLSTM-CNN-CRF (Ma and Hovy, 2016) and the pre-trained language models XLM-R (Conneau et al., 2020) and PhoBERT (Nguyen and Nguyen, 2020). We find that: (i) Automatic Vietnamese word segmentation helps improve the NER results, and (ii) The highest results are obtained by fine-tuning the pre-trained language models, where PhoBERT does better than XLM-R.
• We publicly release our dataset for research or educational purposes. We hope that our dataset can serve as a starting point for future COVID-19 related Vietnamese NLP research and applications.

Related work
Most COVID-19 related datasets are constructed from two types of sources. The first one is scientific publications, including the datasets CORD-19 (Wang et al., 2020) and LitCovid (Chen et al., 2020), that help facilitate many types of research works, such as building search engines to retrieve relevant information from scholarly articles (Esteva et al., 2020;Zhang et al., 2020;Verspoor et al., 2020), question answering and summarization (Lee et al., 2020;Su et al., 2020). Recently, Colic et al.
(2020) fine-tune a BERT-based NER model on the CRAFT corpus (Verspoor et al., 2012) to recognize and then normalize biomedical ontology and terminology entities in LitCovid. The second type is social media data, particularly Tweets. COVID-19 related Tweet datasets are built for many analytic tasks such as identification of informative Tweets (Nguyen et al., 2020b), and disinformation detection and fact-checking (Shahi and Nandini, 2020;Alam et al., 2020;Alsudias and Rayson, 2020). The most relevant work to ours is proposed by Zong et al. (2020), that aims to extract COVID-19 events reporting test results, death cases, cures and prevention from English Tweets. As Twitter is rarely used by Vietnamese people, we could not use it for data collection.

Our dataset 3.1 Entity types
We define 10 entity types with the aim of extracting key information related to COVID-19 patients, which are especially useful in downstream applications. In general, these entity types can be used in the context of not only the COVID-19 pandemic but also in other future epidemics. The description of each entity type is briefly described in Table 1. See the Appendix for entity examples as well as some notices over the entity types.

COVID-19 related data collection
We first crawl articles tagged with "COVID-19" or "COVID" keywords from the reputable Vietnamese online news sites, including VnExpress, 2 ZingNews, 3 BaoMoi 4 and ThanhNien. 5 These articles are dated between February 2020 and August 2020. We then segment the crawled news articles' primary text content into sentences using RDRSegmenter (Nguyen et al., 2018a) from VnCoreNLP (Vu et al., 2018).
To retrieve informative sentences about COVID-19 patients, we employ BM25Plus (Trotman et al., 2014) with search queries of common keywords appearing in sentences that report confirmed, suspected, recovered, or death cases as well as the travel history or location of the cases. From the top 15K sentences ranked by BM25Plus, we manually filter out sentences that do not contain information related to patients in Vietnam, thus resulting in a dataset of 10027 raw sentences.

Annotation process
We develop an initial version of our annotation guidelines and then randomly sample a pilot set of 1K sentences from the dataset of 10027 raw sentences for the first phase of annotation. Two of the guideline developers are employed to annotate the pilot set independently. Following Brandsen et al.
(2020), we utilize F 1 score to measure the interannotator agreement between the two annotators at the entity span level, resulting in an F1 score of 0.88. We then host a discussion session to resolve annotation conflicts, identify complex cases, and refine the guidelines.
In the second annotation phase, we divide the whole dataset of 10027 sentences into 10 nonoverlapping and equal subsets. Each subset contains 100 sentences from the pilot set from the first annotation phase. For this second phase, we employ 10 annotators who are undergraduate students with strong linguistic abilities (here, each annotator annotates a subset, paid 0.05 USD per sentence). Annotation quality of each annotator is measured by F 1 calculated over the 100 sentences that already have gold annotations from the pilot set. All annotators are asked to revise their annotations until they achieve an F 1 of at least 0.92. Finally, we  revisit each annotated sentence to make further corrections if needed, resulting in a final gold dataset of 10027 annotated sentences. Note that when written in Vietnamese texts, in addition to marking word boundaries, white space is also used to separate syllables that constitute words. Therefore, the annotation process is performed at syllable-level text for convenience. To obtain a word-level variant of the dataset, we apply the RDRSegmenter to perform automatic Vietnamese word segmentation, e.g. a 4syllable written text "bệnh viện Đà Nẵng" (Da Nang hospital) is word-segmented into a 2-word text "bệnh_viện hospital Đà_Nẵng Da_Nang ". Here, automatic Vietnamese word segmentation outputs do not affect gold boundaries of entity mentions.

Data partitions
We randomly split the gold annotated dataset of 10027 sentences into training/validation/test sets with a ratio of 5/2/3, ensuring comparable distributions of entity types across these three sets. Statistics of our dataset is presented in Table 2.

Experimental setup
We formulate the COVID-19 NER task for Vietnamese as a sequence labeling problem with the BIO tagging scheme. We conduct experiments on our dataset using strong baselines to investigate: (i) the influence of automatic Vietnamese word segmentation (here, input sentence can be represented in either syllable or word level), and (ii) the usefulness of pre-trained language models. The baselines include: BiLSTM-CNN-CRF (Ma and Hovy, 2016) and the pre-trained language models XLM-R (Conneau et al., 2020) and PhoBERT (Nguyen and 980 0.944 0.967 0.968 0.791 0.940 0.876 0.885 0.967 0.989 0.945 0.931 Table 4: Strict F 1 score for each entity type (denoted by its first 3 characters), and Micro-and Macro-average F 1 scores (denoted by Mic-F 1 and Mac-F 1 , respectively). BiL-CRF abbreviates the baseline BiLSTM-CNN-CRF. Syllable and Word denote results obtained when using syllable-and word-level based dataset settings, respectively.  Nguyen, 2020). XLM-R is a multi-lingual variant of RoBERTa (Liu et al., 2019), pre-trained on a 2.5TB multilingual dataset that contains 137GB of syllable-level Vietnamese texts. PhoBERT is a monolingual variant of RoBERTa, pre-trained on a 20GB word-level Vietnamese dataset.

Hyper-parameter
We employ the BiLSTM-CNN-CRF implementation from AllenNLP (Gardner et al., 2018). Training BiLSTM-CNN-CRF requires input pretrained syllable-and word-level embeddings for the syllable-and word-level settings, respectively. Thus we employ the pre-trained Word2Vec syllable and word embeddings for Vietnamese from Nguyen et al. (2020a). These embeddings are fixed during training. Optimal hyper-parameters that we gridsearched for BiLSTM-CNN-CRF are presented in Table 3. We utilize the transformers library (Wolf et al., 2020) to fine-tune XLM-R and PhoBERT for the syllable-and word-level settings, respectively, using Adam (Kingma and Ba, 2014) with a fixed learning rate of 5.e-5 and a batch size of 32 (Liu et al., 2019).
The baselines are trained/fine-tuned for 30 epochs. We evaluate the Micro-average F 1 score after each epoch on the validation set (here, we apply early stopping if we find no performance improvement after 5 continuous epochs). We then choose the best model checkpoint to report the final score on the test set. Note that each F 1 score reported is an average over 5 runs with different random seeds. Table 4 shows the final entity-level NER results of the baselines on the test set. In addition to the standard Micro-average F 1 score, we also report the Macro-average F 1 score.

Main results
We categorize the results under two comparable settings of using syllable-level dataset and its automatically-segmented word-level variant for training and evaluation. We find that the performances of word-level models are higher than their syllable-level counterparts, showing that automatic Vietnamese word segmentation helps improve NER, e.g. BiLSTM-CNN-CRF improves from 0.906 to 0.910 Micro-F 1 and from 0.858 to 0.875 Macro-F 1 .
We also find that fine-tuning the pre-trained language models XLM-R and PhoBERT helps produce better performances than BiLSTM-CNN-CRF. Here, PhoBERT outperforms XLM-R (Micro-F 1 : 0.945 vs. 0.938; Macro-F 1 : 0.931 vs. 0.911), thus reconfirming the effectiveness of pre-trained monolingual language models on the language-specific downstream tasks (Nguyen and Nguyen, 2020).

Error analysis
We perform an error analysis using the best performing model PhoBERT large that produces 353 incorrect predictions in total on the validation set.
The first error group consists of 69/353 instances with correct entity boundaries (i.e. exact spans) and incorrect entity labels. It is largely due to the fact that the model could not differentiate between LO-CATION and ORGANIZATION entities. This is not surprising because of the ambiguity between these two entity types, in which the same entity mention may act as either LOCATION or ORGA-NIZATION depending on the sentence context. Also, in terms of contact tracing, it would be more useful to label an organization-like entity mention as LOCATION if we can infer that a patient presented at that organization; however, such inference requires additional world knowledge about the entity. In addition, in this error group, the model also struggles to recognize OCCUPATION entities correctly. Recall that OCCUPATION entity mention must represent the job of a particular person labeled with PERSON_NAME or PATIENT_ID. Therefore, it may cause confusion to the model for deciding whether an occupation is linked to a determined person or not in a single sentence context.
The second error group contains 65/353 instances with inexact spans overlapped with gold spans but having correct entity labels. These errors generally happen with multi-word ORGANIZA-TION entity mentions, where (i) an ORGANIZA-TION entity contains a nested location inside its span, e.g. "Bệnh viện Lao và Bệnh phổi Cần Thơ" (Can Tho hospital for Tuberculosis and Lung disease; here, "Can Tho" is a province in Vietnam), or (ii) an organization is a subdivision of a larger organization, e.g. "Khoa tim mạch -Bệnh viện Bạch Mai" (Department of Cardiology -Bach Mai Hospital). 6 The third group of 8/353 errors with overlapped inexact spans and incorrect entity labels does not provide us with any useful insight. The final group of remaining 211/353 errors is accounted for predicted entities corresponding with gold O labels. Particularly in the case of LOCATION, where generic mentions, such as "Bệnh viện tỉnh" (province hospital), "Trạm y tế xã" (commune medical station), "chung cư" (apartment), are recognized as entities, while in fact, they are not.

Conclusion
In this paper, we have presented the first manuallyannotated Vietnamese dataset in the COVID-19 domain, focusing on the named entity recognition task. We empirically conduct experiments on our dataset to compare strong baselines and find that the input representations and the pre-trained language models all have influences on this COVID-19 related NER task. We hope that our dataset can serve as the starting point for further Vietnamese NLP research and applications in fighting the COVID-19 and other future epidemics.