Knowledge-Empowered Representation Learning for Chinese Medical Reading Comprehension: Task, Model and Resources

Machine Reading Comprehension (MRC) aims to extract answers to questions given a passage. It has been widely studied recently, especially in open domains. However, few efforts have been made on closed-domain MRC, mainly due to the lack of large-scale training data. In this paper, we introduce a multi-target MRC task for the medical domain, whose goal is to predict answers to medical questions and the corresponding support sentences from medical information sources simultaneously, in order to ensure the high reliability of medical knowledge serving. A high-quality dataset is manually constructed for the purpose, named Multi-task Chinese Medical MRC dataset (CMedMRC), with detailed analysis conducted. We further propose the Chinese medical BERT model for the task (CMedBERT), which fuses medical knowledge into pre-trained language models by the dynamic fusion mechanism of heterogeneous features and the multi-task learning strategy. Experiments show that CMedBERT consistently outperforms strong baselines by fusing context-aware and knowledge-aware token representations.


Introduction
Machine Reading Comprehension (MRC) has become a popular task in NLP, aiming to understand a given passage and answer the relevant questions. With the wide availability of MRC datasets (Rajpurkar et al., 2016;He et al., 2018;Cui et al., 2019) and deep learning models (Yu et al., 2018;Ding et al., 2019) (including pre-trained language models such as BERT (Devlin et al., 2019)), significant progress has been made.
Despite the success, a majority of MRC research has focused on open domains. For specific domains, however, the construction of highquality MRC datasets, together with the design of corresponding models is considerably deficient (Welbl et al., 2017(Welbl et al., , 2018. The causes behind this phenomenon are threefold. Take the medical domain as an example. i) Data annotators are required to have medical backgrounds with high standards.
Hence, simple crowdsourcing (Rajpurkar et al., 2016;Cui et al., 2019) often leads to poor annotation results. ii) Due to the domain sensitivity, people are more concerned about the reliability of the information sources where the answers are extracted, and the explainability of the answers themselves (Lee et al., 2014;Dalmer, 2017). This is fundamentally different from the task requirements of opendomain MRC. iii) From the perspective of model learning, it is difficult for pre-trained language models to understand the meaning of the questions and passages containing a lot of specialized terms (Chen et al., 2016;Bauer et al., 2018). Without the help of domain knowledge, state-of-theart models can perform poorly. As shown in Figure 1, BERT (Devlin et al., 2019) and MC-BERT (Zhang et al., 2020) only predict part of the correct answer, i.e., "torso" and "buttocks", instead of generating the complete answer to the medical question.
In this paper, we present a comprehensive study on Chinese medical MRC, including i) how the task is formulated, ii) the construction of the Chinese medical dataset and iii) the MRC model with rich medical knowledge injected. To meet the requirements of medical MRC, we aim to predict both the answer spans to a medical question, and the support sentence from the passage, indicating (<1>Imbricate tinea is a special kind of body ringworm, mainly caused by the concentric trichoderma or imbricated versicolor bacterium… <14>It often occurs in the torso and buttocks. The disease can spread to the extremities after a long time. It can even spread to the lips, nail groove and scalp. <15>But the metatarsal is not affected and also do not damage the hair...) the source of the answer. The support sentences provide abundant evidence for users to learn medical knowledge, and for medical professionals to assess the trustworthiness of model output results. For the dataset, we construct a highly-quality Chinese medical MRC dataset, named the Multitask Chinese Medical MRC dataset (CMedMRC). It contains 18,153 <question, passage, answer, support sentence> quads. Based on the analysis of CMedMRC, we summarize four special challenges for Chinese medical MRC, including longtail terminologies, synonym terminology, terminology combination and paraphrasing. In addition, we find that comprehensive skills are required for MRC models to answer medical questions correctly. For answer extraction in CMedMRC, direct token matching is required for answering 31% of the questions, co-reference resolution for 11%, multi-sentence reasoning for 18% and implicit causality for 22%. In addition, the answers to the remaining questions (16%) are extremely difficult to extract without rich medical background knowledge.
To address the medical MRC task, we propose the multi-task dynamic heterogeneous fusion network (CMedBERT) based on the MC-BERT (Zhang et al., 2020) model and a Chinese medical knowledge base (see Appendix).
The technical contributions of CMedBERT are twofold: • Heterogeneous Feature Fusion: We mimic humans' approach of reading comprehension (Wang et al., 1999) by learning attentively aggregated representations of multiple entities in the passage. Different from the knowledge fusion method used by KBLSTM (Yang and Mitchell, 2017) and KT-NET (Yang et al., 2019), we propose a two-level attention and a gated-loop mechanism to replace the knowledge sentinel, so that rich knowledge representations can be better integrated into the model.
• Multi-task Learning: Parameters of CMed-BERT are dynamically learned by capturing the relationships between the two tasks via multi-task learning. We regard the semantic similarities between support sentences and answers to questions as the task similarities.
We compare CMedBERT against six strong baselines. For answer prediction, compared to the strongest competitor, the EM (Exact Match) and F1 scores are increased by +3.88% and +1.46%, respectively. Meanwhile, the support sentence prediction task result is increased by a large margin, i.e., +7.81% of EM and +4.07% of F1.

Related Work MRC Datasets and Models.
Due to the popularity of the MRC task, there exist many types of MRC datasets, such as spanextraction (Rajpurkar et al., 2016;Yang et al., 2018), multiple choices (Richardson et al., 2013;Lai et al., 2017), cloze-style (Hermann et al., 2015), cross-lingual (Jing et al., 2019Yuan et al., 2020). For specific domains, however, the number of publicly available MRC datasets remains few, including SciQ (Welbl et al., 2017), Quasar-S (Dhingra et al., 2017) and Biology (Berant et al., 2014). CLiCR (Suster and Daelemans, 2018) is a cloze-style single-task English medical MRC dataset. However, it contains a relatively small variety of medical questions, automatically generated from clinical case reports. Recently, (Li et al., 2020) propose a multi-choice Chinese medical QA dataset, retrieving text snippets as the passage and the task only chooses an existing correct option from candidate set. Our work specifically focuses on the fine-grained medical  MRC tasks and deep domain knowledge reasoning, with a manually constructed high-quality dataset released.
The model architecture of MRC mostly takes advantage of neural networks to learn token representations of passages and questions jointly (Qiu et al., 2019a;Liu et al., 2019). The interaction between questions and passages is modeled based on attention mechanisms. The rapid development of deep learning leads to a variety of models, such as the QANet (Yu et al., 2018), SAN (Liu et al., 2018). Graph neural networks have been used in MRC recently by modeling the relations between entities in the passage (Ding et al., 2019) and multi-grained tokens representation (Zheng et al., 2020).

Pre-trained Language Models and Knowledge Fusion.
Pre-trained language models (e.g., BERT (Devlin et al., 2019), ERNIE-THU (Zhang et al., 2019), K-BERT (Liu et al., 2020a)) have successfully improved the performance of the MRC task, which even exceed the human level in some datasets. This is because these models obtain better token representations and capture lexical and syntactic knowledge in different layers (Guan et al., 2019). For specific domain, there also have some pre-trained models (Beltagy et al., 2019;Zhang et al., 2020).  (Chen et al., 2018), our proposed model encodes all the triples from a medical KG and then employs heuristic rules to retrieve relevant entities. This practice allows the model to acquire deeper understanding of domain-specific terms.

The CMedMRC Dataset
In this section, we briefly describe the collection process and provide an analysis on various aspects of the CMedMRC dataset. For more dataset collection and statistical analysis of dataset details, we refer readers to the Appendix A and Appendix B.

Dataset Collection Process
The dataset collection process follows the SQuADstyle (Rajpurkar et al., 2016) rather than collecting question-answer pairs as in Google Natural Questions (Kwiatkowski et al., 2019). Our medical text corpus is collected from DXY Medical 2 , an authoritative medical knowledge source in China. The general data collection process of CMedMRC consists of four major steps: passage collection, question-answer pair collection, support sentence selection and additional answer construction. Briefly speaking, during the passage collection process, we filter the corpus to generate high-quality medical passages. A group of human annotators are required to ask questions on  medical knowledge and annotate the answers from these passages. The annotation results are in the form of question-answer pairs. Following SQuAD, we ask annotators to provide 2 additional answers for each question in the DEV and TEST sets.
Since people are concerned about the scientific explanation and sources of answers in the medical domain, we ask annotators to select the support sentence of their annotated answer similar to those of CoQA (Reddy et al., 2019) and QuAC (Choi et al., 2018). Finally, CMedMRC consists of three parts: 12,700 training samples, 3,630 development samples and 1,823 testing samples.

Quality Control
During the dataset collection process, we take the following measures to ensure the quality of the dataset. i) The knowledge source (DXY Medical) contains high-quality medical articles which are written by medical personnel and organized based on different topics in the medical domain. ii) Our annotators are all engaged in medical-related professions rather than annotators with short-term guidance only. iii) We further hire 12 medical experts to check all the collected samples rather than checking a randomly selected sample only. The experts remove out-of-domain questions and questions that are unhelpful to medical practice. In this stage, the experts are divided into two groups and cross-check their judgments.

Challenges of Understanding Texts
Due to the closed-domain property of our dataset, there are some domain-specific textual features in both passages and questions that the model needs to understand. Based on our observations of the CMedMRC, we summarize the following two major challenges. These challenges can be also regarded as key reasons why some recent state-ofthe-art MRC models cannot address the medical MRC task on CMedMRC well.
Lexical-Level: i) Long-tail terminology means these medical terms occur very infrequently and are prone to Out-Of-Vocabulary (OOV) problems. ii) Synonym terminology means that some medical terms may express the same meaning, but there is a distinction between colloquial expressions and professional terms. The above two points require the model to have rich domain knowledge to solve. iii) Terminology combination means these terms are usually formed by a combination of multi-ple terms, while one term is the attributive of another. This does not only require the model to have domain knowledge but also poses challenges to phrase segmentation in specific domains.
Sentence-Level: Paraphrasing means some words in questions are semantically related to certain tokens in passages, but are expressed differently. Consider the last question in Table 1. When the model tries to answer the "not-recommended" question, it should focus on negative terms ("cause infection" and "delay the healing of the wound").

Reasoning Skills for MRC Models
We randomly select 100 samples from the development set to analyze what skills the model should have in order to answer the questions correctly. We divide the reasoning skills corresponding to these samples into five major categories, namely token matching, co-reference resolution, multi-sentence reasoning, implicit causality and domain knowledge. Examples are shown in Table 2. It is particularly noteworthy that the fifth type is the need of domain knowledge to answer medical questions. Consider the example: Passage: ... The incidence of thrombotic cell weakness is most common in childhood with an incidence rate of 0.01 / 10,000... Question: What is the incidence rate of blood platelet weakness?
Answer: 0.01/10,000. We know that the blood platelet in the question refers to the thrombotic cell described in the passage through the medical knowledge base. It shows that the rich information of the knowledge base can help the model obtain a better understanding of domain terms to improve the MRC performance.

Task Formulation and Model Overview
For our task, the input includes a medical question Q together with the passage P .
Let {p 1 , p 2 , · · · p m } and {q 1 , q 2 , · · · q n } represent the passage and question tokens, respectively. In the answer prediction task, the goal is to train an MRC model which extracts the answer span {p i , p i+1 , · · · p j } (0 ≤ i ≤ j ≤ m) from P that correctly answers the question Q. Additionally, the model is required to predict the support sentence tokens {p k , p k+1 , · · · p l } (0 ≤ k ≤ l ≤ m) from P to provide additional medical knowledge and to enhance interpretability of the extracted answers. We constrain that {p k , p k+1 , · · · p l } must form a complete sentence, instead of incomplete semantic units and the support sentence tokens contain the answer span. The high-level architecture of the CMedBERT model is shown in Figure 2. It mainly includes four modules: BERT encoding, knowledge embedding and retrieval, heterogeneous feature fusion and multi-task training.

BERT Encoding
This module is used to learn context-aware representations of question and the passage tokens. For each input pair (the question Q and the passage P ), we treat [ CLS , Q, SEP , P, SEP ] as the input sequences for BERT. We denote {h i } m+n+3 i=1 as the hidden layer representations of tokens, where m and n are the length of passage tokens and question tokens, respectively.

Knowledge Embedding and Retrieval
In the knowledge bases, relational knowledge is stored in the form of (subject, relation, object) triples. In order to fuse knowledge into token representations, we first encode all entities in the knowledge base into a low-dimensional vector space. Here, we employ PTransE (Lin et al., 2015a) to learn entity representations, and denote the underlying entity embedding as e i . Because existing medical NER tools do not have high coverage over our corpus, we consider five types of token strings as candidate entities: noun, time, location, direction and numeric. Two matching strategies are then employed to retrieve relevant entities from the knowledge base: (i) The two strings match exactly. (ii) The number of overlapped tokens is larger than a threshold. After relevant entities are retrieved, we can fuse the knowledge into contextual representations, introduced below.

Heterogeneous Feature Fusion
In this module, we fuse heterogeneous entity features retrieved from the knowledge base into the question and passage tokens representations.
Local Fusion Attention. We observe that each token is usually related to multiple entities of varying importance. Thus, we assign different weights to the entity embedding e j corresponding to the token representation h i using attention mechanism:  where K is the number of entities and α i,j represents the similarity between the j th entity in the retrieved entity set and the i th token. W ∈ R d 2 ×d 1 where d 1 is the dimension of BERT's output and d 2 is the dimension of entity embeddings. After fusing, the representation of the i th token is: e i = K k=1 α i,k e k . However,ē i is only related to retrieved entities, not other tokens in the questionpassage pair.
Global Fusion Attention. In BERT, the output of the [CLS] tag represents the entire sequence information learned by transformer encoders. We use the token output h [CLS] to model the knowledge fusion representation of the entire entity collection that each token recalls: (2) whereê i is the global knowledge fusion result corresponding to the i th token. Gated Loop Layer. In order to fuse local and global results into token representations, we design a gated loop layer. The information of knowledge fusion is filtered through the gating mechanism in each loop of modeling. In the initialization stage, we simply have h 0 i = h i . In the l th iteration, we have the following update process: This process runs for L loops and this fusion process output is h L i . The loop process mimics the hu-man's behavior of reading the passage repeatedly to find the most accurate answers.

Multi-task Training
The output layer of CMedBERT is extended from BERT. We first concatenate two types of token representations and calculate the probability of the i th token being selected in the support sentence as follows: We also calculate its probabilities as the starting and the ending positions of the answer span, respectively: The loss function of the answer prediction task is the negative log-likelihood of the starting and ending positions of ground-truth answer tokens: For the extraction of support sentences, the loss function is defined by cross-entropy: where N is the number of samples and M is the length of input sequences. y start j ,y end j is the starting and ending positions of ground-truth of the j th  token. Furthermore, if the token is in the support sentence, the token label y support j is set to 1, and 0 otherwise.
The representations of the support sentence are related to the positions of the answer. In order to better model the relationship between two tasks, we dynamically learn the coefficient between the loss values of two tasks. Let h su and o sp be selfattended, averaged pooled representations of the support sentence and the answer span. o st , o ed are the start and end position token representations of the answer, respectively. We have: where γ st , γ ed , γ sp are the weight coefficients between the supporting sentence and the start/end/total token representations of the answer span. The loss value coefficient of two tasks λ and the total loss L are as follows: We minimize the total loss L to update our model parameters in the training process.

Experimental Setups
We evaluate CMedBERT on CMedMRC, and compare it against six strong baselines: DrQA (Chen et al., 2017) (Zhang et al., 2020) and KMQA (Li et al., 2020). KT-NET is the first model to leverage rich knowledge to enhance pre-trained language models for MRC. MC-BERT is the first Chinese biomedical pre-trained model fine-tuned on BERT base. We only use the encoder layer in KMQA removing the answer layer due to the different answer type. For evaluation, we use EM (Exact Match) and F1 metrics for answer and support sentence tasks. We calculate character-level overlaps between prediction and ground truth for the Chinese language, rather than token-level overlaps for English. To assess the difficulty of solving CMedMRC tasks, we select 100 testing samples to evaluate human performance. Human scores of EM and F1 are 85.00% and 96.69% for answer prediction, respectively.
In the implementation, we set the learning rate as 5e-5 and the batch size as 16, and the max sequence length as 512. Other BERT's hyperparameters are the same as in Google's settings 3 . Each model is trained for 2 epochs by the Adam optimizer (Kingma and Ba, 2015). Results are presented in average with 5 random runs with different random seeds. Other implementation details are in Appendix C. Table 3 and Table 4 show the multi-task and singletask results on the CMedMRC development and testing sets. CMedBERT has a great improvement compared to four strong baseline models in both tasks. Specifically, our CMedBERT outperforms the state-of-the-art model by a large margin in multi-task results, with +3.88%EM / +1.46%F1 improvements, which shows the effectiveness of our model. Meanwhile, in the support sentence task, our model also has the best performance, improving (+7.81%EM / +4.07%F1) over the testing set. In single task evaluation, we remove the   support sentence training module and the dynamic parameter for loss function module. Our model improves (+4.46%EM / +1.66%F1) over the best baseline model. In addition, we find that using support sentence prediction as an auxiliary task and the pre-training technique in medical domain can further improve the performance of CMedBERT.

Ablation study
In Table 5, we choose three important model components for our ablation study and report the results over the testing set. When the dynamic parameter λ of the loss function is removed from the model, the performance of the model on two tasks is decreased by (-1.00%EM and -0.55%F1) and (-4.42%EM and -3.61%F1), respectively. Without local attention, the EM performance in the answer prediction task decreases by (-3.98%EM and -1.75%F1). Experiments have shown that the model performs worse without the local fusion attention than without the global fusion attention and the dynamic parameter λ. However, the performance of support sentence task is decreased significantly by (-7.92%EM and -3.61%F1) without global fusion attention. It shows that local fusion attention is more important for extracting answer spans, while global fusion attention plays a larger role in support sentence prediction.

Case Study
In Figure 3, we use our motivation example to conduct a case study. In BERT, we can see that the difference among the probability values of different words is small, especially when predicting the probability of ending positions. The ending position probabilities of token "部" token "皮" are 0.3 and 0.25, leading the model to extract the wrong answer span. However, in the knowledge retrieval module of CMedBERT, multiple entities representation are fused into the context-aware latent space representation to enhance the medical text semantic understanding. Therefore, in our CMedBERT model, the starting position probability is 0.95 and the end position probability is 0.8. In this case, the CMedBERT model can easily choose the correct range of the answer span.

Discussion of Support Sentence Task
Compared with the answer prediction task, existing models have poor prediction results on the EM metric in the support-sentence task. In prediction results, we randomly select 100 samples for analysis. We divide the error types into the following three main types (see Appendix): i) starting position cross ii) ending position cross iii) answer substring. The most common error type is the answer substring, accounting for 46%. In this error type, the predicted result of our model is part of the true result, which shows the model cannot predict long answers completely (Yuan et al., 2020) and reduce the accuracy of the results greatly.

Conclusion
In this work, we address medical MRC with a new dataset CMedMRC constructed. An in-depth analysis of the dataset is conducted, including statistics, characteristics, required MRC skills, etc. Moreover, we propose the CMedBERT model, which can help the pre-trained model better understand domain terms by retrieving entities from medical knowledge bases. Experimental results confirm the effectiveness of our model. In the future, we will explore how knowledge can improve the performance of models in the medical domain.

A.1 Medical Passage Curation
We use the following rules to obtain 20,000 passages from DXY as the inputs to human annotators: • We use regular expressions to filter out images, tables, hyperlinks, etc. The English-Chinese translations of medical terms are also provided if the passages contain medical terms in English.
• We find that if we follow (Rajpurkar et al., 2016) to limit the lengths of the passages within 500 tokens, our human annotators could not ask 4 high-quality medical questions easily. Hence, our passage length limit is 1000 tokens.

A.2 Medical Question-Answer Pair Collection
We employ a group of annotators with professional medical background to generate questionanswer pairs from the medical passages. Here are some general guidelines: • We encourage our annotators to ask questions to which the answers are uniformly distributed in different positions of medical passages.
• For each medical passage, we limit the number of questions to 4.
• Each question should be strictly related to the medical domain. When creating the questions, no any part of the texts can be directly copied and pasted from the given medical passages.
• We limit the number of answer tokens to no more than 40.

A.3 Support Sentence Selection
In our dataset, we add an index to each sentence in the passages. Annotators are required to select the support sentence index and mark the range of the answer spans on the user interface.

A.4 Additional Answer Construction
To evaluate the human performance of our dataset and make our model more robust, we collect two additional answers for each question in the development and testing sets. We employ another 12 annotators for answer construction. Since our medical passage is relatively long, we show the questions and the passage contents again on the interface, together with the previously labeled support sentence indices.

B Statistical Analysis of the CMedMRC Dataset B.1 Question and Answer Types
Due to the special characteristics of the Chinese language, the question types cannot be simply classified by prefix words of questions (Rajpurkar et al., 2016). Here, we manually define 8 common question types in the user annotation interface. The statistics of each question type are shown in Figure 4. The first seven question types usually correspond to special medical

46%
Other Ground-truth:抗组胺药第一代的经典代表药「马来酸氯苯那敏」就是一个，它俗称扑尔敏，在多年临床应用中 没有发现对胎儿有明显的致畸或其他严重危害。 (One of the classic representative drugs of the first generation of antihistamines is "Chlorpheniramine Maleate". It is commonly known as Chlorpheniramine. It has not been found to have obvious teratogenic or other serious harm to the fetus in many years of clinical application.) Prediction:但临床上也有一些药物是经过多年验证，只要注意把握用药时间和药量，即使让孕妇吃也不会有事的 (However, there are also some drugs that have been verified for many years in clinical practice. As long as you pay attention to the time and amount of medication, it will be fine even if pregnant women take it.) 8% Table 8: Three typical error answer types in support sentence task. The blue and underscore contents in brackets indicate why the sample belongs to its corresponding category. answers. For example, the What type refers to a question on the name of a drug or a disease, which accounts for more than half of the dataset. A third of the questions belong to the types of How and Why.
The statistics of answer types are also shown in Table 6. The proportions of Noun Phrase and Description types are relatively large. The results are consistent with Figure 4, since most of What questions need to be answered with the above two answer types. Table 7 shows the text length of four input data.

B.2 Analysis of Domain Knowledge
We further analyze to what degree there exists domain knowledge in CMedMRC, in terms of med- ical entities and other terms. In this study, we employ the POS and NER toolkits 4 to tag medical entities and terms from 100 samples in the development set of CMedMRC. We also compare the statistics against those of two other Chinese MRC datasets, namely CMRC (Cui et al., 2019) and DuReader (He et al., 2018). The proportions of entities and five frequent POS tags in the three datasets are summarized in Figure 5. Comparing to the other two open-domain datasets, the proportion of entities in CMedMRC is very high (11%). In addition, the proportion of nouns (27%) is much higher than the other four POS tags in CMedMRC. The most likely cause is that existing models have difficulty recognizing all the medical terms, and treat them as common nouns. Among the three Chinese datasets, CMedMRC has the largest proportion (38%) of nouns and entities. Therefore, it is difficult for pre-trained language models to understand so many medical terms without additional medical background knowledge.
C Experimental Settings

C.1 Medical Knowledge Base and Corpora
The underlying medical knowledge base is constructed by DXY, containing 44 relation types and over 4M relation triples. The KGs embedding trained by TransR (Lin et al., 2015b) on DXY-KG 5 containing 152,508 entities. In knowledge retrieval, the threshold of overlapped tokens is set to half of its own length. The medical pre-training corpora used in ERNIE-THU (Zhang et al., 2019) contains 5,937,695 text segments with 3,028,224,412 tokens (4.9 GB) after pre-processing.

C.2 Additional Training Details
In average, the training time for DrQA, BERT base, MC-BERT, KT-NET, ERNIE, KMQA and CMedBERT takes 10, 16, 16, 27, 29, 28 and 25 minutes per epoch on a TiTAN RTX GPU. All the models are implemented by the PyTorch deep learning framework 6 .