A Multi-Task Approach for Improving Biomedical Named Entity Recognition by Incorporating Multi-Granularity information

Neural biomedical named entity recognition (BioNER) methods usually require a large amount of annotated data, while the annotated BioNER datasets are often difﬁcult to obtain and small in scale due to the limitations of privacy, ethics and high degree of specialization. To alleviate the lack of training samples, unlike conventional methods that only use token-level information, this paper proposes a method that simultaneously utilize the latent multi-granularity information in the dataset. Concretely, the proposed model is based on a multi-task approach, which leverages different training objectives by introducing auxiliary tasks, i.e. binary classiﬁcation, multi-class and multi-token classiﬁcation. Experimental results over three BioNER datasets show that the proposed model produces better performance over the BioBERT baseline and can get more than 3% improvements of F1-score in low-resource scenarios. Finally, we released our code at https://github.com/ zgzjdx/MT-BioNER .


Introduction
Biomedical named entity recognition (BioNER) aims to identify entity mentions such as gene/protein, disease and chemicals from unstructured text. Such information is useful for downstream natural language processing (NLP) tasks like relation extraction (Zhou et al., 2014), automatic abstracting (Mishra et al., 2014) and question answering (Athenikos and Han, 2010), etc. Different from those named entity recognition (NER) tasks for general domain like news, BioNER is particular challenge due to the naming complexity (Liu et al., 2015), large variations in same entity names (Jia et al., 2019;Kim et al., 2019), and new entity mentions rapidly reported in scientific * Corresponding author. Figure 1: Examples from our constructed dataset. In our work, we designed three auxiliary tasks to help improving the main NER task. Two of them are sentencelevel tasks and the other one is a token-level task. Concretely, the first sentence-level task predicts whether or not a sentence contains entities; the second sentencelevel task predicts how many entities a sentence contains; and the token-level task predicts whether or not a given token belongs to a multi-token entity. Clearly, to support training the auxiliary tasks, additional labels have been added in our data. However, please note that, the additional labels could be derived from the original NER labels and do not need additional manual annotations. In a word, what we have done in this paper is try to use the multi-granularity information implied in the original dataset to improve the performance of BioNER.
publications (Luo et al., 2018). These various factors lead to the small number and size of current BioNER datasets. In recent years, neural BioNER has become a main approach because of its outstanding performance (Lample et al., 2016;Habibi et al., 2017;Yadav and Bethard, 2019). Some researchers have investigated introducing multi-task learning (Crichton et al., 2017;Khan et al., 2020) and pre-training (Peng et al., 2019;Lee et al., 2020) to solve the problem of lacking extensive training data and boost the performance of BioNER model. However, few of them combined these two methods together and tried to transfer sentence-level knowledge to tokens (Rei and Søgaard, 2019;Kruengkrai et al., 2020), which had proven to be effective in other domains (Abhishek et al., 2017).
In this paper, we focus on improving BioNER by exploiting multi-granularity information implied in the dataset, without depending on additional manually annotated data. As shown in Figure 1, Besides the main sequence labeling task, we employ three related classification tasks, i.e. a binary classification task for predicting whether a sentence contains entities or not, a multi-class classification task for predicting how many entities a sentence contains and a multi-token entity classification task (Hu et al., 2020). In the rest of this paper, these three tasks will be named bCLS, mCLS and mtCLS, respectively, while the main task will be named NER. Our primary motivation is to mine useful training signals from coarse-grained classification to guide a more robust and interpretable token-level representation.
Our key contributions can be summarized as follows: • To take full advantage of the implicit information contained in NER dataset, we present a multi-task model for jointly learning sentencelevel and token-level labels, which incorporates BioBERT (Lee et al., 2020) as text encoding layers and shares the hidden states between different tasks. To the best of our knowledge, we are the first to introduce different grained-level information in BioNER domain.
• Experimental results on three datasets show that our proposed method is effective, especially in the low-resource scenarios.
• We performed preliminary pair-wise comparison analysis to investigate the relations between tasks and pointed out that token-level labels are more helpful for sentence-level tasks. While at the same granularity, high-difficulty tasks are more helpful to low-difficulty tasks.

Related work
Traditional BioNER methods could be divided into rule-or dictionary-based approaches (Tjong Kim Sang and De Meulder, 2003;Kulick et al., 2004;Gerner et al., 2010). And recent works had shown neural network architecture based BioNER methods achieved promising results. Habibi et al. (2017) used a LSTM-CRF model, which was completely agnostic to entity types. Crichton et al. (2017), on the contrary, used a CNN-based model that takes tokens and their surrounding tokens as input. To solve the label inconsistent problem, Luo et al. (2018) proposed a Att-BiLSTM-CRF model and achieved better performance with little feature engineering. The neural BioNER system is known to be extremely data-intensive, while the available training datasets are relatively small in scale. To tackle this problem, research has been conducted and language models and multi-task learning have been shown to be effective to deal with this problem (Peters et al., 2018;Liu et al., 2019). Jia et al. (2019) proposed a cross-domain NER model, which extracted knowledge from raw texts through a novel parameter sharing network.  proposed CollaboNet, which consists of multiple BiLSTM-CRF models where models could send information to one another for more accurate predictions, and got best F1-score at that time. Although these studies have exploited additional token-level information from auxiliary tasks or language models, they do not consider information from other levels that contained in the NER dataset.
More recently, a transformer-based (Vaswani et al., 2017) large-scale pre-training language model, called BERT (Devlin et al., 2018), led to impressive gains on several NLP benchmarks and the domain-specific BERTs, such as blueBERT (Peng et al., 2019), BioBERT (Lee et al., 2020), SciB-ERT (Beltagy et al., 2019) and PubmedBERT (Gu et al., 2020), have largely outperformed the previous state-of-the-art BioNER systems. But research on multi-task learning based on BERT is still few, and the association between tasks needed to be further explored (Khan et al., 2020;Vu et al., 2020).
The most similar work to ours is the findings of Kruengkrai et al. (2020). However, they only focused on introducing one auxiliary task that requires additional manual annotations, while we attempted to try multiple auxiliary tasks, and our proposed method did not rely on other additional annotations, except for BioNER. 3 The proposed model

Tasks
As mentioned in Section 1, our model involves four tasks: bCLS, mCLS, mtCLS and NER. The goal is to optimize the token-level representation of BioBERT by introducing auxiliary tasks (bCLS, mCLS, mtCLS) and improve the perfor-  mance of the main task (NER). The pre-training model BioBERT are shared across all tasks by hard parameter sharing (Ruder, 2017). The input sequence and output labels of our proposed model are represented in Figure 2. Given a sentence X = {x 1 , ..., x i , ..., x n }, where x i is a token, n is the length of the input sequence. The first token of each X is always a special classification embedding [CLS] and the transformer encoder module maps X into a sequence of input embedding vectors, which are the sum of the token, segment and position embeddings. The detailed description of each task is shown as follows. bCLS: This is a sentence-level binary classification task. Given X, the goal is to predict whether it contains entities or not. In some cases, for X that do not contain entities, the model may incorrectly predicts that it contains entities. Or for X that contains entities, the model may incorrectly predicts that it not contain entities. Therefore, we design bCLS task with the hope of solving this problem by introducing a global guidance information.
mCLS: This is a sentence-level multiclassification task. Given X, the goal is to predict how many entities it contains. To balance label numbers, this paper set mCLS as a 4-classification task, which X contains 0, 1 and 2 entities is set to 0, 1 and 2, respectively, while X with more than 2 entities are all set to 3. Compared with bCLS, mCLS is more difficult and we introduce this task to alleviate the under-or over-recognition entity problem.
mtCLS: Multi-token classification is a tokenlevel 3-classification problem. Given x i in X, the goal is to predict whether it belongs to a multitoken entity like "brain disease" or a single-token entity like "peroxydase" or neither. Our motivation for introducing this task is that if the model knows whether x i is multi-token entity or single-token entity or neither, it can alleviate the entity boundary problem.
NER: Given X, NER aims to predict corresponding labels Y = {y 1 , ..., y i , ..., y n }, where y i is predefined and differs according to the annotation scheme such as BIO and BIOES. We use this main task to measure the model performance and effectiveness of auxiliary tasks.

Architecture
The overall architecture of our proposed model is shown in Figure 3, which mainly includes two parts: the shared encoder and task-specific layers. We use multi-task learning to jointly train the main task and auxiliary tasks, which has been shown effective for transferring knowledge among multiple tasks . For the shared encoder, we take cased BioBERT-base v1.1 1 as feature extractor and hard shared its parameters. Set X as an input sequence, where x i denotes the i − th token in X. We represent each x i using the pre-trained BioBERT embeddings h i ∈ R d , where d is the dimension of hidden states. And the task-specific layers have independent parameters, which include a project layer and a classifier for generating outputs. We use the output of the shared encoder, i.e. H = {h 1 , ..., h i , ..., h n }, as the input of taskspecific layers for both sentence-and token-level tasks, as described in detailed as follows.
Sentence-level tasks. As mentioned in subsection 3.1, bCLS and mCLS are two sentence-level classification tasks. Different from the standard BERT-based classification models, which optimize the [CLS] token (Sun et al., 2019) to perform classification. Our model aims at optimizing the token representations of the shared encoder by sentencelevel labels. Therefore, we created a fixed size vector by applying mean/max pooling (Reimers and Gurevych, 2019) over H, which encourages the model to capture the most useful local features encoded in hidden states. Finally, the probability of class k is predicted by a linear layer and a logistic regression with softmax.
where h ∈ R d is the pooling output of model, W ∈ R d×m and b ∈ R m are trainable weight matrix and bias. m denotes the number of category labels, which is 2 for bCLS and 4 for mCLS. Finally, the loss for our sentence-level task is calculated as follows: where σ(y m =ŷ) = 1 if the classificationŷ of X is the right ground-truth label for the class m. Otherwise, σ(y m =ŷ) = 0. Token-level tasks. As mentioned in subsection 3.1, mtCLS and NER are two token-level classification tasks 2 . Given the dataset D, which consists of N training samples, i.e. D = (x j , y j ) N j=1 , where j denotes the sentence index in D. To train the token-level tasks, we minimize the negative loglikelihood of the correct label sequences over D with the loss function defined as follows: 2 Generally, NER was treated as a sequence labeling problem. However, for a fair comparison with previous works, instead of using sequence labeling algorithms such as Conditional Random Field (CRF) (Wallach, 2004) in task-specific layers, we still use softmax for token-level tasks.   where H j ∈ R n×d is the hidden state of x j . Algorithm 1 provides the procedure for our crosstask joint training method, where α, β, γ, δ are hyper-parameters. Moreover, the final loss of the proposed model is calculated by weighted summing the losses of different tasks.

Datasets
We evaluated the performance of proposed approach on three benchmark datasets 3 used by    training and developing sets for the model training. As a part of the data preprocessing step, token labels were encoded using the standard BIO scheme (Reimers and Gurevych, 2017). In this scheme, for example, a token describing a disease entity is tagged with "B-Disease" if it is at the beginning of the entity, and "I-Disease" if it is inside the disease entity. Other tokens that not describing entities of interest are tagged as "O".

Settings
Following the work of Peng et al. (2020), all datasets are trained with the batch size of 32, maximum sequence length of 256 and a dropout (Srivastava et al., 2014) with the probability of 0.1 after the shared encoder. We use the Adam optimizer (Kingma and Ba, 2014) with a learning rate 5e − 5 for BC2GM, BC5CDR-chem and 1e − 5 for NCBIdisease. The training procedure contains 50 epochs for BC2GM, BC5CDR-chem and 100 epochs for NCBI-disease. A linear learning rate decay schedule with warm-up over 0.1, and a weight decay of 0.01 applied to every epochs of the training by following Liu et al. (2019). Finally, all models were trained on NVIDIA RTX TITAN and used standard F1 metrics 4 to evaluate the overall performance.

BC5CDR-chem BioBERT
Anaesthesia with a propofol infusion and avoidance of serotonin onists provided a nausea -free Ours Anaesthesia with a propofol infusion and avoidance of serotonin onists provided a nausea -free

NCBI-disease BioBERT
We conclude that paternal transmission of congenital DM is rare and preferentially occurs with onset of DM ...

Ours
We conclude that paternal transmission of congenital DM is rare and preferentially occurs with onset of DM ... Table 5: Case study on three datasets, where words in red and in green represent incorrectly and correctly recognized entities, respectively.
the three benchmark datasets, where the current SOTA model is BioBERT. In line with the expectations, Ours MAX, which uses the max pooling strategy, achieved the best results, with the improvements of 0.40, 0.37 and 0.91 F1-scores for the three datasets, respectively. On the contrary, Ours CLS and Ours MEAN achieved negative results from our experiments. The abovementioned phenomenon is consistent with Reimers and Gurevych (2019); Kruengkrai et al. (2020). Another interesting result is that our best model also achieves higher recall score than all the other approaches expect SOTA* result in BC2GM, which indicates that the introducing of coarse-grained tasks helps the model to predict more positive results.
To simulate low-resource scenarios, we also used the reduced training datasets by randomly removing sentences in training sets, while test sets are not modified. As shown in Table 3, CS-MTM was a multi-task model with cross-sharing structure proposed by Wang et al. (2019a), we record the performance of different situations and the best F1-score for each resource size are bolded. When training sets are reduced and test sets are kept, the missing information in removed sentences make all models produce worse results. However, for 50%-size, 25%-size and 10%-size datasets, our model can get an average of 0.56, 0.79 and 1.72 F1-score improvements over the BioBERT, which demonstrates our designed auxiliary tasks can regularize model to generate more robust token-level representations. For BC2GM, BC5CDR-chem and NCBI-disease in all data size, our model can get an average of 0.47, 0.95 and 1.26 improvements in F1-score, which the largest improvement is observed on NCBI-disease. The smaller the training set is the larger improvement could achived by our model. This finding proves our method is more effective in low-resource scenarios. Specifically, on 10%-size NCBI-disease, our model can get 3.29 F1-score improvements over the BioBERT.
To prove that our joint training algorithm is effective, we plot the performance curve of different tasks, which can be found in Figure 4. Moreover, different task combinations can produce different results in multi-task learning. To measure the impact of our designed auxiliary tasks and training algorithm, we conducted ablation studies and showed in Table 4. From the results of the ablation experiment, removing the joint training algorithm leads to a consistent drop in the F1-scores. Compared with the results of Liu et al. (2019), Khan et al. (2020 and Peng et al. (2020), we point out that multi-task learning algorithms such as MT-DNN require a large amount of training data to achieve improvements. Furthermore, all of the auxiliary tasks are helpful to the main task but the impact of different tasks vary. Specifically, mtCLS is the best partner for BC2GM dataset and bCLS can bring the most improvement for BC5CDR-chem and NCBIdisease dataset. This phenomenon shows that different BioNER datasets have different recognition difficulties. For example, the recognition difficulty of BC2GM may mainly related to entity boundary problem, while the recognition difficulty of BC5CDR-chem and NCBI-disease is the entity sparsity problem. Therefore, for BC5CDR-chem and NCBI-disease, the model trends to incorrectly recognize entities in sentences that do not contain entities. This finding is consistent with our statis-tical results, where such cases are 2.38%, 2.71% and 2.87% on the three datasets, respectively. Compared to bCLS, mCLS is less helpful. This implies that the effect of auxiliary tasks in multi-task leaning is closely related to their performances. In fact, the classification performances of mCLS are lower than that of bCLS due to its higher difficulty. Table 5 shows the case study of three datasets. The BC2GM example showed the effect of bCLS task in that our model could correctly recognize the entity "TGF -beta1" while the BioBERT model fails.

Case study
In the BC5CDR-chem example, the input sentence contains two entities "propofol" and "serotonin", and the BioBERT model could only identify one of them, while our model could correctly recognize two entities by incorporating the mCLS task. For the NCBI-disease example, "congenital DM" is a multi-token entity and "DM" is a single-token entity. It could be found that without the help of the mtCLS task, the BioBERT model could not capture such difference and incorrectly recognized two "DM". Overall, these examples confirm that supervised objectives at different granularities, i.e. global information and local information, can be combined to help producing better representations.
Although the case study show that our model with auxiliary tasks outperformed the BioBERT model, these tasks can not completely solve the above problems due to their coarser granularities. Take the bCLS task as an example, the model could noticed that current input sentence contains entities by sentence-level label, but still may trapped in the number of entities or entity boundary.

Impacts of the task relationship
In this subsection, we would like to preliminary study the relationship between different tasks in the same domain, such as the interaction between sentence-level tasks and token-level tasks, and whether or not tasks could help one other. Therefore, we conducted pair-wise comparison experiments, as shown in Figure 5, where x-axis is the secondary task and y-axis is the main task.
First, we point out the token-level labels are more helpful for the sentence-level tasks. For mCLS, it can get an average improvement of 0.79%, 0.54% and 0.15% on the three datasets by taking mtCLS, NER and bCLS as auxiliary tasks, respectively. Considering that mtCLS and NER are token-level tasks and bCLS is a sentence-level task, the results suggest that the coarse-grained tasks can significantly benefit from fine-grained tasks. This finding could be used to guide the choosing of the tasks for multi-task learning.
Second, the same granularity of information also contributes to each other. Concretely, bCLS and mCLS can get an average improvement of 0.39%, 0.15% from mCLS and bCLS, respectively. And mtCLS and NER can get an average improvement of 0.42%, 0.22% from NER and mtCLS, respectively. Meanwhile, the difficulty of task is also a factor that affects the effectiveness of multi-task learning, in that bCLS gets 0.24% more improvements compared to mCLS, and mtCLS gets 0.20% more improvements compared to BioNER.
In addition, the same task combinations performs differently on different datasets. For example, the combinations of mtCLS and mCLS got negative results of -0.25% and -0.33% on the BC2GM and BC5CDR-chem datasets, while achieved 1.3% boost on the NCBI-disease dataset. We guessed it may related to the transferability of specific dataset. So we visualized the task embedding of three datasets, which were generated with the method 5 proposed by Vu et al. (2020), using T-SNE (Belkina et al., 2019) dimension reduction algorithm and showed the results in Figure 6. From the visual-5 https://github.com/tuvuumass/task-transferability ization results, we found that the embedding distance between the same tasks (e.g., BC2GM-bCLS, BC5CDR-chem-bCLS, NCBI-disease-bCLS) or the same type of tasks is closer (e.g., NER and mtCLS, bCLS and mCLS). And the embedding distance between different types of tasks is farther (e.g., bCLS and NER), but more specific relations need further exploration.

Conclusion
In this work, we investigated whether coarsegrained label could benefit the token-level representation for BioNER. We had shown that the proposed BERT-based jointly sentence and token label model was valid without using external data and hand-crafted feature for BioNER in three datasets: BC2GM, BC5CDR-chem, NCBI-disease. Finally, we preliminary discussed the correlation between main task and auxiliary task.
For multi-task learning, describing and reasoning about the relations between tasks through experiments require an amount of computational resources. In future work, with domain related in mind, we will explore efficient methods for generating vectorial representations to measure the relationship between different NLP tasks.