Named Entity Recognition via Machine Reading Comprehension: A Multi-Task Learning Approach

Named Entity Recognition (NER) aims to extract and classify entity mentions in the text into pre-defined types (e.g., organization or person name). Recently, many works have been proposed to shape the NER as a machine reading comprehension problem (also termed MRC-based NER), in which entity recognition is achieved by answering the formulated questions related to pre-defined entity types through MRC, based on the contexts. However, these works ignore the label dependencies among entity types, which are critical for precisely recognizing named entities. In this paper, we propose to incorporate the label dependencies among entity types into a multi-task learning framework for better MRC-based NER. We decompose MRC-based NER into multiple tasks and use a self-attention module to capture label dependencies. Comprehensive experiments on both nested NER and flat NER datasets are conducted to validate the effectiveness of the proposed Multi-NER. Experimental results show that Multi-NER can achieve better performance on all datasets.


Introduction
Named Entity Recognition (NER), which aims to locate and classify entity mentions in text into predefined types, is a fundamental task in information extraction (Chinchor and Robinson, 1997;Nadeau and Sekine, 2007).Typically, NER is formulated as a sequence labeling task, where each token is classified as one of the pre-defined types.However, the sequence labeling models can only assign one label to a token, resulting in the incapability of handling overlapping entities in nested NER (Finkel The Secretary of Homeland Security was in attendance.and Manning, 2009).Figure 1 shows an example of nested NER: Homeland Security can be recognized as ORGANIZATION , as well as PERSON.
To mitigate this issue, many works resort to formulating NER as a Machine Reading Comprehension (MRC) question answering task (termed MRC-based NER) (Wang et al., 2020;Li et al., 2020;Wang et al., 2022).For example, to recognize the ORGANIZATION, a natural-language question "Which ORGANIZATION is mentioned in the text?" is formulated.Then the goal of NER is transformed to answer the formulated questions through machine reading comprehension, given the contexts.MRC-based NER provides a unified solution for both flat and nested NER tasks since each entity type has its corresponding entity span positions as the answer, and these output answers are independent of each other.
Despite much progress having been made in MRC-based NER, existing approaches tend to ignore the label dependencies among entity types, which are critical for precise NER.Label dependencies indicate that different entities in the text have some relations with each other.For example, in a sentence "Ousted WeWork founder Adam Neumann lists his Manhattan penthouse for $37.5 million", once Adam Neumann is recognized as PERSON, it is expected to help with the recognition

Output Layer 1
Output Layer N of WeWork as ORGANIZATION because the founder preceding a person's name implies an organization.
To leverage the label dependencies among entity types, we propose a novel multi-task learning framework (termed Multi-NER) for MRC-based NER.In Multi-NER, MRC-based NER is decomposed into multiple tasks, each task focusing on one entity type.For each task, the corresponding input is the concatenation of an entity-class-related question and the context, and the output is expected to be the corresponding entity spans (i.e., start and end positions).The input is first encoded via a pre-trained BERT (Devlin et al., 2019).The concatenation of embeddings of all tasks are fed into a self-attention module, which can preserve the label dependencies between different entity types.Finally, task-specific output layers are applied to different tasks.
To validate the effectiveness of our proposed Multi-NER, we conduct experiments on the datasets of both flat NER and nested NER.Experimental results show that Multi-NER can benefit the MRC-based NER in both flat NER and nested NER, with the important label dependencies among entities preserved.Additionally, we also visualize the self-attention maps to examine whether the label dependencies have been successfully captured.
Overall, the contributions of this paper are twofold: 1) We are the first to propose a multi-task learning framework for MRC-based NER tasks to capture label dependencies between entity types; 2) The introduced self-attention maps are visualized to verify that the self-attention modules can capture label dependencies.
All the source code and datasets are available at https://github.com/YiboWANG214/MultiNER 2 Methodology

Problem Formulation
Given a sequence X = {x 1 , x 2 , . . ., x n }, where n denotes length of X, NER aims to find every entity mention in X and assign an entity type y ∈ Y to it.BERT-MRC (Li et al., 2020) transforms tagging-style NER to MRC format with a triplet of (QUESTION, ANSWER, CONTEXT).The natural language question q (y) = {q (y) 1 , . . ., q (y) ly }, where l y denotes length of q (y) , is related to the entity type y and considered as QUESTION; The positions P y start,end of entity mentions of y is considered as ANSWER; The input sequence X is considered as CONTEXT.Given X and q (y) , the goal of BERT-MRC is to predict P y start,end .Our Multi-NER applies the same MRC format but further decomposes BERT-MRC into multiple tasks, where each task i ∈ {1, 2, . . ., |Y|} only focuses on one entity type.Thus, the task set {1, 2, . . ., |Y|} and the entity type set Y are bijection.Instead of processing one QUESTION at a time, Multi-NER processes all QUESTIONs in a multi-task framework.Therefore, in Multi-NER settings, given a CONTEXT X and multiple questions {q (y) } y∈Y , the goal is to predict {P y start,end } y∈Y .

Multi-NER
Figure 2 gives an overview of our proposed Multi-NER, which consists of |Y| tasks, where each task denotes the recognition of one specific entity type.
To share information between tasks, one shared encoder is used across tasks, and a self-attention module is employed to capture label dependencies.
Single Task Learning For every single task i ∈ {1, 2, . . ., |Y|}, the input sequence is the concatenation of the natural language question q (i) and the context X, as follows: The output of task i has three components: start index prediction, end index prediction, and span matrix prediction.The start index prediction is the probability of each token being a start position; the end index prediction is the probability of each token being an end position; the span matrix prediction is the probability of each start-end pair being an entity mention position.
Task Interactions Task interactions are twofold.First, one shared large language model like BERT (Devlin et al., 2019) is used as the encoder for all tasks to make the embedding space consistent.Thus, the embedding of task i is . Second, a self-attention module (Vaswani et al., 2017) is used across all tasks, accepting the concatenation of the embeddings of every task as input.The self-attention module incorporates information from every task and outputs the concatenation of hidden states |Y|) . (2) The self-attention module enables E with the property of capturing the label dependencies between entity types, making E a better representation.
Model Learning At training time, all tasks are trained jointly.The loss function for multi-task learning is defined as follows: where α, β, and γ are tunable weights for start index prediction, end index prediction, and span matrix prediction.L i start , L i end and L i span are cross entropy loss of task i.

Experiments
To evaluate the performance of the proposed Multiwe compare it with a state-of-the-art baseline BERT-MRC (Li et al., 2020), on datasets of flat NER and nested NER.We also perform a case study with attention maps visualized to further analyze the ability of Multi-NER to capture label dependencies.

Datasets
We adopt three nested NER datasets (i.e., English ACE-2004(Mitchell et al., 2005), English ACE-2005 (Walker et al., 2006) and GENIA (Ohta et al., 2002)) and one flat NER dataset (i.e., English CoNLL-2003 (Tjong Kim Sang andDe Meulder, 2003)) to evaluate the performance of Multi-NER.ACE-2004 andACE-2005 are two textual datasets from broadcast, newswire, telephone conversations and weblogs.GENIA is a collection of biomedical literature, containing Medline abstracts.CoNLL-2003 is extracted from Reuters news stories between August 1996 and August 1997.We conducted experiments on both nested NER datasets and flat NER dataset since our model is based on BERT-MRC, which can be applied to both flat and nested NER.

Experimental Settings
For a fair comparison, we select BERT base as the backbone encoder for all models.We adopt a onelayer linear transformation to predict the start index and end index and adopt a two-layer MLP with activation function GELU to predict the span matrix.We set the hidden size as 1,536 and the dropout rate as 0.1.More details of the hyperparameters setting are referred to the Appendix A.1.We follow the same process of question generation in (Li et al., 2020) to use annotation guideline notes, which are the guidelines provided to the annotators when building datasets, as references to construct questions.

Results and Analysis
Table 1 shows the experimental results on both nested NER datasets and flat NER dataset.From this table, we can observe that our Multi-NER achieves 85. 34% on ACE 2004, 84.25% on ACE 2005, 81.13% on GENIA, and 92.33% on CoNLL-2003, achieving +1.3%, +0.4%, +1.24% and 1.25% improvement, respectively, when comparing with BERT-MRC.The performance improvements of Multi-NER on all the datasets indicate that formulating MRC-based NER into a multi-task learning framework to obtain label dependencies between different entity types can indeed bring model performance improvement.
Furthermore, the idea behind our proposed multitask framework is to use different output layers to disambiguate between entity types and use a self-attention module to obtain label dependencies between entity types.To evaluate the contribution of the different output layers and the selfattention module, we also conduct ablation studies on all datasets.The experimental results in Table 2 show that both different output layers and the selfattention module contribute to Multi-NER.ACE 2004, we show the ground-truth entities and predicted results of BERT-MRC and Multi-NER in Figure 3.In BERT-MRC, etc is categorized as both ORG and PER, while in Multi-NER and ground truth etc is categorized as PER.Ambiguous tokens like etc are hard to categorize even with contextual information.However, when applying Multi-NER, different entity types and label dependencies are also considered, which is beneficial to ambiguous tokens.
We also show the attention map of the mean scores according to entity types of this example in Figure 4. We can see that PER has a relatively large impact on other entity types, helping the model improve performance on other entity types with PER information.We attribute it to label dependencies obtained by information sharing between entity types using the self-attention module.

Related Work
As language models have advanced (Devlin et al., 2019;Raffel et al., 2020;Zhao et al., 2023;Dong et al., 2023;Zhao et al., 2021;Liu et al., 2021), numerous efforts have emerged to enhance the performance of MRC-based NER.Zhang et al. (2022) incorporated different domain knowledge into MRCbased NER task to improve model generalization ability.Liu et al. (2022) proposed to use graph attention networks to capture label dependencies between entity types when applying MRC-based NER to electronic medical records.However, they only use entity type embeddings to build graph attention networks, ignoring the rich information in context.MRC-based NER is applied to different domains.Du et al. (2022) designed an MRC-based method for medical NER through both sequence labeling and span boundary detection.Zhang and Zhang (2022) applied MRC-based NER for financial named entity recognition from literature.Wang et al. (2020) proposed MRC-based NER with the help of a distilled masked language model in ecommerce.Jia et al. (2022) applied MRC-based methods for multimodal named entity recognition.
The span-based methods (Eberts and Ulges, 2020) that formulate nested NER as a span classification task are also mentionable.Wan et al. (2022) improved span representation using retrieval-based span-level graphs based on n-gram similarity.Yuan et al. (2022) integrated heterogeneous factors like inside tokens, boundaries, labels, and related spans to improve the performance of span representation and classification.Shen et al. (2021) improved span-based NER using a two-stage entity identifier to filter out low-quality spans to reduce computational costs.

Conclusion
In this paper, we propose to incorporate the label dependencies among entity types into a novel multitask learning framework (termed Multi-NER) for MRC-based NER.A self-attention mechanism is introduced to obtain label dependencies between entity types.Experimental results validate that Multi-NER outperforms BERT-MRC on both nested NER and flat NER.Case study and attention map visualization show that our introduced self-attention module is able to capture label dependencies among entities, contributing to a performance improvement.

Limitations
One limitation of our proposed Multi-NER lies in that the number of tasks depends on the number of entity types because each entity type is considered as a task.Depending on the model structure we use in Multi-NER, the number of parameters will increase by 4M for each additional entity type with M ax_Length = 128.One possible solution to solve this problem is using parameter-efficient fine-tuning methods like Hypernetworks (Ha et al., 2016) to effectively generate task-specific output layers.We leave this problem to our future work.

Figure 3 :
Figure3: The ground truth, predictions of BERT-MRC and Multi-NER for a randomly selected example.The middle of the sentence is omitted for better presentation.The complete example is referred to Appendix A.2.

Table 1 :
Experimental results on nested NER datasets and flat NER dataset.

Table 2 :
Ablation studies evaluate the contribution of components of Multi-NER.