A Unified Generative Framework for Various NER Subtasks

Named Entity Recognition (NER) is the task of identifying spans that represent entities in sentences. Whether the entity spans are nested or discontinuous, the NER task can be categorized into the flat NER, nested NER, and discontinuous NER subtasks. These subtasks have been mainly solved by the token-level sequence labelling or span-level classification. However, these solutions can hardly tackle the three kinds of NER subtasks concurrently. To that end, we propose to formulate the NER subtasks as an entity span sequence generation task, which can be solved by a unified sequence-to-sequence (Seq2Seq) framework. Based on our unified framework, we can leverage the pre-trained Seq2Seq model to solve all three kinds of NER subtasks without the special design of the tagging schema or ways to enumerate spans. We exploit three types of entity representations to linearize entities into a sequence. Our proposed framework is easy-to-implement and achieves state-of-the-art (SoTA) or near SoTA performance on eight English NER datasets, including two flat NER datasets, three nested NER datasets, and three discontinuous NER datasets.


Introduction
Named entity recognition (NER) has been a fundamental task of Natural Language Processing (NLP), and three kinds of NER subtasks have been recognized in previous work (Sang and Meulder, 2003;Pradhan et al., 2013a;Doddington et al., 2004;Karimi et al., 2015), including flat NER, nested NER, and discontinuous NER. As shown in Figure 1, the nested NER contains overlapping entities, and the entity in the discontinuous NER may contain several nonadjacent spans. Barack Obama <Person> US <Location> S1: The Lincoln Memorial <Location> Lincoln <Person> S2: muscle pain < Disorder > muscle fatigue <Disorder> S2: (c) Transition-based method for discontinuous NER The sequence labelling formulation, which will assign a tag to each token in the sentence, has been widely used in the flat NER field (McCallum and Li, 2003;Collobert et al., 2011;Huang et al., 2015;Chiu and Nichols, 2016;Lample et al., 2016;Straková et al., 2019;Li et al., 2020a). Inspired by sequence labelling's success in the flat NER subtask, Metke-Jimenez and Karimi (2016); Muis and Lu (2017) tried to formulate the nested and discontinuous NER into the sequence labelling problem. For the nested and discontinuous NER subtasks, instead of assigning labels to each token directly, Xu et al. (2017); Wang and Lu (2019); Yu et al. (2020);  tried to enumerate all possible spans and conduct the span-level classification. Another way to efficiently represent spans is to use the hypergraph (Lu and Roth, 2015;Katiyar and Cardie, 2018;Wang and Lu, 2018;Muis and Lu, 2016).
Although the sequence labelling formulation has dramatically advanced the NER task, it has to design different tagging schemas to fit various NER subtasks. One tagging schema can hardly fit for all three NER subtasks 2 (Ratinov and Roth, 2009;Metke-Jimenez and Karimi, 2016;Straková et al., 2019;. While the span-based models need to enumerate all possible spans, which is quadratic to the length of the sentence and is almost impossible to enumerate in the discontinuous NER scenario . Therefore, span-based methods usually will set a maximum span length (Xu et al., 2017;Luan et al., 2019;Wang and Lu, 2018). Although hypergraphs can efficiently represent all spans (Lu and Roth, 2015;Katiyar and Cardie, 2018;Muis and Lu, 2016), it suffers from the spurious structure problem, and structural ambiguity issue during inference and the decoding is quite complicated (Muis and Lu, 2017). Because the problems lie in different formulations, no publication has tested their model or framework in three NER subtasks simultaneously to the best of our knowledge.
In this paper, we propose using a novel and simple sequence-to-sequence (Seq2Seq) framework with the pointer mechanism (Vinyals et al., 2015) to generate the entity sequence directly. On the source side, the model inputs the sentence, and on the target side, the model generates the entity pointer index sequence. Since flat, continuous and discontinuous entities can all be represented as entity pointer index sequences, this formulation can tackle all the three kinds of NER subtasks in a unified way. Besides, this formulation can even solve the crossing structure entity 3 and multi-type entity 4 . By converting the NER task into a Seq2Seq generation task, we can smoothly use the Seq2Seq pre-training model BART (Lewis et al., 2020) to enhance our model. To better utilize the pre-trained BART, we propose three kinds of entity representations to linearize entities into entity pointer index sequences.
Our contribution can be summarized as follows: • We propose a novel and simple generative solution to solve the flat NER, nested NER, and discontinuous NER subtasks in a unified framework, in which NER subtasks are formulated as an entity span sequence generation problem. • We incorporate the pre-trained Seq2Seq model BART into our framework and exploit three kinds of entity representations to linearize entities into sequences. The results can shed some light on further exploration of BART into the entity sequence generation. • The proposed framework not only avoids the sophisticated design of tagging schema or span enumeration but also achieves SoTA or near SoTA performance on eight popular datasets, including two flat NER datasets, three nested NER datasets, and three discontinuous NER datasets.

NER Subtasks
The term "Named Entity" was coined in the Sixth Message Understanding Conference (MUC-6) (Grishman and Sundheim, 1996). After that, the release of CoNLL-2003 NER dataset has greatly advanced the flat NER subtask (Sang and Meulder, 2003).  found that in the field of molecular biology domain, some entities could be nested. Karimi et al. (2015) provided a corpus that contained medical forum posts on patient-reported Adverse Drug Events (ADEs), some entities recognized in this corpus may be discontinuous. Despite the difference between the three kinds of NER subtasks, the methods adopted by previous publications can be roughly divided into three types.
Token-level classification The first line of work views the NER task as a token-level classification task, which assigns to each token a tag that usually comes from the Cartesian product between entity labels and the tag scheme, such as BIO and BILOU (Ratinov and Roth, 2009;Collobert et al., 2011;Huang et al., 2015;Chiu and Nichols, 2016;Lample et al., 2016;Alex et al., 2007;Straková et al., 2019;Metke-Jimenez and Karimi, 2016;Muis and Lu, 2017;, then Conditional Random Fields (CRF) (Lafferty et al., 2001) or tag sequence generation methods can be used for decoding. Though the work of (Straková et al., 2019;Zhang et al., 2018;Chen and Moschitti, 2018) are much like our method, they all tried to predict a tagging sequence. Therefore, they still need to design tagging schemas for different NER subtasks.
Span-level classification When applying the sequence labelling method to the nested NER and discontinous NER subtasks, the tagging will be complex (Straková et al., 2019;Metke-Jimenez and Karimi, 2016) or multi-level (Ju et al., 2018;Fisher and Vlachos, 2019;Shibuya and Hovy, 2020). Therefore, the second line of work directly conducted the span-level classification. The main difference between publications in this line of work is how to get the spans. Finkel and Manning (2009) Wang et al. (2020a) tried to enumerate all spans. Following Lu and Roth (2015), hypergraph methods which can effectively represent exponentially many possible nested mentions in a sentence have been extensively studied in the NER tasks (Katiyar and Cardie, 2018;Wang and Lu, 2018;Muis and Lu, 2016).

Sequence-to-Sequence Models
The Seq2Seq framework has been long studied and adopted in NLP (Sutskever et al., 2014;Cho et al., 2014;Luong et al., 2015;Vaswani et al., 2017;Vinyals et al., 2015). Gillick et al. (2016) proposed a Seq2Seq model to predict the entity's start, span length and label for the NER task. Recently, the amazing performance gain achieved by PTMs (pre-trained models) Peters et al., 2018;Devlin et al., 2019;Dai et al., 2021; has attracted several attempts to pretrain a Seq2Seq model (Song et al., 2019;Lewis et al., 2020;Raffel et al., 2020). We mainly focus on the newly proposed BART (Lewis et al., 2020) model because it can achieve better performance than MASS (Song et al., 2019). And the sentencepiece tokenization used in T5 (Raffel et al., 2020) will cause different tokenizations for the same token, making it hard to generate pointer indexes to conduct the entity extraction.
BART is formed by several transformer encoder and decoder layers, like the transformer model used in the machine translation (Vaswani et al., 2017). BART's pre-training task is to recover corrupted text into the original text. BART uses the encoder to input the corrupted sentence and the decoder to recover the original sentence. BART has base and large versions. The base version has 6 encoder layers and 6 decoder layers, while the large version has 12. Therefore, the number of parameters is similar to its equivalently sized BERT 5 .

Proposed Method
In this part, we first introduce the task formulation, then we describe how we use the Seq2Seq model with the pointer mechanism to generate the entity index sequences. After that, we present the detailed formulation of our model with BART.

NER Task
The three kinds of NER tasks can all be formulated as follows, given an input sentence of n tokens where s, e are the start and end index of a span, since an entity may contain one (for flat and nested NER) or more than one (for discontinuous NER) spans, each entity is represented as [s i1 , e i1 , ..., s ij , e ij , t i ], where t i is the entity tag index. We use G = [g 1 , ..., g l ] to denote the entity tag tokens (such as "Person", "Location", etc.), where l is the number of entity tags. We make t i ∈ (n, n + l], the n shift is to make sure t i is not confusing with pointer indexes (pointer indexes will be in range [1, n]).

Seq2Seq for Unified Decoding
Since we formulate the NER task in a generative way, we can view the NER task as the following equation: where y 0 is the special "start of sentence" control token.
We use the Seq2Seq framework with the pointer mechanism to tackle this task. Therefore, our model consists of two components:  Figure 2: Model structure used in our method. The encoder encodes input sentences, and the decoder uses the pointer mechanism to generate indexes autoregressively. "<s>" and "</s>" are the predefined start-of-sentence and end-of-sentence tokens in BART. In the output sequence, "7" means the entity tag "<dis>", and other numbers indicate the pointer index (in range [1, 6]).
(1) Encoder encodes the input sentence X into vectors H e , which formulates as follows: where H e ∈ R n×d , and d is the hidden dimension.
(2) Decoder is to get the index probability distribution for each step P t = P (y t |X, Y <t ). However, since Y <t contains the pointer and tag index, it cannot be directly inputted to the Decoder. We use the Index2Token conversion to convert indexes into tokensŷ After converting each y t this way, we can get the last hidden state h d t ∈ R d withŶ <t = [ŷ 1 , ...,ŷ t−1 ] as follows Then, we can use the following equations to achieve the index probability distribution P t E e = TokenEmbed(X) where TokenEmbed is the embeddings shared between the Encoder and Decoder; E e ,Ĥ e ,H e ∈ R n×d ; α ∈ R is a hyper-parameter; G d ∈ R l×d ; [ · ; · ] means concatenation in the first dimension; ⊗ means the dot product. During the training phase, we use the negative log-likelihood loss and the teacher forcing method. During the inference, we use an autoregressive manner to generate the target sequence. We use the decoding algorithm presented in Algorithm 1 to convert the index sequence into entity spans.

Detailed Entity Representation with BART
Since our model is a Seq2Seq model, it is natural to utilize the pre-training Seq2Seq model BART to enhance our model. We present a visualization of if y i > n then  x 5 x 4 x 3 x 2 x 1 Three entity representations: Words in the boxes are entity words, words within the same color box belong to the same entity, and their corresponding entity representation is also with the same color.
There are three entities, (x 1 , x 3 , P ER), (x 1 , x 2 , x 3 , x 4 , LOC), (x 4 , F AC), where LOC, P ER, F AC are their corresponding entity tags. The underlined position index means this is the starting BPE of a word.
our model based on BART in Figure 2. However, BART's adoption is non-trivial because the Byte-Pair-Encoding (BPE) tokenization used in BART might tokenize one token into several BPEs. To exploit how to use BART efficiently, we propose three kinds of pointer-based entity representations to locate entities in the original sentence unambiguously. The three entity representations are as follows: Span The position index of the first BPE of the starting entity word and the last BPE of the ending entity word. If this entity includes multiple discontinuous spans of words, each span is represented in the same way.
BPE The position indexes of all BPEs of the entity words.
Word Only the position index of the first BPE of each entity word is used.
For all cases, we will append the entity tag to the entity representation. An example of the entity representations is presented in Figure 3. If a word does not belong to any entity, it will not appear in the target sequence. If a whole sentence has no entity, the prediction should be an empty sequence (only contains the "start of sentence" (<s>) token and the "end of sentence" (</s>) token ).

Datasets
To show that our proposed method can be used in various NER subtasks, we conducted experiments on eight datasets.
Flat NER Datasets We adopt the CoNLL-2003 (Sang andMeulder, 2003)  Nested NER Datasets We conduct experiments on ACE 2004 7 (Doddington et al., 2004), ACE 2005 8 (Walker and Consortium, 2005), Genia corpus . For ACE2004 and ACE2005, we use the same data split as Lu and Roth (2015); Muis and Lu (2017) Table 2: Results for nested NER datasets," †" means our rerun of their code. " ‡" means our reproduction with only sentence-level context 9 . " " for a fair comparison, we only present results with the BERT-Large model.

Experiment Setup
We use the BART-Large model, whose encoder and decoder each has 12 layers for all experiments, making it the same number of transformer layers as the BERT-Large and RoBERTa-Large model. We did not use any other embeddings, and the BART model is fine-tuned during the optimization. We put more detailed experimental settings in the Supplementary Material. We report the span-level F1.

Results on Flat NER
Results are shown in Table 1. We do not compare with Yamada et al. (2020) since they added entity information during the pre-training process. And for both datasets, our method achieves better performance. We will discuss the performance difference between our three entity representations in Section 5.4.    erative models are comparable to the token-level classication (Straková et al., 2019;Shibuya and Hovy, 2020) and span-level classification (Luan et al., 2019;Wang et al., 2020a) models. Table 3 show the comparison between our model and other models in three discontinuous NER datasets. Although  tried to utilize BERT to enhance the model performance, they found that ELMo worked better. In all three datasets, our model achieves better performance.

Comparison Between Different Entity Representations
In this part, we discuss the performance difference between the three entity representations. The "Word" entity representation achieves better performance almost in all datasets. And the comparison between the "Span" and "BPE" representations is more involved. To investigate the reason behind these results, we calculate the average and median length of entities when using different entity representations, and the results are presented in Table  4. It is clear that for a generative framework, the shorter the entity representation the better performance it should achieve. Therefore, as shown in Table 4, the "Word" representation with smaller average entity length in CoNLL2003, OntoNotes, CADEC, ShARe13 achieves better performance in these datasets. However, although the average entity length of the "BPE" representation is longer than the "Span" representation, it achieves better performance in CoNLL2003, OntoNotes, ACE2004, ACE2005, this is because the "BPE" representation is more similar to the pre-training task, namely, predicting continuous BPEs. And we believe this task similarity is also the reason why the "Word" representation (Most of the words will be tokenized into a single BPE, making the "Word" representation still continuous.) achieves better performance than the "Span" representation in ACE2004, ACE2005, and ShARe14, although the former has longer entity length. A clear outlier is the Genia dataset, where the "Span" representation achieves better performance than the other two. We presume this is because in this dataset, a word will be tokenized into a longer BPE sequence (this can be inferred from the large entity length gap between the "Word" and "BPE" representation.) so that the "Word" representation will also be dissimilar to the pre-training tasks. For example, the protein "lipoxygenase isoforms" will be tokenized into the sequence "['Ġlip', 'oxy', 'gen', 'ase', 'Ġiso', 'forms']", which makes the target sequence of the "Word" representation be "['Ġlip', 'Ġiso']", resulting a discontiguous BPE 0.05% 0.02% 0.30% 0.26% 0.06% 0.0% 0.08% 0.02% Table 5: Different invalid prediction probability for the "Word" entity representation. E 1 means the predicted indexes contain index which is not the start index of a word, E 2 means the predicted indexes within an entity are not increasing, E 3 means duplicated entity prediction. sequence. Therefore, the shorter "Span" representation achieves better performance in this dataset.

Recall of Discontinuous Entities
Since only about 10% of entities in the discontinuous NER datasets are discontinuous, only evaluating the whole dataset may not show our model can recognize the discontinuous entities. Therefore, like in ; Muis and Lu (2016) we report our model's performance on the discontinuous entities in Table 6. As shown in Table 6, our model can predict the discontinuous named entities and achieve better performance.

Invalid Prediction
In this part, we mainly focus on the analysis of the "Word" representation since it generally achieves better performance. We do not restrict the output distribution; therefore, the entity prediction may contain invalid predictions as show in Table 5, this table shows that the BART model can learn the prediction representations quite well since, in most cases, the invalid prediction is less than 1%. We exclude all these invalid predictions during evaluation.

Entity Order Vs. Entity Recall
Its appearance order in the sentence determines the entity order, and we want to study whether the entity that appears later in the target sequence will have worse recall than entities that appear early. The results are provided in Figure 4. The latter the entity appears, the larger probability that it can be recalled for the flat NER and discontinuous NER. While for the nested NER, the recall curve is quite involved. We assume this phenomenon is because, for the flat NER and discontinuous NER (more than 91.1% of entities are continuous) datasets, different entities have less dependence on each other. While in the nested NER dataset, entities in the latter position may be the outermost entity that contains the former entities. The wrong prediction of former entities may negatively influence the later entities.

Conclusion
In this paper, we formulate NER subtasks as an entity span sequence generation problem, so that we can use a unified Seq2Seq model with the pointer mechanism to tackle flat, nested, and discontinuous NER subtasks. The Seq2Seq formulation en-ables us to smoothly incorporate the pre-training Seq2Seq model BART to enhance the performance.
To better utilize BART, we test three types of entity representation methods to linearize the entity span into sequences. Results show that the entity representation with a shorter length and more similar to continuous BPE sequences achieves better performance. Our proposed method achieves SoTA or near SoTA performance for eight different NER datasets, proving its generality to various NER subtasks.

Acknowledgements
We would like to thank the anonymous reviewers for their insightful comments. The discussion with colleagues in AWS Shanghai AI Lab was quite fruitful. We also thank the developers of fastNLP 10 and fitlog 11 . We thank Juntao Yu for helpful discussion about dataset processing.

Ethical Considerations
For the consideration of ethical concerns, we would make detailed description as following: (1) All of the experiments are conducted on existing datasets, which are derived from public scientific papers.
(2) We describe the characteristics of the datasets in a specific section. Our analysis is consistent with the results.
(3) Our work does not contain identity characteristics. It does not harm anyone.
(4) Our experiments do not need a lots of computer resources compared to pre-trained models.

A.1 Hyper-parameters
The detailed hyper-parameter used in different datasets are listed in Table 7

A.2 Beam Search
Since our framework is based on generation, we want to study whether using beam search will increase the performance, results are depicted in Figure 5, it shows the beam search almost has no effect on the model performance. The litte effect on the F1 value might be caused the the small searching space when generating.

A.3 Efficiency Metrics
In this section, we compare the memory footprint, training and inference time of our proposed model and BERT-based models. The experiments are conducted on the flat NER datasets, CoNLL-2003 (Sang andMeulder, 2003) and OntoNotes (Pradhan et al., 2012). We use the BERT-MLP and BERT-CRF models as our baseline models. BERT-MLP and BERT-CRF are sequence labelling based models. For an input sentence X = [x 1 , ..., x n ], both models use BERT (Devlin et al., 2019) to encode X as follows where H ∈ R n×d , d is the hidden state dimension. Then for the BERT-MLP model, it decodes the tags as follows  where W a ∈ R d×|T | and |T | is the number of tags, b a ∈ R |T | , W b ∈ R d×d , b b ∈ R d , F ∈ R n×|T | is the tag probability distribution. Then we use the negative log likelihood loss. And during the inference, for each token, the tag index with the largest probability is deemed as the prediction. For the BERT-CRF model, we use the conditional random fields (CRF) (Lafferty et al., 2001) to decode tags. We assue the golden label sequence is Y = [y 1 , ..., y n ], then we use the following equations to get the probability of Y where M ∈ R n×|T | , Y(s) is all valid label sequences, T ∈ R |T |×|T | is the transitation matrix, an entry (i, j) in T means the transition score from tag i to tag j. After getting the P (Y |X), we use negative log likelihood loss to optimize the model. Dur-  ing the inference, the Viterbi Algorithm is used to find the label sequence achieves the highest score. We use the BERT-base version and BART-base version to calculate the memory footprint during training, seconds needed to iterate one epoch (one epoch means iterating over all training samples), and seconds needed to evaluate the development set. The batch size is 16 and 48 for training and evaluation, respectively. The comparison is presented in Table 8.
During the training phase, we can use the casual mask to make the training of our model in parallel. Therefore, our proposed model can train faster than the BERT-CRF model, which needs sequential computation. While during the evaluating phase, we have to autoregressively generate tokens, which will make the inference slow. Therefore, further work like the usage of a non-autoregressive method can be studied to speed up the decoding (Gu et al., 2018).