Discontinuous Named Entity Recognition as Maximal Clique Discovery

Named entity recognition (NER) remains challenging when entity mentions can be discontinuous. Existing methods break the recognition process into several sequential steps. In training, they predict conditioned on the golden intermediate results, while at inference relying on the model output of the previous steps, which introduces exposure bias. To solve this problem, we first construct a segment graph for each sentence, in which each node denotes a segment (a continuous entity on its own, or a part of discontinuous entities), and an edge links two nodes that belong to the same entity. The nodes and edges can be generated respectively in one stage with a grid tagging scheme and learned jointly using a novel architecture named Mac. Then discontinuous NER can be reformulated as a non-parametric process of discovering maximal cliques in the graph and concatenating the spans in each clique. Experiments on three benchmarks show that our method outperforms the state-of-the-art (SOTA) results, with up to 3.5 percentage points improvement on F1, and achieves 5x speedup over the SOTA model.


Introduction
Named Entity Recognition (NER) is the task of detecting mentions of real-world entities from text and classifying them into predefined types. NER benefits many natural language processing applications (e.g., information retrieval (Berger and Lafferty, 2017), relation extraction , and question answering (Khalid et al., 2008)).
NER methods have been extensively investigated and researchers have proposed effective ones. Most prior approaches (Huang et al., 2015; Chiu and * The two authors contribute equally. † Corresponding author. 1 The source code of this paper is provided in supplementary material and also available at https://github.com/XXX productive cough with white or bloody sputum E1 E1 E1 E2 E2 E2 Figure 1: An example involving discontinuous mentions. Entities are highlighted with colored underlines. Nichols, 2016;Gridach, 2017;Zhang and Yang, 2018;Gui et al., 2019;Xue et al., 2020) cast this task as a sequence labeling problem where each token is assigned a label that represents its entity type. Their underlying assumption is that an entity mention should be a short span of text (Muis and Lu, 2016), and should not overlap with each other. While such assumption is valid for most cases, it does not always hold, especially in clinical corpus (Pradhan et al., 2015). For example, Figure 1 shows two discontiguous entity mentions with overlapping segments. Thus, there is a need to move beyond continuous entities and devise methods to extract discontinuous ones.
Towards this goal, current state-of-the-art (SOTA) models can be categorized into two classes: combination-based and transition-based. Combination-based models first detect all the overlapping spans and then learn to combine these segments with a separate classifier (Wang and Lu, 2019); Transition-based models incrementally label the discontinuous spans through a sequence of shift-reduce actions (Dai et al., 2020b). Although these methods have achieved reasonable performance, they continue to have difficulty with the same problem: exposure bias . Specifically, combination-based methods use the gold segments to guide the classifier during the training process while at inference the input segments are given by a trained model, leading to a gap between training and inference (Wang and Lu, 2019). For transition-based models, at training time, the current action relies on the golden previ-ous actions, while in the testing phase, the entire action sequence is generated by the model (Wang et al., 2017). As a result, a skewed prediction will further deviate the predictions of the follow-up actions. Such accumulated discrepancy may hurt the performance.
In order to overcome the limitation of such prior works, we propose Mac, a Maximal clique discovery based discontinuous NER model. The core insight behind Mac is that all (potentially discontinuous) entity mentions in the sentence can naturally form a segment graph by interpreting their contained continuous segments as nodes, and connecting segments of the same entity to each other as edges. Then the discontinuous NER task is equivalent to finding the maximal cliques from the graph, which is a well-studied problem in graph theory. So, the question that remains is how to construct such a segment graph. We decompose it into two uncoupled subtasks, segment extraction (SE) and edge prediction (EP) in Mac. Typically, given an ntoken sentence, two n × n tag tables are formed for SE and EP respectively where each entry captures the interaction between two individual tokens. SE is then regarded as a labeling problem where tags are assigned to distinguish the boundary tokens of each segment, which have benefits in identifying overlapping segments. EP is converted as the problem of aligning the boundary tokens of segments contained in the same entity. Overall, the tag tables of SE and EP are generated independently, and will be consumed together by a maximum clique searching algorithm to recover desired entities from them, thus immune from the exposure bias problem.
We conducted experiments on three standard discontinuous NER benchmarks. Experiments show that Mac can effectively recognize discontinuous entity mentions without sacrificing the accuracy on continuous mentions. This leads to a new state-ofthe-art (SOTA) on this task, with substantial gains of up to 3.5% absolute percentage points over previous best reported result. Lastly, we show that in the runtime experiments on GPU environments, Mac is about five times faster than the SOTA model.

Related Work
Our work is inspired by three lines of research: discontinuous NER, joint extraction, and maximal clique discovery.
Discontinuous NER requires to identify all entity mentions that have discontinuous structures.
To achieve this end, several researchers introduced new position indicators into the traditional BIO tagging scheme so that the sequential labeling models can be employed (Tang et al., 2013;Metke-Jimenez and Karimi, 2016;Dai et al., 2017;Tang et al., 2018). However, this model suffers from the label ambiguity problem due to the limited flexibility of the extended tag set. As the improvement, Muis and Lu (2016) used hyper-graphs to represent entity spans and their combinations, but did not completely resolve the ambiguity issue (Dai et al., 2020b). Wang and Lu (2019) presented a pipeline framework which first detects all the candidate spans of entities and then merges them into entities. By decomposing the task into two interdependency steps, this approach does not have the ambiguity issue, but meanwhile being susceptible to exposure bias. Recently, Dai et al. (2020b) constructed a transition action sequence for recognizing discontinuous and overlapping structure. At training time, it predicts with the ground truth previous actions as condition while at inference it has to select the current action based on the results of previous steps, leading to exposure bias. In this paper, for the first time we propose a one-stage method to address discontinuous NER while without suffering from the ambiguity issue, realizing the consistency of training and inference.
Joint extraction aims to detect entity pairs along with their relations using a single model . Discontinuous NER is related to joint extraction where the discontiguous entities can be viewed as relation links between segments (Wang and Lu, 2019). Our model is motivated by TPLinker , which formulates joint extraction as a token pair linking problem by aligning the boundary tokens of entity pairs. The main differences between our model and TPLinker are two-fold: (1) We propose a tailor-designed tagging scheme for recognizing discontinuous segments; (2) The maximal clique discovery algorithm is introduced into our model to accurately merge the discontinuous segments.
Maximal clique discovery is to find a clique of maximum size in a given graph (Dutta and Lauri, 2019). Here, a clique is a subset of the vertices all of which are pairwise adjacent. Maximal clique discovery finds extensive application across diverse domains (Stix, 2004;Boginski et al., 2005;Imbiriba et al., 2017). In this paper, we reformulate discontinuous NER as the task of maximal clique discovery by constructing a segment graph and leveraging the classic B-K backtracking algorithm (Bron and Kerbosch, 1973) to find all the maximum cliques as the entities.

Methodology
In graph theory, a clique is a vertex subset of an undirected graph where every two vertices in the clique are adjacent, while a maximal clique is the one that cannot be extended by including one more adjacent vertex. That means each vertex in the maximal clique has close relations with each other, and no other vertex can be added, which is similar to the relations between segments in a discontinuous entity. Based on this insight, we claim that discontinuous NER can be equivalently interpreted as discovering maximal cliques from a segment graph, where nodes represent segments that either form entities on their own or present as parts of a discontinuous entity, and edges connect segments that belong to the same entity mention. Considering the maximum clique searching process is usually non-parametric (Bron and Kerbosch, 1973), discontinuous NER is actually decomposed into two subtasks: segment extraction and edge prediction, to respectively create the nodes and edges of the segment graph. Their prediction results can be generated independently with our proposed grid tagging scheme, and will be consumed together to construct a segment graph, so that the maximal clique discovery algorithm can be applied to recover desired entities. The overall extraction process is depicted in Figure 2. Next, we will first introduce our grid tagging scheme and its decoding workflow. Then we will detail the Mac, a Maximal clique discovery based discontinuous NER model based on this tagging scheme.

Grid Tagging Scheme
Inspired by , we implement single-stage segment extraction and edge prediction based on a novel grid tagging scheme. Given an n-token sentence, our scheme constructs an n × n tag table by enumerating all possible token pairs and giving each token pair the tag(s) based on their relation(s). Note that one token pair may have multiple tags according to the pre-defined tag set.

Segment Extraction
As demonstrated in Figure 1, entity mentions could overlap with each other. To make our model capable of extracting such overlapping segments, we construct a two-dimensional tag table. Figure 3 provides an example. A pair of tokens (t i , t j ) will be assigned with a set of labels if a segment from t i to t j belongs to the corresponding categories. Considering j ≥ i, we discard the lower triangle region of the tag table, so n 2 +n 2 grids are actually generated for an n-token sentence. In practical, the BIS tagging scheme is adopted to represent if a segment is a continuous entity mention (X-S) or locates at the beginning (X-B) or inside (X-I) of a discontinuous entity of type X. For example, (upper, body) is assigned with the tag POB-S since "upper body" is a continuous entity of type Part of Body (POB). And the tag of (Sever, joint) is ADE-B as "Sever joint" is a beginning segment of the discontinuous mention "Sever joint pain" of type Adverse Drug Event (ADE). Meanwhile, "joint" is also recognized as an entity since there is a POB-S tag in the place of (joint, joint), thus the overlapping segment extraction problem is solved.

Edge Prediction
Edge prediction is to construct the links between segments of the same entity mention by aligning their boundary tokens. The tagging scheme is defined as follows: (1) head to head (X-H2H) indicates it locates in a place (t i , t j ) where t i and t j are respectively the beginning tokens of two segments which constitute the same entity of type X; (2) tail to tail (X-T2T) is similar to X-H2H, but focusing on the ending token. As shown in Figure 4, "Sever" has the ADE-H2H and ADE-T2T relations to "shoulder" and "pain", because the type of the discontinuous entity mention "Sever shoulder pain" is Adverse Drug Event . The same logic goes for other tags in the matrix.

Decoding Workflow
Formally, the decoding procedure is summarized in Algorithm 1. The segment tagging table S and edge tagging table E of a sentence T serve as the inputs. Firstly, we extract all the typed segments through decoding S. Then we construct a segment graph G, in which segments that belong to the same entity (decoded from E) have edges with each other. Figure 2 gives an example. Correspondingly, we can yield a continuous entity mention from the singlevertex clique directly, and concatenate segments in each multiple-vertex clique following their original sequential order in T to recover discontinuous entity mentions. We choose the classic B-K backtracking algorithm (Bron and Kerbosch, 1973) for finding the maximal cliques in G, which takes where m is the number of nodes.

Model Structure
With the grid tagging scheme, we propose an endto-end neural architecture named Mac. Figure 5 reveals the overview structure.

Algorithm 1 Decoding Procedure
Input: The segment tagging results S and edge tagging results E of sentence T. S(ti, tj) and E(ti, tj) respectively denote the tag set of token pair (ti, tj) in two schemes. Output: R = {(e k , t k )} m k=1 , e k , t k are respectively the text and the type of the k-th entity. 1: Initialize the edge set A and entity set R with ∅ 2: Obtain the segment set N by decoding S. 3: for segment s ∈ N do 4: for segment g ∈ N do 5: Define type ← the entity type of s or g 6: if type-H2H ∈ E(s.start, g.start) & type-T2T ∈ E(s.end, g.end) then 7: Add (s, g) to A 8: end if 9: end for 10: end for 11: Construct the segment graph G based on N and A 12: Find the maximal cliques C in G with the B-K algorithm 13: for clique c ∈ C do 14: Define t ← the entity type of a random segment in c 15: Concat the segments of c with their order in T as e 16: Add (e, t) to R 17: end for 18: return R

Algorithm 2 B-K Backtracking Algorithm
Input: The graph G Output: the set of all maximal cliques: C.
1: Initialize C and two vertex sets R, X with ∅ 2: Define P ← the node set of G 3: function BRONKER(R, P, X) 4: Add R to C 6: end if 7: for v ∈ P do 8: Define N(v) ← the neighbor set of v 9: end for 13: end function 14: BRONKER(R, P, X) // call the BronKer function 15: return C

Token Representation
Given an n-token sentence [t 1 , · · · , t n ], we first map each token t i into a low-dimensional contextual vector h i with a basic encoder. Then we generate two representations, h s i and h e i , as the taskspecific features for the segment extractor and the edge predictor, respectively: where W * h is a parameter matrix and b * h is a bias vector to be learned during training.

Segment Extractor
The probability that a pair of tokens are the boundary tokens of a segment can be represented as: where b and e denotes the beginning token and ending token. In our tagging scheme (Figure 3), we have a fixed beginning token t i at the i-th row, and take the given beginning token as the condition to label the corresponding ending token, so P (b = t i ) in the i-th row is always 1. Hence, all we need to do is to calculate P (e = t j |b = t i ).
Inspired by Su (2019), we use the Conditional Layer Normalization (CLN) mechanism to model the conditional probability. That is, a conditional vector is introduced as extra contextual information to generate the gain parameter γ and bias λ of layer normalization (Ba et al., 2016) as follows: where c and x are the conditional vector and input vector respectively. x i denotes the i-th element of x, µ and σ are the mean and standard deviation taken across the elements of x, respectively. x is firstly normalized by fixing the mean and variance and then scaled and shifted by γ c and λ c respectively. Based on the CLN mechanism, the representation of token pair (t i , t j ) being a segment boundary can be defined as: In this way, For different t i , different LN parameters are generated, which results in effectively adapting h j to be more t i -specific.
Furthermore, besides the features of boundary tokens, we also consider inner tokens and segment length to learn a better segment representation. Specifically, we deploy a LSTM network (Hochreiter and Schmidhuber, 1997) to compute the hidden states of inner tokens, and use a looking-up table to embed the segment length. Since the ending token is always behind the beginning one, in each row r i , only the tokens behind t i will be fed into the LSTM. We take the hidden state outputted at each time step t j as the inner token representation of the segment s i:j . Then the representation of a segment from t i to t j can be defined as follows:

Edge Predictor
Edge prediction is similar with segment extraction since they all need to learn the representation of each token pair. The key differences are summarized in the following two aspects: (1) the distance between segments is usually not informative, so the length embedding e len i:j is valueless in edge prediction; (2) encoding the tokens between segments may carry noisy semantics for correlation tagging and aggravate the burden of training, so no h in i:j is required. Under such considerations, we represent each token pair for edge prediction as:

Training and Inference
In practical, our grid tagging scheme aims to tag most relevant labels for each token pair, so it can be seen as a multi-label classification problem. Once having the comprehensive token pair representations (h s i:j and h e i:j ), we can build the multi-label classifier via a fully connected network. Mathematically, the predicted probability of each tag for (t i , t j ) can be estimated via: where I ∈ {s, e} is the symbol of subtask indicator, denoting segment extraction and edge prediction respectively, and each dimension of p I i,j denotes the probability of a tag between t i and t j . The sigmoid function is used to transfer the projected value into a probability, in this case, the cross-entropy loss can be used as the loss function which has been proved suitable for multi-label classification task: where K I is the number of pre-defined tags in I, p I i,j [k] ∈ [0, 1] is the predicted probability of (t i , t j ) along the k-th tag, and y I i,j [k] ∈ {0, 1} is the corresponding ground truth. s I equals to 1 if I = e or i if I = s. Then, the losses from segment extraction and edge prediction are aggregated to form the training objective J (θ): At inference, the probability vector p I i,j needs thresholding to be converted to tags. We enumerate several values in the range (0, 1) and pick the one that maximizes the evaluation metrics on the validation (dev) set as the threshold.

Evaluation
In this section, after introducing the datasets and baseline models, we present our experimental results and detailed analysis.

Datasets
Following previous work (Dai et al., 2020b), we conduct experiments on three benchmark datasets from the biomedical domain: (1) Table 1.

Implementation Details
We implement our model upon the in-field BERT base model: Yelp Bert (Dai et al., 2020a) for CADEC, and Clinical BERT (Alsentzer et al., 2019) for ShARe 13 and 14. The network parameters are optimized by Adam (Kingma and Ba, 2014) with a learning rate of 1e-5. The batch size is fixed to 12. The threshold for converting probability to tag is set as 0.5. All the hyper-parameters are tuned on the dev set. We run our experiments on a NVIDIA Tesla V100 GPU for at most 300 epochs, and choose the model with the best performance on the dev set to output results on the test set. we report the test score of the run with the median dev score among 5 randomly initialized runs.

Comparison Models
For comparison, we employ the following models as baselines: (1) (Dai et al., 2020b) is the current best discontinuous NER method, which generates a sequence of actions with the aid of buffer and stack structure to detect entity; Note that the original Trans E model is based on ELMo. For fair comparison with our model, we also implement the in-field BERT-based Trans models, namely Trans B .  Table 3: Results on discontinuous entity mentions. In the Table, two scores are reported and separated by a slash ("/"). The former is the score on sentences with at least one discontinuous entity mention. The latter is the score only considering discontinuous entity mentions.

Main results
2.6% in F1 score on three datasets averagely. Moreover, the Wilcoxon's test shows that a significant difference (p < 0.05) exists between our model and Trans E . We consider that it is because Trans E is inherently a multi-stage method as it introduces several dependent actions, thus suffering from the exposure bias problem. While for our Mac method, it elegantly decomposes the discontinuous NER task into two independent subtasks and learns them together with a joint model, realizing the consistency of training and inference. (4) Comb B can be approximately seen as the pipeline version of our method, their performance gap again confirms the effectiveness of our one-stage learning framework. As shown in Table 1, only around 10% mentions are discontinuous in all three datasets, which is far less than the continuous entity mentions. To evaluate the effectiveness of our proposed model on recognizing discontinuous mentions, following Muis and Lu (2016), we report the results on sentences that include at least one discontinuous mention. We also report the evaluation results when only discontinuous mentions are considered. The scores in these two settings are separated by a slash in Table 3. Comparing Table 2 and 3, we can see that the BIOE model performs better than the Graph when testing on the full dataset but far worse on discontinuous mentions. Consistently, our model again defeat the baseline models in terms of F1 score. Even though some models outperform Mac on precision or recall, they greatly sacrifice another score, which results in lower F1 score than Mac.

Model Ablation Study
To verify the effectiveness of each component, we ablate one component at a time to understand its impact on the performance. Concretely, we investigated the tagging scheme of segments, the segment length embedding, the CLN mechanism (by re-  placing it with the vector concatenation), and the segment inner token representation.
From these ablations shown in Table 4, we find that: (1) When we take B, I and S tags in segment extraction as one class, the score slightly drops by 0.5%, which indicates the segments in different positions of entities may have different semantic features, so distinguishing them can reduce the confusion in the process of model recognition; (2) When we remove the segment length embedding (Formula 9), the overall F1 score drops by 0.6%, showing that it is necessary to let segment extractor aware of the token pair distance information to filter out impossible segments by implicit distance constraint; (3) Compared with concatenating, it is a better choice to use CLN (Formula 7 and 11) to fuse the features of two tokens, which brings 1.9% improvement; (4) Removing segment inner features (Formula 8) results in a remarkable drop on the overall F1 score while little drop on the scores of discontinuous mentions, which suggests that the information of inner tokens is essential to recognize continuous entity mentions.   As discussed in the introduction, overlap is very common in discontinuous entity mentions. To evaluate the capability of our model on extraction overlapping structures, as suggested in (Dai et al., 2020b), we divide the test set into four categories: (1) no overlap; (2) left overlap; (3) right overlap; and (4) multiple overlap. Figure 7 gives examples for each overlapping pattern. As illustrated in Figure 6, Mac outperforms Trans E on all the overlapping patterns. Trans E gets zero scores on some patterns. It might result from insufficient training since these overlapping patterns have relatively fewer samples in the training sets (see Table 5), while the sequential action structure of transitionbased model is a bit data hungry. By contrast, Mac  is more resilient to overlapping patterns, we attribute the performance gain to two design choices: (1) the grid tagging scheme has strong power in accurately identifying overlapping segments and assembling them into a segment graph; (2) Based on the graph, the maximal clique discovery algorithm can effectively recover all the candidate overlapping entity mentions. Table 6 shows the comparison of computational efficiency between the SOTA model Trans E , Trans B , and our proposed Mac. All of these models are implemented by Pytorch and ran on a single Tesla V100 GPU environment. As we can see, the prediction speed of Mac is around 5 times faster than Trans E . Since the transition-based model employs a stack to store partially processed spans and a buffer to store unprocessed tokens (Dai et al., 2020b), it is difficult to utilize GPU parallel computing to speed up the extraction process. In the official implementation, Trans E is restricted to processes one token at a time, which means it is seriously inefficient and difficult to deploy in real development environment. By contrast, Mac is capable of handling data in batch mode because it is a single-stage sequence labeling model in essence.

Analysis on Running Speed
In this paper, we reformulate discontinuous NER as the task of discovering maximal cliques in a segment graph, and propose a novel Mac architecture. It decomposes the construction of segment graph as two independent 2-D grid tagging problems, and solves them jointly in one stage, addressing the exposure bias issue in previous studies. Extensive experiments on three benchmark datasets show that Mac beats the previous SOTA method by as much as 3.5 pts in F1, while being 5 times faster. Further analysis demonstrates the ability of our model in recognizing discontinuous and overlapping entity mentions. In the future, we would like to explore similar formulation in other information extraction tasks, such as event extraction and nested NER.

A Performance Analysis on Different Interval and Span Lengths
Intervals between segments usually make the total length of a discontinuous mention longer than continuous one. Considering the involved segments, the whole span is even longer. That is, different words of a discontinuous mention may be distant to each other, which makes discontinuous NER harder than the conventional NER task. To further evaluate the robustness of Mac in different settings, we analyse the results of test sets on different interval and span lengths. The interval length refers to the number of words between discontinuous segments. The span length refers to the number of words of the whole span. For example, for the entity mention "Sever shoulder pain" in "Sever joint, shoulder and upper body pain.", the interval length is 5, and the span length is 8. Such phenomenon requires models to have the ability of capturing the semantic dependency between distant segments. For the convenience of analysis, we report all datasets' distribution on interval and span length in Table 7 and 8, respectively. And Figure 8 shows the F1 scores of Trans E and Mac on different interval and span lengths. As we can see, Mac outperforms Trans E on most interval and span lengths. Even though Mac is defeated in some cases, the sample number in those cases is too small to disprove the superiority of Mac. For example, on CADEC, Trans E outperforms Mac when span length is 8, but the sample number in the test set is only 10. We figure out an interesting phenomenon: Both Mac and the transition-based model Trans E show poor performance when interval length is 1 and span length is 3, even though the corresponding training samples are sufficient enough (see length = 1 in Table 7 and length = 3 in Table 8 3 ). For example, ShARe 14 has over 200 training samples, of which the interval length is 1, but both models perform much worse than when interval length is 3, which has less training samples. This might result from three folds: (1) Even though the training samples are sufficient, their features and context are so different from the ones in the test sets; (2) the validation set is too small to choose a good model state for the samples with the interval length equal to 1; (3) discontinuous mentions with interval length equal to 1 are harder cases than the others, since only one word to separate the segments makes these discontinuous mentions very similar to the continuous ones, which confuse the model to treat them as a continuous mention. We leave this problem to our future work.