A dynamic programming algorithm for span-based nested named-entity recognition in O(n^2)

Span-based nested named-entity recognition (NER) has a cubic-time complexity using avariant of the CYK algorithm. We show that by adding a supplementary structural constraint on the search space, nested NER has a quadratic-time complexity, that is the same asymptotic complexity than the non-nested case. The proposed algorithm covers a large part of three standard English benchmarks and delivers comparable experimental results.


Introduction
Named entity recognition (NER) is a fundamental problem in information retrieval that aims to identify mentions of entities and their associated types in natural language documents. As such, the problem can be reduced to the identification and classification of segments of texts. In particular, we focus on mentions that have the following properties: 1. continuous, i.e. a mention corresponds to a contiguous sequence of words; 2. potentially nested, i.e. one mention can be inside another, but they can never partially overlap.
Four examples are shown in Figure 1.
In a span-based setting, recognition for nested NER has a cubic-time complexity (Finkel and Manning, 2009;Fu et al., 2021) using variants of the Cocke-Younger-Kasami (CYK) algorithm (Kasami, 1965;Younger, 1967;Cocke, 1970). If we restrict the search space to non-nested mentions, then recognition can be realized in quadratic time using a semi-Markov model (Sarawagi and Cohen, 2004). An open question is whether it is possible to design algorithms with better time-complexity/search space trade-offs.
In this paper, we propose a novel span-based nested NER recognition algorithm with a quadratic-time complexity, that is with the same time complexity as the semi-Markov algorithm for the non-nested case. Our approach is based on the observation that many mentions only contain at most one nested mention of length strictly greater than one. As such, we follow a trend in the syntactic parsing literature that studies search-spaces that allow the development of more efficient parsing algorithms, both for dependency and constituency structures (Pitler et al., 2012(Pitler et al., , 2013Satta and Kuhlmann, 2013;Gómez-Rodríguez et al., 2010;Corro, 2020), inter alia.
Our main contributions can be summarized as follows: • We present the semi-Markov and CYK-like models for non-nested and nested NER, respectively -although we do not claim that these approaches for NER are new, our presentation of the CYK-like algorithm differs from previous work as it is tailored to the NER problem and guarantees uniqueness of derivations; • We introduce a novel search space for nested NER that has no significant loss in coverage compared to the standard one; • We propose a novel quadratic-time recognition algorithm for the aforementioned search space; • We experiment our quadratic-time algorithm on three English datasets (ACE-2004, ACE-2005 andGENIA) and show that it obtains comparable results to the cubic-time algorithm. 2 Related work Span-based methods: Semi-Markov model have been first proposed in the generative modeling framework for time-serie analysis and word segmentation, inter alia (Janssen and Limnios, 1999;Ge, 2002). Sarawagi and Cohen (2004) first proposed a discriminative variant for NER. Arora et al. (2019) extended this approach with a task-tailored structured SVM loss (Tsochantaridis et al., 2004). Semi-Markov models are attractive because their inference algorithms have a O(n 2 ) time complexity, where n is the length of the input sentence. Unfortunately, semi-Markov models can only recognize non-nested mentions. Finkel and Manning (2009) proposed a representation of nested mentions (together with part-of-speech tags) as a phrase structure, giving the possibility to use the CYK algorithm for MAP inference. Influenced by recent work in the syntactic parsing literature on span-based model, i.e. models without an explicit grammar (Hall et al., 2014;Stern et al., 2017), Fu et al. (2021) proposed to rely on these span-based phrase structure parsers for nested NER. As structures considered in NER are not stricto sensu complete phrase-structures, they use a latent span model. Inference in this model has a O(n 3 ) time complexity. Lou et al. (2022) extended this approach to lexicalized structures (i.e. where each mention has an explicitly identified head), leading to a O(n 4 ) time complexity for inference due to the richer structure.
Tagging-based methods: NER can be reduced to a sentence tagging problem using BIO and BILOU schemes (Ratinov and Roth, 2009) to bypass the quadratic-time complexity of semi-Markov models. MAP Inference (resp. marginal inference) is then a linear-time problem using the Viterbi algorithm (resp. forward-backward algorithm). 1 However, this approach cannot incorporate span features neither be used for nested entities. Alex et al. (2007) and Ju et al. (2018) proposed to rely on several tagging layers to predict nested entities. Shibuya and Hovy (2020) proposed an extension of the Viterbi algorithm that allows to rely on BIO tagging for nested NER by considering second-best paths. To leverage the influence of outer entities, Wang et al. (2021) proposed to rely on different potential function for inner entities. Note that these approaches for tagging-based nested NER come at the cost of O(n 2 ) inference algorithms, that is similar to the span-based alogithm we propose.
Hypergraph-based methods: Lu and Roth (2015) proposed an hypergraph-based method for nested NER. Although this approach is appealing for its O(n) (approximate) inference algorithms, it suffers from two major issues: (1) the algorithm the authors proposed to approximate the partition function overestimates its true value; (2) the representation is ambiguous, that is a single path in the hypergraph may represent different analysis of the same sentence. Muis and Lu (2017) proposed a different hypergraph with O(n 2 ) inference algorithms that solves issue (1) but still exhibits issue (2). Katiyar and Cardie (2018) extended hypergraph methods to rely on neural network scoring.
Unstructured methods: Several authors proposed to predict the presence of a mention on each span independently, sometimes with specialized neural architectures (Xu et al., 2017;Sohrab and Miwa, 2018;Zheng et al., 2019;Xia et al., 2019;Tan et al., 2020;Zaratiana et al., 2022), inter alia. Note that these ap-proaches classify O(n 2 ) spans of text independently, hence the time-complexity is similar to the approach proposed in this paper but they cannot guarantee well-formedness of the prediction.

Nested named-entity recognition
In this section, we introduce the nested NER problem and the vocabulary we use through the paper.

Notations and vocabulary
Let s = s 1 ...s n be a sentence of n words. Without loss of generality (wlog), we assume that all sentences are of the same size. We use interstice (or fencepost) notation to refer to spans of s, i.e. s i:j = s i+1 ...s j if 0 ≤ i < j ≤ n, the empty sequence if 0 ≤ i = j ≤ n and undefined otherwise. We denote M the set of possible mentions in a sentence and T the set of mention types. Wlog, we assume that is called the left border (resp. right border). An analysis of sentence s is denoted y ∈ {0, 1} M where y m = 1 (resp. y m = 0) indicates that mention m ∈ M is included in the analysis (resp. is not included). For example, the analysis of sentence 1 in Figure 1 is represented by a vector y where y PER,0,1 = 1, y PER,5,8 = 1 and all other elements are equal to zero. A mention t, i, j is said to be inside Let y be the analysis of a sentence. We call first level mentions all mentions in y that are not inside another mention of the analysis. We call nested mentions all mentions that are not first level mentions. For example, the first level mentions of the analysis of sentence 2 in Figure 1 are PER, 0, 1 and PER, 2, 8 . We call children of mention m ∈ M the set C ⊂ M of mentions that are inside m but not inside another mention that is inside m. Conversely, m is said to be the parent of each mention in C. For example, in sentence 2 in Figure 1, the mention PER, 2, 8 has two children, PER, 2, 3 and PER, 5, 6 . In sentence 4 in Figure 1, GEP, 5, 6 is a child of GEP, 4, 6 but it is not a child of PER, 2, 6 . The left neighborhood (resp. right neighborhood) of a nested mention is the span between the left border of its parent and its left border (resp. between its right border and the right border of its parent). For example, in sentence 2 in Figure 1, mention PER, 5, 6 has left neighborhood s 2:5 and right neighborhood s 6,8 .
The set of possible analyses is denoted Y . We will consider three different definitions of Y : 1. the set of analyses where no disjoint mention spans overlap, corresponding to non-nested NER; 2. the set of analyses where one mention span can be inside another one but cannot partially overlap, corresponding to nested NER; 3. the set 2 with additional constraint that a mention must contain at most one child with a span length strictly greater to one.

Inference problems
The weight of an analysis y ∈ Y is defined as the sum of included mention weights. Let w ∈ R M be a vector of mention weights. The probability of an analysis is defined via the Boltzmann or "softmax" distributions: where Z is the partition function that normalizes the distribution, defined as: Note that, in general, the set Y is of exponential size but Z(w) can nonetheless be efficiently computed via dynamic programming. The training problem aims to minimize a loss function over the training data. We focus on the negative loglikelihood loss function defined as: ℓ(w, y) = −w ⊤ y + log Z(w).
Note that this loss function is convex. This differentiates us from previous work that had to rely on non-convex losses (Fu et al., 2021;Lou et al., 2022). The difference lays in the fact that we will use algorithms that are tailored for the considered search space Y whereas Fu et al. (2021) and Lou et al. (2022) introduced latent variables in order to be able to rely on algorithms made for a different problem, namely syntactic constituency parsing. Note that the partial derivatives of log Z(w) are the marginal distributions of mentions (Wainwright et al., 2008). Hence, we will refer to computing log Z(w) and its derivatives as marginal inference, a required step for gradient based optimization at training time.
At test time, we aim to compute the highest scoring structure given weights w: We call this problem MAP inference. For many problems in natural language processing, marginal inference and MAP inference can be computed via dynamic programming over different semirings (Goodman, 1999) or dynamic programming with smoothed max operators (Mensch and Blondel, 2018). However, we need to ensure the uniqueness of derivations property so that a single analysis y ∈ Y has exactly one possible derivation under the algorithm. Otherwise, the same analysis would be counted several times when computing the partition function, leading to an overestimation of its value.

Related algorithms
In this section, we present semi-Markov and CYK-like algorithms for non-nested and nested NER, respectively. Our presentation is based on the weighted logic programming formalism, also known as parsing-as-deduction (Pereira and Warren, 1983). To the best of our knowledge, the presentation of the CYK-like algorithm is novel as previous work relied on the "actual" CYK algorithm (Finkel and Manning, 2009) or its variant for span-based syntactic parsing (Lou et al., 2022;Fu et al., 2021).

Non-nested named-entity recognition
The semi-Markov algorithm recognizes a sentence from left to right. Items are of the following forms: represent a partial analysis of the sentence covering words s 0:i .
Axioms are items of the form [→, 0] and [t, i, j]. The first axiom form represents an empty partial analysis and the second set of axioms represent all possible mentions in the sentence. We assign weight w t,i,j to axiom [t, i, j], for all t ∈ T, i, j ∈ N s.t. 0 ≤ i < j ≤ n. The goal of the algorithm is the item [→, n].
Deduction rules are defined as follows: [→, i] Rule (a) appends a mention spanning words s i:j to a partial analysis, whereas rule (b) advances one position by assuming word s i:i+1 is not covered by a mention.
A trace example of the algorithm is given in Table 1. Soundness, completeness and uniqueness of derivations can be directly induced from the deduction system. The time and space complexities are both O(n 2 |T |).   There is only one rule that differs, but they both share the same antecedents.

Nested named-entity recognition
We present a CYK-like algorithm for nested named entity recognition. Contrary to algorithms proposed by Finkel and Manning (2009) and Fu et al. (2021), inter alia, our algorithm directly recognizes the nested mentions and does not require any "trick" to take into account non-binary structures, words that are not covered by any mention or the fact that a word in a mention may not be covered by any of its children. As such, we present an algorithm that is tailored for NER instead of the usual "hijacking" of constituency parsing algorithms. This particular presentation of the algorithm will allow us to simplify the presentation of our novel contribution in Section 5. Items are of the following forms: • [t, i, j] as defined previously; • [→, i] as defined previously; • [ →, i, j] with 0 ≤ i < j ≤ n: represent the partial analysis of a mention and its nested structure starting at position i. [ Rule (c) concatenates an analyzed mention to a partial analysis of another mention -note that the constraint forbids that right antecedent shares its left border with its parent. Rule (d) advances of one position in the partial structure, assuming the analyzed mention starting at i does not have a child mention covering s j−1:j . Rules (e) and (f) are used to recognize the internal structure of a mention that has a child sharing the same left border. Although the latter two deduction rules may seem far-fetched, they cannot be simplified without breaking the uniqueness of derivations property or breaking the prohibition of self loop construction of ↔ items. Finally, rule (g) finishes the analysis of a mention and its internal structure.
Deduction rules for the second step are defined as follows: [→, i] They have similar interpretation to the rules of the semi-Markov model where we replaced mentions by possibly nested structures.
A trace example of the algorithm is given in Table 2. Although the algorithm is more involved than usual presentations, our approach directly maps a derivation to nested mentions and guarantee uniqueness of derivations. The space and time complexities are O(n 2 |T |) and O(n 3 |T |), respectively.

O(n 2 ) nested named-entity recognition
In this section, we describe our novel algorithm for quadratic-time nested named entity recognition. Our algorithm limits its search space to mentions that contain at most one child of length strictly greater to one.
Items are of the following forms: • [t, i, j] as defined previously; • [→, i] as defined previously; • [ →, i, j] as defined previously; • [↔, i, j] as defined previously; • [← , i, j] with 0 ≤ i < j ≤ 0: represents a partial analysis of a mention and its internal structure, where its content will be recognized by appending content on the left instead of the right.
Axioms and goals are the same than the one of the CYK-like algorithm presented in Section 4.2 -importantly, there is no extra axiom for items of the form [← , i, j].
For the moment, assume we restrict nested mentions that have a length strictly greater to the ones that share their left boundaries with their parent. We can re-use rules (d), (f), (g), (h) and (i) together with the following two deduction rules: More precisely, we removed the two rules inducing a cubic-time complexity in the CYK-like algorithm and replaced them with quadratic-time rules. This transformation is possible because our search space forces the rightmost antecedents of these two rules to cover a single word, hence we do not need to introduce an extra free variable. However, in this form, the algorithm only allows the child mention of length strictly greater to one to share its left boundary with its parent. We now extend the algorithm to the full targeted search space. The intuition is as follows: for a given mention, if it has a child mention of length strictly greater than one that does not share its left border with its parent, we first start recognizing this child mention and its left neighborhood and then move to right neighborhood using previously defined rules. We start the recognition of the left neighborhood using the two following rules: where the constraints ensure antecedents [↔, i + 1, j] are non-unary (otherwise we will break the uniqueness of derivations constraint). Rule (l) (resp. (m)) recognizes the case where span s i:i+1 contains (resp. does not contain) a mention. The following rules are analogous to rules (d) and (j) but for visiting the left neighborhood instead of the right one: [← , i, j] Finally, once the left neighborhood has been recognized, we move to the right one using the following rule: ACE-2004 Using the aforementioned rules, our algorithm has time and space complexities of O(n 2 |T |). We illustrate the difference with the CYK-like algorithm with a trace example in Table 2: in this specific example, the two analyses differ only by the application of a single rule. Table 3 contains a trace example where all nested mentions have a size one, so the parent mention is visited from left to right. Table 4 contains a trace example where we need to construct one internal structure by visiting the left neighborhood of the non-unary child mention from right to left.
Soundness and completeness can be proved by observing that, for a given mention, any children composition can be parsed with deduction rules as long as there is at most one child with a span strictly greater to one. Moreover, these are the only children composition that can be recognized. Finally, uniqueness of derivations can be proved as there is a single construction order of the internal structure of a mention.
Infinite recursion. An important property of our algorithm is that it does not bound the number of allowed recursively nested mentions. For example, consider the phrase "[Chair of [the Committee of [Ministers of [the Council of [Europe]]]]]". Not only can this nested mention structure be recognized by our algorithm, but any supplementary "of" precision would also be recognized.

Experimental results
Data. We evaluate our algorithms on the ACE-2004 (Doddington et al., 2004), ACE-2005(Walker et al., 2006 and GENIA (Kim et al., 2003) datasets. We split and pre-process the data using the tools distributed by Shibuya and Hovy (2020).
Data coverage. As our parsing algorithm considers a restricted search space, an important question is whether it has a good coverage of NER datasets. Table 5 shows the maximum recall we can achieve with the algorithms presented in this paper. Note that no algorithm achieve a coverage of 100% as there is a small set of mentions with exactly the same span 2 and mentions that overlap partially. We observe that the loss of coverage for our Comparable models based on BERT Table 6: Precision, recall and F1-measure results. We compare ourselves to other BERT-based models -some of the cited papers includes richer models that we omit for brievity as our goal is only to asses the performance of our algorithm compared to the CYK-like one. Results marked with † are the reproduction of Lou et al. (2022) as the original papers experimented on different data splits.
quadratic-time algorithm is negligible compared to the cubic-time algorithm for all datasets.
Timing. We implemented the three algorithms in C++ and compare their running time for MAP inference in Figure 2. The proposed algorithm is way faster than the CYK-like. If we would parse only sentences of 300 words and we only consider the time spend in the decoding algorithm (i.e. ignoring the forward pass in the neural network), the CYK-like algorithm couldn't even decode 50 sentences in a second whereas our algorithm could decode more than 1500 sentences on an Intel Core i5 (2.4 GHz) processor. As such, we hope that our algorithm will allow future work to consider NER on longer spans of text.
Neural architecture and hyperparameters. Our neural network is composed of a finetuned BERT model 3 (Devlin et al., 2019) followed by 3 bidirectional LSTM layers (Hochreiter and Schmidhuber, 1997) with a hidden size of 400. When the BERT tokenizer splits a word, we use the output embedding of the the first token. Mention weights (i.e. values in vector w) are computed using two biaffine layers (Dozat and Manning, 2017), one labeled and one unlabeled, with independent left and right projections of dimension 500 and RELU activation functions.
We use a negative log-liklihood loss (i.e. CRF loss) with 0.1-label smoothing (Szegedy et al., 2016). The learning rate is 1 × 10 −5 for BERT parameters and 1 × 10 −3 for other parameters. We use an exponential decay scheduler for learning rates (decay rate of 0.75 every 5000 steps). We apply dropout with probability of 0.1 at the output of BERT, LSTM layers and projection layers. We keep the parameters that obtains the best F1-measure on development data after 20 epochs.
Results. We report experimental results in Table 6. Note that our goal is not to establish a novel SOTA for the task but to asses whether our quadratic-time algorithm is well-suited for the nested NER problem, therefore we only compare our models with recent work using the same datasplit and comparable neural architectures (i.e. BERTbased and without lexicalization). Our implementation of the CYK-like cubic-time parser obtains results close to comparable work in the literature. Importantly, we observe that, with the proposed quadratic-time algorithm, F1measure results are (almost) the same on GENIA and the the degradation is negligible on ACE-2004 andACE-2005 (the F1-measure decreases by less than 0.5).

Conclusion
In this work, we proposed a novel quadratic-time parsing algorithm for nested NER, an asymptotic improvement of one order of magnitude over previously proposed span-based algorithms. We showed that the search-space restriction has a good coverage of English datasets for nested NER. Despite having the same time-complexity than semi-Markov models, our approach achieves comparable experimental results to the cubic-time CYK-like algorithm.
As such, we hope that our algorithm will be used as a drop-in fast replacement for future work in nested NER, where the cubic-time algorithm has often been qualified of slow. Future work could consider the extension to lexicalized mentions.