Filtered Semi-Markov CRF

Semi-Markov CRF has been proposed as an alternative to the traditional Linear Chain CRF for text segmentation tasks such as Named Entity Recognition (NER). Unlike CRF, which treats text segmentation as token-level prediction, Semi-CRF considers segments as the basic unit, making it more expressive. However, Semi-CRF suffers from two major drawbacks: (1) quadratic complexity over sequence length, as it operates on every span of the input sequence, and (2) inferior performance compared to CRF for sequence labeling tasks like NER. In this paper, we introduce Filtered Semi-Markov CRF, a variant of Semi-CRF that addresses these issues by incorporating a filtering step to eliminate irrelevant segments, reducing complexity and search space. Our approach is evaluated on several NER benchmarks, where it outperforms both CRF and Semi-CRF while being significantly faster. The implementation of our method is available on \href{https://github.com/urchade/Filtered-Semi-Markov-CRF}{Github}.


Introduction
Sequence segmentation, the process of dividing a sequence into distinct, non-overlapping segments, has various applications, including Named Entity Recognition and Chinese Word Segmentation (Tjong Kim Sang and De Meulder, 2003;Li and Yuan, 1998).In the past, this task has been approached as a sequence labeling problem using pre-defined templates, such as the BIO and BILOU schemes (Ratinov and Roth, 2009).The Conditional Random Field (CRF) (Lafferty et al., 2001) has become a popular method for sequence labeling problems due to its ability to model the dependency between adjacent token tags.However, the CRF model may not efficiently capture the underlying structure of the sequence, as it is limited to modeling relationships between individual tokens rather than segments.
The Semi-Markov CRF (Sarawagi and Cohen, 2005) has been proposed as a variant of the CRF, allowing for the incorporation of higher-level segment features, such as segment width.While the Semi-CRF allows for a more natural approach to sequence segmentation, it suffers from slower learning and inference due to its quadratic complexity with respect to the sequence length.Additionally, the Semi-CRF often underperforms CRF, showing only marginal improvements in some cases (Liang, 2005;Daumé and Marcu, 2005;Andrew, 2006), which can be attributed to the Semi-CRF's significantly larger solution space, complicating the search for optimal solutions.
To address the limitations of Semi-CRF, we propose Filtered Semi-CRF, which introduces a filtering step to prune irrelevant segments using a lightweight local segment classifier.By leveraging transformer-based features, such as BERT (Devlin et al., 2019), this classifier can identify high-quality candidate segments.Consequently, the task of the Semi-CRF is simplified to selecting the best segments from the pool of high-quality candidates.
Our experiments demonstrate that this filtering step not only accelerates the decoding process but also improves the overall model performance.
Although pruning techniques have been applied to accelerate parsing algorithms (Roark and Hollingshead, 2008;Bodenstab et al., 2011), they often involve a trade-off between accuracy and inference speed.In contrast, our filtering approach is learned jointly and collaboratively with the Semi-CRF during training, resulting in a model that not only increases efficiency but also improves overall performance.
When evaluated on Named Entity Recognition, our model significantly outperforms both the CRF and Semi-CRF, achieving F1 score improvements of up to 2.5 and 1.1 points, respectively, on the CoNLL 2003 dataset.Additionally, our model also accelerates the decoding process to a speed that can be up to 20 times and 137 times faster than CRF and Semi-CRF, respectively.

Probabilistic structured predictor
In this paper, we aim to produce a structured output y given an input sequence x.To assess the compatibility between the input and output, we employ a parameterized score function S θ (y|x).The probability of a structure y given x is computed as follows: where Y(x) represents the set of all possible outputs for x, and the denominator serves as a normalization constant, referred to as the partition function, denoted by Z θ (x).
Training During training, the goal is to update the model's parameters θ to maximize the likelihood of the training data.The loss function for a pair of data points (x, y) is computed as follows: This loss function can be optimized using a stochastic gradient descent algorithm on the training data.
Computing the partition function Z θ (x) can be challenging when the output space is large, but it can be calculated efficiently using dynamic programming in some cases.
Inference During inference, the goal is to produce the most likely output: All the models we present in this paper follow this type of probabilistic modeling.For the remainder of this paper, we omit the dependency on θ for better readability.

Linear Chain CRF
The Linear-Chain CRF (Lafferty et al., 2001) is a sequence labeling model that assigns a label to each token in the input sequence, taking into account dependencies between adjacent labels.The score function of the CRF has the following form: Here, ψ(y i |x) ∈ R is the sequence label score at position i, and T ∈ R |Y |×|Y | is a learnable label transition matrix.The partition function is computed using the Forward algorithm and the Viterbi algorithm (Rabiner, 1989) is used to determine the optimal labeling, both with a computational complexity of O(L|Y | 2 ) (More details in Appendix A.2).

Semi-Markov CRF
The Semi-CRF, proposed by (Sarawagi and Cohen, 2005), operates at the segment level and allows for the modeling of features that cannot be captured by traditional linear-chain CRFs.It produces a segmentation y of length M for an input sequence x of length L (L ≥ M ).A segmentation y = {s 1 , . . ., s M } ∈ Y(x) satisfies the following properties: of a start position i k , an end position j k , and a label l k ∈ Y .
• The segments have positive lengths and completely cover the input sequence positions 1, . . ., L without overlapping.In other words, the start and end positions satisfy i 1 = 1, j M = L, and for every j k and i k we have 1 ≤ i k ≤ j k ≤ L, and i k+1 = j k + 1.
Consider a sentence from a Named Entity Recognition (NER) task: "Alain Farley works at McGill University".It would be segmented as y=[(1,2,PER), (3,3,O), (4,4,O), (5,6,ORG)], considering assumption from Sarawagi and Cohen (2005) that non-entity segments (referred to as O or null segments) have unit length.Furthermore, the Semi-CRF score function is defined as follows: Here, ϕ(s k |x) ∈ R represents the score of the kth segment of y, and T [l k−1 , l k ] denotes the label transition score.Additionally, T [l 0 , l 1 ] = 0.The partition function of the Semi-CRF can be computed in polynomial time using a modified version of the Forward algorithm and the segmental Viterbi algorithm is used to compute optimal segmentation (Appendix A.3 for details).The computational complexity of the Semi-CRF increases quadratically with both the sequence length and the number of labels, i.e O(L 2 |Y | 2 ).

Graph-based Formulation of Semi-CRF
In this section, we present a graph-based formulation of the Semi-CRF.As explained in § 2.3 , a sequence x of length L is divided into M labeled segments (i k , j k , l k ), with i k , j k and l k denoting respectively the start position, end position and the label.We define a directed graph G(V full , E), with V full its set of nodes composed of all possible segments s k of x: and an edge s k ′ → s k ∈ E exists if and only if the start position of s k immediately follows the end position of s k ′ , i.e., j k ′ + 1 = i k .The weight of an edge s k ′ → s k is defined as: where ϕ(s k |x) is the score of the segment s k and T [l k ′ , l k ] is the label transition score.Moreover, Any directed path s 1 , s 2 , . . ., s M of the graph G corresponds to a valid segmentation of x if it verifies the segmentation properties described in § 2.3 .Additionally, the score of a valid path is computed as the sum of the edge scores, and is equivalent to the Semi-CRF score of the segmentation (Eq.5): The search for the best segmentation of the sequence x is equivalent to finding the maximal weighted path of the graph G that starts at i 1 = 1 and ends at j M = L.This search can be done using a generic shortest path algorithm such as Bellman-Ford, whose complexity is of L 3 .Nevertheless, taking into account the lattice structure of the problem, the Viterbi algorithm (Viterbi, 1967;Rabiner, 1989) can achieve this while reducing the complexity to L 2 .

Filtered Semi-Markov CRF
In this section, we propose an alternative model to Semi-CRF, named Filtered Semi-CRF, which aims to address two fundamental weaknesses of the original model.First, the Semi-CRF is not wellsuited for long texts due to its quadratic complexity and the prohibitively large search space.Secondly, in tasks such as Named Entity Recognition (NER), where certain segments are labeled as null (representing non-entity segments), the Semi-CRF graph can create multiple redundant paths, all leading to the same set of entities.For instance, consider the scenario described in § 2.3.In this scenario, multiple segmentations, such as y=[(1,2,PER), (3,3,O), (4,4,O), (5,6,ORG)] or y=[(1,2,PER), (3,4,O), (5,6,ORG)], would yield the same final set of labeled entities, specifically (1,2, PER) and (5,6,ORG) in this case.To remedy these shortcomings, our proposed model incorporates a filtering step that eliminates irrelevant segments using a lightweight local classifier.By leveraging transformer-based features, this classifier effectively selects high-quality candidate segments, significantly reducing the task of the Semi-CRF to merely choosing the best among already highquality candidates.

Filtering
Local classifier We first define the local classifier ϕ local as a model that assigns a score to a labelled segment s = (i, j, l) given an input sequence x: where h i ∈ R D is the token representation at position i (computed by a pretrained transformer such as BERT), and w l ∈ R D is a learnable weight associated with the label l.The function f represents the segment featurizer, which aggregates token representations into a single feature representation.We found that a simple sum operation provides strong performance across settings.

Filtered graph
The filtering consists in removing the segments s k = (i k , j k , l k ) for which l k = arg max l ϕ local (i k , j k , l|x) and l k = null: (10) This new set of filtered nodes V requires to define the set of edges E differently from the definition of § 2.4.Thus, we propose to define the edges following Liang et al. (1991) This definition means that s k ′ → s k is an edge if the start position of s k follows the end position of s k ′ , and that no other segment lies completely in between these two positions (j k ′ , i k ).This formulation generalizes the Semi-CRF to graphs with missing segments.However, with missing segments, the starting and ending positions of segmentations do not necessarily verify i 1 = 1 and j M = L. Thus, we simply add two terminal nodes start and end, verifying: In this context, a segmentation is simply a path in the graph starting at start and ending at end node (see Figure 1).Referring back to the example in § 2.3, the correct segmentation of "Alain Farley works at McGill University" using the Filtered Semi-CRF would be y=[start, (1, 2, PER), (5, 6, ORG), end], where all remaining part of the segmentation are considered as having null labels.

Segmentation scoring
In the filtered graph, the score of a segmentation, y = {start, s 1 , . . ., s M , end} is computed by summing its edge scores as for the Semi-CRF described in § 2.4: where ϕ global is a model that computes score of the nodes/segments in the filtered graph, defined similarly as ϕ local in § 3.1 and they share the same feature f .T [l k ′ , l k ] represents the transition score between the adjacent labels.By default, we set w(start → s 1 ) = ϕ global (s 1 |x) and w(s M → end) = 0. See figure 1 for a visual example.

Training
In this section, we present our FSemiCRF training which involves updating the whole model parameters to minimize the following loss function: Here, L local and L global represent the filtering loss and the segmentation loss, respectively.

Filtering loss
The filtering loss is the sum of the negative logprobability of all gold-labeled segments, V * : In practice, we assign a lower weight to the loss of null segments to account for the imbalanced nature of the task.For that, we down-weight the loss for the label l = null by a ratio β ∈]0, 1], tuned on the dev set.

Segmentation loss
The segmentation loss is the negative loglikelihood of the gold path y in the filtered graph: S(y|x) is the segmentation score as per § 3.2, and the partition function Z(x), the sum of exponentiated scores for all valid paths in the graph from start to end.It can be computed efficiently via a message-passing algorithm (Wainwright and Jordan, 2008): In practice, this implementation of Z(x) can be unstable, thus, all computations were performed in log space to prevent issues of overflow or underflow.The complexity of the algorithm is O(|V | + |E|) as it performs a topological sort (which visits each node and edge once), and then iterates over each node and its incoming edges exactly once, performing constant time operations.
During training, we impose certain constraints to ensure that the gold segmentation y forms a valid path in the filtered graph (with nodes V ), which is critical for maintaining a positive loss, i.e., log Z(x) > S(y|x): 1) All segments in V that do not overlap with at least one segment from the gold segmentation y are excluded.2) All segments from the gold segmentation, even those not initially selected in the filtering step, are included in V .

Inference
During inference, the first step is to obtain the candidate segments V through filtering, and then constructing the graph G(V, E) (see § 3.1).The final results is obtained by identifying the path, from start to end, in the graph that has the highest score.We achieve this by using a max-sum dynamic programming algorithm, which has a similar structure to Algorithm 1: The highest scoring path y * , represented by argmax y S(y|x), is identified by the path traced by δ[end], which can be obtained through backtracking.This algorithm has a computational complexity of O(|V | + |E|), the same as that of computing the partition function Z(x) in Algorithm 1.

Complexity analysis
In this section, we analyze the complexity of the algorithms 1 and 2, O(|V | + |E|), as a function of the input sequence length L. Note that the size of V does not depend on the number of labels |Y | since there is at most one label per segment due to the filtering step in equation 10.Proposition 4.1 The number of nodes in a Semi-CRF graph (as described in § 2.4) with an input length of L is given by L(L+1) 2 .Proposition 4.2 The number of edges in a Semi-CRF graph (as described in § 2.4) with an input length of L is given by L(L−1)(L+1)
We employ these propositions to determine the complexity of the Filtered Semi-CRF model in the following.The proofs for these propositions can be found in Appendix § A.1.

Worst case complexity
In the worst case scenario, the filtering model ϕ local does not filter any segments, resulting in all segments being retained.By utilizing Propositions 3.1 and 3.2, we can deduce that in the worst case, O(|V |) = O(L 2 ) and O(|E|) = O(L 3 ).This implies that the complexity of our algorithm in the worst case is cubic with respect to the sequence length L, as ).However, it is worth noting that in this worst case scenario, the resulting graph is the Semi-CRF and the complexity can be reduced to L 2 by utilizing the Forward algorithm during training and the Viterbi algorithm during inference (Eq.19 and Eq.18).

Best Case Complexity
In the ideal scenario, the filtering process is optimal, resulting in the number of nodes in the graph |V | being equal to the true number of non-null segments in the input sequence, denoted by S. Furthermore, since S does not contain overlapping segments, |S| ≤ L with |S| = L if all segments in S have unit length and cover the entire sequence, i.e., S = {(i, i, l i )|i = 1 . . .L, l i ̸ = null}.Additionally, |E| = |S| − 1 ≤ L − 1 as optimal filtering implies that the path number is unique.As a result, in this best case scenario, the complexity of the algorithm is linear with respect to the sequence length L, i.e., O(L).
Empirical Analysis In this study, we assess our model's empirical complexity by examining the correlation between the graph size (|V | + |E|) and the input sequence length L. We use three popular NER datasets for this analysis -CoNLL-2003, OntoNotes 5.0, and Arabic ACE.Our findings (shown in Figure 2) indicate a linear increase in the graph size as the sequence length increases.Interestingly, the graph size always stays smaller than the sequence length.This suggests that in practice, the computational complexity of the FSemiCRF model is at worst, O(L).However, during the initial stages of model training, the graph size may be large because the filtering model, which is responsible for reducing the graph size, is not fully trained, as depicted in Figure 3. But, the graph size decreases rapidly after a few training steps as the filtering classifier is improving.

Experimental setups
Datasets and evaluation We evaluate our models on on three diverse Named Entity Recognition (NER) datasets: CoNLL-2003 and OntoNotes 5.0, both English, and Arabic ACE (further details in Appendix A.4).We adopt the standard NER evaluation methodology, calculating precision (P), recall (R), and F1-score (F), based on the exact match between predicted and actual entities.
Hyperparameters To produce contextual token representations, we used bert-large-cased (Devlin et al., 2019) (Kingma and Ba, 2015).We employed a learning rate of 2e-5 for the pre-trained parameters and a learning rate of 5e-4 for the other parameters.We used a batch size of 8 and trained for a maximal epoch of 15.We keep the best model on the validation set for testing.In this work, for all segment-based model, we restrict the segment to a maximum width K to reduce complexity without harming the recall score on the training set (however some segments may be missed for the test set).By bounding the maximum width of the segments, we reduce the number of segments from L 2 to LK.Under this setup, the complexity of the Semi-Markov CRF becomes O(LK|Y | 2 ).We implemented our model with PyTorch (Paszke et al., 2019).The pre-trained transformer models were loaded from HuggingFace's Transformers library, we used AllenNLP (Gardner et al., 2018) for data preprocessing and the seqeval library (Nakayama, 2018) for evaluating the sequence labeling models.Our Semi-CRF implementation is based on pytorch-struct (Rush, 2020).We trained all the models on a server with V100 GPUs.

Main results
FSemiCRF vs. CRF and Semi-CRF As shown in Table 1, our FSemiCRF model outperforms both the CRF and Semi-CRF reference models in all datasets, validating its effectiveness.The Semi-CRF model, while providing competitive results, often lags behind, either matching or slightly underperforming the CRF model.This observation is in line with the findings of Liang (2005).
Comparison to pior works In our work, we mainly compare our approach with previous work that we consider comparable, i.e. that uses sentence-level context and the same backbone model.As shown in the Table 1, on all datasets, we found that our FSemiCRF achieves competitive results on all the datasets.For example, our approach outperforms a span-based model we proposed earlier (Zaratiana et al., 2022a), which uses the Maximum weighted independent set to select the best spans.

Ablation study
Semi-CRF + Unit null We study a variation of the Semi-CRF that only allows for the use of null labels for unit length segments.To do this, we simply modify the original Semi-CRF by eliminating/masking segmentation paths that contain null segments with a size greater than one.The motivation for this study is to fix the multiple redundant paths problem of the Semi-CRF ( § 3).The results show that this approach improves performance on most of the datasets, but still does not perform as well as the other methods, thus validating the importance of segment filtering.

Efficiency analysis
This section focuses on the computational efficiency of different models, for both training and inference.For this experiment, all the models use a base size for the encoder to ensure a fair comparison.
Inference wall clock time The wall clock time analysis for scoring and decoding operations, summarized in Table 2, highlights subtle differences in scoring times across all models.However, when it comes to decoding, FSemiCRF significantly outperforms both CRF and Semi-CRF models on all datasets.Notably, FSemiCRF achieves a remarkable 137x speedup over Semi-CRF on the  OntoNotes 5.0.Overall, FSemiCRF demonstrates superior performance, being up to 6x and 2x faster than CRF and Semi-CRF, respectively.
Training throughput Figure 4 presents the training throughput of the models, which measures the number of batches processed per second using a batch size of 8.It reveals that, in general, CRF is the fastest during training, with FSemiCRF following closely as the second fastest model.This can be attributed to the larger graph size of FSemi-CRF during training, particularly in the early stages, which can potentially slow down the process, as discussed in the complexity analysis (4.4).However, the differences in training performance between the models are not as pronounced as during inference.

Related Work
Linear-chain CRF Numerous frameworks exist for text segmentation.The commonly used Linear-Chain CRF (Lafferty et al., 2001) treats this task as token-level prediction, training through sequence-level objectives and using the Viterbi algorithm (Viterbi, 1967;Forney, 2010) for decoding.Variants have evolved from using handcrafted features (Lafferty et al., 2001;Gross et al., 2006;Roth and tau Yih, 2005) to automated feature learning through neural networks (Do and Artières, 2010;van der Maaten et al., 2011;Kim et al., 2015;Huang et al., 2015;Lample et al., 2016).Higher order dependencies (Markov order N > 1) have been explored for enhanced performance, but their adoption is limited due to complexity and marginal gains (Ye et al., 2009;Cuong et al., 2014).
Dynamic Programming Pruning Prior research has investigated the use of pruning techniques in dynamic programming to improve the efficiency of structured prediction tasks (Roark and Hollingshead, 2008;Rush and Petrov, 2012;Bodenstab et al., 2011;Vieira and Eisner, 2017).These approaches aim to optimize runtime by selectively discarding hypotheses during inference.However, these methods often involve a trade-off between efficiency and performance.In contrast, our Filtered Semi-CRF model introduces a learned filtering step that collaboratively improves both efficiency and overall model performance.
Named Entity Recognition NER is an important task in Natural Language Processing and is used in many downstream information extraction applications such as relation extraction (Zaratiana et al., 2023) and taxonomy construction (Zhang et al., 2018;Dauxais et al., 2022).Usually, NER tasks are designed as sequence labelling (Huang et al., 2015;Lample et al., 2016;Akbik et al., 2018) where the goal is to predict tagged sequence (eg.BIO tags).Recently, different approaches have been proposed to perform NER tasks that go beyond traditional sequence labelling.One approach that has been widely adopted is the span-based approach (Liu et al., 2016b;Fu et al., 2021;Li et al., 2021;Zaratiana et al., 2022a,b,c;Lou et al., 2022;Corro, 2023) where the prediction is done in the span level instead of entity level.Futhermore, the use of the sequence-to sequence models for Named Entity Recognition has become popular recently.For instance, Yan et al. (2021) uses the BART (Lewis et al., 2019) model to generate named entity using encoder-decoder with copy mechanism.

Conclusion
In this paper, we introduce Filtered Semi-CRF, a novel algorithm for text segmentation tasks.By applying our method to NER, we show substantial performance gains over traditional CRF and Semi-CRF models on several datasets.Additionally, our algorithm exhibits improved efficiency, speed, and scalability compared to the baselines.As future work, we plan to investigate the extension of Filtered Semi-CRF to nested segment structures.

Limitations
While our Filtered Semi-CRF model offers several advantages, it also has limitations that should be considered: Sensitivity to Filtering Quality The overall performance and efficiency heavily rely on the accuracy of the filtering process in identifying highquality candidate segments.Inaccurate filtering or the introduction of errors during this step can adversely affect the model's performance.
Restriction to Non-overlapping Entities Our model is designed specifically for non-overlapping entity segmentation.It assumes that entities within the text do not overlap with each other.While this assumption is valid for many applications and datasets, scenarios exist where specific cases of entity overlap occurs, such as nested entities.

Figure 1 :
Figure 1: Filtered Semi-CRF for NER.The model takes as text sequence and output the best entity segments.

Figure 2 :
Figure 2: Empirical complexity analysis.We conducted an empirical complexity analysis using trained Filtered Semi-CRF models.The plot showcases the relationship between the size of the filtered graph (|V | + |E|) and the input sequence length L on three NER datasets.As the length of the input sequence increases, the graph size seems to grow in a linear fashion.

Figure 3 :
Figure 3: Graph Size during Training.The graph size (|V | + |E| + 1) undergoes three stages during training: 1) initially large when the filtering classifier is untrained, 2) decreasing in the second stage as most of segments in the training set have a null label (biasing the classifier toward this label), and 3) increasing again as the classifier improves, better aligning with the training dataset statistics.

Figure 4 :
Figure 4: Training throughput in batches per second.

Table 2 :
Inference Wall Clock Time (lower is better).Comparison of required wall-clock time for the scoring (tokens for CRF, segments for Semi-CRF/FSemiCRF) and decoding processes, measured in milliseconds / sample.