Deep Span Representations for Named Entity Recognition

Span-based models are one of the most straightforward methods for named entity recognition (NER). Existing span-based NER systems shallowly aggregate the token representations to span representations. However, this typically results in significant ineffectiveness for long-span entities, a coupling between the representations of overlapping spans, and ultimately a performance degradation. In this study, we propose DSpERT (Deep Span Encoder Representations from Transformers), which comprises a standard Transformer and a span Transformer. The latter uses low-layered span representations as queries, and aggregates the token representations as keys and values, layer by layer from bottom to top. Thus, DSpERT produces span representations of deep semantics. With weight initialization from pretrained language models, DSpERT achieves performance higher than or competitive with recent state-of-the-art systems on eight NER benchmarks. Experimental results verify the importance of the depth for span representations, and show that DSpERT performs particularly well on long-span entities and nested structures. Further, the deep span representations are well structured and easily separable in the feature space.


Introduction
As a fundamental information extraction task, named entity recognition (NER) requires predicting a set of entities from a piece of text.Thus, the model has to distinguish the entity spans (i.e., positive examples) from the non-entity spans (i.e., negative examples).In this view, it is natural to enumerate all possible spans and classify them into the entity categories (including an extra non-entity category).This is exactly the core idea of spanbased approaches (Sohrab and Miwa, 2018;Eberts and Ulges, 2020;Yu et al., 2020).
Analogously to how representation learning matters to image classification (Katiyar and Cardie, 2018;Bengio et al., 2013;Chen et al., 2020), it should be crucial to construct good span representations for span-based NER.However, existing models typically build span representations by shallowly aggregating the top/last token representations, e.g., pooling over the sequence dimension (Sohrab and Miwa, 2018;Eberts and Ulges, 2020;Shen et al., 2021), or integrating the starting and ending tokens (Yu et al., 2020;Li et al., 2020d).In that case, the token representations have not been fully interacted before they are fed into the classifier, which impairs the capability of capturing the information of long spans.If the spans overlap, the resulting span representations are technically coupled because of the shared tokens.This causes the representations less distinguishable from the ones of overlapping spans in nested structures.
Inspired by (probably) the most sophisticated implementation of attention mechanism -Transformer and BERT (Vaswani et al., 2017;Devlin et al., 2019), we propose DSpERT, which stands for Deep Span Encoder Representations from Transformers.It consists of a standard Transformer and a span Transformer; the latter uses low-layered span representations as queries, and token representations within the corresponding span as keys and values, and thus aggregates token representations layer by layer from bottom to top.Such multilayered Transformer-style aggregation promisingly produces deep span representations of rich semantics, analogously to how BERT yields highly contextualized token representations.
With weight initialization from pretrained language models (PLMs), DSpERT performs comparably to recent state-of-the-art (SOTA) NER systems on six well-known benchmarks.Experimental results clearly verify the importance of the depth for the span representations.In addition, DSpERT achieves particularly amplified performance im-provements against its shallow counterparts2 on long-span entities and nested structures.
Different from most related work which focuses on the decoder designs (Yu et al., 2020;Li et al., 2020b;Shen et al., 2021;Li et al., 2022), we make an effort to optimize the span representations, but employ a simple and standard neural classifier for decoding.This exposes the pre-logit representations that directly determine the entity prediction results, and thus allows further representation analysis widely employed in a broader machine learning community (Van der Maaten and Hinton, 2008;Krizhevsky et al., 2012).This sheds light on neural NER systems towards higher robustness and interpretability (Ouchi et al., 2020).

Related Work
The NER research had been long-term focused on recognizing flat entities.After the introduction of linear-chain conditional random field (Collobert et al., 2011), neural sequence tagging models became the de facto standard solution for flat NER tasks (Huang et al., 2015;Lample et al., 2016;Ma and Hovy, 2016;Chiu and Nichols, 2016;Zhang and Yang, 2018).
Recent studies pay much more attention to nested NER, which a plain sequence tagging model struggles with (Ju et al., 2018).This stimulates a number of novel NER system designs beyond the sequence tagging framework.Hypergraph-based methods extend sequence tagging by allowing multiple tags for each token and multiple tag transitions between adjacent tokens, which is compatible with nested structures (Lu and Roth, 2015;Katiyar and Cardie, 2018).Span-based models enumerate candidate spans and classify them into entity categories (Sohrab and Miwa, 2018;Eberts and Ulges, 2020;Yu et al., 2020).Li et al. (2020b) reformulates nested NER as a reading comprehension task.Shen et al. (2021Shen et al. ( , 2022) ) borrow the methods from image object detection to solve nested NER.Yan et al. (2021) propose a generative approach, which encodes the ground-truth entity set as a sequence, and thus reformulates NER as a sequenceto-sequence task.Li et al. (2022) describe the entity set by word-word relation, and solve nested NER by word-word relation classification.
The span-based models are probably the most straightforward among these approaches.However, existing span-based models typically build span representations by shallowly aggregating the top token representations from a standard text encoder.Here, the shallow aggregation could be pooling over the sequence dimension (Eberts and Ulges, 2020;Shen et al., 2021), integrating the starting and ending token representations (Yu et al., 2020;Li et al., 2020d), or a concatenation of these results (Sohrab and Miwa, 2018).Apparently, shallow aggregation may be too simple to capture the information embedded in long spans; and if the spans overlap, the resulting span representations are technically coupled because of the shared tokens.These ultimately lead to a performance degradation.
Our DSpERT addresses this issue by multilayered and bottom-to-top construction of span representations.Empirical results show that such deep span representations outperform the shallow counterpart qualitatively and quantitatively.

Methods
Deep Token Representations.Given a T -length sequence passed into an L-layered d-dimensional Transformer encoder (Vaswani et al., 2017), the initial token embeddings, together with the potential positional and segmentation embeddings (e.g., BERT; Devlin et al., 2019), are denoted as H 0 ∈ R T ×d .Thus, the l-th (l = 1, 2, . . ., L) token representations are: where as the query, key, value inputs, respectively.It consists of a multi-head attention module and a position-wise feed-forward network (FFN), both followed by a residual connection and a layer normalization.Passing the same matrix, i.e., H l−1 , for queries, keys and values exactly results in selfattention (Vaswani et al., 2017).The resulting top representations H L , computed through L Transformer blocks, are believed to embrace deep, rich and contextualized semantics that are useful for a wide range of tasks.Hence, in a typical neural NLP modeling paradigm, only the top representations H L are used for loss calculation and decoding (Devlin et al., 2019;Eberts and Ulges, 2020;Yu et al., 2020).Deep Span Representations.Figure 1 presents the architecture of DSpERT, which consists of a standard Transformer encoder and a span Transformer encoder.In a span Transformer of size k (k = 2, 3, . . ., K), the initial span representations S 0,k ∈ R (T +k−1)×d are directly aggregated from the corresponding token embeddings: where s 0,k i ∈ R d is the i-th vector of S 0,k , and Aggregating(•) is a shallowly aggregating function, such as max-pooling.Check Appendix A for more details on alternative aggregating functions used in this study.Technically, s 0,k i covers the token embeddings in the span (i, i + k).
The computation of high-layered span representations imitates that of the standard Transformer.For each span Transformer block, the query is a low-layered span representation vector, and the keys and values are the aforementioned token repre-sentation vectors in the positions of that very span.Formally, the l-th layer span representations are: where SpanTrBlock(Q, K, V) shares the exactly same structure with the corresponding Transformer block, but receives different inputs.More specifically, for span (i, i + k), the query is the span representation s l−1,k i , and the keys and values are the token representations H l−1 [i:i+k] .Again, the resulting s l,k i technically covers the token representations in the span (i, i + k) on layer l − 1.
In our default configuration, the weights of the standard and span Transformers are independent, but initialized from a same PLM.Given the exactly same structure, the weights can be optionally shared between the two modules.This reduces the model parameters, but empirically results in slightly lower performance (See Appendix F).
The top span representations S L,k are built through L Transformer blocks, which are capable of enriching the representations towards deep semantics.Thus, the representations of overlapping spans are decoupled, and promisingly distinguishable from each other, although they are originally built from S 0,k -those shallowly aggregated from token embeddings.This is conceptually analogous to how the BERT uses 12 or more Transformer blocks to produce highly contextualized representations from the original static token embeddings.
The top span representations are then passed to an entity classifier.Note that we do not construct a unigram span Transformer, but directly borrow the token representations as the span representations of size 1.In other words, Entity Classifier.Following Dozat and Manning (2017) and Yu et al. (2020), we introduce a dimension-reducing FFN before feeding the span representations into the decoder.According to the preceding notations, the representation of span where w j−i ∈ R dw is the (j − i)-th width embedding from a dedicated learnable matrix; ⊕ means the concatenation operation.z ij ∈ R dz is the dimension-reduced span representation, which is then fed into a softmax layer: where W ∈ R c×dz and b ∈ R c are learnable parameters, and ŷ ij ∈ R c is the vector of predicted probabilities over entity types.Note that Eq. ( 6) follows the form of a typical neural classification head, which receives a single vector z ij , and yields the predicted probabilities ŷ ij .Here, the pre-softmax vector Wz ij is called logits, and z ij is called prelogit representation (Müller et al., 2019).
Given the one-hot encoded ground truth y ij ∈ R c , the model could be trained by optimizing the cross entropy loss for all spans: We additionally apply the boundary smoothing technique (Zhu and Li, 2022), which is a variant of label smoothing (Szegedy et al., 2016) for spanbased NER and brings performance improvements.

Experimental Settings
Datasets.We perform experiments on four English nested NER datasets: ACE 20043 , ACE 20054 , GENIA (Kim et al., 2003) and KBP 2017 (Ji et al., 2017); and two English flat NER datasets: CoNLL 2003 (Tjong Kim Sang and De Meulder, 2003) and OntoNotes 55 .More details on data processing and descriptive statistics are reported in Appendix B.
Implementation Details.To save space, our implementation details are all placed in Appendix C.

Main Results
Table 1 shows the evaluation results on English nested NER benchmarks.For a fair and reliable comparison to previous SOTA NER systems,6 we run DSpERT for five times on each dataset, and report both the best score and the average score with corresponding standard deviation.
Table 2 presents the results on English flat NER datasets.The best F 1 scores are 93.70% and 91.76% on CoNLL 2003 and OntoNotes 5, respectively.These scores are slightly higher than those reported by previous literature.
Appendix D further lists the category-wise F 1 scores; the results show that DSpERT can consistently outperform the biaffine model, a classic and strong baseline, across most entity categories.Appendix E provides additional experimental results on Chinese NER, suggesting that the effectiveness of DSpERT is generalizable across languages.
Overall, DSpERT shows strong and competitive performance on both the nested and flat NER tasks.Given the long-term extensive investigation and experiments on these datasets by the NLP community, the seemingly marginal performance improvements are still notable.

Ablation Studies
We perform ablation studies on three datasets, i. representations, which are computed throughout the span Transformer blocks, embrace deep and rich semantics and thus outperform the shallow counterparts.
To validate this point, Table 3 compares DSpERT to the models with a shallow setting, where the span representations are aggregated from the top token representations by max-pooling, mean-pooling, multiplicative attention or additive attention (See Appendix A for details).All the models are trained with the same recipe used in our main experiments.It shows that the 12-layer deep span representations achieve higher performance than their shallow counterparts equipped with any potential aggregating function, across all datasets.
We further run DSpERT with L ( L < L) span Transformer blocks, where the initial aggregation happens at the (L − L)-th layer and the span Transformer corresponds to the top/last L Transformer blocks.These models may be thought of as intermediate configurations between fully deep span representations and fully shallow ones.As displayed in  three datasets.These results further strengthen our argument that the depth positively contributes to the quality of span representations.Appendix F provides extensive ablation studies evaluating other components.

Effect on Long-Span Entities
The recognition of long-span entities is a long-tail and challenging problem.Taking ACE 2004 as an example, the ground-truth entities longer than 10 tokens only account for 2.8%, and the maximum length reaches 57.Empirical evidence also illustrates that existing NER models show relatively weak performance on long entities (e.g., Shen et al., 2021;Yuan et al., 2022).
Figure 2 presents the F 1 scores grouped by different span lengths.In general, the models based on shallow span representations perform relatively well on the short entities, but struggle for the long ones.However, DSpERT show much higher F 1 scores on the long entities, without any performance sacrifice on the short ones.For ACE 2004, DSpERT outperforms its shallow counterpart by 2%-12% absolute F 1 score on spans shorter than 10, while this difference exceeds 30% for spans longer than 10.Similar patterns are observed on GENIA and CoNLL 2003.
Conceptually, a longer span contains more information, so it would be more difficult to be encoded into a fixed-length vector, i.e., the span representation.According to our experimental results, the shallow aggregation fails to fully preserve the semantics in the original token representations, especially for long spans.The DSpERT, however, allows complicated interactions between tokens through multiple layers; in particular, longer spans experience more interactions.This mechanism amplifies the performance gain on long entities.

Effect on Nested Structures
Even in nested NER datasets, the nested entities are less than the flat ones (See Table 6).To deliberately investigate the performance on nested entities, we look into two subsets of spans that are directly related to nested structures: (1) Nested: the spans that are nested inside a ground-truth entity which covers other ground-truth entities; (2) Covering: the spans that cover a ground-truth entity which is nested inside other ground-truth entities.
For example, in sentence "Mr.John Smith graduated from New York University last year", a location entity "New York" is nested in an organization entity "New York University".A model is regarded to well handle nested structures if it can: (1) distinguish "New York" from other negative spans inside the outer entity "New York University", i.e., those in the Nested subset; and (2) distinguish "New York University" from other negative spans covering the inner entity "New York", i.e., those in the Covering subset.
Figure 3 depicts the F 1 scores grouped by different nested structures.Consistent to a common expectation, nested structures create significant difficulties for entity recognition.Compared to the flat ones, a shallow span-based model encounters a substantial performance degradation on the nested structures, especially on the spans in both the Nested and Covering subsets.On the other hand, our deep span representations perform much better.For ACE 2004, DSpERT presents 2% higher absolute F 1 score than the shallow model on flat spans, but achieves about 40% higher score on spans in both the Nested and Covering subsets.The experiments on GENIA report similar results, although the difference in performance gain becomes less substantial.
As previously emphasized, shallowly aggregated span representations are technically coupled if the spans overlap.This explains why such models perform poorly on nested structures.Our model addresses this issue by deep and multi-layered construction of span representations.Implied by the experimental results, deep span representations are less coupled and more easily separable for overlapping spans.Figure 3: F 1 scores on spans with different nested structures."Nested" means the spans that are nested inside a ground-truth entity which covers other ground-truth entities; "Covering" means the spans that cover a ground-truth entity which is nested inside other ground-truth entities; "Both" means the spans that are both "Nested" and "Covering"; "Flat" means the spans that are neither "Nested" nor "Covering".All the results are average scores of five independent runs.

Analysis of Pre-Logit Representations
For a neural classification model, the logits only relate to the pre-defined classification categories, while the pre-logit representations contain much richer information (Krizhevsky et al., 2012;Wu et al., 2018).Hence, the analysis of pre-logit representations has become a popular tool in machine learning research (Van der Maaten and Hinton, 2008;Müller et al., 2019;Chen et al., 2020).
However, such analysis is incompatible with most neural NER systems except for the span-based ones.Our DSpERT employs a standard classification head, which exposes the pre-logit representations and thus allows the analysis.In this section, we investigate the pre-logit span representations, i.e., z ij in Eq. ( 6), providing more insights into why the deep span representations outperform the shallow counterparts.

Decoupling Effect on Overlapping Spans
The effectiveness of DSpERT on nested structures primarily attributes to its decoupling effect on the representations between overlapping spans.To support this argument, we compare the coupling strengths between span representations with different overlapping levels.Specifically, we define the overlapping ratio α of two given spans as the proportion of the shared tokens in the spans, and then categorize the span pairs into three scenarios: non-overlapping (α = 0), weakly overlapping (0 < α ≤ 0.5) and strongly overlapping (0.5 < α < 1).Table 4 reports the cosine similarities of the representations between entities and their neighboring spans, categorized by overlapping ratios.In general, DSpERT and the shallow model have comparable similarity values on the non-overlapping spans; and DSpERT show slightly higher values on the overlapping spans.However, the shallow model produces significantly higher similarities for stronger overlapping levels.Hence, shallow models yield coupled representations for overlapping spans, while DSpERT can effectively decouple the representations and thus lead to the performance improvement, in particular on the nested entities.Table 4: Cosine similarities of the representations between entity spans and their neighboring spans.Non-/weakly/strongly overlapping means that the overlapping ratio is 0/0-0.5/0.5-1,respectively.All the metrics are first averaged within each experiment, and then averaged over five independent experiments, reported with subscript standard deviations.

ℓ 2 -Norm and Cosine Similarity
We calculate the ℓ 2 -norm and cosine similarity of the span representations.As presented in Table 5, the deep span representations have larger ℓ 2 -norm than those of the shallow counterpart.Although the variance of representations inevitably shrinks during the aggregating process from the perspective of statistics, this result implies that deep span representations are less restricted and thus able to flexibly represent rich semantics.In addition, the deep span representations are associated with higher within-class similarities and lower betweenclass similarities, suggesting that the representations are more tightly clustered within each category, but more separable between categories.Apparently, this nature contributes to the high classification performance.
We further investigate the pre-logit weight, i.e., W in Eq. ( 6).First, the trained DSpERT has a prelogit weight with a smaller ℓ 2 -norm.According to a common understanding of neural networks, smaller norm implies that the model is simpler and thus more generalizable.
Second, as indicated by Müller et al. (2019), a typical neural classification head can be regarded as a template-matching mechanism, where each row vector of W is a template.7Under this interpretation, each template "stands for" the overall direction of the span representations of the corresponding category in the feature space.As shown in Table 5, the absolute cosine values between the templates of DSpERT are fairly small.In other words, the templates are approximately orthogonal, which suggests that different entity categories are uncorrelated and separately occupy distinctive subareas in the feature space.This pattern, however, is not present for the shallow models.

Visualization
Figure 4 visualizes the span representations dimension-reduced by principal component analysis (PCA).The results are quite impressive.The representations by shallow aggregation are scattered over the plane.Although they are largely clustered by categories, the boundaries are mixed and ambiguous.In contrast, the deep span representations group by relatively clear and tight clusters, corresponding to the ground-truth categories.Except for some outliers, each pair of the clusters can be easily separable in this projected plane.This is also consistent to the aforementioned finding that deep span representations have high within-class similarities but low between-class similarities.As a linear dimensionality reduction, PCA results indicate whether and how the features are linearly separable.Note that the pre-logit representations are the last ones before the logits, so the linear separability is crucial to the classification performance.

Discussion and Conclusion
Neural NLP models have been rigidly adhere to the paradigm where an encoder produces tokenlevel representations, and a task-specific decoder receives these representations, computes the loss and yields the outputs (Collobert et al., 2011).This design works well on most, if not all, NLP tasks; and it may also deserve a credit for facilitating NLP pretraining (Peters et al., 2018;Devlin et al., 2019), since such common structure in advance bridges the pretraining and downstream phases.
However, this paradigm may be sub-optimal for specific tasks.In span-based NER (or information extraction from a broader perspective), the smallest modeling unit should be spans instead of tokens;   and thus the span representations should be crucial.This originally motivates our DSpERT.In addition, DSpERT also successfully shows how to exploit the pretrained weights beyond the original Transformer structure, i.e., adapting the weights from computing token representations for span representations.We believe that adding span representation learning in the pretraining stage will further contribute positively.
In conclusion, deep and span-specific representations can significantly boost span-based neural NER models.Our DSpERT achieves SOTA results on six well-known NER benchmarks; the model presents pronounced effect on long-span entities and nested structures.Further analysis shows that the resulting deep span representations are well structured and easily separable in the feature space.

Limitations
To some extent, DSpERT pursues performance and interpretability over computational efficiency.The major computational cost of a Transformer encoder is on the multihead attention module and FFN.As noted, we empirically choose the maximum span size K such that it covers most entities in the training and development splits.From the perspective of F 1 score, this heuristic works well, and DSpERT performs favourably on long-span entities as long as they are covered.However, the entities with extreme lengths beyond K will be theoretically irretrievable.

A (Shallowly) Aggregating Functions
Given token representations H ∈ R T ×d , a model can shallowly aggregate them in corresponding positions to construct span representations.Formally, the span representation of (i, j) can be built by: Max-pooling.Applying max-pooling to H [i:j] over the first dimension.
Mean-pooling.Applying mean-pooling to H [i:j] over the first dimension.

Multiplicative Attention.
Computing where W and u are learnable parameters.
Additive Attention.Computing where W, u and v are learnable parameters; ⊕ means concatenation over the second dimension, and vector v should be repeated for j − i times before the concatenation.
In general, either multiplicative or additive attention computes normalized weights over the sequence dimension of span (i, j), where the weights are dependent on the values of H [i:j] ; and then applies the weights to H [i:j] , resulting in weighted average values.ACE 2004 andACE 2005 are two English nested NER datasets, either of which contains seven entity types, i.e., Person, Organization, Facility, Location, Geo-political Entity, Vehicle, Weapon.Our data processing and splits follow Lu and Roth (2015).

B Datasets
GENIA (Kim et al., 2003) is a nested NER corpus on English biological articles.Our data processing follows Lu and Roth (2015), resulting in five entity types (DNA, RNA, Protein, Cell line, Cell type); data splits follow Yan et al. (2021) and Li et al. (2022).
KBP 2017 (Ji et al., 2017) is an English nested NER corpus including text from news, discussion forum, web blog, tweets and scientific literature.It contains five entity categories, i.e., Person, Geopolitical Entity, Organization, Location, and Facility.Our data processing and splits follow Lin et al. (2019) and Shen et al. (2022).
CoNLL 2003 (Tjong Kim Sang and De Meulder, 2003) is an English flat NER benchmark with four entity types, i.e., Person, Organization, Location and Miscellaneous.We use the original data splits for experiments.
OntoNotes 5 is a large-scale English flat NER benchmark, which has 18 entity types.Our data processing and splits follow Pradhan et al. (2013).
Table 6 presents the descriptive statistics of the datasets.

C Implementation Details
Hyperparameters.We choose RoBERTa (Liu et al., 2019) as the PLM to initialize the weights in the Transformer blocks and span Transformer blocks.The PLMs used in our main experiments are all of the base size (768 hidden size, 12 layers).
For span Transformer blocks, the maximum span size K is specifically determined for each dataset.In general, a larger K would improve the recall performance (entities longer than K will never be recalled), but significantly increase the computation cost.We empirically choose K such that it covers most entities in the training and development splits.For example, most entities are short in CoNLL 2003, so we use K = 10; while entities are relatively long in ACE 2004 and ACE 2005, so we use K = 25.We use max-pooling as the initial aggregating function.
We find it beneficial to additionally include a BiLSTM (Hochreiter and Schmidhuber, 1997) before passing the span representations to the entity classifier.The BiLSTM has one layer and 400 hidden states, with dropout rate of 0.5.In the entity classifier, the FFN has one layer and 300 hidden states, with dropout rate of 0.4, and the activation is ReLU (Krizhevsky et al., 2012).In addition, boundary smoothing (Zhu and Li, 2022) with ε = 0.1 is applied to loss computation.
We train the models by the AdamW optimizer (Loshchilov and Hutter, 2018) for 50 epochs with the batch size of 48.Gradients are clipped at ℓ 2 -norm of 5 (Pascanu et al., 2013).The learning rates are 2e-5 and 2e-3 for pretrained weights and randomly initialized weights, respectively; a scheduler of linear warmup is applied in the first 20% steps followed by linear decay.For some datasets, a few hyperparameters are further tuned and thus slightly different from the above ones.
The experiments are run on NVIDIA RTX A6000 GPUs.More details on the hyperparameters and computational costs are reported in Table 7.
Evaluation.An entity is considered correctly recognized if its predicted type and boundaries exactly match the ground truth.The model checkpoint with the best F 1 score throughout the training process on the development split is used for evaluation.The evaluation metrics are micro precision rate, recall rate and F 1 score on the testing split.Unless otherwise noted, we run each experiment for five times and report the average metrics with corresponding standard deviations.

D Categorical Results
Table 8 lists the category-specific results on ACE 2004, GENIA and CoNLL 2003.As a strong baseline, the classic biaffine model (Yu et al., 2020) is re-implemented with PLM and hyperparameters consistent with our DSpERT; note that our re-implementation achieves higher performances than the scores reported in the original paper.The categorical results show that DSpERT outperforms the biaffine model across almost all the categories of the three datasets, except for Geo-political Entity and Vehicle from ACE 2004.

E Results on Chinese NER
Table 9 shows the experimental results on two Chinese flat NER datasets: Weibo NER (Peng and Dredze, 2015) and Resume NER (Zhang and Yang, 2018).DSpERT achieves 72.64% and 96.72% best F 1 scores on the two benchmarks; they are also quite close to the recently reported SOTA results.

F Additional Ablation Studies
Weight Sharing.As described, the span Transformer shares the same structure with the Transformer, but their weights are independent and separately initialized from the PLM.A straightforward idea is to tie the corresponding weights between these two modules, which can reduce the model parameters and conceptually performs as a regularization technique.
As reported in Since GENIA is a biological corpus, some previous studies use BioBERT on this benchmark (Shen et al., 2021(Shen et al., , 2022)).We also test BioBERT with DSpERT on GENIA.The results show that BioBERT can achieve performance competitive to RoBERTa.
BiLSTM and Boundary Smoothing.As presented in Table 11, removing the BiLSTM layer will result in a drop of 0.2-0.4percentage F 1 scores.In addition, replacing boundary smoothing (Zhu and Li, 2022) with the standard cross entropy loss will reduce the F 1 scores by similar magnitudes.

Figure 1 :
Figure 1: Architecture of DSpERT.It comprises: (Left) a standard L-layer Transformer encoder (e.g., BERT); and (Right) a span Transformer encoder, where the span representations are the query inputs, and token representations (from the Transformer encoder) are the key/value inputs.There are totally K − 1 span Transformer encoders, where K is the maximum span size; and each has L layers.The figure specifically displays the case of span size 3; the span of positions 1-3 is highlighted, whereas the others are in dotted lines.

Figure 2 :
Figure 2: F 1 scores on spans of different lengths.All the results are average scores of five independent runs.

Figure 4 :
Figure 4: PCA visualization of pre-logit span representations of entities in the testing set.
For a T -length input and a d-dimensional Transformer encoder, the per-layer complexities of the multihead attention and FFN are of order O(T 2 d) and O(T d 2 ), respectively.When the maximum span size K ≪ T , our span Transformer brings additional O(K 2 T d) complexity on the attention module, and O(KT d 2 ) complexity on the FFN.Empirically, training a DSpERT consumes about five times the time for a shallow model of a same scale.However, this issue can be mitigated if we use fewer layers for the span Transformer (Subsection 4.3).

Table 1 :
Results of English nested entity recognition.* means that the model is trained with both the training and development splits.† means the best score; ‡ means the average score of multiple independent runs; the subscript number is the corresponding standard deviation.

Table 2 :
Results of English flat entity recognition.
e.,ACE 2004, GENIA and CoNLL 2003, covering flat  and nested, common and domain-specificcorpora.Depth of Span Representations.As previously highlighted, our core argument is that the deep span * means that the model is trained with both the training and development splits.† means the best score; ‡ means the average score of multiple independent runs; the subscript number is the corresponding standard deviation.
Table 3, the F 1 score in general experiences a monotonically increasing trend when depth L increases from 2 to 12; this pattern holds for all

Table 3 :
The effect of depth.The underlined specification is the one used in our main experiments.All the results are average scores of five independent runs, with subscript standard deviations.

Table 5 :
ℓ 2 -norm and cosine similarity of pre-logit representations and templates."pos" means the positive types (i.e., entity types); "neg" means the negative type (i.e., non-entity type).↑/↓ indicates that DSpERT presents a metric higher/lower than its shallow counterpart.All the metrics are first averaged within each experiment, and then averaged over five independent experiments, reported with subscript standard deviations.

Table 6 :
Descriptive statistics of datasets.#Sent.denotes the number of sentences; #Type denotes the number of entity types; #Token denotes the number of tokens; #Entity denotes the number of entities; #Nested denotes the number of entities that are nested in other entities.

Table 7 :
Hyperparameters and computational costs of main experiments.

Table 11 :
Table 10, sharing the weights results in higher F 1 scores on ACE 2004, but Results of ablation studies."b" and "l" mean the PLM sizes of base and large, respectively; for large PLM, span Transformer has 12 layers."BS" means boundary smoothing.The underlined specification is the one used in our main experiments.All the results are average scores of five independent runs, with subscript standard deviations.