Improving Span Representation by Efficient Span-Level Attention

,


Introduction
Many natural language processing tasks involve spans, making it crucial to construct high-quality span representations.In named entity recognition, spans are detected and typed with different labels (Yuan et al., 2022;Zhu et al., 2022); in coreference resolution, mention spans are located and grouped (Lee et al., 2017(Lee et al., , 2018;;Gandhi et al., 2021;Liu et al., 2022); in constituency parsing, spans are assigned scores for constituent labels, based on which a parse tree structure is derived (Stern et al., 2017;Kitaev and Klein, 2018;Kitaev et al., 2019).
Most existing methods compute span representations by shallowly aggregating token representations.They either pool over tokens within the span (Shen et al., 2021;Hashimoto et al., 2017;Conneau et al., 2017), or concatenate the starting and ending tokens (Ouchi et al., 2018; Zhong and Chen, 2021).The limitation of these methods lies in: (i) Span representations are dominated by a subset of tokens, resulting in a potential lack of crucial information.(ii) Intuitively, span interactions should play an important role in span encoding.For example, meanings of spans, especially constituents, can be composed from their sub-spans and disambiguated by their neighbouring spans.However, such span interactions are completely ignored in these methods.
Inspired by the utilization of self-attention in Transformer (Vaswani et al., 2017), we introduce span-level self-attention to capture span interactions and improve span representations.However, computing attention scores for all span pairs leads to O(n 4 ) complexity (n for sequence length).In addition, not all span interactions are intuitively meaningful.Therefore, we design four different span-level patterns to restrict the range of spans that a given span could attend to: Inside-Token, We lay more emphasis on span interactions by incorporating span-level attention.(iii) We design span-level attention patterns to capture meaningful span interactions and reduce the overall complexity to an acceptable level, thereby ensuring both effectiveness and efficiency.

Method
Fig. 2 illustrates the architecture of our model, which we describe from bottom up.
Token representations.Given a sentence w = w 0 , w 1 , . . ., w n , we pass it through BERT (Devlin et al., 2019) to do tokenization and obtain contextualized token representations c = c 0 , c 1 , . . ., c T by taking a weighted average of the outputs from all layers.We then feed them into a linear pro-jection layer to obtain final token representations x = x 0 , x 1 , . . ., x T .Initial span representations.We follow Toshniwal et al. (2020) to initialize span representations from contextualized token representations.During pilot experiments 1 , we observe that among the five pooling methods (max pooling, average pooling, attention pooling, endpoint, diff-sum), max pooling performs the best.Therefore, we choose max pooling as the default initialization method.Specifically, given a span ⟨i, j⟩ and the corresponding token representations {x i , . . ., x j } within the span, the initial span representation s ij is computed by selecting the maximum value over each dimension of the token representations.
Span-level attention.We enumerate all the spans and input their representations to a Transformer encoder with span-level attention.Note that computing attention scores for all span pairs leads to O(n 4 ) time and memory complexity because selfattention has a quadratic complexity and there are a total of O(n 2 ) spans.To reduce the complexity as well as encourage more meaningful span interactions, we design different attention patterns to restrict the range of spans that a given span could attend to (Fig. 1).We use rel(⟨i, j⟩) to denote the set of spans that span ⟨i, j⟩ can attend to.
Inside-Token Each span attends to tokens within this span.This pattern maintains the connection between spans and their internal tokens.
rel(⟨i, j⟩) = {⟨k, k⟩|k = i, . . ., j} Containment Given a span, its super-spans and sub-spans may provide meaningful information of the span.However, the total number of super-spans and sub-spans of a given span is O(n 2 ).Considering the importance of starting and ending positions in span encoding, we propose that each span attends to spans that share the same starting or ending position.This pattern takes into account the containment relationship as well as the starting and ending positions of spans, while reducing the number of spans to O(n).
rel(⟨i, j⟩) = {⟨k, k⟩|k = 0, . . ., T } It is worth noting that all four patterns ensure that the number of spans each span can attend to is O(n), which reduces the overall complexity to O(n 3 ).Moreover, we can combine these four patterns arbitrarily to form new patterns when facing different scenarios.
Inference and Training.After span-level attention, we obtain an enhanced version of s ij for each span.For single span tasks, we feed a span representation into a two-layer MLP classifier.For tasks involving two spans, we concatenate the two span representations and feed them into the MLP classifier.The classifier maps the input into a qdimensional vector, where q is the size of the label set (including NoneType if necessary).We directly utilize the loss function of downstream tasks to train the model, such as the commonly used binary cross-entropy loss and cross-entropy loss in multi-class classification tasks.

Setup
We use BERT-base-cased to obtain contextualized token representations and keep it frozen when conducting probing tasks.We stack 4 Transformer encoder layers and set the number of heads in multi-head attention to 4 to do span-level attention.Dataset details and other hyper-parameters can be found in Table 5 and Table 6 in Appendix A. We conduct all experiments on a single 24GB NVIDIA TITAN RTX and report the micro-averaged F1scores.All results are averaged over three runs with different random seeds.

Probing tasks results
We conduct 6 probing tasks: named entity labeling (NEL), coreference arc prediction (REF), semantic role classification (SRC), constituent labeling (CTL), mention detection (MED) and constituent detection (CTD), following Toshniwal et al. (2020).In these 6 tasks, we only need to do classification or prediction on given spans.Table 1 shows probing tasks results.We pose (i) max pooling, (ii) best performing pooling among five pooling methods mentioned in section 2, (iii) max pooling after four additional layers of normal token-level attention, and (iv) fully-connected spanlevel attention (i.e., the O(n 4 ) full span-level attention without restriction) as four baselines 2 .Overall, fully-connected span-level attention shows good performance compared to pooling methods, validating the effectiveness of span-level attention.Furthermore, applying different attention patterns or pattern combinations not only reduces computational complexity, but also significantly improves performance.This suggests that our proposed attention patterns effectively capture more meaningful span interactions than fully-connected span-level attention without restrictions.Our method also outperforms token-level attention with additional layers, suggesting that the improvement in performance is not merely due to having more parameters.
For specific tasks, the optimal attention patterns vary.For tasks that place more emphasis on structures, such as CTL, CTD and detection task MED, attention patterns inspired by structural span interactions (Containment, Adjacency) show better performance.The same applies to pattern combinations involving them.This makes sense because grammatical structures are closely related to the structural span interactions within a sentence.For tasks that prioritize textual content, such as REF, the All-Token attention pattern performs better due to its attention to the entire input text.In SRC, we speculate that the Inside-Token pattern helps us focus specifically on the prefixes or suffixes generated by tokenization, thus improving performance related to semantic roles.In NEL, a combination of the Containment and All-Token patterns strikes a balance between structure and semantics, leading to good performance.
In general, as shown in Table 8 in Appendix B, our method consistently outperforms the baseline models in all 6 tasks.Moreover, the best performing attention pattern is the combination of the Inside-Token and Containment patterns.This pattern combination, due to its consideration of both semantics and structure, is a reliable choice across different tasks.
As Table 2  observed with span-level attention compared to max pooling when freezing BERT.Fine-tuning BERT leads to further enhancements in overall performance.We speculate that combining span-level attention with stronger pretrained language models and carefully-designed decoders will yield even better results.Specifically, the combination of Inside-Token and Containment/Adjacency performs well when using a frozen BERT, which aligns with our observations from probing tasks conducted under similar settings.When BERT is fine-tuned, token representations capture more comprehensive contextual information, allowing the All-Token pattern to be included in the optimal combination.Table 9 in Appendix B also shows that the improvements brought about by our method are consistent.

SpanBERT backbone
We also conduct experiments on the REF and SRC tasks with SpanBERT being used as the backbone to analyse the generalizability of our proposed method.As Table 3 shows, span-level attention still brings performance gain after changing the backbone to SpanBERT, no matter fine-tuned or not.It further demonstrates the generalizability of our method.Note that fine-tuned SpanBERT with max pooling is a widely favored choice for spanrelated tasks, and our results show that applying span-level attention to this backbone can still bring performance improvement.More detailed results can be found in Table 10 in Appendix B.

Analysis
We conduct an analysis on the effect of the number of span-level attention layers.We select three tasks, REF, SRC and CTL, to cover both semantic and structural tasks.To compare the results, we calculate the overall improvements of different attention patterns compared to the max pooling baseline and average them across the three tasks.As Table 4 shows, stacking 4 layers slightly outperforms the other two options.

Conclusion
We propose to use span-level attention to improve span representations.In order to reduce the O(n 4 ) complexity and encourage more meaningful span interactions, we incorporate different attention patterns to limit the scope of spans that a particular span can attend to.Experiments on various tasks validate the efficiency and effectiveness of our method.

Limitations
We conduct an empirical study with extensive experiments to validate the effectiveness of our pro-posed method and attempt to derive further observations from the experiments.However, there is a lack of solid theoretical explanations and insights for these observations.Moreover, It can be time-consuming to try different pattern combinations and pick the optimal one when encountering new tasks.To enhance efficiency, one possible approach is to propose an automated attention pattern combiner based on reinforcement learning, which can serve as an important component of the entire model.

B Detailed results
We provide detailed experiment results in this section.Table 7, 8, 9 shows averaged F1 scores along with standard deviations in pilot experiments, probing tasks and nested NER.

Figure 1 :
Figure 1: Diagrams for four attention patterns.Each cell represents a span, e.g., the orange cell in each diagram represents the span consisting of tokens from x 1 to x 3 .Orange cells represent target spans and blue cells represent spans they can attend to.

Figure 2 :
Figure 2: Architecture of our model.

Table 2 :
shows, significant improvements are Averaged F1 scores for nested NER with baseline and different pattern combinations.We use the same index as Tab. 1 in the first column to represent the same model.

Table 3 :
Averaged F1 scores for REF and SRC with baseline and different pattern combinations when Span-BERT is used as the backbone.We use the same index as Tab. 1 in the first column to represent the same model.

Table 4 :
Analysis on span-level attention layer.