A Span-based Dynamic Local Attention Model for Sequential Sentence Classification

Sequential sentence classification aims to classify each sentence in the document based on the context in which sentences appear. Most existing work addresses this problem using a hierarchical sequence labeling network. However, they ignore considering the latent segment structure of the document, in which contiguous sentences often have coherent semantics. In this paper, we proposed a span-based dynamic local attention model that could explicitly capture the structural information by the proposed supervised dynamic local attention. We further introduce an auxiliary task called span-based classification to explore the span-level representations. Extensive experiments show that our model achieves better or competitive performance against state-of-the-art baselines on two benchmark datasets.


Introduction
The goal of Sequential Sentence Classification (SSC) is to classify each sentence in a document based on rhetorical structure profiling process (Jin and Szolovits, 2018), and the rhetorical label of each sentence is related to the surrounding sentences, which is different from the general sentence classification that does not involve context. An example is shown in Figure 1, the document is divided into rhetorical labels such as "background" and "outcome" for five sentences in NICTA dataset. The SSC task is crucial for downstream domains such as information retrieval (Edinger et al., 2017), question answering  and so on.
Traditional statistical methods, such as HMM (Lin et al., 2006), CRF (Hirohata et al., 2008;Hassanzadeh et al., 2014), etc., heavily rely on numerous carefully hand-designed features. In contrast, recent methods based on end-to-end neural networks utilize hierarchical sequence * *Corresponding author Figure 1: An example of NICTA dataset for SSC task. The text has five sentences and is divided into two segments {(s 1 , s 2 ), (s 3 , s 4 , s 5 )} by labels. encoders followed by the CRF layer to contextualize sentence representations, which achieved promising results. The first neural network approach (Lee and Dernoncourt, 2016) combined RNN with CNN that incorporates preceding sentences to encode the contextual content and further use a CRF layer to optimize the predicted label sequence. Recently, Jin and Szolovits (2018) propose a hierarchical sequential labeling network to make use of the contextual information within surrounding sentences to help classify. Conversely, Cohan et al. (2019) employ BERT (Devlin et al., 2018) to capture contextual dependencies without hierarchical encoding or CRF layer. Yamada et al. (2020) introduce Semi-Markov CRFs (Ye and Ling, 2018) to assign a rhetorical label at span-level rather than single sentence. Nevertheless, the above-mentioned methods ignore the latent structural information (e.g. segmentation) in the document, which is the grouping of content into topically coherent segments. Intuitively, a segment with several continuous sentences is expected to be more coherent semantics than the text spanning different segments, e.g., the text with two segments in Figure 1. In this paper, we propose a novel span-based dynamic local attention model to explore the latent segment structure in a document for SSC task. First, we introduce dynamic local attention guided by segmentation supervision signal to focus on the surrounding sentences with coherent semantics, called Supervised Dynamic Local Attention (SDLA). Furthermore, we introduce an auxiliary task called span-based classification, which calculates semantic representations of spans and performs span classification on them to obtain predicted rhetorical labels. The dynamic local attention mechanism and the auxiliary task complement each other to enhance the model capacity to perceive segment structure and improve the performance of SSC task. The results on two benchmark datasets show that our method achieves better or competitive performance than state-of-the-art baselines.

Proposed Method
In this paper, we propose a Span-based Dynamic Local Attention Model for sequential sentence classification with two novel components: supervised dynamic local attention and auxiliary span-based classification task, respectively. The architecture of our model is shown in Figure 2.

Sentence Representations
For SSC task, given a sequence of sentences X = {x 1 , x 2 , · · · , x N }, the model needs to predict the label of each sentence Y = {y 1 , y 2 , · · · , y N } based on the context which the sentence appears, where N is the number of sentences. Following the previous work (Yamada et al., 2020), we first feed each sentence into BERT pre-trained with PubMed (Peng et al., 2019) and then extract the encoding corresponding to [CLS] token as sentence encoding S = {s 1 , s 2 , · · · , s N } (we implement it using Sentence-BERT (Reimers and Gurevych, 2019)). Then, we employ two bidirectional LSTM layers to produce context-informed sentence representation h c i ∈ R d for whole document :

Supervised Dynamic Local Attention
In this section, we introduce dynamic local attention guided by a supervised segmentation signal to learn latent segment structure in a document. Firstly, we generate the sentence-level attention spans for each sentence by training soft masking (Nguyen et al., 2020), using pointing mechanism (Vinyals et al., 2015) to approximate left and right boundary positions of the mask vector. Given the query Q and key K, where Q = K = H c , Figure 2: The overview of our model, exemplified by the sample in Figure 1. The labels 'b' and 'o' stand for "background" and "outcome", respectively. C span denotes Auxiliary Span-based Classification Task.
we calculate the left and right boundary matrix φ l ,φ r ∈ R N ×N for query Q as follows: where S is the softmax function, is elementwise product, and W Q L , W K L , W Q R , W Q R ∈ R d×d are trainable parameters. Eq. (2)-(3) approximate the left and right boundary positions of the mask matrix for the query Q (Each row approximate the mask vector of the entire document corresponding to each sentence in sequence order). Note that we additionally introduce mask matrix M to ensure that the left boundary position l and the right boundary position r generated at position i satisfy this relationship such that 0 ≤ l ≤ i ≤ r ≤ N .
Given the above definitions, the attention span masking matrix M a can be achieved by compositing the left and right boundary matrix : where L N ∈ {0, 1} N ×N denotes a unit-value (1) upper-triangular matrix. Then we combine self-attention with the attention span masking, enabling the model to focus on semantically related sentences around the target 200 position and eliminate noisy aggregations : where W Q , W K , W H are the trainable parameters. However, in the absence of a supervised process, the dynamic local attention may fail to focus on the corresponding informative sentences of the target, especially for limited data, so we further introduce the segmentation signal to guide the learning of dynamic local attention to capture coherent semantics more accurately. Specifically, we employ binary cross-entropy loss to describe the differences between attention matrix A and segment signal Y att : where σ is sigmoid function. E ij = 1 denotes ith sentence and j-th are in the same segment (e.g. (s 1 , s 2 ) and (s 4 , s 5 ) in Figure 1). Finally, we concatenate H c and H att as the contextual representations H and add a CRF layer to classify each sentence.

Auxiliary Span-based Classification Task
Due to the obvious label consistency of sentences within spans, we introduce an additional auxiliary task called span-based classification to improve the performance at the span-level. To this effect, we consider all possible spans of various lengths and propose a tagging scheme for span-based classification. The scheme uses the same labels as sentencelevel to represent the label of a span. Firstly, we represent a span from the i-th sentence to the j-th sentence as a vector h ij , which is concatenated by four-vectors similar to Zhao et al. (2020): whereĥ i:j is the attention output over the final sentence representation H in the span, and ϕ(j − i + 1) is the feature vector encoding the span size. We employ a cross-entropy category loss for span-based classification: whereŶ span is the output probability at span-level, F ij denotes i-th sentence and j-th sentence (i, j satisfy the relationship i < j ) are in the same segment and i, j is the beginning and end of the segment respectively (e.g. (s 1 , s 2 ) and (s 3 , s 5 ) in Figure 1).

Objective Function
The overall objective function includes crossentropy loss L sen , L span for sentence and spanbased classification and supervised attention loss L att : where λ att , λ span are the hyperparameters for balancing the strength of L att and L span .

Implementation Details
We set the size of hidden state to 200 and apply dropout with the probability of 0.5 for BiLSTM. Both hyperparameters λ att and λ span are set to 0.3. The batch size is 30. We use Adam optimizer with learning rate 0.003 and weight decay 0.99 for training. For evaluation, we maximize the score from sentence-level CRF to get the predicted labels of the corresponding se-

Experimental Results
Tabel 1 report the performance of our approaches against other methods on PubMed 20k RCT and NICTA-PIBOSO, respectively. The results of other methods are obtained from Yamada et al. (2020). We can observe that our model, whether Sentence-F1 or Span-F1, is significantly better than other methods on NICTA-PIBOS, and we get a result comparable to Yamada et al. (2020) on PubMed 20k RCT. We believe that our model has remarkable performance on NICTA-PIBOS, which has fewer training samples but larger label space, because our model can capture latent segment structure by SDLA component and improve span representations by auxiliary span-based classification.
In addition, table 2 and 3 show the detail results of Span-F1 scores for each rhetorical label. Our model achieves better or similar performance than other baselines, except for "other" on NICAT-PIBOSO and "background" on PubMed 20k RCT. We speculate that the reason is that the sentence semantics corresponding to the "other" label are diverse and not significantly distinguishable from other labels, while the "background" usually appears before the "objective", and the sentence presentations of the two are easily confused.

Segmentation Performance Evaluation
Specially, if we ignore the rhetorical labels of sentences and only consider the segment boundaries (i.e. binary classification, whether it's a boundary), 1 Please refer to Yamada et al. (2020) for the detailed calculation way of Sentence-F1 and Span-F1.  this can be regarded as text segmentation (Koshorek et al., 2018). We evaluate the segmentation performance of our model using the probabilistic P k (Beeferman et al., 1999) error score (lower number, the better). The results 2 are shown in the last column of Table 1. Our model consistently outperforms other baselines, suggesting that it also contributes to the text segmentation task.

Ablation Study
To investigate the effectiveness of the designed components, we conduct an ablation study on the proposed model, and the results are listed in Table 4. With the help of the SDLA component, the performances are improved significantly, and the way we impose the supervised signal to guide the attention proves effective for yielding more true positives. And the auxiliary task of span classification effectively improves Span-F1.

Attention Visualization and Case Study
As shown in Figure 3, by incorporating supervised signal, the attention focus on a continuous local   span around the gold span. The visualization results not only verifies the effectiveness of the supervised signal, but also reveals the interpretability of our proposed SDLA. Table 5 shows the results of Base and Ours method for an abstract obtained from NICTA-PIBOSO. Our model correctly identified the boundary between the spans labeled by background (B) and other (O), which shows our model benefit from capturing latent segment structure identifying the more indistinguishable segmentation boundaries.

Conclusion
In this paper, we propose a novel model for SSC task, which includes a supervised dynamic local attention to explore the latent segment structure of the document, and an auxiliary task to improve the performance at span-level representations. We demonstrate the effectiveness of our model on two datasets and find that our model also performs well in the text segmentation scenario. In future work, we will consider joint learning sequential sentence classification and text segmentation.