Bidirectional Masked Self-attention and N-gram Span Attention for Constituency Parsing

,


Introduction
The concept of attention has become a major aspect of deep learning, and improving attention is essential to enhance the model efficacy.In natural language processing (NLP), numerous studies that utilize the sequence-to-sequence model have achieved significant performance improvements by modifying the attention mechanisms to specific tasks.Tasks such as summarization (Duan et al., 2019;Wang et al., 2018), translation (Zeng et al., 2021;Lu et al., 2021), question answering (Wang et al., 2021;Chen et al., 2019), and multi-modal learning (Nishihara et al., 2020;Liu et al., 2022) are examples of the efficacy of such mechanisms in improving model performance.§ Work done while at Hanyang University.
In the constituency parsing task, which involves identifying constituent phrases and their relationships in a sentence, attention mechanisms, especially self-attention, improves the performance of a parser.Many studies on constituency parsing have emphasized the importance of comprehending sentence spans to improve parser performance (Cross and Huang, 2016;Stern et al., 2017;Gaddy et al., 2018).Recent studies that incorporate attention mechanisms train parsers to comprehend sentence spans by referring to the n-grams of a sentence as the span (Tian et al., 2020) or by considering the directional and positional dependencies from splited word representation (Kitaev and Klein, 2018;Mrini et al., 2020).
However, because attention mechanisms compute the dependency of each element simultaneously, there can be a lack of the directional information that is needed to form sentence spans.This contrasts with long short-term memory (LSTM) models that consider directional information.In attention mechanisms that use attention weights between the query and key vectors as relational information between each element, the weights are computed regardless of the element's relative position.Previous studies (Kitaev and Klein, 2018;Mrini et al., 2020) acknowledged that this method could be problematic and made efforts to address it.However, such attempts were conducted under the assumption of ideal learning conditions, and the problem in the calculation process has persisted.
The purpose of this paper is to modify the attention mechanism into two types of capability.The first one obtains explicit directional information for each word, similar to the approach used by bidirectional LSTM (Figure 1(b)).The second one enhances the representation of each word by incorporating information from spans, which are suitable for constituency parsing.
In this work, we propose a novel model called BNA (Bidirectional masked and Ngram span Attention).BNA employs a variant of masked self-attention (MSA) in which each element in a sequence is considered sequentially by its attention weights bidirectionally, rather than simultaneously.Moreover, BNA incorporates a novel span attention mechanism that represents a key-value matrix by subtracting the hidden states at the span boundaries.This approach enables the query (i.e., word sequence) to access the contextual information of n spans in a sentence.
Our parser achieves state-of-the-art performance with F1 scores of 96.47 and 94.15 for the Penn Treebank and Chinese Treebank datasets, respectively.In addition, through ablation study and analysis, we demonstrate that our proposed BNA model effectively captures sentence structure by contextualizing each word in a sentence through bidirectional dependencies and enhancing span representation.

Related Work
In the field of constituency parsing, since the introduction of the span-based approach by Stern et al. (2017), chart-based neural parsers have outperformed transition-based ones (Zhang, 2020).
The span-based approach involves labeling specific text spans instead of individual tokens or words, enabling the parsers to consider the context and relationships between different spans of the sentence.
With the rise of the Transformer model (Vaswani et al., 2017) in NLP, attention mechanisms have become an attractive alternative to LSTM networks.In constituency parsing, attention mechanisms have shown promising results, as demonstrated by Kitaev and Klein (2018), who used a self-attentive network applied to the span-based parser to improve performance.They split the input vector into content and position representations and performed self-attention on each component separately.Building on this work, Mrini et al. (2020) introduced label attention layers, a modified form of self-attention that enables the model to learn label-specific views of the input sentence.In this mechanism, the attention heads are split into half, forward and backward representations, which are then used to construct span vectors of the input sentence.More recently, Tian et al. (2020) proposed span attention, which assumes no strong dependency between each hidden vector in a transformerbased encoder.Their method involves enhancing the span representation by summing the attention vector of n-grams consisting of embedded word vectors with the span vector, without using directional vectors.
However, conventional attention mechanisms treat all elements simultaneously without considering directional dependencies, making it challenging to construct span vectors using an encoder based on the attention mechanism.Furthermore, constructing arbitrary span vectors from embedded words that lack contextual information of the sentence could be improved.
In this paper, we introduce two types of attention mechanisms that address the issue of directional dependencies and that strengthen span representation.

Background
Self-attention is a powerful mechanism that enables neural networks to capture dependencies between different parts of a sequence.The basic idea behind self-attention is to compute a representation of the entire sequence by weighting the importance of different elements in the sequence based on their similarity to each other.
In a typical self-attention sub-layer, the sequence of input vectors X = [x 1 , ..., x n ] is transformed into three sequences of vectors: queries Q = [q 1 , ..., q n ], keys K = [k 1 , ..., k n ], and values V = [v 1 , ..., v n ].These sequences are computed using learned linear projections: where W Q , W K , and W V are learned weight matrices.
Attention weights α i,j are computed as the dot product of the query vector q at position i and the key vector k at position j, which is subsequently normalized using the softmax function as follows: where d is the dimensionality of the key vectors.The √ d is used to prevent numerical instability.Finally, the weighted sum of the value vectors is computed using the attention weights: This weighted sum h i can be seen as a hidden representation of the i-th vector that considers the importance of each of the other vectors in the sequence.

Approach
Our approach is motivated by the problem that self-attention mechanisms struggle to encode the relative positions and sequential order of elements within the context of a sequence (Ambartsoumian and Popowich, 2018;Hahn, 2020).Studies have been conducted to resolve this issue in tasks that require bidirectional information, such as relation extraction (Du et al., 2018) and machine translation (Bugliarello and Okazaki, 2020).To address this issue, we propose the Bidirectional Masked Self-Attention (BiMSA) and N-gram Span Attention (NSA) mechanisms.Together, these two attention mechanisms comprise our Bidirectional masked and N-gram span Attention (BNA) model.Section 4.1 provides a brief overview of the constituency parsing process.Section 4.2 provides a more detailed explanation of BiMSA and NSA and how they are integrated into the BNA model.

Constituency Parsing
Constituency parsing is the process of analyzing the grammatical structure of a sentence by separating it down into a set of labeled spans represented by the parse tree T .The tree T of a sentence is expressed as a set of labeled spans, where the fencepost position of the t-th span is indicated by i t and j t , and the span has the label l t .
The parser assigns a score s(T ) to each parse tree T , which decomposes as To generate the parse tree T for a given sentence X = [x 1 , x 2 , ..., x n ], the encoder first transforms the input sequence into a set of hidden representations H = [h 1 , h 2 , ..., h n ].Hidden vector V i,j for a span (i, j) is calculated as the difference between the start and end hidden vectors of that span, following the definition of Gaddy et al. (2018) and Kitaev and Klein (2018): where h k represents the hidden vector at position k and is constructed from two vectors from different directions, forward with h f k and backward with h b k .The multi-layer perceptron (MLP) classifier, which serves as a decoder, takes the hidden vector V i,j as the input and assigns a label score to each span.The optimal parse tree with the highest score can be identified efficiently through a variant of the CKY algorithm.2To find the correct tree T * , the model is trained to meet the margin constraints for all trees T through the process of minimizing the hinge loss max(0, max where ∆ denotes the Hamming loss.

BNA
The proposed BNA encoder is composed of two variants of the transformer encoder layers: a BiMSA layer and an NSA layer.The overall architecture of the parser is illustrated in Figure 2. The BiMSA layer is composed of BiMSA and the position-wise feed-forward network (FFN) with the residual connection.The BiMSA layer is computed as follows: where H l−1 is the hidden state of the previous encoder layer and LN(•) is the layer normalization.
The NSA layer has the same structure as the BiMSA layer, but uses NSA instead of BiMSA: Overall, BNA is composed of a sequential structure that contextualizes each word by leveraging both the sequential and directional dependencies using the BiMSA layer first and then enhances the span representation using the NSA layer.

Bidirectional Masked Self-Attention
BiLSTM uses forward and backward recurrent operations to produce an output vector with sequence information as the inductive bias.However, attention-based models compute attention weights solely based on the similarity between the query and key vectors and do not consider the order of elements in the sequence, making it challenging to incorporate sequence directionality.
To overcome this constraint, we introduce BiMSA to capture the directional dependency of the context, which is crucial for constructing a span vector by adding hard mask M to the scaled dot product of the query and key (Figure 1(a)).In this way, Eq. ( 2) is redefined as follows: When M i,j is equal to negative infinity, the q i word does not affect the k j word.Conversely, when M i,j is equal to 0, it does not influence the attention weights.
The mask is divided into two distinct directional segments, namely the forward mask M F and backward mask M B : We apply a forward and backward mask separately to split the directional representation of each word into its respective forward and backward components.Eq. ( 3) is redefined as follows: The output of BiMSA is produced by concatenating two directional hidden states into a single output representation. 3y using directional masks, words are constrained to attend solely to the preceding or subsequent words, enabling the model to more effectively capture the temporal dependencies.We adopt an approach of intentionally separating the bidirectional representations to construct spans from the hidden states of words.Further details are described in the following section.

N-gram Span Attention
The key aspect of constituency parsing is to accurately predict the contextual features of a span, represented by V i,j .Achieving this goal requires a more fine-grained approach to modeling the contextual features.
Previous studies in constituency parsing have empirically shown that encoding spans through the subtraction of bidirectional hidden states can be effective (Stern et al., 2017;Kitaev and Klein, 2018;Kitaev et al., 2019;Zhou and Zhao, 2019;Mrini et al., 2020) and this approach corresponds to a bidirectional variant of the LSTM-Minus features proposed by Wang and Chang (2016).In addition, Tian et al. (2020) recently showed that span attention can be effective for enhancing span representation.Inspired by these empirical assumptions, our novel approach NSA enables each word to reference information from various sizes of n-gram spans created from contextualized hidden states.
NSA begins by constructing an n-gram span matrix.First, the hidden states H from the previous layer are split into the forward and backward representations H F and H B , respectively.Arbitrary span vectors are constructed by applying elementwise subtraction to the separated bidirectional hidden states, which is the same as Eq. ( 6): The n-gram of the arbitrary span is adjusted by varying the distance between the positions i and j.
The n-gram span matrix is constructed by concatenating the hidden states of all 1-to n-gram sequences, as follows: A detailed computational process for constructing the n-gram span matrix is provided in Appendix A.3.
In NSA, the query is projected from the word representation, while the key and value are projected from the span representations.The attention process enables each word to reference the contextual features from its corresponding span.Eq. ( 1) is redefined as: The subsequent computations are carried out in the same manner as the self-attention process described in Section 3.
NSA allows each word to reference the contextual information from its corresponding span.It can also handle the diverse tree structures of sentences by incorporating relational information with other spans within the sentence.For instance, in the sentence "The cat sat on the mat." the word "cat" incorporates span information that can be grouped as a constituent by referencing the contextual features of both the 2-gram span "The cat" and the 4-gram span "sat on the mat".

Datasets
To evaluate the performance of our constituency parsing model on different languages, we conduct experiments on the Penn Treebank 3 (PTB) (Marcus et al., 1993) dataset for English and the Penn Chinese Treebank 5.1 (CTB5.1)(Xue et al., 2005) dataset for Chinese. 4 We use the standard data splits for both PTB and CTB5.1.

Implementation details
To ensure a fair comparison with previous studies, we construct our model with and without the use of pre-trained models as the basic encoder.For the experiment on PTB, we utilize BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019) pretrained large models in the cased version, while for CTB5.1, we use BERT pre-trained base model.Following Tian et al. (2020), we use the default settings of the hyperparameters in the pre-trained models.Kitaev and Klein (2018) experimentally demonstrated that using a character-LSTM (CharLSTM) instead of word embeddings can enhance the parsing accuracy.Therefore, to provide a fair comparison, we compare the test performance of a model that incorporates CharLSTM when a pre-trained model is not used.
In line with Kitaev and Klein (2018) termined for the input sentences using the Stanford tagger (Toutanova et al., 2003).The POS tags of a given sentence are passed through the embedding layer and added element-wise to the hidden word vectors of the sentence to form the input of the model.
In our proposed NSA approach, the length of the n-gram sequence, n, should be designated as a hyperparameter.We test the performance of our model by setting n to 2, 3, 4, and 5, respectively, and select the model with the highest performance to compare it with those of previous studies.The experimental results when n is modified under the same parameter setting can be found in Section 5.5.3.
Further details on the setting of the hyperparameters for our models in all experiments are provided in Appendix A.1.

Performance comparison
The experimental results of our models and those of previous studies on the test sets are presented in Table 1 and Table 2. Our models outperform the previous state-of-the-art results on both datasets.Specifically, our BNA model, which does not use POS tags but employs a pre-trained XLNet model, achieves state-of-the-art performance with an F1 score improvement of 0.06, surpassing the improvement range of 0.01 to 0.02 observed in recent models.Furthermore, the recall and precision scores show uniform improvement without bias, resulting in the highest scores among all the methods.
In the CTB5.1 dataset experiments, our models outperform the previous results by a larger margin than in the PTB experiments.Our model that uses POS tags exceeds the previous best performance and achieves state-of-the-art performance with an F1 score improvement of 0.56.
These improved results demonstrate the effec- tiveness of our BNA model in resolving the critical problem of constructing span representations from the hidden states of words, which is due to the lack of dependencies between elements in attention mechanisms.

Ablation study
To evaluate the effectiveness of the BiMSA and NSA modules in the BNA model, we conduct an ablation study.We compare our models with a single model of the self-attention layer, which serves as the baseline, as it is the same self-attention mechanism as the transformer encoder.The hyperparameters of each model in the ablation study follow the best-performing model in Table 1.The results for the PTB test split are presented in Table 3, while the results for the CTB test split can be found in Appendix A.2.
The results demonstrate a consistent improvement in performance.Specifically, while the performance of the single model of BiMSA is comparable or inferior to that of self-attention, the inclusion of NSA leads to a performance improvement that surpasses that of the single model of self-attention.Using a pre-trained model and POS tags has been observed to be beneficial in improving performance.This finding is consistent with the results of previous studies.In particular, POS tags lead to a greater performance improvement in Chinese than in English.Also we observed a diminishing improvement tendency when the model used a pre-trained model as the encoder.This suggests that the pre-trained model may already possess pattern or knowledge related to POS tags.
Overall, it can be observed that the BiMSA and NSA models complement each other while continuously improving performance on both datasets.

Directional feature for Parsing
In this section, we investigate whether the BiMSA can address the lack of directional and relative positional dependencies between words.We conduct a performance comparison between the BiMSA single model and the self-attention model.We evaluate their performances on the test dataset using the F1 score metric.The results for the PTB test split are presented in Table 4, while the results for the CTB test split can be found in Appendix A.2. Similar to the previous ablation study results, the single BiMSA model exhibits comparable or lower performance than the single self-attention model.However, the addition of NSA significantly improves performance.This suggests that combining a model with insufficient temporal dependency and NSA may lead to a decrease in performance, but the performance enhancement in BiMSA can be attributed to the synergistic effect between BiMSA and NSA layers.
The directional and relative positional dependencies captured by the BiMSA module enable the BNA model to better handle complex syntactic structures, which is demonstrated by the higher F1 score on both the CTB5.1 and PTB datasets.This finding indicates that directional features are essential for improving parsing model performance, particularly for tasks with complex sentence structures.Moreover, the advantage of using the BNA model is even more significant for Chinese datasets, which are known for having more complex sentence structures than English.

Span Attention
In this section, we explore the impact of the number of NSA layers in the BNA model.Specifically, we train and evaluate models with 1, 3, 5, and 8 NSA layers, including a variant in which the order of the layers alternates between the BiMSA and NSA layers.We maintain the total number of layers in the model as 8, and we use the same hyperparameters as those of the single model.Figure 3 illustrates the experimental results, where "Alt" refers to the alternatively applied model.
The results demonstrate that a reduced number of NSA layers leads to superior performance.This finding suggests that conducting span attention with a lack of dependency between each word in the given sentence may result in a degradation of performance.In particular, a model structure that alternates between the BiMSA and NSA layers shows no significant difference from the one that entirely consists of the NSA layer.
Overall, our experiments suggest that the selection of the number of NSA layers in the BNA model should be carefully considered, and a reduced number of layers may prove to be more effective.

Variations of the N-gram
To determine the optimal n-gram length for each language used in the NSA module, we conduct experiments using the best-performing BNA models in both English and Chinese.To compare the results, we vary n from 2 to 5 while keeping all hyperparameters as constant.
As shown in Figure 4, the results indicate that an n-gram length of 4 achieves the highest performance for PTB, while a 3-gram does for CTB5.1.However, extending the n-gram length beyond a certain point can lead to a decrease in model performance.As the n-gram increases, the arbitrary span becomes more similar to the given sentence.As a result, referring to a broader range of spans can dilute the span information that corresponds to each word.
However, since constituents are hierarchically composed of 2-3 words or constituents, the NSA layer allows words to refer to arbitrary spans of various positions, enabling the representation of longer spans even with a shorter span length.While it may be necessary to adjust the arbitrary span length that each word refers to depending on the language, constructing a wide range of arbitrary spans is not essential for representing sentences as constituent trees.

Conclusions
The primary goal of this study was to design attention mechanisms to capture the explicit dependencies between each word and enhance the representation of the output span vectors.Through our experiments, we demonstrated that our proposed BiMSA more effectively contextualizes each word in a sentence by considering the bidirectional dependencies, while NSA improves the span representation by attending to arbitrary n-gram spans.Our findings have major implications for span-based approaches in constituency parsing tasks.Specifically, applying the span representation method to the attention mechanism leads to a significant performance improvement.
In conclusion, constructing a span representation from words contextualized within a given sentence can lead to additional improvement in parsing.Overall, our study contributes to the advancement of attention mechanisms in NLP.We hope that our findings will inspire further research in this area.

Limitations
However, the weight of the model remains a significant issue for high-performance inference, especially for preprocessors that deconstruct and analyze the sentence structure before understanding it.Using a costly parser in real-time machine learning tasks can present limitations as rapid data processing is a crucial objective in this current area of research.To address this concern, future studies should focus on developing a lightweight span attention module that considers the bidirectional dependencies.
Although the n-gram span attention operation can be robust for trees of various sizes and structures, it involves concatenating n-grams from 1 to n to create an n-gram span matrix, making it a heavy operation.This limitation becomes increasingly evident as sentences become longer, resulting in a discrepancy in learning speed when compared to existing parsers during comparative experiments.Tian et al. (2020) suggested categorizing extracted n-grams in a span (i, j) by their length so that n-grams in different categories are weighted separately instead of using all n-grams.It may be helpful to modify the attention to focus only on a limited range of spans to improve the speed of the n-gram span attention module.This modification remains as future work.

A Appendix
A.1 Further implementation details We employ a grid search to identify the optimal parameter settings for our model with a random seed fixed at 42.The parameter tuning was conducted across various ranges, including learning rates of 1e-5, 2e-5, and 3e-5, batch sizes of 50, 100, and 200, n-gram values of 1, 2, 3, and 4, and dropout ratios of 0.1 and 0.2 on the development set.
In the PTB dataset experiments, the optimal model achieves the highest performance with a learning rate of 2e-5, a batch size of 200, and an n-gram value of 4 for the NSA layer.The dropout ratios for the residual connections, feed-forward layer, attention, and CharLSTM morphological representations were 0.2, 0.2, 0.2, and 0.1, respectively.
In the CTB5.1 dataset experiments, the most successful model uses a learning rate of 3e-5, a batch size of 50, and an n-gram value of 3 for the NSA layer.The dropout ratios for the residual connections, feed-forward layer, attention, and CharL-STM morphological representations were 0.1, 0.1, 0.1, and 0.2, respectively.
Both experiments employed identical model sizes, with a model dimensionality of 512 and a feed-forward layer size of 1024.The query/key/value sizes were set to 64, except in the BiMSA layer, where the value size was halved to 32 for split forward and backward computations.
When the parser utilizes a pre-trained model, the number of layers is set to 2. In contrast, when a single model is employed without a pre-trained model, the architecture employs 8 layers.Additionally, to enhance the training speed and performance of the single model, a batch size of 250 and a learning rate of 0.0008 are employed.
All parsers, including those utilizing pre-trained models, were trained within a 12 hour.Training was conducted using a single NVIDIA RTX A5000 GPU for each parser.The parser without a pretrained model has 15.9 million parameters, while the parser with a pre-trained model, which has 2 layers, has 4.7 million parameters.

A.2 Further experimental results
Table A1 presents the ablation study results conducted on the CTB dataset, while Table A2 shows the performance comparison between the BiMSA and self-attention model on the same dataset.The full results from our albation experiments are given in Table A3 and Table A4.

A.3 Procedure of constructing arbitrary span matrix
The separated bidirectional word representations, namely H F and H B , construct span matrices ranging from 1-gram to n-gram.These completed span matrices, Span F N and Span B N , are concatenated to form a single Span N .The specific computation procedure for constructing an arbitrary n-gram span matrix with bidirectional word features is presented in Figure 5.

Figure 1 :
Figure 1: Comparison of the process of capturing directional information from words using BiMSA (a) and BiLSTM (b) methods in a matrix representation.In BiMSA (a), the gray area in the attention score refers to the region where directional masking has been applied.

Figure 2 :
Figure 2: Our parser combines a chart decoder with an encoder, the proposed BNA model.The right side of the figure illustrates the procedure of each attention mechanism when the input sentence X is provided.The multiplication symbol denotes the matrix multiplication, and the summation and subtraction symbols represent the element-wise summation and subtraction, respectively.

Figure 3 :
Figure 3: Comparison of the variants in NSA layers of our best-performing model and their corresponding test set F1 scores.

Figure 4 :
Figure 4: Comparison of the variants in the n-grams of our best-performing model and their corresponding test set F1 scores.Red stars represent our best-performing result.

Figure 5 :
Figure 5: Detailed procedure of constructing arbitrary n-gram span matrix in NSA module.

Table 1 :
Comparison of labeled recall (LR), labeled precision (LP), and F1 scores of our models with those of previous studies on the PTB test dataset.Models with ♣ are trained in our experimental environment.
Mrini et al. (2020), andTian et al. (2020)), we compare the performance of our models with and without Part-Of-Speech (POS) tagging.The POS tags are prede-

Table 3 :
Ablation study of the effectiveness of each approach on the PTB test split.The models that do not utilize BiMSA and NSA both employ a Self-Attention layer.PLM denotes the pre-trained XLNet model.

Table 4 :
Comparison between the BiMSA and selfattention approaches on the PTB test split.∆ indicates the difference between the model performances.PLM denotes the pre-trained XLNet model.

Table A1 :
Ablation study of the effectiveness of each approach on the CTB test split.The models that do not utilize BiMSA and NSA both employ a Self-Attention layer.PLM denotes the pre-trained BERT model.

Table A2 :
Comparison between the BiMSA and selfattention approaches on the CTB test split.∆ indicates the difference between the model performances.PLM denotes the pre-trained BERT model.

Table A3 :
Full results of ablation study on the PTB test split.PLM denotes the pre-trained XLNet model.

Table A4 :
Full results of ablation study on the CTB test split.PLM denotes the pre-trained BERT model.