Structure-aware Fine-tuning of Sequence-to-sequence Transformers for Transition-based AMR Parsing

Predicting linearized Abstract Meaning Representation (AMR) graphs using pre-trained sequence-to-sequence Transformer models has recently led to large improvements on AMR parsing benchmarks. These parsers are simple and avoid explicit modeling of structure but lack desirable properties such as graph well-formedness guarantees or built-in graph-sentence alignments. In this work we explore the integration of general pre-trained sequence-to-sequence language models and a structure-aware transition-based approach. We depart from a pointer-based transition system and propose a simplified transition set, designed to better exploit pre-trained language models for structured fine-tuning. We also explore modeling the parser state within the pre-trained encoder-decoder architecture and different vocabulary strategies for the same purpose. We provide a detailed comparison with recent progress in AMR parsing and show that the proposed parser retains the desirable properties of previous transition-based approaches, while being simpler and reaching the new parsing state of the art for AMR 2.0, without the need for graph re-categorization.


Introduction
The task of Abstract Meaning Representation (AMR) parsing translates a natural sentence into a rooted directed acyclic graph capturing the semantics of the sentence, with nodes representing concepts and edges representing their relations (Banarescu et al., 2013). Recent works utilizing pretrained encoder-decoder language models show great improvements in AMR parsing results (Xu et al., 2020;Bevilacqua et al., 2021). These approaches avoid explicit modeling of the graph structure. Instead, they directly predict the linearized AMR graph treated as free text. While the use of pre-trained Transformer encoders is widely extended in AMR parsing, the use of pre-trained Transformer decoders is recent and has shown to be very effective, maintaining current state-of-the-art results (Bevilacqua et al., 2021).
These approaches however lack certain desirable properties. There are no structural guarantees of graph well-formedness, i.e. the model may predict strings that can not be decoded into valid graphs, and post-processing is required. Furthermore, predicting AMR linearizations ignores the implicit alignments between graph nodes and words, which provide a strong inductive bias and are useful for downstream AMR applications (Mitra and Baral, 2016;Liu et al., 2018;Vlachos et al., 2018;Kapanipathi et al., 2021;. On the other hand, transition-based AMR parsers (Wang et al., 2015;Ballesteros and Al-Onaizan, 2017a;Astudillo et al., 2020;Zhou et al., 2021) operate over the tokens of the input sentence, generating the graph incrementally. They implicitly model graph structural constraints through transitions and yield alignments by construction, thus guaranteeing graph well-formedness. 1 However, it remains unclear whether explicit modeling of structure is still beneficial for AMR parsing in the presence of powerful pre-trained language models and their strong free text generation abilities.
In this work, we integrate pre-trained sequenceto-sequence (seq-to-seq) language models with the transition-based approach for AMR parsing, and explore to what degree they are complementary. To fully utilize the generation power of the pre-trained language models, we propose a transition system with a small set of basic actions -a generalization of the action-pointer transition system of Zhou et al. (2021). We use BART (Lewis et al., 2019) as our pre-trained language model, since it has shown significant improvements in linearized AMR generation (Bevilacqua et al., 2021). Unlike previous approaches that directly fine-tune the model with linearized graphs, we modify the model structure to work with our transition system, and encode parser states in BART's attention mechanism (Astudillo et al., 2020;Zhou et al., 2021). We also explore different vocabulary strategies for action generation. These changes convert the pre-trained BART to a transition-based parser where graph constraints and alignments are internalized.
We provide a detailed comparison with topperforming AMR parsers and perform ablation experiments showing that our proposed transition system and BART modifications are both necessary to achieve strong performance. Although BART has great language generation capacity, it still benefits from parser state encoding with hard attention, and can efficiently learn structural output. Our model establishes a new state of the art for AMR 2.0 while maintaining graph well-formedness guarantees and producing built-in alignments.

Intricacies of AMR Parsers
A frequent complaint about AMR parsers is that they involve combining many different techniques and hand-crafted rules, resulting in complex pipelines that are hard to analyze and generalize poorly. This situation has notably improved in the past few years but there are still two main sources of complexity present in almost all recent parsers: graph re-categorization and subgraph actions.
Graph re-categorization (Wang and Xue, 2017;Lyu and Titov, 2018;Zhang et al., 2019a) normalizes the graph prior to learning. This includes joining certain subgraphs such as entities, dates and other constructs into single nodes, removing special types of nodes like polarity and normalizing propbank names. An example of common normalizations is displayed in Figure 1. Training and decoding of models using this technique happens in this re-categorized space. Re-categorized graphs are expanded to normal valid AMR graphs in a post-processing stage. The type and number of subgraphs normalized vary across implementations, but most high performing approaches (Cai and Lam, 2020;Bevilacqua et al., 2021) utilize the re-categorization described in Appendix A.1 of Zhang et al. (2019a). This version requires of an external Named Entity Recognition (NER) system to anonymize named entities, both at train time and test time. It also makes use of look-up tables for nominalizations (e.g. English to England) and other hand-crafted rules. Graph re-categorization has been criticised for its lack of generalization to new domains, such as biomedical domain or even the AMR 3.0 corpus (Bevilacqua et al., 2021). Recent top performing systems e.g. Cai and Lam (2020); Bevilacqua et al. (2021) also provide results without re-categorization, but this is shown to hurt performance notably on the AMR 2.0 corpus.
Subgraph actions (Ballesteros and Al-Onaizan, 2017b) are used in transition-based systems and play a role similar to re-categorization. Instead of normalizing and reverting, transition-based parsers apply a subgraph action that generates an entire subgraph at once. This subgraph action coincides with many of the subgraphs collapsed in re-categorization. Subgraph actions bring however fewer external dependencies, since the parser learns to segment and identify subgraphs during training. They still suffer however from data sparsity since some subgraphs appear very few times. As in re-categorization, subgraph actions also make use of lookup tables for nominalization and similar constructs that hinder generalization. Furthermore, they create the problem of unattachable nodes. This was addressed in Zhou et al. (2021) by ignoring subgraphs for a set of heuristically determined cases. Subgraph actions have been used in all transitionbased AMR systems (Naseem et al., 2019;Astudillo et al., 2020;Zhou et al., 2021).

A Simplified Transition System
In this section we propose a transition system for AMR parsing designed with two objectives: maximize the use of strong pre-trained decoders such as BART, and minimize the complexity and dependencies of the transition system compared to previous approaches. Similarly to Zhou et al. (2021), we scan the sentence from left to right and use a token cursor to point to a source token at each step. Parser actions can either shift the cursor one token forward or generate any number of nodes and edges while the cursor points to the same token. The proposed set of actions is as follows: SHIFT moves token cursor one word to the right.
COPY creates node where the node name is the token under the current cursor position.
LA(j,LBL) creates an arc with label LBL from the last generated node to the node generated at the j th transition step.
RA(j,LBL) same as LA but with arc direction reversed.
ROOT declares the last predicted node as the root.
Unlike previous transition-based approaches, we do not use a reserved action, such as PRED (Zhou et al., 2021) or CONFIRM (Ballesteros and Al-Onaizan, 2017b), to predict nodes; 2 instead we directly use the node name <string> as the action symbol generating that node. This opens the possibility of utilizing BART's target side pre-trained vocabulary. We avoid using any copy actions that involve copying from lemmatizer outputs or lookup tables. Our COPY action is limited to copying the lower cased word. We also eliminate the use of SUBGRAPH (Zhou et al., 2021) or ENTITY (Ballesteros and Al-Onaizan, 2017b) actions producing multiple connected nodes simultaneously, as well as MERGE action creating spans of words. In previous approaches these actions were derived from alignments or hand-crafted. They thus did not cover all possible cases limiting the scalability of the approach. Finally, we discard the REDUCE action previously used to delete a source token. The effect can be achieved by simply using SHIFT without performing any other action. Figure 2 shows an example sentence with an action sequence and the corresponding graph. This can be compared with the handling of verbalization and named entities in Figure 1.
To train a parser with a transition system, we need an action sequence for each training sentence that will produce the gold graph when executed. This action sequence then serves as the target for seq-to-seq models. A simple rule-based oracle algorithm creates these ground-truth sequences given a sentence, its AMR graph and node-to-word alignments. At each step, the oracle tries the following possibilities in the order of listing and performs the first one that applies: 1. Create next gold arc between last created node and previously created nodes; 2. Create next gold node aligned to token under cursor; 3. If not at sentence end, SHIFT source cursor; 4. Finish oracle.
If possible, nodes are generated by COPY and otherwise with <string> actions. Arcs are generated with LA and RA, connecting the nodes closer to the current node before the ones that are farther away. Note that the arcs are created by pointing to positions in the action history, where a graph node is represented by the action that creates it, following Zhou et al. (2021). Multiple nodes can be generated at a single source word before the cursor is moved by SHIFT. When multiple nodes are aligned to the same token, these nodes are generated in a predetermined topological order of graph traversal, interleaved by edge creation actions. ROOT is applied as soon as the root node is generated.
The above oracle circumvents the problem of unattachable nodes by avoiding the use of subgraph actions. This implies that the oracle will always produce a unique action sequence that fully recovers the gold graph as long as every node in the graph is aligned to some token. To guarantee that all nodes are aligned, we improve upon the alignment system from Naseem et al. (2019); Astudillo et al. (2020), which aligns a large majority, but not all AMR nodes. 3 In order to do this, we apply a heuristic based on graph proximity to maintain local correspondences between graph nodes and sentence words. If a node is not aligned, we copy the alignment from its first child node, if existing, and otherwise the alignment from its first parent node. For example, in Figure 2 if the node person was not provided with an alignment, our oracle would have assigned it to the aligned token of its child node employ-01. This is a recursive procedure -as long as there are some alignments to start with and the ground-truth graph is connected, all the nodes will get aligned.

Parsing Model
We build our model on top of the pre-trained seqto-seq Transformer, BART (Lewis et al., 2019). We modify its architecture to incorporate a pointer network and internalize parser states induced by our transition system, and fine-tune for sentenceto-action generation.

Structure-aware Architecture
We adopt similar modifications on the Transformer architecture as in Zhou et al. (2021) since our transition system is based on the same action-pointer mechanism. The modifications do not introduce new modules or extra parameters, which naturally fit our need to adapt BART into a transition-based parser with internal graph well-formedness.
In particular, the target actions are factored into two parts: bare action symbols (containing labels when presented) and pointer values for edges. We use the BART standard output for the former, and a pointer network for the latter. As the pointing happens on the actions history, essentially a selfattention mechanism, we re-purpose one decoder self-attention head as the pointer network. It is supervised with additional cross entropy loss during fine-tuning and decoded for building graph edges at inference.
We encode the monotonic action-source alignments induced by the parser state with hard attention, i.e. by masking some decoder cross-attention heads to only focus on aligned words. Since BART processes source sentences with subwords, we apply an additional average pooling layer on top of its encoder to return states of original source words, used for the decoder layers for our hard attention. At last, as the possible valid actions are constrained with transition rules and states at every step, we restrict the decoder output space via hard masking of the BART final softmax layer. For simplicity, we do not incorporate the GNN-style (Li et al., 2019) step-wise decoder graph embedding technique in Zhou et al. (2021) as their gain was shown to be modest.

Action Generation
According to how we treat the target-side vocabulary for action generation, we propose two variations of the model. The first one is to use a completely separate vocabulary for target actions, where the decoder input side and output side use stand-alone embeddings for actions, separate from the pre-trained BART subword embeddings. 4 We denote this setup as our sep-voc model (abbreviated as StructBART-S).
However, this might not fully utilize the power of the pre-trained BART since it is an encoder-decoder model with a single vocabulary and all embeddings shared. Although our generation targets are action symbols, the node generating actions are closely related to natural words in their surface forms, which are what BART was pre-trained on. Therefore, we propose a second variation where we use a joint vocabulary for both the source tokens and target actions. Naively relying on the original BART subword vocabulary would end up splitting action symbols blindly, which is not desired as the structures such as alignments and edge pointers would be disrupted. Inspired by Bevilacqua et al. (2021), we add frequent node-creating actions to the vocabulary, in order to capture common AMR concepts intact, and split the remaining concepts with BART subword vocabulary. Non-node-creating actions such as SHIFT and COPY are added as-is to the BART vocabulary.
In this setup, a single node string can potentially be generated with multiple steps; we modify the arc transitions to always point to the beginning position of a node string for attachment. With joint vocabulary setup, the model could learn to generate unseen nodes with BART's subword vocabulary, eliminating potential out-of-vocabulary problems. We refer to this setup as our joint-voc model (abbreviated as StructBART-J).

Training and Inference
We load the pre-trained BART parameters except for the standalone vocabulary embeddings for sepvoc model and the extended embeddings for the joint-voc model. We then fine-tune the model with the updated structure-aware architectures on sentence-action pairs with addition of pointer loss.
For decoding, we use similar constrained beam search algorithm as in Zhou et al. (2021), but with our own transition set and rules. We run a state machine on the side to get parser states used by the model. Note that for our joint-voc model, we only allow subword split for node (<string>) actions. As our fine-tuned model is already structure-aware, the graph well-formedness is always guaranteed and no post-processing is needed to return valid graphs, unlike Xu et al. (2020); Bevilacqua et al. (2021). The only post-processing we use is to add wikification nodes as used in all previous parsers. 5
Evaluation We assess our models with SMATCH (F1) scores 7 . We also report the fine-grained evaluation metrics (Damonte et al., 2016) to further investigate different aspects of parsing results, such as concept identification, entity recognition, re-entrancies, etc.

Model Configuration
We follow the original BART configuration (Lewis et al., 2019) and code. 8 We use the large model configuration as default, and also the base model for ablation studies. The pointer network is always tied with one head of the decoder top layer, and the pointer loss is added to the model cross-entropy loss with 1:1 ratio for training. Transition alignments are used to mask cross-attention in 2 heads of all decoder layers.
For sep-voc model, we build separate embedding matrices for target actions from the training data for decoder input and output space. For joint-voc model, we add new embedding vectors for nonnode action symbols and node action strings with a default minimum frequency of 5 (only accounts for about one third of all nodes due to sparsity). 5 We also do light cleaning of the decoded AMR when they are printed to penman format, such as removing reserved characters in node concepts and printing disconnected subgraphs. 6   Implementation Details Our models are trained with Adam optimizer with batch size 2048 tokens and gradient accumulation of 4 steps. Learning rate is 1e−4 with 4000 warm-up steps using the inversesqrt scheduling scheme (Vaswani et al., 2017). The hyper-parameters are fixed and not tuned for different models and datasets, as we found results are not sensitive within small ranges. We train sep-voc models for 100 epochs and joint-voc models for 40 epochs as the latter is found to converge faster. The best 5 checkpoints based on development set SMATCH from greedy decoding are averaged, and default beam size of 10 is used for decoding for our final parsing scores. We implement our model 9 with the FAIRSEQ toolkit (Ott et al., 2019). More details can be found in the Appendix.

Results
Main Results We present parsing performances of our model (StructBART) in comparison with previous approaches in Table 1. For each model, we also list its features such as utilization of pre-   trained language models and graph simplification methods such as re-categorization. This gives a comprehensive overview of how systems compare in terms of complexity aside from performance. All recent systems rely on pre-trained language models, either as fixed features or through finetuning. The pre-trained BART is particularly beneficial due to its encoder-decoder structure. Among all the models, the graph linearization models (Xu et al., 2020;Bevilacqua et al., 2021) have the least number of extra dependencies when not using graph re-categorization. Our model only requires aligned training data, a trait common to all transition-based approaches. This bears the advantage of producing reliable alignments at decoding time, which are useful for downstream tasks and as explanation of the graph constructing process.
Both our sep-voc and joint-voc model variations work well on all datasets. Without using extra silver data, our model achieves the SMATCH score of 84.2 ±0.1 on AMR 2.0, which is the same as the previous best model (Bevilacqua et al., 2021) with 200K silver data. With the input of only 47K silver data (consisting of ∼20K example sentences of propbank frames and randomly selected ∼27K SQuAD-2.0 context sentences 10 ), we achieve the highest score of 84.7 ±0.1 for AMR 2.0. We also attain the high score of 81.7 ±0.2 on the smallest AMR 1.0 benchmark, and the second best score of 82.7 ±0.1 on the largest AMR 3.0 benchmark. Ensemble of the 3 models from the silver training further improves the performances to 84.9 for AMR 10 https://rajpurkar.github.io/SQuAD-explorer/.

Fine-grained Results
We further examine the fine-grained parsing results on AMR 2.0 in Table 2. We compare models not relying on extra data nor graph re-categorizationn since silver data sets differ across methods, and re-categorization comes with limitations outlined in Section 2. Our models achieve the highest scores across most of the categories, except for negation and wikification. The former may be due to alignment errors and the latter is solved as a separate post-processing step independent of the parser. Compared with the closely related model from Bevilacqua et al. (2021) which also fine-tunes BART but directly on linearized graphs, we achieve significant gains on re-entrancy and SRL (:ARG-I arcs), proving our model generate AMR graphs more faithful to their topological structures.

Analysis
Transition System Table 3 and Table 4 compare different transition systems used by recent transition-based AMR parsers with strong performances. Our proposed system has the smallest set of base actions, utilizes the action-side pointer mechanism for flexible edge creation as in Zhou et al. (2021), but does not rely on special treatment of certain subgraphs such as named entities and dates. This results in slightly longer action sequences compared to Zhou et al. (2021), but with almost 100% coverage 11 (Table 4). Our transition system and oracle can always find action sequences with full recovery of the original AMR graph, regardless of graph topology and alignments.
To assess whether our proposed transition system helps integration with pre-trained BART, we train both the APT model from Zhou et al. (2021) and our sep-voc model on the transition system of   Zhou et al. (2021) and the one introduced in this work (Table 3 last 4 columns). The APT model, based on fixed RoBERTa features, does not benefit from the proposed transition system. However, our proposed model gains 0.6 on AMR 2.0 and 0.7 on AMR 3.0. This confirms the hypothesis that the proposed transitions are better able to exploit BART's powerful language generation ability.
Structural Alignment Modeling In Table 5, we evaluate the effects of our structural modeling of parser states within BART during fine-tuning. Action-source alignments are natural byproduct of the parser state, providing structural information of where and how to generate the next graph component. Our default use of hard attention to encode such alignments works the best. We explore two other strategies for modeling alignments. One is to supervise cross-attention distributions for the same heads with inferred alignments during training, inspired by Strubell et al. (2018). The other is to directly add the aligned source contextual embeddings from the encoder top layer to the decoder input at every generation step. The former hurts the model performance, indicating the model is unable to learn the underlying transition logic to infer correct alignments, while the latter does equally well as our default model. These results justify the modeling of structural constraints, even when finetuning strong pre-trained models such as BART.
We also ablate the use of COPY action in our transition system. The sep-voc model suffers but the joint-voc model is not affected. Without COPY action, the joint-voc model would rely more on BART's pre-trained subword embeddings to split node concepts more frequently, while the sepvocab model would need to learn to generate more rare concepts from scratch. This indicates that BART's strong generation power is fully used to tackle concept sparsity problems with its subwords.  Special Nodes in Joint-Voc In Figure 3, we show the joint-voc model performance with different sized (joint) vocabularies. The vocabulary size is controlled by specifying the minimum frequency of occurrence needed for an AMR concept to be added to the vocabulary. For instance, when the minimum frequency is 1, all 12475 AMR concepts from the training data are added onto the BART vocabulary. The number of added concepts decreases as we increase the minimum frequency threshold. On model performance side, we only observe ±0.2 score variations resulting from vocabulary expansion. More interestingly, the model can work equally well when no special concepts are added to the BART vocabulary (minimum node frequency is ∞) -where all the node names are split and generated with BART subword tokens. Although our default setup uses frequency threshold of 5 in joint-voc expansion, following Bevilacqua et al. (2021), it seems unnecessary in terms of achieving good performance. This highlights the efficacy of utilizing the pre-trained BART's lan-guage generation power for AMR concepts even with relatively small annotated training datasets.

Pre-trained Parameters
We study the contribution of different pre-trained BART components in We also experiment with freezing BART parameters during training in the bottom part of Table 6. Our results of freezing the BART encoder are on similar levels of previous best RoBERTa feature based models, which is behind the full finetuning. Overall, full initialization from BART with structure-aware fine-tuning (#8) works the best.

Related Work
Using seq-to-seq to predict linearized graph sequences for parsing was proposed in Vinyals et al. (2015) and is currently a very extended approach (Van Noord and Bos, 2017;Ge et al., 2019;Rongali et al., 2020). However, it is only recently with the rise of pre-trained Transformer decoders, that these techniques have become dominant in semantic parsing. Xu et al. (2020) proposed custom multi-task pre-training and fine-tuning approach for conventional Transformer models (Vaswani et al., 2017). The massively pre-trained transformer BART (Lewis et al., 2019) was used for executable semantic parsing in Chen et al. (2020) and AMR parsing in Bevilacqua et al. (2021). The importance of strongly pre-trained decoders seems also justified as BART gains popularity in various semantic generation tasks (Chen et al., 2020;Shi et al., 2020). Our work aims at capitalizing on the outstanding performance shown by BART, while providing a more structured approach that guarantees well-formed graphs and yields other desirable sub-products such as alignments. We show that this is not only possible but also attains state-of-the art parsing results without graph re-categorization. Our analysis also shows that contrary to Xu et al. (2020), vocabulary sharing is not necessary for strong performance for our structural fine-tuning.
Encoding of the parser state into neural parsers has been undertaken in various works, including seq-to-seq RNN models Buys and Blunsom, 2017), encoder-only Transformers (Ahmad et al., 2019), seq-to-seq Transformers (Astudillo et al., 2020;Zhou et al., 2021) and pre-trained language models (Qian et al., 2021). Here we explore the application of these approaches to pre-trained seq-to-seq Transformers. Borrowing ideas from Zhou et al. (2021), we encode alignment states into the pre-trained BART attention mechanism, and re-purpose its selfattention as a pointer network. We also rely on a minimal set of actions targeted to utilize BART's generation with desirable guarantees, such as no unattachable nodes and full recovery of all graphs. We are the first to explore transition-based parsing applied on fine-tuning strongly pre-trained seq-toseq models, and we demonstrate that parser state encoding is still important for performance, even when implemented inside of a powerful pre-trained decoder such as BART.

Conclusion
We explore the integration of pre-trained sequenceto-sequence language models and transition-based approaches for AMR parsing, with the purpose of retaining the high performance of the former and structural advantages of the latter. We show that both approaches are complementary, establishing the new state of the art for AMR 2.0. Our results indicate that instead of simply converting the structured data into unstructured sequences to fit the need of the pre-trained model, it is possible to effectively re-purpose a generic pre-trained model to a structure-aware one achieving strong performance. Similar principles can be applied to adapt other powerful pre-trained models such as T5 (Raffel et al., 2019) and GPT-2 (Radford et al., 2019) for structured data predictions. It is worth exploring thoroughly the pros and cons of introducing structure to the model compared to removing structure from the data (linearization) in various scenarios.

A Dataset Statistics
We list the dataset sizes of AMR benchmarks in Table 7. The sizes increase with the release version number. AMR 2.0 is the most used by far. AMR 2.0 shares the same set of sentences for development and test data with AMR 1.0, but with revised annotations and wikification links. AMR 3.0 is released most recently, which is under-explored.

B Details of Model Structures and Number of Parameters
In Table 8, we list the detailed model configuration and number of parameters of the official pre-trained BART models. Our fine-tuned StructBART is with different action vocabulary strategies which builds additional embedding vectors for certain action symbols. The numbers vary from training dataset. We list the detailed number of parameters of our fine-tuned model in Table 9. The fine-tuned model only increases about 3%-8% more parameters for sep-voc model and 0.4%-1% more parameters for joint-voc model.