Stacked AMR Parsing with Silver Data

Lacking sufﬁcient human-annotated data is one main challenge for abstract meaning representation (AMR) parsing. To alleviate this problem, previous works usually make use of silver data or pre-trained language models. In particular, one recent seq-to-seq work directly ﬁne-tunes AMR graph sequences on the encoder-decoder pre-trained language model and achieves new state-of-the-art results, outperforming previous works by a large mar-gin. However, it makes the decoding relatively slower. In this work, we investigate alternative approaches to achieve competitive performance at faster speeds. We propose a simpliﬁed AMR parser and a pre-training technique for the effective usage of silver data. We conduct extensive experiments on the widely used AMR2.0 dataset and the results demonstrate that our Transformer-based AMR parser achieves the best performance among the seq2graph-based models. Further-more, with silver data, our model achieves competitive results with the SOTA model, and the speed is an order of magnitude faster. Detailed analyses are conducted to gain more insights into our proposed model and the effectiveness of the pre-training technique.


Introduction
Abstract meaning representation (AMR) parsing aims to abstract semantics from a natural language sentence into a rooted, directed, and labeled graph, where the nodes represent concepts and edges represent semantic relations (Banarescu et al., 2013). Figure 1 gives an example.
One main challenge of AMR parsing is the lack of large-scale annotated data, which limits the model representative ability. To alleviate the problem and boost the performance, early works propose to use silver (pseudo) data that are generated from some released AMR parsing models * Corresponding author. (van Noord and Bos, 2017; Konstas et al., 2017). Apart from AMR silver data, Xu et al. (2020) try to use other kinds of large-scale silver data to train a pre-trained model, such as constituent parsing data and machine translation data. With the development of pre-trained language models, recent works try to use pre-trained language models to enhance the model input representative ability (Cai and Lam, 2019;Zhou et al., 2021). Most of the them use pre-trained models in the model encoder side since it naturally provides powerful contextualized representations for sentences. Recently, Bevilacqua et al. (2021) propose a seq2seq AMR parser based on BART (Lewis et al., 2020), which is one encoder-decoder fashion pre-trained language model. They first convert the AMR graph into a text sequence with symbols indicating the concepts' graph positions. Then, they propose to fine-tune the sentence sequence and AMR graph sequence on BART, achieving large improvements compared with previous works, including those with BERT. However, it makes the model relatively slower, which parses 31 tokens per second. We think there are two main reasons: 1) the 12-layer Transformer decoder and 2) the longer converted graph sequences that include the added symbols.
In this work, we investigate alternative approaches to achieve competitive performance at faster speeds. We propose a simplified AMR parser and a pre-training technique for effective use of silver data. First, we propose a simple Transformer-based seq2graph AMR parser, denoted as TAMR, that only needs one external bi-affine scorer (Dozat and Manning, 2017) for the relation classification. The remaining question is how we do concept generation and edge classification with Transformer, which is usually used for encoding sequences. Our answer is giving the Transformer attention mechanisms more meanings. In detail, we try to demonstrate that the self-attention in the decoder captures the semantic relation that can guide establishing the connections for the concepts and the cross-attention implicitly links the concept with its surface word, which is similar to the core of attention-based machine translation. Based on the inspirations, we use the copy mechanism (Zhang et al., 2019b;Cai and Lam, 2020) to copy words or lemmas as candidate concepts for concept prediction, in which we treat the cross-attention between the encoder and decoder as the probability. Another source of candidate concepts is the extracted concept vocabulary from the training data. For edge classification, we directly treat part of the decoder self-attention values as the edge scores between concept nodes.
Second, to achieve competitive performance with the current SOTA model, we seek to use silver data to enhance the model representative ability. Specifically, we employ three different performance AMR models (denoted as "father" models) to generate three different performance silver data and try to investigate several questions which are seldom discussed in previous works: 1) What are the best learning schedules to build pre-trained models with silver data and later fine-tune with the gold-standard data, respectively? 2) Are all the different performance silver data beneficial for our model, even its father model lags behind our model? and 3) Whether using multiple different performance silver data can provide more information than the best performance one or not, i.e., can the higher performance silver data benefits from lower performance silver data? Based on the answers to these questions, which are shown in Section 6.2, we propose a stack pre-training technique for effectively using silver data.
We conduct extensive experiments on the commonly used AMR2.0 dataset. The experimental results show that our proposed model achieves the best results among the seq2graph-based models. Utilizing the silver data, our final model achieves comparable results with the current SOTA model, and the speed is an order of magnitude faster. Our contributions are threefold: (I) We propose a simple Transformer-based AMR parser, which only needs to add one external bi-affine scorer for the relation classification. (II) We investigate how to ensemble different models via the proposed stack pre-training method. (III) Detailed analyses show more insights into our model and several interesting findings of utilizing the silver data.

Related Work
AMR parsing approaches can mostly be categorized into four classes: pipeline-based, transitionbased, seq2seq-based, and seq2graph-based approaches.
Pipeline-based approaches mainly consist of two steps: 1) concept identification and 2) relation identification. Flanigan et al. (2014) is the first AMR parsing work (JAMR) that treats concept identification as a sequence labeling problem and relation identification as a maximum-scoring connected graph searching problem, in which they also propose an influential rule-based aligner for aligning the concepts and words. Lyu and Titov (2018) treat the alignment as latent variables and propose a joint model for AMR parsing. Zhang et al. (2019a) first use the attention-based copy mechanism to predict concepts in a BiLSTM encoder-decoder framework and then use the bi-affine scorer for edge and relation prediction based on the predicted concepts.
Transition-based methods aim to design a series of actions to generate the AMR graph. Wang et al. (2016) propose to transform the sentence's dependency tree into its AMR graph. Ballesteros and Al-Onaizan (2017); Naseem et al. (2019) use Stack-LSTM transition-based AMR parser that transforms the sentence into the AMR graph, which is different from Wang et al. (2016). With the rise of Transformer, Astudillo et al. (2020); Zhou et al. (2021) propose to use Stack-Transformer for transition-based AMR parsing.
Seq2seq-based approaches convert the AMR graph generation problem into a symbolic sequence generation problem, where the hierarchy structure is converted into human-defined symbols. Konstas et al. (2017) propose a seq2seq-based AMR parser, which uses millions of unlabeled data with self-training. van Noord and Bos (2017) leverage a character level seq2seq-based model and silver data, achieving promising improvements. Recently, Bevilacqua et al. (2021) fine-tune the the gold-standard data on BART (Lewis et al., 2020), which achieves new SOTA performance.
Seq2graph-based methods generate a new concept node and its connections with previously generated concepts at one time step, thus are relatively faster than seq2seq-based methods. Zhang et al. (2019b) propose a BiLSTM encoder-decoder-based model for several semantic tasks, including AMR. Cai and Lam (2019) present a top-down AMR parser that generates the concept nodes in a root-toleaf way. Cai and Lam (2020) introduce an iterative inference for the decoding process on the Transformer encoder-decoder architecture. Motivated by the seq2graph methods' generation process and the Transformer encoder-decoder framework, we propose to adapt AMR parsing into the Transformer architecture. The main difference between our model and previous seq2graph models is that our model mostly relies on Transformer, only added one biaffine scorer for relation classification.

Task Formulation.
Given one sentence s = w 1 , w 2 , ..., w n , AMR parsing aims to parse the sentence into an AMR graph G = {N , E}, where N = {c 1 , c 2 , ..., c m } is the set of concept nodes in the AMR graph 1 and E = {(c i , c j , r)|1 ≤ i ≤ m, 1 ≤ j ≤ m, r ∈ R} is the set of edges in the graph. R is the set of AMR relations.
Overall, our Transformer-based model consists of the following modules, i.e., input layer, encoder layer, decoder layer, concept generator, edge generator, and relation classifier. We will describe the model architecture in detail and show how to adapt the AMR parsing process into Transformer in the following sections.

Input Layer.
Encoder Input. The model input of each word w i in the sentence s is composed of its character representation which is generated by a convolutional neural network (CNN) (Kalchbrenner et al., 2014), randomly initialized lemma, part-of-speech tag, named entity tag, and dependency label embeddings (Xia et al., 2019), which is denoted as where ⊕ means the concatenation operation. We also use BERT (Devlin et al., 2019) to enhance the word representation. To get the wordbased representations, we make average pooling to sub-word-based representations. And due to the GPU limitation, we fix the BERT model parameters as Zhang et al. (2019b). The final model input representation for w i is computed as Decoder Input. In the decoder, we use the concatenation of the concept character representation and the randomly initialized concept embedding as the concept representation, denoted as

Encoder Layer.
Given the encoder input representations, we use the Transformer encoder (Vaswani et al., 2017) to encode the sentence. Formally, where TF enc means the multi-layer Transformer encoder. As the Transformer has been widely used, we refer our readers to their original paper for the details. The left part of Figure 2 shows the process.

Decoder.
We will describe the decoder layer, concept generator, edge generator, and relation classifier together in this part for better understanding. In general, given the sentence representation and a start concept node START, the decoder needs to generate the AMR graph one concept by one concept. Figure 2 shows the process at the second step. Decoder Layer. Given the concept node input representations of the decoder, we compute the output representations as where TF dec is the multi-layer Transformer decoder.
Concept Generator. Following previous works (Zhang et al., 2019a;Cai and Lam, 2020), our model generates the concept node from two sources, i.e., the concept vocabulary and the source words (or lemmas) in the sentence. First, given the t-th decoder output representation h t , we employ a MLP to re-encode it for dislodging the irrelevant information (Dozat and Manning, 2017), denoted as c t = MLP(h t ). Next, we compute the candidate concept probability distribution over the concept vocabulary as: where W c_voc and b c_voc are linear projection parameters.
Second, we treat the cross-attention (Vaswani et al., 2017) α c t as the latent alignment between the currently predicting concept node n t and the surface words (or lemmas) in the sentence. Based on this assumption, we compute the probability of copying tokens and lemmas of the sentence as follows: where w i means the i-th word, l i means the i-th lemma, and [i] means indexing the i-th token. The final probability of predicting concept n t is: where λ 1 , λ 2 , and λ 3 are normalized weights, which are computed by a single MLP and the softmax function on c t . Edge Generator. To connect the current predicted concept node n t and previously concepts n 1 , n 2 , ..., n t−1 , we directly use the self-attention α s in the decoder. Intuitively, we can treat the self-attention α s as the relevancy between current node n t and previous concept nodes. The edge prediction module of Figure 2 shows the workflow. Specifically, we use the self-attention of the upmost decoder layer, which is computed as: Finally, to determine the edges for the current node and previously generated nodes, we use half of the attention and compute the edge scores as: where H is the number of attention heads. If p e t [i] > 0.5, we connect concept n t and n i . Specifically, this strategy allows the current node to attend to all previous nodes with multi connections, which is a crafty way to handle the reentrancy problem.
Relation Classifier. After the current concept node and related edges are generated, the left process is to assign an appropriate label for each edge. In this work, we directly use the bi-affine scorer (Dozat and Manning, 2017) to classify the semantic relation for each concept node pair. Formally, given the predicted concept n t and any previous concept n j , we compute the relation scores as, . N is the number of decoder layers. W, U and b are learnable parameters. Intuitively, at the i-th step in the decoder, the lower half of decoder layers are used to represent the i-th concept, and the upper half decoder layers are used to represent and predict the i + 1th concept. Therefore, we use the N/2-th layer representation to represent the other concepts that participate in relation classification.
With the above-described model architecture, TAMR generates the AMR graph one node by one node, until a special ending node END generated.
Training & Testing. In training, we employ masked self-attention in the decoder, which ensures each node in the concept sequence can attend to all preceding nodes. For the training objective, our model aims to maximize the log-likelihood of the gold-standard AMR graph given one sentence, which is the sum of the decomposed log-likelihood of each model component.
During testing, we use the beam search method to search the highest-scoring AMR graph.

Stack Pre-training with Silver Data
Reviewing the progress of AMR parsing, we can find that the performance boosting with the development of deep learning in the NLP community, especially the usage of silver data and evolution of pre-trained language models like BERT. Injecting BERT representations (Zhang et al., 2019b;Cai and Lam, 2020) into AMR parsing models brings significant improvements compared with previous BERTfree models (Lyu and Titov, 2018;Cai and Lam, 2019), which is an effective way to potentially alleviate the data sparsity problem and enhance model representative ability. However, BERT can only provide powerful representations for the sentence, which can not directly bring benefits to the decoder module in the encoder-decoder framework.
Recently, Bevilacqua et al. (2021) propose a BART-based (Lewis et al., 2020) seq2seq AMR parsing model (SPRING), where the hierarchy structure of AMR graph is represented by humandefined symbols. BART is a pre-trained seq2seq model that provides powerful representations for both encoder and decoder. Thus, Bevilacqua et al. (2021) achieve new SOTA performance on AMR benchmarks by simply fine-tuning BART with goldstandard AMR data. We think the powerful representations of encoder and decoder contribute to the success of Bevilacqua et al. (2021). However, SPRING runs relatively slower than previous seq2graph-based methods because of the autoregression generation process for the hierarchy symbol-based graph sequence, which needs 26 minutes to test the AMR2.0 test data (about 0.88 sentences/second). Unfortunately, training seq2graph models based on the seq2seq fashion pre-trained models is tricky because BART uses sub-word representations, which is difficult to establish edges for concept nodes. So, can we train a model that has the competitive performance with SPRING and runs as fast as the classic seq2graph models? The answer is yes and we can utilize large-scale silver data (Konstas et al., 2017). In this work, we first employ large-scale silver data to train pre-trained TAMR models for simulating the role of pre-trained language models. Then we fine-tune the pre-trained model with the gold-standard AMR data.
To better understand the effect of silver data and investigate some valid questions that are seldom discussed before, we conduct experiments with silver data that are generated from three different performance models, i.e., JAMR, TAMR, and  SPRING. Figure 3 shows the overall process of our usage of these different performance silver data. First, we use the three models to parse the largescale BLLIP data to generate three different performance silver data. Second, we use the silver data to train three different pre-trained AMR models with TAMR. Third, we fine-tune these pre-trained models with the gold-standard AMR data and investigate whether the model performance can boost or not. Furthermore, we also investigate whether we can utilize all the silver data with some ensemble techniques for further improvements, which can also be seen as an ensemble of different AMR parsing models. To this purpose, we propose a stack pre-training method that progressively learning with these silver data, which is depicted by the bottom workflow in Figure 3. We will discuss these details with the experimental results in Section 6.2.

Settings.
Data. We conduct extensive experiments on the commonly used AMR2.0 (LDC2017T10) dataset. Following the standard data split, AMR2.0 contains 36,521, 1,368, and 1,371 samples in the training, development, and test data. BLLIP 2 data is chosen as the large-scale unlabeled data, which contains 1,795,984 sentences that belong to the newswire domain. We implement our model with Pytorch 3 .
Hyper-parameters. representation and BERT is fixed in our work due to the GPU memory limitation. The encoder and decoder consist of 4 and 8 Transformer blocks, respectively. Each Transformer block has 8 heads, the feed-forward hidden size is 1024, and the hidden size is 512. Implementation Details. We use Stanford CoreNLP (Manning et al., 2014) for tokenization, lemmatization, part-of-speech tagging, and named entity tagging. The dependency relations are obtained by the bi-affine dependency parser (Dozat and Manning, 2017) implemented in SuPar . Previous works (Zhang et al., 2019b;Cai and Lam, 2020) usually use graph recategorization to reduce the complexity and sparseness of the AMR graph. In this work, we use the same script from Cai and Lam (2020) for preand post-processing. Our models are trained with Adam (Kingma and Ba, 2015) optimizer and learning rate with warm-up same as to Vaswani et al. (2017). We use the evaluation tool of Damonte et al. (2017) to test our model.
Training Criterion. We train our models for at most 2,020 epochs and choose the best model to evaluate the test data according to the performance on development data.

Experimental Results.
Results of TAMR.   previous best seq2graph-based model of Cai and Lam (2020) on Smatch score. Besides, it is interesting that our model achieves a better unlabeled Smatch score than Cai and Lam (2020) with +0.7. Cai and Lam (2020) conduct edge prediction based on an iterative state, while we directly model the edge information on the decoder self-attention.
Results of TAMR with Separate Silver Data. Table 3 shows the experimental results of our pretrained models with separate silver data and the corresponding fine-tuning results on the AMR2.0 dataset. The "Silver Data" column shows the silver data performance, i.e., the original model performance (training with the AMR2.0 training data set). The "Pre-train" column shows the results of the three pre-trained models with the separate silver data. We define the pre-trained model with specific silver data d as TAMR pre d and fine-tuned model with specific silver data d as TAMR f t d . For example, the pre-trained model with JAMR silver data is denoted as TAMR pre JAMR . We can see one interesting finding that pre-trained models TAMR pre JAMR and TAMR pre TAMR both outperform their original models, while TAMR pre SPRING does not. We think this is mainly because the average annotator vs. interannotator agreement (Smatch) is 83.0 (Banarescu et al., 2013). Thus, the SPRING silver data can bring a limited benefit of Smatch score around 83.0. Besides, we can see that fine-tuning the pre-trained  Table 4: Smatch scores of our ensemble pre-training models on AMR2.0 development and test data with stack pre-training, where "→A→B" means we stack pre-training with silver data A and then silver data B. models with gold-standard AMR data can consistently improve a lot, even on TAMR pre SPRING , which already outperforms our base TAMR model. This indicates the effectiveness of the silver data for AMR parsing. There is another interesting point that the fine-tuned TAMR f t SPRING did not outperform the pre-trained model of TAMR pre SPRING , which indicates the upper-bound of utilizing silver data?
Results of TAMR with Stacking Silver Data. Previous works (van Noord and Bos, 2017;Zhou et al., 2021) usually use one single silver data for improving AMR parsing performance. We think it is interesting to explore the effects of different performance silver data, which is seldom discussed before. In this work, we propose a stacking pretraining approach, i.e., pre-training different silver data from low-performance silver data to highperformance silver data one by one. Table 4 shows the results of stack pre-training experiments. First, we can see that stack pre-training TAMR silver data on TAMR pre JAMR can bring slight improvement of +0.4 Smatch score (81.9-81.5=0.4), indicating the usefulness of JAMR silver data even though it is generated from a relatively lower performance model. The improvements are consistent in all the combinations, demonstrating the effectiveness of our proposed stack pre-training. Second, we can observe two other interesting findings of the fine-tuning results: 1) Fine-tuning on TAMR pre TAMR and TAMR pre JAMR→TAMR with the gold-standard AMR data achieve the same result of 82.4, even though their pre-trained models differ. We think this is because TAMR silver data comes from TAMR model, which can not provide additional valid information. 2) Fine-tuning the TAMR pre * →SPRING pretrained models achieve promising improvements over TAMR pre SPRING . This demonstrates an interesting conclusion that lower performance silver data that comes from different sources can still provide beneficial information, which is also effective for higher   performance silver data. So, we break the "upperbound" with the stack pre-training method. In addition, we also try to generate silver data with the best performed model of TAMR f t JAMR→TAMR→SPRING and use it in the pre-training process for another iteration. The pre-training step reaches 83.9 Smatch score on the development data, outperforming TAMR pre JAMR→TAMR→SPRING by +0.5 Smatch score. However, compared with the 84.3 Smatch score of TAMR f t JAMR→TAMR→SPRING , the succedent finetuning process didn't bring further improvement in our current settings. We think our proposed stack pre-training technique has sufficiently tapped the potential of the silver data and there is little space for further iterations.
The decoder plays the main role of TAMR, in which the number of layers matters a lot. Table 5 shows the results. We can see that with the increasing of decoder layers, the performance improves accordingly. The 8-layer decoder achieves the best results on the development and test data, respectively. However, we didn't increase more layers because of the GPU memory limitation.

Effect of Methods with Silver Data.
In order to find a better pre-training and fine-tuning pipeline strategy, we conduct detailed experiments.   Table 7: Results of different ensemble techniques on AMR2.0 development data. "AVG_P" means averaging the model parameters of fine-tuned specific models, "AVG_S" means averaging the scores of last decision layers of different models, and "SP" means stack pretraining.
Model Speed (Tokens/Second) SPRING 31 TAMR 300 Analyses of Pre-training. Konstas et al. (2017) propose to pre-train model with a fixed learning rate of 1e-5. In this work, we compare two learning rate strategies with the TAMR silver data: 1) training with the base model learning rate strategy (Ori.) and 2) training with fixed learning rate of 1e-3, 1e-4, and 1e-5. The convergence curves are shown in Figure 4. We can see that the original learning rate strategy achieves the best results. Besides, using the fixed learning rate of 1e-3 makes the model crashed, which only achieves 0.16 Smatch score on the AMR2.0 development data set.
Analyses of Fine-tuning. Given the pre-trained models, how to effectively fine-tune with the gold data is also a valid problem to discuss. In this work, we investigate three learning rate settings. Table 6 shows the results. We can see that 1) Fine-tuning with a small learning rate can achieve promising results, in which 1e-5 is slightly better than 5e-5. 2) Using the original learning rate strategy can achieve better results on TAMR pre JAMR , but lags behind in the other two pre-trained models. Thus, we use the fixed learning rate of 1e-5 in our experiments.
Analyses of Ensemble Techniques. In order to find a better way to use multiple silver data, we compare with the wildly used model ensemble tech-nique of averaging model parameters (Zhou et al., 2021), averaging the last decision layer scores, and our proposed stack pre-training technique. Table 7 shows the results on different model combinations. It is surprising that the averaging method crashed in all the combinations. We think this is because these models' pre-trained models differ a lot on their representative space due to the different performance silver data. The averaging scores of the decision layer method achieves reasonable results, but cannot outperform the better one of the merged models. Our proposed stack pre-training technique achieves consistently best results in all combinations. Therefore, we use our proposed stack pre-training methods for our ensemble experiments. Besides, we experiment with different kinds of combinations and finally found that stacking high-performance silver data on low-performance silver data benefits more. From the observations, we think our proposed stack pre-training technique can also be applied into other models and other NLP tasks which have limited human annotated data. Table 8 shows the speed comparison between our model and SPRING. To evaluate the AMR2.0 test data, our TAMR needs 2min41s, while SPRING needs more time of 20min2s. We think there are two main reasons for the relatively low speed of SPRING. First, the BART backbone has a huge number of parameters to compute in the decoder. Second, the resulting AMR graph text sequence contains a lot of external symbols, making the number of decoding steps increased a lot. While our seq2graph model TAMR, it has a relatively small decoder and only need to generate the concepts and the connected relations at each step.

Speed Comparison.
6.4 Final Results. Table 9 shows the final results of our model and comparison with previous works. We can find that the progress of pre-trained language models influences the development of AMR parsing. Before the rise of seq2seq pre-trained language models, the BERT/RoBERTa-based seq2graph and transition approaches achieve the best results. With the seq2seq pre-trained language models, Bevilacqua et al. (2021) Table 9: Smatch scores of our final models and comparison with previous works on AMR2.0 test data. "Pipe.", "Trans.", "S2S", and "S2G." represent the pipeline-based, transition-based, seq2seq-based, and seq2graph-based methods, respectively. G.R. means using graph re-categorization and † means using silver data. Bevilacqua et al. (2021), yet has a faster speed.

Conclusion
In this work, we propose a simple Transformerbased AMR parsing model that adapts AMR parsing into the Transformer architecture, in which the attention mechanisms are given more meanings of latent alignment and the connections between concepts. Based on our proposed model, we conduct detailed experiments to investigate several strategies for using silver data and propose an effective stack pre-training method for ensemble different models. The experimental results show that our proposed model achieves the best performance of seq2graph-based models and demonstrate the effectiveness of using silver data. Our final model achieves comparable results with the SOTA model, and the speed is an order of magnitude faster. Detailed analyses show the effectiveness of our proposed ensemble technique of stack pre-training for using multiple silver data.