Guiding AMR Parsing with Reverse Graph Linearization

Abstract Meaning Representation (AMR) parsing aims to extract an abstract semantic graph from a given sentence. The sequence-to-sequence approaches, which linearize the semantic graph into a sequence of nodes and edges and generate the linearized graph directly, have achieved good performance. However, we observed that these approaches suffer from structure loss accumulation during the decoding process, leading to a much lower F1-score for nodes and edges decoded later compared to those decoded earlier. To address this issue, we propose a novel Reverse Graph Linearization (RGL) enhanced framework. RGL defines both default and reverse linearization orders of an AMR graph, where most structures at the back part of the default order appear at the front part of the reversed order and vice versa. RGL incorporates the reversed linearization to the original AMR parser through a two-pass self-distillation mechanism, which guides the model when generating the default linearizations. Our analysis shows that our proposed method significantly mitigates the problem of structure loss accumulation, outperforming the previously best AMR parsing model by 0.8 and 0.5 Smatch scores on the AMR 2.0 and AMR 3.0 dataset, respectively. The code are available at https://github.com/pkunlp-icler/AMR_reverse_graph_linearization.


Introduction
Abstract Meaning Representation (AMR) (Banarescu et al., 2013) is a formalization of a sentence's meaning using a directed acyclic graph that abstracts away from shallow syntactic features and captures the core semantics of the sentence.AMR parsing involves transforming a textual input into its AMR graph, as illustrated in Figure 1.The results are obtained from AMRBART (Bai et al., 2022) on the test set of AMR 2.0.
In this study, we aim to address the issue of structure loss accumulation in seq2seq-based AMR parsing.Our analysis (Figure 2) shows that the F1-score of structure prediction (node and relation) decreases as the generation direction progresses.This phenomenon is a consequence of the error accumulation in the auto-regressive decoding process, a common problem in natural language generation (Ing, 2007;Zhang et al., 2019c;Liu et al., 2021).
However, unlike natural language, the linearization of AMR graphs does not follow a strict order, as long as the sequence preserves all nodes and relations in the AMR graph.To this end, we define two linearization orders based on the depthfirst search (DFS) traversal, namely Left-to-Right (L2R) and Right-to-Left (R2L).The L2R order is the conventional linearization used in most previous works (Bevilacqua et al., 2021;Bai et al., 2022;Chen et al., 2022), where the leftmost child corresponding to the penman annotation is traversed first.In contrast, the R2L order is its reverse, where the structures at the end of the L2R order appear at the beginning of the R2L order.By training AMR parsing models with R2L linearization, it improves the accuracy of predictions for the structures at the end of the L2R order, which are less affected by the accumulation of structure loss.
We propose to enhance AMR parsing with reverse graph linearization (RGL).Specifically, we incorporate an additional encoder to integrate the reverse linearization graph and replace the original transformer decoder with a mixed decoder that utilizes gated dual cross-attention, taking input from both the hidden states of the sentence encoder and the graph encoder.We design a two-pass self-distillation mechanism to prevent the model from overfitting to the gold reverse linearized graph as well as to further utilize it to guide the model training.Our analysis shows that our proposed method significantly mitigates the problem of structure loss accumulation, outperforming the previously best AMR parsing model (Bai et al., 2022) by 0.8 Smatch score on the AMR 2.0 dataset and 0.5 Smatch score on the AMR 3.0 dataset.
Our contributions can be listed as follows: 1. We explore the structure loss accumulation problem in sequence-to-sequence AMR parsing.
2. We propose a novel RGL framework to alleviate the structure loss accumulation by incorporating reverse graph linearization into the model, which
outperforms previously best AMR parser.
3. Extensive experiments and analysis demonstrate the effectiveness and superiority of our proposed method.
Assuming that we have a training set containing N sentence-linearized graph pairs (x i , y i ), the total training loss of the model is computed by the crossentropy loss which is listed as follows: where m i is the length of i th linearized AMR graph, and y i <t is the previous tokens.

Graph Linearization Order
As shown in Table 1, we formalize two types of graph linearization, the corresponding AMR graph is shown in Figure 1.Left-to-Right (L2R) denotes that when we use the depth-first search (DFS) to traverse the children of a node, we first start from the leftmost child and then traverse to the right, which is identical to the order of penman annotation and is the default order of sequence-to-sequence based AMR parsers (Bevilacqua et al., 2021;Bai et al., 2022;Chen et al., 2022).In contrast, Rightto-Left (R2L) traverses from the rightmost child to the leftmost child, which is the reverse of the standard traversal order.When the input sentence is long or contains multi-sentence, most of the nodes or relationships that are positioned later in the L2R sequence will appear earlier in the R2L sequence.3 Methodology

Overview
Our method is illustrated in Figure 3.In addition to the traditional encoder-decoder architecture, we have incorporated a graph encoder to include the reverse linearization sequence.As a result, the model now takes both the sentence and its reverse linearization as input.We modify the original transformer decoder with a mixed decoder that uses gated dual cross-attention in each decoder layer, allowing the integration of hidden representations from both the sentence encoder and the graph encoder.During inference, we need an additional R2L AMR parser that generates the reverse linearization ŷr of the sentence and then feed both the input sentence x and ŷr to the model.To obtain reverse linearization during training, a common intuitive approach is to linearize the gold AMR graph into the gold reverse linearization, denoted by y r .However, simply using y r and the source sentence x as input for all training data can lead to overfitting of the model to y r , causing it to ignore the importance of the source sentence.As a result, the model may simply copy from y r and generate y during training.This can limit the model's performance during inference due to the noise introduced by the generated reverse linearization, denoted by ŷr .
To prevent the model from overfitting to y r we introduce silver linearization ŷr during training.While we still hope to utilize the gold linearization y r to guide the training, we design a two-pass self-distillation mechanism.Alongside y r , we incorporate ŷr , which is parsed by the additional R2L AMR parser during training.The teacher model takes y r and x as input, while the student model takes ŷr and x.During each training step, the model performs two forward passes and computes cross-entropy losses, L T CE for the teacher and L S CE for the student.We employ KL divergence L KL to guide the student with the teacher's output.We also design a loss scheduler to balance the weight α i for L T CE and L S CE at optimization step i.

Model Structure
As shown in Figure 3, our model mainly consists of three parts: sentence encoder, graph encoder, and mixed decoder.The major structural difference from standard pretrained models, e.g.BART (Lewis et al., 2020), is that we use a graph encoder to integrate the reverse linearized structural information to guide the model.

Sentence Encoder
The sentence encoder receives the given sentence s = (s 1 , s 2 , ..., s N ), and encodes it to the hidden representations , which is the same as the encoder of pretrained transformer models.
Graph Encoder Following (Bevilacqua et al., 2021;Bai et al., 2022), we adopt the standard transformer encoder to encode the structural information.Given the reverse-linearized AMR graph, the output of the graph encoder is H g = (h g 1 , h g 2 , ..., h g M ).Mixed Decoder Different from the traditional decoder, the mixed decoder takes the hidden states of the sentence H s and the graph H g via a gated dual cross-attention layer as shown in Figure 4.The gated dual cross-attention layer contains two cross-attention modules which are used to integrate H s and H g respectively.In the decoder layer, the output of the self-attention module is where k is the number of tokens in the decoder input and d is the size of the hidden state.The output of each cross-attention module can be computed as: where the two cross-attention modules contains the same query S z but different key-value H s and H g respectively.
The output of the gated dual cross-attention module S o is the weighted sum of S s and S g .
where g ∈ R K×1 is predicted by a feed-forward network: and b 2 ∈ R are trainable parameters and bias.

Training Objective
The training objective of the RGL is: where α i is a balancing weight related to i th iteration.L T CE and L S CE are the cross-entropy loss of the teacher and the student respectively and L KL is the self-distillation loss. in Equation 6 and 7, we regard the forward pass taking y r as input a teacher and ŷr as a student.To obtain the output distribution of both the teacher and the student, the model performs two forward passes in one training step.Note that the teacher and the student model share the same parameters.

Self-distillation
To distill the knowledge from the teacher pass to the student pass, we guide the output of student pass with the teachers by minimizing the Kullback-Leibler divergence loss: where p and q are the output probabilities of the teacher and the student respectively, D is the number of classes which is the total size of the target vocabulary.
Loss scheduler Inspired by the idea of curriculum learning, we introduce a loss scheduler to better balance the training process.We set an adaptive coefficient α i to control the weights of L T CE and L S CE .α i gradually decays with the increase of training step i.The model is supposed to learn more from gold linearization when its capability is weak so that the model can converge quickly.When the model's capability is strong, it is supposed to have the ability to infer from the noisy silver linearization, which can make the model more capable and robust to noise during inference since we do not have a gold linearization graph during inference.The α i can be computed as exponential decay: where k 1 and k 2 are hyper-parameters that can control the upper-and lower-bounds of the α i .We set the upper bound of α i to 0.8 and the lower bound to 0.2 without further tuning.

Inference
Given a sentence, we first use the R2L AMR parser to generate its reverse linearization.Then the trained RGL model takes the reverse linearization and the sentence as input and decodes the standard L2R AMR linearization.

Datasets
We conducted our experiments on two AMR benchmark datasets, AMR 2.0 and AMR 3.0.AMR 2.0 contains 36521, 1368, and 1371 sentence-AMR pairs in training, validation, and testing sets, respectively.AMR 3.0 has 55635, 1722, and 1898 sentence-AMR pairs for training validation and testing set, respectively.

Evaluation Metrics
We use the Smatch (Cai and Knight, 2013) and further the fine-grained scores (Damonte et al., 2017) to evaluate the performance.The detailed explanations of the metrics are shown in Appendix B.
BLINK (Wu et al., 2019) is used to add wiki tags to the predicted AMR graphs in all the systems in our experiments.We do not apply any re-category methods and other post-processing methods which are the same with Bai et al. (2022) to restore AMR from the token sequence.

Main Compared Systems
AMRBART We use the current state-of-the-art sequence-to-sequence AMR Parser proposed by Bai et al. (2022) as our main baseline model.
RGL We initialize our model using AMRBART (Bai et al., 2022).The sentence encoder and the graph encoder are initialized the same as the AMR-BART encoder, but they have individual gradients during training.Full details of the compared systems are listed in Appendix A.

Main Results
We report the results of our method with several Seq2seq baselines on two major datasets, AMR 2.0 and AMR 3.0 in table 2. Our method outperforms previous methods significantly and provides a stateof-the-art AMR parser.
In comparison with the baseline AMRBART, our method outperforms it by 0.8 Smatch point on AMR 2.0 and 0.5 Smatch point on AMR 3.0.Moreover, our method does not introduce any additional data and is compatible with existing methods such as Chen et al. (2022) and Bai et al. (2022).

Ablation Study
Model Training Table 3 presents the results of an ablation study in which we analyze how different training methods affect the performance of RGL.
We observed a significant drop in model performance when we removed the silver linearization from the training process.This approach involves feeding the model with the gold linearization during training while using the silver linearization at inference.We believe this drop in performance occurred for two reasons.First, since the gold reverse linearization and the target are highly similar in structure, the model can be easily overfitted to the gold reverse linearization and ignore the source sentence.This can cause the model to simply replicate the input y r to y instead of accurately parsing the sentence to an AMR graph.Second, the lack of a structure loss for the gold AMR sequence during training means that the model does not learn to differentiate the correct part of the graph from the noisy part, which is required during inference.Therefore, without the silver graph during training, our model cannot be effectively trained.
We also observed a significant drop in performance when we removed self-distillation from the training process.This highlights the importance of self-distillation in our method, which helps the model avoid the error information caused by noise in silver graphs during training.Nevertheless, our method still outperformed AMRBART, even without self-distillation, which demonstrates the effectiveness of incorporating the reverse linearization into AMR parsing.
Finally, when we removed the loss scheduler, the performance of the model degraded.This emphasizes the importance of the loss scheduler in balancing the teacher and the student during training and enhancing the performance of our method.

Graph Encoder Size
We conduct an ablation experiment on how does the size of graph encoder influence the parsing performance.As shown in Table 4, we only retain the bottom few layers of the graph encoder and we observe that the performance generally declines when the number of layers decreases.However, even when the graph encoder retains only four layers, our model still outperforms AMRBART, which demonstrates the effectiveness of incorporating reverse graph linearization during training.

On the Effect of R2L Linearization
In this section, we replace the input of the graph encoder with different sequences to validate the effectiveness of R2L linearization, which is shown in the upper parts of Table 5. ① is the proposed RGL and achieves the best performance of all methods.And we replace the input of the graph encoder with the standard L2R linearization without changing other conditions, which is shown at ②. Inspired by the ideas of Zhou et al. (2019a,b), which explore decoding from both sides for machine translation, we can directly reverse the entire L2R linearization token sequence as the input of graph encoder instead of the R2L linearization, where all the nodes and relations strictly appear at the opposite of L2R linearization, which is the ③ of Table 5.
Comparing ① to ②, we observe a more significant improvement when using R2L linearization.This is because some nodes or relations in R2L linearization are predicted earlier by the R2L parser, resulting in less structure loss and higher accuracy, which serves as a complementary source of information for the model.The result proves the effectiveness of incorporating reverse linearization.
Comparing ① to ③, we find that the performance would drop if we replace the R2L linearization with a simple reversed L2R token sequence.We believe the main reason for this is that the dependencies between nodes and relationships within the linearized AMR graphs are highly intricate.Simply reversing the sequence can lead to unexpected changes in the sequence, e.g.referential variables, making it challenging for the model to accurately predict after the inversion.In fact, the parsing performance of the simple reverse parser is only 75.9 Smatch score, which is far less than the baseline model.In contrast, R2L linearization is a more reasonable reverse as it is meaning-equivalent to the original L2R linearization and can reach similar parsing performance to the original L2R parser.
The combined findings demonstrate that incorporating a reverse order is advantageous for AMR parsing.Moreover, the R2L linearization proves to be a more suitable form compared to reversing the input sequence token by token.

On Incorporating R2L Linearization
In this section, we compare different methods to incorporate the R2L linearization, including several works in other fields adapted into the setting of AMR parsing, which are shown in the lower part of Table 5. Xie et al. (2021) using two decoders to generate two different linearizations i.e.DFS and BFS for code generation and leverages the mutual information to narrow the KL-divergence between the outputs.We adapt this method into AMR parsing settings, where the two different linearizations are L2R and R2L.Then we narrow the output distributions of corresponding nodes and relations of the two linearizations.

Double-decoder+KL
Multitask A simple method to integrate extra linearization order is through multitask learning, where the model learns to predict both the L2R and R2L AMR graph.During training, a task identifier <L2R> or <R2L> is added to the beginning of the input sentence to differentiate the output's order.During inference, we individually test the two orders and select the order with the higher Smatch score (L2R) as the final result.The difference from ① in model architecture is that we share the decoder which learns to generate different lin- earizations simultaneously, rather than use an extra decoder.
Concatenate Input Another intuitive way to directly introduce reverse linearization into the model is to concatenate it with the textual input.Compared with RGL, this method reduces the additional graph encoder without changing other conditions.Experimental results show that both ④ and ⑤ can benefit the model, which implicitly incorporates the R2L linearization to the model through the training loss.However, the proposed RGL explicitly integrates reverse linearization into the model as the extra input, achieving more significant improvements.
However, integrating the R2L linearization through directly concatenating them as the model input is not as effective as the RGL.One possible reason for this is that the linearized graph and the sentence are different structures and simply concatenating them from the input text and letting the model learn the extra structural information provided by R2L linearization through one encoder is challenging.Therefore, the extra graph encoder is necessary for encoding the R2L linearization.
Overall, this section demonstrates that RGL is an effective method for incorporating reverse linearization into the model.

Effect of RGL on structure loss
The decrease of F1 scores for nodes and relations with prediction length is shown in Figure 5. Compared with the baseline AMRBART, there is a significant improvement in the F1 score of both the node and relation prediction of the RGL when the prediction length is greater than 30. 4 9 5 0 -9 9  1 0 0 -1 4 9  1 5 0 -1 9 9  2 0 0 -2 4 9  2 5 0 -2 9 9  3 0 0 -3 4 9  3 5 0 -3 9 9  4 0 0 -4 4 9  4 5 0 -4 9 9 Position(token) To quantify the results, we measured the Pearson coefficients between the F1 scores of nodes and relations and the prediction length.Compared to AMRBART, the Pearson correlation coefficient of node F1 scores with prediction position decreased from -0.42 to -0.26.The coefficient of relation F1 scores with prediction position decreased from -0.72 to -0.6.It proves that the RGL model can indeed alleviate the structure loss problem.

-
Our analysis also reveals that node prediction is less affected by structure loss accumulation than relation prediction.We believe this is mainly because node prediction in AMR parsing is relatively easier, whereas relation prediction requires correct node predictions as a precondition.

Balancing source and reverse linearization
Figure 6 shows the results of a quantitative analysis of the weight g in the gated dual cross-attention of RGL.We recorded the positions and gated values during model inference on the validation set1 .
The diagram reveals that the average value of the gate is less than 0.5, indicating that the model pays more attention to the source sentence than to the reverse linearization.This suggests that the model is performing sentence-to-AMR conversion, rather than simply copying the reverse linearization.
Furthermore, there is a positive correlation between the gated weight and the position, which provides insight into how our method works.In positions closer to the beginning, the model has greater confidence, resulting in smaller structure loss.The model can predict the AMR graph using only the original source sentence.As the position increases, the model needs to refer to the reverse linearization to compensate for the accumulation of structure loss.Consequently, the gated weight for the reverse linearization becomes larger as the position increases.

Related Work
AMR parsing aims to convert a textual input to an AMR semantic graph (Banarescu et al., 2013).There are mainly four AMR Parsing strategies in previous work, two-stage approaches (Flanigan et al., 2014;Lyu and Titov, 2018;Zhang et al., 2019a;Zhou et al., 2020), graph-based approaches (Zhang et al., 2019b;Cai and Lam, 2020), transition-based approaches (Naseem et al., 2019;Lee et al., 2020;Fernandez Astudillo et al., 2020;Zhou et al., 2021), sequence-to-sequence approaches (Ge et al., 2019;Xu et al., 2020a;Bevilacqua et al., 2021;Wang et al., 2021;Bai et al., 2022;Chen et al., 2022;Yu and Gildea, 2022b;Cheng et al., 2022).In terms of AMR graph linearization, Bevilacqua et al. (2021) explores which linearization method is better for AMR parsing, and Chen et al. (2022) studied how to linearize different semantic resources like SRL to enhance AMR parsing.Some methods have also been proposed to incorporate graph information into sequence-tosequence models to compensate for the discrepancy between graph and sequence (Yu and Gildea, 2022a;Bai et al., 2022).While previous seq2seqbased AMR parsing models mostly take the L2R linearization order by default, our work first explores how to leverage different graph linearization orders to enhance AMR parsing.

Conclusion
In this work, we propose a novel Reverse Graph Linearization (RGL) enhanced framework to address the structure loss accumulation problem observed in the seq2seq-based AMR parsing.Through extensive experiments and analysis, it shows that RGL significantly mitigates the problem of structure loss accumulation and outperforms the previous state-of-the-art model on both AMR 2.0 and AMR 3.0 datasets, which demonstrates the effectiveness of the proposed approach.

Limitation
Compared to traditional sequence-to-sequence AMR parser, our model needs an additional R2L parser to generate the reverse linearizations, although it can be easily obtained by fine-tuning off-the-shelf AMR parser, e.g.AMRBART (Bai et al., 2022) and SPRING (Bevilacqua et al., 2021).Due to the necessity to generate the reverse linearization before AMR parsing, the inference is two times slower than the one-pass AMR parser.

B Detailed Evaluation Metrics
We use the Smatch scores (Cai and Knight, 2013) to evaluate the performance.The further the break down scores (Damonte et al., 2017) is shown as follows.i) No WSD, compute while ignoring Propbank senses (e.g., duck-01 vs duck-02), ii) Wikification, F-score on the wikification (:wiki roles), iii) Concepts, F-score on the concept identification task, iv) NER, F-score on the named entity recognition (:name roles), v) Negations, F-score on the negation detection (:polarity roles), vi) Unlabel, compute on the predicted graphs after removing all edge labels, vii) Reentrancy, computed on reentrant edges only, viii) Semantic Role Labeling (SRL), computed on :ARG-i roles only.

D Error propagation vs. structure loss
Figure 8 highlights the distinction between error propagation and structure loss.Error propagation is typically evaluated position-wise or within a limited window (Liu et al., 2021), and is observed in almost every autoregressive method, including sequence-to-sequence based AMR parsing.Once a previous prediction is misplaced or incorrect, subsequent predictions tend to follow the same pattern.In contrast, structure loss evaluates the validity of a node or relation based on its existence in the entire gold graph, rather than its position or window.We argue that structure loss provides a more accurate reflection of the challenges in AMR parsing and other structure generation tasks because it measures the overall quality of the generated AMR graph.

E Case Study
The illustrated example in figure 9 shows the accumulation of structural loss more intuitively.We align the variables predicted by the model with the standard AMR graph and mark the prediction errors in red.From the figure, we can see that there are more errors in the later part of the predicted AMR graph.What's more, the relation ":snt2" is wrongly predicted due to the error of the previous relations ":op1" and ":op2", which shows that the duplicate dependencies imposed by sequence-tosequence manner on AMR parsing have a negative effect.

Figure 1 :Figure 2 :
Figure 1: An example of AMR Parsing of the sentence "Come to study and learn".

Figure 3 :
Figure3: The overview of our method.In addition to the encoder-decoder model, an additional graph encoder is used to incorporate reverse graph linearization.Following the paradigm of self-distillation, we regard the model with the input of the gold linearization y r and x as the teacher model and ŷr parsed by a pre-trained R2L parser and x as the student model.The model does twice forward pass to obtain the output probabilities of the teacher and the student in each training step.We calculate the cross-entropy loss of teacher and student as well as their KL divergence as the training loss.Given a sentence x during inference, the model generates the standard AMR linearization using x and its silver linearization ŷr .

Figure 4 :
Figure 4: The illustration of the mixed decoder in RGL.H s and H g are the hidden representations from the sentence encoder and graph encoder.The module enclosed by the dashed line is the gated dual crossattention, which integrates the outputs of the dual attention through a gate predicted by an FFN.For brevity and focus, the residual connection and normalization are omitted from the figure.

Figure 5 :
Figure 5: F1-score of nodes and relations with the increase of the predicted length of AMRBART (Bai et al., 2022) represented by orange bars and RGL represented by blue bars.

Figure 6 :
Figure6: The histogram of the gated weight in the gated dual cross-attention with the increase of the position during inference.A higher value indicates that the model is attending more to the output of the reverse graph encoder in the cross-attention layer.We divided the positions into buckets of size 50 and computed the average gate value across all positions and layers within each bucket, represented by the blue bar in the diagram.

Figure 7 :
Figure 7: The convergence curve of the RGL and AMR-BART.

Figure 7
Figure 7 presents the convergence curves of RGL and AMRBART on the AMR2.0 dataset.The training process consists of 30 epochs.After each epoch, we compute the SMATCH of RGL and AMRBART on the validation set.Results in Figure 7 indicate that RGL outperforms AMRBART significantly.

Figure 8 :
Figure 8: The descent of (a) position-wise accuracy and (b) graph-wise F1-score of nodes and relations as the decoding progresses.The results are from AMRBART (Bai et al., 2022) on the test set of AMR 2.0.

Table 3 :
Ablation study results on the RGL."w/o loss scheduler": remove the loss scheduler in the training process, where we simply add up all loss terms."w/o self-distillation": remove the L KL and L T CE from training objective."w/o silver linearization": remove the L KL and L S CE from training objective.