Exploiting Abstract Meaning Representation for Open-Domain Question Answering

The Open-Domain Question Answering (ODQA) task involves retrieving and subsequently generating answers from fine-grained relevant passages within a database. Current systems leverage Pretrained Language Models (PLMs) to model the relationship between questions and passages. However, the diversity in surface form expressions can hinder the model's ability to capture accurate correlations, especially within complex contexts. Therefore, we utilize Abstract Meaning Representation (AMR) graphs to assist the model in understanding complex semantic information. We introduce a method known as Graph-as-Token (GST) to incorporate AMRs into PLMs. Results from Natural Questions (NQ) and TriviaQA (TQ) demonstrate that our GST method can significantly improve performance, resulting in up to 2.44/3.17 Exact Match score improvements on NQ/TQ respectively. Furthermore, our method enhances robustness and outperforms alternative Graph Neural Network (GNN) methods for integrating AMRs. To the best of our knowledge, we are the first to employ semantic graphs in ODQA.


Introduction
Question Answering (QA) is a significant task in Natural Language Processing (NLP) (Rajpurkar et al., 2016).Open-domain QA (ODQA) (Chen et al., 2017), particularly, requires models to output a singular answer in response to a given question using a set of passages that can total in the millions.ODQA presents two technical challenges: the first is retrieving (Karpukhin et al., 2020) and reranking (Fajcik et al., 2021) relevant passages from the dataset, and the second is generating an answer for the question using the selected passages.In this work, we focus on the reranking and reading processes, which necessitate fine-grained interaction between the question and passages.
Existing work attempts to address these challenges using Pretrained Language Models (PLMs) (Glass et al., 2022).However, the diverse surface form expressions often make it challenging for the model to capture accurate correlations, especially when the context is lengthy and complex.We present an example from our experiments in Figure 1.In response to the question, the reranker incorrectly ranks a confusing passage first, and the reader generates the answer "2015-16".The error arises from the PLMs' inability to effectively handle the complex semantic structure.Despite "MVP", "Stephen Curry" and "won the award" appearing together, they are not semantically related.In contrast, in the AMR graph, it is clear that "Stephen Curry" wins over "international players", not the "MVP", which helps the model avoid the mistake.The baseline model may fail to associate "Most Valuable Player" in the passage with "MVP" in the question, which may be why the baseline does not rank it in the Top10.To address this issue, we adopt structured semantics (i.e., Abstract Meaning Representation (Banarescu et al., 2013) graphs shown on the right of Figure 1) to enhance Open-Domain QA.
While previous work has integrated graphs into neural models for NLP tasks, adding additional neural architectures to PLMs can be non-trivial, as training a graph network without compromising the original architecture of PLMs can be challenging (Ribeiro et al., 2021).Converting AMR graphs directly into text sequences and appending them can be natural, but leads to excessively long sequences, exceeding the maximum process-  ing length of the transformer.To integrate AMR into PLMs without altering the transformer architecture and at a manageable cost, we treat nodes and edges of AMR Graphs aS Tokens (GST) in PLMs.This is achieved by projecting the embeddings of each node/edge, which consist of multiple tokens, into a single token embedding and appending them to the textual sequence embeddings.This allows for integration into PLMs without altering the main model architecture.This method does not need to integrate a Graph Neural Network into the transformer architecture of PLMs, which is commonly used in integrating graph information into PLMs Yu et al. (2022); Ju et al. (2022).The GST method is inspired by Kim et al. (2022) in the graph learning domain, who uses token embeddings to represent nodes and edges for the transformer architecture in graph learning tasks.However, their method is not tailored for NLP tasks, does not consider the textual sequence embeddings, and only handles a certain types of nodes/edges, whereas we address unlimited types of nodes/edges consisting of various tokens.
Specifically, we select BART and FiD as baselines for the reranking and reading tasks, respectively.To integrate AMR information, we initially embed each question-passage pair into text embeddings.Next, we parse the pair into a single AMR graph using AMRBART (Bai et al., 2022a).We then employ the GST method to embed the graph nodes and graph edges into graph token embeddings and concatenate them with the text embeddings.Lastly, we feed the concatenated text-graph embeddings as the input embeddings to a BARTbased (Lewis et al., 2020a) reranker to rerank or a FiD-based (Izacard and Grave, 2020b) reader to generate answers.
We validate the effectiveness of our GST approach using two datasets -Natural Question (Kwiatkowski et al., 2019) and TriviaQA (Joshi et al., 2017).Results indicate that AMR enhances the models' ability to understand complex semantics and improves robustness.BART-GST-reranker and FiD-GST outperform BART-reranker and FiD on the reranking and reading tasks, respectively, achieving up to 5.9 in Top5 scores, 3.4 in Top10 score improvements, and a 2.44 increase in Exact Match on NQ.When the test questions are paraphrased, models equipped with GST prove more robust than the baselines.Additionally, GST outperforms alternative GNN methods, such as Graphtransformer and Relational Graph Convolution Network (RGCN) (Schlichtkrull et al., 2018), for integrating AMR.
To the best of our knowledge, we are the first to incorporate semantic graphs into ODQA, thereby achieving better results than the baselines.

Related Work
Open-domain QA.Open-Domain Question Answering (ODQA) (Chen et al., 2017) aims to answer one factual question given a large-scale text database, such as Wikipedia.It consists of two steps.The first is dense passage retrieval (Karpukhin et al., 2020) , which retrieves a certain number of passages that match the question.In this process, a reranking step can be used to filter out the most matching passages (Fajcik et al., 2021;Glass et al., 2022).The second is reading, which finds answer by reading most matching passages (Izacard and Grave, 2020b;Lewis et al., 2020b).We focus on the reranking and reading, and integrate AMR into those models.
Abstract Meaning Representation (AMR) (Banarescu et al., 2013) is a formalism for representing the semantics of a text as a rooted, directed graph.In this graph, where nodes represent basic semantic units such as entities and predicates, and edges represent the relationships between them.Compared with free-form natural language, AMR graphs are more semantically stable as sentences with same semantics but different expressions can be expressed as the same AMR graph (Bai et al., 2021;Naseem et al., 2021).In addition, AMR graphs are believed to have more structure semantic information than pure text (Naseem et al., 2021).
Previous work has implemented AMR graphs into neural network models.For example, (Bai et al., 2021) adopts Graph-transformer (Yun et al., 2019) to integrate AMRs into the transformer architecture for the dialogue understanding and generation.AMR-DA (Shou et al., 2022) uses AMRs as an data augmentation approach which first feeds the text into AMRs and regenerates the text from the AMRs.Bai et al. (2022b) uses AMR graphs with rich semantic information to redesign the pretraining tasks which results in improvement on downstream dialogue understanding tasks.However, none of them is used for Open-domain QA or applied with the GST technique.which does not need to implement extra architectures in the PLMs, avoiding the incompatibility of different model architectures.
Integrating Structures into PLMs for ODQA Some work also tries to integrate structure information into PLMs for ODQA.For example, GRAPE (Ju et al., 2022) insert a Relation-aware Graph Neural Network into the T5 encoders of FiD to encode knowledge graphs to enhance the output embeddings of encoders; KG-FiD (Yu et al., 2022) uses the knowledge graph to link different but correlated passages, reranks them before and during the reading, and only feeds the output embeddings of most correlated passages into the decoder.However, existing work concentrates on the knowledge graph as the source of structure information and no previous work has considered AMRs for ODQA.
LLMs in Open-Domain Question Answering (ODQA) Research has been conducted that utilizes pre-trained language models (PLMs) to directly answer open-domain questions without retrieval (Yu et al., 2023;Wang et al., 2021;Ye et al., 2021;Rosset et al., 2021).The results, however, have traditionally not been as effective as those achieved by the combined application of DPR and FiD.It was not until the emergence of ChatGPT that direct answer generation via internal parameters appeared to be a promising approach.
In a study conducted by Wang et al. (2023), the performances of Large Language Models (LLMs), such as ChatGPT (versions 3.5 and 4), GPT-3.5, and Bing Chat, were manually evaluated and compared with that of DPR+FiD across NQ and TQ test sets.The findings demonstrated that FiD surpassed ChatGPT-3.5 and GPT-3.5 on the NQ test set and outperformed GPT-3.5 on the TQ test set, affirming the relevance and effectiveness of the DPR+FiD approach even in the era of LLMs.

Method
We introduce the Retrieval and Reading of Open-Domain QA and their baselines in Section 3.1, AMR graph generation in Section 3.2 and our method Graph-aS-Token (GST) in Section 3.3.

Baseline
Retrieval.The retrieval model aims to retrieve N 1 passages from M reference passages (N 1 << M ) given the question q.Only fast algorithms, such as BM25 and DPR (Karpukhin et al., 2020), can be used to retrieve from the large-scale database, and complex but accurate PLMs cannot be directly adopted.So, retrieval algorithm is often not very accurate.One commonly used method is applying a reranking process to finegrain the retrieval results, and we can use PLMs to encode the correlations, which is usually more accurate.Formally, reranking requires model to sort out the most correlated N 2 passages with q from N 1 passages (N 2 < N 1 ).For each passage p in the retrieved passage P N 1 , we concatenate the q p together and embed them into text sequence embeddings X qp ∈ R L×H , where L is the max token length of the question and passage pair and H is the dimension.
We use a pretrained language model to encode each X qp and a classification head to calculate a ', 'ordinal', ' -', ' entity','eos', 'pad'] Tokenized node 'ordinal-entity'  correlation score between q and p: where P LM denotes the pretrained language model and the commonly used Multi-Layer Perceptron (MLP) is used as as the classification head.We use the cross entropy as the loss function, (2) where N pos and N neg are the numbers of positive and negative passages for training one question, respectively.To identify positive/negative label of each passage to the question, we follow Karpukhin et al. (2020), checking whether at least one answer appears in the passage.
We choose the N 2 passages which have reranked among Top-N 2 for the reading process.
Reading.The reader needs to generate an answer a given the question q and N 2 passages.In this work, we choose the Fusion-in-Decoder (FiD) model (Izacard and Grave, 2020b) as the baseline reader model.The FiD model uses N 2 separate T5 encoders (Raffel et al., 2020) to encode N 2 passages and concatenate the encoder hidden states to feed in one T5 decoder to generate answer.
Similar to reranking, we embed the question q and each passage p to text sequence embeddings where L is the max token length of the question and passage pair and d H is the dimension.Next, we feed the embeddings in the FiD model to generate the answer where a is a text sequence.

AMR
We concatenate each question q and passage p, parse the result sequence into an AMR graph G qp = {V, E}, where V, E are nodes and edges, respectively.Each edge is equipped with types, so e = {(u, r, v)} where u, r, v represent the head node, relation and the tail node, respectively.

Graph aS Token (GST)
As shown in Figure 2, we project each node n or edge e in one AMR graph G into node embedding x n or edge embedding x e .We adopt two types of methods to project each node and edge embeddings to one token embedding, which are MLP projection and Attention projection.After the projection, we append the node embeddings X N = [x n 1 , . . ., x n nn ] and edge embeddings X E = [x e 1 , . . ., x e ne ] to the corresponding text sequence embeddings X T = [x t 1 , . . ., x t n t ].So, the result sequence embedding is in the following notation: Initialization We explain how we initialize embeddings of nodes and edges here.
For edges and nodes, we first embed their internal tokens into token embedding.
For edges, we have For nodes, we have MLP Projection The process is illustrated in the MLP Projection part of Figure 2. As each AMR node can have more than one tokens, we first average its token embeddings.For example, for a head node u, The same is done for the relation.
Then, we concatenate the two node embeddings and one relation embedding together as the edge embedding, Next, we use a R 3d H ×d H MLP layer to project the x e2 ∈ R d H into x e ∈ R d H , and the final edge embedding Similarly, we average the node tokens embeddings first x n1 = AV E([x n 1 , . . ., x n n ]).To reuse the MLP layer, we copy the node embedding two times and concatenate, so, We adopt an MLP layer to obtain final node embedding We have also tried to assign separate MLP layers to nodes and edges, but preliminary experiments show that it does not improve the results.
Attention Projection We use one-layer selfattention to project nodes and edges into embeddings, which is shown in the Attn Projection part in Figure 2. The edge embedding is calculated Similarly, the node embedding is calculated where Att E and Att N both denote one selfattention layer for edges and nodes, respectively.We take the first token (additional token) embedding from the self-attention output as the final embedding.
We only modify the input embeddings from X = X T to X = [X T , X N , X E ].The rest details of models, such as the transformer architecture and the training paradigm, are kept the same with the baselines.Our model can directly use the PLMs to encode AMR graphs, without incompatibility between GNN's parameters and PLMs' parameters.

Data
We choose two representative Open-Domain QA datasets, namely Natural Questions (NQ) and Triv-iaQA (TQ), for experiments.Data details are in presented in Appendix Table 9.
Since retrieval results have a large impact on the performance of downstream reranking and reading, we follow Izacard and Grave (2020b) and (Yu et al., 2022) to fix retrieval results for each experiment to make the reranking and reading results comparable for different models.In particular, we use the DPR model initialized with parameters in Izacard and Grave (2020a)2 to retrieve 100 passages for each question.Then we rerank them into 10 passages, which means N 1 = 100, N 2 = 10.

Models Details
We choose the BART model as the reranker baseline and the FiD model (implemented on T5 model (Raffel et al., 2020)) as the reader baseline, and adopt the GST method on them.For each model in this work, we use its Large checkpoint, such as BART-large and FiD-large, for reranking and reading, respectively.In the reranking process, we evaluate the model using the dev set per epoch, and use Top10 as the pivot metric to select the best-performed checkpoint for the test.For the reading, we evaluate the model per 10000 steps, and use Exact Match as the pivot metric.For training rerankers, we set number of positive passages as 1 and number of negative passages as 7. We run experiments on 2 Tesla A100 80G GPUs.

Metric
Following Glass et al. (2022) and Izacard and Grave (2020b), we use Top-N to indicate the reranking performance and Exact Match for the reading performance.However, TopN is unsuitable for indicating the overall reranking performance for all positive passages, so we also adopt two metrics, namely Mean Reciprocal Rank (MRR) and Mean Hits@10 (MHits@10).The MRR score is the Mean Reciprocal Rank of all positive passages.Higher scores indicate that the positive passages are ranked higher overall.The MHits@10 indicates the percentage of positive passages are ranked in Top10.Higher scores indicate that more positive passages are ranked in Top10.Their formulations are in Appendix Section A.5.Note that, only when the retrieved data is exactly the same, the MRR and MHits@10 metrics are comparable.

Preliminary Experiments
We present the reranking performance of four baseline PLMs, including BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), ELECTRA (Clark et al., 2020) and BART (Lewis et al., 2020a) on the NQ and TQ in Appendix Table 8.BART outperforms other three models in every metric on both NQ and TQ.So, we choose it as the reranker baseline and apply our Graph-aS-Token method to it in following reranking experiments.

Main Results
The Main results are presented in Table 1.Our method can effectively boost the performance on both reranking and reading.
Overall, our GST models have achieved up to  Reranking Shown in the reranking columns of Table 1, BART-GST-M can achieve 80.0/83.7 scores in Top5/Top10, which improve 5.4/3.4 on NQ-test compared to DPR and 1.4/0.4compared to BART-reranker.BART-GST-M achieves 79.3/83.3scores in Top5/Top10, which outperform DPR by 4.7/3.0 on NQ-test, showing that our GST method is effective.
The overall reranking results can also explain the reason why even when the Top10 results are similar and readers are the same, the reranked passages by BART-GST can lead to better reading performance.For example, in NQ test, the reading performance of 'BART-GST-M + FiD' is 0.80 better than 'BARTreranker + FiD'.

Analysis
Robustness.To evaluate the robustness of the baseline and our models, we paraphrase the test questions of NQ and TQ, evaluate paraphrased test questions and the original ones with the same model checkpoint.We use a widely-used paraphraser, namely Parrot Paraphraser (Damodaran, 2021) to paraphrase test questions.The results are shown in Table 3.The performance drops in reranking and reading of our GST models are lower than the baseline model, despite that our models have better performance.For reranking, the drop of our BART-GST-A is -1.9/-1.3/-1.4/-2.1 for Top5/Top10/MRR/MHits@10, which is lower than the baseline's -2.6/-1.5/-1.8/-2.2.For reading, the -3.21 EM drop of FiD-GST-M is also smaller than the -3.90 of baseline FiD.It shows that our GST method can not only improve performance but also improve robustness, which can prove that adding structural information can help models avoid the erroneous influence of sentence transformation.
Comparison with FiD-100.We also compare the reranking+reading paradigm with the directlyreading paradigm.For the latter, the FiD reader is directly trained and evaluated on 100 retrieved passages without reranking.The results are shown in Table 4.
To our knowledge, we are the first make FiD-10 beat FiD-100.
Influence of AMR Quality.We explore how AMR graphs quality influence the performance of our models in this section, by using the AMRBARTbase-finetuned-AMR3.0-AMRParsing,4 which is a smaller version.We compare the reranking performance of BART-GST with either superior or inferior graphs on NQ and TQ.We use the each kind of graphs to train its own reranking models.The results are shown in Table 5.
Our models still work with inferior AMR graphs but the performance is not good as the superior ones in both reranking and reading.This indicates that when the quality of AMR graphs is higher, the GST models can potentially achieve better performance.
Ablation to Nodes/Edges We ablate nodes and edges in our models to explore whether nodes or  edges contribute more to the results.We conduct reranking experiments on NQ.The results are shown in Table 6.As can be seen, nodes are edges are both useful for the GST method, where 'BART-GST-M (only nodes)' and 'BART-GST-M (only edges)' both outperform the baseline BARTreranker in MRR/MHits@10 on NQ test, which are 24.2/48.7 vs 24.7/47.4vs 23.3/45.8,respectively.However, 'BART-GST-M (only edges)' are better than 'BART-GST-M (only nodes)' in four metrics on NQ, partly due to the fact that edges also contain nodes information.
Case Study We present two cases from our in Figure 3.In the upper one, for the negative passage, the baseline may consider "a ban on smoking in all closed public areas" same as "the smoking ban in public places", which are actually different; For the positive passage, the baseline may not take "act regulated smoking in public area" as "the smoking ban in public places" while our model does.
In the lower one, the baseline reader ignores the competition is " for the opportunity to play in Super Bow" rather than "in the Super Bowl" , and because the number of similar passages with "Philadelphia Eagle" are more than the positive passage's, the baseline reader finds the incorrect passage which leads to the incorrect answer.In contrast, our model focuses on the only positive passage and answers the question correctly.

Alternative Graph Methods
We have also tried several methods to integrate AMRs into PLMs, but their performance is worse than our Graph-aS-Token method.Here we take two representative examples, which are Relational Graph Convolution Network (RGCN) (Schlichtkrull et al., 2018) for the reranker and Graph-transformer (Yun et al., 2019) for FiD.All those methods require alignments between text tokens and graph nodes, for which only some nodes can be successfully aligned.

Stacking RGCN above Transformer
The model architecture consists of a transformer encoder and a RGCN model where RGCN is stacked on top of the transformer.After the vanilla forward by transformer encoder, AMR graphs abstracted from queries and passages in advance are constructed with node embeddings initialized from transformer output.Then they are fed into the RGCN model and the final output of the [CLS] node is used for scoring.
For the text embeddings of one question-passage pair, its encoder hidden states H = Encoder(X qp ) For one node n, its initial embedding h 0 = M eanP ooling(H start:end ) where start and end are the start and end positions of the text span aligned with the node.
The update of node embedding for each layer l is c i,r = ∥N r i ∥ where R is the set of edge types, N r i stands for the group of nodes which connect with node i in relation r. so the correlation score of q and p: The results are presented in Table 7, which is clear that the RGCN-stacking method is inferior to the GST method.Some metrics, including Top5, Top10 and MRR, of RGCN-stacking are worse than the baseline, meaning the RGCN method is not feasible for integrating AMRs into PLMs though it looks like reasonable and practical.
Graph-transformer We apply the graphtransformer architecture to FiD model for reading.We follow the graph-transformer architecture in Bai et al. (2021), whose main idea is using AMR information to modify the self-attention scores between text tokens.However, we find stucking challenging for PLMs because the new-initialized graph architectures are not compatible with architectures of PLMs, lead to non-convergence during training.Despite that, tricks such as incrementally training and separate tuning can lead to convergence, results are still below the baseline model, let alone GST.
Flattening AMR Graphs We have also tried to directly flatten AMR graphs into text sequences, but the result sequences are always beyond the maximum processing length (1024) of the transformer.So, we have to cut off some nodes and edges to fit in the transformer, but the results show that it does not work well and has only a very sight improvement while the computational cost is tens times over the baseline.

Conclusion
In this study, we successfully incorporated Abstract Meaning Representation (AMR) into Open-Domain Question Answering (ODQA) by innovatively employing a Graph-aS-Token (GST) method to assimilate AMRs with pretrained language models.The reranking and reading experiments conducted on the Natural Questions and TriviaQA datasets have demonstrated that our novel approach can notably enhance the performance and resilience of Pretrained Language Models (PLMs) within the realm of ODQA.
"Leading Goose" R&D Program of Zhejiang under Grant Number 2022SDXHDX0003.

Limitations
Our Graph-aS-Token (GST) method can increase the time and GPU memory cost, we set an quantitative analysis in Appendix Section A.4.We train the models with only one random seed.We do not conduct a large number of hyper-parameter tuning experiments, but use a fixed set of hyperparameters to make the baseline and our models comparable.
Question: When did Stephen Curry won the MVP award?Golden Answer: 2014-15 Stephen Curry… In 2014-15, Curry won the NBA Most Valuable Player Award and led the Warriors to their first championship since 1975.… …Dirk Nowitzki of Germany are the only MVP winners considered "international players" by the NBA.Stephen Curry in 2015-16 is the only player to have won the award unanimously(AMR) graph parsed with question and the blue passage

Figure 1 :
Figure 1: An example from our experiments.The top-middle square contains the question and the gold standard answer.The middle section shows a confusing passage with an incorrect answer generated by the baseline model and ranked first by the baseline reranker.The bottom-middle section presents a passage with the gold standard answer, which is ranked within the top ten by our reranker but not by the baseline.Important information is highlighted.

Figure 2 :
Figure 2: The structure of our Graph-aS-Token method.The input consists of the text and the AMR graph of one passage; The output is a united embedding.
Question: When did the smoking ban in public places start?Golden Answer: 1995 Act in 1993 and started implementing the act in 1995.The act regulated smoking in public areas and prohibited tobacco sales to people under the age of 16.... Smoking ban ... the consequences of smoking that introduced a ban on smoking in all closed public areas… took effect on 1 June 2013.At first smoking ban abusers were not fined ... went on to the NFC Championship for the opportunity to play in Super Bowl LII in their own stadium, only to lose 38-7 to the eventual Super Bowl champion Philadelphia Eagles … … he was unable to lead the team to victory in the Super Bowl, as the Vikings lost 23-7 to the Kansas City Chiefs … A: A case for reranking, where the baseline ranker does not rank the positive psg into Top10 while our model does.B: A case for reading, where the negative and positive psgs are both ranked into Top10.The baseline reader finds the wrong psg to answer while our model answer correctly.

Figure 3 :
Figure3: Two cases from our experiments for reranking and reading, respectively.We highlight important information over questions and passages.

Table 1 :
Reranking and reading results on the dev/test set of NQ and TQ.In each cell, the left is on the dev while the right is on the test.For the BART/FiD with GST-M/A in the first column, they are equipped AMR graphs with the GST method, -M indicates the MLP projection while -A is the attention projection.

Table 2 :
Overall reranking results on NQ and TQ.In each cell, the left is dev and the right is test.
Robustness of readers.Exact Match as the Metric.To avoid the influence of different reranking results, we use the same DPR results to train and eval.

Table 3 :
Robustness on rerankers and readers.We conduct experiments on NQ.Orig Test is the original test questions while New Test means the paraphased test questions.Drop is the difference from the original test to the paraphrased test, the smaller absolute number indicates better robustness.

Table 4 :
Reading experiments of with and without reranking.The first two row are trained/evaluated with DPR data while the rest are with reranked data.

Table 5 :
Influence of superior AMR graphs which generated by a larger model, and inferior AMR graphs which generated by a smaller model.

Table 6 :
Ablation to nodes and edges to our GST methods on NQ.We choose BART-GST-M because it better performs on NQ.

Table 7 :
Comparison between the baseline, GST and RGCN-Stacking in reranking on NQ.