FactGraph: Evaluating Factuality in Summarization with Semantic Graph Representations

Despite recent improvements in abstractive summarization, most current approaches generate summaries that are not factually consistent with the source document, severely restricting their trust and usage in real-world applications. Recent works have shown promising improvements in factuality error identification using text or dependency arc entailments; however, they do not consider the entire semantic graph simultaneously. To this end, we propose FactGraph, a method that decomposes the document and the summary into structured meaning representations (MR), which are more suitable for factuality evaluation. MRs describe core semantic concepts and their relations, aggregating the main content in both document and summary in a canonical form, and reducing data sparsity. FactGraph encodes such graphs using a graph encoder augmented with structure-aware adapters to capture interactions among the concepts based on the graph connectivity, along with text representations using an adapter-based text encoder. Experiments on different benchmarks for evaluating factuality show that FactGraph outperforms previous approaches by up to 15%. Furthermore, FactGraph improves performance on identifying content verifiability errors and better captures subsentence-level factual inconsistencies.


Introduction
Recent summarization approaches based on pretrained language models (LM) have established a new level of performance (Zhang et al., 2020;, generating summaries that are grammatically fluent and capable of combining salient parts of the source document. However, current models suffer from a severe limitation, generating summaries that are not factually consistent, that * Work done as an intern at Amazon Alexa AI. 1 Our code will be publicly available at https:// github.com/amazon-research/fact-graph  is, the content of the summary does not meet the facts of the source document, an issue also known as hallucination. Previous studies (Cao et al., 2018;Falke et al., 2019;Maynez et al., 2020; report rates of hallucinations in generated summaries ranging from 30% to over 70%. In the face of such a challenge, recent works employ promising ideas such as question answering (QA) (Durmus et al., 2020; and weakly supervised approaches (Kryscinski et al., 2020) to assess factuality. Another line of work explores dependency arc entailment to improve the localization of subsentence-level errors within generated summaries (Goyal and Durrett, 2020). However, these methods have a reduced correlation with human judgments and may not capture well semantic errors (Pagnoni et al., 2021). One reason for the poor performance is the lack of good quality factuality training data. Second, it is challenging to properly encode core semantic content from the document and summary  and reason over salient pieces of information in order to assess the summary factuality. Third, previous work (DAE, Goyal and Durrett, 2021) treats semantic relations as isolated units, not simultaneously considering the entire semantic structure of both document and summary texts.
To mitigate the above issues, we explore meaning representations (MR) as a form of content representation for factuality evaluation. We present FACTGRAPH, a novel graph-enhanced approach that incorporates core information from the document and the summary into the factuality model using graph-based MRs, which are more suitable for factuality evaluation: As shown in Figure 1, graphbased MRs capture semantic relations between entities, abstracting away from syntactic structure and producing a canonical representation of meaning.
Different from previous methods (Kryscinski et al., 2020;Goyal and Durrett, 2021), FACT-GRAPH is a dual approach which encodes both text and graph modalities, better integrating linguistic knowledge and structured semantic knowledge. As shown in Figure 2, it is composed of parameterefficient text and graph encoders which share the same pretrained model and differ by their adapter weights (Houlsby et al., 2019). The texts from the document and summary are encoded using the adapter-based text encoder whereas the entire semantic structures that represent document and summary facts are used as input to the graph encoder augmented structure-aware adapters (Ribeiro et al., 2021b). The representations of the two modalities thus are combined to generate the factuality score.
Intuitively, AMR provides important benefits: First, it encodes core concepts as it strives for a more logical and less syntactic representation, which has been shown to benefit text summarization (Hardy and Vlachos, 2018;Dohare et al., 2018;. Furthermore, AMR captures semantics at a high level of abstraction explicitly modeling relations in the text and reducing the negative influence of diverse text surface variances with the same meaning. Lastly, recent studies Ladhak et al., 2021) demonstrate that there is a trade-off between factuality and abstractiveness. Structured semantic representations are potentially beneficial for reducing data sparsity and localizing generation errors in abstractive scenarios. Figure 1 shows examples of (c) document and (d) summary AMRs, where the summary AMR is missing a crucial modifying node present in the document AMR, which indicates a factual error in the summary.
We consolidate a factuality dataset with human annotations derived from previous works Kryscinski et al., 2020;Maynez et al., 2020;Pagnoni et al., 2021). This dataset is constructed from the widely-used CNN/DM (Hermann et al., 2015) and XSum (Nallapati et al., 2016) benchmarks. Extensive experimental results demonstrate that FACTGRAPH achieves substantial improvements over previous approaches, improving factuality performance by up to 15% and correlation with human judgments by up to 10%, capturing more content verifiability errors and better classifying factuality in semantic relations.

Related Work
Evaluating Factuality. Recently, there has been a surge of new methods for factuality evaluation in text generation, especially for summarization. Falke et al. (2019) propose to rerank summary hypotheses generated via beam search based on entailment scores to the source document. Kryscinski et al. (2020) introduce FACTCC, a model-based approach trained on artificially generated data, to measure if the summary can be entailed by the source document in order to assess the summary factuality. QA-based methods Durmus et al., 2020;Honovich et al., 2021; generate questions from the document and summary, and compare the corresponding answers in order to assess factuality. Xie et al. (2021) formulate causal relationships among the document, summary, and language prior to evaluate the factuality via counterfactual estimation.
Categorizing Factual Errors. A thread of analysis work has focused on identifying different categories of factual errors in summarization. Maynez et al. (2020) show that semantic inference-based au-tomatic measures are better representations of summarization quality, whereas Pagnoni et al. (2021) propose a linguistically grounded typology of factual errors and develop a fine-grained benchmark for factuality evaluation, moving to a fine-grained measure, instead of using a binary evaluation. Fabbri et al. (2021) introduce different resources for summarization evaluation which include a toolkit for evaluating summarization models.
Factuality versus Abstractiveness. Recent works Ladhak et al., 2021) investigate the trade-off between factuality and abstractiveness of summaries and observe that factuality tends to drop with increased abstractiveness. Semantic graphs are uniquely suitable to detect factual errors in abstractive summaries as they abstract away from the lexical surface forms of documents and summaries, enabling direct comparisons of the underlying semantic concepts and relations of a document-summary pair.
Graph-based Representations for Summarization. A growing body of work focuses on using graph-based representations for improving summarization. Whereas different approaches encode graphs into neural models for multi-document summarization (Fan et al., 2019;Li et al., 2020;Pasunuru et al., 2021;Chen et al., 2021), AMR structures have been shown to benefit both document representation and summary generation (Liu et al., 2015;Liao et al., 2018;Hardy and Vlachos, 2018;Dohare et al., 2018) and have the potential of improving controllability in summarization . The above works are related to FACTGRAPH as they use semantic graphs for content representation, but also different because they utilize graphs for the downstream summarization task, whereas FACTGRAPH employ them for factuality evaluation.
Semantic Representations for Factuality Evaluation. More closely to our work, Goodrich et al. (2019) extract tuples from the document and summary and measure the factual consistency by overlapping metrics. Recently, dependency arc entailment (DAE, Goyal and Durrett, 2020) is used to measure subsentence-level factuality by classifying pairs of words defined by dependency arcs which often describe semantic relations. However, FACTGRAPH is considerably different from those approaches, since it explicitly encodes the entire graph semantic structure into the model. Moreover, while DAE considers semantic edge relations of the summary only, FACTGRAPH encodes the semantic structures of both the input document and summary leading to better factuality performance at both sentence and subsentence levels.

FACTGRAPH Model
We introduce FACTGRAPH, a method that employs semantic graph representations for factuality evaluation in text summarization, describing its intuition ( §3.3) and defining it formally ( §3.4).

Problem Statement
Given a source document D and a sentence-level summary S, we aim to check whether S is factual with respect to D. For each sentence d ∈ D we extract a semantic graph G d . Similarly, for the summary sentence S we extract its semantic graph G s . We use texts and graphs from both document and summary for factuality evaluation. Sentence-level summary predictions can be aggregated to generate a factuality score for a multi-sentence summary.

Extracting Semantic Graphs
We select AMR as our MR, but FACTGRAPH can be used with other graph-based semantic representations, such as OpenIE (Banko et al., 2007). AMR is a linguistically-grounded semantic formalism that represents the meaning of a sentence as a rooted graph, where nodes are concepts and edges are semantic relations. AMR abstracts away from surface text, aiming to produce a more languageneutral representation of meaning. We use a stateof-the-art AMR parser (Bevilacqua et al., 2021) to extract an AMR graph G a = (V a , E a , R a ) with a node set V a and labeled edges (u, r, v) ∈ E a , where u, v ∈ V a and r ∈ R a is a relation type. Each G a aims to explicitly represent the core concepts in each sentence. Figure 1 shows an example of a (b) sentence and its (d) corresponding AMR graph. Beck et al. (2018), this procedure transforms the graph into its unlabeled version. Pretrained models typically use a vocabulary with subword tokens, which makes it complicated to properly represent a graph using subword tokens as nodes. Inspired by Ribeiro et al.
into a set of edges and connect every token of u b to every token of v b .

Intuition of Semantic Representation
In order to represent facts to better assess the summary factuality, we draw inspiration from traditional approaches to summarization that condense the source document to a set of "semantic units" (Liu et al., 2015;Liao et al., 2018). Intuitively, the semantic graphs from the source document represent the core factual information, explicitly modeling relations in the text, whereas the semantic summary graph captures the essential content information in a summary (Lee et al., 2021). The document graphs can be compared with the summary graph, measuring the degree of semantic overlap to assess factuality .
Recently, sets of fact triples from summaries were used to estimate factual accuracy (Goodrich et al., 2019). That approach is related to FACT-GRAPH as it uses graph-based MRs, but also different because it compares the reference and the generated summary, whereas we compare the generated summary with the input document. Moreover, differently from Goodrich et al. (2019), FACTGRAPH explicitly encodes the semantic structures using a graph encoder and employs AMR as a semantic representation. Finally, in contrast to DAE (Goyal and Durrett, 2021), which focuses only on extracting summary graph representations, FACTGRAPH uses semantic graphs for both document and summary. Figure 2 illustrates FACTGRAPH, which is composed of text and graph encoders. The text encoder, denoted by E t , uses a pretrained encoder E, aug-mented with adapter modules which receives the summary S and document D and outputs a contextual text representation. Conversely, the graph encoder, denoted by E g , uses the same E, but is augmented with structure-aware adapters. E g receives the summary and multiple document semantic graphs corresponding to its sentences, and outputs graph-aware contextual representations that are used to generate the final graph representation. During training, only adapter weights are trained, whereas the weights from E are kept frozen. Finally, both graph and text representations are concatenated and fed to a final classifier, which predicts whether the summary is factual or not.

Model
Text Encoder. We employ an adapter module before and after the feed-forward sub-layer of each layer of the encoder. We modify the adapter architecture from Houlsby et al. (2019). We compute the adapter representation for each token i at each layer l, given the token representation h l i , as follows: where σ is the activation function and LN(·) denotes layer normalization. W l o ∈ R d×m and W l p ∈ R m×d are adapter parameters. The representation of the [CLS] token is used as the final textual representation, denoted by t.
Graph Encoder. In order to re-purpose the pretrained encoder to structured inputs, we employ a structural adapter (Ribeiro et al., 2021b). In particular, for each node v ∈ V, given the hidden representation h l v , the encoder layer l computes: where N (v) is the neighborhood of the node v in G and W l e ∈ R d×m is a adapter parameter. GraphConv l (·) is the graph convolution that computes the representation of v based on its neighbors in the graph. We employ a Relational Graph Convolutional Network (Schlichtkrull et al., 2018) as graph convolution, which considers differences in the incoming and outgoing relations. Since AMRs are directed graphs, encoding edge directions is beneficial for downstream performance . The structural adapter is placed before, whereas the normal adapter is kept after the feed-forward sub-layer of each encoder layer.
We calculate the final representation of each graph from the pooling denoted as Thus, we use a multi-head self-attention (Vaswani et al., 2017) layer to estimate to what extent each sentence graph contributes to the document semantic representation based on the summary graph. This mechanism allows encoding a global document representation based on the summary graph. In particular, each attention head computes: where z G s is the final representation of G s , k is the number of considered sentence graphs from the input document and W r ∈ R d×d is a parameter.
The final representation is derived from the text and graph representations, q = [t; g], and fed into a classification layer that outputs a probability distribution over the labels y = {Factual, Non-Factual}.

Edge-level Factuality Model
Inspired by Goyal and Durrett (2021), we evaluate the factuality at the edge level. In this setup, we use the same text and graph encoders; however, we encode the semantic graphs differently. In particular, we concatenate G s with each G d ∈ D and feed the concatenation to the graph encoder. The representation of a node v ∈ G s is calculated as: where A(v) is the set of all summary words aligned with v, and r tv and r gv are the word and node representations, respectively. Edge representations are The edge representation r e is fed into a classification layer that outputs a probability distribution over the output labels (y e = {Factual, Non-Factual}). We assign the label non-factual for an edge in G s if one of the nodes in this edge is aligned with a word that belongs to a span annotated as nonfactual. Otherwise, the edge is assigned the label factual. We call this variant FACTGRAPH-E.

Data
One of the main challenges in developing models for factuality evaluation is the lack of training data. Existing synthetic data generation approaches are not well-suited to factuality evaluation of current summarization models and human-annotated data can improve factuality models (Goyal and Durrett, 2021). In order to have a more effective training signal, we gather human annotations from different sources and consolidate a factuality dataset that can be used to train FACTGRAPH and other models. The source collections of the dataset are presented in Table 1. The dataset covers two parts, namely CNN/DM (Hermann et al., 2015) and XSum (Nallapati et al., 2016). CNN/DM contains news articles from two providers, CNN and DailyMail; while XSum contains BBC articles. CNN/DM has considerably lower levels of abstraction, and the summary exhibits high overlap with the article; a typical CNN/DM summary consists of several bullet points. In XSum, the first sentence is removed from an article and used as a summary, making it highly abstractive. After we remove duplicated annotations, the total number of datapoints is 9,567, which we divide into train (8,667), dev (300) and test (600) sets. We call this dataset FACT-COLLECT.

Method Details
Selecting the Document Semantic Graphs. We limit the number of considered document graphs due to efficiency reasons. In particular, we  compute the pairwise cosine similarity between the embeddings of each sentence d ∈ D and the summary sentence S, generated by Sentence Transformers (Reimers and Gurevych, 2019). We thus select k sentences from the source document with the highest scores to be used to generate the document semantic graphs. The model weights are initialized with ELEC-TRA (electra-base discriminator, 110M parameters, Clark et al., 2020), the structural adapters are pretrained using the release 3.0 of the AMR corpus containing 55,635 gold annotated AMR graphs, and the text adapters are pretrained using synthetic generated data. The adapters' hidden dimension is 32, which corresponds to about 1.4% of the parameters of the original ELECTRA encoders. The number of considered document graphs (k) is 5. 2 We report the test results when the balanced accuracy (BACC) on dev set is optimal. Following previous work (Kryscinski et al., 2020;Goyal and Durrett, 2021), we evaluate our models using BACC and Micro F1 scores.

Results and Analysis
We compare FACTGRAPH with different methods for factuality evaluation: two QA-based methods, namely QAGS  and QUALS , and FACTCC (Kryscinski et al., 2020). We fine-tune FACTCC using the training set, that is, it is trained on both synthetic data and FACTCOLLECT. We call this approach FACTCC+.

Correlation with Human Judgments
We also evaluate the model performance using correlations with human judgments of factuality (Pagnoni et al., 2021). In this experiment, FACTCC+ and FACTGRAPH are trained with the FACTCOLLECT data without the Pagnoni et al.
(2021)'s subset, which is used as dev and test sets, according to its split. For both models, following Pagnoni et al. (2021), we obtain a binary factuality label for each sentence and take the average of these labels as the final summary score. We use the official script to calculate the correlations. 4 AMR and Factuality. We investigate whether SMATCH , a metric that measures the degree of overlap between two AMRs, correlates with factuality judgments. We calculate the SMATCH score between all the summary sentence graphs and k document sentence graphs, with k ∈ {1, 3, 5}. We obtain one score per summary sentence by maxing over its scores with the sentence graphs, then averaging over the summary sentence scores to obtain the summary-level score. We also calculate the SMATCH between the generated summary and the reference summary graphs.   As shown in Table 3, SMATCH approaches have a small but consistent correlation, slightly improving over n-gram based metrics (e.g., METEOR and ROUGE-L) in CNN/DM, suggesting that AMR, which has a higher level of abstraction than plain text, may be a semantic representation alternative to content verification. QA-based approaches have higher correlation on the CNN/DM dataset than XSum where their correlation is relatively reduced, and DAE shows higher Spearman correlation than FACTCC on XSum. FACTCC+ and FACTGRAPH, which are trained on data from FACTCOLLECT, have a overall higher performance than models trained on synthetic data, such as FACTCC, again demonstrating the importance of the human-annotation signal when training factuality evaluation approaches. Finally, FACTGRAPH has the highest correlations in both datasets, with a large improvement in XSum, suggesting that representing facts as semantic graphs is effective for more abstractive summaries.  Table 4: Sentence-level BACC in human-annotated XSum generated summaries (Maynez et al., 2020). Figure 3 shows the influence of the different types of factuality errors (Pagnoni et al., 2021) for each approach. Semantic Frame Errors are errors in a frame, core, and non-core frame elements. 5 Discourse Errors extend beyond a single semantic frame introducing erroneous links between discourse segments. Content Verifiability Errors capture cases when it is not possible to verify the summary against the source document due to the difficulty in aligning it to the source. 6 Note that whereas BERTSCORE strongly correlates with content verifiability errors as it is a token-level similarity metric, the other methods improve in Semantic Frame Errors. FACTGRAPH has the highest performance suggesting that graph-based MRs are able to capture different semantic errors well. In particular, FACTGRAPH improves in capturing content verifiability errors by 48.2%, suggesting that representing facts using AMR is helpful.

Edge-level Factuality Classification
We assess factuality beyond sentence-level with FACTGRAPH-E ( §3.5). We train and evaluate the   Maynez et al. (2020). In this dataset, human annotations for sentence and span levels are available. We derive the edge labels required for FACTGRAPH-E training as follows: For each edge in the summary graph, if one of the nodes connected to this edge is aligned with a word that belongs to a span labeled as non-factual, the edge is annotated as non-factual. 7 Summary-level labels are obtained from edge-level predictions: if any edge in the summary graph is classified as nonfactual, the summary is labeled as non-factual. We use the same splits from Goyal and Durrett (2021). 8 We compare FACTGRAPH-E with DAE and additionally with a sentence-level baseline (Goyal and Durrett, 2021) and FACTGRAPH. Table 4 shows that the edge-level factuality classification gives better performance than sentencelevel classification, and FACTGRAPH performs better in both sentence and edge classification levels. FACTGRAPH-E outperforms DAE, demonstrating that training on subsentence-level factuality annotations enables it to accurately predict edge-level factuality and output summary-level factuality.
Finally, while the semantic representations contribute to overall performance, extracting those representations adds some overhead in preprocessing time (and lightly more in inference time), as shown in Appendix B.

Model Ablations
In Table 5, we report an ablation study on the impact of distinct FACTGRAPH's components. First, note that only encoding the textual information leads to better performance than just encoding graphs. This is expected since pretrained encoders are known for good performance in NLP textual tasks due to their transfer learning capabilities and the full document text encodes more information than the selected k document graphs. Moreover, AMR representations abstract aspects such as verb 7 We use the JAMR aligner (Flanigan et al., 2014) to obtain node-to-word alignments. 8 We sample 100 datapoints from the training set as dev set to execute hyperparameter search.  tenses, making the graphs agnostic regarding more fine-grained information. However, this is compensated in FACTGRAPH, which captures coarsegrained details from the text modality. Future work can consider incorporating such information into the graph representation in order to improve the factuality assessment. Ultimately, FACTGRAPH, which uses both document and summary graphs, gives the overall best performance, demonstrating that semantic graph representations complement the text representation and are beneficial for factuality evaluation. Table 6 shows the influence of the number of considered document graphs measured on FACTCOLLECT's dev set performance. Note that generally more document graphs leads to better performance with a peak in 5. This suggests that using all graph sentences from the source document is not required for better performance. Moreover, the results indicate that our strategy of selecting document graphs using the contextual representations of the document sentences which are compared to the summary performs well in practice.

Number of Document Graphs
We additionally present the performance of FACTGRAPH with other semantic representations in Appendix C.

Comparison to Full Fine-tuning.
FACTGRAPH only trains adapter weights that are placed into each layer of both text and graph encoders. We compare FACTGRAPH with a model with similar architecture, with both text and graph encoders, but without (structural) adapter layers. We then fine-tune all the model parameters. Table 7 shows that FACTGRAPH performs better even though it trains only 1.4% of the parameters of the fully fine-tuned model, suggesting that the structural adapters help to adapt the graph encoder to semantic graph representations.

FACTGRAPH-E computes factuality scores for
Article: Margaret Fleming, 36, was last seen at her home in Inverkip by her two carers at about 17:40 on Friday 28 October. She is described as about 5ft 5in tall, […]. Police had said they were trying to build a picture of Ms Fleming's life, part of which she kept "quite private". When last seen, she was wearing a green tartan fleece […]. She also had a satchel-type handbag. A police spokesman said: "There is a specialist search team combing the area around where the missing person was last seen, this includes in the garden of her last known address." […] The detective said that Ms Fleming was a student at James Watt College in Greenock between 1996 and 1997. He said he was keen to speak to anyone who remembered her from then, and who might have been in touch with her over the years. DAE FACTGRAPH-E police have appealed for help in tracing a woman who has been missing for six years.
---+ + + -police have appealed for help in tracing a woman who has been missing for six years. + + -+ + + + + + + Figure 4: An example of a document, its generated summary and factuality predictions for word pairs, based on the dependency graph (DAE) versus AMR graph (FACTGRAPH-E). +/− means the predicted label for that edge.

BACC F1 Parameters
Fully  each edge of the AMR summary graph and those predictions are aggregated to generate a sentencelevel label ( §5.2). Alternatively, it is possible to identify specific inconsistencies in the generated summary based on the AMR graph structure. This factuality information at subsentence-level can provide deeper insights on the kinds of factual inconsistencies made by different summarization models (Maynez et al., 2020) and can supply text generation approaches with localized signals for training (Cao et al., 2020;. Figure 4 shows a document, its generated summary, and factuality edge predictions by  First, note that since DAE uses dependency arcs and FACTGRAPH-E is based on AMR, the sets of edges in both approaches, that is, the relations between nodes and hence words, are different. Second, both methods are able to detect the hallucination six years, which was never mentioned in the source document. However, DAE does not consider that police appealed for help in tracing is factual whereas FACTGRAPH-E captures it. This piece of information is related to a span in the document with a very different but semantically related form (highlighted in bold in Figure 4). This poses challenges to DAE, since it classifies seman-9 Appendix D presents the complete AMR and dependency summary graphs. tic relations independently and only considers the text surface. On the other hand, FACTGRAPH-E matches the summary against the document not only at text surface level but semantic level.

Conclusion
We presented FACTGRAPH, a graph-based approach to explicitly encode facts using meaning representations to identify factual errors in generated text. We provided an extensive evaluation of our approach and showed that it significantly improves results on different factuality benchmarks for summarization, indicating that structured semantic representations are beneficial to factuality evaluation. Future work includes (i) exploring approaches to develop document-level semantic graphs (Naseem et al., 2021), (ii) an explainable graph-based component to highlight hallucinations and (iii) to combine different meaning representations in order to capture distinct semantic aspects.

Impact Statement
In this paper, we study the problem of detecting factual inconsistencies in summaries generated from input documents. The proposed models better consider the text internal meaning structure and could benefit general generation applications by evaluating their output regarding factual consistency, which could ensure that these systems are more trustworthy. This work is built using semantic representations extracted using AMR parsers. In this way, the quality of the parser used to generate the semantic representations can significantly impact the results of our models. In our work, we mitigate this risk by employing a state-of-the-art AMR parser.

Appendices
In this supplementary material, we detail experiments' settings, additional model evaluations and additional information about semantic graph representations.

A Details of Models and Hyperparameters
The experiments were executed using the version 3.3.1 of the transformers library released by Hugging Face (Wolf et al., 2019). In Table 8, we report the hyperparameters used to train FACT-GRAPH. We use the Adam optimizer (Kingma and Ba, 2015) and employ a linearly decreasing learning rate schedule without warm-up. Mean pooling is used to calculate the final representation of each graph.
Structural Adapters' Pretraining. The structural adapters are pretrained using AMR graphs from the release 3.0 (LDC2020T02) of the AMR annotation corpus (Knight et al., 2020). 10 Similarly to the masked language modeling objective, we execute self-supervised node-level prediction, where we randomly mask and classify AMR nodes. The goal of this pretraining phase is to capture domain specific AMR knowledge by learning the regularities of the node/edge attributes distributed over graph structure.
Text Adapters' Pretraining. The text adapters are pretrained using synthetically created data, which is generated by applying a series of rulebased transformations to the sentences of source documents (Kryscinski et al., 2020). The pretraining task is to classify each summary sentence as factual or non-factual. The goal of this pretraining phase is to learn suitable text representations to better identify whether summary sentences remain factually consistent to the input document after the transformation.

B Speed Comparison
FACTGRAPH encodes the structured semantic representations that encode facts from the document and summary. Despite their effectiveness, extracting semantic graphs, such as AMR, is computationally expensive because current models employ Transformer-based encoder-decoder architectures based on Transformers and pretrained language models.
In this experiment, we compare the time execution of FACTGRAPH-E and DAE in a sample of 1000 datapoints extracted from the XSum test set. In order to extract the semantic graphs, we investigate two AMR parsers, Parser1: a dual graphsequence parser that iteratively refines an incrementally constructed graph (Cai and Lam, 2020) , and Parser2: a linearized graph model that employs BART (Bevilacqua et al., 2021). The execution of the AMR parsers is parallelized using four Tesla V100 GPUs. We use Parser2 for the experiments in this paper since it is the current state of the art in AMR parsing, although it is slower in preprocessing than Parser1 is.
As shown in Table 9, DAE's preprocessing is much faster compared to this phase in FACTGRAPH-E, since DAE employs a fast enhanced dependency model from the Stanford CoreNLP tool (Manning et al., 2014). This model builds a parse by performing a linear-time scan over the words of a sentence. Finally, note that FACT-GRAPH is slower than DAE in inference because it employs adapters and encodes both graphs and texts from the document and summary, whereas the DAE model encodes only the texts.

C Comparing Semantic Representations for Factuality Evaluation
OpenIE graph-based structures were used in order to improve factuality in abstractive summarization (Cao et al., 2018), whereas dependency arcs were shown to be beneficial for evaluating factuality (Goyal and Durrett, 2020). We thus investigate different graph-based meaning representations using FACTGRAPH. AMR is a more logical representation that models relations between core concepts, and has a rough alignment between nodes and spans in the text. Conversely, dependencies capture more fine-grained relations between words, and all words are mapped into nodes in the dependency graph. OpenIE constructs a graph with node descriptions similar to the original text and uses open-domain relations, leading to relations that are hard to compare.
As shown in Table 10, whereas OpenIE performs slightly better than dependency graphs, AMR gives the best results according to the two metrics, highlighting the potential use of AMRs in representing salient pieces of information. Different from our work,  and Naseem et al. (2021) propose a graph construction approach which generates a single document-level graph created using the individual sentences' AMR graphs by merging identical concepts -this is orthogonal to our sentence-level AMR representation and can be incorporated in future work.

D Semantic Representations
In Figure 5 we show AMR and dependency representations for the summary sentence "police have appealed for help in tracing a woman who has been missing for six years.". In §5.5 those semantic representations are used to predict subsentence-level factuality using edge-level information. In particular, FACTGRAPH-E employs AMR (Figure 5a) whereas DAE uses dependencies (Figure 5b).