Transformers as Graph-to-Graph Models

We argue that Transformers are essentially graph-to-graph models, with sequences just being a special case. Attention weights are functionally equivalent to graph edges. Our Graph-to-Graph Transformer architecture makes this ability explicit, by inputting graph edges into the attention weight computations and predicting graph edges with attention-like functions, thereby integrating explicit graphs into the latent graphs learned by pretrained Transformers. Adding iterative graph refinement provides a joint embedding of input, output, and latent graphs, allowing non-autoregressive graph prediction to optimise the complete graph without any bespoke pipeline or decoding strategy. Empirical results show that this architecture achieves state-of-the-art accuracies for modelling a variety of linguistic structures, integrating very effectively with the latent linguistic representations learned by pretraining.


Introduction
Computational linguists have traditionally made extensive use of structured representations to capture the regularities found in natural language.The huge success of Transformers (Vaswani et al., 2017) and their pre-trained large language models (Devlin et al., 2019;Zhang et al., 2022;Touvron et al., 2023a,b) have brought these representations into question, since these models are able to capture even subtle generalisations about language and meaning in an end-to-end sequence-to-sequence model (Wu et al., 2020;Michael et al., 2020;Hewitt et al., 2021).This raises issues for research that still needs to model structured representations, such as work on knowledge graphs, hyperlink graphs, citation graphs, or social networks.
In this paper we show that the sequence-tosequence nature of most Transformer models is only a superficial characteristic; underlyingly they * Work done while working at Idiap Research Institute.† Now at Google are in fact modelling complex structured representations.We survey versions of the Transformer architecture which integrate explicit structured representations with the latent structured representations of Transformers.These models can jointly embed both the explicit structures and the latent structures in a Transformer's sequence-of-vectors hidden representation, and can predict explicit structures from this embedding.In the process, we highlight evidence that the latent structures of pretrained Transformers already include much information about traditional linguistic structures.These Transformer architectures support explicit structures which are general graphs, making them applicable to a wide range of structured representations and their integration with text.
The key insight of this line of work is that attention weights and graph structure edges are effectively the same thing.Linguistic structures are fundamentally an expression of locality in the interaction between different components of a representation.As Henderson (2020) argued, incorporating this information about locality in the inductive bias of a neural network means putting connections between hidden vectors if their associated components are local in the structure.In Transformers (Vaswani et al., 2017), these connections are learned in the form of attention weights.Thus, these attention weights are effectively the induced structure of the Transformer's latent representation.
However, attention weights are not explicitly part of a Transformer's hidden representation.The output of a Transformer encoder is a sequence of vectors, and the same is true of each lower layer of self-attention.The latent attention weights are extracted from these sequence-of-vector embeddings with learned functions of pairs of vectors.Edges in explicit graphs can be predicted in the same way (from pairs of vectors), assuming that these graphs have also been embedded in the sequence of vectors.
In recent years, the main innovation has been in how to embed explicit graphs in the hidden representations of Transformers.In our work on this topic, we follow the above insight and input the edges of the graph into the computation of attention weights.Attention weights are computed from an n × n matrix of attention scores (where n is the sequence length), so we input the label of the edge between nodes i and j into the score computation for the i,j cell of this matrix.Each edge label has a learned embedding vector, which is input to the attention score function in various ways depending on the architecture.This allows the Transformer to integrate the explicit graph into its own latent attention graph in flexible and powerful ways.This integrated attention graph can then determine the Transformer's sequence-of-vectors embedding in the same way as standard Transformers.
Researchers from the Natural Language Understanding group at Idiap Research Institute have developed this architecture for inputting and predicting graphs under the name of Graph-to-Graph Transformer (G2GT).G2GT allows conditioning on an observed graph and predicting a target graph.For the case where a graph is only observed at training time, we not only want to predict its edges, we also want to integrate the predicted graph into the Transformer embedding.This has a number of advantages, most notably the ability to jointly model all the edges of the graph.By iteratively refining the previous predicted graph, G2GT can jointly model the entire predicted graph even though the actual prediction is done independently for each edge.And this joint modelling can be done in conjunction with other explicit graphs, as well as with the Transformer's induced latent graph.
Our work on G2GT has included a number of different explicit graph structures.The original methods were developed on syntactic parsing (Mohammadshahi andHenderson, 2021, 2020;Mohammadshahi, 2023).The range of architectures was further explored for semantic role labelling (Mohammadshahi and Henderson, 2023) and collocation recognition (Espinosa Anke et al., 2022).G2GT's application to coreference resolution extended the complexity of graphs to two levels of representation (mention spans and coreference chains) over an entire document, which was all modelled with iterative refinement of a single graph (Miculicich and Henderson, 2022) In the rest of this paper, we start with a review of related work on deep learning for graph modelling (Section 2).We then present the general G2GT architecture with iterative refinement (Section 3), before discussing the specific versions we have evaluated on specific tasks (Section 4).We then discuss the broader implications of these results (Section 5), and conclude with a discussion of future work (Section 6).

Deep Learning for Graphs
Graph Neural Networks.Early attempts at broadening the application of neural networks to graph structures were pursued by Gori et al. (2005) and Scarselli et al. (2008), who introduced the Graph Neural Networks (GNNs) architecture as a natural expansion of Recurrent Neural Networks (RNNs) (Hopfield, 1982).This architecture regained interest in the context of deep learning, expanded through the inclusion of spectral convolution layers (Bruna et al., 2013), gated recurrent units (Li et al., 2015), spatial convolution layers (Kipf and Welling, 2017), and attention layers (Veličković et al., 2018).GNNs generally employ the iterative local message passing mechanism to aggregate information from neighbouring nodes (Gilmer et al., 2017).Recent research, analysing GNNs through the lens of Weisfeiler and Leman (1968), has highlighted two key issues: over-smoothing (Oono and Suzuki, 2020) and over-squashing (Alon and Yahav, 2021).Over-smoothing arises from repeated aggregation across layers, leading to convergence of node features and loss of discriminative information.Over-squashing, on the other hand, results from activation functions during message aggregation, causing significant information and gradient loss.These issues limit the capacity of GNNs to effectively capture long-range dependencies and nuanced graph relationships (Topping et al., 2021).The Transformer architecture (Vaswani et al., 2017) can be seen as addressing these issues, in that its stacked layers of self-attention can be seen as a fixed sequence of learned aggregation steps.
Graph Transformers.Transformers (Vaswani et al., 2017), initially designed for sequence tasks, represent a viable and versatile alternative to GNNs due to their intrinsic graph processing capabilities.Through their self-attention mechanism, they can seamlessly capture global wide-ranging relationships, akin to handling a fully-connected graph.Shaw et al. (2018) explicitly input relative position relations as embeddings into the attention function, thereby effectively inputting the relative position graph, instead of absolute position embeddings, to represent the sequence.Generalising this explicit input strategy to arbitrary graphs (Henderson, 2020) has led to a general class of models which we will refer to as Graph Transformers (GT).
GT Evolution and Applications.The history of graph input methods used in GTs started with Transformer variations that experimented with relative positions to more effectively capture distance between input elements.Rather than adopting the sinusoidal position embedding introduced by Vaswani et al. (2017) or the absolute position embedding proposed by Devlin et al. (2019), Shaw et al. (2018) added relative position embeddings to attention keys and values, capturing token distance within a defined range.Dai et al. (2019) proposed Transformer-XL, which used content-dependent positional scores and a global positional score in attention weights.Mohammadshahi and Henderson (2020) demonstrated one of the earliest successful integration of an explicit graph into Transformer's latent attention graph.They introduced the Graph-To-Graph Transformer (G2GT) architecture and applied it to syntactic parsing tasks by effectively leveraging pre-trained models such as BERT (Devlin et al., 2019).Huang et al. (2020) introduced new methods to enhance interaction between query, key and relative position embeddings within the self-attention mechanism.Su et al. (2021) proposed RoFormer, which utilises a rotation matrix to encode absolute positions while also integrating explicit relative position dependencies into the self-attention formulation.Liutkus et al. (2021) and Chen (2021) extended Performer (Choromanski et al., 2020) to support relative position encoding while scaling Transformers to longer sequences with a linear attention mechanism.Graphormer (Ying et al., 2021) introduced node centrality encoding as an additional input level embedding vector, node distances and edges as soft biases added at attention level, and obtained excellent results on a broad range of graph representation learning tasks.Mohammadshahi and Henderson (2021) built upon the G2GT architecture and proposed an iterative refinement procedure over previously predicted graphs, using a non-autoregressive approach.SSAN (Xu et al., 2021) leveraged the GT approach to effectively model mention dependencies for document-level relation extraction tasks.JointGT (Ke et al., 2021) exploited the GT approach for knowledge to text generation tasks via a joint graph-text encoding.Similarly, TableFormer (Yang et al., 2022) demonstrated the successful utilisation of the GT approach for combined text-table encoding in tablebased question answering tasks.Espinosa Anke et al. ( 2022) proposed a GT architecture for simultaneous collocation extraction and lexical function typification, incorporating syntactic dependencies into the attention mechanism.Miculicich and Henderson (2022) showed that the G2GT iterative refinement procedure can be effectively applied to graphs at multiple levels of representation.Diao and Loynd (2022) further extended a GT architecture with new edge and node update methods and applied them to graph-structured problems.QAT (Park et al., 2022) substantially expanded upon GT models to jointly handle language and graph reasoning in question answering tasks.In the study conducted by Mohammadshahi and Henderson (2023), the G2GT model showed substantial improvements in the semantic role labelling tasks.The multitude of successful applications and extensions firmly establish Graph Transformers as a robust and adaptable framework for addressing complex challenges in language and graphs.

Graph-to-Graph
Transformer Architecture Our Graph-to-Graph Transformer (G2GT) architecture combines the idea of inputting graph edges into the self-attention function with the idea of predicting graph edges with an attention-like function.By encoding the graph relations into the self-attention mechanism of Transformers, the model has an appropriate linguistic bias, without imposing hard restrictions.Specifically, G2GT modifies the attention mechanism of Transformers (Vaswani et al., 2017) to input any graph.Given the input sequence W = (x 1 ,x 2 ,...,x n ), and graph relations G = {(x i ,x j ,l),1 ≤ i,j ≤ n,l ∈ L} (where L is the set of labels), the modified self-attention mechanism is calculated as1 : (1) where r ij ∈ {0, 1} |L| is a one-hot vector which specifies the type of the relation between x i and x j , 2 W R 1 , W R 2 ∈ R |L|×d are matrices of graph relation embeddings which are learned during training, |L| is the label size, and d is the size of hidden representations.The value equation of Transformer (Vaswani et al., 2017) is also modified to pass information about graph relations to the output of the attention function: where W R 3 ∈ R |L|×d is another learned relation embedding matrix.
To extract the explicit graph from the sequence of vectors output by the Transformer, a classification module is applied to pairs of vectors and maps them into the label space L. Initially, the module transforms each vector into distinct head and tail representations using dedicated projection matrices.Subsequently, a classifier (linear, bilinear or MLP) is applied, to map the vector pair onto predictions over the label space.Notably, each edge prediction can be computed in parallel (i.e. in a non-autoregressive manner), as predictions for each pair are independent of one another.Given the discrete nature of the output, various decoding methods can be employed to impose desired constraints on the complete output graph.These can range from straightforward head-tail order constraints, to more complex decoding algorithms such as the Minimum Spanning Tree (MST) algorithm.
Having an architecture which can both condition on graphs and predict graphs gives us the powerful ability to do iterative refinement of arbitrary graphs.Even when graph prediction is non-autoregressive, conditioning on the previously predicted graph allows the model to capture between-edge correlations like an autoregressive model.As illustrated in Figure 1, we propose Recursive Non-autoregressive G2GT (RNGT), et al. ( 2022) provide a survey of previous proposals for relative position encoding.In ongoing work, we have found that using a relation embedding vector to reweight the dimensions in standard dot-product attention works well for some applications.
2 This formulation can be easily extended to multilabel graphs by removing the one-hot constraint.We are investigating the most effective method for doing this.

Initial Graph Predictor
Input Sentence which predicts all edges of the graph in parallel, and is therefore non-autoregressive, but can still condition every edge prediction on all other edge predictions by conditioning on the previous version of the graph (using Equations 1 and 2).
The input to the model is the input graph W (e.g. a sequence of tokens), and the output is the final graph G T over the same set of nodes.First, we compute an initial graph G 0 over the nodes of W , which can be done with any model.Then each recursive iteration encodes the previous graph G t−1 and predicts a new graph G t .It can be formalised in terms of an encoder E RNG and a decoder D RNG : where Z represents the set of vectors output by the model, and T indicates the number of refinement iterations.Note that in each step of this iterative refinement process, the G2G Transformer first computes a set of vectors which embeds the predicted graph (i.e.E RNG (W, G t−1 )), before extracting the edges of the predicted graph from this set-of-vectors embedding (i.e.D RNG (Z t )).

G2GT Models and Results
This section provides a more comprehensive explanation of each alternative G2GT model we have ex-plored, along with an outline of how we've applied these models to address various graph modelling problems.The empirical success of these models demonstrate the computational adequacy of Transformers for extracting and modelling graph structures which are central to the nature of language.
The large further improvements gained by initialising with pretrained models demonstrates that Transformer pretraining encodes information about linguistic structures in its attention mechanisms.

Syntactic Parsing
Syntactic parsing is the process of analysing the grammatical structure of a sentence, including identifying the subject, verb, and object.Syntactic dependency parsing is a critical component in a variety of natural language understanding tasks, such as semantic role labelling (Henderson et al., 2013;Marcheggiani andTitov, 2017, 2020), machine translation (Chen et al., 2017), relation extraction (Zhang et al., 2018), and natural language inference (Pang et al., 2019).It is also a benchmark structured prediction task, because architectures which are not powerful enough to learn syntactic parsing cannot be computationally adequate for language understanding.Syntactic structure is generally specified in one of two popular grammar styles, constituency parsing (i.e.phrase-structure parsing) (Manning and Schutze, 1999;Henderson, 2003Henderson, , 2004;;Titov and Henderson, 2007a) and dependency parsing (Nivre, 2003;Titov and Henderson, 2007b;Carreras, 2007;Nivre and McDonald, 2008;Dyer et al., 2015;Barry et al., 2021).There are two main approaches to compute the dependency tree: transition-based and graph-based parsers.Transition-based parsers predict the dependency graph one edge at a time through a sequence of parsing actions (Yamada and Matsumoto, 2003;Nivre and Scholz, 2004;Titov and Henderson, 2007b;Zhang and Nivre, 2011;Weiss et al., 2015;Yazdani and Henderson, 2015), and graph-based parsers compute scores for every possible dependency edge and then apply a decoding algorithm to find the highest scoring total tree (McDonald et al., 2005;Koo and Collins, 2010;Kuncoro et al., 2016;Zhou and Zhao, 2019).
In the following, we outline our proposals for using G2GT for syntactic parsing tasks.

Transition-based Dependency Parsing
In (Mohammadshahi and Henderson, 2020), we integrate the G2GT model with two baselines, named StateTransformer (StateTr) and Sentence-Transformer (SentTr).In the former model, we directly input the parser state into the G2GT model, while the latter takes the initial sentence as the input.For better efficiency of our transition-based model, we used an alternative version of G2GT, introduced in Section 3, where the interaction of graph relations with key matrices in Equation 1is removed.Each parser decision is conditioned on the history of previous decisions by inputting an unlabelled partially constructed dependency graph to the G2GT model.Mohammadshahi and Henderson (2020) evaluate the integrated models on the English Penn Treebank (Marcus et al., 1993), and 13 languages of Universal Dependencies Treebanks (Nivre et al., 2018).
Results of our models on the Penn Treebank are shown in Table 1 (see (Mohammadshahi and Henderson, 2020) for further results on UD Treebanks).Integrating the G2GT model with the StateTr baseline achieves 9.97% LAS Relative Error Reduction (RER) improvement, which confirms the effectiveness of modelling the graph information in the attention mechanism.Furthermore, initialising our model weights with the BERT model (Devlin et al., 2019), provides significant improvement (27.65% LAS RER), which shows the compatibility of our modified attention mechanism with the latent representations learned by BERT pretraining.Integrating the G2GT model with the SentTr baseline results in a similar significant improvement (4.62% LAS RER).

Graph-based Dependency Parsing
The StateTr and SentTr models generate the dependency graph in an autoregressive manner, predicting each parser action conditioned on the history of parser actions.Many previous models have achieved better results with graph-based parsing methods, which use non-autoregressive computation of scores for all individual candidate dependency relations and then use a decoding method to reach the maximum scoring structure (McDonald et al., 2005;Koo and Collins, 2010;Ballesteros et al., 2016;Wang and Chang, 2016;Kuncoro et al., 2016;Zhou and Zhao, 2019).However, these models usually ignore correlations between edges while predicting the complete graph.In (Mohammadshahi and Henderson, 2021), we propose the Recursive Non-autoregressive Graphto-Graph Transformer (RNGT) architecture, as discussed in Section 3. The RNGT architecture can be applied to any task with a sequence or graph as input and a graph over the same set of nodes as output.Here, we apply it for the syntactic dependency parsing task, and preliminary experiments showed that removing the interaction of graph relations with key vectors, in Equation 1, results in better performance and a more efficient attention mechanism.Mohammadshahi and Henderson (2021) evaluate this RNGT model on Universal Dependency (UD) Treebanks (Nivre et al., 2018), Penn Treebanks (Marcus et al., 1993), and the German CoNLL 2009 Treebank (Hajič et al., 2009) for the syntactic dependency parsing task.
Table 2 shows the results on 13 languages of UD Treebanks.First, we use UDify (Kondratyuk and Straka, 2019), the previous state-of-the-art multilingual dependency parser, as the initial parser for the RNGT model.The integrated model achieves significantly better LAS performance than the UDify model in all languages, which demonstrates the effectiveness of the RNGT model at refining a dependency graph.Then, we combine RNGT with Syntactic Transformer (SynTr), a stronger monolingual dependency parser, which has the same architecture as the RNGT model except without the graph input mechanism.The SynTr+RNGT model reaches further improvement over the strong SynTr baseline (four languages are significant), which is stronger evidence for the effectiveness of the graph refinement method.Interestingly, there is little difference between the performance with different initial parsers, implying that the RNGT model is effective enough to refine any initial graphs.In fact, even when we initialise with an empty parse, the Empty+RNGT model achieves competitive results with the other RNGT models, again confirming our powerful method of graph refinement.

Semantic Role Labelling
The semantic role labelling (SRL) task provides a shallow semantic representation of a sentence and builds event properties and relations among relevant words, and is defined in both dependencybased (Surdeanu et al., 2008) andspan-based (Carreras andMàrquez, 2005;Pradhan et al., 2012) styles.Previous work (Marcheggiani and Titov, 2017;Strubell et al., 2018;Cai and Lapata, 2019;Fei et al., 2021;Zhou et al., 2020) showed that the syntactic graph helps SRL models to predict better output graphs, but finding the most effective way to incorporate the auxiliary syntactic information into SRL models was still an open question.In (Mohammadshahi and Henderson, 2023), we introduce the Syntax-aware Graph-to-Graph Transformer (SynG2G-Tr) architecture.The model conditions on the sentence's dependency structure and jointly predicts both span-based (Carreras and Màrquez, 2005) and dependency-based (Hajič et al., 2009) SRL structures.Regarding the self-attention mechanism, we remove the interaction of graph embeddings with value vectors in Equation 2, as it reaches better performance in this particular task (Mohammadshahi and Henderson, 2023).
Results for span-based SRL are shown in Table 3.Without initialising the models with BERT (Devlin et al., 2019), the SynG2G-Tr model outperforms a previous comparable state-of-the-art model (Strubell et al., 2018) in both end-to-end and given-predicate scenarios.The improvement indicates the benefit of encoding the graph information in the self-attention mechanism of Transformer with a soft bias, instead of hard-coding the graph structure into deep learning models (Marcheggiani and Titov, 2017;Strubell et al., 2018;Xia et al., 2019), as the model can still learn other attention patterns in combination with this graph knowledge.BERT (Devlin et al., 2019) initialisation results in further significant improvement in both settings, which again shows the compatibility of the G2GT modified self-attention mechanism with the latent structures learned by BERT pretraining.

Coreference Resolution
Coreference resolution (CR) is an important and complex task which is necessary for higher-level semantic representations.We show that it benefits from a graph-based global optimisation of all the coreference chains in a document.

CR Task Definition and Background
Coreference resolution is the task of linking all linguistic expressions in a text that refer to the same entity.Solutions for this task involve three parts: mention-detection (Yu et al., 2020;Miculicich and Henderson, 2020), classification or ranking of mentions, and finally reconciling the decisions to create entity chains.These approaches fall within three principal categories: mention-pair models which perform binary decisions (McCarthy and Lehnert, 1995;Aone and William, 1995;Soon et al., 2001), entity-based models which focus on maintaining single underlying entity representation, contrasting the independent pair-wise decisions of mentionpair approaches (Clark andManning, 2015, 2016), and ranking models which aim at ranking the possible antecedents of each mention instead of making binary decisions (Wiseman et al., 2016).A  limitation of these methods lies in their bottom-up construction, resulting in an underutilisation of comprehensive global information regarding coreference links among all mentions in individual decisions.Furthermore, these methods tend to exhibit significant complexity.Modelling of coreference resolution as a graph-based approach offer an alternative to deal with these limitations.

Iterative Graph-based CR
Miculicich and Henderson (2022) proposed a novel approach to modelling coreference resolution, treating it as a graph problem.In this framework, the tokens within the text serve as nodes, and the connections between them signify coreference links (see Figure 2).Given a document D = [x 1 ,...,x N ] with length N , the coreference graph is formally defined as the matrix G ⊂ N N ×N , which represents the relationships between tokens.Specifically, the relationship type between any two tokens, x i and x j , is labelled as g i,j ∈ {0,1,2} for the three distinct relation types: (0) no link, (1) mention link, and (2) coreference link.The primary objective of this approach is to learn the conditional probability distribution p(G|D).To achieve this, an iterative refinement strategy is employed, which captures interdependencies among relations.The model iterates over the same document D for a total of T iterations.In each iteration t, the predicted coreference graph G t is conditioned on the previous prediction, denoted as G t−1 .Thus, the conditional probability distribution of the model is defined as follows: The proposed model operates on two levels of representation.In each iteration, it predicts the entire graph.However, during the first iteration, the model focuses on predicting edges that pinpoint mention spans, given that coreferent links only have relevance when mentions are detected.From the second iteration, both mention links, and coreference links are refined.This iterative strategy permits the model to enhance mention-related decisions based on coreference resolutions, and vice versa.This framework utilises iterative graph refinement as a substitute for conventional pipeline architectures in multi-level deep learning models.The iterative process concludes either when the graph no longer undergoes changes or when a predetermined maximum iteration count is attained (see Figure 3).
Ideally, encoding the entirety of the document in a single pass would be optimal.However, in practical scenarios, a constraint on maximum length arises due to limitations in hardware memory capacity.To address this challenge, Miculicich and Henderson (2022) introduce two strategies: overlapping windows and reduced document approach.In the latter strategy, mentions are identified during an initial iteration with a focus on optimising recall, as previously suggested in (Miculicich and Henderson, 2020).Only the representations of these identified spans are subsequently used as inputs for the following iterations.Miculicich and Henderson (2022) conducted experiments on the CoNLL 2012 corpus (Pradhan et al., 2012) and showed improvements over relevant baselines and previous state-of-the-art methods, summarised in Table 4.We compare our model with three baselines: Lee et al. (2017) proposed the first end-to-end model for coreference resolution; Lee et al. (2018) extended the previous model by introducing higher order inference; and Xu and Choi (2020) used the span based pretrained model SpanBERT (Joshi et al., 2020).The 'Baseline' of Lee et al. (2018) uses ELMo (Peters et al., 2018) to obtain token representations, so versions of this Baseline which use 'BERT-large' (Joshi et al., 2019) and 'SpanBERT-large' (Joshi et al., 2020) as their pretrained models, are directly comparable to our 'G2GT BERT-large' and 'G2GT SpanBERT-large' models, respectively.
These results show that coreference resolution benefits from making global coreference decisions using document-level information, as supported by the G2GT architecture.Our model achieves its optimal solution within a maximum of three iterations.Notably, due to the model's ability to predict the entire graph in a single iteration, its computational complexity is lower compared to that of the baseline approaches.

Discussion
The empirical success of Graph-to-Graph Transformers on modelling these various graph structures helps us understand how Transformers model language.This success demonstrates that Transformers are computationally adequate for modelling linguistic structures, which are central to the nature of language.The reliance of these G2GT models on using self-attention mechanisms to extract and encode these graph relations shows that self-attention is crucial to how Transformers can do this modelling.The large improvements gained by initialising with pretrained models indicates that pretrained Transformers are in fact using the same mechanisms to learn about this linguistic structure, but in an unsupervised fashion.
These insights into pretrained Transformers give us a better understanding of the current generation of Large Language Models (LLMs).It is not that these models do not need linguistic structure (since their attention mechanisms do learn it); it is that these models do not need supervised learning of linguistic structure.But perhaps in a Table 4: Evaluation of CR on the test set (CoNLL 2012) in terms of precision (P), recall (R) and F1 score for three metrics, as well as the average F1 over metrics.* significant at p < 0.01 compared to (Joshi et al., 2020), † significant at p < 0.05 compared to (Xu and Choi, 2020).low-resource scenario LLMs would benefit from the inductive bias provided by supervised learning of linguistic structures, such as for many of the world's languages other than English.And these insights are potentially relevant to the issues of interpretability and controllability of LLMs.
These insights are also relevant for any applications which could benefit from integrating text with structured representations.Our current work investigates jointly embedding text and parts of a knowledge base in a single G2GT model, providing a way to integrate interpretable structured knowledge with knowledge in text.Such representations would be useful for information extraction, question answering and information retrieval, amongst many other applications.Other graphs we might want to model with a Transformer and integrate with text include hyperlink graphs, citation graphs, and social networks.An important open problem with such models is the scale of the resulting Transformer embedding.

Conclusion and Future Work
The Graph-to-Graph Transformer architecture makes explicit the implicit graph processing abilities of Transformers, but further research is needed to fully leverage the potential of G2GT.

Conclusions
The success of the above models of a variety linguistic structures shows that Transformers are underlyingly graph-to-graph models, not limited to sequence-to-sequence tasks.The G2GT architecture with its RNGT method provides an effective way to exploit this underlying ability when modelling explicit graphs, effectively integrating them with the implicit graphs learned by pre-trained Transformers.Inputting graph relations as features to the self-attention mechanism enables the information input to the model to be steered by domain-specific knowledge or desired outcomes but still learned by the Transformer, opening up the possibility for a more tailored and customised encoding process.Predicting graph relations with attention-like functions and then re-inputting them for iterative refinement, encodes the input, predicted and latent graphs in a single joint Transformer embedding which is effective for making global decisions about structure in a text.

Future Work
One topic of research where explicit graphs are indispensable is knowledge graphs.Knowledge needs to be interpretable, so that it can be audited, edited, and learned by people.And it needs to be integrated with existing knowledge graphs.Our current work uses G2GT to integrate knowledge graphs with knowledge conveyed by text.
One of the limitations of the models discussed in this paper is that the set of nodes in the output graph needs to be (a subset of) the nodes in the input graph.General purpose graph-to-graph mappings would require also predicting a set of new nodes in the output graph.One natural solution would be autoregressive prediction of one node at a time, as is done for text generation, but an exciting alternative would be to use methods from non-autoregressive text generation in combination with our iterative refinement method RNGT.
The excellent performance of the models presented in this paper suggest that many more problems can be successfully formulated as graph-to-graph problems and modelled with G2GT, in NLP and beyond.The code for G2GT and RNGT is open-source and publicly available at https://github.com/idiap/g2g-transformer.

Figure 2 :
Figure 2: Example of a graph structure for coreference.Mention spans are shown in bold, and colours represent entity clusters.The mention heads are underlined.

Figure 3 :
Figure 3: Example of iterations with G2GT in two stages.
. Current work on knowledge extraction poses further challenges, most notably the issue of tractably modelling large graphs.The code for G2GT is open-source and available for other groups to use for other graph structures (at https://github.com/idiap/g2g-transformer).

Table 3 :
Comparing our SynG2G-Tr with previous comparable SoTA model on CoNLL 2005 test sets for both indomain (WSJ), and out-of-domain (Brown) sets.Scores being boldfaced means that they are significantly better.