HySPA: Hybrid Span Generation for Scalable Text-to-Graph Extraction

Text-to-Graph extraction aims to automatically extract information graphs consisting of mentions and types from natural language texts. Existing approaches, such as table filling and pairwise scoring, have shown impressive performance on various information extraction tasks, but they are difficult to scale to datasets with longer input texts because of their second-order space/time complexities with respect to the input length. In this work, we propose a Hybrid Span Generator (HySPA) that invertibly maps the information graph to an alternating sequence of nodes and edge types, and directly generates such sequences via a hybrid span decoder which can decode both the spans and the types recurrently in linear time and space complexities. Extensive experiments on the ACE05 dataset show that our approach also significantly outperforms state-of-the-art on the joint entity and relation extraction task.


Introduction
Information Extraction (IE) can be viewed as a Text-to-Graph extraction task that aims to extract an information graph Shi et al., 2017) consisting of mentions and types from unstructured texts, where the nodes of the graph are mentions or entity types and the edges are relation types that indicate the relations between the nodes. A typical approach towards graph extraction is to break the extraction process into sub-tasks, such as Named Entity Recognition (NER) (Florian et al., 2006(Florian et al., , 2010 and Relation Extraction (RE) (Sun et al., 2011;Jiang and Zhai, 2007), and either perform them separately (Chan and Roth, 2011) or jointly Eberts and Ulges, 2019).
Recent joint IE models Wang and Lu, 2020;Lin et al., 2020) have shown impressive performance on various IE tasks, since they can mitigate error propagation and leverage inter-dependencies between the tasks. Previous work often uses pairwise scoring techniques to identify relation types between entities. However, this approach is computationally inefficient because it needs to enumerate all possible entity pairs in a document, and the relation type is a null value for most of the cases due to the sparsity of relations between entities. Also, pairwise scoring techniques evaluate each relation type independently and thus fail to capture interrelations between relation types for different pairs of mentions.
Another approach is to treat the joint information extraction task as a table filling problem Wang and Lu, 2020), and generate twodimensional tables with a Multi-Dimensional Recurrent Neural Network (Graves et al., 2007). This can capture interrelations among entities and relations, but the space complexity grows quadratically with respect to the length of the input text, making this approach impractical for long sequences.
Some attempts, such as Seq2RDF  and IMoJIE (Kolluru et al., 2020), leverage the power of Seq2seq models (Cho et al., 2014) to capture the interrelations among mentions and types with first-order complexity, but they all use a pre-defined vocabulary for mention prediction, which largely depends on the distribution of the target words and will not be able to handle unseen out-of-vocabulary words.
To solve these problems, we propose a first-order approach that invertibly maps the target graph to an alternating sequence of nodes and edges, and applies a hybrid span generator that directly learns to generate such alternating sequences. Our main contributions are three-fold: • We propose a general technique to invertibly map between an information graph and an alternating sequence (assuming a given graph traversal algorithm). Generating an alternating sequence is equivalent to generating the original information graph.
• We propose a novel neural decoder that is enforced to only generate alternating sequences by decoding spans and types in a hybrid manner. For each decoding step, our decoder only has linear space and time complexity with respect to the length of the input sequence, and it can capture inter-dependencies among mentions and types due to its nature as a sequential decision process.
• We conduct extensive experiments on the Automatic Content Extraction (ACE) dataset which show that our model achieves state-ofthe-art performance on the joint entity and relation extraction task which aims to extract a knowledge graph from a piece of unstructured text.

Modeling Information Graphs as Alternating Sequences
An information graph can be viewed as a heterogeneous multigraph Shi et al., 2017) G = (V, E), where V is a set of nodes (typically representing spans (t s , t e ) in the input document) and E is a multiset of edges with a node type mapping function φ : V → Q and an edge type mapping function ψ : E → R. Node and edge types are assumed to be drawn from a finite vocabulary. Node types can be used e.g. to represent entity types (PER, ORG, etc.), while edge types may represent relations (PHYS, ORG-AFF, etc.) between the nodes. In this work, we represent node types as separate nodes that are connected to their node v by a special edge type, [TYPE]. 2

Representing information graphs as sequences
Instead of directly modeling the space of heterogeneous multigraphs, G, we build a mapping s π = f s (G, π) from G, to a sequence space S π . f s depends on a (given) ordering π of nodes and their edges in G, constructed by a graph traversal algorithm like Breadth First Search (BFS) or Depth First Search (DFS), and an internal ordering of nodes and edge types. We assume that the elements s π i of the resultant sequences s π are drawn from finite sets of node representations V (defined below), node types Q, edge types R (incl. [TYPE]), and "virtual" edge types U : [SEP]} do not represent edges in G, but serve to control the generation of the sequence, indicating the start/end of sequences and the separation of levels in the graph.
We furthermore assume that s π = s π 0 , ..., s π n that represent graphs have an alternating structure, where s π 0 , s π 2 , s π 4 , ... represent nodes V , and s π 1 , s π 3 , ... represent actual or virtual edges. In the case of BFS, we exploit the fact that it visits nodes level by level, i.e., in the order p i , c i1 , ..., c ik , p j (where c ik is the k-th child of parent p i , connected by edge e ik , and p j may or may not be equal to one of the children of p i ), which we turn into a sequence, where we use the special edge type [SEP] to delineate the levels in the graph. This representation allows us to unambiguously recover the original graph, if we know which type of graph traversal is assumed (BFS or DFS). 3 Algorithm 1 (which we use to translate graphs in the training data to sequences) shows how an alternating sequence for a given graph can be constructed with BFS traversal. Figure 1 shows the alternating sequence for an information multigraph. The length |s π | is bounded linearly by the size of the graph O(|s π |) = O(|V | + |E|) (which is also the complexity of typical graph traversal algorithms like BFS/DFS).
Algorithm 1 Alternating sequence construction algorithm with BFS Input :Ordered adjacency dictionary of an information graph G, positions of nodes in the input text p q , frequency of edge types in the training set p r Output :An alternating sequence y π Sort the nodes in G according to p q For each node v in G, sort the neighbors and the edges of v according to p q and p r respectively Instantiate y π as an empty list for u in G do if u is not visited then Initialize an empty queue q Mark u as visited and enqueue u to q while q is not empty do Dequeue a node w from q if w in G then Append w and all the neighbors of w with their edge types to y π Append the separation edge type, [SEP], to y π Mark all unvisited neighbors of w as visited and enqueue them to q end end end end Return y π Node and Edge Representations Our node and edge representations (explained below) rely on the observation that there are only two kinds of objects in an information graph: spans (as addresses to pieces of input texts) and types (as representations of abstract concepts). Since we can view types as special spans of length 1 grounded on the vocabulary of all types, Q ∪ R ∪ U , we only need O(nm + |Q ∪ R ∪ U |) number of indices to unambiguously represent the spans grounded on a concatenated representation of the type vocabulary and the input text, where n is the maximum input length, m is the maximum span length, and m n. We denote these indices as hybrid spans because they consist of both the spans of texts and the length-1 spans of types. These indices can be invertibly mapped back to types or text spans depending on their magnitudes (details of this mapping are explained in Section 3.2). With this joint indexing of spans and types, the task of generating an information graph is thus converted to generat-ing an alternating sequence of hybrid spans. Generating sequences We model the distribution p(s π ) by a sequence generator h with parameters θ (|s π | is the length of the s π ): We will address in the following sections how to enforce the sequence generator, h, to only generate sequences in the space S π , since we do not want h to assign non-zero probabilities to arbitrary sequences that do not have a corresponding graph.

HySPA: Hybrid Span Generation for Alternating Sequences
In order to directly generate a target sequence that alternates between nodes that represent spans in the input and a set of node/edge types that depend on our extraction task, we first build a hybrid representation H that is a concatenation of the hidden representations from edge types, node types and the input text. This representation functions as both the context space and the output space for our decoder. Then we invertibly map both the spans of input text and the indices of the types to the hybrid spans grounded on the representation H. Finally, hybrid spans are generated auto-regressively through a hybrid span decoder to form the alternating sequence y π ∈ S π . By translating the graph extraction task to a sequence generation task, we can easily use beam-search decoding to reduce possible exposure bias (Wiseman and Rush, 2016) of the sequential decision process and thus find globally better graph representation.
High-level overview of HySPA: The HySPA model takes a piece of text (e.g. a sentence or passage), and the pre-defined node and edge types as input, and outputs an alternating sequence representation of an information graph. We enforce the generation of this sequence to be alternated by applying an alternating mask to the output probabilities. The detailed architecture is described in the following subsections. we arrange the type list, v as a concatenation of the label names of the edge types, virtual edge types and node types, i.e.,

Text and Types Encoder
v =R ⊕Û ⊕Q where ⊕ means the concatenation operator between two lists, andR,Û ,Q are the lists of the type names in the sets R, U, Q, respectively (e.g. Q = ["Geopolitics", "Person", ...]). Note that the concatenation order between the lists of type names can be arbitrary as long as it is kept consistent throughout the whole model. Then, as in the embedding part of the table-sequence encoder (Wang and Lu, 2020), for each type, v i , we embed the label tokens of the types with the contextualized word embedding from a pre-trained language model, the GloVe embedding (Pennington et al., 2014) and the character embedding, where l p = |R| + |U | + |Q| is the number of all kinds of types, W 0 ∈ R de×dm is the weight matrix of the linear projection layer, d e = d c + d g + d k is the total embedding dimension and d m is the hidden size of our model. After we obtain the contextualized embedding of the tokens of each type v i ∈ v, we take the average of these token vectors as the representation of v i and freeze its update during training. More details of the embedding pipeline can be found in Appendix A. This embedding pipeline is also used to embed the words in the input text, x. Unlike the pipeline for the type embedding, we represent the word as the contextualized embedding of its first sub-token from the pre-trained Language Model (LM, e.g. BERT (Devlin et al., 2018)), and finetune the LM in an end-to-end fashion.
After obtaining the type embedding E v , and the text embedding E x respectively, we concatenate them along the sequence length dimension to form the hybrid representation H 0 . Since H 0 is a concatenation of word vectors from four different types of tokens, i.e., edge types, virtual edge types, node types and text, a meta-type embedding is applied to indicate this type difference between the blocks of vectors from the representation H 0 , as shown in Figure 2. The final context representation H is obtained by element-wise addition of the meta-type embedding and H 0 , where l h = l p + |x| is the height of our hybrid representation matrix H.

Invertible Mapping between Spans & Types and Hybrid Spans
Given a span in the text, t = (t s , t e ) ∈ N 2 , t s < t e , we convert the span t to an index k, k ≥ l p , in the representation H via the mapping g k , k = g k (t s , t e ) = t s m + t e − t s − 1 + l p ∈ N, Figure 3: An example of the alternating sequence representation (in the middle) of a knowledge graph (at bottom) from the ACE05 training set, where A 1 means the Algorithm 1. We take m = 16 and l p = 19 for this example. "19" in the alternating sequence is the index for the span (0,1) of "He", "83" is the index for the span (4,5) of "Baghdad", and "10" is the index of the virtual edge type, [SEP]. The input text (on top) for this graph is "He was captured in Baghdad late Monday night".
where m is the maximum length of spans, and l p = |R| + |U | + |Q|. We keep the type indices in the graph unchanged because they are smaller than l p and k ≥ l p . Since, for an information graph, the maximum span length, m, of a mention is often far smaller than the length of the text, i.e., m n, we can then reduce the bound of the maximum magnitude of k from O(n 2 ) to O(nm) by only considering spans of length smaller than m, and thus maintain linear space complexity for our decoder with respect to the length of the input text, n. Figure 3 shows a concrete example of our alternating sequence for a knowledge graph in the ACE05 dataset.
Since t s , t e , k are all natural numbers, we can construct an inverse mapping g t that converts the index k in H back to t = (t s , t e ), where · is the integer floor function and mod is the modulus operator. Note that g t (k) can be directly applied to the indices from the types segment of H and remain their values unchanged, i.e., g t (k) = (k, k), ∀k < l p , k ∈ N.
With this property, we can easily incorporate the mapping g t into our decoder to map the alternating sequence y π back to the spans in the hybrid representation H. Figure 4 shows the general model architecture of our hybrid span decoder. Our decoder takes the context representation H as input, and recurrently decodes the alternating sequence y π given a startof-sequence token.

Hybrid Span Decoder
Hybrid Span Encoding via Attention Given the alternating sequence y π , and the mapping g t (section 3.2), our decoder first maps each index in y π to a span, (t s i , t e i ) = g t (y π i ), grounded on the representation H and then converts the span to an attention mask, M 0 , to allow the model to learn to represent a span as a weighted sum of a segment of the contextualized word representations referred by the span, where H [CLS] ∈ R |y π |×dm is the |y π |-times repeated hidden representation of the start of the sequence token, [CLS], from the text segment of H, and H y is our final representation of the hybrid spans in y π . W 1 , W 2 , b 1 , b 2 are learnable parameters, and t s i , t e i are the start and the end position of the span thatwe are encoding. Note that for the type spans whose length is 1, the result of the softmax calculation will always be 1, which leads to its span representation to be exactly its embedding vector as we desired.
Traversal Embedding In order to distinguish the hybrid spans at different position in y π , a naive way is to add a sinusoidal position embedding (Vaswani et al., 2017) to H y . However, this approach treats the alternating sequence as an ordinary sequence and ignores the underlying graph structure it encodes. To alleviate this issue, we propose a novel traversal embedding approach which captures the traversal level information, the parentchild information and the intra-level connection information as a substitution of the naive position embedding. Our traversal embedding can either encode the BFS or DFS traversal pattern. As an example, we assume BFS traversal here and leave the details of DFS traversal embedding in Appendix D. Figure 4: The architecture of our hybrid span decoder. N is the number of the decoder layers. ⊕ before the softmax function means the concatenation operator. H N y is the hidden representation of the sequence y π from the last decoder layer. Our hybrid span decoder can be understood as an auto-regressive model that operates in a closed context space and output space defined by H. Our BFS traversal embedding is a pointwise sum of the level embedding, L, the parent-child embedding, P , and the tree embedding, T of a given alternating sequence, y, TravEmbed(y) = L(y)+P (y)+T (y) ∈ R |y|×dm where the level embedding assigns the same embedding vector L i for each position at the BFS traversal level i, and the value of the embedding vector is filled according to the non-parametric sinusoidal position embedding since we want our embedding to extrapolate to the sequence that is longer than any sequences in the training set. The parent-child embedding assigns different random initialized embedding vectors at the positions of the parent nodes and the child nodes in the BFS traversal levels to help model distinguish between these two kinds of nodes. For encoding the intra-level connection information, our insight is that the connection between each nodes in a BFS level can be viewed as a depth-3 tree, where the first depth takes the parent node, the second depth is filled with the edge types and the third depth consists of the corresponding child nodes for each of the edge types. Our tree embedding is then formed by encoding the position information of the depth-3 tree with a tree positional embedding (Shiv and Quirk, 2019) for each BFS level. Figure 5 shows a concrete example of how these embeddings function for a given alternating sequence. The obtained traversal embedding is then pointwisely added to the hidden representation of the alternating sequence H y for injecting the traversal information of the graph structure.
Inner blocks With the input text representation H text sliced from the hybrid representation H and the target sequence representation H y , we apply an N -layer transformer structure with mixed-attention  to allow our model to utilize features from different attention layers when decoding the edges or the nodes of an alternating sequence. Note that our hybrid span decoder is perpendicular to the actual choice of the neural structures of the inner blocks, and we choose the design of mixed-attention transformer  because its layerwise coordination property is empirically more suitable for our heterogeneous decoding of two different kinds of sequence elements. The detailed structure of the inner blocks is explained in Appendix E.
Hybrid span decoding For the hybrid span decoding module, we first slice off the hidden rep-resentation of the alternating sequence y π from the output of the N -layer inner blocks and denote it as H N y . Then for each hidden representation h N y i ∈ H N y , 0 ≤ i < |y π |, we apply two different linear layers to obtain the start position representation, s y i , and the end position representation, e y i , where W 5 , W 6 ∈ R dm×dm and b 5 , b 6 ∈ R dm are learnable parameters. Then we calculate the scores of the target spans separately for the types segment and the text segment of H, and concatenate them together before the final softmax operator for a joint estimation of the probabilities of text spans and type spans, where h i is the score vector of possible spans in the type segment of H, and t i is the score vector of possible spans in the text segment of H. Since the type spans always have a span length 1, we only need an element-wise addition between the start position scores, h s i and the end position scores h e i to calculate h i . The entries of t i contain the scores for the text spans, t s i ,j + t e i ,k , ∀j ≤ k, k − j < m, which are calculated with the help of an unfold function which converts the vector t e i ∈ R n to a stack of n sliding windows of size m, the maximum span length, with stride 1. The alternating masks m a ∈ R lp , m a ∈ R n are defined as: where l e = |R| + |U | is the total number of edge types. In this way, while we have a joint model of nodes and edge types, the output distribution is enforced by the alternating masks to produce an alternating decoding of nodes and edge types, and this is the main reason why we call this decoder a hybrid span decoder.

Experimental Setting
We test our model on the ACE 2005 dataset distributed by LDC 4 , which includes 14.5k sentences, 38.3k entities (with 7 types), and 7.1k relations (with 6 types), derived from the general news domain. More details can be found in Appendix C.
Following previous work, we use F1 as an evaluation metric for both NER and RE. For the NER task, a prediction is marked correct when both the type and the boundary span match those of the gold entity. For the RE task, a prediction is correct when both the relation type and the boundaries of the two entities are correct.

Implementation Details
When training our model, we apply the crossentropy loss with a label smoothing factor of 0.1. The model is trained with 2048 tokens per batch (roughly a batch size of 28) for 25000 steps using an AdamW optimizer (Loshchilov and Hutter, 2018) with a learning rate of 2e −4 , a weight decay of 0.01, and an inverse square root scheduler with 2000 warm-up steps. Following the TabSeq model (Wang and Lu, 2020), we use RoBERTa-large (Liu et al., 2019) or ALBERT-xxlarge-v1 (Lan et al., 2020) for the pretrained language model and slow its learning rate by a factor of 0.1 during training. A hidden state dropout rate of 0.2 is applied to RoBERTa-large while the rate of 0.1 for ALBERTxxlarge-v1. A dropout rate of 0.1 is also applied to our hybrid span decoder during training. We set the maximum span length, m = 16, the hidden size of our model, d m = 256, and the number of the decoder blocks, N = 12. Even though theoretically the beam-search should help us reduce the exposure bias, we do not observe any performance gain during grid search of the beam size and the length penalty on the validation set (detailed grid search setting is in Appendix A). Thus we set a vanilla beam size of 1 and the length penalty of 1, and leave this theory-experiment contradiction for future research. Our model is built with the FAIRSEQ toolkit  for efficient distributed training and all the experiments are conducted on two NVIDIA TITAN X GPUs.

Results
Table 1 compares our model with the previous stateof-the-art results on the ACE05 test set. Compared with the previous SOTA, TabSeq (Wang and Lu, 2020) with ALBERT pretrained language model, our model with ALBERT has significantly better performance for both NER score and RE score, while maintaining a linear space complexity which is an order smaller than TabSeq. Our model is the first joint model that has both linear space and time complexities compared with all previous joint IE models, and thus has the best scalability for largescale real world applications.

Ablation Study
To prove the effectiveness of our approach, we conduct ablation experiments on the ACE05 dataset. As shown in Table 2, after we remove the traversal embedding the RE F1 scores drop significantly, which indicates that our traversal embedding can help encode the graph structure and improve relation predictions. Also if the alternating masking is dropped, the NER F1 and RE F1 scores both drop significantly, which proves the importance of enforcing the alternating pattern. We can observe that the mixed-attention layer contributes significantly for relation extraction. This is because the layerwise coordination can help the decoder to disentangle the source features and utilize different layer features between the entity and the relation prediction. We can also observe that the DFS traversal has worse performance than BFS. We suspect that this is because the resultant alternating sequence from DFS is often longer than the one from BFS due to the nature of the knowledge graphs, and thus increases the learning difficulty.

Error Analysis
After analyzing 80 remaining errors, we categorize and discuss common cases below ( Figure 6 plots the distribution of error types). These may require additional features and strategies to address.
Insufficient context. In many examples, the answer entity is a pronoun that cannot be accurately typed given the limited context: in "We notice they said they did not want to use the word destroyed, in fact, they said let others do that", it's difficult to correctly classify We as an organization. This could be mitigated by using entire documents as input, leveraging cross-sentence context. Rare words. The rare word issue is when the word in test set rarely appeared in the training set and often not termed in the dictionary. In the sentence "There are also Marine FA-18s and Marine Heriers at this base", the term Heriers (a vehicle incorrectly classified as person by the model) neither appeared in the training set, nor understood well by pre-trained language model; the model, in this case, can only rely on subword-level representation.
Background knowledge required Often the sentence mentions entities that are difficult to infer from the context, but are easily identified by consulting a knowledge base: in "but critics say Airbus should have sounded a stronger alarm after a similar incident occurred in 1997", our model incorrectly predicts the Airbus to be a vehicle while the Airbus here refers to the European aerospace corporation. Our system also separated United Nations Security Council into two entities United Nations and Security Council, generating a non-existing relation triple (Security Council part-of United Nations). Such mistakes could be avoided by consulting a knowledge base such as DBpedia (Bizer et al., 2009)

Related Work
NER is often done jointly with RE in order to mitigate error propagation and learn inter-relation between tasks. One line of approaches is to treat the joint task as a squared table filling problem (Miwa and Sasaki, 2014;Gupta et al., 2016;Wang and Lu, 2020), where the i-th column or row represents the i-th token. The table has diagonals indicating sequential tags for entities and other entries as relations between pairs of tokens. Another line of work is by performing RE after NER. In the work by Miwa and Bansal (2016), the authors used BiLSTM (Graves et al., 2013) for NER and conse-quently a Tree-LSTM (Tai et al., 2015) based on dependency graph for RE.  and , on the other hand, takes the approach of constructing dynamic text span graphs to detect entities and relations. Extending on , Lin et al. (2020) introduced ONEIE, which further incorporates global features based on cross subtask and instance constraints, aiming to extract IE results as a graph. Note that our model differs from ONEIE (Lin et al., 2020) in that our model captures global relationships automatically through autoregressive generation while ONEIE uses feature engineered templates; Moreover, ONEIE needs to do pairwise classification for relation extraction, while our method efficiently generates existing relations and entities. While several Seq2Seq-based models Zeng et al., 2018Wei et al., 2019;Zhang et al., 2019) have been proposed to generate triples (i.e., node-edge-node), our model is fundamentally different from them in that: (1) it is generating a BFS/DFS traversal of the target graph, which captures dependencies between nodes and edges and has a shorter target sequence, (2) we model the nodes as the spans in the text, which is independent of the vocabulary, so even if the tokens of the nodes are rare or unseen words, we can still generate spans on them based on the context information.

Conclusion
In this work, we propose the Hybrid Span Generation (HySPA) model, the first end-to-end text-tograph extraction model that has a linear space and time complexity at the graph decoding stage. Besides its scalability, the model also achieves stateof-the-art performance on the ACE05 joint entity and relation extraction task. Given the flexibility of the structure of our hybrid span generator, abundant future research directions remain, e.g. incorporating the external knowledge for hybrid span generation, applying more efficient sparse self-attention, and developing better search methods to find more globally plausible graphs represented by the alternating sequence.