Relation Extraction with Word Graphs from N-grams

Most recent studies for relation extraction (RE) leverage the dependency tree of the input sentence to incorporate syntax-driven contextual information to improve model performance, with little attention paid to the limitation where high-quality dependency parsers in most cases unavailable, especially for in-domain scenarios. To address this limitation, in this paper, we propose attentive graph convolutional networks (A-GCN) to improve neural RE methods with an unsupervised manner to build the context graph, without relying on the existence of a dependency parser. Specifically, we construct the graph from n-grams extracted from a lexicon built from pointwise mutual information (PMI) and apply attention over the graph. Therefore, different word pairs from the contexts within and across n-grams are weighted in the model and facilitate RE accordingly. Experimental results with further analyses on two English benchmark datasets for RE demonstrate the effectiveness of our approach, where state-of-the-art performance is observed on both datasets.


Introduction
Recently, neural models (Zeng et al., 2014;Zhang and Wang, 2015;Xu et al., 2015;dos Santos et al., 2015;Wang et al., 2016;Zhou et al., 2016;Zhang et al., 2017;Soares et al., 2019) with powerful encoders (e.g., Transformers) have achieved promising performance for relation extraction (RE), for the reason that the encoders are superior in capturing contextual information and thus allow RE systems to better understand the text and correctly identify the relations between entities in the given text. To further improve the ability of RE models to understand the context, many studies Zhang et al., 2018;Guo et al., * Equal contribution. † Corresponding author. 1 The code involved in this paper are released at https: //github.com/cuhksz-nlp/RE-NGCN. 2019; Sun et al., 2020;Yu et al., 2020;Mandya et al., 2020;Tian et al., 2021d;Chen et al., 2021) leverage extra resources, such as auto-parsed word dependency, through graph-based approaches, e.g., graph convolutional networks (GCN). In doing so, such studies learn the long-distance connections among useful words from the dependency tree and extract relations between entity pairs accordingly. However, in doing so, dependency parsers required by their approaches are not always available. In this dilemma, one needs another way to extract useful word connections to build the graph for GCN-based models, whereas limited attentions from previous studies have been paid to this alternative.
In this paper, we propose attentive GCN (A-GCN) for relation extraction, where its input graph is built based on n-grams extracted with unsupervised methods i.e., pointwise mutual information (PMI), rather than an existing dependency parser. Specifically, two types of edges in the graph are introduced to model word connections within and across n-grams and an attention mechanism is applied to GCN to weight these edges. In doing so, different contextual information are discriminatively learned to facilitate RE without requiring any external resources. We evaluate our approach on two English benchmark datasets, i.e., ACE2005EN and SemEval 2010 Task 8, where the results demonstrate the effectiveness of our approach with stateof-the-art performance observed on both datasets.

The Approach
RE is often treated as a classification task, where the input is a sentence X = x 1 , · · · , x n with two given entities (denoted by E 1 and E 2 ) in it. Our approach follows this paradigm and uses a variant of graph neural model, i.e., attentive GCN (A-GCN), to incorporate word pair information and predicts the relation r between E 1 and E 2 by r = arg max r∈R p (r|A-GCN (X , G X , E 1 , E 2 )) (1) Figure 1: An overview of the architecture for A-GCN with the graph built upon n-grams illustrated in blue boxes. Two given entities (i.e., "Money" and "hedge funds") are shown in red and blue colors, respectively.
where G X is the graph built based on n-grams in X , R is the relation type set; p computes the probability of a particular relation type r ∈ R with the given input (i.e., X , G X , E 1 , and E 2 ), and r is the prediction of our A-GCN model. In the following text, we firstly elaborate how we construct the graph based on n-grams, and then illustrate the architecture of the A-GCN model for RE.

Graph Construction from N-grams
Conventionally, the graph used in GCN-based models for natural language understanding tasks (including RE) is constructed by the dependency tree of each input sentence. However, high-quality dependency parsers are not always available. Therefore, we do not want our model to rely on the existence of dependency parsers and hence we need an alternative to build the graph. Given that ngrams are widely used as effective features that carry contextual information to enhance the model performance in many previous studies (Song et al., 2009;Song and Xia, 2012;Ishiwatari et al., 2017;Yoon et al., 2018;Tian et al., 2020aTian et al., ,b,c, 2021a, we propose to construct the graph for GCN-based models based on n-grams in X which are extracted from a pre-constructed n-gram lexicon N .

N-gram Lexicon Construction
Before we segment appropriate n-grams for each input sentence, an n-gram lexicon N is built over the entire corpus based on pointwise mutual information (PMI). Specifically, we firstly compute the PMI of any two adjacent words x , x for all data by where p is the probability of an n-gram (i.e., x , x and x x ) in the training set; thus a higher PMI score suggests a greater chance of forming an ngram. Therefore, for each pair of adjacent words x i−1 , x i , we use a predefined threshold to determine whether the two words should be combined or split. Through this process, we segment all sen- Figure 2: Examples of IN and CROSS edges for building the graph in an example input sentence. Herein, ngrams extracted from the lexicon N are shown in the bottom gray box; the two entities (i.e., "message" and "mail application") are highlighted in red and blue colors; example IN and CROSS edges are marked in yellow and green colors, respectively (for simplicity, we only show CROSS edges associated with "message").
tences in the training set into small text spans and collect them to construct the n-gram lexicon N .

N-gram Extraction for Each Sentence
Based on the given entities (i.e., E 1 and E 2 ) and the ngram lexicon N , n-grams in a sentence are extracted as follows. First, each entity itself is considered to be an n-gram. Then, we extracts n-grams appearing in N from the sentence, where if there are overlaps between n-grams, we merge them into a larger n-gram. For example, we extract four ngrams (i.e., "message", "was delivered to", "mail application", and "two days ago" illustrated in blue boxes) from the example sentence shown in Figure 2. In these n-grams, "two days ago" is a nonoverlapping n-gram included in the lexicon; "was delivered to" is the merger of two overlapping ngrams "was delivered" and "delivered to"; "message" and "mail application", highlighted in red and blue respectively, are the n-grams for given entities. In general, word-word connections of adjacent words in the same n-gram is strong in terms of co-occurrence, as well as some connections from words among co-occurred n-grams, which motivates us to treat such connections as important edges in the graph for GCN-based models.
Graph Construction Given an input sentence X with extracted n-grams, we construct the graph for GCN-based model via two types of undirected edges, i.e., the "IN" and "CROSS" edges. The first type is to model the local word pairs while the second type is to model the word pairs cross n-grams. For the first type, any two adjacent words within the same n-gram are connected. For the second type, inspired by that English phrases tend to be either head-initial or head-final (e.g., phrase "read some books" and "green apples" respectively) in many cases, we connect the starting and ending words of any two n-grams under the condition that to no more than two n-grams between them. As an illustration, Figure 2 shows all IN edges (highlighted in yellow) and some CROSS edges (highlighted in green) for an example sentence. 2

Attentive Graph Convolutional Networks
Standard GCN models treat all word pairs from the graph equally and hence are not able to handle the possibility that different x i may contribute separately to x j . Especially for n-gram based graph construction, it is of vital significance that the proposed model distinguish different word pairs since all n-grams and the graph are constructed automatically without any supervised guidance. Therefore, we apply an attention mechanism to the adjacent i,j is attached to each x i and its associated x j in the l-th A-GCN layer. Formally, this process to compute p (l) i,j can be presented by where h (l−1) i refers to the output vector for x i from the previous GCN layer and "·" denotes the inner production. Afterwards, we apply weight p (l) i,j to the connection between x i and x j and obtain the output representation h u i,j ) depending on the edge type (i.e., IN, CROSS, and self-connected edges) between x i and x j . 3 Compared with standard GCN, our approach is able to attach different numerical weights to word pairs and distinguish the importance of them so as to better leverage contextual information accordingly. Moreover, we integrate the edge type information into the output representation of x i (i.e., h (l) i ), so that different types of contextual information are separately modeled.

Relation Extraction with A-GCN
To conduct relation extraction with A-GCN, we obtain the hidden vector h (0) i for x i from BERT (Devlin et al., 2019) to feed into the first A-GCN layer. Next, we apply the forward function (i.e., Eq. (3)-(4)) in each A-GCN layer and obtain the output (i.e., h L i ) from the last A-GCN layer (i.e. the L-th layer). Then, we apply max pooling to all words as well as the words belong to an entity so as to obtain the representation of the entire sentence h X and the two entities h E k (i.e. k = 1, 2), respectively. This process is thus formalized by and i |x i ∈ E k }) (6) Afterwards, we concatenate the representations of the sentence (i.e., h X ) and the two entities (i.e., h E 1 and h E 2 ) and apply a trainable matrix W R to the resulting vector to map it to the output space by where o is a |R|-dimensional vector with each of its value referring to a relation type from the label set R. Finally, we apply a softmax function to o to predict the relation r between E 1 and E 2 .
3 For example, if the edge type between xi and xj is CROSS, then W  , we use BERT-large encoder 7 as our textual encoder. Moreover, we run standard GCN and our A-GCN models with two layers in the experiments. In addition to the proposed ngram based graph construction, we also try fully connected graph (where every two words are connected through an edge) for both GCN and A-GCN. For evaluation, we follow previous studies to assess all models with F1 scores on the test sets. 8

Results
Tabel 1 reports the F1 scores of different models on the test set of ACE05 and SemEval, where the results from BERT-large baseline (ID: 1) without using GCN are also reported for reference. 9 There are several observations. First, although the baseline (ID: 1) has achieved outstanding performance, our models with A-GCN (ID: 3, 5) still further im-  Table 1: F1 scores of our A-GCN models and the baselines (i.e., BERT-only and standard GCN). "Full'' and "N-gram" represent the graph constructed based on all word connections and our approach, respectively. prove the performance. This observation confirms the effectiveness of A-GCN and the graphs from n-grams. Second, both standard GCN and A-GCN with the graphs from n-grams (ID: 4,5) consistently outperform the ones with the fully connected graph (ID:2,3). Particularly, when the full graph is used, GCN obtains limited improvements over the baseline (ID: 1) or even worsen the performance, which are largely due to the noise introduced in the full graph. On the contrary, the graph built upon the n-grams only has the edges that connect important context words and thus allows GCN and A-GCN models to outperform the BERT-large baseline and achieve higher performance than the models with fully connected graph. Third, for both datasets, A-GCN functionalizes better than GCN (ID: 2, 3, 4, 5) with the same graph (i.e., either "Full" or"Ngram"). This observation is explained by that the attention mechanism distinguishes different edges of a graph by assigning higher weights to more important ones, so that facilitate relation extraction. In addition, we compare our best models (i.e., BERT-large + A-GCN (N-gram)) with previous studies and report the results in Table 2, where our model outperforms all previous studies and achieves state-of-the-art performance on the two benchmark datasets. The result confirms that, compared with the dependency tree, graphs from ngrams also have a strong ability to extract contextual information. Moreover, although a graph from n-grams potentially carries some noise, the attention mechanism significantly helps to identify useful connections and facilitates RE accordingly, where different word pairs within and across ngrams are weighted and thus the model discriminatively learns from contextual information.

Conclusion
In this paper, we propose A-GCN to leverage information of word pairs for RE, where the graph for A-GCN is built from n-grams without relying  Table 2: The comparison (F1 scores) between previous studies and our best models (i.e., BERT-large + A-GCN (N-gram)) on ACE05 and SemEval. Previous studies that leverage word dependencies are marked by "*". on syntactic parsing. Particularly, we use PMI to extract n-grams from all training data and apply different connections among n-grams for graph networks, where attention is equipped to further enhance model performance. In doing so, A-GCN is able to dynamically learn from different word pairs so that less-informative relations are smartly pruned. Experimental results and analyses on two English benchmark datasets for RE demonstrate the effectiveness of our approach, where state-ofthe-art performance is obtained on both datasets. hidden-vector and 16 attention heads. For other hyper-parameter settings to train the models, we report them in Table 4. We test all combinations of them for each model and use the one achieving the highest accuracy score in our final experiments.

Hyper-parameters Values
Learning Rate 1e − 5, 3e − 5, 5e − 5 Warmup Rate 0.06, 0.1 Dropout Rate 0.1 Batch Size 2, 4, 8 Table 4: The hyper-parameters tested in tuning our models and the best one used in our final experiments are highlighted in bold.  Table 5: The number of trainable parameters (Para.) and inference speed (sentences per second) of the experimented models on the test set of both ACE05 and SemEval datasets. "GCN" is the baseline ; "A-GCN" refers to our approach. "Full" and "N-gram" are the graph construction methods based on all word connections and our approach, respectively.