Relation Extraction with Type-aware Map Memories of Word Dependencies

Relation extraction is an important task in information extraction and retrieval that aims to extract relations among the given entities from running texts. To achieve a good performance for this task, previous studies have shown that a good modeling of the contextual information is required, where the dependency tree of the input sentence can be a beneﬁcial source among different types of contextual information. However, most of these studies focus on the dependency connections between words with limited attention paid to exploiting dependency types. In addition, they often treat different dependency connections equally in modeling so that suffer from the noise (inaccurate dependency parses) in the auto-generated dependency tree. In this paper, we propose a neural approach for relation extraction, with type-aware map memories (TaMM) for encoding dependency types obtained from an off-the-shelf dependency parser for the input sentence. Speciﬁcally, for each word in an entity, TaMM maps all associated words along with the dependencies among them to memory slots and then assigns a weight to each slot according to its contribution to relation extraction. Our approach not only leverages dependency connections and types between words, but also distinguishes reliable dependency information from noisy ones and appropriately model them. The effectiveness of our approach is demonstrated by the experiments on two English benchmark datasets, where our approach achieves state-of-the-art performance on both datasets. 1

: An illustration of an example sentence (including the entity terms "bone marrow" and "stem cells") with its dependency parsing result.
to downstream tasks such as schema induction (Nimishakavi et al., 2016), knowledge graph construction (Yu et al., 2017), and question answering (Xu et al., 2016). Normally, relation extraction aims to predict the relation between each pair of entities in a given sentence. For example, in the sentence "the [bone marrow] e 1 produces [stem cells] e 2 " with the entity terms "bone marrow" and "stem cells", the relation between the two entities is "Product-Producer". Therefore, the ability of modeling the context from the input is of great importance to guarantee the performance of relation extraction. To this end, approaches based on neural networks have achieved promising success for the task in the past decade (Socher et al., 2012;Zeng et al., 2014;Zhang and Wang, 2015;Xu et al., 2015;dos Santos et al., 2015;Wang et al., 2016;Zhou et al., 2016;Zhang et al., 2017;Wu and He, 2019;Soares et al., 2019;Fu et al., 2019;Aydar et al., 2020;Tian et al., 2021c) because of their effectiveness in capturing contextual information by powerful encoders.
In addition, previous studies try to improve relation extraction performance by incorporating extra knowledge into their models. Among all such knowledge, syntactic information from the autogenerated dependency parse of the input sentence indicates its helpfulness to improve model performance for the reason that word dependencies provide long distance contextual information (Xu et al., 2015). However, in previous studies, the main focus is the dependencies among words, with little attention paid to dependency types, which are also essential to help the relation extraction task. For example, Figure 1 shows the dependency tree of a sentence where the entities (i.e., "bone marrow" and "stem cells") are highlighted in red; the dependency type "nsubj" (nominal subject) between "bone marrow" and "produces" as well as the type "dobj" (direct object) between "stem cells" and "produces" indicates the first (i.e., "bone marrow") and the second entity (i.e., "stem cells") are the subject and object of "produces", which provide important cues to predict the relation between the two entities. Moreover, previous studies also suffer from the noise in the auto-generated dependency tree, in which cases all the dependencies are modeled equally without identifying their contributions to the task. 2 Therefore, it is important to design an appropriate approach to leverage the dependency information to improve the relation extraction task.
In this paper, we propose a neural approach for relation extraction, with a type-aware map memory (TaMM) module to encode dependency information obtained from an off-the-shelf dependency parser. Specifically, for each word in an entity, we firstly extracts the dependency information associated with it, where two types of dependency information are considered: the first is "in-entity" dependency suggested by the governor and dependents of that word; the second is "cross-entity" dependency obtained from the dependency path between entities. Then, TaMM is applied to map the associated words along with the dependency types between them to memory slots and then assign a weight to each slot to distinguish its contribution to the relation extraction task. Compared with other approaches, such as graph neural networks (GCN), to leverage dependency information, our approach not only leverages the dependency type information, but also distinguish reliable dependency information from noisy ones and model them accordingly. The evaluation of different models is performed on two English benchmark datasets, i.e., ACE2005 and SemEval 2010Task 8 (Hendrickx et al., 2010, where our approach outperforms all baselines and previous studies by achieving the state-of-the-art performance on both datasets.

Preliminaries
Relation extraction is conventionally regarded as a text classification task, where an input sentence X = x 1 · · · x l has l words and two entities, i.e., E 1 and E 2 , in it are mapped to a particular relation class (denoted by y). 3 In most cases the contextual information is of great importance to make a correct prediction for relations. Therefore, it is straightforward to consider integrating extra features to enhance contextual modeling. Of all such features, the syntactic information suggested by the dependency tree of the input sentence has been demonstrated to be useful for relation extraction in many studies (Xu et al., 2015;Zhang et al., 2018;Guo et al., 2019). However, most models to leverage the dependency information are not naturally appropriate to model the dependency types among words. It is required to find an appropriate approach to leverage the dependency type information.
Of all choices, key-value memory networks (KVMN) (Miller et al., 2016) is an effective solution in modeling pair-wisely organized information to improve many NLP tasks (Tapaswi et al., 2016;Das et al., 2017;Mino et al., 2017;Nie et al., 2020;Song et al., 2020;Tian et al., 2020aTian et al., ,d, 2021b. Specifically, KVMN maps the information instances into a list of memory slots s i = (k i , v i ) (i is the index of the memory slot s i ) with k i referring to the key and v i the value, respectively. The KVMN addresses the memory slot s i by assigning a weight p i to the value v i by comparing the input (denoted by x) to the key k i : where Φ · are functions that map the input features into their embeddings and A is a matrix that maps the embeddings into another vector space. After addressing all memory slots, KVMN reads the values by computing the weighted sum of the value vectors (i.e., AΦ V (v i )) using the resulting probability weights (i.e., p i ), which is expressed by Then, a is incorporated into the input representation by an element-wise summation: Thus, the resulting vector o contains the weighted information from all values in the memory slots and is finally used to predict the output. The left part illustrates the backbone classification model; the right part shows the process to leverage the in-entity and cross-entity memory slots associated with "bone" (highlighted in yellow) through the proposed type-aware map memories (TaMM). In entity and cross-entity memory slots are written in blue and green color respectively.

The Proposed Approach
Although KVMN can be used to leverage extra information for relation extraction, it loses the information of keys by using it as a weighting component as stated previously. Therefore, we propose type-aware map memories (TaMM) to leverage both context words (keys) and dependency types (values) to improve relation extraction, where two types of dependency information, i.e., "in-entity" and "cross-entity" dependencies are considered. Figure 2 illustrates the architecture of our approach, in which the entities in the input X is highlighted in red; the left part illustrates the backbone classification model; the right part shows the process of constructing in-entity (S (in) ) and crossentity (S (cross) ) memory slots from the dependency tree of the input and the process of incorporating them into the backbone model through TaMM. To summarize, our approach can be formalized as where T is the set of entity relation types and S = (S (in) , S (cross) ) is the memory slots for TaMM.
The following texts illustrates the details of our proposed appraoch, including how we construct the memory slots and the computation of TaMM, with its application in relation extraction.

Memory Slot Construction
In order to construct the memory slots used in our approach, we firstly use an off-the-shelf toolkit to generate the dependency parsing results of the input X . In the parse tree, every word in X is connected with its governor and its dependents with labeled dependency connections; for any two words in X , there is exactly one path between them 4 . For each word in an entity, e.g., the word x iu in E u (i u is the index of x iu in X and u ∈ {1, 2}), we consider two types of dependency information suggested by the obtained dependency tree of X and construct their corresponding memory slots. The first one is "in-entity" memory slots constructed upon all the governor and dependents of x iu (i.e., first-order dependencies). The second is "cross-entity" memory Figure 3: An illustration of the construction process for two types of memory slots (i.e., in-entity memory slots (a) and cross-entity memory slots (b)) for "bone" (with yellow background). Entities are presented in red color. slots constructed upon the words and dependency arcs along the dependency path between x iu and the words in the other entity 5 . Figure 3 shows the process to construct the two types of memory slots from the dependency tree of a sentence, where the entities in it are highlighted in red. In the following text, we illustrate the way to extract the in-entity and the cross-entity memory slots for x iu .
In-entity Memory Slots In-entity memory slots focus on the contextual information from the words connecting to x iu by dependency parses. To construct them, we firstly locate the governor and all dependents of x iu in X from the dependency tree. Then we regard the governor and dependents as the keys in the memory slots and their dependency relations with x iu as their corresponding values. Therefore, we obtain a list of memory slots with the j-th of them denoted as s iu,j is the word connected with x iu by a dependency connection and v (in) iu,j the dependency relation type between them. For example, in Figure  3(a), for the word "bone" (highlighted with yellow background) in the first entity "bone marrow", we find its dependent "marrow" and the dependency relation type compound between them (the dependency with its type is highlighted in blue) and obtain the dependency slot S compound)]. 6 In this case, there is only one word (i.e., "marrow") associated with "bone". Similarly, if the word we focus on is "marrow", the in-entity memory slots for it should be S Cross-entity Memory Slots Cross-entity memory slots aim to incorporate the contextual information along the dependency path in between the ). As illustrated in Figure 3(b), for "bone" (highlighted with yellow background) in the first entity "bone marrow", we locate the dependency path between "bone" and the last word "cells" of the second entity "stem cells": "bone -marrowproducescells", as well as the dependency relation types along that path: "compound" for "bone -marrow", "nsubj" for "marrowproduces", and "dobj" for "produces -cells" (highlighted in green). Therefore, the cross-entity memory slots for "bone" are S In summary, for x iu in in E u , we obtain the inentity memory slot list S (in) iu and the cross-entity memory slot list S (cross) iu , which are fed into the TaMM module as illustrated in Figure 2.

Type-aware Map Memories
There are previous approaches for relation extraction that leverage dependency information and focus on dependencies among words without considering their dependency types. With learning from such information, there is a nonnegligible challenge that there are noises in the auto-generated depen-dency results, which may hurt model performance since they provide misleading contextual information. One straightforward way to address this issue is to weight different dependencies according to their contribution to the relation extraction task. As discussed in the previous section, although KVMN provides a way to selectively model dependency information it is limited in omitting the contextual information carried by the keys in the final output from the memories.
To address the aforementioned limitations in KVMN, we propose type-aware map memories (TaMM) to incorporate the dependency information carried by both the keys and values (i.e., the memory slots), where the architecture of TaMM is illustrated on the top right of Figure 2. Specifically, for each word in an entity, e.g., the word x iu in E u (i u is the index of x iu in X and u ∈ {1, 2}), we consider two types of dependency information, i.e., "in-entity" and "cross-entity" dependency information, and construct their corresponding memory slots. We denote the j-th in-entity and crossentity memory slots as s ), respectively, and use the same process to model them.
Taking the in-entity memory slots as an example, we firstly use two matrices to map the keys k iu,j , respectively. Next, we compute the weight p iu,j assigned for each value through the inner production between the key embedding e k,(in) iu,j and the hidden vector of x iu (which is denoted as h iu ) obtained from the encoder in the backbone model: iu is the number of in-entity memory slots associated with x iu . Then, we apply the weights to the corresponding memory slots and obtain the weighted sum (denoted as a where "+" refers to element-wise sum of vectors. Therefore, compared to KVMN, our approach is able to leverage both context words and depen-  dency types associated with x iu . With the same process for in-entity memory slots, we deal with the cross-entity ones and obtain the weighted sum a (cross) iu . Finally, we concatenate the two resulting vectors by a iu = a with a iu denoting the output of TaMM and containing the weighted dependency information to enhance the backbone model.

Relation Extraction with TaMM
Once the TaMM is built, it is straightforward to apply it to relation extraction through a backbone classifier. In our approach, we use BERT (Devlin et al., 2019) as the classifier to encode the input X and obtain the hidden vectors for all words. Note that we only use the hidden vectors of the words in the two entities to predict their relations. Therefore, for each word x iu in the entity E u , we feed h iu into TaMM and obtain the corresponding output a iu . Then, we concatenate h iu and a iu , and for each entity E u , use the max pooling strategy to obtain the vectorized representation o u by Afterwards, we concatenate the representation of the two entities (i.e. o 1 for E 1 and o 2 for E 2 ) and pass the resulting vector through a fully connected layer (a classifer) to obtain the final prediction y by where W and b are the trainable weight matrix and bias vector for the fully connected layer.   Hendrickx et al., 2010;Zeng et al., 2014;Zhang and Wang, 2015;Xu et al., 2015;dos Santos et al., 2015;Wang et al., 2016;Zhou et al., 2016;Zhang et al., 2017;Soares et al., 2019) to use its official train/test split. The statistics of the two datasets are summarized in Table 1.

Implementation
In our experiments, we use Standard CoreNLP Toolkits (SCT) 11 to obtain the dependency tree for each input sentence. Since the quality of text representation plays an important role in the performance of NLP models (Komninos and Man andhar, 2016;Song et al., 2017Song et al., , 2018Liu and Lapata, 2018;Song and Shi, 2018;Song et al., 2021), we use BERT 12 (Devlin et al., 2019), which is a pre-trained language model that achieves stateof-the-art in many NLP tasks (Wu and He, 2019;Soares et al., 2019;Tian et al., 2020bTian et al., ,c, 2021a, as the encoder in our model. Specifically, we use the uncased version of BERT with its default settings (e.g., for BERT-base, we use 12 layers of multi-head attentions with 768 dimensional hidden vectors; for BERT-large, we use 24 layers of multihead attentions with 1024 dimensional hidden vectors) and fine-tune its all trainable parameters in the training stage. For TaMM, we randomly initialize the embeddings of all keys and values with their dimensions matching that of the hidden vectors from BERT. For evaluation, we follow previous studies to use the standard micro-F1 scores 13 for 10 We use the dataset split from https://github. com/tticoin/LSTM-ER/tree/master/data/ ace2005/split. 11 We use SCT under version 3.9.2 from https:// stanfordnlp.github.io/CoreNLP/. 12 We download different BERT models from https:// github.com/huggingface/transformers. 13 We use the evaluation script from sklearn framework.

Overall Performance
In the main experiments, we run our models using BERT-base and BERT-large encoders with and without TaMM and try different combinations of in-entity and cross-entity dependency information (i.e., in-entity dependency information only, cross-entity dependency information only, and both of them). We also run the baselines using standard graph convolutional networks (GCN), standard graph attention networks (GAT), and KVMN to leverage the dependency information. Table 3 shows the results (F1 scores) of different models. 15  Table 4: The comparison between our models (the ones using TaMM (Both)) and previous studies on ACE2005 and SemEval. Models with dependency features and BERT-large are marked by " †" and "*", respectively.

Models ACE2005 SemEval
There are several observations. First, TaMM works well with both BERT base and large, where consistent improvement is observed over the BERTbase and BERT-large baselines across all datasets, although they have already achieved very good performance. Second, TaMM outperforms standard GCN and GAT models, which can be attributed to our modeling of dependency type information in TaMM. Third, under all the three settings to incorporate different types of dependency information (i.e., in-entity, cross-entity, and both), our models with TaMM outperforms the BERT baseline and the highest F1 score is achieved when both in-entity and cross-entity dependency information are used (i.e., + TaMM (Both)). This observation confirms the individual contribution of in-entity and crossentity dependency information as well as the effectiveness of our approach to leverage them together to improve model performance. Fourth, compared with our TaMM models using cross-entity dependency information only (i.e., + TaMM (Cross)), the models using in-entity dependency information only (i.e., + TaMM (In)) achieves higher results in most cases. One possible explanation could be the following. There are overlaps between inentity dependencies and cross-entity dependencies. For example, the dependency between "bone" and "marrow" is shared by both in-entity dependencies and cross-entity dependencies in Figure 3. Therefore, with in-entity dependency only, TaMM not only leverages the contextual words directly associated with the entities themselves, but also can still partially benefit from the contextual information along the dependency path, whereas TaMM with cross-entity dependency only fails to leverage the contextual words directly associated with the entities, which leads TaMM (In) to achieve better  Table 5: F1 scores of models using BERT-large and TaMM (In/Both) to leverage 1st-, 2nd-, and 3rd-order dependencies. "N/A" refers to no order can be applied.
performance than TaMM (Cross). Fifth, for all the settings, our model with TaMM consistently outperforms the baselines with KVMN, which demonstrates the effectiveness of our approach to improve relation extraction. The explanation is that TaMM is able to leverage both context words (keys) and dependency types (values) at the same time, while KVMN fails to incorporate the context information carried by keys, which leads KVMN to omit some important features and thus get inferior results. Moreover, we compare our model under the best setting (i.e., the ones using TaMM to leverage both in-entity and cross-entity dependency relation) with previous studies and report the results (F1 scores) in Table 4. It is found that our model with BERT-large encoder outperforms all previous studies (including the ones also using BERT-large encoder).

The Effect of Dependency Information
To analyze the effect of using dependency information, we perform three investigations on models using BERT-large encoder.
The first investigation is to examine different orders of dependencies used in TaMM. Previous experiments showed the effectiveness of our model with TaMM on first-order word dependencies. We also try second-and third-order dependencies via the model (i.e., large BERT) with TaMM (Both). The results (with scores from the first-order dependencies) are reported on table 5, where the corresponding results from the models with TaMM (In) as well as the BERT-large baseline are also reported. The observations are drawn as follows. First, models with TaMM under all settings outperforms the BERT-large baseline, which is confirmed by all results on both datasets. Second, models with TaMM (Both) consistently outperform the ones with TaMM (in) under the same setting, which indicates the cross-entity dependencies are able to bring greater improvements. Third, for models Figure 4: The performance of BERT-large baseline and our TaMM (both) on test instances from SemEval grouped by the entities' distance (i.e., the number of words between two entities).
with TaMM (Both), using higher order dependencies often results in inferior results; while the trend is on the opposite for models with TaMM (in). One possible explanation is that for TaMM (both), most essential word dependencies in between the two entities have already been encoded, higher order dependencies sometimes introduce noise other than useful information; while for TaMM (In), leveraging higher order dependencies allows the model to cover more contextual information along the dependency path between two entities.
The second is to explore the performance of our model on different test instances grouped by their entity distance (i.e., the number of words between the two entities), to see whether our approach can capture long-distance word-word dependencies and help with relatin extraction. In doing so, we split the test set of SemEval into three groups according to the entity distance (i.e., from 0 to 4, from 5 to 9, and higher than 10) and perform our best TaMM model and the BERT baseline on them. Figure 4 illustrates the performance of TaMM (i.e., the orange bar) and BERT (i.e., the blue bar). It can be found that our TaMM outperforms the BERT-baseline on all three groups of test instances, where bigger gaps can be observed when the entities' distance goes higher. This observation demonstrates the effectiveness of our approach to encode dependency information to improve relation extraction.
The third investigation is to explore the effect of TaMM using different dependency parsers. Specifically, in addition to the Stanford CoreNLP Toolkits (SCT) used in the main experiments, we also try spaCy 16 to obtain the dependency trees and report the results (with BERT-large encoder) in Table  6. It is found that models with different dependency parsers consistently outperform the BERTlarge baseline, which indicates the robustness of our model design in improving relation extraction.  Table 6: F1 scores of models using BERT-large and TaMM (In/Cross/Both) to leverage dependency information from different parsers (i.e., SCT and spaCy).

Case Study
To examine how TaMM leverages dependency information to improve model performance, in Figure 5, we show an example input where our approach successfully predicts the relation in between the two entities (in red colors) to be "Entity-Destination", while the BERT-large baseline fails to do so ("Component-Whole"). In the figure, the dependencies between words are highlighted in different colors to represent the total weights assigned to their corresponding in-entity and crossentity memory slots, where darker color refers to higher weight. Overall, we find that the most emphasized dependencies are along the dependency path connecting the two entities, where the memory slots for those dependencies receive the highest weights. For the first entity "treadmill", the dependency typeˆnsubj: pass (passive nominal subject) in the highlighted memory slot (installed,ˆnsubj: pass) suggests the first entity is the patient of the action install; similarly, for the second entity "space station", the highlighted dependency typeˆobj (object) suggests this entity is the location of the action install given the fact that the input is a passive sentence. Therefore, our approach is able to leverage these cues learned from word dependencies and their dependency types so as to predict the correct relation for the two entities: "Entity-Destination".

Related Work
Relation extraction is an important task in NLP, which significantly relies on a good modeling of the contextual information to achieve outstanding model performance. To improve the capability of context modeling for relation extraction, studies in the past decade leverage neural networks, such as using CNN (Zeng et al., 2014;Wang et al., 2016), RNN (Socher et al., 2012;Xu et al., 2015;Zhou et al., 2016) and BERT encoders (Wu and He, 2019;Soares et al., 2019;. To further Figure 5: An example input fed into our model with TaMM (Both) and its correctly predicted relation between the two entities marked in red. Word dependencies are highlighted in different colors to visualize the total weights assigned to their corresponding in-entity and cross-entity memory slots, where darker color refers to higher weight.
enhance the models for this task, incorporating extra knowledge into the models has been proved as an effective method, where normally three types of extra knowledge are used: lexical, syntactic and semantic knowledge, and syntactic knowledge has been proved to be useful for this task (Xu et al., 2015). With this finding, there are studies also using advanced neural architecture, such as graph convolutional networks, to incorporate syntactic knowledge from auto-generated dependency parse of the input sentence (Zhang et al., 2018;Guo et al., 2019;Sun et al., 2020;Yu et al., 2020;Mandya et al., 2020). Compared to the aforementioned studies, TaMM offers a simple yet effective nongraph-based approach to leverage dependencies for relation extraction. TaMM provides the ability not only incorporate both word dependencies and their types into the model to help improve relation extraction performance, but also discriminatively leverage the dependencies by assigning different weights to them, which can address the potential noise in the auto-generated dependencies and thus further improve model performance.

Conclusion
In this paper, we proposed an effective method for relation extraction with word dependencies encoded by TaMM, whose keys and values are built upon the dependency tree of the input sentence obtained from off-the-shelf toolkits. Particularly, for each entity in the sentence, we extract words associated with it according to the dependency parse of the input sentence and their corresponding dependency relation types. Then, we use TaMM to encode and weight such information and integrate it into the relation extraction task.   Table 7: F1 scores of models with different configurations (i.e., the ones using base or large BERT with KVMN or TaMM and different combinations of inentity and cross-entity dependency information) on the development set of ACE2005 for relation extraction.

Appendix B. Mean and Deviation of the Results
In the experiments, we test models with different configurations. For each model, we train it with the best hyper-parameter setting using five different random seeds. We report the mean (µ) and standard deviation (σ) of the F1 scores on the test set of ACE2005 and SemEval in Table 8. 17 SemEval does not have an official dev set.  Table 8: The mean µ and standard deviation σ of accuracy and F1 scores of all models (i.e., the ones using base or large BERT with KVMN or TaMM and different combinations of in-entity and cross-entity dependency information) on the test set of ACE2005 and SemEval for relation extraction.