CoRI: Collective Relation Integration with Data Augmentation for Open Information Extraction

Integrating extracted knowledge from the Web to knowledge graphs (KGs) can facilitate tasks like question answering. We study relation integration that aims to align free-text relations in subject-relation-object extractions to relations in a target KG. To address the challenge that free-text relations are ambiguous, previous methods exploit neighbor entities and relations for additional context. However, the predictions are made independently, which can be mutually inconsistent. We propose a two-stage Collective Relation Integration (CoRI) model, where the first stage independently makes candidate predictions, and the second stage employs a collective model that accesses all candidate predictions to make globally coherent predictions. We further improve the collective model with augmented data from the portion of the target KG that is otherwise unused. Experiment results on two datasets show that CoRI can significantly outperform the baselines, improving AUC from .677 to .748 and from .716 to .780, respectively.


Introduction
With its large volume, the Web has been a major resource for knowledge extraction. Open information extraction (open IE; Sekine 2006;Banko et al. 2007) is a prominent approach that harvests subject-relation-object extractions in free text without assuming a predefined set of relations. One way to empower downstream applications like question answering is to integrate those free-text extractions into a knowledge graph (KG), e.g., Freebase. Relation integration is the first step to integrate those extractions, where their free-text relations (i.e., source relations) are normalized to relations in the target KG (i.e., target relations). Only after relation integration can entity linking proceed to resolve the * This work was performed while at Amazon.  Figure 1: A motivating example. Trained on parallel data, a local model may suffer from sparse context for a new entity pair Nell-Marie at inference, wrongly disambiguating "parent" to father instead of mother.
free-text subjects and objects to their canonical entities in the target KG. Local Approaches. Relation integration has been studied by the natural language processing (NLP) community. With exact matching in literal form between entity names in the source graph and target KG, previous methods obtain parallel data, i.e., common entity pairs, between the two graphs as in Fig. 1. Features of the entity pairs (e.g., Malia-Barack) in the source graph and their relations in the target KG (e.g., father) are used to train models to predict target relations for future extractions. A common challenge is the ambiguity of source relations, e.g., "parent" may correspond to father or mother in different contexts. Previous methods exploited contextual features including embeddings of seen entities (e.g., "Malia"; Riedel et al. 2013), middle relations between (e.g., "parent"; Riedel et al. 2013;Toutanova et al. 2015;Verga et al. 2017Verga et al. , 2016, and neighbor relations around the entity pair (e.g., "gender"; Zhang et al. 2019).
Assuming rich contexts to address the ambiguity challenge, previous methods may fall short under the evolving and incomplete nature of the source

Methods
Middle No entity Neighbor Collective relation param. relation inference (Riedel et al., 2013) (Verga et al., 2017) (Zhang et al., 2019) CoRI (ours)  graph. For example, in the lower part of Fig. 1, emerging entities may come from new extractions with sparse contextual information. For the pair Nell-Marie, a conventional model learned on the parallel data may have neither seen entities nor neighborhood information (e.g., "gender") to depend on, thus failing to disambiguate "parent" and wrongly predicting father. Due to the local nature of previous approaches, i.e., predictions for different entity pairs are made independently of each other, the model is unaware that "Nell" has two fathers in the final predictions. Such predictions are incoherent in common sense that a person is more likely to have one father and one mother, which is indicated by the graph structure around Malia in the target KG part of the parallel data.

Our Collective Approach
To alleviate the incoherent prediction issue of local approaches, we propose Collective Relation Integration (CoRI) that exploits the dependency of predictions between adjacent entity pairs to enforce global coherence. Specifically, we follow two stages, i.e., candidate generation and collective inference. In candidate generation, we simply use a local model to make independent predictions as candidates, e.g., father for all the three pairs in the lower part of Fig. 1. In collective inference, we employ a collective model that is aware of the common substructures of the target graph, e.g., Malia. The collective model makes predictions by not only taking as input all contextual features to the local model but also the candidate predictions of the current and all neighbor pairs. For the pair Nell-Marie, the collective model will have access to the candidate prediction father of Nell-Burton, which helps flip its final prediction to the correct mother. Tab. 1 summarizes CoRI and representative previous work from four aspects. To the best of our knowledge, CoRI is the first to collectively perform relation integration rather than locally.
Being responsible to make globally consistent predictions, the collective model needs to be trained to encode common structures of the target KG, e.g., Malia having only one father/mother in the parallel data of Fig. 1. To this end, we train the collective model in a stacked manner (Wolpert, 1992). We first train the first-stage local model on the parallel data, then train the second-stage collective model by conditioning on the candidate predictions of neighbor entity pairs from the first stage (e.g., father for Malia-Barrack) to make globally consistent predictions (e.g., mother for Malia-Michelle).
Parallel Data Augmentation. The parallel data may be bounded by the low recall of exact name matching or the limited extractions generated by open IE systems. We observe that, even without counterpart extractions, the unmatched part of the target graph (as in Fig. 1) may also have rich common structures to guide the training of the collective model. To this end, we propose augmenting the parallel data by sampling subgraphs from the unmatched KG and creating pseudo parallel data by synthesizing their extractions, so the collective model can benefit from additional training data characterizing the desired global coherence.
To summarize, our contributions are three-fold: (1) We propose CoRI, a two-stage framework that improves state-of-the-art methods by making collective predictions with global coherence.
(2) We propose using the unmatched target KG to augment the training data.
(3) Experimental results on two datasets demonstrate the superiority of our approaches, improving AUC from .677 to .748 and from .716 to .780, respectively.

Preliminaries
In this section, we first formulate the task of relation integration, then describe local methods by exemplifying with the state-of-the-art approach OpenKI (Zhang et al., 2019).

Relation Integration
We treat subject-relation-object extractions from open IE systems as a source graph K(E, R) = {(s, r, o) | s, o ∈ E, r ∈ R}, where E denotes extracted textual entities, e.g., "Barack Obama", and R denotes extracted source relations, e.g., "parent". We denote by (s, o) a source entity pair. For (s, o), K s,o = {r | (s, r, o) ∈ K} denotes all source relations between them. Similarly, K r = {s, o | (s, r, o) ∈ K} denotes all entity pairs with relation r in between. We use the union K R = r∈R K r to refer to all extracted entity pairs.
Definition 1 (Relation Integration). Given a source graph K and a target KG K (E , R ) with target entities E and target relations R , the task of relation integration is to predict all applicable target relations for each extracted entity pair in K R : where (s, r , o) ∈ Γ is an integrated extraction indicating that a target relation r holds for (s, o).
To train relation integration models, all methods employ parallel data formalized as follow: Definition 2 (Parallel Data). Parallel data are common entity pairs shared between K R and K R and their ground truth target relations in K : To obtain parallel data, a widely used approach is to find entities shared by E and E by exact name matching, then generate common entity pairs and their ground truth.

Local Approaches
Previous local methods score potential integrated extractions by assuming their independence: where θ is the parameters of the local model. One representative local model achieving state-of-theart performance is OpenKI (Zhang et al., 2019). It encodes the neighborhood of (s, o) in K by grouping and averaging embeddings of source relations in three parts. Let K s,· be the set of source relations between s and neighbor entities other than o, and similarly for K ·,o . OpenKI represents (s, o) by concatenating the three averaged embeddings into a local representation t l : where l stands for local, and A(.) takes a set of relations and outputs the average of their embeddings. Then each integrated extraction is scored by: where MLP l is a multi-layer perceptron and σ the sigmoid function. Given a parallel data T = where γ is a hyperparameter to account for the imbanlance between positive and negative relations, because the latter often outnumber the former. The final loss is the sum of all examples.

Collective Relation Integration
As discussed in § 1, the drawback of local methods is that predictions of different entity pairs are independently made. Neglecting their dependency may lead to predictions inconsistent with each other.
To address the issue, we propose a collective approach CoRI, which achieves collective relation integration via two stages: candidate generation and collective inference. In this section, we demonstrate the input and output of the two stages, as well as our current implementations.

Candidate Generation
As mentioned in § 1.1, candidate generation's responsibility is to provide candidate predictions to the collective inference stage. Formally, candidate predictions Γ l (l means local) are generated by executing a local model on the source graph K: The candidate predictions in Γ l may be partially wrong, but the other correct ones can help adjust wrong predictions of their adjacent entity pairs in the collective inference stage, under the guidance of the collective model. For example, in the upper part of Fig. 2, we have a source graph K with three entity pairs. The input to candidate generation is the entire K. After applying the local model (OpenKI in our case), we have three additional edges as the output Γ l in the lower part of Fig. 2. Note that the candidate prediction father for Nell-Marie (denoted by black outline) is incorrect due to insufficient information in its neighborhood in K, i.e., both the relations in between of and around the entity pair (denoted by solid edges) are ambiguous "parent"s.
Fortunately, the entity pair Nell-Burton is relatively easy for the local model to predict as father because it can leverage the neighbor relation "father" between Billy-Burton. Such correct candidate predictions are included in Γ l , provided to the collective inference stage as additional signals for later correction of the wrong predictions such as father for Nell-Marie.

Collective Inference
Collective inference's responsibility is to encode the structures of the target graph and use such information to refine the candidate predictions Γ l by enforcing coherence among them. To this end, a collective model P β (with parameters β) takes both the source graph K and the candidate predictions Γ l as input, and outputs the final predictions Γ: In the Nell-Marie case of Fig. 2, when making the final prediction, its own candidate predictions and those of the neighbor entity pairs (solid edges in Γ l of the lower part in Fig. 2) are used to leverage the dependency among them. We concatenate the embeddings of candidate predictions to the local representation t l obtained in the first stage, and represent each entity pair as follow: where c means collective. Γ l s,o includes candidate target relations between s and o, and similarly for Γ l s,· and Γ l ·,o . Then we use another multi-layer perceptron MLP c to convert t c to probabilities and minimize the loss function for P β similar to that of the local model P θ in Eq. 4.

Training Collective Model
According to Eq. 6, we need Γ l as features to train the collective model P β . This is to ensure that P β captures the dependencies among target relations. One may ask why we do not directly use ground truth K instead of predictions Γ l . At test time, we can only use target relations predicted by P θ as input to P β because the ground truth target relations of neighbor entity pairs might not be available. If we train P β using the ground truth, there will be a discrepancy between training and testing, potentially hurting the performance.
Specifically, we split the training set T into T folds. We generate Γ l by rotating and unioning a temporary local model's predictions on a heldout fold, where the temporary model is trained on the other folds. Then we train P β on the parallel data T with Γ l . In this manner, we can use the full dataset to optimize the collective model while avoiding generating candidates on the training data of the local model, which leads to overfitting. The detailed training procedure is given in Alg. 1.

Data Augmentation w/ Unmatched KG
As in Def. 2, the volume of parallel data is limited by the number of shared entity pairs K R ∩ K R of the two graphs. In Fig. 1, the unmatched part of the target KG, containing entity pairs without extraction counterparts (i.e., K R \ K R ) and their target relations, can also indicate common substructures of the target KG, and guide the training of the collective model. To this end, we propose leveraging unmatched KG to generate pseudo parallel data to augment the limited training data. Synthesizing Pseudo Extractions. To leverage the unmatched KG, we need to synthesize pseudo extractions for the target entities and relations to add to K as features. Since we do not use entityspecific parameters, we only synthesize source relations like "parent", and keep the target entities unchanged, as illustrated in Fig. 3. Specifically, for each subject-relation-object tuple (s , r , o ) in the unmatched KG, we keep s and o unchanged, and synthesize source relations r by sampling from: i.e., the conditional probability of observing r given r based on co-occurrences in the parallel data. |K r ∩ K r | is the number of entity pairs with both r and r in between, and |K r | is the number of entity pairs with r in between. In this way, we obtain a pseudo extraction (s, r, o), as detailed in Alg. 2 Pseudo Data Selection. We regard all pseudo extractions as a graph K p . Similar to Def. 2, we define pseudo parallel data as below.
Definition 3 (Pseudo Parallel Data). Pseudo parallel data T p includes common entity pairs between pseudo extractions K p and the target KG K , associated with their ground truth target relations, i.e., To make use of pseudo parallel data T p , the most straightforward way is to use them together with parallel data T to train the collective model P β . However, not all substructures in the target graph K are useful for P β . For example, when K has other domains irrelevant to the source extraction graph, substructures in those domains may distract P β from concentrating on the domains of the source graph. To mitigate this issue, we only use a subset of T p similar to T , as shown by the black-outlined parts in Fig. 3. Specifically, we represent each entity pair (s, o) as a virtual document with surrounding target relations K s,o ∪ K s,· ∪ K ·,o Algorithm 2: Our augmentation approach. Result: Collective model β with data augmentation.
(1) Synthesizing Pseudo Extractions K p K p ← ∅; T p ← ∅; for (s , r , o ) ∈ K , where (s , o ) ∈ K R \ KR do s ← s and o ← o ; Sample r ∼ P (r|r ); as "tokens". For each entity pair from the parallel data T , we use BM25 (Robertson and Zaragoza, 2009) to retrieve its top K most similar entity pairs from T p , and add them to the selected pseudo parallel data T p for training, as detailed in Alg. 2.

Datasets and Evaluation
We use the ReVerb dataset (Fader et al., 2011) as the source graph, and Freebase 1 and Wikidata 2 as the target KGs, respectively. We follow the same name matching approach in Zhang et al. (2019) to obtain parallel data. To simulate real scenarios where models are trained on limited labeled data but applied to a large testing set, we use 20% of entity pairs in the parallel data for training and the other 80% for testing, and there is no overlap. We also compare the performance under other ratios in § 6.3. Dataset statistics are listed in Tab  We evaluate by ranking all integrated extractions based on their probabilities, and report area under the curve (AUC). Considering real scenarios where we want to integrate as many extractions as possible while keeping a high precision, we also report Recall and F 1 when precision is 0.8, 0.9, or 0.95.

Compared Methods
We compare the following methods in experiments. Relation Translation is a simple method that maps source relations to target relations with conditional probability P (r | r) similar to Eq. 9. For an entity pair (s, o), the predicted target relations are {arg max r P (r |r) | r ∈ K s,o }. Universal Schema (E-model) (Riedel et al., 2013) learns entity and relation embeddings through matrix factorization, which cannot generalize to unseen entities. It is a local model that scores each integrated extraction independently. Rowless Universal Schema (Verga et al., 2017) is a local model which improves over the E-model by eliminating entity-specific parameters, thus generalizing to unseen entities. OpenKI (Zhang et al., 2019) is a local model that addresses the ambiguity of source relations by using neighbor relations for more context. CoRI is our collective two-stage relation integration model trained with Alg. 1. CoRI + DA is our model where the training data is augmented by pseudo parallel data with Alg. 2. To verify the necessity of retrieval-based pseudo data selection, we also compare with a random DA baseline where we select K random entity pairs. CoRI + KGE is another approach to exploit the unmatched KG with KG embeddings (KGE) trained on the entire target KG in an unsupervised manner. We initialize the embeddings of target relations averaged by A(.) in Eq. 7 with TransE  embeddings trained on the target graph.

Implementation Details
We uniformly use 32-dimension embeddings for all relations, and AdamW (Loshchilov and Hutter, 2019) optimizer with learning rate 0.01 and epsilon 10 -8 . The ratio γ in Eq. 4 is set to 10. We sample at most 30 neighbor source relations to handle entity pairs with too many neighbor relations. We use T = 5 folds in Alg. 1 to train our collective model. We retrieve top K = 5 entity pairs in pseudo data selection, adding about 20K and 12K entity pairs to the two datasets in Tab. 2, respectively. We use BM25 (Robertson and Zaragoza, 2009) implementation in ElasticSearch 3 in pseudo data selection. We use the KGE released by OpenKE. 4 Our model is trained with 32 CPU cores and a single 2080Ti GPU, and it takes 1-2 hours to converge.

Experimental Results
We aim to answer the following questions: (1) Is CoRI superior to local models? (2) Is CoRI robust w.r.t. varying size of training and testing data? (3) Is unmatched KG useful for CoRI? Is our parallel data augmentation approach the best choice?

Main Results
In Tab. 3, we show results comparing all methods on both datasets. Our observations are as follows.
Collective inference is beneficial. Among the baselines, OpenKI generally performs best because it leverages neighbor relations besides middle relations between entity pairs, without relying on entity parameters. Even without data augmentation, CoRI outperforms OpenKI by a large margin, improving AUC from .677 to .708 and from .716 to .746 on the two datasets, respectively, which demonstrates the effectiveness of collective inference. Data augmentation further improves the performance. By comparing CoRI with CoRI + DA (retrieval), we observe that data augmentation further improves AUC from .708 to .748 and from .746 to .780, respectively, which indicates that using unmatched KG can effectively augment the training of the collective model. We plot the precision-recall curves of the best three approaches in Fig. 4. It demonstrates the superiority of our methods across the whole spectrum. Generalization on unseen entities is necessary. Among the baselines, the E-model uses entityspecific parameters, hindering it from generalizing to unseen entities and making it less competitive.

Effectiveness of Pseudo Data Selection
As shown in Tab. 3, both KGE, random, and retrieval-based data augmentation approaches perform better than CoRI (without DA), indicating the effectiveness of using the unmatched KG. Our retrieval-based DA outperforms the random coun-  Table 3: Main experimental results. The best results are in bold, and the best external baselines are underlined. CoRI outperforms the best baseline OpenKI by a large margin, and parallel data augmentation (DA) further improves its performance. "-" indicates that the precision was not achieved.
terpart, which confirms the superiority of similaritybased data augmentation in choosing substructures that cover domains relevant to the original parallel data. Our DA approach outperforms KGE, demonstrating the necessity of selectively using the unused KG to avoid discrepancies with the parallel data. Different Numbers of Pseudo Data Entity Pairs.
In Fig. 5, we compare the performance of DA w.r.t. different numbers of retrieved entity pairs K. We observe that K=5 yields better performance than K=1. However, further increasing K hurts the performance, which is probably due to pseudo entity pairs with lower similarity to the parallel data causing a domain shift. This validates the necessity of selectively using pseudo parallel data.

Impacts of Data Size on CoRI
Due to its collective nature, one may wonder about CoRI's performance w.r.t. other training and testing data sizes. We analyze these factors in this section. Our observations are similar on both datasets, so we only report the results on ReVerb + Freebase.
Varying Size of Training Data. In Fig. 6a, we compare CoRI (without DA) with OpenKI by varying the portion of the parallel data for training from 20% (used in our main results in Tab. 3) to 80%. We observe that using more training data improves the performance, as shown by the increasing trends w.r.t. all metrics. Our method outperforms OpenKI in all settings, demonstrating that our method is effective in both high-and low-resource settings.
Varying % of Accessible Neighbor Entity Pairs.
Our collective framework is special in its collective inference stage, where the collective model refines the candidate prediction of an entity pair by considering its neighbor entity pairs' candidates. We hypothesize that the more neighbor entity pairs the collective model has access to, the better performance it should achieve. For example, if we use a portion of 50%, candidate predictions for only half of the neighbor entity pairs rather than the entire Γ l will be used in Eq. 7. We vary the portion from 25% to 100% (used in our main experiments in Tab. 3). As shown in Fig. 6b, even accessing 25% can make CoRI outperform OpenKI. As the percentage increases, CoRI continues to improve, while OpenKI remains the same because it is local, i.e., not using candidate predictions.

Case Study
In Fig. 7, we show two cases from ReVerb + Freebase where CoRI corrects the mistakes of OpenKI in the collective inference stage. In the first case, the source relation "is in" between "Iowa" and "Mahaska County" is extracted but in the wrong direction. OpenKI just straightforwardly predicts containedby based on the surface form, but fails to leverage the neighbor relations to infer that Iowa is a larger geographical area. With the collective model, CoRI is able to use the other two candidate predictions of containedby to flip the wrong prediction to contains. In the second case, a prediction is needed between "Bily Joel" and "Columbia". Here the source relation "was in" and the object entity "Columbia" are both ambiguous, which can refer to geographical containment with a place or membership to a company. OpenKI makes no prediction due to the ambiguity, while CoRI makes the right prediction music label by collectively working on the other entity pairs, where all predictions coherently indicate that "Columbia" is a music company.   Figure 7: Two cases from ReVerb + Freebase with predictions in this font. The wrong predictions of OpenKI is corrected by our collective model.

Related Work
Relation integration has been studied by both the database (DB) and the NLP communities. The DB community formulates it as schema matching that aligns the schemas of two tables, e.g., matching columns of an is in table to those of another subarea of table (Rahm and Bernstein, 2001;Cafarella et al., 2008;Kimmig et al., 2017). Such  (Soderland et al., 2013), most works leverage the link structure between entities and relations. Universal schema (Riedel et al., 2013) learns embeddings of entities and middle relations between entity pairs through decomposing their co-occurrence matrix. However, the entity embeddings make it not generalize to unseen entities. Other methods (Toutanova et al., 2015;Verga et al., 2016Verga et al., , 2017Gupta et al., 2019) also exploit middle relations, but eliminate entity parameters. Zhang et al. (2019) moves one step further by explicitly considering neighbor relations, leveraging more context from the local link structure. Some works Angeli et al., 2015) directly minimize the distance between embeddings of relations sharing the same entity pairs. Yu et al. (2017) further leverage compositional representations of entity names instead of using free parameters to deal with unseen entities at test time.
There are also works on Open IE canonicalization that cluster source relations. Some use entity pairs as clustering signals (Yates and Etzioni, 2009;Nakashole et al., 2012;Galárraga et al., 2014), while others use lexical features or side information Vashishth et al., 2018). However, the clusters are not finally aligned to relations in target KGs, different from our problem.
The two-stage collective inference framework has been explored in other problems like entity linking (Cucerzan, 2007;Guo et al., 2013;Shen et al., 2012), where candidate entities are generated for each mention independently, and collectively ranked based on their compatibility in the second stage. In machine translation, an effective approach to leverage monolingual corpus in the target language is to back-translate it to the source language to augment the limited parallel corpus (Sennrich et al., 2016). The above works inspired us to use collective inference for relation integration and leverage the unmatched KG for data augmentation. Another approach to perform collective inference is to solve learning problem with constraints, such as integer linear programming (Roth and Yih, 2004), posterior regularization (Ganchev et al., 2010), and conditional random fields (Lafferty et al., 2001). Comparing to our approach, these methods usually involve heavy computation, or are hard to optimize. Examining the perfor-mance of these methods is an interesting future direction. Besides, we also adopted ideas of selecting samples from out-domain data similar to in-domain samples (Xu et al., 2020;Du et al., 2020) to select our pseudo parallel data.

Conclusion
In this paper, we proposed CoRI, a collective inference approach to relation integration. To the best of our knowledge, this is the first work exploring this idea. We devised a two-stage framework, where the candidate generation stage employs existing local models to make candidate predictions, and the collective inference stage refines the candidate predictions by enforcing global coherence. Observing that the target KG is rich in substructures indicating the desired global coherence, we further proposed exploiting the unmatched KG by selectively synthesizing pseudo parallel data to augment the training of our collective model. Our solution significantly outperforms all baselines on two datasets, indicating the effectiveness of our approaches.