Element Intervention for Open Relation Extraction

Open relation extraction aims to cluster relation instances referring to the same underlying relation, which is a critical step for general relation extraction. Current OpenRE models are commonly trained on the datasets generated from distant supervision, which often results in instability and makes the model easily collapsed. In this paper, we revisit the procedure of OpenRE from a causal view. By formulating OpenRE using a structural causal model, we identify that the above-mentioned problems stem from the spurious correlations from entities and context to the relation type. To address this issue, we conduct Element Intervention, which intervene on the context and entities respectively to obtain the underlying causal effects of them. We also provide two specific implementations of the interventions based on entity ranking and context contrasting. Experimental results on unsupervised relation extraction datasets show our method to outperform previous state-of-the-art methods and is robust across different datasets.


Introduction
Relation extraction (RE) is the task to extract relation between entity pair in plain text. For example, when given the entity pair (Obama, the United States) in the sentence "Obama was sworn in as the 44th president of the United States", an RE model should accurately predict the relationship "President of" and extract the corresponding triplet (Obama, President of, the United States) for downstream tasks. Despite the success of many RE models (Zeng et al., 2014;Baldini Soares et al., 2019), most previous RE paradigms rely on the predefined relation types, which are always unavailable in open domain scenario and thereby limits their capability in real applications. Open Relation Extraction (OpenRE), on the other hand, has been proposed to extract relation facts without pre-defined relation types neither annotated data. Given a relation instance consisting of two entities and their context, OpenRE aims to identify other instances which mention the same relation. To achieve this, OpenRE is commonly formulated as a clustering or pair-matching task. Therefore the most critical challenge for OpenRE is how to learn effective representations for relation instances and then cluster them. To this end, Yao et al. (2011) adopts topic model (Blei et al., 2003) to generate latent relation type for unlabelled instances. Later works start to utilize datasets collected using distant supervision for model training. Along this line, Marcheggiani and Titov (2016) utilizes an auto-encoder model and trains the model through self-supervised signals from entity link predictor. Hu et al. (2020) encodes each instance with pretrained language model (Devlin et al., 2019;Baldini Soares et al., 2019) and learn the representation by self-supervised signals from pseudo labels.
Unfortunately, current OpenRE models are often unstable and easily collapsed (Simon et al., 2019).
For example, OpenRE models frequently cluster all relation instances with context "was born in" into the relation type BORN IN PLACE because they share similar context information. However, "was born in" can also refer to the relation BORN IN TIME. Furthermore, current models also tend to cluster two relation instances with the same entities (i.e., relation instances with the same head and tail entities) or the same entity types into one relation. This problem can be even more severe if the dataset is generated using distant supervision because it severely relies on prototypical context and entity information as supervision signals and therefore lacks of diversity.
In this paper, we attempt to explain and resolve the above-mentioned problem in OpenRE from a causal view. Specifically, we formulate the process of OpenRE using a structural causal model (SCM) (Pearl, 2009), as shown in Figure 1. The main assumption behind the SCM is that distant supervision will generate highly correlated relation instances to the original prototypical instance, and there is a strong connection between the generated instance to the prototypical instance through either their entities or their context. For example, " [Jobs] was born in [California]" and " [Jobs] was born in [1955]" are highly correlated because they share similar context "was born in" and entity "Jobs". Such connection will result in spurious correlations, which appear in the form of the backdoor paths in the SCM. Then the spurious correlations will mislead OpenRE models, which are trained to capture the connection between entities and context to the relation type.
Based on the above observations, we propose element intervention, which conducts backdoor adjustment on entities and context respectively to block the backdoor paths. However, due to the lack of supervision signals, we cannot directly optimize towards the underlying causal effects. To this end, we further propose two surrogate implementations on the adjustments on context and entities, respectively. Specifically, we regard the instances in the original datasets as the relation prototypes. Then we implement the adjustment on context through a Hierarchy-Based Entity Ranking (Hyber), which fixes the context, samples related entities from an entity hierarchy tree and learns the causal relation through rank-based learning. Besides, we implement the adjustment on entities through a Generation-based Context Con-trasting (Gcc), which fixes the entities, generates positive and negative contexts from a generationbased model and learns the causal effects through contrastive learning.
We conduct experiments on different unsupervised relation extraction datasets. Experimental results show that our method outperforms previous state-of-the-art methods with a large margin and suffers much less performance discrepancy between different datasets, which demonstrate the effectiveness and robustness of the proposed methods.

OpenRE from Causal View
In this section, we formulate OpenRE from the perspective of Structural Causal Model and give the theoretical proof for intervention methods that block the backdoor paths from relation elements (i.e., context and entity pair) to the latent relation types.

Task Definition
Relation extraction (RE) is the task of extracting the relationship between two given entities in the context. Considering the sequence example: S = [s 0 , ..., s n−1 ] which contains n words, e 1 = [i, j] and e 2 = [k, l] indicate the entity pair, where 0 ≤ i ≤ j < k ≤ l ≤ n − 1, a relation instance X is defined as X = (S, e 1 , e 2 ), (i.e. the tuple of entity pair and the corresponding context). The element of a relation instance is the entity pair and the corresponding context. Traditional RE task is to predict the relations type when given X. However, the target relation types are not pre-defined in OpenRE. Consequently, OpenRE is commonly formulated as a clustering task or a pair-matching task by considering whether two relation instances X i and X j refer to the same relation.
Unfortunately, current OpenRE models are often unstable and easily collapsed (Simon et al., 2019). In the next section, we formulate OpenRE using a structural causal model and then identify the reasons behind these deficiencies from the SCM.

Structural Causal Model for OpenRE
Figure 1 (a) shows the structural causal model for OpenRE. The main idea behind the SCM is distant supervision will generate highly correlated relation instances to the original prototypical instance, and there is a strong connection between the generated instance to the prototypical instance through  either their entities or their context. Specifically, in the SCM, we describe OpenRE with five critical variables: 1) the prototypical relation instance P , which is a representative relation instance of one relation type cluster; 2) the entity pair E, which encodes the entity information of one relation instance; 3) the context C, which encodes the context information of one relation instance; 4) a relation instance X (which can be generated from distant supervision or other strategies) and 5) the final pair-wise matching result Y , which corresponds to whether instance X and the prototypical relation instance P entail the same relation. Given the variables mentioned above, we formulate the process of generating OpenRE instances based on the following causal relations: • E ← P → C formulates the process of sampling related entities and context respectively from the prototypical relation instance P .
• E → X ← C formulates the relation instance generating process. Given the context C and entities E from the prototypical relation instance P , a new relation instance X is generated based on the information in C and E. This process can be conducted through distant supervision.
• P → Y ← X formulates the OpenRE clustering or pair-wise matching process. Given a prototypical relation instance P and another relation instance X, this process will determine whether X belongs to the relation cluster of P .

Spurious Correlations in OpenRE
Given a relation prototypical instance P , the learning process of OpenRE is commonly to maximize the probability P(y, P |X) = P(y, P |E, C). However, as it can be observed from the SCM, there exists a backdoor path P → E → X when we learn the underlying effects of context C. That is to say, the learned effect of C to Y is confounded by E (through P ). For example, when we learned the effects of context "was born in" to the relation "BORN IN PLACE", the backdoor path will lead the model to mistake the contribution of the entities (PERSON, PLACE) to the contribution of context, and therefore resulted in spurious correlation. The same thing happens when we learn the effects of entities E, which is influenced by the backdoor path P → C → X. As a result, optimizing these spurious correlations will result in an unstable and collapsed OpenRE model.

Resolving Spurious Correlations via Element Intervention
To resolve the spurious correlations, we adopt the backdoor adjustment (Pearl, 2009) to block the backdoor paths. Specifically, we separately intervene on context C and entities E by applying the do-operation.
Entity Intervention. As shown in Figure 1 (b), to avoid the spurious correlations of entities to relation types, we conduct the do-operation by inter-vening on the entities E: Since P(P ) is uniformly distributed in the real world, this equation can be rewritten as: This equation means the causal effect from the entities E to its matching result Y can be estimated by considering the corresponding possibility of each context given the prototypical relation instance P .
The detailed implementation will be described in the next section.
Context Intervention. Similarly, we conduct context intervention to avoid the spurious correlations of context to relation types, as shown in Figure 1 which means the causal effect from the context C to its matching result Y can be estimated by considering the corresponding possibility of each entity E given P . The detailed implementation will also be described in the next section.

Optimizing Causal Effects for OpenRE
To effectively capture the causal effects of entities E and context C to OpenRE, a matching model P(Y |C, E, P ; θ) should be learned by optimizing the causal effects: where e(X) and c(X) represents the entities and context in relation instance X, I(X, P ) is an indicator which represents whether X and P belong to the same relation. P(Y |C, E, P ; θ) = P(Y |X, P ; θ) is a matching model, which is defined using a prototype-based measurements: where D is a distance measurement and R(X; θ) is a representation learning model parametrized by θ, which needs to be optimized during learning. In the following, we will use D(X, P ) = D(R(X; θ), R(P ; θ)) for short. However, it is difficult to directly optimize the above loss function because 1) in unsupervised OpenRE, we are unable to know whether the relation instance X generated from (E, C) matches the prototypical relation instance P ; 2) we are unable to traverse all possible E and C in Equation (2) and (3). To resolve these problems, in the next section, we will describe how we implement the context intervention via hierarchy-based entity ranking and the entity intervention via generation-based context contrasting.

Element Intervention Implementation
As we mentioned above, it is difficult to directly optimize the causal effects via Equation (4). To tackle this issue, this section provides a detailed implementation to approximate the causal effects. Specifically, we regard all relation instances in the original data as the prototypical relation instance P , and then generate highly correlated relation instances X from P via a hierarchy-based sampling and generation-based contrasting. Then we regard structural signals from the entity hierarchy and confidence score from the generator as distant supervision signals, and learn the causal effects via ranking-based learning and contrastive learning.

Hierarchy-based Entity Ranking for Context Intervention
To implement context intervention, we propose to formulate P(E|P ) using an entity hierarchy, and approximately learn to optimize the causal effects of P(Y = 1, P |do(C)) and P(Y = 0, P |do(C)) in Equation (4) via a hierarchy-based entity ranking loss. Specifically, we first regard all relation instances in the data as prototypical relation instance P . Then we formulate the distribution P(E|P ) by fixing the context in P and replacing entities by sampling from an entity hierarchy. Each sampled entity is regarded as the same P(E|P ). Intuitively, the entity closer to the original entities in P tends to generate more consistent relation instance to P . To approximate this semantic similarity, we utilize the meta-information in WikiData (i.e., the "instance of" and "subclass of" statements, which describe the basic property and concept of each entity), and construct a hierarchical entity tree for ranking the similarity between entities. In this work, we apply a three-level hierarchy through these two statements: • Sibling Entities: The entities belonging to the same parent category as the original entity. For example, "Aube" and "Paris" are sibling entities since they are both the child entity of "department of France", and both express the concepts of location and GPE. These sibling entities can be considered as golden entities to replace.
• Cousin Entities: The entities belonging to the same grandparent category but the different parent category from the original entity. For example, "Occitanie" and "Paris" is of the same grandparent category "French Administrative Division", but shares different parent category. These entities can be considered as silver entities since they are likely to be the same type as the original one but less possible than the sibling entities.
• Other Entities: The entities beyond the grandparent category, which are much less likely to be the same type as the original one.
For the example in Figure 2, the prototypical relation instance "Hugo was born in [Paris], [France]" is sampled to be intervened. We first fix the context and randomly choose one of the head or tail entity to be replaced. In this case, we choose "Paris". Then, entities that correspond to different hierarchies are sampled and to replace the original entity. In this case, "Aube" is sampled as the sibling entity, "Occitanie" to be the cousin entity and "19 th century" to be the other entity.
After sampled these intervened instances, we approximately optimize P(Y, P |do(C)) using a rankbased loss function: where θ is the model parameters, D(X i , P ) is the distances between representations of generated relation instance X i and prototypical relation instance P . X is the intervened relation instance set, m E is the margin for entity ranking loss, and n = 3 is the depth of the entity hierarchy.

Generation-based Context Contrasting for Entity Intervention
Different from the context intervention that can easily replace entities, it is more difficult to intervene on entities and modify the context. Fortunately, the rapid progress in pre-trained language model (Radford et al., 2019;Lewis et al., 2020;Raffel et al., 2020) makes the language generation from RDF data 2 available (Ribeiro et al., 2020). So in this work, we take a different paradigm named Generation-based Context Contrasting, which directly generates different relation instances from specifically designed relation triplets, and approximately learn to optimize the causal effects of P(Y = 1, P |do(E)) and P(Y = 0, P |do(E)) in Equation (4) via contrastive learning. Specifically, we first sample relation triplets from Wikidata as prototypical relation instance P , and then generates relation triplets with the same entities but different relation context using the following strategies: • Relation Renaming, which contains the same entity pair with the original one, but an alias relation name for generating a sentence with different expressions. Then this instance is considered as a positive sample to prototypical relation instance.
• Context Expansion, which extends the original relation instance with an additional triplet. The added triplet owns the same head/tail entity with the original instance but differs in the relation and tail/head entity. This variety aims to add irrelative context, which forces the model to focus on the important part of the context and is also considered as a positive sample to prototypical relation instance.
• Relation Replacing, which contains the same entity pair as the original one, but with other relations between these two entities. This variety aims to avoid spurious correlations that extracts only based on the entity pair and is considered as a negative instance to the prototypical relation instance.
Then we use the generator to generate texts based on these triplets. Specifically, we first wrap the triplets with special markers "[H], [T],[ R]" corresponds to head entity, tail entity, and relation name. Then we input the concatenated texts for relation instance generation. In our implementation, we use T5 (Raffel et al., 2020;Ribeiro et al., 2020) as the base generator, and pre-train the generator on WebNLG data (Gardent et al., 2017). After sampled these intervened instances, we approximately optimize P(Y, P |do(E)) using the following contrastive loss function: where θ is the model parameters, X is the intervened instance set, P is the positive instance set generated from relation renaming and context expansion, N is the negative instance set generated from relation replacing, P is the original prototypical relation instance, m C is the margin.

Surrogate Loss for Optimizing Causal Effects
Based on entity ranking and context contrasting, we approximate the causal effects optimized in Equation (4) with the following ranking and contrastive loss: which involves both the entity ranking loss and the context contrastive loss. During inference, we first encode each instance into its representation using the learned model. Then we apply a clustering algorithm to cluster the relation representations, and the relation for each instance is predicted through the clustering results.

Dataset
We conduct experiments on two OpenRE datasets -T-REx SPO and T-REx DS, since these datasets are from the same data source but only differ in constructing settings, which is very suitable for evaluating the stability of OpenRE methods. These datasets are both from T-REx 3 (Elsahar et al., 2018) -a dataset consists of Wikipedia sentences that are distantly aligned with Wikidata relation triplets; and these aligned sentences are further collected as T-REx SPO and T-REx DS according to whether they have surface-form relations or not. As a result, T-REx SPO contains 763,000 sentences of 615 relations, and T-REx DS contains nearly 12 million sentences of 1189 relations. For both datasets, we 3 https://hadyelsahar.github.io/t-rex/ use 20% for validation and the remaining for model training as Hu et al. (2020).

Baseline and Evaluation Metrics
Baseline Methods. We compare our model with the following baselines: 1) rel-LDA (Yao et al., 2011), a generative model that considers the unsupervised relation extraction as a topic model. We choose the full rel-LDA with a total number of 8 features for comparison in our experiment.
2) March (Marcheggiani and Titov, 2016), a VAEbased model learned by self-supervised signal of entity link predictor. 3) UIE (Simon et al., 2019), a discriminative model that adopts additional regularization to guide model learning. And it has different versions according to the choices of different relation encoding models (e.g., PCNN). We report the results of two versions-UIE-PCNN and UIE-BERT (i.e., using PCNN and BERT as the relation encoding models) with the highest performance. 4) SelfORE (Hu et al., 2020), a self-supervised framework that bootstraps to learn a contextual relation representation through adaptive clustering and pseudo label.
Evaluation Metrics. We adopt three commonlyused metrics to evaluate different methods: B 3 (Bagga and Baldwin, 1998), V-measure (Rosenberg and Hirschberg, 2007) and Adjusted Rand Index (ARI) (Hubert and Arabie, 1985). Specifically, B 3 contains the precision and recall metrics to correspondingly measure the correct rate of putting each sentence in its cluster or clustering all samples into a single class, which are defined as follows: Then B 3 F 1 is computed as the harmonic mean of the precision and recall. Similar to B 3 , V-measure focuses more on small impurities in a relatively "pure" cluster than less "pure" cluster, and use the homogeneity and completeness metrics: V Homo. =1 − H(c(X)|g(X))/H(c(X)) V Comp. =1 − H(g(X)|c(X))/H(g(x)) ARI is a normalization of the Rand Index, which measures the agreement degree between the cluster and golden distribution. This metric ranges in [-1,1], a more accurate cluster will get a higher score. Different from previous metrics, ARI is less sensitive to precision/homogeneity and recall/completeness.

Hyperparameters and Implementation Details
In the training period, we manually search the Hyperparameters of learning rate in [5e-6,1e-5, 5e-5], and find 1e-5 is optimal, search weight decay in [1e-6, 3e-6, 5e-5] and choose 3e-6, and use other hyperparameters without search: the dropout rate of 0.6, a batch size of 32, and a linear learning schedule with a 0.85 decay rate per 1000 minibatches. In the evaluation period, we simply adopt the pre-trained models for representation extraction, then cluster the evaluate instances based on these representations. For clustering, we follow previous work (Simon et al., 2019;Hu et al., 2020) and set K=10 as the number of clusters.   1. Our method outperforms previous OpenRE models and achieves the new state-of-the-art performance. Comparing with all baseline models, our method achieves significant performance improvements: on T-Rex SPO, our method improves the SOTA B 3 F 1 and V-measure F 1 by at least 3.9%, and ARI by 2.9%; on T-Rex DS, the improvements are more evident, where SOTA B 3 F 1 and V-measure F 1 are improved by at least 10.0%, and ARI is improved by 4.9%.

Overall Results
Metrics Both Seen Unseen BLEU 60.9 65.9 54.9 chrF++ 76.0 79.2 72.5 Table 3: Quantitative performance of our generator on WebNLG. Seen stands for generating from seen relation triplets, unseen stands for generating from unseen relation triplets. Both stands for a combination of seen and unseen relation triplets.
contrast, our methods have marginal performance differences, which indicates both the effectiveness and robustness of our methods.

Detailed Analysis
In this section, we conduct several experiments for detailed analysis of our method.
Ablation Study. To study the effect of different intervention modules, we conduct an ablation study on each intervention module by correspondingly ablating one. The other setting remains the same as the main model. From Table 1, we can see that, in both T-REx SPO and DS, combining these two modules can result in a noticeable performance gain, which demonstrates that both two modules are important to the final model performance and they are complementary on alleviating unnecessary co-dependencies: Hyber aims to alleviate the spurious correlations between the context and the final relation prediction, and Gcc aims to alleviate the spurious correlations between entity pair and the final relation prediction. Besides, in T-REx DS, we can see that Hyber or Gcc only is effective enough to outperform previous SOTA methods, which indicates that element intervention has clearly unbiased representation on either entity pair or context.
Entity Ranking on Generated Texts. This experiment studies the effect of different data sources for Hyber module. As shown in Table 2, we can see that Hyber based on T-REx SPO dataset or the generated texts has marginal difference. That means Hyber is robust to the source context. On the other hand, the quality of the generated texts satisfies the demand of this task.
Quality of Context Generation(unseen relations). This experiment gives a quantitative analysis of the generator used in our work. We select WebNLG (Gardent et al., 2017) to test the generator, and adopt the widely-used metrics including BLEU (Papineni et al., 2002) and chrF++ (Popović, 2017) for evaluation. As shown in Table 3, we can see that our generator is quite effective on seen relation generation. Though the generator suffers a performance drop in unseen relations, the scores are still receptible. Combined with results from other experiments, the generator is sufficient for this task.
Visualization of Relation Representations. In this experiment, we visual the representations of the validation instances. We sample 10 relations from the T-REx SPO validation set and each relation with 200 instances for visualization. To reduce the dimension, we use t-sne (van der Maaten and Hinton, 2008) to map each representation to the dimension of 2. For the convenience of comparison, we color each instance with its ground-truth relation label. Since the visualization results of only Hyber or Gcc are marginally different from the full model, so we only choose the full model for visualization. As shown in Figure 3, we can see that each relation is mostly separate from others. However, there still be some instances misclassified due to the overlapping in the representation space.

Related Work
Current success of supervised relation extraction methods (Bunescu and Mooney, 2005;Qian et al., 2008;Zeng et al., 2014;Zhou et al., 2016;Velikovi et al., 2018) depends heavily on large amount of annotated data. Due to this data bottleneck, some weakly-supervised methods are proposed to learn relation extraction models from distantly labeled datasets (Mintz et al., 2009;Hoffmann et al., 2011;Lin et al., 2016) or few-shot datasets (Han et al., 2018;Baldini Soares et al., 2019;Peng et al., 2020). However, these paradigms still require pre-defined relation types and therefore restricts their application to open scenarios. Open relation extraction, on the other hand, aims to cluster relation instances referring to the same underlying relation without pre-defined relation types. Previous methods for OpenRE can be roughly divided into two categories. The generative method (Yao et al., 2011) formulates OpenRE using a topic model, and the latent relations are generated based on the hand-crafted feature representations of entities and context. While the discriminative method is first proposed by Marcheggiani and Titov (2016), which learns the model through the self-supervised signal from entity link predictor. Along this line, Hu et al. (2020) propose the Self-ORE that learns the model through pseudo label and bootstrapping technology. However, Simon et al. (2019) point out that previous OpenRE methods severely suffer from the instability, and they also propose two regularizers to guide the learning procedure. But the fundamental cause of the instability is still undiscovered.
In this paper, we revisit the procedure of OpenRE from a causal view. By formulating OpenRE using a structural causal model, we identify the cause of the above-mentioned problems, and alleviate the problems by Element Intervention. There are also some recent studies try to introduce causal theory to explain the spurious correlations in neural models (Feng et al., 2018;Gururangan et al., 2018;Tang et al., 2020;Qi et al., 2020;Zeng et al., 2020;Wu et al., 2020;Qin et al., 2020;Fu et al., 2020). However, to the best of our knowledge, this is the first work to revisit OpenRE from the perspective of causality.

Conclusions
In this paper, we revisit OpenRE from the perspective of causal theory. We find that the strong connections between the generated instance to the prototypical instance through either their entities or their context will result in spurious correlations, which appear in the form of the backdoor paths in the SCM. Then the spurious correlations will mislead OpenRE models. Based on the observations, we propose Element Intervention to block the backdoor paths, which intervenes on the context and entities respectively to obtain the underlying causal effects of them. We also provide two specific implementations of the interventions based on entity ranking and context contrasting. Experimen-tal results on two OpenRE datasets show that our methods outperform previous methods with a large margin, and suffer the least performance discrepancy between datasets, which indicates both the effectiveness and stability of our methods.