A Relation-Oriented Clustering Method for Open Relation Extraction

The clustering-based unsupervised relation discovery method has gradually become one of the important methods of open relation extraction (OpenRE). However, high-dimensional vectors can encode complex linguistic information which leads to the problem that the derived clusters cannot explicitly align with the relational semantic classes. In this work, we propose a relation-oriented clustering model and use it to identify the novel relations in the unlabeled data. Specifically, to enable the model to learn to cluster relational data, our method leverages the readily available labeled data of pre-defined relations to learn a relation-oriented representation. We minimize distance between the instance with same relation by gathering the instances towards their corresponding relation centroids to form a cluster structure, so that the learned representation is cluster-friendly. To reduce the clustering bias on predefined classes, we optimize the model by minimizing a joint objective on both labeled and unlabeled data. Experimental results show that our method reduces the error rate by 29.2% and 15.7%, on two datasets respectively, compared with current SOTA methods.


Introduction
Relation extraction (RE), a crucial basic task in the field of information extraction, is of the utmost practical interest to various fields including web search (Xiong et al., 2017), knowledge base completion (Bordes et al., 2013), and question answering (Yu et al., 2017). However, conventional RE paradigms such as supervision and distant supervision are generally designed for pre-defined relations, which cannot deal with new emerging relations in the real world.
Under this background, open relation extraction (OpenRE) has been widely studied for its use * Corresponding authors.  Figure 1: Although both instances S 2 and S 3 express founded relation while S 1 expresses CEO relation, the distance between S 1 and S 2 is still smaller than that between S 2 and S 3 . This is because there may be more similar surface information (e.g. word overlapping) or syntactic structure between S 1 and S 2 , thus the derived clusters cannot explicitly align with relations.
in extracting new emerging relational types from open-domain corpora. The approaches used to handle open relations roughly fall into one of two groups. The first group is open information extraction (OpenIE) (Etzioni et al., 2008;Yates et al., 2007;Fader et al., 2011), which directly extracts related phrases as representations of different relational types. However, if not properly canonicalized, the extracted relational facts can be redundant and ambiguous. The second group is unsupervised relation discovery (Yao et al., 2011;Shinyama and Sekine, 2006;Simon et al., 2019). In this type of research, much attention has been focused on unsupervised clustering-based RE methods, which cluster and recognize relations from high-dimensional representations (Elsahar et al., 2017). Recently, the self-supervised signals in pretrained language model are further exploited for clustering optimization (Hu et al., 2020).
However, many studies show that highdimensional embeddings can encode complex linguistic information such as morphological (Peters et al., 2018), local syntactic (Hewitt and Manning, 2019), and longer range semantic information (Jawahar et al., 2019). Consequently, the distance of representation is not completely consistent with relational semantic similarity. Although Hu et al. (2020) use self-supervised signals to optimize clustering, there is still no guarantee that the learned clusters will explicitly align with the desired relational semantic classes (Xing et al., 2002). As shown in Figure 1, we use the method proposed by Hu et al. (2020) to get the instance representations. Although both instances S 2 and S 3 express the founded relation, the euclidean distance between them is larger than that between S 1 and S 2 , which express different relation. Obviously, the clustering algorithm tends to group instances S 1 and S 2 together, rather than S 2 and S 3 which express the same relation.
In this work, we propose a relation-oriented clustering method. To enable the model to learn to cluster relational data, pre-defined relations and their existing labeled instances are leveraged to optimize a non-linear mapping, which transforms high-dimensional entity pair representations into relation-oriented representations. Specifically, we minimize distance between the instances with same relation by gathering the instances representation towards their corresponding relation centroids to form the cluster structure, so that the learned representation is cluster-friendly. In order to reduce the clustering bias on the predefined classes, we iteratively train the entity pair representations by optimizing a joint objective function on the labeled and unlabeled subsets of the data, improving both the supervised classification of the labeled data, and the clustering of the unlabeled data. In addition, the proposed method can be easily extended to incremental learning by classifying the pre-defined and novel relations with a unified classifier, which is often desirable in real-world applications. Our experimental results show that our method outperforms current state-of-the-art methods for OpenRE. Our codes are publicly available at Github * .
To summarize, the main contributions of our work are as follows: (1) we propose a novel relation-oriented clustering method RoCORE to enable model to learn to cluster relational data; (2) the proposed method achieves the incremental learning of unlabeled novel relations, which is often desirable in real-world applications; (3) experimental results show that our method reduces * https://github.com/Ac-Zyx/RoCORE. the error rate by 29.2% and 15.7%, on two realworld datasets respectively, compared with current state-of-the-art OpenRE methods.

Related Work
Open Relation Extraction. To meet the needs of extracting new emerging relation types, many efforts have been undertaken to exploring methods for open relation extraction (OpenRE). The first line of research is Open Information Extraction (Etzioni et al., 2008;Yates et al., 2007;Fader et al., 2011), in which relation phrases are extracted directly to represent different relation types. However, using surface forms to represent relations results in an associated lack of generality since many surface forms can express the same relation. Recently, unsupervised clustering-based RE methods is attracting lots of attentions. Elsahar et al. (2017) proposed to extract and cluster open relations by re-weighting word embeddings and using the types of named entities as additional features. Hu et al. (2020) proposed to exploit weak, self-supervised signals in pretrained language model for adaptive clustering on contextualized relational features. However, the self-supervised signals are sensitive to the initial representation (Gansbeke et al., 2020) and there is still no guarantee that the learned clusters will align with the relational semantic classes (Xing et al., 2002). Wu et al. (2019) proposed the relation similarity metrics from labeled data, and then transfers the relational knowledge to identify novel relations in unlabeled data. Different from them, we propose a relation-oriented method explicitly clustering data based on relational information. Knowledge in High-Dimensional Vector. Pretrained static and contextual word representations can provide valuable prior knowledge for constructing relational representations (Soares et al., 2019;Elsahar et al., 2017). Peters et al. (2018) showed that different neural architectures (e.g., LSTM, CNN, and Transformers) can hierarchically structure linguistic information that varies with network depth. Recently, many studies (Jawahar et al., 2019;Clark et al., 2019;Goldberg, 2019) have shown that such hierarchy also exists in pretraining models like BERT. These results suggest that high-dimensional embeddings, independent of model architecture, learn much about the structure of language. Directly clustering on these highdimensional embeddings should hardly produce Entity Pair Encoder (BERT) [

Non-linear Decoder
Figure 2: Overview of our RoCORE method. At the first step, we encode both the labeled and unlabeled instances in to entity pair representations. Then the entity pair representations are transformed to relation-oriented representations by gathering towards their relational centroids in the second step. Finally, based on the pseudo labels generated by clustering on unlabeled data, we optimize the entity pair representations and classifier by minimizing a joint objective function to reduce the clustering bias on predefined classes. The above three steps are performed iteratively to gradually improve model performance on novel relations. ideal clusters in our desired way, which motivates us to extend current unsupervised clustering-based RE methods to learn the representations tailored for clustering relational data.

Approach
In this work, we propose a relation-oriented clustering method, which takes advantage of the relational information in the existing labeled data to enable model to learn to cluster relational data. In order to reduce the clustering bias on the predefined classes, we iteratively train the entity pair representations by optimizing a joint objective function on the labeled and unlabeled subsets of the data, improving both the supervised classification of the labeled data, and the clustering of the unlabeled data. The proposed method is shown in Figure 2.
Specifically, given an unlabeled dataset D u = {s u i } i=1,...,M of relational instances s u i , our goal is to automatically cluster the relational instances into a number of classes C u , which we assume to be known a priori. To enable the model to learn to cluster data, we incorporate a second labeled dataset of pre-defined relations D l = {(s i , y i )} i=1,...,N where y i ∈ {1, ..., C } is the relational label for instance s i .

Method Overview
We approach the problem by learning a relationoriented representation, from which the derived clusters can be explicitly aligned with the desired relational semantic classes. As illustrated in Figure  2, we learn the representation and optimize the model by performing three iterative steps: (1) First, we encode relation instances in D and D u using the entity pair encoder implemented as the pretrained BERT (Devlin et al., 2018), which takes relation instances {s i } i=1,...,N , and {s u j } j=1,...,M , as input, and output relation representation h i , h u j . However, high-dimensional h can encode a mixture of various aspects of linguistic features and the derived clusters from h cannot explicit align with desired relational classes.
(2) To make the distance between the representations accurately reflect the relational semantic similarity, the obtained h i are transformed to lowdimensional relation-oriented representations h i by a non-linear mapping g. Under the supervision of labels y i in D , g is optimized by the gathering of h i towards their relational centroids to form a cluster structure, thereby we obtain h u j from unlabeled data using the optimized g and generate the pseudo labelsŷ u according to clustering on h u i . (3) Because using labeled data to guide the h towards their relational centroids will produce clustering bias on pre-defined relations, it is difficult to directly generate high-quality pseudo labels. To reduce the negative effect of errors in pseudo labels, we optimize classifier and entity pair representations by minimizing a joint objective function, containing terms for both pre-defined and novel relations, using respectively the given labels y and generated pseudo labelŷ u . Based on the refined entity pair representation h which encode more contextual relational information, the above three steps are performed iteratively to gradually improve the quality of pseudo labelsŷ u and model performance.

Entity Pair Encoder
Given a relation instance marking the position of the entity pair, the entity pair encoder f aims to map relation instance s i to a fixed-length embedding h i = f (s i ) ∈ R d that encode contextual information in s i . We adopt BERT (Devlin et al., 2018) as the implemention of our encoder f due to its strong performance on extracting contextual information. Formally: where r is a hyperparameter that denotes the output layer of BERT. s and e represent start and end position of the corresponding entity respectively. ⊕ denotes the concatenation operator. This structure of entity pair representation encoder has been widely used in previous RE methods (Wang et al., 2021;Hu et al., 2020).

Relation-Oriented Clustering Module
In order to make the distance between representation accurately reflect the relational semantic similarity, the obtained Under the supervision of labels y i in D , g is optimized by the gathering of h i towards their relational centroids as follows: where c r denotes the centroids of relation r. The center loss L center seems reasonable, but problematic. A global optimal solution to minimize L center is g(h i ) = 0, which is far from being desired. This motivates us to incorporate a reconstruction term to prevent the semantic space from collapsing. Specifically" a decoding network d(·) is used to map the representation h i back to the original representation h i .Thus, we can derive the following loss function: where both the encoder g(h i ) and decoder d(h i ) are implemented as DNN. The function (·, ·) : R d → R is the least-squares loss (x, y) = x − y 2 2 that measures the reconstruction error and other choices such as 1 -norm also can be considered. λ is a hyper-parameter that balances the reconstruction error versus center loss.
Finally, we obtain {h u j } j=1,...,M using the optimized g and generate pseudo labelsŷ u using k-means algorithm as follows:

Relation Classification Module
Based on the pseudo labelsŷ u generated by clustering, we can train the classifier and refine entity pair representation h to encode more contextual relation information. Since it's difficult to keep the order of clusters consistent in multiple clustering, instead of using standard cross entropy loss, we propose to use the pairwise similarities for novel relation learning.
where the symbol q ij denotes whether s u i and s u j belong to the same cluster. If a pair is from the same cluster, the classifier η u : R d → R C u outputs similar distributions, and vice-versa. Specifically, we use the pair-wise KL-divergence to evaluate the distance of two relation instances. Given a pair of instance s u i , s u j , their corresponding output distributions are defined as P = η u (f (s u i )) and Q = η u (f (s u j )). For the pair from the same cluster, the cost is described as: where P * denotes that P is assumed to be a constant and each KL-divergence factor D KL (P||Q) is a unary function whose gradient is simply ∂D KL (P * ||Q)/∂Q. If s u i , s u j comes from different clusters, their output distributions are expected to be different, which can be defined as a hinge-loss function: and the total loss can be defined as a contrastive loss: Note that L BCE is a symmetric loss w.r.t. s u i ,s u j since P and Q are alternatively assumed to be constant in L + and L − . Finally, we get the prediction for a relation instance s u i as follows: 3.5 Training Methods

Iterative Joint Training
Because using labeled data to guide the h towards their relational centroids will produce clustering bias on pre-defined relations, it is difficult to directly generate high-quality pseudo labelsŷ u for novel relations. To reduce the negative effect of errors in pseudo labels, we incorporate a classifier η : R d → R C l for pre-defined relations and refine h by minimizing a joint objective function, containing terms for both pre-defined and novel relations, using respectively the given labels y and generated pseudo labelŷ u as follows: The refined entity pair representation h encode more contextual relation information, which in turn promote clustering optimization and generate pseudo labelsŷ u with higher accuracy. We refine representation h and optimize clustering in a iterative manner to gradually improve the quality of the pseudo labels and model performance. This iterative procedure is detailed in Algorithm 1. refine entity pair representation 10 until convergence;

Incremental Learning Scheme
In real-world settings, when facing a new sentence, we often don't know whether it belongs to predefined relations or novel relations. In this work, we explore the incremental learning of novel relations to enable η l to discriminate both predefined and novel relations. Under incremental learning settings, we extend the classifier η l to C u novel relation types, so that η l : R d → R C l +C u . Then, the model is trained using cross-entropy loss instead of equation 15 as follows: where we obtainŷ j using equation 14 and the coefficient µ(t) balances the cross entropy loss of pre-defined and novel relations. We implemented it as a ramp-up function where t is current epoch and T is the ramp-up length and coefficient µ 0 ∈ R + .

Experimental Setup
In this section, we describe the datasets for training and evaluating the proposed method. We also detail the baseline models for comparison.
Finally, we clarify the implementation details and hyperparameter configuration of our method.

Datasets
We conduct experiments on two relation extraction datasets.
FewRel. Few-Shot Relation Classification Dataset (Han et al., 2018). FewRel is a human-annotated dataset containing 80 types of relations, each with 700 instances. We follow the setting in (Wu et al., 2019) to use the original train set of FewRel, which contains 64 relations, as labeled set with predefined relations, and the original validation set of FewRel, which contains 16 new relations, as the unlabeled set with novel relations to extract. 1,600 instances were randomly selected from the unlabeled set as the test set. The rest of labeled and unlabeled instances are considered as the train set. TACRED. The TAC Relation Extraction Dataset (Zhang et al., 2017). TACRED is a humanannotated large-scale relation extraction dataset that covers 41 relation types. We remove the instances labeled as no_relation and use the remaining 21,773 instances for training and evaluation. Similar to the setting of FewRel, we select the 0-30 relation types as labeled set with pre-defined relations and the 31-40 relation types as unlabeled set with novel relations. We randomly selected 15% of the instances from the unlabeled set as the test set. The rest of the labeled and unlabeled instances are considered as the train set.

Compared Methods
To evaluate the effectiveness of our method, we select the following SOTA OpenRE models for comparison. Note that the first four methods are unsupervised and RSN as well as RSN-BERT leverages labeled data of predefined relations. HAC with Re-weighted Word Embeddings (RW-HAC) (Elsahar et al., 2017). RW-HAC is a feature clustering method for OpenRE. The model contructs relational feature based on the weighted word embeddings as well as entity types. pre-training epochs L 10 BCE loss coefficient σ 2 center loss coefficient λ for FewRel 0.005 center loss coefficient λ for TACRED 0.001 ramp-up coefficient µ 0 1.0 ramp-up length T 10

Implementation Details
Our entity pair encoder is implemented as the bert-base-uncased which consists of 12 layers and we use layer 8 as the output layer for best performance. Note that we only fine-tune the parameters of the output layer in the iterative training process to avoid overfitting. Non-linear mapping g(·) and d(·) are both implemented as a DNN with relu activation, specifically R d -512-512-256 for g(·) and 256-512-512-R d for d(·). All experiments are conducted using a GeForce GTX 1080Ti with 11GB memory and table 1 shows our best hyper-parameter settings.

Results and Analysis
In this section, we present the experimental results of our model on two real-world datasets to demonstrate the effectiveness of our method. We also provide additional experimental results on hyperparameter analysis and relation representation visualization in appendix A and B.   Table 2 reports model performances on FewRel, TACRED dataset, which shows that the proposed method achieves state-of-the-art results on OpenRE task. Benefitting from the valuable information in the labeled instances of pre-defined relations, RoCORE effectively learns the relation-oriented representation from which the derived clusters explicitly align with relational semantic classes, thereby outperforming previous clustering-based baseline such as SelfORE by a large margin. In addition, despite the fact that RSN and its variant RSN-BERT also leverage relational information in labeled data, the learning of similarity metrics and clustering are mutually independent. In our method, relation representation learning and cluster optimization are mutually dependent. Thus, the learned representations are tailored for clustering. As a result, our method outperforms RSN and RSN-BERT on the two datasets.

Ablation Study
To study the contribution of each component in the proposed method, we conduct ablation experiments on the two datasets and display the results in Table  3. The results show that the model performance is degraded if L center is removed, indicating that the guidance of supervision signals from predefined relations provide valuable information for learning the relation-oriented representations. It is worth noting that the reconstruction term has an important role in the clustering module. Without
Rec.  the reconstruction term, the semantic space will collapse and the performance will be seriously hurt.
In addition, joint optimizing on both the labeled and unlabeled data is also very important. The initial pseudo labels for novel relations are not accurate due to the unwanted clustering bias on pre-defined relations. Without L CE , the error in pseudo labels will lead the refinement of the entity pair representation to a wrong direction, which affects the model performance.

The Influence of Pre-defined Relation Number on Performance
In this subsection, we conduct experiments on two different datasets to explore the influence of predefined relation number on performance of our method. For FewRel dataset, following the setting in (Wu et al., 2019), we change the number of pre-defined relations from 40 to 64 while fixing the total number of labeled instances to 25,000.   Similarly, the settings for TACRED dataset is 18, 31 and 12, 000, respectively. From figure 3 we can see the following: (1) The increase of pre-defined relation number do improve the generalization of our method on novel relations. The models trained on 64/31 relations slightly perform better than the models trained on 40/18 relations on FewRel/TACRED dataset (2) Our method constantly performs better than RSN and RSN-BERT with the number of predefined relations vary. This indicates the effectiveness of our method.

Cross Domain Analysis
In real-world settings, pre-defined relations and novel relations of interest usually come from different domains. To study the model performance in cross-domain settings, we conducted experiments on two cross-domain tasks, i.e,: FewRel to TACRED and TACRED to FewRel. Pre-defined relations and their labeled instances come from the source domain training dataset, and we evaluate performance on the target domain testing dataset. Table 4 shows the experimental results, from which we can observe that: (1) the change of domain increases the semantic gap between the predefined and novel relations. As the result of that, the performance of the model using labeled data of predefined relations is degraded.
(2) compared with RSN and RSN-BERT, our method shows better generalization performance on novel relations, which shows that our proposed iterative joint training method effectively reduces the unwanted bias on source domain labeled data. (3) In addition, when a model has the tendency to cluster multiple relation into one, an unbalanced PR value (i.e., high rec. and low prec. in RSN-BERT) will be produced, which is undesired in real-world applications.

Incremental Learning of Novel Relations
In this subsection, we evaluate the effectiveness of our incremental learning scheme and explore the influence of the amount of labeled data on model performance. We use BERT with a linear softmax classifer as the baseline for comparison.
We train the baseline model using the labeled data of both pre-defined and novel relations, following the supervised learning paradigm. For our method, we still use only the labels of pre-defined relations. From figure 4 we can observe the following: (1) The performance of the models improve gradually as labeled data increase. Our method can still maintain good performance when there is a lack of labeled data. This indicates that the proposed method is robust to the reduction of labeled data.
(2) Our method achieves similar performance compared with the supervised baseline on two experiments, which use 40% labels of novel relations on FewRel dataset and 82% on TACRED respectively. It indicates that we successfully achieve the incremental learning of novel relations.

Conclusions
In this work, we introduce a relation-oriented clustering method that extends the current unsupervised clustering-based OpenRE method. The proposed method leverages the labeled data of pre-defined relations to learn a relation-oriented representation from which the derived clusters explicitly align with relational classes. Iterative joint training method effectively reduces the unwanted bias on labeled data. In addition, the proposed method can be easily extended to incremental learning of novel relations. Experimental results show that our method outperforms SOTA methods for OpenRE.

A Hyperparameter Analysis
From the experimental results of ablation study, it can be seen that reconstruction loss and center loss have a great impact on the performance of the model. λ is a key hyperparameter that balances the reconstruction loss versus center loss. In this section, we conduct experiments to study the influence of the value of λ on the performance of the model. From Figure 5 we can see that: (1) When λ gradually increases from 0, the center loss begins to affect the optimization. The model learns that instances with the same relation should be mapped to relatively close positions in the representation space, and the performance of the model gradually improves.
(2) When the lambda exceeds a certain threshold, further increasing the λ will leads to unwanted bias to the predefined relations, which will degrade the performance of the model.

B Relation Representation Visualization
To intuitively show how the RoCORE method learns the constantly optimized relation-oriented representation, we visualize the relational representation with t-SNE (van der Maaten and Hinton, 2008). The visualization results are shown in Figure 6. It is apparent that, before training (left), the relational representations are distributed randomly at different locations in the semantic Figure 6: Visualization of the relation representation after t-SNE dimension reduction. The representations are colored with their ground-truth relation labels. These three sequentially illustrate the feature representation of initial sate, after reconstruction pre-training, and after training. All figures visualize the clustering result for 600 instances of randomly selected 6 novel relations on FewRel test dataset.
space. After pre-training (middle), the relational representations still are not tailored for the relations. For example, the instances with blue and light green colors may have similar syntactic or surface features and clustering them directly will lead to a poor result. After training (right), the relational representations are well separated and the distribution is based on relation types.

C Detailed Results of Other Experiments
In this section, the detailed results of ablation experiments and cross domain analysis are listed in Table 5 and Table 6 respectively.

Dataset
Method B 3 V-measure ARI Prec.