Open-world Semi-supervised Generalized Relation Discovery Aligned in a Real-world Setting

Open-world Relation Extraction (OpenRE) has recently garnered significant attention. However, existing approaches tend to oversimplify the problem by assuming that all unlabeled texts belong to novel classes, thereby limiting the practicality of these methods. We argue that the OpenRE setting should be more aligned with the characteristics of real-world data. Specifically, we propose two key improvements: (a) unlabeled data should encompass known and novel classes, including hard-negative instances; and (b) the set of novel classes should represent long-tail relation types. Furthermore, we observe that popular relations such as titles and locations can often be implicitly inferred through specific patterns, while long-tail relations tend to be explicitly expressed in sentences. Motivated by these insights, we present a novel method called KNoRD (Known and Novel Relation Discovery), which effectively classifies explicitly and implicitly expressed relations from known and novel classes within unlabeled data. Experimental evaluations on several Open-world RE benchmarks demonstrate that KNoRD consistently outperforms other existing methods, achieving significant performance gains.


Introduction
Relation extraction (RE) is a fundamental task in natural language processing (NLP), aiming to extract fact triples in the format ⟨head entity, relation, tail entity⟩ from textual data.Open-world RE (OpenRE) is a related research area that focuses on discovering novel relation classes from unlabeled data.Recent advancements in OpenRE have demonstrated impressive results by integrating prompting techniques with advanced clustering methods (Zhao et al., 2021;Li et al., 2022b;Wang et al., 2022a).However, current OpenRE methods face limitations due to assumptions about unlabeled data that do not align with the characteristics of real-world datasets.These assumptions include: *

Corresponding author
(1) the presumption that unlabeled data solely consists of novel classes or is pre-divided into sets of known and novel instances; (2) the absence of negative instances; (3) the random division of known and novel classes in a dataset; and (4) the availability of the ground-truth number of novel classes in unlabeled data.
In this work, we critically examine these assumptions and align the task of OpenRE within a realworld setting.We dispose of simplifying assumptions in favor of new assumptions that align with characteristics of real-world unlabeled data in hopes of increasing the practicality of these methods.We call our setting Generalized Relation Discovery and make the following claims: (a) Unlabeled data includes known, novel, and negative instances: Unlabeled data, by definition, lacks labels; we cannot assume it only consists of novel classes or is pre-divided into sets of known and novel instances.Our challenge is to accurately classify known classes and discover novel classes within unlabeled data.Additionally, many sentences with an entity pair do not express a relationship (e.g., negative instances, or the no relation class) (Zhang et al., 2017a).Neglecting negative instances in training leads to models with a positive bias, reducing their effectiveness in identifying relationships in real-world data.Hence, we opt to include negative instances in our setting.(b) Novel classes are typically rare and belong to the long-tail distribution: To define known and novel classes, we base our selection process on the intuition that known classes are more likely to be common, frequently appearing relationships.In contrast, unknown, novel classes are more likely to be rare (i.e., longtail) relationships.Instead of randomly choosing the set of novel classes, we construct data splits based on class frequency.Although it is possible for frequently appearing classes to be unknown, we deliberately select rare classes for our novel classes to create a more challenging setting.Lastly, without labels, it is impossible to know a priori the ground truth number of novel classes contained within unlabeled data; we do not assume we can access this information in our setting.
Our experimental results show that our proposed setting makes for a difficult task, ripe for advancements and future work.
State-of-the-art approaches in relation discovery leverage a prompt-learning method to predict relation class names which are then embedded into a latent space and grouped via clustering (Zhao et al., 2021;Li et al., 2022b;Wang et al., 2022a (1) constrained to words found within an inputted instance; and, (2) unconstrained, where the model can predict any word within its vocabulary.Constrained predictions optimize for explicitly expressed relationships while unconstrained predictions optimize for implicitly expressed relationships.This prompt method forms the backbone of our proposed method, Known and Novel Relation Discovery (KNoRD), which can effectively classify explicitly and implicitly expressed relationships of known and novel classes from unlabeled data.
Another key aspect of our method is that it clusters labeled and unlabeled data within the same featurespace.Each labeled instance serves as a "vote" for a cluster belonging to the set of known classes.We effectively bifurcate clusters into sets of known and novel classes by employing a majority-vote strategy.Novel-identified clusters are then utilized as weak labels, in combination with gold labels, to train a model via cross-entropy (see Figure 1).This methodology presents an innovative approach to relation discovery in open-world scenarios, offering potential applications across various NLP domains.
The main contributions of this work are: • We critically examine the assumptions made in OpenRE and carefully craft a new setting, Generalized Relation Discovery, that aligns with characteristics of real-world data.• We propose an innovative method to classify known and novel classes from unlabeled data.
• We illustrate the effectiveness of modeling implicit and explicit relations via prompting.Open-world RE: Open-world RE seeks to discover new relation classes from unlabeled data.Instances of relations are typically embedded into a latent space and then clustered via K-Means.These works often simplify the task by assuming all instances of unlabeled data belong to the set of novel classes and that unlabeled data contains no negative instances (Elsahar et al., 2017;Hu et al., 2020;Zhang et al., 2021).More recently, some OpenRE methods have been proposed to predict known and novel classes from unlabeled data (Zhao et al., 2021;Wang et al., 2022a;Li et al., 2022b).However, these works assume unlabeled data comes pre-divided into sets of known and novel instances, that negative instances are removed, and that the number of novel classes is known.
Open-world semi-supervised learning: The setting we propose for relation discovery is inspired by open-world semi-supervised learning (Open SSL) proposed by Cao et al. (2022) where the authors use a margin loss objective with adaptive uncertainty to predict known and novel classes from a set of unlabeled images.Besides the difference in domains, their setting differs from ours in that they assume unlabeled data has an equal number of known and novel instances-an assumption we cannot make when working with relation extraction datasets which often have imbalanced, longtail class distributions (Zhang et al., 2017b;Stoica et al., 2021;Yao et al., 2019;Amin et al., 2022).Furthermore, Cao et al. (2022)  not transfer to our task where a prevalence of sentences do not express a relationship (Zhang et al., 2017b).
The setting proposed in Li et al. (2022c) is similar to ours however, negative instances are removed and their known/novel class splits are done randomly instead of by class frequency.Furthermore, their method relies on active learning where human annotators annotate instances of novel classes.The annotations are then used to train a classifier.In contrast, our model and the baselines we evaluate do not require human annotation for novel classes.Cao et al. (2022), we assume there is no distribution shift between the labeled and unlabeled data (i.e., known relation classes found in labeled data also occur in the unlabeled data).
We formulate our task in the following way: where x j are unlabeled instances (e.g, sentences), y and y ′ denote known and novel predictions, c and c ′ denote known and novel classes, respectively, and Y is the set of all predictions for instances in D u .Each sentence x j contains an entity pair-a head entity e 1 and a tail entity e 2 -and the predicted relationship y j links the two entities, producing a fact triplet ⟨e 1 , y j , e 2 ⟩.

Prompt-based Training
We leverage the instances of labeled data to train a language model to predict the linking relationship between an entity pair found in a sentence via prompting.Specifically, given a sentence x j , we construct a prompt template ⟨e 2 ⟩ where e 1 and e 2 are two entities in x j . 2 The template is appended to x j to obtain contextualized relation instance.Then, a masked language model (e.g., BERT (Devlin et al., 2019)) learns to predict the masked tokens between two entities.To alleviate the model overfitting on the relation token vocabulary, we randomly mask 15% of tokens in x j , and the model is jointly trained to predict the masked tokens in sentences and masked relation names.
During inference, we feed the contextualized relation instance with only masked relations3 to the model and predict the masked token.Top-ranked tokens predicted for [MASK] from the model are used to create semantically-aligned representations in the subsequent stage.

Semantically-aligned Representations
Leveraging our observation that relationships are expressed explicitly or implicitly, we construct two settings for our prompt model: constrained and unconstrained predictions.Constrained predictions are [MASK] predictions (y i ) constrained to words found within the inputted instance x i , i.e., y i ∈ x i where x i is the input sentence with words {w 1 , . . ., w n }.In this setting, top tokens in the inputted instance are used to optimize for explicitly expressed relationships.In the unconstrained setting, we allow the model to use any word in its entire vocabulary (V) to predict the name of the relationship, i.e., y i ∈ V, optimizing for implicitly expressed relationships.
We use the hidden representations of the top three tokens in each setting to construct the following representations: where n = 3 and z y j is the j th embedded representation (z y j ∈ R D ) of the prediction corresponding to instance x i from a phrase embedding model (Li et al., 2022a).Note, we do not use the prompting model to produce z because this model is trained on only known classes and tends to overfit on known classes even if random tokens are masked and predicted in x i during training.
Our final relationship representation is constructed by combining the constrained and unconstrained representations: where ⟨, ⟩ represents concatenation.The combined representation r i models explicitly and implicitly expressed relationships in sentence x i .

Clustering with Majority-vote Bifurcation
Relationship representations from Equation 3 are clustered via Gaussian Mixture Models (GMM).
To improve the quality of clusters, we adjust the cluster member according to their entity meta-type pairs (e.g.[human, organization], etc.).Specifically, we select the top 30% of relation instances in each cluster and use the major entity meta-type pair of them as the meta-type of the cluster.Then, all relation instances are adjusted to the nearest cluster with the same meta-type.
We cluster instances from both labeled and unlabeled data into the same feature-space.Intuitively, since all labeled instances are instances of known classes, each labeled instance acts as a "vote" voting for a cluster that corresponds to a known class.
We tally the votes of all the labeled instances and use the results to bifurcate the set of clusters into two subsets of known-class clusters G K and novel class clusters We call this method "majority-vote bifurcation" and use the novel-identified clusters G N as weak labels for the subsequent classification module.

Relation Classification
In the final stage of KNoRD, we use gold labels from labeled data and weak labels generated from the method described in Section 4.3 to train a relation classification model.
Since the clusters generate weak labels of varying degrees of accuracy, we select the top P % of weak labels for each cluster in G N .In our model, we set P = 15.We retrospectively explore the effects of different P values and report the performance in Appendix A.5.We observe that the optimal value for P varies across datasets.We leave developing an advanced method of determining P for future work. For To form a relation representation, we concatenate the representations of two entities e 1 and e 2 : r e1e2 = ⟨m e1 , m e2 ⟩.The relation representations are sent through a fully-connected linear layer which is trained using cross-entropy loss: where We leave developing an automated method for class abstraction for future work.

Datasets
We evaluate KNoRD on three RE datasets: TA-CRED (Zhang et al., 2017b), ReTACRED (Stoica et al., 2021), and FewRel (Han et al., 2018).For each dataset, we first construct splits of known and novel classes based on class frequency, assigning the top 50% most frequent relation classes to the set of known classes (C K ) and the lower 50% to the set of novel classes (C N ) (see Figure 2).Since FewRel is a balanced dataset with relationships defined from a subset of Wikidata relationships, we obtain real-world class frequencies based on their frequency within Wikidata.For more details on our FewRel pre-processing steps, see Appendix A. We include two versions of each dataset: with and without negative instances (e.g., the no relation class).The setting with negative instances best mirrors real-world data; however, as our experiments show, discovering novel classes in a sea of negative instances is difficult.We include results from both settings and the setting with negative instances will be ripe for advancement and future work.
Our focus is to evaluate methods on data with no distribution drift (i.e., known classes from training occur in the unlabeled data along with novel classes).We leave an evaluation on out-ofdistribution datasets (Gao et al., 2019;Bassignana and Plank, 2022) to future work.

Experiments
We compare KNoRD to state-of-the-art OpenRE baselines: (1) RoCORE (Zhao et al., 2021), (2) MatchPrompt (Wang et al., 2022a), 6 (3) TABs (Li et al., 2022b).Since OpenRE methods cannot naturally operate within the Generalized Relation Discovery setting, we extend the OpenRE baselines in the following ways: (i) RoCORE ′ , MatchPrompt ′ , TABs ′ : Given that OpenRE methods cannot identify previously seen (known) classes mixed with novel classes in unlabeled data, we evaluate their performance on novel classes and propose a method to adapt them for seen classes.To achieve this, we treat all classes as novel classes, enabling these methods to effectively cluster the unlabeled data.Subsequently, we employ the Hungarian Algorithm (Kuhn, 1955) to match some discovered classes to known classes from labeled data, facilitating performance evaluation on the known classes.
(ii) RoCORE † , MatchPrompt † , TABs † : Many leading OpenRE models assume unlabeled data comes pre-divided into sets of known and novel instances (Zhao et al., 2021;Li et al., 2022b;Wang et al., 2022a).A natural extension of these methods is to prepend a module that segments unlabeled data into known and novel instances.We pre-train models on known classes and then generate confidence scores for each unlabeled instance.We use the softmax function as a proxy for confidence (Hendrycks and Gimpel, 2016) and set the confidence threshold equal to the mean confidence from labeled instances of known classes.Instances with confidence scores below the threshold are assigned to novel classes.We report the accuracy of this method in bifurcating unlabeled data in Appendix A.4.4.
ORCA: We include OCRA (Cao et al., 2022), a computer vision model developed for a similar generalized open-world setting.OCRA is the only model architecture in our experiments that can predict known and novel classes from unlabeled data and, thus, requires no modification, beyond adaptation to the RE task, to function within our proposed setting.For more details about adapting ORCA to predict relationships, see Appendix A.4.5.GPT 3.5: Given the zero-shot learning capabilities of Large Language Models (Kojima et al., 2022), we also include GPT 3.5 (OpenAI, 2021) 7 as a baseline.To assess GPT 3.5, we leverage in-context learning-we provide examples of extracted relationships and a list of known relation classes.We instruct the model to predict the most appropriate relation class name or suggest a novel class name when an instance does not fit within the set of known classes (see Appendix A.4.1 for details).GPT 3.5 +cos: Since responses from GPT 3.5 may not align perfectly with ground-truth labels, we use DeBERTa to map the responses (y i ) to ground-truth class names by embedding the predictions and the ground-truth class names (z y i and , respectively) and identifying the ground-truth class that exhibits the highest cosine similarity with the predicted class: where y m i is the mapped prediction of prediction y i , and ϵ = 1e−8.We denote the GPT 3.5 baseline with mapped predictions as "GPT 3.5 +cos."For all our baselines, we use identical settings for the number of known and novel classes and, save GPT 3.5, we use the same pre-trained model (He et al., 2021) as a base model and map predictions to ground-truth classes via the Hungarian Algorithm.We use Micro-F1 scores to assess relation classification performance and report overall performance as well as performance on known and novel classes to assess a model's ability to identify each class type from unlabeled data.

Results
Our proposed method, KNoRD, outperforms the baseline models in all metrics (Table 2).We observe that the ORCA baseline demonstrates strong overall performance and the OpenRE methods (RoCORE ′ , MatchPrompt ′ , TABs ′ ) yield diverse results, which we attributes to the differences in underlying architectures.Models such as TABs and MatchPrompt incorporate clustering methods that effectively develop relationship representations in an unsupervised setting.In contrast, RoCORE relies more heavily on supervised training to form high-quality relationship representations.This distinction is evident in our confidence-based adaptations (RoCORE † , MatchPrompt † , TABs † ), where pre-dividing the unlabeled data benefits RoCORE significantly while the results for MatchPrompt and TABs are mixed.
We observe that GPT 3.5 underperforms in this setting.Although mapping responses to ground-truth classes (GPT 3.5 +cos) yields a slight performance boost, the model still performs poorly relative to our other baselines.Given the unsatisfactory results from GPT 3.5 in our simplified experiment setting without negative instances, we decide to exclude it from the more challenging setting where negative instances are present.We conclude that more advanced techniques are required to enable GPT 3.5 to accurately classify and discover relationships from textual data.A deeper examination of GPT 3.5's performance is provided in Appendix A.4.3.
In the setting with negative instances, all methods struggle to identify novel relation classes indicating the difficulty of discovering new classes among instances with no relation.We attribute the lower overall performance using TACRED compared to ReTACRED to TACRED's wrong labeling problem (Stoica et al., 2021).
The relatively small drop in performance of all models between FewRel with and without negative instances can be attributed to FewRel lacking annotated negative instances, so we artificially augment the data with negatives from ReTACRED.We posit that the models can exploit the slight difference in the distribution of the augmented negative instances, thus reducing the task's difficulty.These results emphasize the importance of future RE dataset creation efforts in annotating negative instances.

Ablations
We conduct ablation studies to better understand the relative importance of each design choice behind KNoRD.
• KNoRD w/constrained: We only use constrained predictions from the prompt model to construct relationship representations (Equation 1) and keep all other modules unchanged for this ablation.• KNoRD w/unconstrained: Similar to the aforementioned ablation, but we only leverage unconstrained predictions to construct relationship representations (Equation 2) for the GMM module.• KNoRD without CE: We remove the crossentropy (CE) module from KNoRD and allow the GMM to predict relation classes directly.We remap cluster predictions to ground-truth classes using the Hungarian Algorithm.We also manually evaluate the accuracy of our prompt method in predicting relation class names.We evaluate the alignment of the top one and top three constrained and unconstrained predictions with ground-truth class names (Figure 3).Constrained predictions, designed to model explicit relationships, are generally more accurate for longtail relation classes.Conversely, unconstrained predictions perform better on common relationships.
This observed phenomenon roughly aligns with Zipf's Law (Zipf, 1936), indicating that rare concepts and relations are more likely to appear explicitly in a long-form manner.In contrast, common relations tend to be expressed in a compressed form (e.g., implicitly).This insight lends additional evidence to designing a prompt method that captures explicit and implicit relationships.

Conclusion
In this work, we address the limitations of existing approaches in OpenRE and introduce the Generalized Relation Discovery setting to align the task to characteristics of data found in the real-world.By expanding the scope of unlabeled data to include known and novel classes, as well as negative instances, and incorporating long-tail relation types in the set of novel classes, we aim to enhance the practicality of OpenRE methods.Furthermore, we propose KNoRD, a novel method that effectively classifies explicitly and implicitly expressed relations from known and novel classes within unlabeled data.Through comprehensive experimental evaluations on various Open-world RE benchmarks, we demonstrate that KNoRD consistently outperforms existing methods, yielding significant performance gains.These results highlight the efficacy and potential of our proposed approach in advancing the field of OpenRE and its applicability to real-world scenarios.

Limitations
The limitations of our method are as follows: 1. Our method requires human-annotated data, which is expensive and time-consuming to create.2. Our method cannot automatically determine the ground truth number of novel classes in unlabeled data.We leave this to future work.3. Our method focuses on sentence-level relation classification, and without further testing, we cannot claim these methods work well for document-level relation classification.

The low F1 scores of our model and all leading
OpenRE models within our experiments with negative instances highlight an area for growth in future works.

Ethical Concerns
We do not anticipate any major ethical concerns; relation discovery is a fundamental problem in natural language processing.A minor consideration is the potential for introducing certain hidden biases into our results (i.e., performance regressions for some subset of the data despite overall performance gains).However, we did not observe any such issues in our experiments, and indeed these considerations seem low-risk for the specific datasets studied here because they are all published.

A.2 Additional Related Work
Continual Relation Extraction: Continual relation extraction (CRE) is a relatively new task that focuses on continuously extracting relations, including novel relations, as new data arrives.CRE's main challenge is preventing the catastrophic forgetting of known classes (Hu et al., 2022;Zhao et al., 2022).In CRE, new data can contain known and novel classes, similar to our setting; however, CRE assumes that all new data is labeled, which fundamentally differs from our unlabeled setting.
Zero-shot Relation Extraction: Zero-shot relation extraction methods typically assume that test data only contains novel classes and that descriptions for those novel classes are readily available (Levy et al., 2017;Obamuyide and Vlachos, 2018;Lockard et al., 2020;Chen and Li, 2021).Generalized zero-shot relation extraction (ZSRE) removes the assumption that test data can only contain novel classes.However, ZSRE methods still heavily rely on descriptions of novel relation classes (Huang et al., 2018;Rahman et al., 2017), which is information that we do not assume is available in our unlabeled setting.
Prompt-based RE Methods: Prompt-based methods have shown promising results for both closedworld and open-world RE tasks (Jun et al., 2022;Wang et al., 2022a;Li et al., 2022b).Prompt-based methods for relation extraction involve constructing prompts, sometimes called "templates," that provide contextual cues for identifying relations between entities in text.These prompts typically comprise natural language phrases that capture the semantic relationship between entities.A.3 Pre-processing and Augmenting FewRel Special treatment is needed for FewRel dataset since it is a uniform dataset without entity type information or annotated negative instances.
Frequency-based class splits: To conduct the frequency-based splits described in Section 5, we obtain the distribution of relation classes as they appear in real-world data.Given that the relationship IDs in FewRel correspond with relationships in Wikipedia, we obtain class frequency information directly from Wikipedia by aggregating counts of occurrences of each relationship.
Augmenting with negative instances: Unfortunately, FewRel does not provide annotated negative instances (e.g., the no relation class).To better simulate real-world data, we augment the FewRel dataset with negative instances from ReTACRED.We recognize that augmenting FewRel with data from another dataset is not ideal since distribution differences may exist.Future work in the Generalized Relation Discovery setting may focus on extending FewRel with domain-aligned human annotated negative instances.
Resolving entity type information: The role of entity type information in relation extraction has been widely acknowledged (Peng et al., 2020;Wang et al., 2022b).However, the FewRel dataset lacks explicit entity type information.To address this limitation and resolve entity types for all entities in the FewRel dataset, we employ the following two-phase approach: 1. Wikidata ontology traversal: FewRel provides a Wikidata entity ID for each entity.Leveraging the Wikidata API, we retrieve the metadata associated with each entity ID.Then, we recursively map entity types (e.g., the value of the property "subclass of" for each concept in Wikidata) to parent types until a root node is found.During this traversal, we encounter a few special cases: (1) concepts with missing values for the "subclass of" property; (2) concepts with multiple values for the "subclass of" property; and (3) values of "subclass of" that lead to looping paths in the ontology.For entities with missing values, typically found in the leaf nodes of the Wikidata knowledge graph, we default to the value of the Wikidata property "instance of" as the starting concept for our recursive transversal.
When a concept has multiple values for "subclass of," we select the first value unless that value leads to a looping path within the ontology (e.g., "make-up artist" is a subclass of "hair and make-up artist," which is a subclass of a "make-up artist").In these cases, we choose the next value of the "subclass of" until we find a non-looping path to a root node.2. Type binning: Using the raw values of subclasses results in thousands of fine-grained entity types.We iteratively bin entity types into parent entity types until each entity type has at least 1,000 entities to obtain broader, more generalized entity types.This method produced 23 distinct entity types (see Figure 5 for the names and distribution of entity types found in FewRel).

A.4 Baselines
In this section, we provide additional details about our baseline models.
A.4.1 Soliciting Predictions from GPT 3.5: GPT 3.5 often performs better on tasks with the help of in-context learning (Wei et al., 2023;Wang et al., 2023).We construct a prompt that lists all known relation classes and offers a couple examples of extracted relationships.We use natural language class names to help the model understand and make predictions.The following is the prompt we used for soliciting predictions for our tests: 1. Select the correct relation between the head and tail entities in the following unlabeled examples.2. Each example has the head and tail entities appended to the sentence in the form: (head entity) (tail entity).3.There are 40 known relation classes, and up to 80 unknown, or novel, relation classes.4. The following is the list of known relation classes: "instance of", "subject", "language", "country", "located in", "occupation", "constellation", "citizenship", "part of", "taxon rank", "location", "heritage", "has part", "sport", "genre", "child", "country of origin", "position", "follows", "followed by", "contains", "father", "jurisdiction", "field of work", "participant", "spouse", "mother", "participant", "operator", "performer", "member of party", "publisher", "owned by", "member org", "religion", "headquarters", "sibling", "position played", "work location", "original language" 5.An identical prompt was also used for TACRED and ReTACRED, with changes only made to the numbers of known and novel classes, the list of known classes, and the examples provided.
A.4.2 Probing GPT 3.5 for Prior Knowledge One issue in evaluating GPT 3.5 is that the exact body of data used for training is unknown.Therefore, to ensure a fair comparison, we seek to determine if GPT 3.5 prior knowledge of the various datasets we use in our experiments.To do this, we ask GPT 3.5 to list all the classes in a specific dataset with the following prompt: "What are the relation classes found in the [DATASET_NAME] relation extraction dataset?"We report the response when asking about TACRED in Table 4. Note: GPT 3.5 also responded with accurate descriptions of the relation classes, but they are omitted for brevity.
For the TACRED dataset, GPT 3.5 responded with 37 correct responses of the 41 total relation classes.
We use these results to argue that GPT 3.5 has an unfair advantage in discovering novel classes in TACRED and ReTACRED.Despite this advantage, GPT 3.5 did poorly compared to the other baselines we tested.However, when asked about the relation classes in FewRel, it responded with only four correct responses of the 80 total relation classes in the dataset.This information can partially explain why, in our tests, GPT 3.5 performs better on the TACRED and ReTACRED datasets compared to FewRel.

A.4.3 A Deeper Examination of GPT 3.5's Performance
The performance of GPT 3.5 has yielded results below our initial expectations.We carefully constructed our prompts in accordance with best practices drawn from recent studies that have showcased the efficacy of in-context learning with generative models (Wei et al., 2023;Wang et al., 2023) prompting techniques, such as Chain-of-Thought (CoT) or Self-Consistency Prompting, in future works.
To gain a more comprehensive understanding of the reasons behind GPT 3.5's suboptimal performance, we conducted an informal error analysis.Our investigation involved randomly selecting 40 instances of erroneous predictions across both known and novel classes generated by GPT 3.5.We present our observations below: Errors within Known Classes: GPT 3.5's inaccuracies within known classes appear to stem from the difficulty in distinguishing between classes with subtle differences.For instance, in the context of ReTACRED, the model frequently confuses the following known classes: "org:top_members/employees,"

•Figure 1 :
Figure 1: In KNoRD, we use labeled data to train a prompt model to predict relation class names.That model is then used to generate constrained (in-sentence) and unconstrained (all vocabulary) predictions.We average and concatenate representations from the top three constrained and unconstrained predictions.Representations are clustered using Gaussian Mixture Models (GMM) and bifurcated into sets of known and novel instances via a majority-vote.Novel-identified clusters provide weak labels in a cross-entropy training objective.

Figure 2 :
Figure 2: Data splits used in our Generalized Relation Discovery setting.Given n total classes in a dataset, the set of known classes are the top ⌊n/2⌋ most frequent classes.Remaining classes are placed into the set of novel classes.Labeled data consists of 85% of instances from known classes.Unlabeled data contains 15% of instances of known classes and 100% of the instances from novel classes (*numbers do not include negative instances, † since FewRel has no annotated negative instances, we augment the dataset using negative instances from ReTACRED 5 ).

Figure 3 :
Figure 3: Accuracy of top 1 and top 3 predictions from our prompt method in two settings: constrained (insentence words) and unconstrained (all vocabulary) predictions.Unconstrained predictions perform well with common relationships, while constrained predictions perform well on long-tail relationships.

A. 1 Figure 4 :
Figure 4: Implicitly expressed relationships are conveyed through lexical patterns, where specific linguistic patterns indicate a relationship between entities.Explicitly expressed relationships are represented through class-indicative words.
Concurrent OpenRE worksLi et al. (2022b) andWang et al. (2022a) introduce a prompt-based framework for unlabeled clustering.Prompt-based methods are used to generate relationship representations which are then clustered in a high-dimensional space.The clusters are iteratively refined using the training signal from labeled data, with careful measures to ensure the model is not biased to known classes.However, the aforementioned methods assume that unlabeled data is already divided into sets of known and novel classes, which is an unrealistic assumption of real-world unlabeled data.Furthermore, these works only report performance on novel classes, obscuring the model's overall performance in a real-world setting where the unlabeled data contains known and novel classes.
• We openly provide all code, experimental settings, and datasets used to substantiate the claims made in this paper.1

Table 1
We encode x i and obtain the hidden states {h 1 , h 2 , . . ., h |x i | }.Then, mean pooling is applied to the consecutive entity tokens to obtain repre- sentations for the head and tail entities (e 1 and e 2 , respectively).Assuming n start and n end are the start and end indices of entity e 1 , the entity repre-sentation is: y is a binary indicator that is 1 if and only if i is the correct classification for observation o, p(y o,i ) is the Softmax probability that observation o is of class i, and N is the number of classes.
Li et al. (2022b)show that setting the number of novel classes to a large number corresponds to finegrained novel class predictions which, depending on the task and desired outcome, can be grouped into more general classes via abstraction.Since we do not assume the ground-truth number of novel classes is available, we use a relatively high number of novel classes equal to twice the number of known classes (2 × |C K |), where |C K | is the number of known classes found in the labeled data.The N in Equation 5 is set to |C

Table 2 :
F1-micro scores reported on unlabeled data with and without negative (e.g., no relation) instances.F1 (known) and F1 (novel) report performance on ground-truth known and novel classes, respectively.OpenRE models are extended to operate in the Generalized Relation Discovery setting (see Section 6 for details).All scores average five runs except the GPT 3.5 scores which are resultant from a single run.

Table 3 :
Ablation experiments varying relationship representation methods used in KNoRD, as well as removing the cross-entropy module and using cluster predictions directly ("w/o CE").

Table 4 :
. Nevertheless, it is plausible that more effective prompting methods for open relation extraction exist.In particular, we propose exploring alternative Comparing responses from GPT 3.5 to the ground-truth classes in the TACRED dataset.GPT 3.5.correctly predicts 37 of the 41 relation classes in TA-CRED, receiving an F1 score of 0.91 for class name prediction.