Revisiting Large Language Models as Zero-shot Relation Extractors

Relation extraction (RE) consistently involves a certain degree of labeled or unlabeled data even if under zero-shot setting. Recent studies have shown that large language models (LLMs) transfer well to new tasks out-of-the-box simply given a natural language prompt, which provides the possibility of extracting relations from text without any data and parameter tuning. This work focuses on the study of exploring LLMs, such as ChatGPT, as zero-shot relation extractors. On the one hand, we analyze the drawbacks of existing RE prompts and attempt to incorporate recent prompt techniques such as chain-of-thought (CoT) to improve zero-shot RE. We propose the summarize-and-ask (\textsc{SumAsk}) prompting, a simple prompt recursively using LLMs to transform RE inputs to the effective question answering (QA) format. On the other hand, we conduct comprehensive experiments on various benchmarks and settings to investigate the capabilities of LLMs on zero-shot RE. Specifically, we have the following findings: (i) \textsc{SumAsk} consistently and significantly improves LLMs performance on different model sizes, benchmarks and settings; (ii) Zero-shot prompting with ChatGPT achieves competitive or superior results compared with zero-shot and fully supervised methods; (iii) LLMs deliver promising performance in extracting overlapping relations; (iv) The performance varies greatly regarding different relations. Different from small language models, LLMs are effective in handling challenge none-of-the-above (NoTA) relation.


Introduction
Relation extraction (RE) aims to identify the relationships between entities in texts, and plays an important role in information extraction (IE).Most of existing RE methods (Zeng et al., 2014;dos Santos et al., 2015) require large amounts of labeled training data which is labor-intensive and timeconsuming in practice.Hence, extracting relations from texts using zero or few-shot methodologies has garnered significant scholarly attention (Han et al., 2018;Chen and Li, 2021).
Recent studies (Wei et al., 2022;Wang et al., 2023b) on large-scale pre-trained language models (LLMs), such as GPT-3 (Brown et al., 2020), demonstrate that LLMs perform well in various downstream tasks without any training or finetuning but only with a few examples as instructions, which is called in-context learning.However, there is currently no consensus on whether LLMs are good few-shot information extractors (Agrawal et al., 2022;Jimenez Gutierrez et al., 2022).Different with some other tasks, RE is more challenging for LLMs because the structured data containing multiple dependent elements are difficult to extract directly and accurately.Although recent studies (Wang et al., 2023a) indicate that some conventional fine-tuning models still outperform LLMs in few-shot RE tasks, we still want to explore whether LLMs can achieve competitive performance compared to fine-tuning models.
Similar to few-shot learning, LLMs also show promising performance on zero-shot settings (Kojima et al., 2022).Recent work for zero-shot RE via prompting LLMs has achieved remarkable progress.QA4RE (Zhang et al., 2023) is a multiplechoice question answering prompt format, in which each relation is transformed into a template and LLMs are expected to predict only a single letter.This prompt is simple but requires manually crafted templates and unable to deal with overlapping relations, which motivates us to find more general and effective prompts.ChatIE (Wei et al., 2023) transforms the zero-shot IE task into a multiturn question answering problem with a two-stage framework, even surpasses some full shot models on several datasets.Nonetheless, ChatIE still performs worse than the state-of-the-art and is only evaluated on limited benchmarks.Thus it is still unclear how to improve extracting performance by designing effective prompts and whether LLMs are good zero-shot relation extractors.To this end, we revisit and investigate the potential of LLMs in zero-shot RE on the following research questions: • (RQ1) How does LLMs perform on RE incorporating existing prompt techniques?
Previous work (Ma et al., 2023;Wang et al., 2023a) fails to achieve promising results on RE since black box LLMs such as ChatGPT are difficult to ensure the reliability of outputs.To answer the first question, we investigate the feasibility of incorporating recent prompt techniques to improve the reliability of extracted results.For example, chain-of-thought (CoT) prompting (Wei et al., 2022) improves the reliability of model output by providing intermediate reasoning steps.Active prompting (Diao et al., 2023) is an uncertaintybased active learning method to quantify the uncertainty so as to select the most uncertain outputs.Specifically, we propose the summarize-and-ask (SUMASK) prompting, which decomposed RE into two subtasks: text summarization (Liu and Lapata, 2019) and question answering (Chen et al., 2017a).We further introduce an uncertainty estimation method to approximately characterize output probabilities of LLMs, which yields substantial improvements compared to VANILLA prompting.

Related Work
Few and Zero-shot Relation Extraction Fewshot RE (Han et al., 2018) aims to predict novel relations by exploring a few labeled instances.Prototypical networks (Snell et al., 2017) are widely used and combined with pre-trained language models (Devlin et al., 2019) in few-shot settings to achieve impressive results.To be capable of extracting relations that were not specified in advance, zero-shot RE (Levy et al., 2017) is proposed to invent new models to predict new relations.However, existing zero-shot methods (Chen and Li, 2021) still requires much labeled data.Recent studies (Zhang et al., 2023;Wei et al., 2023)  Large Language Models and Prompting Besides the "pre-train and fine-tune" paradigm (Liu et al., 2023), pre-trained LLMs possess characteristics that are advantageous for few-shot (Brown et al., 2020) and zero-shot (Kojima et al., 2022) learning, whereby appropriate prompts are used to effectively guide the model towards generating desired task outputs, thus beginning an era of "pre-train and prompt" (Liu et al., 2021).Prior works (Zhao et al., 2021;Liu et al., 2021) note the sensitivity of prompting under slight modifications.
Empirical results demonstrate that answering the restrictive prompts is challenging due to biases acquired during pre-training (Zhao et al., 2021;Arora et al., 2023).In this study, we evaluate different prompt formats tailored to the particularities of RE and propose SUMASK prompting which outperforms VANILLA prompting by a large margin.

Problem Definition
Previous zero-shot RE (Chen and Li, 2021) only involves single relation classification.We extend this zero-shot setting to multiple entities and relations.Given the pre-defined relation set R = {r 1 , r 2 , ..., r N } and the sentence S containing the entity set E = {e 1 , e 2 , ..., e M }, we aim to extract all the relations between these entities composing the relational triples set Z = {(e i , r k , e j )}, where N denotes the number of relations, M represents the number of entities, and e i , e j ∈ E, r k ∈ R.
4 Prompt Design

VANILLA Prompting
Previous work (Ma et al., 2023;Wang et al., 2023a) claims that LLMs achieve poor results on IE tasks such as RE.We argue that one important reason why LLMs underperform the state-of-the-art is the poor prompt design, as different prompts towards same tasks can cause large variations in the model predictions (Zhao et al., 2021;Arora et al., 2023).Figure 1 illustrates the most direct and common prompt strategy which directly asks LLMs to extract relation labels from text through instructions.However, we empirically find that this approach is ineffective because it makes LLMs to accomplish three no-trivial reasoning processes in only one step: (i) Extracting the relation semantics between the subject and object in the sentence; (ii) Understanding the semantics of each relation label; (iii) Matching the relation semantics between the entities and the given relation labels.Consistent with existing findings (Jimenez Gutierrez et al., 2022;Ma et al., 2023;Wang et al., 2023a), LLMs using VANILLA prompting are unable to achieve satisfactory performance on zero-shot RE.

SUMASK Prompting
Due to the difficulty of LLMs in completing three reasoning processes in one step, we leverage the idea of CoT (Wei et al., 2022) and suggest decomposing this step to artificially guide LLMs in understanding and reasoning.To classify the relations between the subject and object, a simple method is to sequentially ask LLMs whether each relation exists between two entities.For the three reasoning  Uncertainty Estimation Generally, relation classification assumes that only one correct relation is extracted.However, SUMASK prompt possibly obtains multiple "yes" while querying all the relations in R. Therefore, the final predicted relation is required to select from multiple candidates.Given the sentence S, the subject e s and object e o , the predicted relation is obtained by: Then we aim to transform the multi classification form into multiple binary classification forms.Hence, we define a random variable r as follows: ( 1, 0, ..., 0 ) r = r 1 ( 0, 1, ..., 0 ) r = r 2 ... ( 0, 0, ..., 1 ) r = r N (2) Here we make an assumption that only one positive label exists among N binary classification.We denote the intermediate results summarization and question as s i and q i corresponding to relation r i .[1] Gaetano Savi is a professor of Botany at the University of Pisa....
[k] At the University of Pisa, Gaetano Savi serves as a professor specializing in Botany.
[1] Is Gaetano Savi specifically engaged in the field of Botany for his work?...
[k] Does Gaetano Savi's professional occupation revolve around the field of Botany?
[1] Gaetano Savi holds the title of a professor in the field of Botany at the University of Pisa....
[k] Gaetano Savi is a professor of Botany at the University of Pisa.
[k] Is Gaetano Savi's residence associated with the field of Botany?
... ... ... Then the probability of relation r i is obtained by: ) where r i = 1 indicates that LLMs answer "yes" based on the summarization and question of relation i.Based on the above one positive label assumption, we have: The final predicted relation is selected from all the candidate relations with the max probability: Unfortunately, it is difficult to get the conditional probability of each step in LLMs.For instance, the "gpt-3.5-turbo"model only provides the final natural text output without any logit or probability.To this end, we introduce an uncertainty estimation method to approximately characterize conditional probabilities.Finding the relation r that satisfies equation 5 is equivalent to: where U (X|Y ) represents the uncertainty of the random variable X under the known random variable Y .Therefore, the relation with the smallest uncertainty is selected as final prediction.
Inspired by Diao et al. (2023), we consider measuring the uncertainty using the dispersion degree among k generated answers A = {a 1 , ..., a k } as shown in Figure 2. Specifically, we feed answers A into a pre-trained Sentence-BERT encoder (Reimers and Gurevych, 2019) to generate the answer representations Z = {z 1 , ..., z k }.Then the uncertainty is calculated by: where d(•) function measures the distance between two representations.After obtaining the uncer-tainty of each step, we select the relation r via: We adopt the majority vote (Wang et al., 2023b) to determine the yes/no answer in last step.If the system answers "no" with every relation r i ∈ R, the prediction is NoTA.For overlapping relations, we simply consider all relations that answer with "yes" as predictions.
Entity-Relation Mapping Obviously, asking LLMs for each relation is inefficient.Inspired by Li et al. (2022), we adopt the entity-relation mapping mechanism to deal with relation redundancy.Specifically, when the entity type is determined, the relations that possibly related to it are also determined, so most impossible relations are discarded in advance.Note that the VANILLA prompting also adopts this simple strategy.This simple mechanism not only improves efficiency but also benefits overall performance.
Overlapping Relation Extraction We adopt the NYT (Riedel et al., 2010) to test the ability of extracting overlapping relations.For NYT, we assume the entities and their types in the sentence are available and models only extract the overlapping relations given the entities in its test set.We use micro-F1 for evaluation.NYT is only evaluated on LLMs as existing baselines have not considered this multiple entities and relations zero-shot setting.
To keep OpenAI API costs under control, we randomly select 1,000 samples in the corresponding test set according to the proportion of samples in each relation class.Specifically, the number of samples corresponding to each relation in FewRel is the same.Note that 78.56% of samples in TA-CRED test set belongs to NoTA relation, while it is 57.91% in Re-TACRED.We provide the statistics of the datasets in Appendix A.

Baselines
Zero-shot Baselines For FewRel and Wiki-ZSL, we choose R-BERT (Wu and He, 2019), ESIM (Chen et al., 2017b), CIM (Rocktäschel et al., 2016) and ZS-BERT (Chen and Li, 2021) as the zero-shot RE baselines.Note that Relation-Prompt (Chia et al., 2022) uses the seq2seq-based models to generate pseudo data of unseen relations to fine-tune the model.And recent method RE-Matching (Zhao et al., 2023) requires elaborated relation descriptions to achieve superior performance but the open source code is not yet available.Thus the two methods are not discussed in this study.

LLMs Baselines
We investigate open source LLMs such as GPT-J (Wang and Komatsuzaki, 2021), BLOOM (Scao et al., 2022) and T0 (Sanh et al., 2022) with SUMASK prompting with other state-of-the-art models in zero-shot settings.For the parameter scale, we choose GPT-J-6B1 , BLOOM-7.1B2 and T0pp-11B3 for experiments.For ChatGPT (Ouyang et al., 2022), we use the Table 1: Main results on FewRel and Wiki-ZSL.In order to reduce the effect of experimental noise, the unseen label selection process is repeated for five different random seeds to produce the test set.The results of the baselines are retrieved from Chen and Li (2021).
"gpt-3.5-turbo-0301", which is the most capable GPT-3.5 model and optimized for chat.We denote the combination of ChatGPT and two prompts as VANILLA and SUMASK for brevity.Similar to Kojima et al. (2022), after the model outputs a text, our method picks up only the part of the answer text that first satisfies the answer format.The implementation details are provided in Appendix B.

Main Results
The results by varying m unseen relations on FewRel and Wiki-ZSL are summarized in Table 1.Generally, LLMs with zero-shot prompting achieve competitive results compared to existing zero-shot RE methods over two datasests when targeting at different numbers of unseen rela-  3. First, SUMASK prompting is not prominent in 41 semantic relations, which demonstrate the high micro-F1 score is mainly due to the high portion of NoTA relation.Second, SUMASK also achieves significant improvement like QA4RE in NoTAincluded metrics compared to the small LM-based NLI methods.For VANILLA, excluding NoTA relation brings better results.This further demonstrates the sensitivity of prompting and the effectiveness of proposed prompting.
Relation Specific Analysis Due to biases acquired during pre-training, LLMs have different abilities to understand different relations, which leads to varying levels of extraction results.We analyze the performance differences through experiments on 80-relation dataset FewRel.Specifically, under the SUMASK framework, we ask the LLMs whether the answer of question generated by golden triple is "yes".Then we adopt the accuracy metric to evaluate the performance of each relation.Finally, we select 10 relations with the  best and worst performance, as shown in Figure 4. Surprisingly, the accuracy difference between the best ("voice type") and the worst ("language of work or name") relation is 84.4%.We provide the detailed analysis in Appendix C. The semantic similarity between relations in the embedding space greatly impacts the zero-shot RE performance.Following Chen and Li (2021), we select five semantically distant relations and the other five relations that possess similar semantics to evaluate on our baselines, illustrated in Figure 3. Obviously, dissimilar relations lead to better results.First, when enhanced by SUMASK prompting, LLMs delivers more stable results because of the smaller performance gap between three settings.Second, the text-entailment based methods are less affected by similar relations compared to embedding-based models such as R-BERT and ZS-BERT.Because the predictions by text entailment based methods ESIM, CIM and SUMASK prompting do not resort to similarity search.

Prompt Strategy Analysis
We study the effectiveness of proposed SUMASK prompting.We conduct the ablation study about summarization generation, question generation and uncertainty estimation.Specifically, we omit the summarization process, replace the LLMs generated questions with pre-defined question templates (Appendix D) and  randomly select the relation from candidates without uncertainty estimation, respectively.Table 5 shows the ablation study results.Summarization consistently improves the overall performance under different settings, which indicates that incorporating reasoning steps before predicting relation is reasonable.Compared to pre-defined templates, LLMs generated questions may not necessarily be the best choice.Manually designed template enables the semantic description of relations more accurate, but our simple method is convenient and requires no external interference.Uncertainty estimation shows the significant impact on performance.Note that SUMASK still achieves 76.3% F1 on TACRED, because the uncertainty estimation has no impact on NoTA relation theoretically.
To understand the rationality of uncertainty estimation, we select 500 samples from each datasets to illustrate the correlations between uncertainty estimation and ground truth.Intuitively, golden relations have relatively low uncertainty.Figure 4 shows that most golden relations correspond to low uncertainty, while only a few correspond to large uncertainty, which is consistent with our intuition.

Overlapping Relation Extraction Results
The overlapping relation extraction results are illustrated in Table 6.VANILLA prompting is difficult to handle the overlapping relations as LLMs always tend to output only one relation.In contrast, SUMASK prompting is transferable and consistent on LLMs with different sizes.Note that different from the relation classification results, the recall of SUMASK is higher than its precision.Because 31.4 53.6 39.6 39.5 36.9 39.3 39.8 30.3 BLOOM 35.2 56.9 43.5 39.2 38.4 42.1 44.6 33.6 T0 39.6 57.3 46.8 44.3 41.9 47.2 48.3 35.7 VANILLA 20.4 16.5 18.2 31.3 22.7 16.3 12.8 7.7 SUMASK 55.7 78.3 65.1 65.6 62.5 66.7 70.8 59.5 Table 6: Main results on NYT.regarding all the candidate relations as predictions brings many false positives.Setting thresholds for uncertainty estimation might be a feasible solution.
To further study the capability of LLMs in extracting overlapping relations, we conduct experiments on different types of sentences.We split the sentences into five classes and use N to denote the number of relational triples in a sentence.Again, the SUMASK prompting achieves good performance over all five classes.Overlapping relational triples are summarized into four patterns: SEP (Sin-gleEntityPair), NEO (NoEntityOverlap), SEO (Sin-gleEntityOverlap) and EPO (EntityPairOverlap). Figure 5 shows that the performance of most baselines on four patterns presents a decreasing trend, reflecting the increasing difficulty of extracting relational triples from sentences with different overlapping patterns, encouraging more consistent and effective methods in the future research.

Conclusion
This work provides a comprehensive study on zeroshot RE with prompting-based LLMs.Besides the VANILLA prompting, we introduce a novel SUMASK prompting to fully explore the power of LLMs.Our experiments on six benchmarks demonstrate the capability of LLMs in zero-shot RE.Furthermore, we are able to answer the three aforementioned questions.Recent prompt techniques such as CoT significantly improve zero-shot RE prompting.Properly instructed LLMs not only deliver competitive or superior results compared to state-of-the-art relation classification models, but also are promising for zero-shot overlapping RE.

Limitations
We only carry out comprehensive experiments on zero-shot RE without few-shot and domain-specific exploration.It is still unclear what are the capabilities of LLMs on domain-specific datasets and how much performance could be improved by few-shot prompting.Our limited budget also restricted our study to a small set of prompt styles.It is possible that having a larger prompt design search space could narrow the gap between models fine-tuning and LLMs in-context learning.

B Implementation Details
For the hyper-parameters of SUMASK prompting, we set the number of generated answers k as 5, and we use the "bert-large-nli-mean-tokens" version of Sentence-BERT encoder to generate the answer representations.For open source LLMs GPT-J, BLOOM and T0, we set the max generated length as 128 and the temperature as 0.3.Note that the results of VANILLA prompting with open source LLMs are not discussed in our paper because they all achieve near-0 performance.Due to the high noise content in the output of LLMs, we pick up only the part of the answer text that first satisfies the answer format to alleviate the unexpected behaviors.For gpt-3.5-turbo-0301, we set the max length as 256 and the temperature as 0.7 according to official default setting.We treat the outputs of ChatGPT as valid results without post-processing.

C Relation Specific Analysis
We provide accuracy results for all relations in FewRel.The results are summarized in Table 9.And we provide several case studies to analyze the performance of ChatGPT on different relations.
"language of work or name" We observe two important reasons for the poor performance of this relation, shown in Table 10.On the one hand, ChatGPT sometimes misunderstand the semantics of entities or relations, which leads to generated questions deviating from the original meaning expressed.For example, Elizabeth is an English female name but ChatGPT treats the name as a person (Case 2), which also indicates the drawback of this method that the generated questions might be unexpected without providing the context or template.This also highlights the importance of incorporating entity types into relation extraction.On the other hand, prompting is a brittle process wherein small modifications to the prompt can cause large variations in the model predictions.For example, we use "Answer the question from context" rather "Answer the question from context with yes/no" to expect that ChatGPT could not only give the "yes/no" answer but also provide the reason for its judgment.We achieve the expected results in most relations.However, the answers corresponding to the relation "language of work or name" frequently do not contain "yes/no" while express positive (Case 3), which makes automatic evaluation difficult.Therefore, the specific form of the statement answer is important, otherwise it may lead to unreasonable evaluation.Moreover, the annotation errors (Case 4) of this relation also lead to biased and unreliable evaluation.
"tributary" and "mouth of the watercourse" ChatGPT performs poorly in both two relations.First, these two relations have very similar semantics because they are reciprocal in FewRel, as shown in Case 1.Unfortunately, ChatGPT frequently reverses the subject and object corresponding to these two relations during question generation step.Specifically, ChatGPT treats the triple (lisava river, tributary, natra river) as (lisava river, tributary of, natra river).This phenomenon highlights the advantage of manually crafted question templates and persuades us to provide few-shot demonstrations to generate reliable questions.Second, we also find plenty of annotation errors of two relations because of distant supervision, shown in Case 2. We can see that ChatGPT provides the correct judgments and reasons for this case, which makes it possible for LLMs to become reliable annotation inspectors.

D Question Templates
In ablation study, we replace the LLMs generated questions with pre-defined question templates.For FewRel, we simply define a template that forms with "The relation between 'subject' and 'object' is 'relation name'.Yes or No?".For TACRED, we follow Zhang et al. (2023) to use the templates shown in Table 12.

E Discussions
The empirical validation of SUMASK prompting The underlying operational mechanisms intrinsic to SUMASK prompting rely on the strong capabilities of LLMs as zero-shot reasoners.Following the prompt instructions, LLMs pay attention to the entities of interest, infer and summarize the relations between them.To the best of our knowledge, the logical reasoning faculties of LLMs are not explicitly utilized in previous work of zero-shot relation extraction.From raw text inputs to extracted relation labels, this process lacks intermediate reasoning steps.We decompose the zero-shot relation extraction into three steps to make LLMs sensitize to the semantic understanding and logical reasoning.With the proper instruction for summarization, LLMs are able to perform logical reasoning on specific entities and obtain relations between them.The LLMs can automatically do this via prompting, but the small fine-tuned model cannot.SUMASK elicits the logical reasoning ability inside LLMs for relation deduction.The ablation results also show that without the summarization step, the overall performance drops 2.5% -7.7% F1 on FewRel and TACRED under different settings.The experimental results on overlapping relation extraction also demonstrate the superior of SUMASK prompting over VANILLA prompting.
Here we provide a case study on NYT.The input prompt is: Summarize the relations between "Cambodia" and "Penh" from context.Context: Homage to Cambodia was performed at Chaktomuk Conference Hall in Phnom Penh on Oct. 21 , attended by the king.Summarization: Then the response from ChatGPT is "Phnom Penh" is the capital city of "Cambodia," where the event "Homage to Cambodia" took place at the "Chaktomuk Conference Hall" on October 21.Obviously, the first sentence "Phnom Penh" is the capital city of "Cambodia," generated by LLMs clearly elucidates the relationship between two entities, facilitating the subsequent processes.
The advantages of SUMASK prompting Generally, the SUMASK prompting does not require any sort of prompt engineering or template writing to start using.And this is one of our contribution points.Specially, not all relations can be accurately described by templates because the description of relations may vary across different entities.For instance, the relation language of work or name in datast FewRel is hard to describe by a single template, while this relation can not only describe language versions of some literary works, but also describe what language a name belongs to.Consider the following two templates: (1) The language of subject is object.and (2) subject is a name in object.The triple (Elizabeth, language of work or name, English) satisfy the second template but deliver confusions in the first template, while the triple (The Lord of the Rings, language of work or name, English) only satisfy the first template.SUMASK does not require any sort of prompt engineering or template writing to start using.
The complexity of SUMASK prompting Typically, the SUMASK prompting method suffers from relatively high inference complexity as it needs to enumerate all possible triples to obtain summarizations, questions, and answers, for k times.Suppose we have n samples to be extracted and r candidate relations, the total complexity is O(k × r × n).First, using entity types to discard the most irrelevant relations is a useful method, which achieves less complexity O(k × r × n) where r represents the maximum value of the mapped relation candidate set and much less than r.Second, we can certainly ask multiple samples (not exceeding the model maximum length) to LLMs at one time to improve inference speed.Suppose we concatenate k samples in a prompt, then the complexity becomes O(r × n).More efficient zero-shot prompting for relation extraction is worth exploring in the future.
Uncertainty estimation in all LLMs Uncertainty estimation is based on an assumption that the outputs of LLMs would be stabler with the predictions equivalent to the ground truth.The efficiency of using logits to generate the probabilities of intermediate results is relatively low, because we are required to obtain the probability of tokens at each position.Moreover, it is highly susceptible to extreme values.For example, if a token with a low probability is sampled during sampling process, it will affect the probability value of the entire sentence.In addition, due to the generated long text sequence, the difference of probability values between generated sentences is not obvious, making it difficult to choose the relation with the highest probability.Using SUMASK with uncertainty estimation brings better outcomes and we show this technique is suitable for both white box (e.g., BLOOM) and black box (e.g., ChatGPT) models.

Figure 1 :
Figure 1: Illustration of the VANILLA prompting.The output of LLMs is highlighted in color.

Figure 2 :
Figure 2: Illustration of the SUMASK prompting.The outputs of LLMs are highlighted in color.The probability of relation "residence" is 0 because the system answers "no" via majority vote.To estimate the uncertainty of relation "field of work", we generate k [SUMMARIZATION], [QUESTION], [ANSWER] representations, respectively.Then we calculate the dispersion degree among these representations to approximate the uncertainty.

Figure 3 :
Figure 3: Performance comparison between five similar, random and dissimilar relations.

Figure 4 :
Figure 4: The correlations between uncertainty estimation and ground truth.

Figure 5 :
Figure 5: F1-score of extracting overlapping relations from sentences with different overlapping patterns.
ferior performance oZhang et al., 2023)can be largely attributed to their inability to handle the NoTA relation.To this end, we provide an evaluation of zero-shot methods on NoTA relation.Following previous work(Sainz et al., 2021;Zhang et al., 2023), we report the NoTA-excluded micro-F1 and NoTA-included macro-F1 to investigate the extracting ability of normal and NoTA relation.The NoTA relation results are shown in Table

Table 5 :
Results of ablation study.

Table 9 :
Accuracy of each relation in FewRel.subject died in the state or province object, Yes or No? per:title subject is a object, Yes or No? org:member_of subject is the member of object, Yes or No? per:other_family subject is the other family member of object, Yes or No? org:country_of_headquarters subject has a headquarter in the country object, Yes or No? org:parents subject has the parent company object, Yes or No? per:stateorprovince_of_birth subject was born in the state or province object, Yes or No? per:spouse subject is the spouse of object, Yes or No? per:origin subject has the nationality object, Yes or No? per:date_of_birth subject has birthday on object, Yes or No? per:schools_attended subject studied in object, Yes or No? org:members subject has the member object, Yes or No? org:founded subject was founded in object, Yes or No? per:stateorprovinces_of_residence subject lives in the state or province object, Yes or No? per:date_of_death subject died in the date object, Yes or No? org:shareholders subject has shares hold in object, Yes or No? org:website subject has the website object, Yes or No? org:subsidiaries subject owns object, Yes or No? per:charges subject is convicted of object, Yes or No? org:dissolved subject dissolved in object, Yes or No? org:stateorprovince_of_headquarters subject has a headquarter in the state or province object, Yes or No? per:country_of_birth subject was born in the country object, Yes or No? per:siblings subject is the siblings of object, Yes or No? org:top_members/employees subject has the high level member object, Yes or No? per:cause_of_death subject died because of object, Yes or No? per:alternate_names subject has the alternate name object, Yes or No? org:number_of_employees/members subject has the number of employees object, Yes or No? per:cities_of_residence subject lives in the city object, Yes or No? org:city_of_headquarters subject has a headquarter in the city object, Yes or No? per:children subject is the parent of object, Yes or No? per:employee_of subject is the employee of object, Yes or No? org:political/religious_affiliation subject has political affiliation with object, Yes or No? per:parents subject has the parent object, Yes or No? per:city_of_birth subject was born in the city object, Yes or No? per:age subject has the age object, Yes or No? per:countries_of_residence subject lives in the country object, Yes or No? org:alternate_names subject is also known as object, Yes or No? per:religion subject has the religion object, Yes or No? per:city_of_death subject died in the city object, Yes or No? per:country_of_death subject died in the country object, Yes or No? org:founded_by subject was founded by object, Yes or No?