Chain of Thought with Explicit Evidence Reasoning for Few-shot Relation Extraction

Few-shot relation extraction involves identifying the type of relationship between two specific entities within a text, using a limited number of annotated samples. A variety of solutions to this problem have emerged by applying meta-learning and neural graph techniques which typically necessitate a training process for adaptation. Recently, the strategy of in-context learning has been demonstrating notable results without the need of training. Few studies have already utilized in-context learning for zero-shot information extraction. Unfortunately, the evidence for inference is either not considered or implicitly modeled during the construction of chain-of-thought prompts. In this paper, we propose a novel approach for few-shot relation extraction using large language models, named CoT-ER, chain-of-thought with explicit evidence reasoning. In particular, CoT-ER first induces large language models to generate evidences using task-specific and concept-level knowledge. Then these evidences are explicitly incorporated into chain-of-thought prompting for relation extraction. Experimental results demonstrate that our CoT-ER approach (with 0% training data) achieves competitive performance compared to the fully-supervised (with 100% training data) state-of-the-art approach on the FewRel1.0 and FewRel2.0 datasets.


Introduction
Relation extraction (RE) aims at identifying the relation between two given entities based on contextual semantic information (Cardie, 1997;Bach and Badaskar, 2007;Pawar et al., 2017).However, the performance of RE models often degrades significantly when the labeled data is insufficient.The few-shot relation extraction (FSRE) task needs to be addressed with a limited amount of annotated training data (Han et al., 2018;Gao et al., 2019b;Brody et al., 2021).Recently, numerous researchers have tackled this problem by employing meta-learning and neural graph techniques (Fangchao et al., 2021;Dou et al., 2022;Li and Qian, 2022;Zhang et al., 2021;Li et al., 2023).These methods have achieved satisfying results by meta-training the model on a large dataset or incorporating external knowledge.
More recently, pre-trained Large Language Models (LLMs) such as GPT-series models, have exhibited significant in-context learning capabilities (Brown et al., 2020;Min et al., 2022), achieving promising results across many NLP tasks.These findings suggest that LLMs can effectively perform various tasks without the need for parameter optimization, a concept known as In-context Learning (Dong et al., 2022).Within the paradigm of in-context learning, LLMs demonstrate competitive performance compared to standard fullysupervised methods across many NLP tasks, even with just a few examples provided as few-shot demonstrations in the prompt (Wang et al., 2022).
Furthermore, the chain-of-thought prompting method (Wei et al., 2022;Kojima et al., 2022;Zhang et al., 2022) elicits an impressive reasoning capability from the LLM in mathematical problems and commonsense reasoning.While in the RE task, there may exist a reasoning process that guides the LLM in determining the relation label.However, there is a lack of research to fill this gap.Though GPT-RE (Wan et al., 2023) introduces a golden label-induced reasoning method by prompting the LLM to generate a suitable reasoning process solely based on the given golden label.The performance improvement from the auto-generated reasoning process is marginal compared to a meticulously designed approach for few-shot demonstration retrieval.
We argue that the one-step reasoning process generated by LLM does not fully unleash the potential of LLM: (1) Previous studies and our experiments indicate that the one-step auto-generated reasoning process by LLM does not emphasize the higher-level abstraction of entity types, specifically the concept-level entities, which has been proven to be beneficial for FSRE task (Hao et al., 2019;Zhang et al., 2021).For instance, consider the following simple example: the relation between two location entities should not be categorized under the relation type between human beings.( 2) Due to the huge amount of pre-training data, the LLM has already possessed a considerable knowledge base (Petroni et al., 2019;Roberts et al., 2020), which can be beneficial when the LLM encounters an FSRE task.(3) The quality of semantic representation of the relation label is not crucial in the fully-supervised setting, but in-context learning is sensitive to the relation label.For instance, given the relation labels= {mother, child, sport} in FewRel 1.0 (Han et al., 2018), relation labels "mother" and "child" would confuse the LLM without appropriate prompt designing, as the primary distinction between these two relations is the positioning of the parent entity as either the head or tail entity.Moreover, the word "sport" barely contains enough relation information for the LLM to perform the RE task.We call this issue the semantic ambiguity of relation labels.
To this end, this paper presents a novel chainof-thought prompting method for the FSRE task: Chain-of-thought with Explicit Evidence Reasoning, achieving competitive results compared to the state-of-the-art result on FewRel 1.0 and FewRel 2.0.Our method employs a 3-step reasoning approach to address the aforementioned issues.In the first and second steps, CoT-ER requires the LLM to output the concept-level entities corresponding to the head and tail entities, which serve as the foundation for RE-specific reasoning.In the third step, CoT-ER prompts the LLM to extract the relevant contextual spans as evidence that explicitly establishes a specific relationship between these two entities.By combining the head entity, tail entity, and relation label to form a coherent sentence, LLMs can determine the relation label between two given entities more semantically, addressing the issue of semantic ambiguity of relation labels in prompting methods.Figure 1 demonstrates the difference between Auto-CoT and our CoT-ER.

Related Work
Few-shot Relation Extraction.Few-shot relation extraction aims at predicting semantic relations between head and tail entities indicated in a given instance based on a limited amount of annotated data.FewRel, a large-scale dataset introduced by Han et al. (2018), was the first to explore few-shot learning in relation extraction.Many approaches (Qu et al., 2020;Yang et al., 2021;Zhang et al., 2021) incorporate external knowledge to improve performance given the scarcity of training data.Another line of FSRE research (Gao et al., 2019a;Han et al., 2021;Liu et al., 2022b) relies solely on the input text and provided relation description information, without incorporating external knowledge.Most of the previous methods usually adopt complicated designs of neural networks or introduce external knowledge, which can be labor-intensive in realistic scenarios.
In-context Learning.GPT-3 in-context learning (ICL) (Brown et al., 2020;Dong et al., 2022) has emerged as a novel paradigm in NLP and demonstrates competitive performance across various tasks when compared to fine-tuned models.It's much easier to introduce prior knowledge into LLMs by incorporating relevant text information into prompt (Liu et al., 2022a;Lu et al., 2022;Wei et al., 2022).Furthermore, ICL is a training-free approach by directly prompting the LLMs, which means it's a ready-to-use method and can be easily applied to various tasks with a few demonstrations in the prompt.
Recently, most researchers focus on the demonstration designing of ICL to improve the performance in NLP tasks and gradually developed into two categories (Dong et al., 2022).The first line of demonstration designing tries to seek an optimal arrangement of the few-shot demonstrations in the 3) associates an evidence reasoning process with each instance from the support set by prompting LLM with human-annotated demonstrations; b) Instances retrieval module ( §3.4) selects the few-shot demonstrations from the candidate set for the ultimate prompt based on their similarity to the query instance.c) Inference module ( §3.5) utilizes the ultimate prompt, which is composed of M support instances with their associated reasoning process, to derive an evidence reasoning process for the query instance.
prompt by selecting instances from the dataset (Liu et al., 2022a;Valmeekam et al., 2022;Wu et al., 2022;Wan et al., 2023) and ordering the selected demonstration examples (Lu et al., 2022;Liu et al., 2022a;Ma et al., 2023).Another line of demonstration design aims to discover an effective prompting method to unleash the potential of LLMs.Several studies (Honovich et al., 2022;Zhou et al., 2022) find that the LLMs can generate the task instruction automatically.Furthermore, Wei et al. (2022) revealed the reasoning ability of LLM by adding intermediate reasoning steps manually before giving the answer, which is called chain-of-thought (CoT).Additionally, Kojima et al. (2022) shows that by simply adding "Let's think step by step" before each answer, LLM can do the zero-shot reasoning without manually annotated data.Based on this discovery, Zhang et al. (2022) proposed Auto-CoT, replacing the manually written reasoning process in CoT with the automatically generated reasoning process by LLM.
Despite the CoT prompting method achieving promising results in many NLP tasks, it still lacks relevant exploration for RE.Hence, in this paper, we propose a novel CoT prompting method called CoT-ER to fill this gap.
True Few-Shot Learning.Perez et al. (2021) argues that prior research has achieved promising results by choosing prompts or tuning other hyperparameters using a large development set, such as selecting few-shot demonstrations from a large training set, which does not truly demonstrate the few-shot learning capability of LLMs.This setting has been adopted by many works (Logan IV et al., 2022;Schick and Schütze, 2022;Lu et al., 2022;Jimenez Gutierrez et al., 2022) to get a more accurate result of few-shot performance.We will also adopt the setting in this paper.

Problem Formulation
Definition of Relation Extraction.Let x denote the input sentence and e sub , e obj denote the pair of subject and object entities in the given sentence.The RE task aims to identify the relation label r between the marked entities in sentence x.Here R represents a predefined set of relations, and r is an element of R.

Definition of Few-shot Relation Extraction.
Given an N-way K-shot RE task, the goal is to solve this problem for each instance in the query set based on the support set.The relation label set R consists of N type of relations.For each r ∈ R, the support set S r includes K instances, represented as {s1 r , s2 r , s 3 r , ..., s K r }.The query set Q comprises the test input instance for each r ∈ R.
Since the N and K are usually quite small, predicting relations in query instances with limited labeled data presents a significant challenge.Previous studies tackled this problem by training a dataefficient network, specifically the meta-learningbased method.In the subsequent section, we will discuss a training-free method to address this problem by leveraging the reasoning ability of the LLM.

Overview
An overview of our proposed CoT-ER is shown in Figure 2, which consists of 3 components: (1) The Human-Instructed Reasoning Module, which aims to associate a reasoning process with each instance from the support set by prompting LLM with human-annotated data.( 2) A Similarity Based KNN Retrieval Module will select instances with the reasoning process from the support set based on the similarity to query instance, which are considered as few-shot demonstrations in the ultimate prompt.( 3) The Inference Module predicts the relation label of a query instance by instructing the LLM through the ultimate prompt, which concatenates the task instruction, few-shot demonstrations, and a question about the instance.

Human-Instructed Reasoning Module
Since the LLM has the ability of in-context learning (Brown et al., 2020), we propose a human-instructed approach to guide the LLM in performing accurate reasoning using a minimal amount of annotated data.
CoT-ER Designing.To fully leverage the knowledge stored in LLM and facilitate step-by-step reasoning, we introduce a novel 3-step reasoning framework with concept-level knowledge and explicit evidence.In Step 1, the LLM infers concept-level knowledge related to the head entity, while Step 2 does the same for the tail entity.Through these steps, the LLM can easily exclude options with incorrect concept entities.Step 3: To figure out which relation label fits this pair of entities most within the given context, we explicitly highlight relevant text spans as evidence, and subsequently construct a coherent expression that combines the two entities and the relation label together.Table 6 shows an example with the relation label "crosses".To further illustrate our 3-step reasoning process, few-shot demonstrations in Figure 3 demonstrate the template of this reasoning process.
CoT-ER Generating.We annotated one CoT-ER reasoning example for each relation class in the dataset to be seed examples. 1Then we design an appropriate prompt 2 using the annotated example as the few-shot demonstration to instruct the LLM in generating a similar reasoning step for each support instance.Each support instance with the CoT-ER reasoning steps will be appended to the candidate set. Figure 3 shows a similar prompt designed for the Human Instructed Reasoning Module.

Instance Retrieval Module
Several studies (Liu et al., 2022a;Lu et al., 2022;Wan et al., 2023) suggest that selecting few-shot demonstrations based on similarity yields strong improvements in in-context learning.Wan et al. (2023) achieved promising performance in RE by employing a task-specific fine-tuned model as the encoder, which means this approach does not fit the true few-shot setting mentioned in §2.Moreover, this advantage diminishes rapidly as the number of candidates decreases.
Because of the limited input tokens of LLM, a single prompt may not hold all support instances given an N-Way K-Shot task.For instance, in the case of the 10-way 5-shot task, having a total of 50 candidate samples leads to the inability to append them all in one single prompt.In this paper, we follow the similarity-based method for selecting fewshot demonstrations.To obtain a relation-specific similarity representation, we first reconstruct the input text as "Context: [text] Given the context, what is the relation between "[head entity]" and "[tail entity]"?"by incorporating entity-level information.Then, we utilize the GPT series model "text-embedding-ada-002" as the encoder to get the semantic embedding.Subsequently, we compute the Euclidean distance between each instance in the candidate set and the query instance.Finally, M instances from the candidate set are selected as few-shot demonstrations based on their lower Euclidean distance to the query instance.Intuitively, we aim to provide as much information as possible for LLM, so we follow the principle of filling the context window to increase M as much as possible.

Inference Module
To create the ultimate prompt, we simply concatenate a task instruction, few-shot demonstrations, and a question that is tailored to the query instance, using the support instances with CoT-ER reasoning as few-shot demonstrations.Figure 3 shows the framework of the ultimate prompt.It is worth noting that LLMs have a strong inclination to wrongly output N U LL in a general setting (Wan et al., 2023;Jimenez Gutierrez et al., 2022).Here, we enforce the LLM to select one of the provided relation labels, as we do not consider the "None-of-the-Above" scenario in the FewRel dataset (Han et al., 2018;Gao et al., 2019b).

Dataset For Few-shot Relation Extraction
There are two standard few-shot relation extraction datasets: FewRel 1.0 (Han et al., 2018) and FewRel 2.0 (Gao et al., 2019b) 3 .FewRel 1.0 is constructed from Wikipedia, which consists of 70, 000 sentences annotated with 100 relation labels, these 100 relation labels are divided into 64/16/20 splits for train/validation/test set.FewRel 2.0 extends FewRel 1.0 by introducing additional validation and test sets from the medical domain, which includes 10 relation labels with 1, 000 instances and 15 relation labels with 1, 500 instances, respectively.Besides, note that FewRel 1.0 provides a description of each relation label in the dataset, but FewRel 2.0 does not.This difference is a crucial factor to consider when designing the seed CoT-ER reasoning process4 .

Implementation Details
For LLM, we select "text-davinci-003" and get the response from GPT by calling the open API of the OpenAI5 with the parameter temperature = 0.
In a realistic scenario, it's reasonable to perform the RE task directly with fixed, manually annotated examples as few-shot demonstrations for each relation label.To this end, we evaluate the performance by selecting the few-shot demonstrations from a predetermined human-annotated CoT-ER dataset (seed examples), which is denoted as Manual-CoT-ER.In this setting, the few-shot demonstrations are independent of the support set, meaning that LLM will perform the RE task using a smaller amount of annotated data.In contrast, Auto-CoT-ER utilizes the auto-generated CoT-ER reasoning process as the few-shot demonstrations based on the support set which is described in §3.3.
Following the standard configuration of FewRel, we conducted experiments in these settings: 5-Way 1-shot, 5-Way 5-shot, 10-Way 1-Shot, and 10-Way 5-Shot.Due to the high cost of running the CoT-ER on the GPT-3 API and the golden labels of the test set are not publicly available, we evaluate all LLMbased methods by sampling 100 × N test queries for each N-way K-shot task from validation sets.

Compared Methods
We consider two categories of methods for the FSRE task.Methods with 100%-training data: MTB (Baldini Soares et al., 2019), CP (Peng et al., 2020), HCPR (Han et al., 2021), FAEA (Dou et al., 2022), GTPN (Fangchao et al., 2021), GM_GEN (Li and Qian, 2022), KEFDA (Zhang et al., 2021).Generally, these methods train a model on FewRel 1.0 training set and evaluate their performance on FewRel 1.0, 2.0 validation and test sets.Methods with 0%-training data: To the best of our knowledge, no relevant evaluation has been conducted under the N-Way K-Shot setting on the FSRE dataset FewRel using the in-context learning approach.Thus we applied Vanilla-ICL (Brown et al., 2020) and Auto-CoT (Zhang et al., 2022;Wan et al., 2023) as the baseline prompt formatting methods.These methods utilize a few examples as demonstrations and prompt the LLM to perform an NLP task.Vanilla-ICL designs a template that directly combines the texts and relation label, such as "Context:[text], Given the context, the relation between [head entity] and [tail entity] is [relation label]".Auto-CoT extends the Vanilla-ICL with auto-generated reasoning steps.Throughout the experiment, we noticed that whether to require the LLM to perform reasoning in the final answering stage can lead to inconsistent results, thus we report both of the results in Table 1 and Table 2. Additionally, we utilize the pre-trained BERT-base model6 and the GPT series model text-embedding-ada-002 as the encoder to directly obtain a representation of the input text.For each N-way K-shot task, we obtain a prototype of each class by averaging K instances that belong to this class.Then the predicted label of the query instance is assigned to the class whose prototype has the closest Euclidean distance to the query instance.We denote these two methods as Bert-proto and GPT-proto.

Main Results
We present our main experiment results with previous methods in Table 1 and Table 2. From the table, we can observe that: First, Auto-CoT does not demonstrate significant improvement compared to Vanilla-ICL in the fewshot scenario.This could be attributed to the low quality of the reasoning process and the reduced number of instances in few-shot demonstrations due to the maximum tokens limitation.Furthermore, When it comes to generating a reasoning process in the ultimate answer, Auto-CoT with reasoning outperforms the version of directly generating a relation label on FewRel 1.0.However, an opposite conclusion is reached on FewRel 2.0.We try to provide an explanation for this: FewRel 1.0 draws instances from Wikipedia and often requires common sense for reasoning, whereas FewRel 2.0 necessitates medical-related expertise and constitutes a smaller portion of the pre-training corpus compared to common sense.Consequently, the LLM encounters difficulties in performing reasoning tasks in the medical domain.
Second, Both Manual-CoT-ER and Auto-CoT-ER outperform the training-free baselines with fewer instances used in the few-shot demonstrations.Indicating the necessity of designing a specific CoT prompting method tailored to the RE task in order to achieve better performance in the Methods 5-Way 1-Shot 5-Way 5-Shot 10-Way 1-Shot 10-Way 5-Shot few-shot scenario.Third, CoT-ER prompting method achieves competitive performance compared to the state-of-theart fully-supervised method and surpasses the majority of fully-supervised methods with minimal manual labor both on FewRel 1.0 and FewRel 2.0.This suggests that GPT series LLMs have the potential to beat previous fully-supervised methods when high-quality relation information and welldesigned reasoning processes are provided.

Ablation Study on CoT-ER
Does the incorporation of entity information significantly benefit the coT-ER?To this end, we conducted ablation experiments to demonstrate the necessity of the 3-step reasoning process.In this experiment, we removed the first and second steps and compared the performance with Auto-CoTreasoning.For fairness concerns, we implemented this experiment using Auto-CoT-ER, which also employs an auto-generated reasoning process by LLM.Due to the limitation of maximum input and output tokens, we set the number of instances in the few-shot demonstrations to 13 for the ablation experiments.The results are presented in Figure 4.
We find that: (1) after removing the first and second steps, the performance of Auto-CoT-ER shows a significant decline with reductions of 3.4, 2.2, 1.8, 2.9, and 5.2, 6, 5.3, 7.6 Accuracy on FewRel 1.0 and FewRel 2.0 respectively.It means higher-level abstraction of entity types, specifically the conceptlevel entities, are beneficial to the LLM performing RE task in the few-shot scenario.( 2) Despite the third step of CoT-ER pairing the support instance with a simpler reasoning process compared to Auto-CoT, it achieves superior performance in certain challenging scenarios (10-Way 1-Shot and medical domain §5.1).This finding indicates that the semantic information provided by the relation label is more beneficial to the LLM than low-quality reasoning information.

The Stability of CoT-ER
Different Random Seeds for Task Sampling.Due to the high cost of the "text-davinci-003", we sample a relatively small number of queries for testing, specifically 100 × N for each N-Way K-Shot task.It may raise concerns that the results may not hold up when evaluated on the full test sets.To this end, we evaluated the CoT-ER and Vanilla-ICL using 8 random seeds for N-Way K-Shot task sampling.Table 3 and Table 4 show experimental results with mean ± standard deviation on FewRel2.0.Notably, CoT-ER consistently outperforms Vanilla-ICL across all N-way K-shot settings with a lower standard deviation.
Different Number of Few-shot Instances.To investigate how the selected number of demonstrations contribute to the performance of CoT-ER,  we conducted experiments across different M under the 5-Way 5-Shot setting.A single prompt can hold 13 CoT-ER reasoning demonstrations at worst, whereas all the support instances ( 25) can be appended to the prompt in Vanilla-ICL.The results are presented in Table 5.We observe that both CoT-ER and Vanilla-ICL can benefit from more few-shot examples, which aligns with the same conclusion in previous work (Liu et al., 2022a).However, the performance of Vanilla-ICL deteriorates rapidly as the number of examples decreases.CoT-ER can effectively leverage the information from provided instances and maintain strong performance even with a reduced number of instances.This demonstrates that CoT-ER exhibits greater stability compared to Vanilla-ICL when the number of few-shot instances varies.

Case Study
We select one typical reasoning example generated by LLM to better demonstrate the superiority of our prompting method CoT-ER.As shown in Table 6, this instance necessitates the LLM to correctly identify the relation label "crosses" between "Railway Bridge" and "Daugava".In FewRel 1.0, the relation label "crosses" is described as "obstacle (body of water, road, ...) which this bridge crosses over or this tunnel goes under".
However, using Auto-CoT prompting leads to a wrong prediction where the model incorrectly labels the relation with "located in or next to body of water", which pertains to the relationship between a location entity and a water-related entity (river, lake, ...).The primary reason for the failure of Auto-CoT is the absence of higher-level abstraction of entities in reasoning, which is necessary to comprehend the entities involved in the relation.CoT-ER addresses this issue by incorporating concept-level information into the reasoning process through its first and second steps.Specifically, in this case, the LLM first reasons the subject entity corresponds to a bridge and the object entity corresponds to a river based on its own knowledge base and contextual information, thereby excluding the relation requiring a location entity and water-related entities (as demonstrated by other examples in prompt).With this clue, the LLM can effectively perform subsequent reasoning steps.
Furthermore, the presence of both "crosses" and "located in or next to body of water" labels in an N-way K-shot task can indeed confuse the LLM due to the lack of semantic information on these two phrases.CoT-ER addresses this issue by inte-

Auto-CoT
The clues that lead to the relation between "Railway Bridge" and "Daugava" to be "located in or next to body of water" in the sentence are the words "crosses" and "river".These words indicate that the Railway Bridge is crossing the Daugava River, which is a body of water.

✗
CoT-ER 1. Subject entity "Railway Bridge" is a bridge name, which refers to the entity of bridge in the context.2. Object entity "Daugava" is a river name, which refers to the entity of river in the context.
3. According to the context, "The Railway Bridge is a bridge that crosses the Daugava river."indicates that "Railway Bridge" crosses "Daugava".

✓
Table 6: We compare our prompting method CoT-ER with Auto-CoT for an example in FewRel 1.0.The context of this example is "The Railway Bridge is a bridge that crosses the Daugava river in Riga, the capital of Latvia.", and the relation between two highlight entities is "crosses".The key reasoning processes of CoT-ER are highlighted with green.
grating both entities and the relation label into a coherent expression, as exemplified by this example "Railway Bridge" crosses "Daugava".

Conclusion
In this paper, we explore the potential of LLM incontext learning on few-shot relation extraction.To enhance the general performance caused by lowquality auto-generated reasoning processes, we introduce CoT-ER, a prompting method tailored to few-shot relation extraction.The core idea is to prompt LLM to generate evidence using taskspecific and concept-level knowledge stored in its pre-training stage.These pieces of evidence will be utilized by LLM during the RE task, and facilitate the reasoning process.Additionally, we devise a label verbalizing technique by integrating both entities and the relation label into a coherent expression.This technique addresses the semantic ambiguity of relation labels, which is a common challenge encountered during relation extraction when utilizing in-context learning.The experimental results on FewRel 1.0 and FewRel 2.0 outperform all training-free baselines, demonstrating the effectiveness of our proposed approach.Moreover, achieving comparable results to the state-of-the-art fully-supervised method suggests that the paradigm of in-context learning holds promise as a novel solution for the few-shot relation extraction task.

Limitations
Although CoT-ER achieved decent results on FewRel 1.0 and FewRel 2.0, there is still poten-tial for future improvement.Our proposed method does not fully utilize all instances when handling larger support sets, such as 5-way 5-shot and 10way 5-shot, due to the constraint of maximum length.Though we adopt a similarity-based KNN retrieval to select superior instances for few-shot demonstrations, we find it not as effective in the few-shot setting compared to other works that perform well when there is a large candidate set available.Due to the high cost of employing reasoningrequired ICL via GPT-3 API, we have not evaluated the CoT-ER on a superior LLM with longer maximum input tokens and a larger scale.
Our limited budget also restricted the optimization for the construction of seed examples.It is possible to enhance the performance with a more informative and appropriate design.

Ethics Statement
It is known that pre-trained language models could capture the bias reflecting training data.Thus, our approach using LLMs can potentially generate offensive or biased content.We acknowledge this risk and suggest that practitioners should carefully examine the potential bias before deploying our models in real-world applications.In the hippocampus, transcriptional upregulation was observed in nogo-a(one day post-injury), mag and pirb at seven days post-injury.
Reasoning process of CoT-ER: 1. Entity "transcriptional" refers to the process of transcribing DNA into RNA, which refers to the entity of biological process in the context.2. Entity "mag" is the name of a gene, which refers to the entity of gene in the context.3.According to the context, "transcriptional upregulation was observed in nogo-a (one day post-injury), mag and pirb at seven days post-injury" indicates that "mag" plays role in "transcriptional".So, the relation between "transcriptional" and "mag" is "gene plays role in process".

✓ spouse
Context: Their maternal grandparents were John II, Count of Holland and Philippa of Luxembourg.
Reasoning process of CoT-ER: 1. Subject entity "Philippa of Luxembourg" is a personal name, which refers to the entity of person in the context.2. Object entity "John II, Count of Holland" is a personal name, which refers to the entity of person in the context.3.According to the context, "maternal grandparents" indicates that Object entity "John II, Count of Holland" is a child of Subject entity "Philippa of Luxembourg".So, the relation between "Philippa of Luxembourg" and "John II, Count of Holland" is "child".
✗ is primary anatomic site of disease Context: Loss of hair in the other areas of the skin is present in the majority of cases.
Reasoning process of CoT-ER: 1. Entity "Loss of hair" is the name of a condition, which refers to the entity of condition in the context.2. Entity "skin" is the name of a body organ, which refers to the entity of physical location of body part in the context.3.According to the context, "loss of hair in the other areas of the skin" indicates that "loss of hair" occurs in "skin".So, the relation between "loss of hair" and "skin" is "occurs in".

✗
Table 9: The representative scenarios where CoT-ER makes mistakes for different relation class.
Vanilla-ICL 10 times, with 7 of them showing a relatively high improvement.However, in the 7th and 8th relations, CoT-ER still lags behind Vanilla-ICL by a few percentage points.
Table 8 shows the experimental results on FewRel 2.0.We can observe that CoT-ER surpasses Vanilla-ICL 6 times, with 4 of them showing a significant improvement of over 10%.However, in the 5th and 7th relations, CoT-ER still lags behind Vanilla-ICL by a few percentage points.

A.2 Error Analysis of Reasoning Process
CoT-ER shows significant improvements in scenarios when Vanilla-ICL performs poorly.Here, we present a few cases to illustrate the reasoning process of CoT-ER in some representative scenarios in order to facilitate future research.Table 9 shows some correct and incorrect answers produced by CoT-ER.
Relation class gene plays role in process: In this case, CoT-ER can not only recognize what "transcriptional" and "mean" mean in the context but also have concept-level knowledge.And the final prediction is correct.
Relation class spouse: In this case, CoT-ER correctly recognizes the entity type and precisely extracts the crucial evidence "maternal grandparents".However, the LLM incorrectly interprets "maternal grandparents" as the relationship between these two entities, when they are actually a couple of "maternal grandparents".This demonstrates that the LLM with CoT-ER may overlook contextual information sometimes.
Relation class is primary anatomic site of disease: In this case, CoT-ER can also recognize the entity well, and the conclusion is semantically right ("loss of hair" occurs in "skin").However, the final prediction is incorrect, as the predicted relation label is "occurs in" while the ground truth label is "is primary anatomic site of disease".The reason is that the label "occurs in" in this dataset means "a condition occurs in a period of the lifetime".This particular label should be matched with entity pairs like "condition or disease" and "a period of a person's lifetime (such as congenital)".This issue indicates that the relation description would be a key component in such methods, but it's not included in the FewRel 2.0 dataset.

B Seed Examples
In this section, we will give more details about how we construct the seed examples and present all seed examples in Table 10, Table 11, Table 12, Table 13 and Table 14.
We first randomly select one instance from each relation class to serve as the seed example.Then we outline the three steps of CoT-ER in each instance.By considering the contextual information of the two entities involved, we manually assign the entity types and identify relevant text spans as evidence.Because FewRel 1.0 provides a description of each relation (such as "sport: sport in which the subject participates or belongs to") in a separate file while FewRel 2.0 does not, this has implications for the design of seed examples.However, these descriptions are not directly used in the prompts.Furthermore, we haven't optimized the seed examples, leaving room for further improvement.

C Prompts
All prompt templates used in this paper are presented in Table 15.

Figure 1 :
Figure 1: The comparison between Auto-CoT and CoT-ER (ours) prompting methods.Specifically, CoT-ER leverages side information to induce LLMs to generate explicit evidence for relation reasoning.

Figure 2 :
Figure 2: An illustration of CoT-ER for few-shot RE.Different colored lines indicate the flow of support and query instances from an N-way K-shot task.a) Human-instructed reasoning module ( §3.3) associates an evidence reasoning process with each instance from the support set by prompting LLM with human-annotated demonstrations; b) Instances retrieval module ( §3.4) selects the few-shot demonstrations from the candidate set for the ultimate prompt based on their similarity to the query instance.c) Inference module ( §3.5) utilizes the ultimate prompt, which is composed of M support instances with their associated reasoning process, to derive an evidence reasoning process for the query instance.

Figure 3 :
Figure 3: Template of the ultimate prompt.M represents the number of few-shot demonstrations selected by the instance retrieval module, and V erbalize() denotes a transformation function that combines the component into a coherent expression.The prompt used in human human-instructed reasoning module follows a similar structure, but instead of few-shot demonstrations, it employs annotated examples.

Figure 4 :
Figure 4: Ablation study on the first and second reasoning steps of CoT-ER.Auto-CoT refers to the "with reasoning generation" version.

Table 1 :
Main Results on FewRel 1.0 validation/test set.All results are given by accuracy (%).(m) means there are m instances selected as few-shot demonstrations.+reasoning denotes the demand of reasoning process generation.

Table 2 :
Main Results on FewRel 2.0 validation/test set.All results are given by accuracy (%).(m) means there are m instances selected as few-shot demonstrations.+reasoning denotes the demand of reasoning process generation.

Table 7 :
Results of each label in FewRel 1.0.Definition of digital relation type:1 voice type, 2 position played on team/speciality, 3 original language of film or TV show, 4 constellation, 5 military rank, 6 competition class, 7 member of, 8 spouse, 9 located in or next to body of water, 10 follows, 11 crosses, 12 sport, 13 main subject, 14 child, 15 part of, 16 mother.

Table 8 :
Results of each label in FewRel 2.0.Definition of digital relation type: 1 inheritance type of, 2 ingredient of, 3 classified as, 4 gene found in organism, 5 is primary anatomic site of disease, 6 causative agent of, 7 biological process involves gene product, 8 gene plays role in process, 9 is normal tissue origin of disease, 10 occurs in.