Aligning Instruction Tasks Unlocks Large Language Models as Zero-Shot Relation Extractors

Recent work has shown that fine-tuning large language models (LLMs) on large-scale instruction-following datasets substantially improves their performance on a wide range of NLP tasks, especially in the zero-shot setting. However, even advanced instruction-tuned LLMs still fail to outperform small LMs on relation extraction (RE), a fundamental information extraction task. We hypothesize that instruction-tuning has been unable to elicit strong RE capabilities in LLMs due to RE's low incidence in instruction-tuning datasets, making up less than 1% of all tasks (Wang et al., 2022). To address this limitation, we propose QA4RE, a framework that aligns RE with question answering (QA), a predominant task in instruction-tuning datasets. Comprehensive zero-shot RE experiments over four datasets with two series of instruction-tuned LLMs (six LLMs in total) demonstrate that our QA4RE framework consistently improves LLM performance, strongly verifying our hypothesis and enabling LLMs to outperform strong zero-shot baselines by a large margin. Additionally, we provide thorough experiments and discussions to show the robustness, few-shot effectiveness, and strong transferability of our QA4RE framework. This work illustrates a promising way of adapting LLMs to challenging and underrepresented tasks by aligning these tasks with more common instruction-tuning tasks like QA.


Introduction
Large language models (LLMs) (Brown et al., 2020;Chowdhery et al., 2022;Zhang et al., 2022) have been shown to achieve impressive performance on many NLP tasks.Using the in-context learning paradigm, without any parameter updating, LLMs are able to achieve comparable performance with small language models (LMs) fine-tuned on thousands of examples (Liu et al., 2022; Min et al.,   1 Code and data are available at https://github.com/OSU-NLP-Group/QA4RE.2022a; Liang et al., 2022). 2 More recently, finetuning LLMs on datasets containing thousands of downstream tasks transformed into an instruction following format (i.e., instruction-tuning) has been shown to improve LLMs considerably across the board, especially in zero-shot setting (Iyer et al., 2022;Ouyang et al., 2022;Chung et al., 2022).We examine the capability of LLMs in identifying the relationship between entities in a sentence, i.e., relation extraction (RE), a fundamental task in information extraction.Recent work (Jimenez Gutierrez et al., 2022) has found that LLMs underperform fine-tuned small LMs for RE in the biomedical domain.Our results on general domain RE in Fig. 1 reveal that even two of the most advanced instruction-tuned LLMs, FLAN-T5 XXL (Chung et al., 2022) and text-davinci-003 (Ouyang et al., 2022), fail to outperform the stateof-the-art (SoTA) zero-shot RE method based on small LMs (Sainz et al., 2021).
We hypothesize that the limited relation extraction capability of instruction-tuned LLMs could be a byproduct of the low incidence of RE tasks in instruction-tuning datasets (Ouyang et al., 2022;Sanh et al., 2022;Chung et al., 2022;Wang et al., 2022). 3To address the low incidence issue, we propose the QA4RE framework, which aligns RE with multiple-choice question answering (QA), a task that appears much more frequently in most instruction-tuning datasets-around 12-15% of all the tasks in both Wang et al. (2022) and Ouyang et al. (2022).Specifically, by casting the input sentence as a question and possible relation types as multiple-choice options (Fig. 2), LLMs are able to perform RE by selecting the option representing the correct relation type.
Thorough evaluations on four real-world relation extraction datasets and six instruction-tuned models from two different series (OpenAI GPT-3.5 and FLAN-T5 (Chung et al., 2022)) show that QA4RE brings significant gains over the standard RE formulation on, validating its effectiveness and our hypothesis concerning the low incidence of RE.More specifically, our framework enables textdavinci-003 and FLAN-T5-XXLarge to achieve an average of 8.2% and 8.6% absolute improvements in F1, respectively.For the first time, an LLM is able to outperform prior small-LM-based SoTA in the zero-shot setting by a large margin.In-depth analyses further demonstrate the robustness and few-shot effectiveness of QA4RE.More importantly, our framework has been proven to be effectively transferable on instruction-tuned models with various sizes, ranging from 80M to 175B.Our contributions are summarized as follows: (1) We systematically investigate instruction-tuned LLMs on four real-world relation extraction datasets and note that their limited performance on RE might stem from the low incidence of RE tasks in instruction-tuning datasets.
(2) We reformulate RE as multiple-choice QA in an effort to appropriately leverage QA's much higher prevalence in instruction-tuning datasets and achieve significant improvements on six recent instruction-tuned LLMs, significantly outperforming previous SoTA zero-shot RE methods based on small LM for the first time.
(3) In addition, we demonstrate our QA4RE method's robustness to diverse prompt designs as well as its promising results in the few-shot setting.(4) Finally, we show the effectiveness of QA4RE framework is transferable and consistent on various instruction-tuned models with different sizes from 80M to 175B.Our study illustrates the potential of aligning infrequent and challenging tasks with frequent instruction-tuning tasks and can guide others in exploring this direction.

Related Work
Instruction Tuning.Large language models originally obtained impressive zero and few-shot performance by leveraging self-supervised next token prediction at massive scales.More recently, supervised fine-tuning on a large number of downstream tasks has been shown to improve LLM accuracy, robustness, fairness, and generalization to unseen tasks (Ouyang et al., 2022;Iyer et al., 2022;Wei et al., 2022a;Chung et al., 2022;Sanh et al., 2022).Several strategies have been developed to align LLMs to human instructions including Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022) as well as the more standard language modeling objective, used to finetune LLMs on a wide range of tasks reformulated as instruction following tasks (Iyer et al., 2022;Wei et al., 2022a;Chung et al., 2022;Sanh et al., 2022).
Eliciting LLM Abilities.The high cost and increasingly private nature of LLM pre-training make it quite challenging to conclusively determine how different pre-training techniques bring about different LLM capabilities.Many factors involved in pre-training such as simple self-supervised scaling, code or multi-lingual text pre-training (Chowdhery et al., 2022;Chen et al., 2021;Chung et al., 2022) as well as the distinct versions of instruction-tuning mentioned above (Ouyang et al., 2022;Iyer et al., 2022;Wei et al., 2022a;Chung et al., 2022), can interact in a wide variety of ways to unleash the abilities LLMs display.Nonetheless, Fu and Khot (2022) hypothesize that the use of code during pretraining seems to improve an LM's reasoning ability, evidenced by the improved ability to leverage Chain-of-Thought prompting (Wei et al., 2022b) by models trained partially on code such as PaLM (Chowdhery et al., 2022), code-davinci-002 (Chen et al., 2021), and text-davinci-002/003 (Ouyang et al., 2022), compared to text-only models like text-davinci-001 and OPT-175B (Zhang et al., 2022).Additionally, instruction-tuning on a large set of tasks has been shown to improve generalization to unseen tasks, reduce the need for few-shot examples and improve accuracy and robustness

Vanilla RE
Given a sentence, and two entities within the sentence, classify the relationship between the two entities based on the provided sentence.All possible relationships are listed below: -per:city_of_birth -per:city_of_death -per:cities_of_residence -no_relation Sentence: Wearing jeans and a white blouse, Amanda Knox of Seattle is being cross-examined by prosecutors.Entity 1 : Amanda Knox Entity 2 : Seattle Relationship: per:city_of_birth

QA4RE
Determine which option can be inferred from the given sentence.
Sentence: Wearing jeans and a white blouse, Amanda Knox of Seattle is being cross-examined by prosecutors.

Options:
A. Amanda Knox was born in the city Seattle B. Amanda Knox died in the city Seattle C. Amanda Knox lives in the city Seattle D. Amanda Knox has no known relations to Seattle Which option can be inferred from the given sentence?Option: C.

NLI RE
Amanda Knox lives in the city Seattle Wearing jeans and a white blouse, Amanda Knox of Seattle is being crossexamined by prosecutors.

E N C
No Relation Threshold across many language tasks (Ouyang et al., 2022;Iyer et al., 2022;Chung et al., 2022).Jimenez Gutierrez et al. (2022) report that LLMs underperform standard small LMs fine-tuning in the few-shot setting on a comprehensive set of biomedical RE datasets and show evidence that the poor handling of the none-of-the-above (NoTA) relation category is one of the major culprits.Furthermore, although a few RE-like tasks were included in Super Natural Instruction (Wang et al., 2022), these tasks constitute about 0.5% of the dataset and none of them were selected for model evaluation.

Methodology
In this section, we formally define the relation extraction problem and describe our multi-choice QA approach for the problem in detail.

Problem Statement
Relation extraction (RE) aims to extract the relationship between two given entities based on a specific sentence.More concretely, one relation example contains a sentence S as well as a head entity E h and a tail entity E t within S. Given a relation example (S, E h , E t ), models are required to identify the relation between E h and E t expressed in the S from a set of pre-defined relation types.
To ensure a fair comparison, we utilize the same templates developed in previous studies (Sainz et al., 2021;Lu et al., 2022) to generate answer options within our QA4RE framework.Furthermore, in Sec.6.2 we discuss the possibility of directly applying the NLI formulation for RE in LLMs.

QA4RE Framework
As shown in Fig. 2 (right), we reformulate the relation extraction task as a multi-choice QA problem.By integrating the given head and tail RE entities (E h and E t ) into the relation templates and using them as multiple-choice options, LLMs are able to leverage extensive QA instruction fine-tuning which has dramatically improved recent models.Additionally, our method allows LLM to generate only an answer index instead of the verbalized relation as in previous work (Jimenez Gutierrez et al., 2022), also shown in Fig. 2

(center).
Type-Constrained Answer Construction.To transform RE into a multiple-choice question, for a given relation example (S, E h , E t ), we utilize sentence S as the context in standard QA and create options composed of pre-defined templates filled with E h and E t entities.To fairly compare with previous work, we apply type constraints (when applicable) to eliminate options for relation types that are not compatible with the entity types of the head and tail entities.For instance, if the type of E h is PER-SON, the relation "org:country_of_headquarters" would be deemed invalid given that a person does not have headquarters.
For the NLI approach (Sainz et al., 2021), we report performance using their own templates on TA-CRED and TACREV.As this method does not have templates for RETACRED and SemEval, we use the templates from the follow-up work, SuRE (Lu et al., 2022), on these two datasets instead.All the zero-shot methods, including those on LLMs, apply entity type constraints to reduce the relation label space.Since SemEval does not provide entity types, the above methods use all possible relations in every instance as the label space.
Few-Shot.Though our main experiments focus on zero-shot RE, we further explore our method's capabilities by comparing their few-shot performance against several competitive small LM-based methods on the TACRED dataset.

QA4RE Implementation Details
Our QA4RE framework utilizes the same templates and type constraints developed by prior work (Sainz et al., 2021;Lu et al., 2022).In particular, we use SuRE (Lu et al., 2022) templates for our QA4RE approach on all 4 datasets since NLI (Sainz et al., 2021) templates were only designed for TACRED.For prompt engineering, we explore prompt formats and task instructions for vanilla RE and QA4RE in pilot experiments, using text-davinci-002 on a 250-example subset of the TACRED dev set.We then use the same task instructions and prompt format for all four datasets and LLMs.Table 1: Experimental results on four RE datasets (%).We omit the 'davinci' within the names of GPT-3.5 Series LLMs and ChatGPT refers to gpt-3.5-turbo-0301.We mark the best results in bold, the second-best underlined, and F1 improvement of our QA4RE over vanilla RE in green.
To systematically compare our QA4RE framework with the vanilla RE formulation, we evaluate them on two series of LLMs, resulting in seven models in total.In GPT-3.5 series LLMs, for LLMs accessible via Text Completion API (code-davinci-002, text-davinci-002, and text-davinci-003), we follow previous work (Jimenez Gutierrez et al., 2022) and use the logit bias option to constrain token generation to relation labels for vanilla RE and option indices for QA4RE.Due to the fewer available control options for LLMs in Chat Completion API (gpt-3.5-turbo-0301),we only set the temperature as 0 and use the default system prompt.
We also examine open-sourced FLAN-T5 series LLMs (Chung et al., 2022) that are trained on a mixture of tasks (Sanh et al., 2022;Wei et al., 2022a;Wang et al., 2022).The 1,836 tasks utilized in training include less than 0.5% of RE-similar tasks, making FLAN-T5 series LLMs the ideal models for verifying our hypothesis.Specifically, we use XLarge (3B) and XXLarge (11B) models and adopt the same prompts and greedy decoding strategy as GPT-3.5 series LLMs to ensure a fair comparison.

Zero-Shot Results
Our main experimental results on four relation extraction datasets can be found in Tab. 1.We have the following observations from our results: (1) By reformulating RE as QA, our framework improves upon the vanilla RE formulation on all the LLMs and most datasets, making them much stronger zero-shot relation extractors.In particular, text-davinci-003 and FLAN-T5 XL and XXL are able to outperform the prior SoTA, NLI DeBERTa , by a large margin.One thing worth noting is that QA4RE brings the largest gain on the best LLM in each series (text-davinci-003 and FLAN-T5 XXL), showing that stronger LLMs may benefit more from our framework.
(2) The two FLAN-T5 LLMs in Tab. 1 benefit significantly from our QA4RE framework.Moreover, consistent and substantial improvements can also be observed in other FLAN-T5 models and the full test set, as discussed in Sec.6.3 and Appendix C. Considering that relation extraction tasks account for less than 0.5% of the instruction tasks used to train FLAN-T5 models, these findings strongly support our hypothesis that aligning underrepresented tasks with more common instruction-tuning tasks, such as QA, unlocks LLMs' ability to solve low-frequency tasks.
(3) The SemEval dataset poses a significant challenge for all baselines given its lack of typeconstraints, particularly for SuRE (Lu et al., 2022).With such a large search space, generative LMs without fine-tuning tend to summarize all examples into NoTA relation, resulting in its systematic failure.It should be noted that without type constraints, the RE problem becomes a 19-choice question answering task in our QA4RE framework.Despite this, our method still demonstrates substantial improvements for LLMs, particularly for text-davinci-003 and FLAN-T5 XXL.

Robustness to Verbalization Templates
For our experiments, we utilize manually written relation templates from previous work (Sainz et al., 2021;Lu et al., 2022).However, Lu et al. (2022) note that model performance may vary significantly with template design.Thus, to investigate the robustness of models to different templates, thorough experiments are conducted with four different templates, described in detail in Appendix B.3, across all zero-shot methods on the TACRED dataset.Tab. 2 shows results comparing these four templates on all methods used in our main experiments, including vanilla RE as a template-free reference.From Tab. 2, we observe the following: (1) Our method consistently outperforms small LM baselines and the vanilla RE framework, regardless of the template.It is worth noting that even with templates that are constructed with label name information only and no expert knowledge (TEMP3 and TEMP4), our QA framework still performs better than vanilla RE, indicating the effectiveness and consistency of our QA framework.

Methods
(2) NLI and SuRE performance is largely template dependent.When using carefully crafted highquality templates (TEMP1 and TEMP2), several LM-based NLI methods outperform text-davinci-003 with vanilla RE.However, when equipped with templates created without expert knowledge (TEMP3 and TEMP4), the performance of both NLI and SuRE deteriorates dramatically.In contrast, QA4RE is more robust to variation in verbalization templates, reducing trial-and-error development efforts as well as making it more readily transferred to settings where obtaining quality templates may be limited due to the high cost of expert annotations, such as the biomedical or financial domains.

None-of-the-Above Relation Evaluation
The none-of-the-above (NoTA) relation (Gao et al., 2019;Sabo et al., 2021;Jimenez Gutierrez et al., 2022) is defined as the case where no relation of interest exists between the given entities.Jimenez Gutierrez et al. ( 2022) demonstrate that the earlier inferior performance of LLMs on RE tasks can be largely attributed to their inability to handle the NoTA relation.To evaluate the efficacy of zero-shot methods on NoTA relation, following previous work (Fei and Liu, 2016;Shu et al., 2017;Sainz et al., 2021), we apply NoTA-included macro F1 metric as well as micro and macro P vs. N (all positive relations vs. NoTA relation as binary classification) F1 metrics.From Tab. 3, we observe that, when enhanced by our QA framework, text-davinci-003 achieves significant improvement in NoTA-included metrics, outperforming the small LM-based NLI methods.This further demonstrates the effectiveness of our framework, even in handling the challenging NoTA relation.It is worth noting that these superior results are achieved by simply adding an entity-filled NoTA relation template as an answer option for QA, without the additional thresholding requirements of previous methods (Sainz et al., 2021;Lu et al., 2022).This eliminates the need for additional hyperparameter searching, which can be tricky for low-resource settings.

Few-Shot Results
While zero-shot RE is our main focus, we also evaluate our method under the few-shot setting.Results are shown in Tab. 4. Due to budget limitations, we restrict our case study to the 4-shot scenario (i.e., 4 labeled examples per relation) with the best-performing LLM in the zero-shot setting (text-davinci-003).After determining the optimal number of in-context examples searched on the dev set, we randomly select the examples with the same entity type constraints from the given train set.
Interestingly, vanilla RE is unable to obtain any improvement from labeled examples, suggesting that it is also limited in the few-shot setting.The limited performance shown by vanilla RE indicates that few-shot demonstrations might bias the model towards incorrect relations in the context rather than helping it perform the task more accurately.We use text-davinci-003 for vanilla RE and QA4RE.For the best-performing baseline (NLI) as well as vanilla RE and QA4RE, we mark the results in bold when they are improved over their zero-shot alternatives.
Even employing our QA4RE framework, the few-shot text-davinci-003 does not outperform the DeBERTa-based NLI method (Sainz et al., 2021) when using their own templates (TEMP1).However, fine-tuning the NLI model on RE data can be brittle even with careful hyperparameter tuning, as evidenced by the unstable gains seen as more data is added for both TEMP1 and TEMP2.Furthermore, we find that few-shot NLI results when using TEMP2 drop substantially from TEMP1, suggesting that this approach also lacks robustness to templates in the few-shot setting.Thus, considering that our QA approach enables LLMs to obtain few-shot improvements over zero-shot results using random in-context learning example selection, obtains only around 2% lower performance than the best NLI model, and is robust to different template designs, our approach is competitive on few-shot RE and has the potential to achieve even stronger performance with more exploration.We leave further investigation on how to improve LLMs for few-shot RE to future work.

Vanilla + Template RE
Given a sentence, and two entities within the sentence, classify the relationship between the two entities based on the provided sentence.All possible relationships are listed below: -per:city_of_birth: Entity 1 was born in the city Entity 2 -per:city_of_death: Entity 1 died in the city Entity 2 -per:cities_of_residence: Entity 1 lives in the city Entity 2 -no_relation: Entity 1 has no known relations to Entity 2 Sentence: Wearing jeans and a white blouse, Amanda Knox of Seattle is being cross-examined by prosecutors.Entity 1 : Amanda Knox Entity 2 : Seattle Relationship: per:city_of_birth We conduct an ablation study to better understand how relation templates contribute to the performance improvement obtained by QA4RE.As illustrated in Fig. 3, we fill the relation verbalization templates with markers Entity 1 and Entity 2 as relation explanations, thereby presenting the expert knowledge from the templates to the LLM.Using the same templates and type constraints, we compare this framework (termed Vanilla+TEMP) with vanilla RE and QA4RE on the TACRED dataset and GPT-3.5 series LLMs.
As shown in Tab. 5, introducing relation explanations using the same templates does not result in consistent or significant performance improvement.In fact, adding extra information to the task instruction might make it more challenging for the LLM to understand the task.In contrast, using our QA4RE framework, we do not need to separately specify the entities of interest or relation explanations; they are both naturally embedded in the answer options.These ablation results show that the gains from QA4RE mainly come from the QA reformulation, not simply from the relation explanations/templates.

QA4RE vs. NLI4RE
Given the strong performance obtained by small LMs using the NLI reformulation of RE, we leverage this same formulation (Sainz et al., 2021) for LLMs (termed NLI4RE).5More concretely, for each example, we use the LLM to predict whether the given sentence (the premise) entails each answer option from the QA4RE formulation (the hypothesis).We allow the LLM to generate entailment, neutral, or contradiction for each sentencerelation pair.If the maximum probability of entailment among all possible positive relations is below the threshold of 0.5, the example will be classified as NoTA, as done by Sainz et al. (2021).As shown in Tab.6, when using the NLI formulation, text-davinci-003 surprisingly underperforms the vanilla RE formulation.The reason for its poor performance is two-fold: (1) The heuristically predefined threshold 0.5 is not ideal for LLMs and thus many positive predictions are classified as NoTA.However, it is also difficult to find a good threshold under the zero-shot setting.(2) Under NLI4RE, unlike vanilla RE or QA4RE, an LLM is not seeing the full relation space but assigning probabilities to each candidate hypothesis individually.The final prediction is thus more sensitive to the LLM's bias over different relations.
NLI4RE also requires multiple inference runs for each relation example to evaluate all the candidate relations, incurring a significantly higher cost.

QA4RE & Model Size
To verify the effectiveness and transferability of our QA4RE framework on smaller instruction-tuned models, we further evaluate the FLAN-T5 Small (80M), Base (250M), and Large (780M) on the full test set over four RE datasets.Tab.7 shows our QA4RE framework can still bring considerable gains to instruction-tuned models with various sizes, even for the smallest one (80M).This demonstrates the effectiveness of QA4RE is transferable across various model sizes from 80M to 175B, considering the consistent improvements of QA4RE on several GPT-3.5 models.
In the FLAN-T5 series, larger models benefit more from our framework.However, we note that this trend does not continue when scaling up to much larger GPT-3.5 models.In fact, all GPT-3.5 models except for text-davinci-003 benefit less from QA4RE than FLAN-T5 models.The smaller improvements of QA4RE on these models make their overall RE performance only comparable with models that are approximately 20 and 50 times smaller.This indicates that the wide variety of alignment strategies used by the GPT-3.5 series models discussed in Sec. 2 might not be universally more effective than standard instruction-tuning for improving model generalization on low-incidence tasks even when aligned to high incidence ones.Nevertheless, the strong improvement observed in the strongest models tested, text-davinci-003 and FLAN-T5-XXL, demonstrates the potential for QA4RE's effectiveness to continue as models become even more capable in the future.

Conclusions and Future Work
In this work, we first show that even the most recent instruction-tuned LLMs underperform fine-tuned small LMs on the relation extraction (RE) task.To address this limitation, we reformulate RE into multiple-choice question answering (QA) with the purpose of leveraging a task that is widely cov-ered in instruction-tuning datasets like QA, instead of RE, which is barely present in these datasets.Comprehensive experiments demonstrate that our QA4RE framework unlocks the power of LLMs as zero-shot relation extractors, especially for two recent LLMs (text-davinci-003 and FLAN-T5 XXL).We also conduct thorough experiments to explore the robustness and few-shot effectiveness of our method as well as study in what LLM training scenarios it is most effective.
In future work, we hope to explore additional underrepresented tasks in instruction-tuning that might be challenging for LLMs and could be successfully aligned with more widely adopted instruction-tuning tasks like QA.Additionally, we plan to continue exploring this line of work by leveraging our QA4RE framework for other LLMs such as the OPT-series (Zhang et al., 2022;Iyer et al., 2022) andPaLM (Chowdhery et al., 2022), which are not included in this work due to the limited computational resources and/or access.

Limitations
Even though our method helps unleash the power of six recent strong LLMs as zero-shot relation extractors, earlier LLMs without strong instruction tuning such as text-davinci-001 saw no improvements from our framework.Additionally, although we carry out comprehensive experiments on the zero-shot RE setting, our few-shot exploration is more limited.It is still unclear from our investigation whether including even more training examples can improve LLM's RE performance and to what extent the same trends seen across GPT-3 models in the zero-shot setting hold steady in the few-shot setting.We leave answering these questions for future work.

Ethics Statement
In this work, we propose a method to improve LLM performance on the important and fundamental task of relation extraction.We do not anticipate any ethical issues regarding the topics of this research.

A Instruction Dataset Portion
#Tasks %RE %QA T0 (Sanh et al., 2022) 62 0 27.4 FLAN (Wei et al., 2022a) 62 0 21 MetaICL (Min et al., 2022b) 142 0 28.9 NaturalInstruct (Wang et al., 2022) 1731 <0.5 >12 As shown in Tab. 9, there is no RE task in T0 (Sanh et al., 2022), FLAN (Wei et al., 2022a), and MetaICL (Min et al., 2022b) instruction tuning datasets.Even in the largest available NaturalInstruct (Wang et al., 2022), RE tasks consist of only less than 0.5% of the total tasks.By contrast, QA is the most popular task format in all instruction tuning datasets.These observations indicate the low incidence of RE tasks and the dominance of QA tasks in datasets used for instruction tuning.

B Experimental Details B.1 Hyperparameters for Few-Shot Methods
In the few-shot setting, for each K, we randomly sample 3 times to obtain different training subsets, each of which will be used as in-context demonstrations for LLMs or used to train the small language models in baselines.Report results are averaged over the three subsets.To avoid over-estimating few-shot performance with too many dev examples (Perez et al., 2021), we use 100 randomly selected examples of dev set for all the hyperparameter searching.
For LLMs, we use the dev set to search for the optimal number of in-context examples as a hyperparameter from {1, 2, 5}.Then we randomly select the same type-constrained in-context examples from the given train set.
For all small LM-based baselines, we use their publicly available code and hyper-parameters for training.According to the original papers of NLI (Sainz et al., 2021) and SuRE (Lu et al., 2022), we use the checkpoints available online and hyperparameters reported for model training.Unfortunately, we were unable to reproduce SuRE results with default hyperparameters.For standard Fine-Tuning (Jimenez Gutierrez et al., 2022), PTR (Han et al., 2022), andKnowPrompt (Chen et al., 2022), we perform a grid search over hyperparameters on dev with the range shown in Tab.10.
We use 8 NVIDIA GeForce RTX 2080 Ti and 2 NVIDIA RTX A6000 to conduct all the experiments.The total GPU hours used and the cost for OpenAI API are listed in Tab.11.

B.2 Prompts for LLMs
As shown in Tab. 12, we list all templates used in this paper including vanilla + TEMP in Tab. 5, NLI4RE in Tab.6, and vanilla as well as QA4RE in all experiments.

B.3 Relation Verbalization Templates
In the relation verbalization template robustness experiment shown in Tab. 2, the differences between four templates are described below using the org:top_members/employees relation from TA-CRED benchmark as an example: 1

NLI4RE
In this task, you will be presented with a premise and a hypothesis sentence.Determine whether the hypothesis sentence entails (implies), contradicts (opposes), or is neutral with respect to the given premise sentence.Please answer with "Contradiction", "Neutral", or "Entailment". Premise

Figure 1 :
Figure1: Main finding: Strong instruction-tuned LLMs underperform prior zero-shot RE methods using the standard (vanilla) RE formulation.Our QA4RE framework enables models in two sets of instruction-tuned LLMs (FLAN-T5 and GPT-3.5) to surpass the prior SoTA on 4 RE datasets by a large margin.Results are averaged over 4 RE datasets.We omit the word 'davinci' from the GPT-3.5 model displayed for brevity.

Figure 2 :
Figure 2: This figure shows a schematic of the SoTA NLI zero-shot framework in which each sentence must be compared with each relation template (left), the vanilla formulation for prompting GPT-3 for RE as done in Jimenez Gutierrez et al. (2022) (center) and our multiple-choice QA setting, in which each relation is transformed into a template and GPT-3 is expected to predict only a single letter (right).

Figure 3 :
Figure 3: The same example and templates as Fig. 2 but using templates for relation explanations.
Please refer to Appendix B.2 and B.3 for prompt format and relation verbalization template details, respectively.

Table 3 :
NoTA-included 42-class macro F1 as well as macro and micro P vs. N (all positive relations vs. NoTA) F1 on TACRED (%).The best result of each metric is bolded.text-003 refers to text-davinci-003.Ma and Mi are short for macro and micro, respectively.

Table 4 :
Few-shot F1 on TACRED (%).All results are averaged over 3 different training subsets for each K.

Table 5 :
Evaluation on TACRED regarding whether incorporating relation explanations based on the same templates into vanilla RE bridges its gap to QA4RE (%).
Formulation RED RERED REV Eval Avg.

Table 9 :
Popular instruction tuning datasets and proportion of RE and QA tasks in each.

Table 10 :
(Han et al., 2022)ed for grid search of fewshot methods.Learning Rate 2 is used for training new tokens in PTR(Han et al., 2022)and virtual tokens inKnowPrompt (Chen et al., 2022).

Table 11 :
Total GPU Hours for open sources LMs and cost for using OpenAI API (all version included).
. Concrete Examples: {E h } is a chairman/ president/director of {E t } 2. Semantic Relationship: {E h } is a high level member of {E t } 3. Straightforward: The relation between {E h } and {E t } is top members or employees 4. Word Translation: {E h } organization top members or employees {E t } , and two entities within the sentence, classify the relationship between the two entities based on the provided sentence.All possible Relationships are listed below: Given a sentence, and two entities within the sentence, classify the relationship between the two entities based on the provided sentence.All possible Relationships are listed below with explanations: -[Possible Relation 1]: [Relation 1 Template] -[Possible Relation 2]: [Relation 2 Template] -[NoTA Relation]: [NoTA Relation Template] Sentence: [Sentence S] Entity 1: [Head Entity E h ] Entity 2: [Tail Entity Et] Relationship:

Table 12 :
Prompt Formats of frameworks for LLMs in this paper.We only demonstrate NLI4RE with 1 template for simplicity.