Revisiting Relation Extraction in the era of Large Language Models

Relation extraction (RE) is the core NLP task of inferring semantic relationships between entities from text. Standard supervised RE techniques entail training modules to tag tokens comprising entity spans and then predict the relationship between them. Recent work has instead treated the problem as a sequence-to-sequence task, linearizing relations between entities as target strings to be generated conditioned on the input. Here we push the limits of this approach, using larger language models (GPT-3 and Flan-T5 large) than considered in prior work and evaluating their performance on standard RE tasks under varying levels of supervision. We address issues inherent to evaluating generative approaches to RE by doing human evaluations, in lieu of relying on exact matching. Under this refined evaluation, we find that: (1) Few-shot prompting with GPT-3 achieves near SOTA performance, i.e., roughly equivalent to existing fully supervised models; (2) Flan-T5 is not as capable in the few-shot setting, but supervising and fine-tuning it with Chain-of-Thought (CoT) style explanations (generated via GPT-3) yields SOTA results. We release this model as a new baseline for RE tasks.


Introduction
Relation extraction (RE) is the task of identifying entities and their semantic relationships from texts.Standard supervised approaches (Eberts and Ulges, 2019a) to RE learn to tag entity spans and then classify relationships (if any) between these.More recent work has shown that conditional language models can capably perform this task-achieving SOTA or near-SOTA results-when trained to output linearized strings encoding entity pairs and their relations (Paolini et al., 2021;Lu et al., 2022b;Huguet Cabot and Navigli, 2021).However, to date such work has considered only moderately sized pre-trained models for RE such as BART (Paolini et al., 2021;Huguet Cabot and Navigli, 2021).

Model Performance on CoNLL
Micro-F1 Score In this work we investigate the use of very large language models--including GPT-3 (Brown et al., 2020b)-for end-to-end relation extraction via generation.Our contributions are as follows.
1. We show that few-shot learning with GPT-3 yields near SOTA performance on standard RE datasets, outperforming fully supervised models.
2. We find that Flan-T5 (large; Chung et al. 2022) is not as capable, even when fine-tuned.But we then propose an approach to training Flan-T5 with Chain-of-Thought (CoT) style "explanations" (generated automatically by GPT-3) that support relation inferences; this achieves SOTA results.
3. Evaluating the performance of generative models for RE is non-trivial because one cannot rely on exact matches to targets.We address this by collecting a small amount of annotations scoring generated outputs against targets.We use these annotations to quantify the problem, identify erro-neous gold references and accurately evaluate our models.
Our results indicate that, in general, LLMs should be the default approach to RE, especially given that one can train Flan-T5-which is dramatically smaller than GPT-3, and publicly available-to achieve SOTA performance (Figure 1).

RE via Text Generation
We treat RE as a conditional text generation task.Concretely, for a dataset of size N , we model the probability of generating a linearized string y of a relation triplet (entity_1, relation_type, entity_2) conditioned on a context string C. Specifically, C includes a chain of n linearized examples (x i , y i ), with n << N .Formally: We provide examples of context strings in the Appendix.We conduct experiments over four standard RE datasets comprising varying numbers of entities and relation types, namely ADE (Gurulingappa et al., 2012), CoNLL (Roth and Yih, 2004), NYT (Riedel et al., 2010), and DocRED (Yao et al. 2019); details in Table 1 and Appendix A. Following Huguet Cabot and Navigli (2021), we linearize our target relation triplets.However, we adopt a much simpler scheme than prior work: We linearize inputs with a single relation type (e.g.ADE) as a list of tuples: [(drug, effect), ... ,(drug, effect)] For inputs with multiple relation types (as in CoNLL04 and NYT), we form triplets comprising a subject, relation, and object (along with their corresponding types), in the order of appearance of the subject entity:  Challenges inherent to evaluating generative large language models for RE The expressivity of language models coupled with the openendedness of RE makes evaluation difficult.This has led to inconsistent approaches to evaluation (Taillé et al., 2020).Past work, especially that pre-dating LLMs for the task, has tended to perform "strict" evaluation, requiring exact matches between generated linearized relation tuples and references.This may be appropriate when is evaluating smaller conditional generation models (such as BART) for RE, which have been fine-tuned on large training sets, because after training such models consistently generate standardized outputs.By contrast, however, models like GPT-3 (or other large language models capable of zero-or few-shot application) can produce a wide variety of output formats which convey similar content.For example, given an input from ADE and prompted to list all drugs and associated adverse events, a large language model might yield Aspirin: stomach pain, chest pain.Or it may instead output: Side effects of aspirin include cramping and stomach pain, and pain in the chest.There are countless possible variants which may all communicate the correct answer; we provide additional real examples in the Appendix D. The flexibility of language means that parsing out the structured result to compare it to a reference (to calculate standard metrics like precision, recall, and F-1) is a non-trivial problem.This is in stark contrast to traditional approaches to tasks like NER and RE where models effectively classify input tokens instead of generating new ones from a vast vocabulary.
Training models, either via traditional supervised learning or in-context few-shot learning, encourages models to comport with the structure of training instances.We therefore focus our analysis on such supervised settings in this work, starting with an evaluation of few-shot learning with GPT-3 for RE.Nonetheless, even when supervised, LLMs used for RE are prone to generating outputs which may be accurate but nonetheless differ from the target.To address this, we enlist human annotators to judge whether the model outputs convey the same information as the reference targets.
3 In-Context Few-Shot Learning with GPT-3 for RE In this section we first describe our few-shot prompting strategy for GPT-3, and report the results realized by this approach across a set of RE corpora.
We adopt forms of instructional in-context few-shot prompting to GPT-3.1 Motivated by the preceding discussion regarding evaluation challenges, we collect human annotations judging the model's generations against the gold references.Finally, using these annotations we report results achieved using GPT-3 with few-shot prompting for RE (Table 2).
All references to GPT-3 in this work refer to the "text-davinci-002" variant.

Prompts
We describe the prompts we use for each of the datasets considered in turn.
ADE To construct prompts for ADE, we use the instructional prompt: List all (drug: adverse effects) pairs in the following text, followed by an input text.We then select 12 examples ("shots") at random from the training set, and for each we append the corresponding input followed by linearized target relations to the instructional prompt; this yields a prompt featuring 12 examples, comprising 755 tokens.To make a prediction for a new example we append one last List all (drug: adverse effects) pairs in the following text instruction followed by the corresponding text and then ask GPT-3 to generate text conditioned on this final prefix.Specifically, we perform this generation using default parameters save for sampling temperature, which we set to 0.5. 2 We impose a maximum output length of 256 tokens.We next aim to evaluate the performance of GPT-3 for RE when provided the above prompts.But doing so requires addressing the challenges inherent to evaluating LLMs for RE outlined above (and in prior work; Taillé et al. 2020).

Manually re-evaluating "errors"
We quantify the errors in evaluation that occur when one uses "strict" measures of performance while using few-shot prompted LLMs for RE across each dataset.We do this by acquiring human annotations (collected via Mechanical Turk; details in Appendix D) on model outputs, with respect to reference labels provided in the accompanying datasets.In particular, we show annotators ostensible "false positive" and "false negative" outputs produced by GPT-3 for these corpora-as would be computed using exact matching against references-and ask them to judge whether these are accurately categorized.
On ADE we find that 51.67% of "false positives"-a slight majority-are more accurately viewed as true positives, and 32.61% of "false negatives" are deemed as, in fact, true negatives.On CoNLL outputs, annotators marked 50.27% of "false positives" as valid, and 36.6% of "false negatives" as being accurate.
As mentioned above, we were unable to design a prompt for NYT that yielded reasonable few-shot results with GPT-3.So we instead ask annotators to evaluate outputs from Flan-T5 fine-tuned on the NYT train set.In this case, they deemed 36.9% and 22.97% of "false positives" and "false negatives", respectively, to in fact be accurate.We present some illustrative cases in Figure 2 and additional examples in Appendix Tables 8 and 7.These findings imply that strict (exact-matching) evaluation against references for RE will be inaccurate (and pessimistic).In the results we later report for LLMs, we therefore take into account these manual assessments. 3

Results
Using the above prompts and manual annotation process just described, we find that in most cases GPT-3 performs comparably to current fully supervised SOTA RE models without fine-tuning and given only 12-20 training examples.This can be seen in Table 2 (2.a).We also find a substantial number of instances where the model correctly identifies relation pairs, which in fact are incorrectly marked in the references (detailed below in Section ??).We observe additional issues with the NYT and CoNLL datasets which we discuss below.
CoNLL We find a number of relation triplets where the output does not conform to the set of valid relation types (∼% of relation triplets in the validation set).Examining these triplets, we often find the out-of-domain relation-types to be either closely related to a correct CoNLL relation-type (e.g., shoot−→kill) or otherwise correct even if not related to a CoNLL relation-type.There were a total of 18 input validation instances in which at least one of the generated relation triplet did not conform to a valid CoNLL relation; we provide a 3 One could also train a model on manual assessments of "false positives" and "false negatives" to semi-automate this evaluation (avoiding the need to collect such judgments on entire testing sets); we provide results showing the feasibility of doing so in the Appendix D. full list of these instances and the generated relation triplets in the Appendix D.1.
NYT We find the strategy of omitting the relation descriptions in the prompt to be detrimental to the model's performance.Contrary to our findings in ADE and CONLL, we observe a sharp decline in Micro-F1 scores in case of NYT (∼30 point reduction) as compared to the fully supervised SOTA.Further, we observe a non-trivial number of invalid or empty output instances (∼10.6% of all generated sequences).These results highlight a remaining limitation of in-context learning with large language models: for datasets with long texts or a large number of targets, it is not possible to fit detailed instructions in the prompt.In light of the issues we were unable to evaluate this approach on the DocRED dataset, which we leave for future work.In such cases, traditional fine-tuning is the practical option.Despite these limitations, the fact that GPT-3 is able to (marginally) outperform the current SOTA with in-context learning from tens of examples is encouraging.But GPT-3 is a massive opaque model available only via OpenAI's API (at cost).Further, fine-tuning GPT-3 would incur additional cost, and one would have access to the resultant model only via the OpenAI interface.For these reasons, smaller, open-source LLMs for RE would be preferable.Next we show that by enriching supervision with Chain-of-Thought (CoT) outputs elicited from GPT-3, we can achieve SOTA performance using Flan-T5 (Large).
Four days after the initial injection of 3.6 mg of goserelin acetate, severe dyspnea developed due to worsening pleuritis carcinomatosa, which was considered as a flare-up.

Reference
[('goserelin acetate','flare')] Generated [('goserelin acetate','severe dyspnea')] Wrong, but counted as a false negative Some have called for a memorial to the lynched youth to join the many other shrines here in Waco, a city of 113,000 neighboring President Bush's ranch in Crawford, and home to Baylor University, founded in 1845, the first institution of higher learning in Texas and the largest baptist university in the world.

Generated
[('Amb.Vernon A. Walters', 'Work_For', 'U.S')] Correct, but counted as a false positive Wrong, but counted as a false negative Correct, but counted as a false positive

Out-of-Domain (CoNLL04)
In 1881 , President James A. Garfield was shot by Charles J. Guiteau, a disappointed office-seeker, at the Washington railroad station. Reference Figure 2: Examples of misclassified FPs and FNs from GPT-3 (generated under few-shot in-context prompting scheme) under traditional evaluation of generative output.In each instance, the entity-type of subject and object was correctly identified.

SOTA RE Performance with Flan-T5
We use Flan-T5 (Large), an LLM trained on a large number of tasks with instructional prompts.We first evaluate this in a few-shot setting (Section 4.1), shortening prompts in light of T5's smaller size, compared to GPT-3.We then consider fine-tuned variants, including a novel approach in which we train Flan-T5 using chain-of-thought (CoT) style explanations for RE elicited from GPT-3.The latter strategy yields SOTA results across all datasets considered.

Few-Shot RE with Flan-T5
For few-shot learning with Flan-T5, we use the same instructional prefixes (with examples) as we did for GPT-3 above, but we reduce the number of exemplars in the prompts to make them more concise.We summarize our findings from these experiments on ADE and CoNLL below, and provide a full set of results in Appendix B. ADE We include 7 (instead of the 12 used for GPT-3) randomly selected in-context examples for ADE.We observe a significant increase in nonconforming relation pairs in outputs (13.9% of generations).These often include outputs where the model generates the same token (or a set of tokens) repeatedly, or where relation tuples contain greater or fewer than 2 entities.Unsurprisingly given these qualitative impressions, the model fares poorly under strict evaluation on the validation set, resulting in a ∼ 20 drop in F1 score compared to GPT-3.
CoNLL The prompt for CONLL consisted of 7 (in place of the 12 for GPT-3) exemplars inserted into the instructional prefix described above.
Again we found that Flan-T5 generated many nonconforming outputs (12.5%).Additionally, we find that Flan-T5 generates a large number of out-ofdomain relations between entities (over 120 unique relations), most of which are unrelated to CoNLL, making it impossible to meaningfully evaluate outputs (details in Appendix D).
NYT We exclude this dataset given the large set of relation and entity types, which-as discussed above-makes designing a prompt with sufficient instructions that also fits within the in-context window impossible.(We address this below via finetuning, which sidesteps the issue.) These results indicate that few-shot learning with Flan-T5 is not competitive with GPT-3, and so is not comparable to SOTA RE models.However, we next show that fine-tuning Flan-T5 can yield substantially better results, especially if one includes reasoning about RE in the supervision.

Fine-tuning Flan-T5 for RE
We first perform standard fine-tuning for Flan-T5 (Large) using available training datasets.We report results from the test set in Table 2 (1.e.).This yields performance equivalent to, but not better than, existing fully supervised models such as REBEL.
As a potential mechanism to improve the performance of Flan-T5 for RE, we propose enriching the Example Input (NYT) It will be the final movie credited to Debra Hill, a film producer and native of Haddonfield, who produced "Halloween" and was considered a pioneering woman in film.
Next we evaluate the impact of CoT explanations in two settings: As additional context for prompting GPT-3, and then as additional supervision signal with which to train Flan-T5.

Eliciting CoT reasoning for RE
We use the same prompts from the few-shot experiments above but augment them with CoT-style explanations (one per shot) written by one of the authors.This yields moderate gains in the overall performance for GPT-3 (∼3 and ∼2.2 micro-F1 points for ADE and CONLL, respectively; Table 2 2.b), and also reduces the number of non-conforming relations generated (from 13.9% to 0.8% on ADE, and from 12.5% to 1.1% on CONLL).Further, using CoT results in only one instance of an out-ofdomain relation-type generated on CoNLL, compared to over 120 relations generated without CoT explanations.In sum: using CoT in few-shot learning for RE with GPT-3 yields more standardized outputs, but does not much improve performance.Next we propose to capitalize on CoTs automatically generated over training sets to enrich the supervision with which we train Flan-T5.

Fine-tuning Flan-T5 with CoT explanations
We augment target relations used to train Flan-T5 with CoT strings automatically generated by GPT-3 over the training dataset.Specifically, we modify the prompt used in Section 3 to generate CoT-style explanations conditioned on the input and relation reference labels.The following is an example of the prompt we provide GPT-3 to elicit a CoTexplanation: Text: This April 14 is the 125th anniversary of the night when Lincoln, the 16th president, was assassinated by John Wilkes Booth in the presidential box at Ford's Theatre.We then use these explanations along with reference relation labels as targets to fine-tune Flan-T5 (Large), as depicted in Figure 3. Overall, we found this strategy to be effective obtaining state-of-theart results across datasets, while being much faster to train compared with existing fully supervised models.We summarize our findings below, and report results in Table 1 (1.f.).
ADE We obtain explanations for the entire training set and fine-tune Flan-T5 Large with an instructional prefix with a batch size of 8, learning rate 3e-5 for 6 epochs.The dataset defines 10 folds of train/test splits, and we evaluate using the best checkpoint for each fold in the dataset.Our model yields a 9.97 point gain in micro F-1 score (averaged over the folds) over the existing fully supervised generative SOTA (REBEL; Huguet Cabot and Navigli (2021)).
CONLL For CONLL, we again obtain CoT-style explanations for the entire dataset via GPT-3.We then fine-tune with a batch size of 4 and learning rate 3e-5 for 10 epochs and evaluate using the best-performing checkpoint on the validation set.We see a 5.42 absolute point gain on the micro-F1 score over the existing fully-supervised generative SOTA.
NYT comprises 56k training examples.In this case we generate CoT explanations via GPT-3 for only a subset of 25k examples (about half of the train set), due to its large size and the associated cost.We fine-tune the model with a batch size of 4, learning rate 2e-5 for 4 epochs and then evaluate using the best performing checkpoint on the validation set.We obtain a 3.37 point gain on the micro-F1 score over the existing fully-supervised SOTA.
In sum, fine-tuning Flan-T5 (large) with both train labels and CoT explanations produced by GPT-3 yields SOTA performance across RE datasets by a considerable (5-10 points micro-F1) margin (Figure 1).

"Fully Supervising" Flan with GPT-3
Above we showed that Flan-T5 (large) outperforms existing RE methods by substantial margins when trained using CoTs from GPT-3.Now we ask whether we can take this approach of distillation from GPT-3 even further by eliciting both labels and CoT explanations from GPT-3 in a few-shot setting, and then using these to train Flan-T5.That is, above we used the reference labels for training, whereas here we use "labels" produced by GPT-3 given just a handful (10s) of training instances as shots.We run this experiment only on CoNLL due to the cost of processing datasets in this way (which requires running few shot inference in GPT-3 over entire training sets).
To generate the targets in this case, we start with an instructional prefix and 12 training instances from CoNLL and their corresponding humanwritten explanations; this is the same setup as the in-context GPT-3 model (Table 1 2.b.), though here we apply this to the training instances.We then prompt GPT-3 on all training instances except for the 12 shots to produce pseudo labels (relations) and associated CoT explanations.
Using this new GPT-generated training data, we again fine-tune Flan-T5 (Large) as described above (Section 4.2.2), and evaluate it on the validation set.This approach marginally outperforms the existing fully-supervised SOTA (Huguet Cabot and Navigli, 2021), but underperforms fine-tuning Flan with references references and GPT-generated explanations (Table 2, 2.c.).

Related work
Standard NLP methods for identifying relations in free text have included Conditional Random Fields (Lafferty et al., 2001), structured SVMs (Tsochantaridis et al., 2004), and more recently, training large deep learning models with a joint objective (Eberts and Ulges, 2021, 2019a; Wang and Lu, 2020) to identify entities and relations simultaneously.More recently, the rise of massive language models (Radford and Narasimhan, 2018;Radford et al., 2019;Brown et al., 2020a) has also motivated research into prompt-based learning methods for structured prediction (Wang et al., 2022).

Relation extraction with pre-trained LMs
Several recently proposed RE approaches (which we have built upon here) have proposed addressing the task using conditional generative models to output string encodings-i.e., linearized forms-of target relations (Zeng et al., 2018(Zeng et al., , 2020;;Nayak and Ng, 2020;Huguet Cabot and Navigli, 2021).Paolini et al. (2021) proposed a framework that formulated many structured prediction tasks, including relation extraction, as a seq2seq problem where they decode outputs into structured information.Huguet Cabot and Navigli (2021) extended this line of work by training a SOTA BART-style (Lewis et al., 2020) model specifically for relation extraction using a unique triplet linearization strategy.Beyond these task-specific models, Wang et al. (2022) proposed a task-agnostic structured pre-training scheme which enables zero-shot transfer to several structured prediction tasks.
These past efforts focussed on solely fine-tuning seq2seq models, adopting standard supervised approaches to learning to generate the relations expressed in a given input.(REBEL incorporated a pretraining scheme designed for RE (Huguet Cabot and Navigli, 2021), but this was in addition to a fine-tuning step.)In this work we also evaluate the ability of large language models to perform few-shot relation extraction via in-context learning; to our knowledge this is the first such evaluation for RE specifically, although few-shot learning more generally is an active sub-area of research.

Few Shot In-Context Learning
Few shot in-context learning entails incorporating a few training examples into model prompts, effectively "learning" via the activations induced by passing these examples through the network at inference time.This has the advantage of completely forgoing model weight updates, which can be costly for LLMs (Wang et al., 2021).An active area of research concerns such cross-task generalization capabilities (Ye et al., 2021;Wei et al., 2022a;Min et al., 2022;Xu et al., 2022) of LLMs where a model learns a new, previously-unseen task efficiently with just a few examples.Chen et al. ( 2022) also proposed a self-supervised objective as an intermediate stage between pre-training and downstream few-shot learning.Recent work on few shot in-context learning has largely focused on the selection (Liu et al., 2022) and ordering (Lu et al., 2022a) of exemplars included in the prompt provided to the model.

Conclusions and Future Directions
We have evaluated the capabilities of modern large language models (LLMs)-specifically GPT-3 and Flan T5 (Large)-on the task of Relation Extraction (RE).We found that, when evaluated carefully, GPT-3 performs comparably to fully supervised state-of-the-art (SOTA) models, given only 10s of examples.We then proposed a distillation technique in which we augmented target RE labels with Chain of Thought (CoT) style explanations elicited from GPT-3 and used this to fine-tune Flan-T5; this yielded SOTA performance across all datasets considered, often by wide margins (5-10 points in F1).Our results suggest that where feasible, LLMs should be a standard baseline for RE.

Future directions
We have left several avenues open for further exploration.For example, evaluating LLMs like GPT-3 for RE required collecting manual annotations to identify ostensible "false positive" and "false negative" model outputs which were in fact accurate.Designing models to automate this evaluation might provide similar reliability without the accompanying costs; we provide preliminary work in this direction through the use of simple BERT-style classifiers in Appendix D.

Limitations
We have demonstrated that across three standard RE datasets, LLMs achieve SOTA results.In particular, GPT-3 yields such performance even given only 10s of training sample for in-context learning.We then showed that we can similarly achieve SOTA performance with the much smaller (and open-source) Flan T5 (Large) model, when trained using CoT generations produced by GPT-3.We also highlighted key challenges for evaluation in this setting.
But there are important limitations to these contributions.First, here we considered three standard RE datasets with binary relations butas we discussed-we excluded more complex RE datasets.For example, we did not consider corpora containing n-ary relations between entities (Taboureau et al., 2010).We were also unable to run experiments on datasets with lengthy texts and a large number of relations, such as DocRED (Yao et al., 2021), due to the necessary prompt lengths for such inputs.
Second, while we found that CoT-style explanations generated by GPT-3 can be fruitfully used as additional supervision to fine-tune smaller language models, we made no attempt to evaluate the quality of these generated explanations which may have an impact on the model performance.
Third, we did not fine-tune GPT-3 on the RE datasets, mainly due to the cost of doing so.It is likely that a fine-tuned GPT-3 would yield performance superior to the results we achieved with Flan T5 (which constitute current SOTA).But, in addition to the costs necessary for fine-tuning this model, the resultant weights would not be accessible to run locally in any case; one would have access to it only via the OpenAI interface, which motivated our decision to fine-tune the smaller and open-source Flan T5 instead.
Finally, we only experiment with datasets curated in the English language and therefore, we do not know that the issues we have highlighted could replicate in the same way in other languages.

A Datasets
We considered and conducted the evaluation of our methods on the following datasets.Basic data statistics are also reported in Table 1.
ADE Adverse Drug Events (Gurulingappa et al., 2012) contains binary relations of (drug, adverse event) pairs.Drugs and adverse events are the only two entity types.This dataset provides a 10-fold split.
CONLL04 The CoNLL04 consists of sentences from news articles that were annotated for the mentioned entities and relations between entities (Roth and Yih, 2004).It includes four entity types (PER, ORG, LOC, OTH) and five possible relations (KILL, WORK_FOR, LIVE_IN, LOCATED_IN, ORG_BASED_IN).
NYT The NYT comprises sentences sampled from New York Times news articles published between 1987 and 2007 (Riedel et al., 2010).The data was distantly annotated with relations triplets from FreeBase.We use a processed version of NYT (Zeng et al., 2018)

B Models and Reproducibility
We provide average micro metrics over 5 seeds across each dataset in Table 3.On Flan-T5-Large, where we do fine-tuning, some hyperparameters were manually tuned but most left at their default values.The final values for the ones that were manually tuned are provided in Table 4.We perform all experiments with a single NVIDIA Quadro RTX 8000 with 64GB of RAM on an Intel Xeon E502680v4 (2.4GHz).

B.1 Costs ($)
We provide details on the costs we incurred while running experiments on GPT-3 in Table 5.

C Prompts
We use the following prompt elements as few-shot exemplars corresponding to each dataset in our evaluation.Inputs and target references are directly extracted from the original training sets while the explanations are human-written and were added when necessary for the experiments described in section 3 and 4.

ADE
Example Instructional Prefix: List all [drug, adverse effects] pairs in the TEXT provided below.
TEXT: We report on three observations of parkinsonian patients with levo-dopa-induced diphasic dyskinesias, who received subcutaneous apomorphine to reduce the duration of abnormal movements.

Relations: [['G-CSF', 'erythematous papular eruption']]
Explanation: G-CSF therapy caused erythematous papular eruption in a girl with cystic fibrosis.<s>TEXT: Hypersensitivity to carboplatin is a rare but real complication of therapy and should be considered in patients

D Learning to Identify False False Positives and Negatives
As discussed in the main paper, one common problem across datasets in generative RE is evaluation, given that LMs are flexible in how they might express entities and relations.Prior work in RE has tended rely on standard metrics to quantify performance (precision, recall, micro-F1).These rely on matching classified (or in our case, generated) labels to reference labels to calculate the number of true positives (TPs), false positives (FPs), true negatives (TNs), and false negatives (FNs).
Prior to the introduction of LLMs for generative RE, Taillé et al. (2020) attempted to unify evaluation and provide useful guidelines around issues associated with prior methods and how different evaluation strategies rendered an accurate comparison infeasible.They broadly recommended the use of a strict evaluation scheme where for a relation triplet to be considered correct, the head and tail entity surface forms must be an exact match, as well as their corresponding types (when available).While this provides a standardized framework for traditional models where entities and and relations are hard classification labels, in a generative setting we often find that LLMs, under varying levels of supervision, produce relation triplets (or pairs) that do not correspond exactly to their reference counterparts, but are nonetheless correct upon manual review.Consider the following example from CoNLL in Figure 2 Text: On Friday, U.S. Ambassador Vernon A. Walters... fuselage.Gold Reference: [(Vernon A. Walters, 'Live In', U.S.)] Generated Relations: [[Vernon A. Walters, 'Works For', U.S.]] In this example, one can reasonably infer that Vernon A. Walter is a U.S. Ambassador.Therefore, by definition a U.S. diplomat to another country cannot live inside the U.S., but such a person must work for the U.S. (commonsense dictates that a diplomat would work for a specific country).
To achieve a more accurate characterization of how LLMs perform on generative RE tasks, we hired human annotators on Amazon Mechanical Turk4 to manually re-assess all ostensible FPs and FNs from each of our datasets.To control for quality and recruit annotators we ran pilot experiments on 50 instances of pre-annotated data. 5We required AMT workers to have an overall approval rating of > 95% irrespective of geographic region.Based on these initial set of results we hired a total of 9 workers who reliably followed our instructions.Recruited workers were paid periodic bonuses (equivalent to one hour of pay) based on the quality of their annotations.
To identify potentially faulty "false positives", we provided annotators with the input text along with the relation identified as a FP, and ask the following question: "Can the given given relation be reasonably derived from the text?".Similarly, to identify erroneous "false negatives", we provide annotators with the input text, the full set of generated labels, the ostensible FN from the reference set, and ask: "Can the reference relation triplet (or pair) be inferred from the generated set of relations?".Each instance was annotated by three different AMT workers, and we considered a potential FP/FN to be inaccurate only when all annotators agree on a label. 6We provide specific examples of FPs and FNs in Tables 8 and 7. We summarize the dataset-specific findings in Table 6.
In light of these findings, we make a first effort in using simple, learned models to classify falsepositives/negatives in generative RE.We experiment with fine-tuned BERT (Devlin et al., 2019) classifier to classify "false positives" and "false negatives" as being accurate designations (or not).For FPs, we concatenate the input with a generated relation pair/triplet (potential FP) and classify using the [CLS] token - We analyze the effectiveness of this approach in Figure 4 using the AUC-ROC.We find that this approach is most effectiveness in identifying potential potential false positives for CoNLL (AUC 0.88), while being least effective at identifying false negatives for CoNLL (AUC 0.73).This suggests that learning to identify erroneous "false positives" and "false negatives" may be a promising avenue to facilitate accurate automated evaluation of generative LLMs for RE.

Figure 1 :
Figure 1: RE performance of LLMs on the CoNLL dataset. 1 Few-shot GPT-3 slightly outperforms the existing fully supervised SOTA method (Huguet Cabot and Navigli 2021; dotted horizontal line). 2 Eliciting CoT reasoning from GPT-3 further improves few-shot performance.3 Fine-tuning Flan-T5 (large) is competitive with, but no better than, existing supervised methods, but 4 supervising Flan-T5 with CoT reasoning elicited from GPT-3 substantially outperforms all other models.
is then a pair of input text and a linearized target string: Input Bill Nelson, NASA administrator announced the mars mission today.Target [(Bill Nelson:Per, Work_For, NASA:Org)]

Text:Figure 3 :
Figure 3: We propose fine-tuning Flan-T5 (large) for relation extraction (RE) using standard supervision and Chain-of-Thought (CoT) reasoning elicited from GPT-3 for RE.This yields SOTA performance across all datasets considered, often by substantial margin (∼5 points absolute gain in F1).

[
CLS] Input Text [SEP] Potential FPSimilarly, for FNs we concatenate the input text with a potential FN and the full set of generated labels, and classify using the [CLS] token -[CLS] Input Text [SEP] Potential FN[SEP] Generated Labels

Table 1 :
Dataset statistics.Train, validation and test indicate the number of relation triplets in each dataset.
deep bidirectional transformers for language understanding.In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171-4186, Minneapolis, Minnesota.Association for Computational Linguistics.
containing three overlapping entity types (LOC, PER, ORG) and 24 relation types.

Table 3 :
Average micro metrics over 5 seeds for the test sets (10-folds for ADE).

Table 4 :
Hyperparameters and compute time for the fully fine-tuned Flan models (corresponding to main results table2).

Table 5 :
Summary of costs incurred by prompting and using GPT-3 as a labeler for RE.Rome is in Lazio province and in France, added, "That a mogul like Sumner Redstone could make a statement so vicious, so pompous, so petulant as that he didn't want to make a deal with Tom Cruise because of his personal conduct -it tells you more about Sumner Redstone and Viacom, than about Tom Cruise".
Relations: [['Sumner Redstone:Per', '/business/ company-shareholder/ major-shareholder-of', 'Viacom:Org']] Explanation: Sumner Redstone is a major shareholder of the company Viacom.<s>TEXT: It is a room of paintings by Leonard Peltier , a citizen of the Anishinabe and Dakota and Lakota nations who is serving two consecutive life terms in Pennsylvania for the murder of two F.B.I. agents on the Pine Ridge Reservation in South Dakota.Relations: [['Leonard Peltier:Per', '/people/person/ethnicity', 'Lakota:Per'], ['Lakota:Per', '/people/ethnicity/people', 'Leonard Peltier:Per']] Explanation: Leonard Peltier is a member of the Lakota native-american tribe and consequently belongs to that ethnic group.<s>TEXT: INSIDE THE N.B.A. Correction : February 9 , 2006 , Thursday A sports article on the Spotlight page on Sunday about Dick Bavetta , a longtime referee in the National Basketball Association, misstated the number he was approaching to set the record for regular-season games worked.Relations: [['Dick Bavetta:Per', '/people/person/profession', 'National Basketball Association:Org']] Explanation: Dick Bavetta is a person who's profession is that of a referee in National Basketball Association.<s>TEXT: Now the United States Postal Service may be displaying a similar rebellious streak : tomorrow at the huge Sturgis motorcycle rally in the Black Hills of South Dakota, the Postal Service will issue a set of four stamps that depict classic American bikes.Relations: [['United States Postal