Towards Realistic Few-Shot Relation Extraction

In recent years, few-shot models have been applied successfully to a variety of NLP tasks. Han et al. (2018) introduced a few-shot learning framework for relation classification, and since then, several models have surpassed human performance on this task, leading to the impression that few-shot relation classification is solved. In this paper we take a deeper look at the efficacy of strong few-shot classification models in the more common relation extraction setting, and show that typical few-shot evaluation metrics obscure a wide variability in performance across relations. In particular, we find that state of the art few-shot relation classification models overly rely on entity type information, and propose modifications to the training routine to encourage models to better discriminate between relations involving similar entity types.


Introduction
Few-shot approaches have been explored in a variety of natural language processing (NLP) tasks, such as machine translation (Gu et al., 2018) and textual entailment (Yin et al., 2020), as well as an assortment of classification and natural language inference tasks (Yan et al., 2018;Brown et al., 2020b). The introduction of large, pretrained transformer language models in NLP increased the promise of building systems that can perform complex NLP tasks from only a small number of training examples (Brown et al., 2020a). Han et al. (2018) introduced a few-shot learning framework for relation classification (FewRel), and recently several systems have achieved near-human performance on this task -in some settings even exceeding it. 1 These stunning results might give the impression that FewRel has been solved, and that these systems can be used to extract any relation of interest from a collection of text (e.g., to populate a knowledge-base from web documents) with only a few example instances.
In this paper, we take a deeper look at the applicability of high performing few-shot relation classification models in a more realistic relation extraction (RE) setting. We find that although transformer-based models achieve high accuracy at the FewRel task, this obscures a wide variability in performance across relation types. In particular, we find that state-of-the-art FewRel models heavily rely on entity type information, and thus are unable to discriminate between many types of relations that are trivial for humans (e.g., spouse-of vs. childof). However, we find that enriching the training data with relations with similar entity types forces the model to attend less to entity type information, and in a ranking evaluation, improves performance on unseen relations by up to 24% precision at 50, absolute.
2 Few-shot Relation Extraction 2.1 FewRel 1.0 The FewRel challenge (Han et al., 2018) introduced the few-shot paradigm to relation classification. The authors provided a dataset and evaluation framework for measuring performance on the task. They also adapted several few-shot text classification systems to relation classification. The dataset was comprised of 700 example instances from Wikipedia for each of 100 relations. Of these, 64 were designated for training, 16 for validation, and 20 were withheld for testing. The evaluation was structured as follows: for each instance, N relations were chosen from the pool and, for each relation, K support examples were sampled. These examples were the only information provided to the model regarding these relations. In addition, Q query examples were selected from each of the N relations, and the task of the model was to decide the correct label for each query from among N possible answers (N -class classification). This pro-cess was repeated a fixed number of times (1,000 by default) and the results were averaged.

State of the Art (SOTA)
Although multiple representation and modeling strategies were proposed by the authors and challenge participants, the authors' most successful CNN-based system, as well as the current top performing system (Baldini Soares et al., 2019) employed a simple prototype approach. In this approach, each example was encoded as a vector, and the prototype representation for each relation was derived as the average across all exemplar vectors. The representation of a query example was then compared to the prototypical representations of the candidate relations, and the most similar one (by inner product) was chosen as the label.

FewRel 2.0
Two main concerns were raised about the FewRel 1.0 setup: the cross-domain applicability of the models, and their performance in the none-of-theabove (NOTA) scenario, where a portion of the query instances may belong to none of the candidate relations. FewRel 2.0 (Gao et al., 2019) was designed to address both of these concerns. Our work addresses orthogonal issues, and we limit ourselves to the FewRel 1.0 dataset.

Experimental Setup
Our first set of experiments was exploratory. We set out to reproduce the results of several high-scoring models on the FewRel 1.0 dataset, and to examine how they performed in a setting that more closely reflected a few-shot RE use case, where the model is expected to extract instances of a small number of relations from a large corpus, with high precision and recall. We refer to this scenario as "Realistic RE" because it is common in applied RE tasks such as knowledge-base population (KBP) and information extraction. Since prototypical models offered simplicity and best-in-class performance, we focused on those.
In addition to the CNN and BERT ( or training code public. For representation of individual relations, we followed two common strategies mentioned in the RE literature and implemented in the FewRel codebase. In both strategies, the subject and object entities in the sentence are enclosed in special tokens. In the first strategy (CLS) the encoding of the [CLS] token in the subword-tokenized sentence is used as its representation. In the second strategy (ENT), the encoding of the two special tokens preceding the entities are concatenated to form the example representation.
We averaged the representations of the examples to create a prototype representation for each relation, and computed similarity between prototype and query by inner product. All models were trained with the default set of hyperparameters provided in the FewRel repository, without any tuning. Since the official test set is not publicly available, we report validation set performance. This set was not used during training or tuning.
For evaluation that more closely mirrors the realistic few-shot RE scenario, we used a total ranking setting. Instead of randomly sampling K examples of N relations for each instance, we compiled a single test set containing 50 instances of each relation. We then evaluated performance on this test set on a relation-by-relation basis. For each relation, we sampled K = 5 examples and created a prototypical representation. We then ranked the entire test set by similarity to the prototype representation, and calculated precision at 50 (P@50) for this ranking (Järvelin and Kekäläinen, 2017). 2 We repeated this process 10 times, and reported mean P@50 3 . All models were trained using a single V100 GPU.

Results
Performance at FewRel and RE In Table 1 we can see the performance of several models in the  FewRel 5-way-5-shot setup. We chose these values for K and N since they were the most similar to the few-shot-RE scenario described above. All models were trained on the FewRel training set for 10,000 iterations and with default parameters. As seen in the table, performance increases with model complexity and size, with a large gap between the CNN model and the transformerbased models, and smaller gaps between the latter group. Entity representations consistently outperform sentence-level, [CLS] token representations across model classes, suggesting that entity representations provide a powerful signal for FewRel models. Table 2 shows P@50 model performance in the RE setting. Performance varies widely between relations and, for many relations, is much lower than one would expect from the numbers in Table 1. We see that the order between models shown in Table 1 is not maintained on individual relations. There is, however, a rough ordering of difficulty among relations, with all models achieving > 85% precision on the top half of relations.
For the rest of the paper, we report results from a single model, RoBERTa-base with ENT representation, due to the tractable size of the model and its strong overall performance. Table 3 displays the two most frequent confounders for the RoBERTa-base ENT representation for the least accurate relations. We can see that on these "hard" relations the model gets confused with one or two primary confounders a significant portion of the time.

Difficult Relations
For many of these relations, the confusion is  justified, and probably does not represent actual false-positives. For example, when extracting relation P206 (the relation between a location and a nearby body of water), the model selects instances of P177 (the relation between a road or bridge and the natural obstacle is crosses) as positives. The most difficult relation, P361 ("part of"), is confused with instances of P463 ("member of organization") and P59 ("star's constellation"), which are both subtypes of that P361. These types of errors are also likely to be made by humans, and may account for the somewhat low human performance on the FewRel task. However there is one group of relations: mother, child, spouse, which are easily distinguished by humans, but are confused by the models. These relations all share similar entity type signaturesboth entities are people. Since several recent papers (see Section 5) demonstrate that supervised RE models rely heavily on entity type information, we hypothesize that few-shot models do the same.
To test this hypothesis, we evaluated the models on TACRED data. In this evaluation, family relations are confused, as are other groups of relations which share a similar type signature.  shows the results for a subset of these groups. See Appendix C for the full confusion matrix for all TACRED person and organization relations. This confirms that even high-scoring few-shot RE models rely primarily on entity type information, and find it hard to distinguish between relations with similar type signatures.

Alternate Representations
Under the hypothesis that the choice of the ENT representation strategy was responsible for the entity-type bias, we experimented with other representations, including CLS (as described above), concatenation of ENT and CLS, and entity masking. Our results showed no significant improvement in distinguishing confusable relations. This indicates that the models are still focusing on the entity-type information, even if they are getting it indirectly (e.g., from the [CLS] tag attending to the entity parts of the sentence).

Data Augmentation
We note that the FewRel training dataset does not contain confusable relations. Even if a few such relations existed in the dataset, the training procedure, which randomly samples a small subset of relation from a large pool, would rarely result in a difficult example where two or more sampled relations share the same type.
In the supervised setting, Rosenman et al. (2020) show that model performance can be improved by introducing (manually-created) challenging exam-ples into the training data. We attempted a similar remedy in the FewRel setting, by adding examples from TACRED, which contains a large number of confusable relations. In order to avoid overlap between training and test data, we split the TACRED relations in two: all person relations were added to the FewRel training set, and all organization relations were used for testing.

Relation
FewRel   Table 5 shows the results of training on the FewRel dataset, augmented with TACRED person relations and testing on TACRED organization. We can see improved results and less confusion among the relations sharing similar entity type signatures. This means that with the addition of more challenging examples, the model was forced to look beyond the entity-type signature, and incorporate other information suitable for distinguishing confusable relations. Note that organization TACRED relations were never observed during training. In addition, augmenting the training data with TA-CRED person relations improves the overall accuracy from 95.18% to 96.16% on the original FewRel validation set and from 82.54% to 85.48% on the TACRED organization relations.

Related Work
Several recent papers have analyzed the weaknesses of SOTA supervised relation extraction systems, primarily on the TACRED dataset (Zhang et al., 2017). Rosenman et al. (2020) list several "lazy" strategies employed by supervised SOTA models in the TACRED challenge, including the "entity-type heuristic" which relies solely on entity types, ignoring context. Alt et al. (2020) perform in-depth error analysis and show that many errors stem from confusing relations with identical (coarse) type signatures, and ignoring context. Tran et al. (2020) present a system that uses only entity information (in the form of unsupervised cluster IDs) to match SOTA results on TACRED. To the best of our knowledge, there has not been a similar error analysis for few-shot classification models.

Conclusions
In this work we explored the applicability of fewshot relation classification models in a relation extraction setting. We showed that high classification accuracy does not translate to high extraction performance, due to the reliance of few-shot models on entity type information. As a result, the models tend to perform poorly on relations involving broad entity types, such as people, locations, or dates. By explicitly adding confusable relations at training time, we force the model to rely less heavily on entity types, and consequently discriminate between relations with similar argument types. Further modifications to the training sampler that encourage the model to downweight entity type information are the subject of ongoing work.

P59
The area of the celestial sphere of which the subject is a part (from a scientific standpoint, not an astrological one). P364 Language in which a film or a performance work was originally created. P410 Military rank achieved by a person, or military rank associated with a position. P155 Immediately prior item in a series of which the subject is a part. P412 Person's voice type. Expected values: soprano, mezzo-soprano, contralto, countertenor, tenor, baritone, bass (and derivatives). P2094 Official classification by a regulating body under which the subject (events, teams, participants, or equipment) qualifies for inclusion. P413 Position or specialism of a player on a team, e.g., Small Forward. P177 Obstacle (body of water, road, ...) which this bridge crosses over or this tunnel goes under. P25 Female parent of the subject. P921 Primary topic of a work. P40 Subject has object as biological, foster, and/or adoptive child. P641 Sport in which the subject participates or belongs to. P463 Organization or club to which the subject belongs. P206 Sea, lake or river.

P26
The subject has the object as their spouse (husband, wife, partner, etc.). P361 Object of which the subject is a part.