Exploring Task Difficulty for Few-Shot Relation Extraction

Few-shot relation extraction (FSRE) focuses on recognizing novel relations by learning with merely a handful of annotated instances. Meta-learning has been widely adopted for such a task, which trains on randomly generated few-shot tasks to learn generic data representations. Despite impressive results achieved, existing models still perform suboptimally when handling hard FSRE tasks, where the relations are fine-grained and similar to each other. We argue this is largely because existing models do not distinguish hard tasks from easy ones in the learning process. In this paper, we introduce a novel approach based on contrastive learning that learns better representations by exploiting relation label information. We further design a method that allows the model to adaptively learn how to focus on hard tasks. Experiments on two standard datasets demonstrate the effectiveness of our method.


Introduction
Relation extraction aims to detect the relation between two entities contained in a sentence, which is the cornerstone of various natural language processing (NLP) applications, including knowledge base enrichment (Trisedya et al., 2019), biomedical knowledge discovery (Guo et al., 2020), and question answering . Conventional neural methods (Miwa and Bansal, 2016;Tran et al., 2019) train a deep network through a large amount of labeled data with extensive relations, so that the model can recognize these relations during the test phase. Although impressive performance has been achieved, these methods are difficult to adapt to novel relations that have never been seen in the training process. In contrast, humans can identify new relations with very few examples. It is thus of great interest to enable the model to generalize to new relations with a handful of labeled instances. Figure 1: An example of easy few-shot task (top) and hard few-shot task (bottom). This is a 3-way-1-shot setup -each task involves three relations, and each relation has one supporting instance. Blue and red colors indicate head and tail entities respectively. For the easy task, the relations are very different, and it is easy to classify the query instance. However, due to the subtle differences among the relations in the hard tasks, it is challenging to correctly predict the true relation.
Inspired by the success of few-shot learning in the computer vision (CV) community (Sung et al., 2018;Satorras and Estrach, 2018), Han et al. (2018) first introduce the task of few-shot relation extraction (FSRE). FSRE requires models to be capable of handling classification of novel relations with scarce labeled instances. A popular framework for few-shot learning is meta-learning (Santoro et al., 2016;Vinyals et al., 2016), which optimizes the model through collections of few-shot tasks sampled from the external data containing disjoint relations with novel relations, so that the model can learn cross-task knowledge and use the knowledge to acclimate rapidly to new tasks. A simple yet effective algorithm based on meta-learning is prototypical network (Snell et al., 2017), aiming to learn a metric space in which a query instance is classified according to its distance to class prototypes. Recently, many works (Gao et al., 2019a;Qu et al., 2020;Yang et al., 2020) for FSRE are in line with prototypical networks, which achieve remarkable performance. Nonetheless, the difficulty of distinguishing relations varies in different tasks , depending on the similarity between relations. As illustrated in Figure 1, there are easy few-shot tasks whose relations are quite different, so that they can be consistently wellclassified, and also hard few-shot tasks with subtle inter-relation variations which are prone to misclassification. Current FSRE methods struggle with handling the hard tasks given limited labeled instances due to two main reasons. First, most works mainly focus on general tasks to learn generalized representations, and ignore modeling subtle and local differences of relations effectively, which may hinder these models from dealing with hard tasks well. Second, current meta-learning methods treat training tasks equally, which are randomly sampled and have different degrees of difficulty. The generated easy tasks can overwhelm the training process training and lead to a degenerate model.
To fill this gap, this paper proposes a Hybrid Contrastive Relation-Prototype (HCRP) approach, which focuses on improving the performance on hard FSRE tasks. Concretely, we first propose a hybrid prototypical network, capable of capturing global and local features to generate the informative class prototypes. Next, we present a novel relation-prototype contrastive learning method, which leverages relation descriptions as anchors, and pulls the prototype of the same class closer in representation space and pushes those of different classes away. In this way, the model gains diverse and discriminative prototype representations, which could be beneficial to distinguish the subtle difference of confusing relations in hard few-shot tasks. Furthermore, we design a task-adaptive training strategy based on focal loss (Lin et al., 2017) to learn more from hard tasks, which allocates dynamic weights to different tasks according to task difficulty. Extensive experiments on two large-scale benchmarks show that our model significantly outperforms the baselines. Ablation and case studies demonstrate the effectiveness of the proposed modules. Our code is available at https://github.com/hanjiale/HCRP . The contributions of this paper are summarized as follows: • We present HCRP to explore task difficulty as useful information for FSRE, which boosts hybrid prototypical network with relationprototype contrastive learning to capture di-verse and discriminative representations.
• We design a novel task adaptive focal loss to focus training on hard tasks, which enables the model to achieve higher robustness and better performance.
• Qualitative and quantitative experiments on two FSRE benchmarks demonstrate the effectiveness of our model.

Few-shot Relation Extraction
Relation extraction is a foundational and important task in NLP and attracts many recent attentions (Chen et al., 2021;Nan et al., 2020Nan et al., , 2021a Peng et al., 2020) combining prototypical networks with pre-trained language models, which achieve impressive results. However, the task difficulty of FSRE has not been explored. In this work, we focus on the hard tasks and propose a hybrid contrastive relation-prototype method to better model subtle variations across different relations.

Contrastive Learning
Contrastive learning (Jaiswal et al., 2021) has gained popularity recently in the CV community. The core idea is to contrast the similarities and dissimilarities between data instances, which pulls the positives closer and pushes negatives away simultaneously. CPC (van den Oord et al., 2018) proposes a universal unsupervised learning approach. MoCo  presents a mechanism for building dynamic dictionaries for contrastive learning. Sim-CLR  improves contrastive learning by using larger batch size and data augmentation. Khosla et al. (2020) extend the self-supervised contrastive approach to the supervised setting. Nan et al. (2021b) propose a dual contrastive learning approach for video grounding. There are also some applications of contrastive learning in the field of NLP. Fang and Xie (2020) employ back translation and MoCo to learn sentence-level representations. Gunel et al. (2021) design supervised contrastive learning for pre-trained language model fine-tuning. Inspired by these works, we propose a heterogeneous relation-prototype contrastive learning in a supervised way to obtain more discriminative representations.

Task Definition
We follow a typical few-shot task setting, namely the N -way-K-shot setup, which contains a support set S and a query set Q. The support set S includes N novel classes, each with K labeled instances. The query set Q contains the same N classes as S. And the task is evaluated on query set Q, trying to predict the relations of instances in Q. What's more, an auxiliary dataset D base is given, which contains abundant base classes, each with a large number of labeled examples. Note the base classes and novel classes are disjoint with each other. The few-shot learner aims to acquire knowledge from base classes and use the knowledge to recognize novel classes. One popular approach is the meta-learning paradigm (Vinyals et al., 2016), which mimics the few-shot learning settings at training stage. Specifically, in each training iteration, we randomly select N classes from base classes, each with K instances to form a support set S = {s i k ; i = 1, . . . , N, k = 1, . . . , K}. Meanwhile, R instances are sampled from the remaining data of the N classes to construct a query set Q = {q j ; j = 1, . . . , R}. The model is optimized by collections of few-shot tasks sampled from base classes, so that it can rapidly adapt to new tasks.
For an FSRE task, each instance consists of a set of samples (x, e, y), where x denotes a natural language sentence, e = (e h , e t ) indicates a pair of head entity and tail entity, and y is the relation label. The name and description for each relation are also provided as auxiliary support evidence for relation extraction. For example, for a relation with its relation id "P726" in a dataset that we use, we can obtain its name "candidate" and description "person or party that is an option for an office in an election".

Approach
In this section, we present the details of our proposed HCRP approach. The overall learning framework is illustrated in Figure 2. The inputs are N -way-K-shot tasks (sampled from the auxiliary dataset D base ), where each task contains a support set S and a query set Q. Meanwhile, we take the names and descriptions of these N classes (i.e., relations) as inputs as well. HCRP consists of three components. The hybrid prototype learning module generates informative prototypes by capturing global and local features, which can better capture the subtle differences of relations. The relationprototype contrastive learning component is then used to leverage the relation label information to further enhance the discriminative power of the prototype representations. Finally, a task adaptive focal loss is introduced to encourage the model to focus training on hard tasks.

Hybrid Prototype Learning
We employ BERT (Devlin et al., 2019) as the encoder to obtain contextualized embeddings of query instances {Q j ∈ R lq j ×d ; j = 1, . . . , R} and sup- . . , N, k = 1, . . . , K}, where l q j and l s i k are the sentence lengths of the j-th query instance and k-th support instance in class i respectively, and d is the size of the resulting contextualized representations. For each relation, we concatenate the name and description and feed the sequence into the BERT encoder to obtain relation embeddings {R i ∈ R l r i ×d ; i = 1, . . . , N }, where l r i is the length of relation description i.

Global Prototypes
For instances in S and Q, the global features {s i k ∈ R 2d ; i = 1, . . . , N, k = 1, . . . , K} and {q j ∈ R 2d ; j = 1, . . . , R} are obtained by concatenating the hidden states corresponding to start tokens of two entity mentions following Baldini Soares et al. (2019). The global features of relations {r i ∈ R 2d ; i = 1, . . . , N } are obtained by the hidden states corresponding to [CLS] token (converted to 2d dimension with a transformation). For each relation i, we average the global features of the K supporting instances following the work of Snell et al. (2017), and further add the global feature of relation to form global prototype representation.

Local Prototypes
While global prototypes are capable of capturing general data representations, such representations may not readily capture useful local information within specific RSRE tasks. To better handle the hard FSRE tasks with subtle differences among highly similar relations, we further propose local prototypes to highlight key tokens in an instance that are essential to characterize different relations. For relation i, we first calculate the local feature of the k-th support instance as: where [·] n is the n-th row of a matrix, sum() is an operation that sums all elements for each row in a matrix. Specifically, we allocate weights to different tokens according to their similarities with relation descriptions, and take the weighted sum to form such local features. Similarly, we calculate the similarity between relation embedding R i and each support instance embedding S i k of relation i and obtain K features {r i k ; k = 1, . . . , K}: The K features are then averaged to arrive at the final local representation of relation i: The local feature of a query instance is calculated by the following formulas.
Finally, we generate the local prototype by averaging the local features of the support set, plus the local feature of the relation.

Hybrid Prototypes
The model concatenates the global and local prototype to form hybrid prototype representations: where [; ] denotes column-wise concatenation. The hybrid representation of query instance is also obtained by concatenating the global and local features: With the representation of query and prototypes of N relations, the model computes the probability of the relations for the query instance q j as follows:

Relation-Prototype Contrastive Learning
Hard tasks usually involve similar relations whose prototype representations are close, leading to increased challenges in classifying query instances.
To gain a more discriminative prototype representation, we design a novel Relation-Prototype Contrastive Learning (RPCL) method, which leverages the interpretable relation names and descriptions to calibrate the few-shot prototypes. Unlike conventional unsupervised or self-supervised contrastive learning, RPCL utilizes the labels of support instances in each task to perform supervised contrastive learning. Concretely, taking a relation representation as an anchor, the prototype of the same class as positive and prototypes of different classes as negatives, RPCL aims to pull the positive closer with the anchor and pushes negatives away. For a specific relation i with its hybrid representation, the model collects positive prototype p i h and negative prototypes {p n h ; n = 1, . . . , N, n = i}. The goal is to distinguish the positive from the negatives. We use dot product to measure the similarities between the relation anchor and selected prototypes.
The contrastive loss is calculated by the following formula:

Task Adaptive Focal Loss
We design a task adaptive focal loss to learn more from hard tasks, which is a modified cross entropy (CE) loss. The CE loss can be written as follows: where y is the class label, and z y is the estimated probability for the class y.
where γ ≥ 0 adjusts the rate at which easy examples are down-weighted. For an easy example, z y is almost 1, the factor goes to 0, and the loss for easy examples is down-weighted, which in turn increases the importance of correcting misclassified examples, which are potentially harder. We employ focal loss instead of cross entropy loss to focus more on hard query examples. Moreover, to focus more on hard tasks, we design a novel task adaptive focal loss, which introduces the dynamic task-level weights. Specifically, for an N -way-K-shot task, the model calculates the class-wise similarity to estimate task difficulty. The higher the inter-class similarity, the harder the task. We first concatenate the hybrid features of prototype and relation to represent each class c i = [r i h ; p i h ], and then define the task similarity matrix S τ ∈ R N ×N , for i, j ∈ {1, . . . , N }, where || · || is the Euclidean norm. The task similarity scalar is obtained by the following formula: where || · || F is the Frobenius norm, and T is the number of tasks in a mini-batch. The scalar represents the degree of difficulty of the task. We add the task-level scalar to the focal loss, which not only focuses on the hard examples at the instance level, but also focuses more on the hard tasks at the task level. Formally, the task adaptive focal loss is defined as follows, The final objective function of our model is defined as L = L T F + λ × L C , where λ is a hyperparameter to balance the two terms. 1.0 and FewRel 2.0 are large-scale few-shot relation extraction datasets, consisting of 100 relations, each with 700 labeled instances. The average number of tokens in each sentence instance is 24.99, and there are 124,577 unique tokens in total. Our experiments follow the splits used in official benchmarks, which split the dataset into 64 base classes for training, 16 classes for validation, and 20 novel classes for testing. FewRel 1.0 is trained and tested on the same Wikipedia domain. In addition, the name and description of each relation are also given, providing additional interpretability for each relation. FewRel 2.0 with domain adaptation setting is trained on Wikipedia domain but tested on a different biomedical domain. Only the names of relation labels are given but descriptions are not available, which makes the task more challenging.

Evaluation
Consistent with the official evaluation scripts, we evaluate our model by randomly sampling 10,000 tasks from validation data. The performance of the model is evaluated as the averaged accuracy on the query set of multiple N -way-K-shot tasks. According to the previous work (Han et al., 2018;Gao et al., 2019b), we choose N to be 5 and 10, and K to be 1 and 5 to form 4 scenarios. We report the final test accuracy by submitting the prediction of our model to the FewRel leaderboard 2 .

Implementation Details
The approach is implemented with PyTorch (Paszke et al., 2019) and trained on 1 Tesla P40 GPU. We adopt the Transformer library of Huggingface 3 (Wolf et al., 2020) and take the uncased model of BERT base as the encoder for fair comparison. The AdamW optimizer (Loshchilov and Hutter, 2019) is applied to minimize loss. We manually adjust the hyper-parameters based on the performance on the validation data, which are listed in Table 1. Specifically, we use the same hyperparameter values for two datasets except for λ. For FewRel 1.0, we concatenate the name and description of each relation as inputs, and λ is set to 1. For FewRel 2.0, we only input the relation names, and λ is adjusted to 2.5. The number of parameters in our model is 110 million. The average runtime of training and evaluation under 10-way-1-shot setting is 13.35 hours and 1.25 hours, respectively.

Comparison to Baselines
We compare our model with the following baseline methods: 1) Proto (Snell et al., 2017), the algorithm of prototypical networks. We employ CNN and BERT as encoder separately (Proto-CNN and Proto-BERT), and combine adversarial training    Qu et al. (2020), and are reported by Peng et al. (2020). Our method introduces additional relation label name and description information, which is the same as TD-Proto. Other baseline methods also use different external knowledge. See Section 5.2.1 for details.
Model 5-way 5-way 10-way 10-way 1-shot 5-shot 1-shot 5-shot  Here we report the experimental results produced by the work of Peng et al. (2020) which is based on BERT base as the encoder for fair comparison. Table 2 presents the experimental results on FewRel 1.0 validation set and test set. As shown in the upper part of Table 2, our method outperforms the strong baseline models by a large margin, especially in 1-shot scenarios. Specifically, we improve 5-way-1-shot and 10-way-1-shot tasks 3.46 points and 5.86 points in terms of accuracy respectively compared to the second best method, demonstrating the superior generalization ability. Our method also achieves the best performance on FewRel 2.0, as shown in Table 3, which proves the stability and effectiveness of our model. The performance gain mainly comes from three aspects. (1) The hybrid prototypical networks capture rich and subtle features.
(2) The relation-prototype contrastive learning leverages the relation text to further gain discriminative prototypes. (3) The task-adaptive focal loss forces model to learn more from hard few-shot tasks. In addition, we evaluate our approach based on the model CP, where the BERT encoder is initialized with their pre-trained parameters 4 . The lower part of Table 2 shows that our approach achieves a consistent performance boost when using their pre-trained model, which demonstrates the effectiveness of our method, and also indicates the importance of good representations for few-shot tasks.

Performance on Hard Few-shot Tasks
To further illustrate the effectiveness of the developed method, especially for hard FSRE tasks, we evaluate the models on FewRel 1.0 validation set with three different 3-way-1-shot settings, as shown in Table 4. Random is the general evaluation setting, which samples 10,000 test tasks randomly from validation relations, as detailed in section 5.1.2. Easy represents the evaluated tasks are easy. We fix the 3 relations in each task as 3 very different relations, which are "crosses", "constellation", and "military rank". Different tasks own different instances but the same relations. Similarly, we pick 3 similar relations, which are "mother", "child", and "spouse" respectively, and evaluate the performance of models under the Hard setting. As we can see, the baselines achieve good performance under random and easy settings. However, the ac-  curacy has dropped significantly under the hard setting, which illustrates that hard few-shot tasks are extremely challenging. HCRP gains the best accuracy, especially under the hard setting, proving that our model can effectively handle hard few-shot tasks.

Analysis of Hybrid Prototype Learning
This section discusses the effect of hybrid prototype learning. As shown in the Table 7, we conduct an ablation study to verify the effectiveness of hybrid prototypes. Removing local (Model 2) and global prototypes (Model 3) decreases the performance respectively, indicating that both prototypes are essential to represent relations. Furthermore, we present a 3-way-1-shot task sampled from the FewRel 1.0 validation set, as shown in Table 5.
The task can be regarded as a hard task because the three relations are highly similar. Our model correctly classifies the query instance as "mother". We visualize the similarity between the query instance and different relation prototypes, where different columns represent different models. Proto-BERT and HCRP without global prototypes tend to classify the query into the wrong relation "child". HCRP without local prototypes can correctly predict the relation "mother". HCRP further correctly predicts with a higher degree of confidence, which proves that hybrid prototypes can better model subtle inter-relation variations for hard tasks.

Analysis of Relation-Prototype Contrastive Learning
To demonstrate the effectiveness of relationprototype contrastive learning (RPCL), we first conduct the ablation study, shown in model 4 of Table 7. It is clear that there is a severe decline in performance if removing the relation-prototype contrastive learning in 5-way-1-shot and 10-way-1-shot settings. As Figure 3 depicts, we visualize the learned embedding spaces with t-SNE (Maaten  and Hinton, 2008) to intuitively characterize the resulting representations for similar relations. Specifically, we pick two similar relations "mother" and "child" from the FewRel 1.0 validation set, and randomly sample 100 instances for each relation.
We can see that embeddings trained with RPCL are clearly separated, which makes classification easier, while those trained without RPCL are lumped together. By using the relation-prototype contrastive learning, which regards the relation text as anchors and hybrid prototypes as positives and negatives, our model arrives at more discriminative representations, especially for hard tasks.

Analysis of Task-Adaptive Focal Loss
As shown in Table 7, we compare our designed task adaptive focal loss with cross entropy (CE) loss (Model 5), cross entropy loss with task weights (Model 6), and focal loss (Model 7). Comparing Model 5 and Model 7, we observe that focal loss achieves higher accuracy than CE loss, and adding the task adaptive weight (Model 1) further improves the performance. In addition, experi-Tasks Relations Weights task 1 performer, director, characters, composer, publisher 0.29 task 2 performer, has part, location, father, platform, religion 0.19 task 3 located on terrain feature, location of formation, country, work location, location 0.30 task 4 has part, instrument, operating system, military branch, successful candidate 0.22 Table 6: An example of a 4-task batch and the task-adaptive weights, where each task has 5 relations (N =5).

Model
No. 5-way 10-way  ments show that the CE loss with task weights also improves performance compared to CE loss. Table 6 depicts a case study to show the task-adaptive weights. Specifically, we give a sampled minibatch of tasks from the FewRel 1.0 training set, where the batch size is 4. Each task has 5 relations, which are also listed. The model allocates weights for each task according to the similarity of support set. For example, the relations in task 1 are similar to each other, mainly describing the relations in the art field, so the model assigns a relatively higher task weight. However, the relations in task 2 are very different, so the model allocates a lower weight. The ablation experiments and case study prove that our proposed loss can pay more attention to hard tasks in the training process, thus improve the performance.

Conclusion
This paper focuses on hard few-shot relation extraction tasks and proposes a hybrid contrastive relation-prototype approach. The method proposes a hybrid prototype learning method that generates informative prototypes to model small interrelation variations. A relation-prototype contrastive learning approach is proposed. Using relation information as anchors, it pulls instances of the same relation class closer in the representation space while pushing dis-similar ones apart. This process further enables the model to acquire more discriminative representations. In addition, we introduce Figure 3: t-SNE plots of instance embeddings trained with or without (w/o) relation-prototype contrastive learning. Two easy-to-confuse relations ("mother" and "child") with 100 samples are adopted. Best viewed in color.
a task adaptive focal loss to focus more on hard tasks during training to achieve better performance. Experiments have demonstrated the effectiveness of our proposed model. There are multiple avenues for future work. One possible direction is to design a better mechanism for selecting tasks in the training process rather than using random sampling.