HyperNetwork-based Decoupling to Improve Model Generalization for Few-Shot Relation Extraction

Few-shot relation extraction (FSRE) aims to train a model that can deal with new relations using only a few labeled examples. Most existing studies employ Prototypical Networks for FSRE, which usually overfits the relation classes in the training set and cannot generalize well to unseen relations. By investigating the class separation of an FSRE model, we find that model upper layers are prone to learn relation-specific knowledge. Therefore, in this paper, we propose a HyperNetwork-based Decoupling approach to improve the generalization of FSRE models. Specifically, our model consists of an encoder, a network generator (for producing relation classifiers) and the generated-then-finetuned classifiers for every N -way- K -shot episode. Meanwhile, we design a two-step training strategy along with a class-agnostic aligner, by which the generated classifiers focus on acquiring relation-specific knowledge and the encoder is encouraged to learn more general relation knowledge. In this way, the roles of upper and lower layers in our FSRE model are explicitly decoupled, thus enhancing its generalizing capability during testing. Experiments on two public datasets demonstrate the effectiveness of our method. Our source code


Introduction
Relation extraction is a fundamental task in information extraction that aims to identify the semantic relations between two entities in sentences (Zeng et al., 2020;Han et al., 2020).Typically, conventional approaches are highly dependent on a large amount of labeled data and cannot deal with unseen relations well.Therefore, recent studies (Han et al., 2018;Gao et al., 2019c) turn to Few-Shot Relation Extraction (FSRE) that only requires a handful Current FSRE studies are usually conducted in an N -way-K-shot setting.In this setting, the model is trained on a series of episodes, each of which has N relation classes.In each episode, every relation class includes K support instances (as support set) and Q query instances (as query set).Under this setting, most existing studies (Gao et al., 2019a;Yang et al., 2020;Zhang and Lu, 2022) employs Prototypical Network (Snell et al., 2017) for FSRE.The Prototypical Network is designed to learn a suitable prototypical vector for each relation using the support instances from the support set.Then, these vectors are used to predict the relation of the query instances.Moreover, to further improve the performance of the model, some recent studies (Han et al., 2021a;Liu et al., 2022) introduce relation description and adopt contrastive learning in Prototypical Network.Although these methods have achieved improvements, their models still tends to overfit relation classes appearing in the training set, and exhibit an unsatisfactory generalizing capability to unseen relations.
To further explore what restricts the generalizing capabilities of Prototypical Network models, inspired by Kornblith et al. (2021), we conduct a preliminary study to investigate the internal representations within different model layers.Specifically, we measure the layer-wise class separation between representations of instances belong into different relation classes during training.We observe that the representations present a higher class separation in the upper layers of the model than those in the lower layers.This intuitively shows that the upper layers are likely to learn relationspecific knowledge and overfit the relation classes in the training set, while the lower layers basically acquire more general relation knowledge.We speculate that this over-adaptation of upper layers limits the generalizing capability of the model.
Based on the above finding, in this paper, we propose a HyperNetwork-based Decoupling approach for FSRE.As illustrated in Figure 1, our model consists of three components: an encoder at the bottom, a network generator in the middle (for generating the initialized relation classifiers) and the generated relation classifiers at the top.In each episode, the network generator takes in the encoder output and generates an initialized relation classifier for current episode.Subsequently, the generated classifiers are fine-tuned, so that the model can quickly learn relation-specific knowledge and adapt to new relations.Meanwhile, the encoder and generator are encouraged to learn more general relation knowledge from the training set.In this way, we explicitly decouple the roles of the lower and upper components in our FSRE model.
To this end, our model update procedure consists of two steps: the fine-tuning of the classifier and the update of the encoder together with the generator.For the classifier fine-tuning, we first use the generator to produce an initialized classifier for every episode to discriminate the relation classes within them.Then, we use corresponding support set to optimize its generated classifier, which endows the classifier with relation-specific knowledge.Consequently, our model can effectively adapt to new relations through such produce-then-finetune process.Moreover, the update of the encoder and generator resorts to learning more general knowledge not specific to a certain episode, which is crucial for an FSRE model not to overfit.To better learn such knowledge, we train the model across different episodes so that it is not biased towards a narrow set of relations.Particularly, we treat a collection of M sampled episodes as an updating interval for the encoder and generator.In each updating interval, the encoder and generator are jointly optimized through maximizing the overall performance of all trained classifiers on the query set of these M episodes.
During testing, when there is an episode containing unprecedented relation types not existing in the training set, we use the generator to produce a fresh classifier that is then fine-tuned by samples from the support set.Finally, only the encoder and the fine-tuned classifier are used to predict the relation of each query instance.By doing so, the model can quickly adapt to new episode.
However, as each updating interval contains only a handful of episodes, so that our model is still prone to bias towards relation classes in these episodes.To address this issue, we additionally design a class-agnostic aligner, which undergoes all episodes in the training set throughout the training process.Thus, with the help of the aligner, our encoder is able to learn more global general relation knowledge, further alleviating the overfitting to specific relations.
Experimental results on two public benchmarks show that our model consistently outperforms all competitive baselines.Extensive ablation and case studies demonstrate the effectiveness of various components in our model.

Preliminary Study
To better understand the limited generalizing capabilities of current FSRE models, inspired by (Kornblith et al., 2021), given a BERT-based Prototypical Network model, we first investigate layerwise class separation between representations of instances belonging to different relations.Specifically, the class separation indicates the dispersion of representations belonging to the same class relative to the overall dispersion of all embeddings.
In each N -way-K-shot training episode, we measure the class separation in every layer of the model following the metrics in (Kornblith et al., 2021).Particularly, we denote the average withinclass cosine distance and the overall average cosine distance in the l-th layer as d l in and d l overall , respectively.They are calculated using the following equations: , where h l n,i is the embedding of the i-th instance (from the n-th relation class) in the l-th layer, and represents the relative within-class variance, a lower value of which corresponds to a higher degree of class separation in the l-th layer.Thus, the degree of class separation can be defined as As explored in Kornblith et al. (2021), higher class separation means the model tends to overfit the classes in the training set and is less transferable to unseen classes.Hence, we depict the layer-wise change of class separation with respect to the number of training steps in Figure 2. We can observe that Sep l in upper layers (e.g., Layer-12 and Layer-10) quickly increase at early steps and consistently stay much higher than those in lower layers (e.g., Layer-1 and Layer-2).Therefore, Figure 2 shows that the upper layers of the Prototypical Network model are likely to learn relation-specific knowledge and overfit the relation classes during training.

Methodology
In this section, we introduce our HyperNetworkbased Decoupling approach for FSRE.Specifically, we first provide the problem formulation of FSRE and give a detailed description of our model architecture.Then, we elaborate the two-step training strategy of our model.Moreover, we additionally design a class-agnostic aligner to further enhance the generalizing capability of our model.Finally, we describe how our model is used for test in an N -way-K-shot setting.

Problem Formulation
Current FSRE studies are usually conducted in an N -way-K-shot setting.In this setting, the model is trained and tested on a series of episodes, each of which is sampled from mutually exclusive training

Model Architecture
As shown in Figure 4, except the aligner, our model consists of three components: an encoder at the bottom, a network generator in the middle and the generated classifiers at the top.
Encoder.Following existing studies (Soares et al., 2019;Han et al., 2021a;Liu et al., 2022), we use BERT (Devlin et al., 2019) as the encoder E of our model to encode both instances and relation descriptions.Specifically, for each instance, we first use four special tokens "[P1][/P1]" and "[P2][/P2]" to mark the start and end positions of the head and tail entities within an instance, respectively.Then, the encoder is employed to obtain a contextual representation for each token.Finally, we concatenate the representations of these tokens (i.e., [P1] and [P2]) at the start positions of the head and tail entities to form the representation of this instance: Moreover, as in (Liu et al., 2022;Li and Qian, 2022), we also encode relation descriptions and produce corresponding relation representations with the same dimensions as instance representations.Specifically, the representation of the relation description is obtained by concatenating the representation of the [CLS] token and the average representations of the remaining other tokens: r = [h CLS ; h avg ].  (Ha et al., 2016) where the generated module is directly used to make prediction, our produced classifiers are further fine-tuned before actually being used.

Two-step Training
Based on the above model architecture, we expect the generated classifier to acquire relation-specific Step 1: Classifier Fine-tuning.Let us denote the generated classifier in the m-th episode as C m , the classifier is fine-tuned using the corresponding support set (See ② in Figure 4(a)).The training objective of C m is formalized as the following classification loss: where S m refers to the support set of the m-th episode, θ Cm is the parameter of C m produced by our generator (i.e., W cm and b cm ).In this way, the relation-specific knowledge in the m-th episode can be effectively learned by the classifier, resulting in a set of M fine-tuned classifiers Step 2: Encoder & Generator Updating.After obtaining all fine-tuned classifiers (i.e., C 1 , • • • , C M ), we optimize our encoder and gen-erator using the following objective: where Q m denotes the query set of the m-th episode, L cm is the classification loss computed by C m , θ E and θ G represent the parameters of the encoder and generator, respectively.Since all finetuned classifiers have been updated at the first step, the parameter deviation disables the gradient directly backpropagated to the generator.To solve this, we adopt a trick that denotes the classifiers in an equivalent form, i.e., θCm of the trained classifier C m is denoted as θCm = θ Cm + ( θCm − θ Cm ).With the help of above loss, the encoder and generator are optimized to enhance the overall classification performance across multiple episodes.By doing so, they are encouraged to learn more general relation knowledge not specific to certain relations.

Class-Agnostic Aligner
Although the above two-step training strategy can already enchance the generalizing capability of our model, there are still only a limited number of episodes involved in the update of the encoder and generator, as in Equation 3. Therefore, the encoder is still possible to overfit the currently sampled M episodes.To avoid this problem, we additionally design a global class-agnostic aligner A that is an MLP layer.Unlike the classifiers, the parameters of A are just randomly initialized and then continually optimized across all episodes during the whole training, not freshly produced by the generator in every episode.In this way, the aligner can further encourage the encoder to learn more global knowledge, alleviating its overfitting to the relations within currently sampled M episodes.
Specifically, we first use our encoder E and its exponential moving average (EMA) counterpart E to obtain the representations of query and support instances in each of the sampled M episodes.Then, along with the fine-tuning of the generated classifiers, the aligner A is simultaneously trained using a contrastive loss as follows: where h stands for instance representations obtained from the E, h i ′ refers to the representation of the positive query instance x i ′ belonging to the same class as support instance x i .Thus, corresponding to Section 3.3, the overall objective of the first-step training can be written as For the second-step training, the aligner better encourages the encoder to learn general relation knowledge using a similar contrastive loss: x j ∈Sm sim(A(h i ), A(h j )) .
Thereby, the overall objective of the second-step training is formulated as where since both loss terms update the encoder, we balance them using a hyperparameter β.

Inference
During testing, for a new episode from test set, we first use the generator to produce a freshly initialized classifier.Then, the classifier is fine-tuned using the support set to quickly learn the relationspecific knowledge within current episode.Finally, only the encoder and the fine-tuned classifier are used to predict the relation of each query instance.

Datasets
Our model is evaluated on two commonly-used datasets: • FewRel 1.0 (Han et al., 2018).It is a largescale human-annotated FSRE dataset constructed from Wikipedia articles, containing 100 relations.There are 700 instances in each relation.The training, validation and test sets contain 64, 16 and 20 relations, respectively.
• FewRel 2.0 (Gao et al., 2019c).To evaluate the generalizing capability of our model, we also conduct experiments on FewRel 2.0, whose training set is the same as FewRel 1.0.
Particularly, the test set of FewRel 2.0 is constructed from the biomedical domain (no overlap with relations in th training set) with 25 relations, each of which contains 100 instances.
In each of these two datasets, the training, validation and test sets contain mutually exclusive relation classe sets.

Settings
In previous FSRE studies, there are mainly two types of model settings, differing in whether the model encoder is additionally pre-trained using a noisy Relation extraction(RE) corpus provided by Zhang and Lu (2022).In our experiments, as in previous work, we use BERT (Devlin et al., 2019) to initialize the encoder of our model.For optimization, AdamW (Loshchilov and Hutter, 2019) is used with a linear warmup (Goyal et al., 2017) for the first 10% steps.The learning rates of the encoder and generator are set to 1e-5, while those of the Aligner A and the generated classifiers are set to 4e-5 and 1e-2, respectively.Following previous work, we set the number of sampled episode M at every training step to 4. It takes about ten hours for the whole training on a single 24 GB NVIDIA RTX 3090 GPU.For testing, we use classification accuracy as the performance metric.

Baselines
We compare our model with the following baseline methods: 1) Proto-BERT (Snell et al., 2017), a BERT-based prototypical network model.2) MAML (Finn et al., 2017), a typical meta-learning method.3) BERT-PAIR (Gao et al., 2019c), a similarity-based prediction method, in which each query instance is paired with all support instances.4) REGRAB (Qu et al., 2020), a relation graphbased approach.5) TD-Proto (Yang et al., 2020)  2022), a label prompt dropout method that effectively exploits the relation description, and filters the pre-training data of CP to conduct more rigorous few-shot evaluation.

Effect of Hyper-parameters β
The β in Equation 7is an important hyperparameter, which balances the two loss terms regarding the optimization of our encoder.Thus, we conduct an experiment with different values of β on the validation set of FewRel 1.0.From Figure 5, we observe that our model achieves the best performance when β is set to 0.1.Hence, we use β = 0.1 in all subsequent experiments.

Main Results
Results on FewRel 1.0.The experimental results on the validation and test sets of FewRel 1.0 are presented in Table 1.In the first part of the table, all the models directly use BERT to initialize their encoder without additional pre-training.For these models, we observe that our model consistently outperforms all other contrast models on FewRel 1.0 dataset, especially surpassing the strongest GM_GEN that also fine-tunes a relation classifier for each testing episode.Moreover, the models in the second part also use BERT to initialize their encoder.Notably, they are additionally pre-trained using the noisy RE corpus before the training on Few-shot 1.0.To conduct more rigorous few-shot evaluation, Zhang and Lu (2022) filters the relations in FewRel 1.0 from the noisy RE corpus.Among these models, our model performs significantly better than contrast models in all N -way-K-shot settings.
Results on FewRel 2.0.As mentioned in Section 4.1, FewRel 2.0 is a more difficult dataset whose training set and test set not only contain mutually exclusive relation classes but also come from different domains.From Table 2, we observe that our model still consistently outperforms all contrast models.These results further demonstrate the superiority of our model.

Ablation Study
We further conduct extensive ablation studies by removing different components of our model to comprehend their different impacts.We compare our model with the following variants in Table 3.
(1) w/o.Generator.In this variant, we remove the network generator and employ a shared classifier across all episodes.As shown in Line 1, this leads to a significant performance drop of 1.29 and  1.59 points in 5-way and 10-way settings, respectively.These results indicates that the generalizing capability of a shared classifier is limited, demonstrating the effectiveness of our network generator.Particularly, our generator can generate a suitable initialized relation classifier for each episode.Subsequently, the fine-tuning of these classifiers encourages the model to effectively learn relationspecific knowledge from only a handful of support instances, thus quickly adapting to unseen relations.
( To verify the effectiveness of this strategy, in this variant, we turn to simultaneously optimize the classifier, encoder and generator in each episode.As illustrated in Line 2, this variant causes a significant performance decline.This suggests that simultaneously optimizing the classifier, encoder, (3) w/o.Generator & Two-Step.When we remove the generator from our model and simultaneously optimize the whole model in each episode, the performance drops by 2.17 and 2.48 points in 5-way and 10-way settings, respectively (See Line 3).These results suggest that our generator and two-step training strategy both crucially contribute to the generalizing ability of our model.
(4) w/o.Aligner A. The class-agnostic aligner A aims to provide our model with more global general relation knowledge across the whole training process.To verify its effectiveness, we remove it from our model and the performance also decreases (See Line 4).This suggests that the aligner A can indeed enhance model generalization.
(5) w/o.EMA.It is noteworthy that the EMA encoder E integrates the historical relation knowledge from previously learned episodes during the whole training.In this variant, the EMA operation is removed and the performance becomes inferior to Ours (See Line 5).It indicates that, with the help of E, our aligner A is able to provide our encoder E with more global general relation knowledge in a more efficient manner.

Comparison of Class Separation with Our Model
To further verify the effectiveness our proposed HyperNetwork-based decoupling approach, we make comparison of the layer-wise class separation Sep l between the encoder of a conventional prototypical network (PN) FSRE model and that of ours.For clarity, we only depict the degrees of class separation in bottom-2 and top-2 layers (Layers 1, 2 ,11 and 12) in Figure 6.From the figure, we can observe that, in lower layers (i.e., Layers 1 and 1), the degree of class separation shows little difference between the PN model and ours.However, in upper layers (i.e., Layers 11 and 12), the class separation of our model is consistently much lower and vary moderately during the whole training process.This indicates that our entire encoder, not just its lower layers, focuses on learning more general relation knowledge, thus exhibiting less bias towards relation classes in the training set.For the details about all the layer-wise class separation in our model, please refer to Appendix A.

Related Work
Relation extraction(RE) is a critical and fundamental task in natural language processing (NLP), which aims to identify the semantic relations between two entities within a given text (Xue et al., 2019;Han et al., 2020;Chen et al., 2022;Zhang et al., 2022Zhang et al., , 2023b,a),a).However, conventional approaches are highly dependent on a large amount of labeled data and cannot deal with unseen relation classes well.Therefore, recent studies (Han et al., 2018;Gao et al., 2019c)  Most existing studies (Gao et al., 2019a;Yang et al., 2020;Zhang and Lu, 2022) employ Prototype Networks for FSRE, which aims to learn a suitable prototypical vector for each relation using a handful of annotated instances.Gao et al. (2019b) employs an attention mechanism to enhance the robustness of the prototype network to noisy data.Qu et al. (2020) proposes a Bayesian meta-learning method with an external global relation graph to model the posterior distribution of relational prototypes.Han et al. (2021b) focuses on enhancing the performance of Prototype Network on complex relations through an adaptive focal loss and a hybrid network.Moreover, some studies (Yang et al., 2020;Wang et al., 2020;Han et al., 2021b) use supplementary information about entities and relations, such as relation descriptions, to enhance the prototype vectors of relations.Despite impressive results achieved, these methods still tends to overfit relation classes appearing in the training set, which limits their generalizing capability to new relations.In this paper, inspired by (He et al., 2020;Yin et al., 2022), we propose a HyperNetwork-based decoupling method along with a two-step training strategy to prevent overfitting of the FSRE model to the relations within the training set.
On the other hand, some studies (Soares et al., 2019;Peng et al., 2020;Dong et al., 2021;Wang et al., 2022) focus on further training pre-trained language models (PLMs) using noisy RE datasets.Soares et al. ( 2019) collect a large-scale pretraining dataset and propose a matching the blanks pre-training paradigm.Peng et al. (2020) proposes an entity-masked contrastive pre-training framework for FSRE.Wang et al. (2022) introduces three structure pre-training tasks to pre-train the large language model (GLM with 10B parameters), allowing it to better comprehend structured information in text.Unlike several other studies, Zhang and Lu (2022) introduces a more rigorous few-shot evaluation scenario by filtering relations contained in FewRE 1.0 from the pre-trained corpus.Meanwhile, they propose a label prompt dropout method to prevent the model from overfitting to the relation description.These methods are compatible with our model, which can provide our model with a better pre-trained encoder.

Conclusion
In this paper, we propose a HyperNetwork-based Decoupling approach to improve the generalizing capability of FSRE models.Specifically, our model consists of an encoder, a network generator (for generating relation classifiers) and the generated classifiers.Our generator aims to generate a properly initialized relation classifier for each episode, allowing our model can quickly adapt to new relations.Meanwhile, we design a two-step training strategy along with a class-agnostic aligner, in which the generated classifiers focus on acquiring relation-specific knowledge while the encoder is encouraged to learn more general relation knowledge.In this way, the roles of upper and lower layers in an FSRE model are explicitly decoupled, thus enhancing its generalizing capability during testing.Experiments on two public FSRE datasets and extensive ablation studies show that our model consistently outperforms all competitive baselines.

Figure 1 :
Figure 1: Illustration of the difference between conventional Prototypical Network model and our model.

Figure 2 :
Figure 2: Layer-wise class separation Sep l in a 12-layer BERT-based Prototypical Network FSRE model with respect to the training steps.

Figure 3 :
Figure 3: The illustration of sampling an N -way-Kshot episode from training/validation/test set.

Figure 4 :
Figure 4: The architecture and two-step training framework of our model.In each episode, for the first step, the network generator of our model produces an initialized N-way relation classifier subsequently fine-tuned by the classification loss L c (i.e., C m →C m ).Meanwhile, the class-agnostic aligner is trained through the contrastive loss L align .For the second step, the encoder and generator are jointly optimized by L c and L align .Gray indicates frozen modules in each step.
knowledge while the encoder and generator can learn more general relation knowledge.To this end, we separate the training of our model into two steps: the fine-tuning of the generated classifiers and the updating of the encoder and generator.Suppose we first sample M episodes from the training set and use the generator to produce an initialized relation classifier for each episode, i.e., C 1 , • • • , C M .

Figure 5 :
Figure 5: 10-way-1-shot accuracy of our models with different values of β on FewRel 1.0 validation set.
) w/o.Two-Step Training.During training, we separate the training of our model model into two steps: the fine-tuning of the generated classifiers and the update of the encoder and generator.

Figure 6 :
Figure 6: The comparison between conventional Prototypical Network (PN) model and ours in terms of class separation Sep l at lower and upper layers.
turn to Few-Shot Relation Extraction (FSRE) that aims to train a model to classify instances into novel relations with only a handful of training examples.

Figure 7 :
Figure 7: Layer-wise class separation Sep l in our 12layer encoder E with respect to the training steps.

Table 1 :
(Zhang and Lu, 2022)FewRel 1.0 validation / test set."w/o.RE pre-train": the models without additional RE pre-training."w/.RE pre-train": the models whose encoders are additionally pre-trained on the noisy RE corpus(Peng et al., 2020).To conduct more rigorous few-shot evaluation,Zhang and Lu (2022)filters relations contained in FewRE1.0 from the corpus."†":the results are reported in(Zhang and Lu, 2022).