Few-shot Classification with Hypersphere Modeling of Prototypes

Metric-based meta-learning is one of the de facto standards in few-shot learning. It composes of representation learning and metrics calculation designs. Previous works construct class representations in different ways, varying from mean output embedding to covariance and distributions. However, using embeddings in space lacks expressivity and cannot capture class information robustly, while statistical complex modeling poses difficulty to metric designs. In this work, we use tensor fields (``areas'') to model classes from the geometrical perspective for few-shot learning. We present a simple and effective method, dubbed hypersphere prototypes (HyperProto), where class information is represented by hyperspheres with dynamic sizes with two sets of learnable parameters: the hypersphere's center and the radius. Extending from points to areas, hyperspheres are much more expressive than embeddings. Moreover, it is more convenient to perform metric-based classification with hypersphere prototypes than statistical modeling, as we only need to calculate the distance from a data point to the surface of the hypersphere. Following this idea, we also develop two variants of prototypes under other measurements. Extensive experiments and analysis on few-shot learning tasks across NLP and CV and comparison with 20+ competitive baselines demonstrate the effectiveness of our approach.


Introduction
Constituting cognition of novel concepts with a few examples is crucial for machines to emulate human intelligence, and with the exorbitant cost associated with annotating large amounts of data, fewshot learning has garnered considerable attention in modern deep learning (Lu et al., 2020).Despite the success under ample supervision, limited training examples remain a challenge for traditional deep * Equal contribution † Corresponding authors neural models.Consequently, various approaches have been proposed to extend the applicability of deep neural networks to scenarios with limited data availability.One significant area of research within this domain is metric-based meta-learning (Snell et al., 2017;Ren et al., 2018;Allen et al., 2019), where models are trained to generate expressive representations and perform classification through predefined metrics.
The success of metric-based meta-learning depends on both representation learning and the metrics chosen.One straightforward approach relies on training feature representation and adopts a nearest-neighbor classifier (Vinyals et al., 2016;Yang and Katiyar, 2020;Wang et al., 2019).Other works introduce additional parameters as class representation to achieve better generalization ability.A naive way to estimate class representation is to use the mean embedding of feature representation, i.e., prototypes (Snell et al., 2017;Allen et al., 2019), while some also use second-order moments (Li et al., 2019a) or reparameterize the learning process to generate class representation in a richer semantic space (Ravichandran et al., 2019) or in the form of probability distribution (Zhang et al., 2019).Apart from traditional Euclidean and cosine distance, a variety of metric functions are also proposed (Sung et al., 2018;Zhang et al., 2020a;Xie et al., 2022).Most existing works learn class representation from the statistical perspective, making designing and implementing the metrics more difficult.For example, the proposed covariance metric in CovaMNet (Li et al., 2019a) theoretically requires a non-singular covariance matrix, which is awkward for neural-based feature extraction methods.
This paper revisits metric-based meta-learning and demonstrates that geometrical modeling can simultaneously enhance the expressivity of representations and reduce the difficulty of metric calculation, meanwhile yielding surprising per-formance in few-shot classification.Specifically, we propose HyperProto , a simple and effective approach to model class representation with hyperspheres.It is equipped with three advantages: (1) Characterizing geometrical "area" as manifolds with complex boundaries can often be difficult in deep learning.Instead, we only use two sets of learnable parameters: the center and the radius, to represent a hypersphere, which is straightforward and easy to learn.(2) A hypersphere is much more expressive than a single point in the representation space.The introduction of a learnable radius parameter greatly expands the representative power.(3) Besides, hyperspheres are suitable for constructing measurements in Euclidean space.We can calculate the Euclidean distance from one feature point to the surface of the hypersphere in order to perform metric-based classification, which is difficult for other manifolds.
Along with the simplicity in metric design and the enhanced expressive power is the easiness in optimization and learning.With the metrics designed as distance to the hypersphere surface, both the radius and the center of the hypersphere will appear in the loss function and participate in the backward propagation during optimization.Intuitively, for the classes with sparse feature distributions, the corresponding radii of their prototypes are large, and the radii are small otherwise.Beyond the Euclidean space, we also develop two variants based on the general idea -cone-like prototypes with cosine similarities and Gaussian prototypes from the probability perspective (in Appendix A).
We conduct extensive experiments to evaluate the effectiveness of HyperProto on two classical information extraction tasks, few-shot named entity recognition (NER) (Ding et al., 2021c) and relation extraction (RE) (Han et al., 2018;Gao et al., 2019b).Despite the simplicity, we find that our approach is exceedingly effective, which outperforms the vanilla prototypes by 8.33 % absolute in average F1 on FEW-NERD (INTRA), 6.55% absolute in average F1 on FEW-NERD (INTER), 4.77% absolute in average accuracy on FewRel, respectively.The generality of our approach allows it to be easily integrated with other techniques.We combine our method with prompt-learning and task-specific pre-training to obtain high-quality representations, substantially outperforming many competitive baselines.We believe our approach could serve as a strong baseline for few-shot learning and inspire new ideas from the research community for representation learning.

Problem Setup
We consider the episodic N -way K-shot few-shot classification paradigm1 .
of examples in the same N classes.Each input data is a vector x i ∈ R L with the dimension of L and y i is an index of the class label.For each input

Methodology
This section describes the mechanisms of hypersphere modeling of prototypes.One hypersphere prototype is represented by two parameters: the center and the radius, which are first initialized via estimation and then optimized by gradient descent in conjunction with the encoder parameters.

Overview
We now introduce HyperProto , which are a set of hyperspheres in the embedding space D to abstractly represent the intrinsic features of classes.Formally, one prototype is defined by (1) where d : R D × R D → [0, +∞) is the distance function in the metric space.f ϕ is a neural encoder parameterized by ϕ, while z and ϵ denote the center and the radius of the hypersphere.We use M(•) to denote the measurement between a data point and a hypersphere prototype based on d(•).
The central idea is to learn a hypersphere prototype for each class with limited episodic supervision, and each example in the query set (x * , y * ) is predicted by the measurement to the hypersphere prototypes M(x * j , B d ), which is the Euclidean distance from the embedding to the surface of the hyperspheres, (2) Note that with such metric design, the value of M(•) may be negative.That is, geometrically speaking, the point is contained inside the hypersphere, which does not affect the calculation of the loss function and the prediction.Generally, the idea is to use areas instead of points in the embedding space to model prototypes to enhance expressivity while preserving the convenience of Euclidean metric design.The advantages of the proposed method are two folds.First, as stated in § 1, one hypersphere prototype could be uniquely modeled by the center z and the radius ϵ, while characterizing manifolds with complex boundaries in the embedding space is intricate.Second, it is easy to optimize the parameters by conducting metric-based classification since they are naturally involved in measurement calculation.

Hypersphere Prototypes
To construct hypersphere prototypes, the first step is the initialization of the center z and the radius ϵ of the hypersphere.To start with a reasonable approximation of the data distribution, we randomly select K instances from each class for initialization.Specifically, for one class, the center of the hypersphere prototype is initialized as the mean output of the K embeddings as in Snell et al. (2017), and the radius is the mean distance from each sample to the center, as shown in Equation 3, where S n is the set of sampled instances from the n-th class, (3) Once initialized, a hypersphere prototype will participate in the training process, where its center and radius are simultaneously optimized.During training, for each episode, assuming the sampled classes are N = {n 1 , n 2 , ..., n N }, the probability of one query point x ∈ Q belonging to class n is calculated by softmax over the metrics to the corresponding N hypersphere prototypes.
And the parameters of f and hypersphere prototypes are optimized by minimizing the metric-based cross-entropy objective: Equation 4 explains the combination of the advantages of hypersphere prototypes, where M is calculated by ϵ and z, which will participate in the optimization.The parameters of the neural network ϕ are optimized along with the centers and radii of hypersphere prototypes through gradient descent.
To sum up, in the initialization stage, the hypersphere prototypes of all classes in the training set, which are parameterized by z and ϵ, are estimated by the embeddings of randomly selected instances and stored for subsequent training and optimization.
In the training stage, the stored ϵ is optimized by an independent optimizer.The optimization will yield a final location and size of the hyperspheres to serve the classification performance.More importantly, the involvement of prototype centers and radii in the training process will, in turn, affect the optimization of encoder parameters, stimulating more expressive and distinguishable representations.
Algorithm 1 expresses the initialization and learning stages of hypersphere prototypes.Although the centers and radii are stored and optimized continuously in training (in contrast with vanilla prototypes where centers are re-estimated at each episode), the whole process is still episodic, as in each episode, the samples in the query set are only evaluated against the classes in that single episode instead of the global training class set.
Meanwhile, a standard episodic evaluation process is adopted to handle the unseen classes, where we estimate prototype centers and radii in closed forms.In the episodic evaluation procedure, HyperProto directly takes the mean of instance embeddings as the centers and the mean distance of each instance to the center as the radius (as in Equation 3), following previous works (Vinyals et al., 2016;Snell et al., 2017;Zhang et al., 2020a).
We also develop two variants that use "areas" to represent class-level information in few-shot classification under other measurements, details can be found in Appendix A.

Experiments
To evaluate the effectiveness of the proposed method, we conduct experiments on few-shot named entity recognition (NER) and few-shot relation extraction (RE) tasks, both of which are fundamental tasks of information extraction accompanied by well-established datasets.Task descriptions, datasets, and implementation details are reported in Appendix B. Apart from the experimental study in this section, we also carry out additional experiments and analyses of image classification to demonstrate the generality of our method in Appendix C.

Combination with Orthogonal Techniques
Our experiments show that by simply adding a radius parameter in the learning process, Hyper-Proto outperforms vanilla embedding prototypes by a large margin.In addition, as a basic method of few-shot learning, HyperProto can be used successfully with other orthogonal enhancements to further boost performance.We choose two techniques, prompting and task-specific pre-training to combine with our approach on NER and RE tasks, respectively.Essentially, these two methods can both be regarded as means to construct high-quality initial representations for the current task.And our approach performs metric based few-shot learning on top of the initial representations.
For NER, we enhance the primitive Hyper-Proto with prompt (Liu et al., 2023), where in the support set the label of the entity is inserted after each entity, and in the query set the label candidates are concatenated and inserted as prefixes at the beginning of the input.For RE, we apply Hy-perProto to a task-specific pre-trained relation encoder (Peng et al., 2020).The two experiments further show the compatibility of the proposed method and indicate its potential as a novel fundamental modeling strategy for few-shot learning.

Overall Results
Few-shot Named Entity Recognition.Table 1 shows the performance on FEW-NERD.It can be seen that HyperProto has a considerable advantage over vanilla ProtoNet, with an increase of at least 5% in f1-score across all settings.The success on both datasets demonstrates that HyperProto can learn the general distribution pattern of entities across different classes and thus can greatly improve the performance when little information is shared between the training and test set.The performances of NNShot and HyperProto are comparable when it comes to low-shot.This is probably because, in the sequence labeling task, it is more difficult to infer the class-level information from very limited tokens.In this case, the modeling ability of hypersphere prototypes degenerates towards the nearest-neighbors strategy in NNShot.As the shot number increases, the memory cost of NNShot grows quadratically and becomes unaffordable, while HyperProto keeps it in reasonable magnitude.In this sense, HyperProto is more efficient.
When HyperProto is combined with prompt, it outperforms many other strong baselines like CONTAINER (Das et al., 2021) and ESD (Wang et al., 2022), which use pre-training and additional span attention module to enhance class representation.Specifically, HyperProto is shown to be more advantageous in INTRA setting.It also fits with our intuition since less shared information between training and test set would make features learned during pre-training stage or the trained attention module less transferable.It further shows the robustness of the modeling of HyperProto .We also believe a carefully designed initialization strategy is vital for the performance of our model in low-shot settings.The impact of the number of shots is reported in Appendix C.4.Few-shot Relation Extraction.Table 2 presents the results on two FewRel tasks.Methods that use additional data or conduct task-specific encoder pre-training are not included.
HyperProto generally performs better than all baselines across all settings.In terms of backbone models, when combined with pre-trained models like BERT, hypersphere prototypes can yield a larger advantage against prototypes.It shows that the hypersphere modeling of prototypes can better approximate the real data distribution and boosts the finetuning of BERT.Meanwhile, it sheds light on the untapped ability of large pre-trained language models and stresses that a proper assumption about data distribution may help us unlock the potential.HyperProto 's outstanding performance on the Domain Adaptation task further validates the importance of a better abstraction of data in trans-Model FewRel 1.0 5 way 1 shot 5 way 5 shot 10 way 1 shot 10 way 5 shot Avg.fer learning.Meanwhile, the large performance variation in the domain adaptation task suggests that when the domain shifts, the estimation of hypersphere prototypes becomes less stable.
To further evaluate the compatibility of our approach and other orthogonal techniques, we replace the original BERT model with the version pre-trained on relation classification task (Peng et al., 2020).It could be observed that, with this pre-trained encoder, the performance of our method boosts substantially, demonstrating the model-agnostic nature of our approach.

Experimental Analysis
Analysis of the Radius Dynamics.We demonstrate the mechanism of hypersphere prototypes by illustrating the change of radius for one specific hypersphere.In the learning phase, the radius of a hypersphere prototype changes according to the "density" of the sampled episode, which could be characterized by the mean distance of samples to the corresponding prototype center.Practically, due to randomness in sampling, the value of the mean distance may oscillate at a high frequency in this process, and the radius changes accordingly.To better visualize the changing of radius along with the mean distance at each update, for each round of training, we fix one specific class as the anchor class for mean distance and radius recording and apply a special sampling strategy at each episode.Specifically, we take FewRel training data and train on the 5 way 5 shot setting with CNN encoder.While training, each episode contains the anchor class and 4 other randomly sampled classes.Training accuracy is logged every 50 steps.After a warmup training of 500 steps, we sample "good" or "bad" episodes for every 50 steps alternatively.A "good" episode has higher accuracy on the anchor class than the previously logged accuracy, while conversely, a "bad" episode has an accuracy lower than before.The mean distance to the prototype center and radius at each episode are logged every 50 steps after the warmup.
Figure 2 shows the changing of mean distance and radius for 8 classes during 600∼2000 training steps.Although the numeric values of distance and radius differ greatly and oscillate at different scales, they have similar changing patterns.Besides, it could be observed that there is often a small time lag in the change of radius, indicating that the change of radius is brought by the change in mean distance.This is in line with our expectations and perfectly demonstrates the learning mechanism of hypersphere prototypes.The experiment provides a promising idea, if we can control the sampling strategy through knowledge a priori, we may find a way to learn ideal hypersphere prototypes.Visualization.We also use t-SNE (van der Maaten and Hinton, 2008) to visualize the embedding before and after training, by ProtoNet and Hyper-Proto , respectively.5 classes are sampled from the training set and test set of the Few-NERD dataset, and for each class, 500 samples are randomly chosen to be embedded by BERT trained on the 5-way-5-shot NER task. Figure 3 shows the result of embeddings in a 2-dimensional space, where different colors represent classes.Note that for the token-level NER task, the interaction between the target token and its context may result in a more mixed-up distribution compared to instance-level embedding.For both models, the representations of the same class in the training set become more compressed and easier to classify compared to their initial embeddings.While HyperProto can produce even more compact clusters.The clustering effect is also observed for novel classes.We also calculate the difference between the mean euclidean distances from each class sample to the (hypersphere) prototype of the target class and to other classes.The larger the difference, the better the samples are distinguished.For ProtoNet, the difference is 2.33 and 1.55 on the train and test set, while for HyperProto the results are 5.09 and 4.56, respectively.This can also be inferred from the t-sne result.Since samples from different classes are distributed at different densities, an extra radius parameter will help better distinguish between classes.The visualization and statistical results demonstrate the effectiveness of HyperProto in learning discriminative features, especially in learning novel class representation that considerably boosts model performance under few-shot settings.Representation Analysis.To study if the learned representations are discriminative enough for performing few-shot classification, we illustrate the normalized distances between the learned representations and the hypersphere prototypes in Figure 4. Specifically, we randomly sample 5 classes and 25 instances (5 per class) for each dataset and produce representations for the instances and hypersphere prototypes for the classes.Then, we calculate the distance between each instance to each prototype (i.e., distance from the point to the hypersphere surface) to produce the matrix.All the values in the illustration are normalized since the absolute values may vary with the datasets.Warmer colors denote shorter distances in the illustration.The illustration shows that in all three datasets, our model could effectively learn discriminative representations and achieve stable metric-based classification.
Hypersphere Prototypes In order to further analyze the representations produced by HyperProto , we study the similarities of randomly sampled instance embeddings.We randomly select 4 × 5 classes and 5 instances per class in FEW-NERD, FewRel and miniImageNet, respectively.As illustrated in Figure 5, each subfigure is a 25 × 25 matrix based on 5 classes.We calculate the cosine similarities of these embeddings and observe clear intra-class similarity and inter-class distinctiveness.This result confirms the robustness of our model since all the classes and instances are sampled randomly.

Related Work
This work is related to studies of meta-learning, whose primary goal is to quickly adapt deep neural models to new tasks with a few training examples (Hospedales et al., 2020).To this end, two branches of studies are proposed: optimizationbased methods and metric-based methods.The optimization-based studies (Finn et al., 2017;Franceschi et al., 2018;Ravi and Beatson, 2018) regard few-shot learning as a bi-level optimization process, where a global optimization is conducted to learn a good initialization across various tasks, and a local optimization quickly adapts the initialization parameters to specific tasks by a few steps of gradient descent.
Compared to the mentioned studies, our work is more related to the metric-based meta-learning approaches (Vinyals et al., 2016;Snell et al., 2017;Satorras and Estrach, 2018;Sung et al., 2018), whose general idea is to learn to measure the similarity between representations and find the closest labeled example (or a derived prototype) for an unlabeled example.Typically, these methods learn a measurement function during episodic optimization.More specifically, we inherit the spirit of using prototypes to abstractly represent class-level information, which could be traced back to cognitive science (Reed, 1972;Rosch et al., 1976;Nosofsky, 1986), statistical machine learning (Graf et al., 2009) and to the Nearest Mean Classifier (Mensink et al., 2013).In the area of deep learning, Snell et al. (2017) propose the prototypical network to exploit the average of example embeddings as a prototype to perform metric-based classification in few-shot learning.In their work, prototypes are estimated by the embeddings of instances.However, it is difficult to find a satisfying location for the prototypes based on the entire dataset.Ren et al. (2018) adapt such prototype-based networks in the semi-supervised scenario where the dataset is partially annotated.Moreover, a set of prototypebased networks are proposed concentrating on the improvements of prototype estimations and application to various downstream tasks (Allen et al., 2019;Gao et al., 2019a;Li et al., 2019b;Pan et al., 2019;Seth et al., 2019;Ding et al., 2021b;Li et al., 2020c;Wertheimer and Hariharan, 2019;Xie et al., 2022;Zhang et al., 2020a).We discuss our method within the context of other prototype-enhanced methods in Section D.1.There has also been a growing body of work that considers the few-shot problem from multiple perspectives, bringing new thinking to the field (Tian et al., 2020;Yang et al., 2021;Laenen and Bertinetto, 2021;Zhang et al., 2020b;Wang et al., 2021;Das et al., 2021;Wertheimer et al., 2021;Ding et al., 2021a;Cui et al., 2022;Hu et al., 2022).There has also been a series of works that embed prototypes into a non-Euclidean output space (Mettes et al., 2019;Keller-Ressel, 2020;Atigh et al., 2021).It should be noted that these studies regard hyperspheres or other non-Euclidean manifolds as a characterization of the embedding space, while our proposed method use hyperspheres to represent prototypes and conduct metric-based classification in the Euclidean space.Therefore, the focus of our proposed HyperProto is different from the above non-Euclidean prototype-based works.

Conclusion
This paper proposes a novel metric-based few-shot learning method, hypersphere prototypes.Unlike previous metric-based methods that use dense vectors to represent the class-level semantics, we use hyperspheres to enhance the capabilities of prototypes to express the intrinsic information of the data.It is simple to model a hypersphere in the embedding space and conduct metric-based classification in few-shot learning.Our approach is easy to implement and also empirically effective, we observe significant improvements to baselines and compatibility with other techniques on downstream tasks.For potential future work, such modeling could be extended to more generalized representation learning like word embeddings.School (HW2021008), and Institute Guo Qiang at Tsinghua University.

Limitations
Compared to vanilla prototypes, the advantage of HyperProto would also rely on the additional radius parameter.Under the 1-shot setting, however, hypersphere prototypes will face challenges in estimating the radius in support sets, this is because the initial radius may be biased by the randomness of sampling.When the radius is set to exactly 0, the model will resemble a traditional prototypical network.Nevertheless, although not as large as the boost in the multi-shot setting, we find that having a consistently optimizable radius parameter at the training stage in the 1-shot scenario still delivers non-trivial results and exceeds most baselines (Table 1, Table 2, Table 3).This further points to the positive influence of the added radius parameter to learning prototype representation and hints on the possible research direction in learning a transferable radius in 1-shot scenario.

A Generalizations of Our Method
We have introduced the mechanisms of hypersphere prototypes in Euclidean space.In this section, we generalize this idea to construct variants with other measurements.Cone-like HyperProto.Cosine similarity is a commonly used measurement in machine learning.Assume all the data points are distributed on a unit ball, and we use the cosine value of the intersection angle to measure the similarity of the two embeddings.While keeping the intuition of the modeling of hypersphere prototypes in mind, we introduce an additional angle parameter ϵ.We use θ a,b to denote the intersection angle of the two embeddings a and b.In this way, the center point z and the angle ϵ could conjointly construct a cone-like prototype, (6) where d(f ϕ (x), z) = cos(θ f ϕ (x),z ).The measurement M(•) is defined as the cosine of the angle between the instance embedding and the nearest point on the border of the prototype, Similar to the vanilla hypersphere prototypes, z and ϵ need to participate in the learning process for optimization, and the angle θ x,z is computed by the inverse trigonometric function, The prediction for a training example is also based on the softmax over the measurements to the prototypes like Eq. 5. Note that as shown in Eq. 7, the measurement becomes −1 when a data point is "inside" the cone-like prototype.Then it is hard to make a prediction when an embedding is inside two prototypes.It thus requires that the prototypes do not intersect with each other, that is, to guarantee the angle between two center points is larger than the sum of their own parameter angles, Therefore, the final loss function is L = L cls + L dis .Gaussian HyperProto.From the probability perspective, each class can be characterized by a distribution in a multi-dimensional feature space.The measurement of a query sample to the n-th class can thus be represented by the negative log likelihood of f ϕ (x) belonging to B n .In line with other works (Zhang et al., 2019;Li et al., 2020d), we can simply assume each class subjects to a Gaussian distribution B n ∼ N (µ n , Σ n ).To reduce the number of parameters and better guarantee the positive semi-definite feature, we can further restrict the covariance matrix to be a diagonal matrix such that Σ n = σ 2 n I. Then the measurement becomes where δ = d 2 log 2π.The probability of target class given a query sample can be calculated by Eq. 4 in the same fashion: p(y = n|x) = p(f ϕ (x);Bn) n ′ p(f ϕ (x);B n ′ ) .Note that the derived form of the equation is the same as directly calculating the probability of p(y = n|x) under a uniform prior distribution of p(y).Comparing with pure probabilistic approaches, such as variational inference that treats B as hidden variables and models p(B|S) and p(B|S, x) with neural network (Zhang et al., 2019), under the framework of § 3.2, B is explicitly parameterized and optimized for each class during training.Moreover, comparing Eq. 10 with Eq. 2, it can be observed that when formalizing B as a distribution, instead of as a bias term, the original radius parameter (now the variance) functions as a scaling factor on Euclidean distance.

B Experimental Details
This section reports the experimental details of all three tasks in our evaluation.All the experiments are conducted on NVIDIA A100 and V100 GPUs with CUDA.The main experiments are conducted on three representative tasks in NLP and CV, which are few-shot named entity recognition (NER), relation extraction (RE), and image classification.The experimental details will be presented in the following sections.

B.1 Experimental Details for Few-shot Named Entity Recognition
We assess the effectiveness of hypersphere prototypes on NLP, specifically, the first task is few-shot named entity recognition (NER) and the dataset is FEW-NERD (Ding et al., 2021c) 2 .NER aims at locating and classifying named entities ( real-world objects that can be denoted with proper names) given an input sentence, which is typically regarded as a sequence labeling task.Given an input sentence ""Bill Gates is a co-founder of the American multinational technology corporation Microsoft", an named entity recognition system aims to locate the named entities (Bill Gates, Microsoft) and classify them into specific types.Conventional schema uses coarse-grained labels such that Person for Bill Gates and Organization for Microsoft.In more advanced schema like Few-NERD, models are asked to give more specific entity types, for example, Person-Entrepreneur for Bill Gates and Organization-Company for Microsoft.
Different from typical instance-level classification, few-shot NER is a sequence labeling task, where labels may share structural correlations.NER is the first step in automatic information extraction and the construction of large-scale knowledge graphs.Quickly detecting fine-grained unseen entity types is of significant importance in NLP.To capture the latent correlation, many recent efforts in this field use large pre-trained language models (Han et al., 2021b) like BERT (Devlin et al., 2019) as backbone model and have achieved remarkable performance.The original prototypical network has also been applied to this task (Li et al., 2020b;Huang et al., 2020;de Lichy et al., 2021).Dataset.The experiment is run on FEW-NERD dataset (Ding et al., 2021c).It is a large-scale NER dataset containing over 400,000 entity mentions, across 8 coarse-grained types and 66 fine-grained types, which makes it an ideal dataset for few-shot learning.It has been shown that existing methods including prototypes are not effective enough on 2 FEW-NERD is distributed under CC BY-SA 4.0 license this dataset.Baselines.We choose the following baselines: • Proto (Snell et al., 2017) is the main baseline, which adapts the prototypical network on fewshot named entity recognition.
• NNShot (Yang and Katiyar, 2020) is a tokenlevel metric-based method that is specifically designed for few-shot labeling.
• StructShot (Yang and Katiyar, 2020) adds a CRF layer in inference and further boosts performance of NNShot.
• CONTaiNER (Das et al., 2021) uses a pretrained backbone and further finetunes on the few-shot data.
• ESD (Wang et al., 2022) uses attention mechanism to learn prototype representation.
Implementation Details.We run experiments under four settings on the two released benchmarks, FEW-NERD (INTRA) and FEW-NERD (IN-TER).Specifically, we use uncased BERT-base as the backbone encoder and 1e-4 as the encoder learning rate.As for learning rate for radius parameter, we use 20.0 for HyperProto+Prompt 10-way-5-shot INTER setting and 10.0 for other settings.AdamW is used as the BERT encoder optimizer, and Adam (Kingma and Ba, 2017) is used to optimize prototype radius.The batch size is set to 2 across all settings.All models are trained for 10000 steps and validated every 1000 steps.The results are reported on 5000 steps of the test episode.For each setting, we run the experiment with 3 different random seeds and report the average F1-score the standard error.We use PyTorch (Paszke et al., 2019) and huggingface transformers (Wolf et al., 2020) to implement the backbone encoder BERT base .

B.2 Experimental Details for Few-shot Relation Extraction
The other common NLP task is relation extraction (RE), which aims at correctly classifying the relation between two given entities in a sentence.For example, given an input sentence with marked entities "[Bill Gates] is a co-founder of the American multinational technology corporation [Microsoft]", the relation extraction system aims to give the relationship between Bill Gates and Microsoft.This is a fundamental task in information extraction.RE is an important form of learning structured knowledge from unstructured text.We use FewRel (Han et al., 2018) 3 and FewRel 2.0 (Gao et al., 2019b) as the datasets.
Dataset.We adopt the FewRel dataset (Han et al., 2018;Gao et al., 2019b), a relation extraction dataset specifically designed for few-shot learning.FewRel has 100 relations with 700 labeled instances each.The sentences are extracted from Wikipedia and the relational entities are obtained from Wikidata.FewRel 1.0 (Han et al., 2018) is released as a standard few-shot learning benchmark.FewRel 2.0 (Gao et al., 2019b) adds domain adaptation task and NOTA task on top of FewRel 1.0 with the newly released test dataset on PubMed corpus.
Baselines.In addition to the main baseline, prototypical network (Snell et al., 2017), we also choose the following few-shot learning methods as the baselines in relation extraction.
• MLMAN (Ye and Ling, 2019) is a multi-level matching and aggregation network for few-shot relation classification.Note that Proto-HATT and MLMAN are not model-agnostic.
• GNN (Satorras and Estrach, 2018) is a metalearning model with a graph neural network as the prediction head.
• SNAIL (Mishra et al., 2017) is a meta-learning model with attention mechanisms.
• Meta Net (Munkhdalai and Yu, 2017) is a classical meta-learning model with meta information.
• BERT-pair (Gao et al., 2019b) is a strong baseline that uses BERT for few-shot relation classification.
• MTB (Soares et al., 2019) pre-trains on sentence pairs constructed by entity linking technique.
• CP (Peng et al., 2020) pre-trains a relation classification model with contrastive learning.
3 FewRel is distributed under MIT license • COL (Ding et al., 2021b) assumes that prototypes distribute uniformly on a unit ball surface and pre-trains the prototype representation.
• HCRP (Han et al., 2021a) uses contrastive learning to learn better prototype representations, while focusing more on hard cases.
• LPD (Zhang and Lu, 2022) adopts relation description as prompt and randomly drops labels in the support set to derive better class prototype.
Implementation Details.The experiments are conducted on FewRel 1.0 and FewRel 2.0 domain adaptation tasks.For FewRel 1.0, we follow the official splits in Han et al. (2018).For FewRel2.0, we follow Gao et al. (2019b), training the model on wiki data, validating on SemEval data, and testing on the PubMed data.We use the same CNN structure and BERT as encoders.The learning rate for hypersphere prototype radius is 0.1 and 0.01 for CNN and BERT encoder, respectively.Adam (Kingma and Ba, 2017) is used as radius optimizer.We train the model for 10000 steps, validate every 1000 steps, and test for 5000 steps.The other hyperparameters are the same as in the original paper.

B.3 Experimental Details for Few-shot Image Classification
Image classification is one of the most classical tasks in few-shot learning research.Seeking a better solution for few-shot image classification is beneficial in two ways: (1) to alleviate the need for data augmentation, which is a standard technique to enrich the labeled data by performing transformations on a given image; (2) to facilitate the application where the labeled image is expensive.To demonstrate the effectiveness of HyperProto, we also conduct experiment on few-shot image classification with miniImageNet (Vinyals et al., 2016) dataset.
The results of the experiment are shown in C. Dataset.miniImageNet (Vinyals et al., 2016) is used as a common benchmark for few-shot learning.
The dataset is extracted from the full ImageNet dataset (Deng et al., 2009), and consists of 100 randomly chosen classes, with 600 instances each.Each image is of size 3×84×84.We follow the split in Ravi and Larochelle (2017) and use 64, 16, and 20 classes for training, validation, and testing.
Baselines.The baselines we choose are as follows: • Prototypical network (Snell et al., 2017) uses the vanilla prototypes as representations and is our main baseline.
• IMP (Allen et al., 2019) is a prototype-enhanced method that models an infinite mixture of prototypes for few-shot learning.
• CovaMNet (Li et al., 2019a) is a few-shot learning method that uses covariance to model the distribution information to enhance few-shot learning performance.
• Variational FSL (Zhang et al., 2019) is a variational Bayesian framework for few-shot learning, which contains a pre-training stage.
• Activation to Parameter (Qiao et al., 2018) predicts parameters from activations in few-shot learning.
• TRAML (Li et al., 2020a) uses adaptive margin loss to boost few-shot learning, and Prototypes + TRAML is a strong baseline in recent years.
• AWGIM (Guo and Cheung, 2020) follows the idea of LEO (Rusu et al., 2018) but generates different classification representations for different queries by maximizing mutual information.
Implementation Details.The experiments are conducted on 5 way 1 shot and 5 way 5 shot settings.
To ensure validity and fairness, we implement hypersphere prototypes with various backbone models including CNN, ResNet-12, and WideRes-Net (Zagoruyko and Komodakis, 2016) to make it comparable to all baseline results, and we also re-run some of the baselines including prototypical network (Snell et al., 2017), infinite mixture prototypes (Allen et al., 2019), and CovaMNet (Li et al., 2019a)

C Additional Experiments and Analysis
This section provides additional experiments and analysis.We first present results on image classification, then we compare generalized HyperProto.
We also experiment on cross-dataset setting and provide analysis on impact of training data volume on model performance and instance-level representation.
Table 3 shows the result on miniImageNet fewshot classification under 2 settings.HyperProto substantially outperforms the primary baselines in most settings, displaying their ability to model the class distribution of images.We observe that compared to NLP, image classification results are more stable both for vanilla prototypes and hypersphere prototypes.This observation may indicate the difference in encoding between the two technologies.
Token representations in BERT are contextualized and changeable around different contexts, yet the image representation produced by deep CNNs aims to capture the global and local features thoroughly.Under the 5-way 5-shot setting, the improvements of HyperProto are significant.The effectiveness of our method is also demonstrated by the comparisons with other previous few-shot learning methods with the same backbones.In particular, Hy-perProto yields the best results of all the compared methods with the WideResNet (Zagoruyko and Komodakis, 2016) backbone, suggesting that the expressive capability of hypersphere prototypes can be enhanced with a more powerful encoder.Compared to the 5-shot setting, our model improves mediocrely in the 1-shot setting of ConvNet and ResNet-12 (He et al., 2015).The phenomenon is consistent with the intuition that more examples would be more favorable to the learning of radius.
We further analyze the dynamics of the radius of our method in Appendix 4.3.
To further compare the variants of our approach, we conduct experiments for cone-like and gaussian HyperProto with WideResNet-28-10 on miniImageNet as well.Table 4 presents results across three measurement settings.Although the two variants do not perform better than our main method, they still considerably outperform many baselines in Table 3.While the three models' performance is close under the 1-shot setting, the Cone HyperProto model performs worse in the 5-shot setting.It could be attributed to unsatisfying radius learning.It is found that the Cone Hyperproto model is susceptible to radius learning rate and is prone to overfitting.

C.3 Cross-dataset Few-shot Learning
We also conduct experiments on the more difficult cross-dataset setting.Specifically, the model trained on miniImagenet is tested on the CUB dataset (Welinder et al., 2010) under the 5-way 5shot setting.We use ResNet-12 (RN-12) (He et al., 2015) as the backbone in our experiment.Table 5 shows the results compared with several baselines.It can be seen that HyperProto outperforms the baselines by a large margin even with less powerful encoder (RN-12), indicating the ability to learn representations that are transferrable to new domains.The results also echo the performance of Hyper-Proto for the cross-domain relation extraction in Table 2.

C.4 Impact of Number of Shots
We conduct additional experiments on FEW-NERD (INTRA) 5-way setting with 10, 15, 20 shots.Since NNShot becomes too memoryintensive to run when shot reaches 15, we provide results on Proto and HyperProto .Figure 7 shows both models perform better when more data are available, while HyperProto performs consistently better than vanilla prototypes.

D Discussion
This section discusses related prototype-based methods in detail, and the broader impact of our  work.

D.1 Other Prototype-enhanced Methods
In this section, we discuss the difference between hypersphere prototypes with four prototypeenhanced methods in few-shot learning: infinite mixture prototypes (Allen et al., 2019), Cov-aMNet (Li et al., 2019a), Variational Few-Shot Learning (Zhang et al., 2019), and Two-Stage (Das and Lee, 2020).Infinite mixture prototypes (Allen et al., 2019) model each class as an indefinite number of clusters and the prediction is obtained by computing and comparing the distance to the nearest cluster in each class.This method is an intermediate model between prototypes and the nearest neighbor model, whereas hypersphere prototypes alleviate the overgeneralization problem of vanilla prototypes with a single additional parameter that turns a single point modeling into a hypersphere.The essential prototype-based feature of hypersphere prototypes allows faster computation and easier training.
CovaMNet (Li et al., 2019a) calculates local variance for each class based on support samples and conducts metric-based learning via covariance metric, which basically evaluates the cosine similarity between query samples and the eigenvectors of the local covariance matrix.To ensure the nonsingularity of the covariance matrix, the feature of each sample is represented with a matrix, generated by a number of local descriptors, with each extracting a feature vector.Compared to hypersphere prototypes, both methods attempt to model more variance based on vanilla prototypes, while the idea of hypersphere prototypes is more straightforward and requires fewer parameters.On the other hand, the multi-channel features adopted by CovaMNet are less natural for NLP tasks.
Variational Few-Shot Learning (Zhang et al., 2019) tackles the few-shot learning problem under a bayesian framework.In order to improve single point-based estimation, a class-specific latent variable representing the class distribution is introduced and is assumed to be Gaussian.The method parameterizes the mean and variance of the latent variable distribution with neural networks that take the feature of a single instance as input.
The learning and inference processes are both conducted on the latent variable level.The method adopts variational inference and is built on modeling distribution as a latent variable, where the metric calculation highly relies on the Gaussian assumption.Hypersphere prototypes, on the other hand, model the distribution with a center vector and a radius parameter in the actual embedding space, which is more tangible and easier to calculate.It is worth noting that this work also points that a single embedding is insufficient to represent a class, and samples the prototype from a highdimensional distribution.This is actually similar to our starting point, the difference is that our approach turns out to consider the problem from the geometric point of view based on the original embedding space, and proves that such simple geometric modeling could be very efficient in the few-shot scenarios.
Two-Stage Approach first trains feature encoder and variance estimator on training data in an episodic manner with extracted absolute and relative features.Then in the second stage, training data are split into "novel" class, and base class, novel class prototypes are learned from both sample mean and base class features.The classification is carried out with integrated prototypes.This method improves on vanilla prototypes by extracting more features and combining information from base classes, but still follows single-point-based metric learning.Our approach extends a single point to a hypersphere in the embedding space and therefore, better captures within-class variance.

D.2 Broader Impact
Our method focuses on the method of few-shot learning, which enables machine learning systems to learn with few examples, and could be applied to many downstream applications.The technique itself does not have a direct negative impact, i.e., its impact stems primarily from the intent of the user, and there may be potential pitfalls when the method is applied to certain malicious applications.

E K∼2K Sampling for Few-NERD
In the sequence labeling task FEW-NERD, the sampling strategy is slightly different from other classification tasks.Because in the named entity recognition, each token in a sequence is asked to be labeled as if it is a part of a named entity.And the context is crucial for the classification of each entity, thus the examples are sampled at the sequence level.Under this circumstance, it is difficult to operate accurate N way N shot sampling.Ding et al. (2021c) propose a greedy algorithm to conduct N way K ∼ 2K shot sampling for the FEW-NERD dataset.We follow the strategy of the original paper (Ding et al., 2021c) and report it in Algorithm 2.

Figure 1 :
Figure 1: The illustration of our proposed HyperProto, where the data is sampled in 5-shot.The star symbol denotes the center of the hypersphere, the solid triangle denotes the sampled examples, and the dotted triangle denotes other examples in the whole dataset.The solid green line denotes the distance from a data embedding to the hypersphere's surface.The left part illustrates the initialization stage, where the sampled data estimate the center and radius, and the right part illustrates the learning stage, where the center and radius are simultaneously optimized.

Figure 2 :
Figure 2: The illustration depicts the radius change according to the degree of sparsity of the sampled episode.Each subfigure represents a selected anchor class in FewRel.The horizontal axis represents the increase of training steps.
Figure 3: t-sne visualization of feature distributions.The six subfigures, from left to right, are the representations of seen data (in training set) before training, produced by ProtoNet, and produced by HyperProto ; novel data (in test set) before training, produced by ProtoNet, and produced by HyperProto .Note that even after training, the neural network has never seen the novel data and their classes.

FODVVFigure 5 :
Figure 5: Representation similarity matrix produced by HyperProto on FEW-NERD, FewRel and miniImageNet.Each row illustrates 20 classes and 100 instances in one dataset.Each subfigure contains 5 classes and 25 instances.Each unit denotes the cosine similarity of two embeddings, and each 5×5 cell indicates the comparison of two classes.The units on the diagonal represent the same instance, and the 5×5 cells on the diagonal represent the same class.Warmer color means higher similarity in this illustration.

Figure 6 :
Figure 6: Two variants according to different measurements.The left is the cone-like modeling with cosine similarities, and the right is the Gaussian modeling from the probability perspective.

Figure 7 :
Figure 7: Impact of shot number on model performance for FEW-NERD (INTRA) 5-way setting.
Given a large-scale annotated training set D train , our goal is to learn a model that can make accurate predictions for a set of new classes D test , containing only a few labeled examples for training.The model will be trained on episodes constructed using D train and tested on episodes based on D test .Each episode contains a support set S = {x i , y i } N ×K i=1 for learning, with N classes and K examples for each class, and a query set for inference Q Algorithm 1: Training process.fϕ is the feature encoder, N total is the total number of classes in the training set, N is the number of classes for support and query set, K is the number of examples per class in the support set, K ′ is the number of examples per class in the query set, M is a hyper-parameter.RANDOMSAMPLE(S, K) denotes a set of K elements chosen uniformly at random from set S, without replacement.λf and λ ϵ are separate learning rates.Input: Training data D train = {(x 1 , y 1 ), ..., (x T , y T )}, y i ∈ {1, ..., N total }.D k denotes the subset of D containing all elements (x i , y i ) such that y i = k

Table 1 :
48.49 2.75 60.78 1.87 41.69 3.45 53.16 3.21 65.40 0.08 68.34 0.73 61.72 0.37 67.90 3.90 58.44 Performance (F1 score) on FEW-NERD .The standard deviation is reported with 3 runs with different random seeds for each model.Results with † are reported in Ding et al. (2021c), and other baseline results are from the original papers.Results with ♠ mean that the approaches involve task-specific pre-training encoder.Best results in bold and the second best results are underlined.

Table 2 :
Han et al. (2018))el 1.0 and FewRel 2.0 under 4 different settings.The standard deviation is reported with 3 runs with different random seeds for each model.Results with † are reported inGao et al. (2019b)andHan et al. (2018).Results with ‡ are obtained by re-running the original code.Other baseline results are from the original papers.Results with ♠ mean that the approaches involve task-specific pre-training encoder.Results with ♣ indicate that the approaches involve additional resources like knowledge graphs and relation descriptions, etc. Best results in bold and the second best results are underlined.

Table 5 :
Results on cross-dataset classification.