Refinement Matters: Textual Description Needs to be Refined for Zero-shot Learning

,


Introduction
Humans can recognize samples from unseen classes by leveraging the visual information of seen categories and textual descriptions of seen and unseen classes (Larochelle et al., 2008;Palatucci et al., 2009;Lampert et al., 2009).Zero-Shot Learning, inspired by this recognition ability of humans, * These authors contributed equally to this work 1 The source code is available at https://github.com/Sethup123/CARNet Figure 1: It can be seen that the attributes "small bird" and "long black beak" (words in bold) are common in all three species of the Hummingbird.However, attributes like "ruby red throat" or "orange bird" (words in red) distinguish one species from another.Hence the distinguishing attributes must be given more weight than the common attributes in the class attribute vector.We achieve this through the gating unit in the attribute refinement network.
learns unseen classes through the textual description (also referred to as side-information or class attribute vector or semantic information) (Xian et al., 2017).A typical ZSL algorithm does not need training samples from unseen classes.However, it requires the class description for both seen and unseen classes (Zhang and Saligrama, 2015;Reed et al., 2016).
The generative model has recently been the most popular approach for ZSL.It uses generators like VAE (Mishra et al., 2018;Schonfeld et al., 2019) or GAN (Narayan et al., 2020;Vyas et al., 2020) to generate synthetic samples for unseen classes using the class attribute vector.However, despite their promising results, such methods are not very efficient due to the following reasons: (i) The method requires the knowledge of the number of unseen classes and respective attribute vectors during training which is not always feasible, and (ii) retraining of the classifier with seen and unseen samples, with each new unseen classes.On the other hand, non-generative approaches for ZSL alleviates the above problems but shows inferior accuracies.Typically, the non-generative models learn mapping in three ways: (i) visual to attribute space (Xian et al., 2016) or (ii) attribute to visual space (Zhang et al., 2017;Li et al., 2019), or (iii) joint embedding of attribute and visual space (Cacheux et al., 2019).It is to be noted that most of the existing non-generative or embedding-based ZSL approaches are formulated to learn embedding from visual to attribute space.They assume that the seen and unseen classes share the same representational characteristics and are linked in the attribute space (Frome et al., 2013;Wang et al., 2019;Chen et al., 2021a).However, this approach leads to the wellknown hubness problem (Dinu et al., 2014), where the representations are skewed towards seen classes (Zhang et al., 2017;Li et al., 2019).Another problem with this approach is that it implicitly loses the discriminative power of visual features that are generally extracted from a powerful pre-trained deep learning model (like ResNet (Xian et al., 2018a(Xian et al., , 2016) ) and GoogleNet (Song et al., 2018)) but are then mapped to a different smaller attribute space (Li et al., 2019(Li et al., , 2018)).
Although the aforementioned issues in the nongenerative model are mitigated by mapping the attribute to visual features (Zhang et al., 2017;Li et al., 2019;Skorokhodov and Elhoseiny, 2021), they have lower classification accuracies.In this work, we propose a non-generative method with an Attribute-Refinement Network (ARN) that leverages the gating mechanism.The ARN enables highly robust representation of the description/attribute vector for the seen and unseen classes.In recent years, the gating mechanism has shown good performance without any complex architecture in supervised learning tasks (Srivastava et al., 2015;Dauphin et al., 2017;Sandler et al., 2018;Wu et al., 2018;Liu et al., 2021).In this paper, we propose a gating mechanism for refining the textual description in the ZSL task.The ARN learns to refine the attribute in a self-weighing manner from the seen class attribute (Fig. 1).These refined attributes are mapped onto the visual space using an attribute embedder (AE) to obtain the class prototype vector of each class.The class prototype vector is then combined with the visual features in the feature-prototype combiner (FPC) to obtain classifications.The ARN, AE, and FPC are trained jointly using the circle loss and standard softmax cross-entropy in such a way that it minimizes inter-class and maximizes intra-class similarity.The circle loss achieves better within-class compactness and between-class discrepancy compared to triplet loss and adaptive margin softmax loss, as it unifies both classification and pair-wise similarity representation objectives (Sun et al., 2020).We refer to our approach as Circle loss guided gating-based Attribute-Refinement Network (CARNet).
We evaluate the performance of CARNet in three scenarios: (a) Conventional Zero-Shot Learning (ZSL), where only the unseen classes are available during inference (b) Generalized Zero-Shot Learning (GZSL), where both the seen and unseen classes are available during inference and (c) Continual Generalized Zero-Shot Learning (CGZSL), where data arrives as a sequence of tasks and only current task data is available during training, with the challenge of handling catastrophic forgetting of the past tasks.The performance of CARNet for conventional ZSL and GZSL is evaluated on five standard datasets.The CGZSL method is evaluated for the challenging CUB and SUN datasets.The extensive experiment shows that CARNet outperforms the recent generative (unlike the generative model, we do not require the unseen class description during training) and the non-generative model by a significant margin.Our ablation study emphasizes the significance of each component of the proposed learning algorithm.The main contributions of our work are summarised as follows: (i) We propose a gating-based attribute-refinement network (ARN) to enhance the class description/attribute for zero-shot learning.(ii) The ARN and AE are guided by circle loss to achieve better withinclass compactness and between-class discrepancy.(iii) We propose a highly competitive, simple, and fast non-generative method.Our model achieves ∼ 70× speedup compared to generative ZSL methods.

Related Work
The proposed CARNet is evaluated for the three kinds of ZSL settings: conventional ZSL, GZSL, and CGZSL.We provide a brief survey on all these three settings.ZSL aims to construct the recognition model for the samples from unseen classes using the textual description (i.e., attribute information) of unseen classes.These attribute information can be obtained through various ways, like humanannotated attributes (Farhadi et al., 2009), textual descriptions (Reed et al., 2016), and word vectors (Socher et al., 2013;Frome et al., 2013).In recent years, there has been a surge of interest in this area.The whole literature of ZSL can be broadly categorized into two parts: generative and non-generative (i.e., embedding-based) approaches.
The first popular category is the generative approach, which solves the ZSL problem by synthesizing the unseen class samples.To synthesize the samples from the unseen classes, models leverage on a powerful generative models like conditional variational autoencoder (VAE) (Mishra et al., 2018;Kumar Verma et al., 2018) or generative adversarial network (GAN) (Vyas et al., 2020;Xian et al., 2018b;Felix et al., 2018;Keshari et al., 2020;Verma et al., 2020) or a combination of VAE and GAN (Xian et al., 2019;Narayan et al., 2020).
Another popular category is the non-generative approach, and it does not need class attribute information of unseen classes during training.In the early ZSL work (Lampert et al., 2009;Farhadi et al., 2009;Lampert et al., 2013), models directly predict the attribute confidence from images.Methods based on this approach can be further divided into three groups.In the first group, we project visual feature into attribute (i.e., semantic) space (Lampert et al., 2013;Socher et al., 2013;Frome et al., 2013;Akata et al., 2016;Fu and Sigal, 2016).In the second group, both visual and attribute data are projected into intermediate space (Akata et al., 2015;Fu et al., 2014;Lei Ba et al., 2015;Romera-Paredes and Torr, 2015;Cacheux et al., 2019).In the third group, visual space is spanned by attribute to visual mapping (Zhang et al., 2017;Li et al., 2019;Skorokhodov and Elhoseiny, 2021).ZSL methods developed based on the projection from attribute space to visual space approach are more suitable for mitigating the hubness problem, and recent works (Zhang et al., 2017;Li et al., 2019;Skorokhodov and Elhoseiny, 2021) show promising results for the ZSL and GZSL setting.Surprisingly, despite the fast, accurate, and realistic setting, this approach has not been explored much in the past.In this work, we consider the non-generative model for further exploration and learn the mapping from the attribute space to the visual space similar to (Li et al., 2019;Skorokhodov and Elhoseiny, 2021).
The above-discussed ZSL methods can handle data only in an offline setting, and cannot be used in a setting with a streaming sequence of tasks (Delange et al., 2021), known as Continual GZSL (CGZSL).Only a handful of research is available for CGZSL (Chaudhry et al., 2019a;Wei et al., 2020;Skorokhodov and Elhoseiny, 2021;Gautam et al., 2021aGautam et al., , 2020Gautam et al., , 2021b)).For the extensive evaluation, apart from the conventional ZSL and GZSL, CARNet is also evaluated on the CGZSL setting as proposed in Skorokhodov and Elhoseiny (2021).

Problem Definition
In this section, we define the problem formally and introduce the notations.The objective of ZSL is to learn a model that can generalize the novel classes (i.e., unseen classes) with the help of side information (attribute/descriptions) without training data for the novel classes.The attribute vector of each class is constructed by either using a word embedding vector generated from a language model or manually defining the key features like color, size, shape, pattern, etc. Primarily, the ZSL setting consists of two sets of classes known as seen and unseen classes.Let D s tr and D ts be the training and testing data, respectively, for the C s seen and C u unseen classes.We also have set of seen ({C s }) and unseen ({C u }) classes where {C s } ∩ {C u } = ϕ i.e. seen and unseen classes set are disjoint.It is to be noted that {C s } and {C u } denote the set of seen and unseen classes, while C s and C u denote the number of seen classes and unseen classes, respectively.Corresponding to each seen class i (i ∈ {C s }) and unseen class j (j ∈ {C u }), there is a d-dimensional class attribute vector, i.e., A s i ∈ R d and A u j ∈ R d , respectively.In the ZSL training, data is represented as: where N is the number of seen class images and {x i , y i } is the image and label pair.During inference for conventional ZSL, we have D ts = {x j } M j=1 with attribute set A = A u where ∀j, x j belongs to the unseen class.However, in GZSL, we have D ts = {x j } M j=1 with attribute set A = A s ∪ A u where ∀j, x j belongs to either seen or unseen class.Here A s and A u are the seen and unseen class attribute information, respectively.Overall, our objective is to develop a model based on the training dataset D tr (i.e., seen data), and it needs to be generalized over all class labels {C} where {C} = {C s } ∪ {C u } and the total number of classes in {C} is C.  2. The ARN, the AE, and the FPC are trained end-to-end based on the sum of two losses, namely, the circle loss and the softmax cross-entropy loss.These two losses guide the gating unit to yield refined attributes, which lead to better class prototype vectors through AE.In this section, we present a detailed description of our model CARNet.

Gating-based Attribute Refinement
Network (ARN) The class attribute vector plays a very crucial role in ZSL, as there are no visual samples for unseen classes during training.Moreover, the attribute vector is the only information that is available for both the seen and unseen classes.Therefore, it is highly important that the attribute representation has minimal noise and highlights its prime dimensions.The objective of the ARN is to obtain an accurately representative class attribute vector with high weight on its key dimensions, as shown in Fig.We first normalize the input A s across the dimension d for each class independently using layernorm (Ba et al., 2016), as A s L = LayerN orm(A s ) Over the layer-norm, we perform the linear projection followed by the Gaussian error linear unit (GELU) (Hendrycks and Gimpel, 2016) activation function as Here, the linear projection helps in the expansion of the current dimension of the class attribute vector.Here, W 1 ∈ R d×h denotes weight for the linear projection and A s P ∈ R C s ×h .Further, we apply the gating unit, which helps achieve a better representation of the attribute information.In the ARN, the gating unit performs cross-feature learning on the higher dimensional class attribute information vector (A s P ).For this purpose, we split the A s P into two parts, each with half the dimension, i.e.A s P 1 ∈ R C s ×h/2 and A s P 2 ∈ R C s ×h/2 .Both halves are processed as follows: where Conv1D represents 1-D Convolution, which enables the spatial projection, A s G ∈ R C s ×h/2 , and ⊙ represents Hadamard product (i.e., element-wise multiplication) which is a linear gating.Also, we can interpret this operation as self-weighting on each dimension of the attribute vector.The important dimensions will get high weight while the other has lower weight.The Hadamard product enables the refinement network to keep both information (i.e., raw A s P 1 and spatial projection of A s P 2 ) in the output of the gating unit, i.e., A s G .During training, the Conv1D(LayerN orm(A s P 2 )) is initialized as an identity matrix.Finally, the output of the ARN is obtained through residual learning, as shown below: where W 2 ∈ R h/2×d denotes weight for the linear projection, ⊕ denotes element-wise-addition, and A s R ∈ R C s ×d is the final refined class attribute information.The linear projection helps A s G to have the same dimension as A s .
Overall, the refinement network stacks the abovementioned operations one after the other, as shown in Fig. 2, and this set of operations can be repeated multiple times.As the set of operations can be repeated multiple times for better attribute refinement, it can cause the vanishing gradient problem that is very common in typical gating units.However, the residual learning in Eq. ( 2) helps to alleviate this issue.

Attribute Embedder (AE)
After getting the refined attribute for seen classes, we perform attribute to visual mapping using the AE (as shown in Fig. 2) to obtain the class prototype matrix P s for the C s seen classes, where each row corresponds to the class prototype vector p s c of the respective seen class c ∈ {C s }.
Overall, the AE is a simple 3-layered multi-layer perceptron (MLP) architecture, which is used to perform the attribute-to-visual mapping.Here, f denotes the dimension of the visual feature vector.

Feature-Prototype Combiner (FPC)
The visual features V s tr ∈ R N ×f are extracted by passing the images of the seen classes (D s tr ) through a pretrained ResNet-101 model (no finetuning).These visual features are combined with the class prototype vectors in the FPC through scaled cosine similarity between the P s and the V s tr (Skorokhodov and Elhoseiny, 2021).The scaled cosine similarity (scos) scales and normalizes the class prototype vectors and the extracted visual features before computing the dot product between them as follows: ) where v s tr ∈ V s tr is the f -dimensional extracted visual feature for a sample, p s c is the class prototype vector of class c ∈ {C s }, β is the scaling hyperparameter, which has the same impact as setting a high temperature of β 2 in softmax (Liu et al., 2018).Here, normalization reduces the variance of the class prototype vectors and the visual features, which helps in achieving better performance.

Training of the CARNet using only Seen Classes
The CARNet is trained by minimizing the circle loss and the softmax cross-entropy loss over the end-to-end network comprising the ARN and the AE.We present these loss functions and the learning algorithm of CARNet in this subsection.Without loss of generality, let us assume that v s tr ∈ V s tr be the extracted visual feature of a sample, which belongs to the seen class k ∈ {C s }.
Circle Loss: Generally, two kinds of losses are involved in the literature: one kind of losses, like L2-softmax, AM-softmax, and angular softmax are good candidates for classification, while the other kind of losses, like triplet loss, N-pair loss, contrastive loss, and the margin loss are good candidates for pair-wise similarity.The circle loss (Sun et al., 2020) aims to unify both classification and pair-wise similarity representation.Hence, it is a good candidate for optimizing ARN, AE, and FPC.Moreover, it enhances the feature learning and better separability by using flexible optimization and definite convergence target (Sun et al., 2020).The main objective of feature learning is to increase within-class similarity s p while reducing betweenclass similarity s n .The circle loss unifies both class-level labels and pair-wise similarity with K within-class similarity scores (s p ) and L betweenclass similarity scores (s n ) and is defined as: where α i p = [1 + m − s i p ] + , and α j n = [s j n + m] + are weighting factors, α j n > 0, α i p > 0, and γ is a scaling factor.Here [.] + denotes cut-off at Zero. ∆ n = m and ∆ p = 1 − m are the between-class and within-class margins, respectively.
In CARNet, we use the sample v s tr as the anchor and the corresponding class prototype vector p s k of class k as the positive sample and the remaining class prototype vectors p s j of the seen classes as negative samples.The cosine similarity is used to determine the positive similarity s p and negative similarity s n as follows: Hence, the circle loss in Eq. ( 5) is modified in CARNet as: Softmax Cross-Entropy Loss: To improve the classification, the softmax cross-entropy loss (L sof t−ce ) is applied over the computed scaled cosine similarity in Eq. ( 4), as shown below: Thus, the training of the CARNet is achieved by learning the weights of the ARN and the AE using the losses in Eq. ( 8) and Eq. ( 9), as shown below: Thus, the loss in Eq. ( 10) is optimized during training.It is to be noted that only the seen class information (D s tr , {C s }, A s ) is used during training.

Inference: Seen and Unseen Classes
The proposed CARNet method is based on the fixed body and dynamic head (classification layer) architecture.As the model is trained with only seen classes, the classification layer has neurons corresponding to the seen classes only, i.e., C s neurons.Further, we simply modify the output head and enable it for unseen classes ({C u }) using its class attribute information (A u ) as per the following procedure: 1. Pass the unseen class attribute information A u to the trained ARN and get the output A u R .2. Pass the A u R to the trained AE and get the unseen class prototype vectors P u ∈ R C u ×f for C u unseen classes.3. The unseen class prototype vectors (P u ) are stacked with seen class prototype vectors (P s ) as follows: where P ∈ R C×f .
After computation of P , we compute the scaled cosine similarity score scos(v ts , p i ) as follows: ) where p i ∈ P is the class prototype vector of class i ∈ {C s } ∪ {C u }, and v ts ∈ V s ts ∪ V u ts .Here, V s ts and V u ts are the extracted visual features using pretrained Resnet-101 model for the test images of seen and unseen classes, respectively.Finally, we perform a traditional way of classification and choose the class based on the highest cosine similarity score.

Comparison with Baseline Methods
In this section, the performance of CARNet is evaluated against strong baseline models for three ZSL settings.Results for both the ZSL settings are provided in Table 1.
On the remaining datasets, the model outperforms all the generative models but shows competitive performance to IZF IZF (Shen et al., 2020) attribute space, enabling it to have better performances than other generative ZSL methods.However, CARNet outperforms IZF for CUB, AWA1, and AWA2 by a significant margin of 2.8%, 2.4%, 3.7%, respectively, and yields similar result for aPY dataset (only a difference of 0.1%).In addition to performance gains, the proposed approached CAR-Net is characteristically advantageous over generative ZSL approaches in that, while the CARNet uses only the attribute vectors of seen classes during training, the generative ZSL methods use the attribute vectors of both seen and unseen classes during training which is not a realistic scenario in a dynamic environment.Table 3 presents the computational time required to train the various ZSL methods.It can be observed from this table that the CARNet is at least 68×, 68×, 21×, and 31× times faster than generative methods for SUN, CUB, AWA1, and AWA2, as observed in Table 3.This can be attributed to the fact that CARNet only needs to process class attribute vectors through ARN and AE.
Thus, the proposed CARNet is a desirable candidate for conventional and generalized ZSL, owing to its performance, data requirements and computational speed.Continual generalized zero-shot learning (CGZSL): While ZSL assumes data for all tasks to be available apriori, data may arrive in a sequential manner in real-world, and collecting all the data in memory is cumbersome.Hence, we further evaluate the performance of CARNet for the highly challenging CGZSL setting proposed in Skorokhodov and Elhoseiny (2021).This setting assumes that the data arrives in a sequence of tasks and only the current task data is available for training.Thus, after training for a sequence of [1, . . ., t] tasks, all classes in the [1, . . ., t] tasks are considered as seen classes and classes from (t + 1) onward are considered as unseen classes.As experience replay-based methods generally outperform regularization-based methods in the literature (Delange et al., 2021), CARNet is equipped for CGZSL using experience replay (ER) (Chaudhry et al., 2019b) strategy with classbalanced reservoir (CBR) sampling (Chrysakis and Moens, 2020).We measure the performance of CGZSL method using SA, U A, and HM at each task.Further, we compute the mean of SA, U A, and HM of overall tasks and denote it as mSA, mU A, and mHM (Skorokhodov and Elhoseiny, 2021).We present the CGZSL results in Table 2, along with the state-of-the-art CGZSL methods.Our method outperforms all existing methods by an absolute gain of 7.1% and 0.4% for CUB and SUN datasets, respectively.We also provide the performance of CARNet with CBR sampling (CARNet-ER+CBR) and without CBR sampling (CARNet-ER) in Table 2.

Ablation Study: Significance of Individual Components in CARNet
In this section, to emphasize the significance of individual components of CARNet, we perform an extensive ablation study over all the components and hyperparameters.
We study the effect of individual components of the CARNet, namely, (i) ARN (ii) AE (iii) circle loss (iv) softmax cross-entropy loss (v) scaled cosine similarity or Dot Product.We present the

Conclusion
In this work, we developed the circle loss guided gating-based attribute-refinement network for handling ZSL, GZSL, and continual-GZSL tasks.CARNet refines the attribute through a gating unit where it improves the attribute representation by learning a self-weight on each attribute dimension in a projected space.These refined attributes improve the embedding, which helps to overcome the model bias towards the seen classes.The whole model is guided by the circle loss along with the standard softmax cross-entropy loss, which maximizes the inter-class separability and intra-class similarity.Also, unlike the generative method, CARNet does not require the attribute vector of the unseen classes during training.The proposed method is quite fast, as the attribute refinement network and the attribute embedder need to process only the class attribute vectors during training.This work shows that a simple MLP-based architecture can outperform various highly computationally expensive ZSL methods.This approach needs to be explored with generative methods and other applications of ZSL, like zero-shot for sketch-based image retrieval, action recognition, and natural language processing.

Limitations
One major limitation is that the inference data must be from the same domain, as the proposed model cannot handle data from the other domains on which the model is not trained.Another limitation of the proposed method is that it requires task id during training in the CGZSL setting, without which CARNet cannot optimize the proposed model properly.However, in realistic scenarios, it is not necessary for the data to arrive with welldefined task-boundaries.Hence, the requirement of task id during training is a drawback of our proposed model.

Figure 2 :
Figure 2: CARNet: circle loss guided gating-based attribute refinement network for ZSL.It primarily consists of three blocks: (i) ARN, (ii) AE, and (iii) FPC 2. Let A s ∈ R C s ×d be the class attribute matrix for C s seen classes where each row corresponds to the d dimensional class attribute vector of the corresponding class.The ARN consists of the following stack of operations.

Table 2 :
. Continual Generalized Zero-shot Learning Results

Table 3 :
(Skorokhodov and Elhoseiny, 2021korokhodov and Elhoseiny, 2021) is a non-generative model and remaining are generative models.results of this study in Table4.The softmax crossentropy loss is an imperative loss for the model, as the proposed CARNet has to perform classification.Therefore, we kept it in all cases in the component analysis of Table4.It is very evident from the results that attribute refinement significantly boosts the performance of CARNet.Moreover, the scaledcosine similarity is another important component, and helps to outperform the model with another potential candidate, namely, dot product by a large margin.

Table 4 :
Component Analysis