Fine-grained Category Discovery under Coarse-grained supervision with Hierarchical Weighted Self-contrastive Learning

Novel category discovery aims at adapting models trained on known categories to novel categories. Previous works only focus on the scenario where known and novel categories are of the same granularity.In this paper, we investigate a new practical scenario called Fine-grained Category Discovery under Coarse-grained supervision (FCDC). FCDC aims at discovering fine-grained categories with only coarse-grained labeled data, which can adapt models to categories of different granularity from known ones and reduce significant labeling cost. It is also a challenging task since supervised training on coarse-grained categories tends to focus on inter-class distance (distance between coarse-grained classes) but ignore intra-class distance (distance between fine-grained sub-classes) which is essential for separating fine-grained categories.Considering most current methods cannot transfer knowledge from coarse-grained level to fine-grained level, we propose a hierarchical weighted self-contrastive network by building a novel weighted self-contrastive module and combining it with supervised learning in a hierarchical manner.Extensive experiments on public datasets show both effectiveness and efficiency of our model over compared methods.


Introduction
Discovering novel categories based on some known categories has attracted much attention in both Natural Language Processing (Zhang et al., 2021;Zhao et al., 2021) and Computer Vision (Zhong et al., 2021;Han et al., 2019).Previous works assume that novel categories are of the same granularity (or of the same class hierarchy level) as known categories.However, in real-world scenarios, novel categories can be more fine-grained sub-categories of known ones (e.g., sports and tennis).A typical ap-Figure 1: An example of proposed FCDC task (finegrained label names need to be assigned by experts).
plication of this scenario is when data analysts want to perform more fine-grained analysis on data with only coarse-grained annotations, where re-labeling fine-grained categories can be time consuming and labour intensive.For example, in the intent detection field, discovering more fine-grained user intents can help to provide better services to customers, but labeling fine-grained intent categories is often much more difficult than labeling coarsegrained ones, since fine-grained annotation often requires higher expertise.To meet this requirement, we investigate a new scenario named Fine-grained Category Discovery under Coarse-grained supervision (FCDC).As shown in Figure 1, FCDC needs models to discover fine-grained categories (e.g., tennis and music) based only on coarse-grained (e.g., sports and arts) labeled data which are easier and cheaper to obtain.
In addition to being in line with above practical needs, FCDC is also a challenging task.Firstly, performing FCDC requires models to increase intra-class distance to ensure fine-grained separability with only coarse-grained supervision.However, coarse-grained classification only focuses on inter-class distance and does not care about intraclass distance (Bukchin et al., 2021), so samples with the same coarse-grained labels will be close to each other and hard to be separated in finegrained feature space.Secondly, since fine-grained differentiation depends on correct coarse-grained classification, FCDC also requires models to con-trol inter-class distance to ensure coarse-grained separability.Although increasing intra-class distance can contribute to fine sub-classes separability, it will also decrease inter-class distance, which can result in overlapping between different coarsegrained classes and therefore lead to misclassification.So how to control and coordinate interclass and intra-class distance to ensure both coarse-grained and fine-grained separability is the core challenge of FCDC.
To address above challenges and transfer knowledge from coarse-grained level to fine-grained level, we propose a hierarchical weighted selfcontrastive network.By performing different experiments on each layer of BERT, Jawahar et al. (2019) find bottom layers of BERT capture more surface features and top layers capture more highlevel semantic features, which means BERT can extract features of different granularities from shallow to deep (Xu et al., 2021).Inspired by this phenomenon, the core motivation of our model is to learn coarse-grained knowledge by shallow layers of BERT and learn more fine-grained knowledge by the rest of deep layers hierarchically.This motivation is not only consistent with the feature extraction process of BERT, but also corresponding with the shallow-to-deep learning process of humans.Specifically, we use given coarse-grained labels to train shallow layers of BERT to learn some surface knowledge, then we propose a weighted selfcontrastive module to train deep layers of BERT to learn more fine-grained knowledge based on the learned surface knowledge.
To ensure both coarse-grained and fine-grained separability, we further propose a weighted selfcontrastive module to better coordinate inter-class and intra-class distance in the fine-grained feature space.Specifically, given a query sample, we firstly propose a weighting strategy by weighting different negative samples to control both inter-class and intra-class distance.Then we propose a selfcontrastive strategy to generate positive samples to coordinate inter-class and intra-class distance to avoid the overlapping between different coarsegrained classes.We further verify effectiveness and efficiency of our model both theoretically (Section 3.2.4) and experimentally (Section 4.5).
The main contributions of our work can be summarized as threefold: • We propose to investigate a practical scenario called Fine-grained Category Discovery un-der Coarse-grained supervision (FCDC), we further propose a hierarchical model to learn fine-grained knowledge from shallow to deep to facilitate the FCDC task.
• To better coordinate inter-class and intra-class distance, we propose a novel weighted selfcontrastive module to ensure both coarsegrained and fine-grained separability.
• Extensive experiments on public datasets show that our model significantly advances best compared methods with a large margin and gets double training efficiency than stateof-the-art contrastive learning methods.
2 Related work

Contrastive learning
Contrastive Learning (CL) aims at grouping similar samples closer and separating dissimilar samples far from each other in a self-supervised way (Jaiswal et al., 2021), which has gained popularity in both Natural Language Processing (NLP) (Wu et al., 2020) and Computer Vision (CV) (Chen et al., 2020).A critical point for CL is to build high-quality positive and negative samples.The simplest way to construct negative samples is to use other in-batch data as negatives (Chen et al., 2017).Further, He et al. (2020) built a dynamic queue with momentum-updated encoder to keep consistency of representations of negatives.However, these methods consider all negatives equally important, which may lose discriminative information of different negatives.As for positive samples, in CV, one common way is taking two different transformations of the same image as the query and positive sample (Dosovitskiy et al., 2014).And in NLP, augmentation techniques such as word deletion (Meng et al., 2021), adversarial attack (Yan et al., 2021) and dropout (Gao et al., 2021) were proposed to generate positives.Although there are some recent works (Bae et al., 2021) using outputs from different levels of a network as positives, we have totally different motivations: they aim at providing more high-quality positives for representation learning while we aim at better adjusting intra-class and inter-class distance.

Novel Category Discovery
With data volume increases, novel categories especially novel fine-grained categories may be introduced into datasets (Mekala et al., 2021).To dis- cover novel categories without human annotation, most previous work adopted clustering methods and transfer learning methods to generate pseudo labels for unlabeled data to train their models (Zhan et al., 2020).For example, Zhang et al. (2021) proposed an alignment strategy to perform Deep-Cluster (Caron et al., 2018) to discover novel categories.Ge et al. (2020) proposed a mutual mean teaching network to refine noisy pseudo labels to perform unsupervised person re-identification.Recently, Two similar tasks as FCDC are proposed.Bukchin et al. (2021) proposed to perform finegrained image classification under coarse-grained supervision with angular contrastive learning, and they performed this task in a few-shot learning way which needs extra fine-grained labels for each categories.Mekala et al. (2021) proposed to perform fine-grained text classification with coarse-grained annotations, but they need extra fine-grained label hierarchy and corresponding label names to assist in the task.These two tasks both rely on extra fine-grained knowledge from human annotations, which is usually unavailable when novel categories appear in real-world applications.Comparatively, our FCDC is a category discovery task which does not require fine-grained knowledge and is more adapted to real world scenarios.

Problem Formulation
Denote by Different from previous tasks (Bukchin et al., 2021;Mekala et al., 2021) where the finegrained label set Y f ine = {F 1 , F 2 , ..., F K } is already known, FCDC assumes that we do not have any prior knowledge about fine-grained labels.So FCDC requires models to perform clustering methods (e.g., K-Means) to discover fine-grained clusters Y f ine with D train .Since performing clustering will assign each input with a specific cluster assignment, FCDC can also classify inputs into proper fine-grained categories {f 1 , f 2 , ..., f N }.Although the number of fine-grained clusters K can be estimated with various methods from the clustering area, we assume it is known in FCDC following previous similar works (Zhang et al., 2021;Bukchin et al., 2021) to make a fair comparison.

Proposed Approach
As shown in Figure 2, our model mainly contains three components: BERT, Dynamic Queue and Momentum BERT.BERT is used to extract both coarse-grained and fine-grained features.Dynamic Queue can store more negative samples grouping by their coarse-grained labels following Bukchin et al. (2021).Momentum BERT is used to update representations of samples in Dynamic Queue.Inspired by the "shallow to deep" learning process of humankind and the ability of pre-trained models to extract features from coarse-grained to fine-grained (Jawahar et al., 2019;Xu et al., 2021), a core motivation of our model is to learn fine-grained knowledge in a progressive way.Specifically, our model can learn coarse-grained knowledge at shallow layers under coarse-grained supervision and learn more fine-grained knowledge at deep layers with the proposed weighted self-contrastive learning.

Supervised Learning
We firstly perform supervised learning on Transformer layer L of BERT to learn coarse-grained knowledge.Given the i-th document D i with its coarse-grained label c i , we use all token embeddings from the L-th layer of BERT as its shallow features.Then we apply a mean-pooling layer to get its shallow feature representation h L i : is the hidden state of feature representations, h is the dimension of hidden representations.Then we perform supervised learning with cross entropy loss on coarse-grained labels to get supervised loss L L sup at layer L: where z L i ∈ R M is the output logits, M is the number of coarse-grained classes.σ is the Tanh activation function, W a ∈ R h * M and b a ∈ R M are learnable weights and bias terms, respectively.(z i ) j is the j-th element of output logits z i .

Weighted Self-contrastive Learning
As shown in Figure 3, denote the coarse-grained inter-class and intra-class distance by d coarse and d f ine , respectively.Supervised learning on coarsegrained labels can ensure d coarse 0 but will also make d f ine ≈ 0, which can bring difficulties for fine-grained categorization.So how to increase d f ine to ensure separability of fine-grained subclasses is a severe challenge.Meanwhile, increasing d f ine without restraint will result in overlapping between different coarse-grained classes and therefore lead to misclassification.So how to constrain d f ine to ensure the proper classification on coarse-grained classes is the other challenge.In summary, our total goal can be described as: where d boundary is a threshold to ensure that samples fall into proper coarse-grained classes.
To achieve above objectives, we propose a weighted self-contrastive module by introducing a novel generation strategy for positive samples and a weighting strategy for negative samples.

Negative Key Generation
Given the i-th document D i , we use all token embeddings from the output layer of BERT as its deep features.Then we apply a mean-pooling layer to get its deep feature representation h o i ∈ R h : In-batch negative keys Given h o i with its coarsegrained label c i as a query q, we treat shallow and deep features of other in-batch samples as its in-batch negative keys, where k in In this way, we can increase distance between different samples so that satisfying d f ine 0 and d coarse 0. To satisfy d coarse d f ine , we propose a weighting strategy by giving more weights to samples with different coarse-grained labels as the query q to further increase their distance.So k in − can be divided into two groups according to the coarse-grained labels: Momentum negative keys To provide more negative keys, we build a momentum BERT and a set of dynamic queues {Q i } M i=1 to store previous samples grouped by their coarse-grained labels following Bukchin et al. (2021), where M is the number of coarse-grained classes.Specifically, given h o i with its coarse-grained label c i as a query, we treat samples from the queue Q c i as its momentum negative keys: Feature representations of samples in dynamic queues are extracted by momentum BERT, and parameters of momentum BERT are updated in a momentum way (He et al., 2020).At the end of each iteration, the dynamic queues will be updated by adding novel samples and removing the earliest samples.Since samples in k m − (i) have the same coarse-grained labels as the query, they are much harder to be separated and beneficial to better representation learning.
The overall negative keys for the query h o i is :

Positive Key Generation
By weighting different negative samples, we can satisfy the condition 0 d f ine d coarse .But increasing d f ine without restraint will violate the condition d f ine < d boundary and make some samples fall into incorrect coarse-grained classes.To solve this problem, we propose a self-contrastive strategy by treating shallow features of a query as its positive key.Specifically, given the deep feature representation h o i for document D i as a query, we treat h L i as its positive key: As shown in Figure 3, after supervised learning on coarse-grained labels at layer L, h L i can be very close to the class center of c i , so pulling h o i close to h L i will also pull h o i close to the class center of c i .In this way, we can increase d f ine with restraint and satisfy the condition d f ine < d boundary without computing the specific value of d boundary .Another advantage of our self-contrastive strategy is that we can get double training efficiency than traditional data augmentation-based methods (Wu et al., 2020;Gao et al., 2021) since we only need to perform forward and backward propagation only once to get and update both queries and positive keys (Section 5.2).

Weighted Self-contrastive Loss
Given the query h o i with its positive key k + (i) and negative keys k − (i), the overall loss of our weighted self-contrastive module is: (11) where α l ∈ {α same , α dif f , α m } are weighting factors for different negative keys, sim(h i , h j ) is cosine similarity h T i h j h i • h j and τ is a temperature hyperparameter.
By weighting different negative keys and selecting shallow features as the positive key, our model can satisfy the goal in Inequation 4 and provide conditions for subsequent fine-grained categorization.

Theoretical Analysis
The effectiveness of our weighted self-contrastive learning compared with traditional contrastive learning from the gradient perspective is analyzed below.Self-contrastive Strategy Compared with traditional contrastive loss which only aims at grouping queries and their transformations closer, our selfcontrastive strategy aims at pulling queries and their shallow features closer: Since τ is positive, the positive similarity will increase and h o i will be grouped closer to h L i .After supervised learning on coarse-grained labels at layer L, h L i can be close to the class center of c i , so pulling h o i closer to h L i will also pull h o i closer to the class center of c i .So our Self-Contrastive strategy can guarantee queries fall into correct coarse-grained categories and get double training efficiency since we only need to perform forward and backward propagation only once to get and update both queries and positive keys.Weighting Strategy Since negatives with the same coarse-grained labels as queries have larger gradients (Wang and Liu, 2021), traditional contrastive loss will push these negatives farther from queries than those with different coarse-grained labels as queries, which leads to d coarse < d f ine and is opposite of what we expect to solve the FCDC task.To mitigate this limitation, we propose a weighting strategy to give more weights to samples with different coarse-grained labels as the query to further increase their distance: By increasing the weighting factor α l for negatives with different coarse-grained labels as queries, the corresponding similarity will decrease faster.So negatives with different coarse-grained labels from queries will be pushed farther than those with the same coarse-grained labels as queries, which can guarantee d f ine < d coarse for the FCDC task.

Overall Loss
To further guarantee samples to be classified into proper coarse-grained categories, we also add supervised learning on coarse-grained labels at the output layer.So the overall loss for our hierarchical weighted self-contrastive network is: where L o sup is the cross entropy loss at the output layer.γ 1 and γ 2 are weighting factors.
After representation learning, we simply perform the non-parametric clustering method K-Means to discover fine-grained categories based on features extracted by the output layer of BERT.

Experiments 4.1 Datasets
To evaluate effectiveness of our model, we conduct experiments on three public datasets.Statistics of three datasets can be found in Table 1.CLINC is an intent classification dataset released by Larson et al. (2019).Web of Science (WOS) is a paper classification dataset released by Kowsari et al. (2017).HWU64 is a personal assistant query classification dataset released by Liu et al. (2021).

Implementation Details
We use the pre-trained BERT model (bert-baseuncased) implemented by Pytorch (Wolf et al., 2020) as our backbone and adopt most of its suggested hyper-parameters.We use the cuml library (Raschka et al., 2020) to perform K-Means on GPU to speed up calculations.We use the AdamW optimizer with 0.01 weight decay.Gradient clipping is also used with the norm 1.0.For hyper-parameters, temperature τ is set to 0.1, layer L is set to 8, and the weighting factors α l for {k dif f − (i), k same − (i), k m − (i)} are set to {1.4, 1.0, 1.0}, weighting factors {γ 1 , γ 2 } are set to {0.001, 0.008}.The training batch size is set to 128, and the testing batch size is set to 64.The momentum queue size for each coarse-grained category is set to 128, and the momentum factor for Momentum BERT is set to 0.9.The hidden dimension h is 768, the learning rate is set to 5e −5 , the dropout rate is set to 0.1.The training epoch is set to 20.
For a fair comparison, we use the same BERT model as ours to extract features for all compared methods and adopt hyper-parameters in their original papers.

Compared Methods
Baselines We perform FCDC with BERT in unsupervised and coarse-supervised way as baselines.Self-supervised Methods DeepCluster (Caron et al., 2018) and DeepAligned (Zhang et al., 2021) are self-supervised methods using self-training techniques and achieve state-of-the-art results in many category discovery tasks.Ancor (Bukchin et al., 2021) is a self-supervised method designed for few-shot fine-grained classification with coarsegrained labels.SimCSE (Gao et al., 2021) and Delete One Word (Wu et al., 2020) are contrastive learning methods in NLP with different data augmentation techniques.Self-supervised + Cross Entropy To investigate the influence of coarse-grained supervision on compared models, we further add cross entropy loss on coarse-grained labels L o sup to their loss function.

Evaluation Metrics
We use fine-grained labels as ground truth to evaluate model performance on testing sets.Since no fine-grained knowledge is available for the FCDC task, we need to perform clustering to discover fine-grained categories.Clustering performance can reflect the quality of discovered fine-grained clusters (more compact clusters usually mean better discovered categories).And classification performance can reflect the semantic overlap between discovered clusters and real categories.
To evaluate clustering performance, we use two broadly used external evaluation metrics.Adjusted Rand Index (ARI) is used to evaluate the degree of agreement between cluster assignments and ground truth.And Normalized Mutual Information (NMI) is used to evaluate the mutual information between cluster assignments and ground truth.To evaluate classification performance, we use metric Accuracy (ACC), which is obtained from Hungarian algorithm (Kuhn, 1955) to align cluster assignments and ground truth.Model performance on fine-grained categories are reported in Table 2. From the results we can draw following conclusions.Our model significantly outperforms other compared methods across all datasets.We contribute reasons of better performance of our model to following two points.Firstly, we propose a hierarchical architecture to learn fine-grained knowledge from shallow to deep, which is consistent with the feature extraction process of BERT and the shallow-to-deep learning process of humans.Secondly, we propose a weighted self-contrastive module to coordinate inter-class and intra-class distance so that we can better learn both coarse-grained and fine-grained knowledge.
Self-training methods perform badly on all datasets and evaluation metrics since they rely on abundant labeled data to generate high-quality pseudo labels for unlabeled data.Contrastive learning methods perform better than self-training methods since they do not need fine-grained labels to initialize their models.However, their performance is still much worse than ours since they cannot fully utilize given coarse-grained labels to control inter-class and intra-class distance between samples.We also find that model performance of most compared methods increases with the addition of coarse-grained supervision, which means coarsegrained supervision can boost model performance on fine-grained tasks.
Our model performance on coarse-grained  3. From the table we can see that our model gets similar classification accuracy as the upper-bound coarse-supervised BERT, which means that our model can control not only intra-class distance to ensure fine-grained separability, but also inter-class distance to ensure coarse-grained variability.

Ablation Study
To investigate contributions of different components to our model, we compare the performance of our model with its variants on the CLINC dataset.As shown in Table 4, removing different compo- nents will affect model performance more or less, which indicates the effectiveness of different components of our model.Removing Momentum Encoder has minimal impact, since our model is insensitive to the number of negative samples.Removing weighting strategy or cross entropy loss at shallow layers also hurt model performance since they can help to learn coarse-grained knowledge and lay foundation for learning fine-grained knowledge.Above all, removing self-contrastive strategy results in a significant decrease, since it is responsible for controlling intra-class and inter-class distance.

Training Efficiency
In this section, we compare the training efficiency of our model with contrastive methods SimCSE and Delete One Word on the CLINC dataset.We test all methods using the BERT base model trained on the same hardware platform (an AMD EPYC CPU 7702 and a RTX 3090 GPU) with the batch size 128.Average results over 100 epochs are shown in Figure 4. Compared with SimCSE and Delete One Word, our model gets double training efficiency both when adding or removing Momentum Encoder, which benefits from our selfcontrastive strategy.Specifically, our model utilizes shallow features of queries as positive keys, which only needs to perform forward and backward propagation once to get and update both queries and positive keys.

Visualization
We further visualize the learned embeddings of our model and SimCSE using t-SNE on the CLINC dataset in Figure 5.Our model can separate different coarse-grained categories with a larger margin than SimCSE (Top in Figure 5), which benefits from our strategy of combining supervised learning and contrastive learning in a hierarchical way.Furthermore, our model can also separate different fine-grained categories with a larger margin (Bottom in Figure 5), which benefits from the weighted self-contrastive module.In summary, our model can better control both inter-class and intra-class distance between samples to facilitate the FCDC task than traditional contrastive learning methods.

Choices of L and weighting factors
Effect of Shallow Layer L The influence of the choice of shallow layer L on model performance is shown in Figure 6.Our model achieves the best performance when L=8.In this way, our model can learn coarse-grained knowledge at shallow layers (L<8) and provide enough model capacity to learn fine-grained knowledge at deeper layers (L>8), which is consistent with the feature extraction process of BERT (Jawahar et al., 2019).

Effect of Weighting Factors
We investigate the influence of the ratio β = α dif f /α same in Figure 7 (We fixed α m = 1 since it has little influence).As analyzed in Section 3.2.4,by giving more weights to negatives with different coarse-grained labels as queries (β > 1), our weighting strategy can keep these negatives further away from queries and guarantee d f ine < d coarse .On the contrary, when β < 1, negatives with the same coarse-grained labels as queries will be further away from the queries, which can hurt our model performance.

Conclusion
In this paper, we investigate a novel task named Fine-grained Category Discovery under Coarsegrained supervision (FCDC), which can reduce significant labeling cost and adapt models to novel categories of different granularity from known ones.We further propose a hierarchical weighted selfcontrastive model to approach the FCDC task by better controlling intra-class and inter-class distance.By performing supervised and contrastive learning on shallow and deep layers of pre-trained models, our model can learn fine-grained knowledge from shallow to deep with only coarse-grained supervision.Extensive experiments on public datasets show that our approach is more effective and efficient than compared methods.

Limitations
The limitations of our method lies in two aspects.Firstly, following previous works, we need to know the number of fine-grained clusters K as prior knowledge, which is usually difficult to get in realworld scenarios.Secondly, our method cannot predict semantic meanings (e.g., label names) of discovered fine-grained categories, which is also an unexplored question in the field of novel category discovery.

Figure 3 :
Figure 3: The effectiveness of our self-contrastive module, which can ensure both intra-class and inter-class distance.

Figure 7 :
Figure 6: Effect of shallow layer L.

Table 1 :
Statistics of datasets.# indicates the number of samples.|C|, |F| means the number of coarse-grained and fine-grained classes, respectively.

Table 2 :
Model comparison results (%) on fine-grained categories.Average results over 5 runs are reported.'+ CE' means adding coarse-grained supervision with cross entropy loss.We have also performed statistical significance test and all the p-values are less than 10 −6 , which means our improvement is significant.

Table 3 :
Classification accuracy (%) on coarse-grained categories on test sets.

Table 4 :
Results (%) of different model variants.'-' means that we remove the component from our model.