Coarse2Fine: Fine-grained Text Classification on Coarsely-grained Annotated Data

Existing text classification methods mainly focus on a fixed label set, whereas many real-world applications require extending to new fine-grained classes as the number of samples per label increases. To accommodate such requirements, we introduce a new problem called coarse-to-fine grained classification, which aims to perform fine-grained classification on coarsely annotated data. Instead of asking for new fine-grained human annotations, we opt to leverage label surface names as the only human guidance and weave in rich pre-trained generative language models into the iterative weak supervision strategy. Specifically, we first propose a label-conditioned fine-tuning formulation to attune these generators for our task. Furthermore, we devise a regularization objective based on the coarse-fine label constraints derived from our problem setting, giving us even further improvements over the prior formulation. Our framework uses the fine-tuned generative models to sample pseudo-training data for training the classifier, and bootstraps on real unlabeled data for model refinement. Extensive experiments and case studies on two real-world datasets demonstrate superior performance over SOTA zero-shot classification baselines.


Introduction
In traditional text classification problems, the label set is typically assumed to be fixed. However, in many real-world applications, new classes, especially more fine-grained ones will be introduced as the data volume increases. One commonly used method is to extend the existing label set to a label hierarchy by expanding every original coarsegrained class into a few new, fine-grained ones, and then assign a fine-grained label to each document. Using the directory structure for a set of files in computer as an example (see in Figure 1), people * Jingbo Shang is the corresponding author.  usually start organizing the files in a coarse-grained fashion like "Music" and "Academics". Once the number of files in each of these coarse-grained directories increases, the categorization serves little purpose. Therefore, we would like to create new fine-grained sub-directories inside coarse-grained directories like {"rap", "rock", "oldies"} for "music" and similarly for "academics". However, the process of assigning these files into fine-grained sub-directories typically begins with almost no supervision for fine-grained labels.
To accommodate such requirements, in this paper, we introduce a new, important problem called coarse-to-fine grained classification, which aims to perform fine-grained classification on coarsely annotated data without any fine-grained human annotations. There has been prior research on performing text classification using extremely weak supervision, i.e., only label surface names as source of supervision. For example, X-Class (Wang et al., 2021) learns class-aligned document representations to generate pseudo-labels and LOT-Class (Meng et al., 2020) assumes replacements of label surface names in a sentence are related to the classes and leverages pre-trained language models to extract those words. Note that, coarseto-fine setting differs from generic zero-shot text classification in terms of having additional coarse supervision and a pre-conceived label hierarchy, though the final label set is available in either case. And also, coarse-to-fine setting is different from hierarchical classification. We have no supervision for fine-grained labels other than the label names whereas the few shot hierarchical setting can have a few samples for fine-grained labels. Therefore, we want to capture the coarse-grained supervision and label hierarchy available to perform fine-grained classification.
In this paper, we propose a novel framework C2F as illustrated in Figure 2. In the absence of finegrained human annotations, it uses fine-grained label surface names as weak-supervision signals and leverages pre-trained language models as data generators. Similar to previous work, we first generate weak supervision from the whole corpus by assuming label surface names as their respective strong label-indicators. For two iterations, C2F fine-tunes a language model based on weak supervision and trains a classifier based on generated pseudo-training data to refine weak supervision. We observe that raw weak supervision usually has a highly-skewed label distribution, especially at the beginning, because the popularity of the label names varies. Since we have no prior knowledge about the underlying label distribution, to avoid significant deviations from that distribution, we opt to draw a balanced, weakly annotated subset through a stratified sampling before any model training. We propose to fine-tune language models in a labelconditioned, hierarchy-aware manner. Specifically, we inform the language models with label information by adding the label surface names at the beginning of each document. We further incorporate a regularization objective into the fine-tuning process that captures the constraints derived from the label hierarchy. Facilitated by this fine-tuning process, we then generate pseudo-training data for each fine-grained label and train a classifier. Next, using this fine-grained classifier's predictions over the coarsely annotated data, we select the samples with a high predicted probability for each respective fine-grained label.
We conduct experiments on two real-world datasets containing both coarse and fine-grained labels. The results demonstrate the effectiveness of our framework in leveraging a label hierarchy and a rich pre-trained language model to perform fine-grained text classification with no supervision. Via thorough ablation, we isolate separate benefits accrued, initially just from using the labelconditioned, fine-tuned language model in the weak supervision pipeline, and the later incremental benefit once we incorporate our proposed regularization objective into the language model fine-tuning.
To the best of our knowledge, we are the first to work on the coarse-to-fine text grained classification, which aims to perform fine-grained classification on coarsely annotated data without any finegrained annotations. It is also worth mentioning that C2F is compatible with almost any generative language model and text classifier. Our contributions are summarized as follows: • We develop a label-conditioned fine-tuning formulation for language models to facilitate conditional corpus generation. • We devise a regularization objective based on the coarse-fine label constraints derived from the label hierarchy to be consistent with the preconceived label hierarchy. • We conduct extensive experiments to demonstrate the superiority of C2F.
Reproducibility. We will release the code and datasets on Github 1 .

Another Example Application
Another example task which motivates how our framework could be deployed (besides the aforementioned directory example) is Intent Classification in Task-based Dialog (Chen et al., 2019;Kumar et al., 2019;Schuster et al., 2019;Gangal et al., 2020). This is often seen as a hierarchical classification problem (Gupta et al., 2018), with domains (e.g. movie, airline, shopping search) and intents (e.g. book-movie vs check-reviews, book-flight vs add-meal-options) forming the higher and lower levels of the label hierarchy. For a real-world task based dialog system (e.g. Alexa), there's always a need over time to keep introducing both new domains (e.g. cruise, medical FAQ) and intents (order-popcorn, flight-entertainment) -as both data volume increases and the backend capabilities of the system expand.

Problem Formulation
The input of our problem contains: (1) A treestructured label hierarchy T with coarse-grained labels C at the first level and fine-grained labels F as their children. The m coarse-grained classes are named {C 1 , C 2 , . . . , C m }, and k finegrained classes are named as {F 1 , F 2 , . . . , F k }. All these class names are in natural language (e.g., words or phrases) and assumed to be informative; and (2) a collection of n text documents D={D 1 , D 2 , . . . , D n } and their corresponding coarse-grained labels {c 1 , c 2 , . . . , c n }. We record the mapping from each fine-grained class to its corresponding coarse-grained parent class as f ↑ : F → C. The fine-grained classes in a coarse-grained label are represented by the coarseto-fine mapping f ↓ : C → P(F), where P(·) is the powerset operator, which generates the set of all subsets. In this problem, each coarse class maps to a non-empty subset of fine classes, and all these subsets of fine-classes taken together are mutually non-overlapping and exhaustive.
We aim to build a high-quality document classifier from these inputs, assigning a fine-grained class label F j ∈ f ↓ (c i ) to each document D i ∈ D.

Our C2F Framework
As visualized in Figure 2, C2F aims to build a text classifier that can assign fine-grained labels to a set of coarsely-annotated documents based on only the label surface names and their hierarchical relations. In the absence of fine-grained human annotations, it uses fine-grained label surface names as weak-supervision signals and leverages a pretrained language model as data generators. Following an iterative process, C2F fine-tunes a language model based on weak supervision. This fine-tuned language model is used to generate pseudo training data to train a fine-grained text classifier. Based on the classifier's predictions, we select highly probable samples for each fine-grained class and repeat this process for one more iteration by replacing weak supervision with these samples. This bootstrapping increases the quality of weak supervision by eliminating the mislabeled samples and improves the performance of text classifier as we show later in our case studies.
Our major contributions lie in how to better incorporate the label names and their hierarchical relations into the language model and therefore generate more high-quality psuedo training data. Our framework is compatible with any generative language model and we choose GPT-2 (Radford et al., 2019) in our implementation. We feed label names to the language model through a labelconditioned formulation. We further incorporate a regularization objective into the fine-tuning process that captures the constraints derived from the label hierarchy. The key components of C2F are discussed in detail in the following sections.

Initial Fine-grained Weak Supervision
We assume that user-provided label surface names are of high quality and are strong indicators for their respective classes, following the state-of-theart weakly supervised text classification methods that only rely on label surface names (Wang et al., 2021;Meng et al., 2020). This assumption is intuitive and valid because there is no guidance other than class names from user and we expect them to be of high quality and indicative of the categories.
Ideally, the posterior probability of a document belonging to a class after observing the presence of strong indicators should be close to 1. Therefore, we consider samples that exclusively contain the label surface name as its respective weak supervision. Mathematically, let W (F j ) denote weak supervision of fine-grained class F j : where D i ∩ f ↓ (c i ) returns a set of fine-grained label names under the coarse-grained class c i that appear in the document D i . When this set only contains F j , it means "exclusive" to other fine-grained labels. This "exclusiveness" could help us improve the precision of the initial weak supervision. Note that, it is implied that F j ∈ f ↓ (c i ).
We observe that the initial weak supervision obtained usually has a highly-skewed label distribution, because the popularity of the label names varies. This difference in distribution could bias the generative language model towards the majority label and might affect the quality of generated samples, which in turn, would affect the performance of the text classifier. To address this problem, as there is no other prior knowledge, we opt to draw a balanced, weakly annotated subset through a stratified sampling before any model training. In other words, we make the size of weak supervision uniform for all labels, equal to the size for the minority label.

Tailored Language Model Training
In this section, we describe our label-conditioned, hierarchy-aware language model training formulations that facilitates conditional corpus generation. Specifically, we continuously train a pre-trained language model to capture the distribution P (D|l), where D is a document and l is a (coarse or fine) label surface name. Thus, this model can generate pseudo-training documents for fine-grained labels, when we plug in fine-grained label surface names.

Label-Conditioned Generation
Before we describe our formulation, we briefly introduce GPT-2 and its pre-training objective. GPT-2. GPT-2 is a large, pre-trained left-to-right language model which exhibits strong performance with minimal in-task fine-tuning on many generation tasks, such as dialog  and story generation (See et al., 2019). Its strong zeroshot ability across tasks stems from its pre-training on the vast and diverse WebText corpus (≈8M documents), besides the good inductive bias of its transformer-based architecture. GPT-2 is trained on standard language modeling objective to maximize the likelihood of a document D as follows: where P (·) is modeled with a transformer-based architecture with parameters Θ.
To continuously train GPT-2 in a labelconditioned way, one has to maximize P (D|l) in-stead of P (D). We designate the label surface names as the special token sequences and append them in the beginning to their respective documents with another special token <labelsep> separating the label sequence and document. For example, a sample document "Messi plays for FC Barcelona" belonging to "soccer" is modified to "soccer <labelsep> Messi plays for FC Barcelona". Therefore, our objective is to maximize L(D|l) defined below: Note that, the l here could be the label surface name of a coarse-grained or fine-grained class. One can view our formulation as asking the label token sequence to play the role of prompt and the document D to be the continuation, thus facilitating conditional corpus generation.
During the continuous training process, we have access to both the gold, coarse-grained labels and weak, fine-grained labels. Examples included in weak supervision give rise to two label-conditioned documents -one by prefixing with the coarsegrained, gold label and the other with the weak, fine-grained one (due to the "exclusiveness" in the initial weak supervision). Those not in the weakly supervised set only give rise to the first kind. Since there is no conflict between these two labels, we simply treat a document as belonging independently to either of them.

Hierarchy-Aware Regularization
Our label-conditioned generation treats both fineand coarse-grained labels as prompts and does not use any information from the hierarchy. Therefore, we propose to add a regularization to the language model with constraints derived from hierarchy.
Intuitively, fine-grained labels are more specific to coarse-grained labels, and therefore, when generating the same document conditioned on its gold fine-grained label, it should have a higher probability than that conditioned on its coarse-grained label. We believe the same intuition is applicable to the high-quality weak supervision. Therefore, we seek to enforce the constraint while continuously training on weak supervision. Specifically, a document should be more likely given its fine-grained (weak) label rather than its coarse-grained label. Mathematically, where W (F j ) is the weak supervision for finegrained label F j . Note that, it is implied from This inequality can be expressed in the form of a margin between P (D i |F j ) and P (D i |c i ), which can be implemented in practice through an additional Hinge loss term: where is a positive constant.
We incorporate this hierarchy-aware regularization into the final objective function as follows: The final optimization aims to maximize O.

Pseudo Training Data Generation, Text Classifier, & Weak Supervision Update
After continuously training the language model in a label-conditioned way, we generate the data for each fine-grained category. Specifically, we send the corresponding label surface name as the prompt to our language model, and it then generates samples for that respective class. Since we don't know the label distribution beforehand, we assume it's a balanced distribution and thus, avoiding inducing potential bias in the classifier. We generate twice the required documents divided equally among finegrained labels. Specifically, for a fine-grained label F j ∈ f ↓ (c), we generate 2 Nc |f ↓ (c)| documents, where N c is the number of documents that belong to coarse-grained label c.
We train a text classifier over these generated documents and their corresponding finegrained labels.
Our framework is compatible with any text classifier and we use BERT(bert-base-uncased) (Devlin et al., 2019) classifier in our experiments.
After training the text classifier, we obtain finegrained predictions and probability scores for all coarsely annotated documents D. Finally, we bootstrap it on unlabelled data by replacing weak supervision W (F j ) by top-k predictions where k = |W (F j )| in every fine-grained label F j and repeat this process one more time. In our experiments, we observe that these top-|W (F j )| predictions are of significantly higher quality than the initial weak supervision, thus improving the text classifier.

Experiments
In this section, we start with introducing datasets, compared methods, and experimental settings. Next, we present quantitative evaluation results of C2F together with all compared methods. In the end, we show qualitative studies to analyze different aspects of our C2F framework.

Datasets
We evaluate our framework on two hierarchical datasets where each document has one coarsegrained label and one fine-grained label. The dataset statistics are provided in Table 1. The details of these datasets are as follows: • The New York Times (NYT): Following the previous work Wang et al., 2021) we experiment on the NYT dataset. It is a collection of news articles written and published by The New York Times. Each news article is classified into one of 5 coarse-grained genres (e.g., arts, sports) and 25 fine-grained categories (e.g., movies, baseball). • The 20 Newsgroups (20News): The 20News dataset 2 is a collection of newsgroup documents partitioned widely into 6 groups (e.g., recreation, computers) and 20 fine-grained classes (e.g., graphics, windows, baseball, hockey). There are three miscellaneous labels (i.e., "misc.forsale", "talk.politics.misc", "talk.religion.misc"). As one can notice, their label names are about 'miscellaneous' and contain information of various types. Since these labels and label surface names have no focused meaning, we drop the documents annotated as these labels in our experiments.

Compared Methods
Since we aim to perform fine-grained classification with no fine-grained supervision, we compare our framework with a wide range of zero-shot and weakly supervised text classification methods described below: • Word2Vec learns word vector representations (Mikolov et al., 2013) for all words in the corpus and consider the word vectors of label surface name vectors as their respective label representations. In the case of multi-word label descriptors, the embeddings of individual words are averaged. Each document is labeled with the most similar label based on cosine similarity.  , 2020) is a seeddriven contextualized weak supervision framework. They leverage pre-trained language models to resolve interpretation of seed words and make the weak supervision contextualized. • LOTClass (Meng et al., 2020) uses pre-trained language model like BERT (Devlin et al., 2019) to query replacements for class names and constructs a category vocabulary for each class. This is further used to fine-tune the language model on a word-level category prediction task and identifies potential classes for documents via string matching. A classifier is trained on this pseudolabeled data with further self-training. • X-Class (Wang et al., 2021) learns class-oriented document representations that make it adaptive to the user-specified classes. These document representations are aligned to classes through PCA + GMM, harvesting pseudo labels for a supervised classifier training.
We also compare C2F with its ablated variants. C2F-NoHier uses label-conditioned generation alone without the hierarchy-aware regularization. C2F-Ind and C2F-Ind-NoHier are run individually on each coarse-grained label c to assign a finegrained label F j ∈ f ↓ (c) and the predictions are accumulated at the end to compute aggregated results. However, C2F-Ind uses both label-conditioned generation with the hierarchy-aware regularization whereas C2F-Ind-NoHier uses label-conditioned generation alone. C2F-1IT is a BERT classifier trained on initial fine-grained weak supervision. We also consider C2F with different generative LMs and classifiers. C2F-GPT-BERT, C2F-GPT-LR use GPT (Radford et al., 2018) as the generative language model and BERT, Logistic Regression as classifiers respectively.
For a fair comparison, we make coarse-grained annotated data available for all baselines and run them individually on each coarse-grained label c to assign a fine-grained label F j ∈ f ↓ (c) and the predictions are accumulated at the end to compute aggregated results. We provide label surface names as seed words for seed-word-driven baselines like ConWea and WeSTClass.
We also present the performance of BERT in a supervised setting which is denoted as BERT-Sup. The results of BERT-Sup reported are on the test set which follows an 80-10-10 train-dev-test split.

Experimental Settings
While fine-tuning GPT-2, we experiment with learning rates α ∈ {5e −5 , 5e −4 , 5e −6 }, with α = 5e −4 being found optimal, and continue the labelconditioned language model training for 5 epochs. Generation from the model is done via nucleus sampling (Holtzman et al., 2020), with a budget of p = 0.95 and a length limit of 200 subwords. The prompt given for generation is simply the tag sequence corresponding to the intended fine-grained label of the sample to be generated. Since finegrained class ratios are apriori unknown, an equal number of examples are sampled for each finegrained class within the same coarse-grained class.
For the hierarchy-aware regularization, we set the hinge loss margin = log 5 and λ = 0.01. For hyperparameter selection of , we sweep over the sequence of values in {logn} n=10 n=1 . Further searching is done through two levels of binary search. The decision to initially sweep over values in logarithmic fashion is taken based on two intuitions: i) Larger jumps were found to skip over the domain of variation of epsilon too quickly ii) is essentially a margin on logarithmic probabilities.

Quantitative Results
We evaluate our framework using Micro-f1(Mi-F 1 ) and Macro-f1(Ma-F 1 ) as performance metrics. The evaluation results of all methods run on three random seeds are summarized in Table 2 along with their respective standard deviations. We can observe that our proposed framework achieves superior performance compared to all other baselines. We discuss the effectiveness of C2F as follows: • C2F demonstrates the best performance among all compared baselines. By utilizing the generative language model through label-conditioned fine-tuning and regularizing it with hierarchical hinge loss to leverage the hierarchy, it is able to generate good quality pseudo training data, which helped in achieving the best performance. • C2F outperforms X-Class with a significant margin. X-Class doesn't take advantage of label hierarchy and requires class names to be one word whereas our framework has no such limitation and leverages rich language models to understand informative label surface names. • We have to note the significantly low performance of LOTClass. LOTClass queries replacements of label surface names and consider those to be indicative of the label. This is a valid assumption for the coarse-grained classification but when the classes become fine-grained, the replacements may not be indicative of its respective class. For e.g., consider the sentence "I won a baseball game.". If "baseball" is replaced by "tennis", it is still a valid and meaningful statement but "tennis" is not indicative of "baseball". Therefore, LOTClass performs low in the finegrained text classification task. Our framework separates the weak supervision for each label initially and fine-tunes the language model in a label conditioned way. Therefore, it is able to distinguish between fine-grained labels as well. • The comparison between C2F, C2F-Ind and C2F-NoHier, C2F-Ind-NoHier shows that the hinge loss helped in leveraging the constraints from hierarchy to improve the language model. • The comparison between C2F and C2F-Ind shows that the fine-grained classification benefits from the hierarchical structure and joint training with other coarse-grained classes. • We can observe that C2F perform significantly better than C2F-1IT. This shows that the finegrained classification improves with bootstrapping, where the samples with high predicted probabilities are selected and used them as weak supervision for the next iteration. • The comparison between C2F and C2F-GPT-BERT, C2F-GPT-LR shows that the performance improves with larger language models. This also demonstrates that C2F is compatible with different generative language models and classifiers. • We observe that the performance of C2F is quite close to supervised method BERT-Sup, for e.g., on the NYT dataset. This demonstrates that C2F is quite effective in closing the performance gap between weakly supervised and supervised setting with just label surface names as supervision.

Performance increase with bootstrapping
The f1-scores of fine-grained labels in three coarsegrained labels "computer", "politics", "religion" across iteration-0 and iteration-1 are plotted in Fig 3. We see that performance increases significantly from iteration-0 (blue) to iteration-1 (red). We attribute this increase to our bootstrapping.

Sensitivity to
A potential concern with the experimental setup can be overtly high sensitivity of C2F to the hinge loss margin parameter, i.e . However, from the plot in Figure 4, we clearly see that F1 scores aren't drastically sensitive to epsilon -with standard deviations of 0.00515 and 0.00517 for Macro and Micro-F1 scores respectively.

Qualitative Analysis
Given a particular coarse label (say sports) and its data subsets X = {D i |c i = "sports"} and X f = {D i |c i = "sports", f i = f ∈ f ↓ ("sports")}, as a matter of post-hoc analysis,  we can compare three distinct "supervised" splits a classifier could've been trained on: 1. Gold: Data along with gold fine-grained labels, which is not actually available in our setting. 2. C2F-Init: This is the subset of X for which the initial weak supervision strategy assigns fine labels based on label surface names. 3. C2F-Gen: This is the data sampled from our trained language model as the generator for each of the respective fine-grained labels. Which supervision is more apt from the purview of training? To answer this, we examine the entropy H() of the word frequency distribution of the three datasets. Specifically, we examine reduction in value from entropy of the overall set H(X) to the mean entropy on partitioning further by fine label, i.eH(X f ). The larger this drop, the more internally coherent are the label partitions.
As we can see from Figure  6.924, being 4.21% smaller. In summary, we see that C2F-GEN provides a more discriminative training signal without reducing example diversity. A few samples of generated documents for finegrained labels is shown in Table 3.

Related Work
We review the literature about different weakly supervised text classification methods.
There are three main sources of weak supervision: (1) a set of representative keywords for each class , (2) a few labeled documents (Tang et al., 2015;Miyato et al., 2017;Xu et al., 2017;, (3) label surface names (Tao et al., 2018;Meng et al., 2020;Wang et al., 2021). Typically, weakly supervised text classification frameworks obtain pseudo-labeled data, train a classifier, and improve the classifier by bootstrapping over unlabeled data. Seed-driven frameworks obtain pseudo-labeled data from userprovided seed words. When a few labeled documents are provided as weak supervision, the abovementioned pipeline similarly starts with these as pseudo-labeled data. In this paper, we focus on label surface names as the source of weak supervision. Along this line, Doc2Cube (Tao et al., 2018) expands label keywords from label surface names and performs multidimensional document classification by learning dimension-aware embedding; (Meng et al., 2020) identify keywords for classes by querying replacements for class names using BERT and pseudo-labels the documents by string matching with the selected keywords. (Wang et al., 2021) proposed an adaptive representation learning Table 3: Example generated samples for fine-grained labels hockey, basketball, cosmos, and economy in NYT dataset and autos, atheism, windows in 20Newsgroup dataset.
Label Sentence hockey the rangers' two injury-riddled offense and seven-game effort to defeat the florida panthers on saturday have prompted coach alain vigneault to call on the veteran right wing to start against the capitals. his message to the rookie goalie thomas vanek was that "he was fine today" and the rangers did not have to worry about his status in a game against another talented team.
basketball the nets entered the n.b.a. season with such wide open space that even their most ardent supporters of the franchise might admit their disappointment. the season ended sunday with the los angeles lakers selecting the first-round draft pick, with the eventual no. 1 pick jason kidd, a dynamic penetrator who could challenge for a starting job, at least in a year. cosmos washington -the outermost reaches of space are barely perceptible from the surface, making it difficult to see the red planet, but a spacecraft based on this data could find something as yet elusive: an elusive galaxy that has just vanished.on thursday, kepler spacecraft, which is carrying back-to-back samples of solar starlight, will begin a journey that will lead the space agencyś curiosity. economy washington -the number of americans seeking unemployment benefits fell less than expected last week, the latest evidence that steady job growth is holding steady. jobless claims rose,000 to a seasonally adjusted,000, the labor department said on thursday. that is 146 more than previously reported.last week's drop in jobless claims came two months after steady increases in the previous week, when claims rose. autos re manual shift bigots in article (james bruder) writes another question that hasn't been addressed yet is how come the auto mated to the lever controlling selector is not mounted to the transmission? i would think that the mated to the shifter would be mounted in the passenger compartment. is this a problem with the manual transmission? and if so, is it a problem with the shifter's mounting point? atheism re why do people become atheists? in article, (kent sandvik) writes in article, (robert beauchaine) writes and i suppose i would have better evidence if i could. why would it be any different, for one thing? iḿ fairly new to this group, so perhaps this sort of question has already been asked, and answered before. but ive just started to think about it. windows re dos 6.0 in article 1qh61o, (russ sharp) wrote it's absolutely ludicrous for me to try and run dos 6.0 without the bloody help of at least 8 people. i've tried compiling it on several systems, and i've run it six times without a problem. dos 6.0 didn't mention a config.sys or anything else. there were a couple other windows' manuals which did mention about config.sys. method for obtaining label and document embedding and these document embeddings are further clustered to pseudo-label the corpus. However, all these methods perform flat text classification. Although our method performs text classification using only fine-grained label surface names as supervision, we have coarse-grained annotated data and leverage it to improve fine-grained classification. There are a few methods that perform weakly supervised hierarchical classification (Meng et al., 2019;Zhang et al., 2021). However, our problem statement is different from hierarchical classification. We have coarse-grained annotated data and our framework utilizes it and label hierarchy to perform fine-grained text classification. Recently, (Hsieh et al., 2019) introduced coarse-tofine weakly-supervised multi-label learning problem. However, they assume a few fine-grained labeled documents as supervision whereas we require only label surface names. Additionally, our framework is generative in nature i.e. instead of pseudo-labeling the corpus, we generate training data and train the classifier.

Conclusions and Future Work
Through this work, we introduced the task of coarse-to-fine grained classification and laid out its significance. Next, we showed the promise of incorporating pre-trained language models like GPT-2 into a weak supervision strategy which starts out with just label surface names. Finally, we showed a way to attune these models for our task even better, through explicit regularization based on coarsefine label constraints which fall naturally out of our task definition. We outperform multiple SOTA zero-shot baselines on NYT and 20News, underscoring the utility both of incorporating pre-trained language models as well as task constraints.
We believe exploring newer ways of exploiting task agnostic knowledge sources and injecting task constraints into the weakly supervised learning process are promising avenues for future work.