Weakly-supervised Text Classification Based on Keyword Graph

Weakly-supervised text classification has received much attention in recent years for it can alleviate the heavy burden of annotating massive data. Among them, keyword-driven methods are the mainstream where user-provided keywords are exploited to generate pseudo-labels for unlabeled texts. However, existing methods treat keywords independently, thus ignore the correlation among them, which should be useful if properly exploited. In this paper, we propose a novel framework called ClassKG to explore keyword-keyword correlation on keyword graph by GNN. Our framework is an iterative process. In each iteration, we first construct a keyword graph, so the task of assigning pseudo labels is transformed to annotating keyword subgraphs. To improve the annotation quality, we introduce a self-supervised task to pretrain a subgraph annotator, and then finetune it. With the pseudo labels generated by the subgraph annotator, we then train a text classifier to classify the unlabeled texts. Finally, we re-extract keywords from the classified texts. Extensive experiments on both long-text and short-text datasets show that our method substantially outperforms the existing ones.


Introduction
Text classification is one of the most fundamental tasks in natural language processing (NLP). In real-world scenarios, labeling massive texts is timeconsuming and expensive, especially in some specific areas that need domain experts to participate. Weakly-supervised text classification (WTC) has received much attention in recent years because it can substantially reduce the workload of annotating massive data. Among the existing methods, the mainstream form is keyword-driven (Agichtein and Gravano, 2000;Riloff et al., 2003; Kuipers * The author did most work during internship at Alibaba. † Correspondence author  Figure 1: (a) Existing methods do not consider the correlation among keywords, which will generate wrong pseudo-label for text B. (b) Our method exploits the correlation among keywords by GNN over a keywordgraph, and converts the task of assigning pseudo-labels for unlabeled texts to annotating subgraphs, which leads to much better performance. et al., 2006;Meng et al., 2018Meng et al., , 2019Meng et al., , 2020Mekala and Shang, 2020;Wang et al., 2021;Shen et al., 2021), where the users need only to provide some keywords for each class. Such class-relevant keywords are then used to generate pseudo-labels for unlabeled texts.
Keyword-driven methods usually follow an iterative process: generating pseudo-labels using keywords, building a text classifier, and updating the keywords or self-training the classifier. Among them, the most critical step is generating pseudo-labels. Most existing methods generate pseudo-labels by counting keywords, with which the pseudo-label of a text is determined by the category having the most keywords in the text.
However, one major drawback of these existing methods is that they treat keywords independently, thus ignore their correlation. Actually, such correlation is important for the WTC task if properly exploited, as a keyword may implies different categories when it co-occurs in texts with other different keywords. As shown in Fig. 1, suppose the users provide keywords "windows" and "microsoft" for class "computer" and "car" for class "traffic". When "windows" and "microsoft" appear in text A, the "windows" means operating system, and text A should be given a pseudo-label of "computer". However, when "windows" meets "car" in text B, the "windows" means the windows of a car and text B should be given a pseudo-label "traffic". With previous simple keyword counting, text A can get a correct pseudo-label, but text B cannot. Therefore, treating keywords independently is problematic.
In this paper, we solve the above problem with a novel iterative framework called ClassKG (the abbreviation of Classification with Keyword Graph) where the keyword-keyword relationships are exploited by GNN on keyword graph. In our framework, the task of assigning pseudo-labels to texts using keywords is transformed into annotating keyword subgraphs. Specifically, we first construct a keyword graph G with all provided keywords as nodes and each keyword node updates itself via its neighbors. With G, any unlabeled text T corresponds to a subgraph G T of G, and assigning a pseudo-label to T is converted to annotating subgraph G T . To accurately annotate subgraphs, we adopt a paradigm of first self-supervised training and then finetuning. The keyword information is propagated and incorporated contextually during keyword interaction. We design a self-supervised pretext task that is relevant to the downstream task, with which the finetuning procedure is able to generate more accurate pseudo-labels for unlabeled texts. Texts that contains no keywords are ignored. With the pseudo-labels, we train a text classifier to classify all the unlabeled texts. And based on the classification results, we re-extract the keywords, which are used in the next iteration.
Furthermore, we notice that some existing methods employ simple TF-IDF alike schemes for reextracting keywords, which makes the extracted keywords have low coverage and discrimination over the unlabeled texts. Therefore, we develop an improved keyword extraction algorithm that can extract more discriminative keywords to cover more unlabeled texts, with which more accurate pseudolabels can be inferred.
In summary, our contributions are as follows: • We propose a new framework ClassKG for weakly supervised text classification where the correlation among different keywords is exploited via GNN over a keyword graph, and the task of assigning pseudo-labels for unlabeled texts is transformed into annotating key-word subgraphs on the keyword graph.
• We design a self-supervised training task on the keyword graph, which is relevant to the downstream task and thus can effectively improve the accuracy of subgraph annotating.
• We conduct extensive experiments on both long text and short text benchmarks. Results show that our method substantially outperforms the existing ones.

Related Work
Here we review the related works, including weakly-supervised text classification and selfsupervised learning.

Weakly-Supervised Text Classification
Weakly-supervised text classification (WTC) aims to use various weakly supervised signals to do text classification. Weak supervision signals used by existing methods includes external knowledge base (Gabrilovich et al., 2007;Chang et al., 2008;Song and Roth, 2014;Yin et al., 2019), keywords (Agichtein and Gravano, 2000;Riloff et al., 2003;Kuipers et al., 2006;Tao et al., 2015;Meng et al., 2018Meng et al., , 2019Meng et al., , 2020Mekala and Shang, 2020;Wang et al., 2021;Shen et al., 2021) and heuristics rules (Ratner et al., 2016(Ratner et al., , 2017Badene et al., 2019;Shu et al., 2020). In this paper, we focus on keyword-driven methods. Among them, WeST-Class (Meng et al., 2018) introduces a self-training module that bootstraps on real unlabeled data for model refining. WeSHClass (Meng et al., 2019) extends WeSTClass to hierarchical labels. LOT-Class (Meng et al., 2020) uses only label names as the keywords. Conwea (Mekala and Shang, 2020) leverages contextualized corpus to disambiguate keywords. However, all these methods treat keywords independently, so ignore their correlation, which is actually useful information for generating pseudo-labels. Different from the existing methods, we exploit keyword correlation by applying GNN to a keyword graph, which can significantly boost the quality of pseudo-labels.

Self-supervised Learning
Self-supervised learning exploits internal structures of data and formulates predictive tasks to learn good data representations. The key idea is to define a pretext task and generate surrogate training samples automatically to train a model. A wide range of pretext tasks have been proposed. For images, self-supervised strategies include predicting missing parts of an image (Pathak et al., 2016), patch orderings (Noroozi and Favaro, 2016) and instance discrimination (He et al., 2020). For texts, the tasks can be masked language modeling (Devlin et al., 2019), sentence order prediction (Lan et al., 2020) and sentence permutation (Lewis et al., 2020). For graphs, the pretext tasks can be contextual property prediction (Rong et al., 2020), attribute and edge generation (Hu et al., 2020b). However, up to now there are only a few works of self-supervised learning for subgraph representation (Jiao et al., 2020;Qiu et al., 2020), where contrastive loss is used for subgraph instance discrimination, and the downstream tasks they serve are mainly whole graph classification or node classification, instead of subgraph classification. Here, we design a new pretext task based on the keyword graph to learn better representations of keyword subgraphs, with which the accuracy of pseudo-label generation is improved, and consequently classification performance is boosted.

Problem Definition
The input data contains two parts: a) A set of userprovided initial keywords S = {S 1 , S 2 , ...S C } for C categories, where S i = {w i 1 , w i 2 , ..., w i k i } denotes the k i keywords of category i, b) A set of n unlabeled texts U = {U 1 , U 2 , ..., U n } falling in C classes. Our aim is to build a text classifier and assign labels to the unlabeled texts U .

Framework
Fig. 2 is the framework of our method, which follows an iterative paradigm. In each iteration, we first build a keyword graph G based on the cooccurrence relationships between keywords from all classes. Each keyword node aggregates information from its neighbors. Here, an unlabeled text corresponds to a subgraph of G and annotating unlabeled texts is converted to annotating the corresponding subgraphs. To train a high-quality subgraph annotator A, we first train A with a designed self-supervised task on G, then finetune it with noisy labels. After that, unlabeled texts containing keywords are mapped to subgraphs, which are annotated by A to generate pseudo-labels. With the pseudo-labels, we train a text classifier to classify all unlabeled texts. Based on the classification results, keywords are re-extracted and updated for the next iteration, until the keywords change little.

Keyword Graph Construction
To model the relationships among keywords, we construct a keyword graph G by representing keywords as vertices and co-occurrences between keywords as edges, denoted as G = (V, E).
For vertices V, the embedding of a node is initialized with vector where v class is the one-hot embedding of keyword class, v index is the one-hot embedding of keyword index, and C is the number of classes.
For edges E, if keywords w i and w j occur in an unlabeled text in order, there exists a directed edge from w i to w j . Meanwhile, we take their cooccurrences F ij in unlabeled texts as edge attribute. Considering the limited number of keywords contained in a text, we do not use any sliding window to limit the number of edges.
With the keyword graph G, keyword feature information is propagated and aggregated by GNN.

Subgraph Annotator Training
With the keyword graph, an unlabeled text is converted into a subgraph of G. Specifically, the keywords in a text hit a set of vertices in G. The subgraph is the induced subgraph of the hit vertices in G. Assigning pseudo labels to unlabeled texts is equivalent to annotating the corresponding subgraphs, which is a graph-level classification problem. In practice, we employ graph isomorphism network (GIN) (Xu et al., 2019) as our subgraph annotator to perform node feature propagation and subgraph readout. The keyword feature is propagated and aggregated as follows: v denotes the representation of node v after the k th update. M LP (k) is a multi-layer perceptron in the k th layer. ε is a learnable parameter. N (v) denotes all the neighbors of node v. Then, we perform readout to obtain subgraph representation: GIN concatenates the sum of all node features from the same layer as the subgraph representation.
To train a subgraph annotator with high annotating accuracy, we first train a GIN via a designed self-supervised task, then finetune it.
Figure 2: Our framework follows an iterative paradigm. In each iteration, we first build a keyword graph G, with which unlabeled texts corresponds to subgraphs of G, and assigning pseudo-labels to texts is transformed to annotating the corresponding subgraphs. To train the subgraph annotator, we design a self-supervised pretext task, and finetune it. Then, a text classifier is trained with the pseudo-labels. Based on the classification results of the classifier, keywords are re-extracted and updated for the next iteration.

Self-supervised Training on Graph
As mentioned in previous self-supervised learning works (Hu et al., 2020a), a successful pre-training needs examples and target labels that are correlated with the downstream task of interest. Otherwise, it may harm generalization, which is known as negative transfer. Considering that the downstream task is a graph-level classification, we design a graph-level self-supervised task, which is highly relevant to subgraph annotation. Our selfsupervised method is shown in Alg. 1, where the subgraph annotator A learns to predict the class of the start point of a random walk and the subgraph derived from the random walk is similar to the subgraph generated by an unlabeled text.
To begin with, we randomly sample a keyword w r from class C r as the start point of a random walk. The number of random walk steps follows the same Gaussian distribution N (u s , σ 2 s ) as that of the number of keywords appearing in an unlabeled text in U . Therefore, we estimate the parameters of the Gaussian distribution u s , σ 2 s based on U as follows: where kf (U i ) is the number of keywords contained in text U i . Then, we can sample the length L of random walk from distribution N (u s , σ 2 s ). The probability of walking from node w i to node w j is derived from the co-occurrence frequency by where F ik is the co-occurrence frequency of w i followed by w k . N (w i ) is the neighbors set of w i .

Algorithm 1 Self-supervision on Keyword Graph
Input: keyword graph G, unlabeled texts U , Gaussian parameters u s , σ 2 s , edge probability p ij Output: pretrained subgraph annotator A 1: repeat 2: Randomly sample a class C r .

3:
Sample a keyword w r from class C r .

5:
Perform a random walk on G, with w r as start point, p ij as probability, L as length, which will obtain a subgraph G r .

6:
With G r as input of A, C r as predicted target, compute the loss.

7:
Compute the gradient and update parameters of A. 8: until convergence Then, we start from node w r to perform a Lstep random walk. In each step, p ij determines the probability of jumping from w i to neighbor w j . At the end of random walk, we obtain a subgraph G r , which is the induced subgraph of the traversed nodes in the keyword graph G.
Our self-supervised task is designed to take G r as the input of A and make A learn to predict the class of start point w r . The loss function is defined as the negative log likelihood of C r :

Finetuning
After pre-training the subgraph annotator A , we finetune it for a few epochs. The labels of finetuning are generated by voting as follows: where tf (w j , U i ) denotes the term-frequency (TF) of keyword w j in text U i . The loss function is defined as follows: where G i is the subgraph of text U i . Note that the number of epochs for fine-tuning cannot be too large, otherwise it may degenerate into voting.

Text Classifier
After training the subgraph annotator A, we use it to annotate all the unlabeled texts U and generate pseudo-labels, which are used to train a text classifier. Texts containing no keywords are ignored.
Our framework is compatible with any text classifier. We use Longformer (Beltagy et al., 2020) as the long text (document) classifier and BERT (Devlin et al., 2019) as the short text (sentence) classifier. Following previous works (Meng et al., 2018(Meng et al., , 2020, we self-train (Rosenberg et al., 2005) the classifier on all unlabeled texts. The predicted labels for all unlabeled texts by the text classifier are then used to re-extract keywords.

Keywords Extraction
Considering that the coverage and accuracy of userprovided keywords are limited, we re-extract keywords based on the predictions of the text classifier in each iteration. Existing methods use indicators such as term frequency (TF), inverse document frequency (IDF) and their combinations (Mekala and Shang, 2020) to rank words, and a few top ones are taken as keywords. However, they treat all indicators equally, and are prone to select common and low-information words. Here, we employ an improved TF-IDF scheme, which increases the significance of IDF to reduce the scores of common words. The score of word w i in class C k is evaluated as follows: Above, M is a hyperparameter. According to the score, we select the top Z words in each category as the keywords for the next iteration.
To determine whether the model has converged, we define the change of keywords as follows: where S T i is the keywords set of the i th iteration. If ∆ < (a hyperparameter), the iteration stops.  (Meng et al., 2020) and use the label names as initial keywords. The evaluation is performed on the test set. For all classes, we use no more than four keywords per category.

Compared Methods
We compare our method with a wide range of weakly-supervised text classification methods: 1) IR-TF-IDF evaluates the relevance between documents and labels by aggregated TF-IDF values of keywords. Documents are assigned labels based on their relevance to labels. 2) Dataless (Chang et al., 2008)   ilarities. 5) BERT count simply counts keywords to generate pseudo labels for training BERT. 6) WeSTClass (Meng et al., 2018) generates pseudo documents to train a classifier and bootstraps the model with self-training. 7) LOTClass (Meng et al., 2020) utilizes only label names to perform classification. They use pre-trained LM to find class-indicative words and generalizes the model via self-training. 8) ConWea (Mekala and Shang, 2020) leverages BERT to generate contextualized representations of words, which is further utilized to train the classifier and expand seed words.

Experimental Settings
The training and evaluation are performed on NVIDIA RTX 2080Ti. In the subgraph annotator, we use a three-layer GIN (Xu et al., 2019). We first train it with our self-supervised task 10 6 iterations and then finetune it 10 epochs. We set the batch size of self-supervision/finetuning to 50/256. In classifier training, we set the batch size to 4/8 for long/short texts. Both the subgraph annotator and the text classifier use AdamW (Loshchilov and Hutter, 2019) as optimizer. Their learning rates are 1e-4 and 2e-6, respectively. The classifier uses bert-base-uncased for short texts and longformerbase-4096 for long texts. For keywords extraction, we select top 100 keywords per class in each iteration. The hyperparameter M is set to 4. The keywords set change threshold is set to 0.1. Our code has already been released. 1

Performance Comparison
Long text datasets. The evaluation results are summarized in Table 2. Since the datasets are imbalanced, we use micro-f1 and macro-f1 as evalua-  tion metrics. As we can see, our method achieves SOTA, and outperforms existing weakly supervised methods in most cases. On 20Newsgroup, which is a much harder dataset, our method exceeds SOTA for all metrics by a large margin. The gap is 13% in fine-grained classification and 18% in coarsegrained classification. Although the NYT dataset is relatively simple, our model still has advantage on three of the four metrics: achieves over 1% improvement on 3 metrics, degrades a little only on macro-f1 of coarse-grained, due to the extreme imbalance of categories. Short text datasets. Results of short text datasets are shown in Table 3. We follow previous works (Meng et al., 2020) to use accuracy as the metric. We can see that our method outperforms SOTA on all datasets, especially for DBPedia, the improvement is up to 6.9%. With only label names as initial keywords, our method achieve almost 90% accuracy on all datasets.

Ablation Study
Here, we check the effects of various components and parameters in our framework. Experiments are conducted on 20News with fine-grained labels.

Effectiveness of Subgraph Annotator
To verify the effectiveness of subgraph annotator A, we compare the results w/wo subgraph annotator A  and w/wo self-supervised learning (SSL). For the case without A, we use keyword counting to generate pseudo-labels, which is widely used in previous works. For the case without self-supervision, we directly finetune the subgraph annotator without self-supervised training. The results of the first 6 iterations are illustrated in Fig. 3.
We can see that 1) our method with all components performs much better than the other cases, proving the effectiveness of exploiting the correlation among keywords. 2) For the case using keyword counting, since the correlation among keywords is ignored, the micro/macro-f1 of pseudo labels is the worst, which leads to the worst classification performance. 3) For the case with finetuning but no self-supervised learning, it outperforms keyword counting by 2.5% and 1.7% on micro-f1 and macro-f1 of pseudo labels in the 6 th iteration, respectively, which further leads to 3.6% and 3.2% gain on micro/macro-f1 of classification performance. 4) Our self-supervised learning task can boost performance, exceeding the case without SSL by a large margin of 3.5% and 4.4% in terms of micro/macro-f1 of pseudo labels, and 4.0% and 4.9% of classification performance.

Subgraph Annotator Implementation
We use different GNNs to implement the subgraph annotator, including GCN (Kipf and Welling, 2017), GAT (Veličković et al., 2018) and GIN (Xu et al., 2019). For GCN and GAT, we readout the subgraph by averaging all node features in the last layer. For fair comparison, all GNNs set the layer number to 3. Performance comparison is given in Table 4. We can see that the performance of subgraph annotator is highly related to the selected GNN model, and a more powerful GNN model will lead to higher annotation accuracy.

Effect of the Number of Keywords
Here, we check the effect of the number of extracted keywords. We vary the number of extracted keywords per class Z and show the results in Fig. 4. We can see that 1) since more keywords will hit more texts, more extracted keywords result in higher text coverage.
2) The change (∆) of keywords falls below the threshold (0.1) in the 3 th update for all three keywords number settings. We can assume that Z has little effect on the number of iterations for model convergence. 3) Increasing the number of keywords from 50 to 100 brings a great performance improvement, while more keywords (Z = 300) make little change.

Effect of the IDF Power M
We check the effect of the power of IDF M in Eq. (9) by changing its value from 1 to 7 for ex-  tracting keywords, based on which we train the subgraph annotator and report the micro-f1 of subgraph annotation and the coverage of unlabeled texts. Results of the 1 th and 2 th iterations are shown in Fig. 5. With the increase of M , the accuracy of labeling also increases, but the coverage decreases. This is due to that a larger M makes the algorithm extract more uncommon words, thus improving the accuracy while reducing the coverage.

Effect of Text Classifier Implementation
Our framework is compatible with any classifier. Here, we replace the Longformer classifier (Beltagy et al., 2020) for long texts with HAN (Pappas and Popescu-Belis, 2017) classifier. Results are shown in Table 5. As we can see, our framework with HAN classifier still achieves good performance, surpassing SOTA by 7% in micro/macro-f1.

Effects of Hyperparameters
Here, we check the effects of two hyperparameters: the number of GIN layers and the number of epochs for finetuning subgraph annotator. The results of the 1 th and 2 th iterations are shown in Fig. 6. We can see that the accuracy of subgraph annotator decreases slightly as the number of GIN layers increases, which may be due to over smoothing. As finetuning goes, the labeling accuracy decreases slightly, which may be caused by overfitting.

Case Study
Here we present a case study to show the power of our framework. We take the technology class in AG News dataset as an example. In the beginning, we take "technology" as the initial keyword. At the end of the 1st/2nd iteration, the keywords are updated and the top 12 keywords are presented in Table 6. Obviously, all the 12 keywords extracted by our method are correct, belonging to "technology" category. Furthermore, we check the annotation results in the 2nd iteration. Some annotations are shown in Table 7. We can see that our annotator gets more accurate pseudo-labels than keyword counting.  Strippers and pole dancers should be banned from performing in stretch... 1 0 0 The Olympics -weapon of mass distraction. As Australians watch their country... 1 0 0

Conclusion
In this work, we propose a novel method for weakly-supervised text classification by exploiting the correlation among keywords. Our method follows an iterative paradigm. In each iteration, we first build a keyword graph and the task of assigning pseudo-labels is transformed into annotating keyword subgraphs. To accurately annotate subgraphs, we first train the subgraph annotator with a designed pretext task and then finetune it. The trained subgraph annotator is used to generate pseudo labels, with which we train a text classifier. Finally, we re-extract keywords from the classification results of the classifier. Our experiments on both long and short text datasets show that our method outperforms the existing ones. As for future work, we will focus on improving the proposed method by new mechanisms and network structures.