Towards Open-Domain Topic Classification

We introduce an open-domain topic classification system that accepts user-defined taxonomy in real time. Users will be able to classify a text snippet with respect to any candidate labels they want, and get instant response from our web interface. To obtain such flexibility, we build the backend model in a zero-shot way. By training on a new dataset constructed from Wikipedia, our label-aware text classifier can effectively utilize implicit knowledge in the pretrained language model to handle labels it has never seen before. We evaluate our model across four datasets from various domains with different label sets. Experiments show that the model significantly improves over existing zero-shot baselines in open-domain scenarios, and performs competitively with weakly-supervised models trained on in-domain data.


Introduction
Text classification is a fundamental natural language processing problem, with one of its major applications in topic labeling (Lang, 1995;Wang and Manning, 2012).Over the past decades, supervised classification models have achieved great success in closed-domain tasks with large-scale annotated datasets (Zhang et al., 2015;Tang et al., 2015;Yang et al., 2016).However, they are no longer effective in open-domain scenarios where the taxonomy is unbounded.Retraining the model for every new label set often incurs prohibitively high cost in the sense of both annotation and computation.By contrast, having one classifier that is flexible with unlimited labels can save such tremendous efforts while keeping the solution simple.Therefore, in this work, we build a system for open-domain topic classification that can classify a given text snippet into any categories defined by users.
At the core of our system is a zero-shot text classification model.While supervised models are typically insensitive to class names, a zero-shot model is usually label-aware, meaning that it can understand label semantics directly from the name or definition of the label, without accessing any annotated examples.Our model TE-Wiki combines a Textual Entailment (TE) formulation with Wikipedia finetuning.Specifically, we construct a new dataset that contains three million articlecategory pairs from Wikipedia's subcategory graph, and finetune a pretrained language model (e.g.BERT) to predict the entailment relations between articles and their associated categories.We simulate the diversity in open-domain classification with the wide coverage of Wikipedia, while preserving label-awareness through an entailment framework.
In our benchmarking experiments, TE-Wiki outperforms all previous zero-shot methods on four benchmarks from different domains.It also shows competitive performance against weaklysupervised models trained on in-domain data.By learning from Wikipedia, our method does not require any data that is specifically collected from the evaluation domains.On the other hand, since our model is label-aware, it can flexibly classify text pieces into any labels outside Wikipedia.
Finally, we compare our system against humans for further insights.We show that even humans are sometimes confused by ambiguous labels through a crowdsourcing study, which explains the performance gap between open-domain and supervised classification.The gap is reduced significantly when label meanings are clear and well aligned with the semantics of text.We also use an example to illustrate the negative effect of a bad label name.Through the analysis, we demonstrate the importance of choosing proper label names in opendomain classification.

Related Work
Open-domain zero-shot text classification was first studied in the NLP domain in (Chang et al., 2008) (under the name "dataless classification") as a method that classifies solely based on a general knowledge source and does not require any indomain data, whether labeled or not.It was proposed to embed both the text and labels into the same semantic space, via Explicit Semantic Analysis, or ESA (Gabrilovich and Markovitch, 2007), and pick the label with the highest relevance score.This idea was further extended to hierarchical (Song and Roth, 2014) and cross-lingual (Song et al., 2016) text classification.Later on, (Yin et al., 2019) called this protocol "label fully unseen" and proposed an entailment approach to transfer knowledge from textual entailment to text classification.It formulates an n-class classification problem as n binary entailment problems by converting labels into hypotheses and the text into the premise, and selects the premise-hypothesis pair with highest entailment score.More recently, another concurrent work (Chu et al., 2021) proposed to explore resources from Wikipedia for zero-shot text classification, but with a different formulation.
There are many other methods that also require less labeling than supervised classification, though in slightly different settings.For example, previous works have explored to generalize from a set of known classes (with annotation) to unknown classes (without annotation) using word embed-dings of label names (Pushp and Srivastava, 2017;Xia et al., 2018;Liu et al., 2019), class correlation on knowledge graphs (Rios and Kavuluru, 2018;Zhang et al., 2019), or joint embeddings of documents and labels (Nam et al., 2016).Besides, Weakly supervised approaches (Mekala and Shang, 2020;Meng et al., 2020) learn from an unlabeled, but in-domain training set.Given a set of predefined labels, a label-aware knowledge mining step is first applied to find class-specific indicators from the corpus, followed by another self training step to further enhance the model by propagating the knowledge to the whole corpus.However, none of these approaches are suitable for building an open-domain classification system.They either require domain-specific annotation or knowing test labels beforehand.

System Description
We present details about our open-domain topic classification system, starting with an overview of our web interface, followed by the backend model.

User Interface
Figure 1 is a snapshot of our online demo.The system is supported by multiple backend models for test and comparison.Among them, "Bert-Wiki", corresponding to TE-Wiki in this paper, is the bestperforming one in our evaluation.After selecting the model(s), users can create their own taxonomy in the "Labels" column, and input the text snippet.The system will then classify the text with the userdefined taxonomy.Results are presented in two formats: a bar chart and a ranking table.The table on the right provides a clear view of rankings by each model, while the bar chart on the left is useful to compare the scale of the scores from different models for different labels.These scores, ranging from 0 to 1, are probabilities of the label being relevant to the text, which we will explain further in the next section.
Consider the example in Figure 1.The input text is most relevant to lifestyle, somewhat relevant to technology, and irrelevant to children, which aligns with the prediction of our "Bert-Wiki" model.

TE-Wiki
We now describe our best performing model TE-Wiki.Previous work (Yin et al., 2019) has demonstrated that an n-way classification problem can be converted into n binary entailment problems.Specifically, we can use the text as the premise, and candidate labels as the hypotheses, to generate The motivation is that classification is essentially a special kind of entailment.Suppose we want to classify a document into 3 classes: politics, business, sports.We can ask three binary questions: "Is it about politics?","Is it about business?","Is it about sports?".By doing so, the model is no longer constrained to a fixed label set, as we can always ask more questions to handle new labels.
With the above framework, it is straightforward to train a model on an entailment dataset (e.g.MNLI (Williams et al., 2018), FEVER (Thorne et al., 2018), RTE (Dagan et al., 2005;Wang et al., 2019).)and use it for classification.However, this may not be the optimal choice as topic classification only focuses on high-level concepts, while textual entailment has a much wider scope and involves many other aspects (e.g., see (Dagan et al., 2013)).Therefore, we propose to construct a new dataset from Wikipedia with articles as premises, and categories as hypotheses.Our desired training pair should meet the following two criteria: 1.The hypothesis is consistent with the premise, i.e. the categorization is correct.
2. The hypothesis should be abstract and concise to reflect the high-level idea of the premise, rather than focus on certain details.
Directly using all the categories associated with an article satisfies the first criterion, but fails with the second, as some of them do not represent the article well.For example, the page Bill Gates is assigned Category:Cornell family, which is correct about the person but probably not a suitable label for the whole article.To resolve the issue, we instead use higher-level categories on Wikipedia's subcategory graph to yield better hypotheses.The overview of TE-Wiki is illustrated in Figure 2. Specifically, we start with a set of 700 toplevel categories from Wikipedia's overview page3 as roots.For each of them, we run a depth-first search (DFS) to find its subcategories.In our experiment, we set the max depth to 2 to ensure the subcategories found are strongly affiliated with the root.We collect all member articles of categories in the DFS tree, including both leaves and internal nodes, and pair them with the root to construct positive examples.In case an article can be reached from multiple root categories, we only pair it with the root(s) that has the smallest tree distance to the article to ensure supervision quality.Then for each article, we randomly choose a different category to construct a negative example.While we have tried more sophisticated negative sampling strategies with the aim to confuse the model, none of them makes a significant improvement.Thus, we keep to this simple version.The final training set D = {(x i , c i , p i ) n i=1 } consists of 3-tuples such that x i is a Wikipedia article, c i is the corresponding high-level category name, and p i ∈ {+1, −1} is the label.The procedure for constructing the training set is summarized in Algorithm 1.
We then fine-tune the pre-trained BERT model (Devlin et al., 2019) with the collected dataset.Given a tuple (x i , c i , p i ), the concatenation of x i and c i is passed to a BERT encoder, followed by a classification head to predict whether the article x i belongs to the category c i .During test, (i) for the single-labeled case, we pick the label with the highest predicted probability, (ii) for the multi-labeled case, we pick all labels predicted as positive (i.e.probability > 0.5).We do not use any hypothesis template to convert label names into sentences as in (Yin et al., 2019), for consistency with training.

Evaluation
We evaluate all the backend models of our system on four classification benchmarks to compare their performance.We also compare them against weakly-supervised and supervised models to quantify how much we can achieve without any domainspecific training data.

Experiment setup
Datasets: We summarize all test datasets in Table 1.For Yahoo! Answers, we use the reorganized train/test split by (Yin et al., 2019).All datasets are in English.Among the four, Situation Typing is a multi-labeled dataset with imbalanced classes, for which we report the weighted average of per-class F1 score.We refer readers to (Yin et al., 2019) for the class distribution statistics.The other three are single-labeled and class-balanced, and we report the classification accuracy.Models: Apart from TE-Wiki, we run five zeroshot models for open-domain evaluation, as well as a weakly-supervised and a supervised model for close-domain comparison.
• Word2Vec (Mikolov et al., 2013): To measure cosine similarity between the embedding vectors of text and label.

Result Analysis
The main results are presented in It is possible that some of the testing labels also appear in the Wikipedia categories used for training. 5To ensure the quality and fairness of our zero-shot evaluation, we remove the overlapping categories from Wikipedia training data, and retrain the TE-Wiki model for each test set.Specifically, we normalize labels and categories by their lower-cased, lemmatized names, and perform a token-based matching.We report in Table 3 the performance before and after deduplication.
We find that deduplication has little or even positive influence on performance, which shows that TE-Wiki does not rely on seeing test labels during training.In particular, the performance on Yahoo gets improved with deduplication.We suspect that exact match between training and testing labels can lead to overfitting, since the same label may have different meanings under different context.Notice that this study is only for justifying our zero-shot evaluation.For real-word applications, excluding overlapping categories is neither necessary nor feasible as users do not know the test labels beforehand in zero-shot scenarios.

Early stopping and knowledge transfer
To study the convergence of our model, we sample a small dev set of 1000 examples from Yahoo's original validation set.During training, we find that with 25 steps the TE-Wiki model already achieves a reasonably good performance on the dev set.Further training for longer steps yields some, but not significant gains.Since the model has only seen 25 × 64 = 1600 examples at that point, there is little chance for the model to acquire label specific knowledge with such a small amount of data.Hence, we believe that during the early steps, the model actually learns "what topic classification is about", while the knowledge specific to different labels has already been implicitly stored in the pretrained BERT encoder.The category prediction task takes a minor role in transferring world knowledge.Rather, it teaches the model how to use existing knowledge to make a good inference.

Importance of label names
Since zero-shot classifiers understand a label by its name, the quality of label names can be a important performance bottleneck in designing open-domain text classification systems.To study this, We con- Table 3: Performance before and after removing the overlapping categories, as well as their difference.We also show the number of removed categories, and the percentage of test documents that belong to the overlapping labels.duct crowdsourcing surveys on subsets of Yahoo and AG News.For each dataset, we randomly sample 1,000 documents while preserving class balance.Every document is independently annotated by five workers.In the survey question, we only provide the document to be classified and names of candidate labels, without giving workers examples for each class.We consider an example to be correctly classified by humans only if at least three workers choose the gold label.Details about the survey are in Appendix.
We summarize the results in Table 4. Row 1&3 are classification accuracy on the whole crowdsourcing datasets, and row 2&4 are on subsets of examples where all 5 workers choose the same label.We observe that when including all examples, both TE-Wiki and humans perform much worse than the supervised method.The supervised approach has the advantage that it learns data-specific features to resolve ambiguity among different classes.On the other hand, humans only make judgements based on their understanding of the labels and a stand-alone test document, and so does our zeroshot algorithm.Ideally, this task should not be difficult for humans as long as the labels properly describe the text topics.However, in some cases the labels could be ambiguous and confusing.Figure 3 shows an example of a bad label name leading to a mistake.The word "Reference" in the correct label actually means "quoting other people's words".However, it is hard for an ordinary person to understand the meaning without any example as illustration.4 out of 5 annotators instead chose "Entertainment & Music" due to the movie "Star Wars".By contrast, the supervised model has no difficulty in making the correct decision because it has seen plenty of quotation examples during training and can easily capture the useful pattern like "Who said XXX".The main reason for humans' confusion here is that the label name does not directly reflect the semantics of the text.A better description of the class should be provided for classification without examples.
We also calculate the accuracy on examples where all 5 workers agree, as in row 2&4 in Table 4.We believe the high inter-annotator agreement here indicates a better alignment between the semantics of text and label.We find a significant improvement of human performance on these less ambiguous cases.The same happens to our zero-shot model, but the supervised method benefits much less.Consequently, the performance gap between humans and the supervised model is also getting closer, which demonstrates that ambiguous labels have a strongly negative impact on classification.Therefore, we believe picking good labels is crucial for open-domain topic classification.

Conclusion
We introduce a system for open-domain topic classification.The system allows users to define customized taxonomy and classify text with respect to that taxonomy at real time, without changing the underlying model.To build a powerful model, we propose to utilize Wikipedia articles and categories and adopt an entailment framework for zero-shot learning.The resulting TE-Wiki outperforms all existing zero-shot baselines in open-domain evaluations.Finally, we demonstrate the importance of choosing proper label names in open-domain topic classification through a crowdsourcing study.

A Crowdsourcing Setup
We conduct crowdsourcing annotations for 1000 documents sampled from the Yahoo!Answers dataset and another 1000 from the AG News on Amazon Mechanical Turk (AMTurk).Both crowdsourcing subsets preserve the class-balance as in the original datasets.We avoid using long documents so that each document contains no more than 512 characters.The 1000 samples are split into 40 assignments, each containing 25 examples.We request 5 AMTurk workers for multiple-choice questions on each assignment.In order to ensure the response quality, we use anchor examples and gold annotations from the original datasets to filter out low-quality answers.Specifically, in each assignment we insert two anchor examples that we believe are easy enough for workers to choose the correct answer as long as they pay attention.We reject a submission if a worker's classification accuracy against gold annotations is below 30%, or both anchor examples are wrongly classified.With a small initial pilot, we estimate the average working time for labeling 25 examples to be 22 minutes, and we set the pay rate to be $1.5 per assignment for each valid submission.The overall cost is $300 for 200 valid submissions for each dataset.

Figure 1 :
Figure 1: An overview of our open-domain topic classification system.Users can choose multiple models (top), and define their own text input and candidate labels (middle).Prediction results from different models are displayed in the bar chart and the table (bottom).

Figure 2 :
Figure 2: An overview of our proposed TE-Wiki.Left: the data collection process.For each of the top-level categories, we run DFS to find its descendant categories as well as their member articles.These articles are paired with the root category for model input.Right: the model architecture.We use BERT for sequence classification.The article text is concatenated with the category name to feed into a BERT encoder.The classification head takes the output embedding of the "[CLS]" token to classify the input text-category pair.

Algorithm 1 :
Collect training data Input :Top-level category set S, Wikipedia subcategory graph G, max search depth r = 2; Initialize d(x, c) = ∞ for any article x ∈ X and c ∈ S. M = {}; for c in S do T = DFS(c, G, r); for t in T .nodesdo for x in t.articles do d(x, c) = min{d(x, c), 1 + depth(t)}; end end end for x in X do if minc∈S d(x, c) < ∞ then P = argmin c∈S d(x, c); for c in P do Add (x, c, 1) to M ; end Sample c from S − P ; Add (x, c , −1) to M ; end end Output :M

Figure 3 :
Figure 3: An example with a bad label name.Annotators are confused by the word "Reference".

Table 2 :
Test results of all methods on four datasets.Compared with Word2Vec and ESA, ESA-WikiCate is overall the best among the three embedding-based methods.TE-WikiCate outperforms all other zero-shot methods across all four datasets, and performs competitively against the weakly-supervised LOTClass.
Implementation: We finetune the bert-baseuncased model on the Wikipedia article-category dataset to train TE-Wiki.We removed 26 categories whose name starts with "List of" from the 700 top-level categories, resulting in 674 categories

Table 4 :
Classification accuracy on crowdsourcing datasets.Yahoo-5 and AG News-5 count only examples for which all five workers choose the same label.