Structural Contrastive Representation Learning for Zero-shot Multi-label Text Classification

,


Introduction
Zero-shot multi-label text classification (ZMTC) (Chalkidis et al., 2020;Xiong et al., 2022;Song et al., 2021;Liu et al., 2021a;Lupart et al., 2022;Zhang et al., 2022) defines the following problem: given a set of documents with no labels and the full label description for each class, we would like to correctly classify unseen documents to these classes.ZMTC approaches can be leveraged to solve the cold start problem in e-commerce systems (Li et al., 2019;Chang et al., 2021).For instance, we can accurately retrieve newly added products with learning-based retrieval systems.With ZMTC, we do not have to worry about whether we have enough supervised data for the retrieval system.Without retraining the semantic matching model, ZMTC is capable of mapping between the customer query and its matched product descriptions even if they are recently added.
Challenges of ZMTC: We observe two major challenges in ZMTC.Learning the mapping between input text and its associated class descriptions is hard.In practice, both input text and class description can be a structural document with a short title sentence and a long descriptive paragraph (Bhatia et al., 2016).As a result, representation learning of multi-categorical input text and class description is likely hard.Secondly, the label space is enormous.Practitioners deploy ZMTC to tasks with number of classes in millions or even billions (Medini et al., 2019(Medini et al., , 2020;;Liu et al., 2021b;Dahiya et al., 2021).The explicit zero-short learning approaches that require learning softmax classifiers (Pourpanah et al., 2020) would become prohibitive due to the expensive overhead in computing billion-scale embeddings.
Exploiting Representation Learning in ZMTC: One ideal way of tackling ZMTC is to transform it into a nearest neighbor search problem.By learning meaningful representations for both input and class text, the ZMTC becomes a similarity search over embeddings.However, the existing representation approaches do not tackle the two major ZMTC challenges completely.The current ZMTC methods (Chalkidis et al., 2020;Xiong et al., 2022) divide the structural document into a set of sentences.Next, they perform sentence-level representation learning by modeling the pairwise similarity of input and label documents.This representation learning method neglects the paragraph-level information and the structural relationship between the title and contents.Moreover, the current ZMTC approaches index the label space into clusters (Xiong et al., 2022), trees (Gupta et al., 2021) or graphs (Chen et al., 2021) to reduce the computation.However, this procedure results in multi-stage training, which is generally hard to optimize end-to-end for ZMTC training.
Our Proposal: This paper proposes an end-toend structural contrastive representation learning approach for ZMTC.We propose a novel randomized text segmentation (RTS) method.We start by creating random chunks of the document contents into subsequences.Next, we pair the generated chunks with the document title as well as other chunks from the description to form positive pairs.The pairs are then used for contrastive representation learning.Our novel approach of combining titles and text introduces a data-dependent way that trains the model to associate segments of the description with the title.As a result, the relationship between the short title sentence and the long content paragraphs is baked into the representation.We can think of it as a novel self-supervised auxiliary task.This method allows us to learn the representation without label engineering.In other words, we transform ZMTC into learning the similarity between different types and modalities of text within documents.Specifically, our proposal enables us to represent the long paragraph with random subsequence sampling.Our extensive experiments indicate that our approach leads to up to 2.33% improvement in precision@1 and 5.94× speedup in training time on state-of-the-art large-scale ZMTC benchmarks.

Zero-Shot Multi-label Text Classification
Zero-shot multi-label text classification (ZMTC) is a standard natural language processing (NLP) task with practical significance.In recommendation systems, efficient ZMTC leads users to new products (Li et al., 2019;Chang et al., 2021).In medical document analysis, ZMTC is the tool for tagging medical subject headings to a stream of related papers (Lupart et al., 2022).Current ZMTC methods focus on label modeling by shrinking the large label space for more expressivity and better efficiency.For instance, Chalkidis et al. ( 2020) use a hierarchy of labels to help improve the ZMTC performance.Xiong et al. (2022) introduce a multiscale label clustering to help the learning of both text and label representations.Liu et al. (2021a) introduce reasoning in the label hierarchy modeling to boost the effectiveness of per-trained language models in ZMTC.Zhang et al. (2022) introduce meta-data such as label synonyms in contrastive learning for better ZMTC.In this paper, we aim at a label-engineering-free approach of ZMTC.We focus our research on modeling the correlation between the title and contents of documents.As a result, we directly generate meaningful representations for both input text and labels so that ZMTC can be solved with efficient near neighbor search engines (Johnson et al., 2019).

Contrastive Learning
Inspired by the recent success of contrastive representation learning methods in the field of computer vision (Chen et al., 2020;Khosla et al., 2020;He et al., 2020), multiple contrastive learning approaches have been proposed for sentence representation learning in NLP.Wu et al. (2020) leverage multiple data augmentation techniques for better sentence representation learning.Zhang et al. (2020) attempt to maximize mutual information between sentence-level and token-level representations.Giorgi et al. (2021) sample spans of text as positive pairs for contrastive learning.Gao et al. (2021) use different dropout masks as data augmentation.Aside from sentence representation learning, document representation learning is also seeing contrastive learning approaches gaining traction.Xu et al. (2021) propose to represent documents as a graph attention network, in which each passage is a vertex, and perform contrastive learning on pairs of passage subsets to learn document representations.Luo et al. (2021) use data augmentation techniques such as synonym substitution and backtranslation for better document representation.
Contrastive learning approaches have also been applied to ZMTC problems in prior works.Xu et al. (2022) propose to iteratively train the query encoder and document encoder using training pairs constructed with the Inverse Cloze Task (Lee et al., 2019) and dropout (Srivastava et al., 2014), and expand the set of negative instances with a cache queue.Xiong et al. (2022) construct positive pairs with the Inverse Cloze Task and augment the set of positive instances with pseudo-labels constructed with unsupervised clustering and TF-IDF.However, these contrastive learning approaches do not focus on modeling the structural information of both input and label documents.Moreover, the learning framework has multiple stages, making the training inefficient.

Methodology
In this section, we introduce our proposed structural contrastive learning approach for ZMTC.We start with our problem settings.Next, we introduce our approach of representation learning for structural text with title and content.Finally, we highlight the proposed randomized text segmentation with more intuition.

Problem Setting
Notations: In this paper, we denote X = {(t 1 , c 1 ), . . ., (t |X| , c |X| )} as a set of documents.Every (t i , c i ) ∈ X is a title-content pair where t i and c i represents title and content text, respectively.Let Y = {y 1 , . . ., y |Y | } be the set of labels.Each y i ∈ Y can be a short sentence description or a structural document with a title and contents.Each (t i , c i ) ∈ X corresponds to a subset of labels in Y .The set of documents X is split into disjoint subsets X train and X test for training and evaluation, respectively.We summarize the notations in Table 1.
The multi-label text classification problem is the problem of matching documents to their most relevant labels in a large pool of labels.ZMTC is an important subtask for this problem that focuses on unseen labels.In the ZMTC setup, we have access to X train and Y for training a model to classify documents to labels.We would like to correctly classify each unseen document in X test to labels in Y with the trained model.Due to the zero-shot nature of the problem, we do not have access to M , the ground truth mappings of documents to labels, for training.This problem formulation is general enough that many real-world problems can be modeled after, for example, predicting which items are similar to an item on an e-commerce website (Chang et al., 2021), predicting which categories an article belongs to on an online encyclopedia (Bhatia et al., 2016), or predicting medical subject headings for COVID-19 related articles (Lupart et al., 2022).

Learning Text Representation
In this section, we introduce our structural contrastive representation learning approach for ZMTC.We present an overview of our method in Figure 1.For document data containing both title and content, we start with randomized text segmentation to generate subsequences for better paragraph-level representation learning of long text.Next, we pair the generated text segments with titles or each other to construct positive pairs and train the model using a contrastive representation learning framework.As a result, we obtain representation for both input and label text so that ZMTC becomes a nearest neighbor search problem.In the following subsections, we start by introducing the randomized text segmentation technique.Next, we introduce our contrastive learning framework.

Randomized Text Segmentation
We perform randomized text segmentation (RTS) on the contents to divide a long text into nonoverlapping contiguous subsequences.We use these subsequences to generate positive pairs for contrastive representation learning.
The contents c of a document is a finite sequence of terms, c = (w 1 , . . ., w |c| ), where each term w is a textual entity such as a word.We segment the contents into non-overlapping contiguous subsequences by sampling lengths l 1 , l 2 , . . . of the subsequences from the discrete uniform distribution U(L min , L max ), where L min and L max are hyperparameters.We keep sampling from the distribution until we obtain k sampled lengths from the distribution such that k i=1 l i ≥ |c| and k−1 i=1 l i < |c|.Then, we segment the contents into k subsequences (w 1 ,...,w l 1 ),(w l 1 +1 ,...,w l 1 +l 2 ),. . . ,(w 1+ i<k l i ,...,w |c| ).To prevent the last subsequence from being too short in length, we merge the last two subsequences by concatenation if the length of the last subsequence is less than L min 2 .The process of randomized segmentation of contents is repeated independently every epoch.The subsequences obtained through segmenting the same text can be completely different for different epochs due to the independent sampling at each epoch.

Positive Pair Construction
In this section, we introduce how to construct positive pairs for contrastive representation learning given the RTS subsequences and the short title of the document.We construct two types of positive pairs for each input document and one or two types of positive pairs for each label document.We perform RTS on the contents of every document.Given a document with content c and title t, we use RTS to obtain k subsequences s 1 , . . ., s k of c. Next, we construct two types of positive pairs: 1.For each subsequence s i , we pair it with t as (t, s i ).There would be k such pairs.2. For each subsquence s i , we pair it with another subsequence s j where i ̸ = j.We form ⌈ k 2 ⌉ pairs of (s i , s j ) by sampling pairs from {s 1 , . . ., s k } without replacement, and pair the last remaining one with s 1 if k is odd.For the label set Y , if it only contains a short description for each class, we directly construct |Y | positive pairs of (y i , y i ) and use dropout noise to prevent representation collapse (Gao et al., 2021).On the other hand, if elements in Y have both a short title and long contents, we apply the same pair construction method on labels as input documents.It is worth noting that we do not model the correlation between the input document and the labels in the zero-shot learning setup.We directly use the pairs for training in a contrastive learning framework with a language model as the encoder.This procedure is end-to-end learning with only one training stage.Moreover, it does not involve any label engineering such as clustering.

Training Loss
In this section, we introduce the contrastive loss we used for representation learning with positive pairs we constructed.Let E(•) denote a encoder with pre-trained weights.This encoder transforms input text into an embedding with fixed dimensions.We choose MPNet (Song et al., 2020) as the encoder, and use the [CLS] representation as the text embedding.Next, following the contrastive learning framework in (Chen et al., 2020), in each iteration, we sample a batch of positive pairs {(x i , xi )|i ∈ {1, . . ., b}} with size b and minimize the following loss: (1) where f (x, x) = x•x ∥x∥∥x∥ is the cosine similarity and τ is the temperature hyperparameter.We train E(•) for a certain number of epochs and update its weights to minimize the loss.

Inference
Once the training is finished, we perform inference with a nearest neighbor search framework.We first encode the labels into a set of embeddings {E(y)|y ∈ Y }.Next, given an input document with title t and contents c.We concatenate them as t c and generate the document embedding E(t c).Finally, we query and retrieve the knearest neighbor embeddings of E(t c) in the set E(Y ).Here we use the same cosine similarity as our distance metric.
k-nearest neighbor search for dense embedding vectors can be greatly accelerated using the FAISS engine (Johnson et al., 2019).As a result, we obtain an efficient workflow for ZMTC.

Discussion
Motivation: The motivation of our method is to leverage the inherent structure in the data to generate high-quality pairs for contrastive learning.A document is a title-content pair, where the title is short and expresses the main topic of the document, and the contents are long and describe multiple concepts of the topic in detail.With randomized text segmentation, we break up the long contents into short segments, each of which consists of one or two constituent concepts of the topic.By pairing these segments with the title or other segments for contrastive learning, the model captures the semantic similarity between texts from different categories within the same document.Moreover, the model learns to produce high-quality representations for both input documents and labels.Furthermore, by independently repeating the RTS process every epoch, the model is trained on a different set of pairs every epoch.This prevents the model from memorizing the training pairs and overfitting, and encourages the model to capture the underlying semantic similarity of concepts within the document.
RTS vs Sentence-level Separation: Previous approaches of contrastive learning for document data break up documents into natural sentences by splitting text at appropriate punctuations (Xiong et al., 2022;Lee et al., 2019).However, this method has multiple downsides.Natural sentences are not ideal training data for contrastive learning, since they may be too short to capture enough context, and they are static.Moreover, the model is prone to memorizing the training data or overfitting, since it is trained on the same set of pairs every epoch.Our method produces training data of much bet-ter quality and variety, and enables the model to learn underlying patterns in the data that are otherwise difficult or impossible to recognize.We will demonstrate this empirically in 4.5.
Choices of Hyperparameter: Based on the motivation of our proposed method, we describe a method of setting hyperparameter values for L min and L max .We set L min = l and L max = 2l such that, with high probability, a subsequence of length l, which is randomly sampled from the contents of any document in the dataset, would capture enough context for one to recognize an idea or a concept described in the document.

Experiment
In this section, we evaluate the performance of our method and compare it with competitive baselines on 4 ZMTC datasets.There are 2 product recommendation datasets, 1 article recommendation dataset, and 1 article categorization dataset in our experiment.We choose these datasets for simulating the cold start problem in large-scale recommendation systems, information retrieval tasks in search engines, and natural language processing of unseen documents.In the experimental evaluation, we would like to answer the following questions: (1) Does our approach of RTS and pair construction improve the ZMTC accuracy?( 2

Datasets
We conduct our experiments on 4 publicly available datasets for multi-label text classification.Table 2 presents the statistics of the datasets.All 4 datasets have a very large set of labels, ranging from 131K to 960K in size, which enables us to accurately evaluate the model performance since real-world ZMTC tasks usually have an enormous label space.We obtain LF-Amazon-131K, LF-WikiSeeAlso-320K, and LF-Wikipedia-500K  datasets from the extreme classification repository (Bhatia et al., 2016).The LF-Amazon-1M is available in (Gupta et al., 2021).All 4 datasets use data collected from real-world applications; LF-Amazon-131K and LF-Amazon-1M contain itemto-item recommendation data from the e-commerce website Amazon, LF-WikiSeeAlso-320K contains data for related articles from the encyclopedic website Wikipedia, and LF-Wikipedia-500K contains article categorization data from Wikipedia.Since the datasets use data from large-scale recommendation systems, they are ideal for evaluating the real-world performance of models.

Testbed
We implement our approach with PyTorch (Paszke et al., 2019).Our experiments are conducted on a machine with 4 NVIDIA Tesla V100 32GB GPU and 2 24-core/48-thread Intel Xeon Gold 5220R CPUs with 1.5TB of RAM.

Evaluation Metrics
We adopt precision at p or P @p, p ∈ {1, 3, 5} and recall at r or R@r, r ∈ {1, 3, 5, 10, 100} as the evaluation metrics for ZMTC tasks, which are defined as: where n is the number of documents evaluated, is the set of predicted labels for the ith document, and 1 top-p i (•) is an indicator function indicating whether a predicted label is a ground truth top-p label for the ith document.The precision and recall metrics are frequently used for this setup in prior works (Xiong et al., 2022;Reddi et al., 2019;Chang et al., 2021).

Hyperparameter
The best hyperparameters we found in our experiments for each dataset are shown in table 3. We adopt the same training procedure for each dataset.
We finetune the base version of MPNet (Song et al., 2020) with positive pairs constructed with our proposed method for 5 or 10 epochs with the AdamW optimizer (Loshchilov and Hutter, 2019) with decreasing learning rate to optimize the loss function in Equation 1 with τ = 0.05.The learning rate decays 10× over epochs on a linear schedule.
We carry out grid search for the best learning rate in {5 × 10 −5 , 5 × 10 −6 , 5 × 10 −8 } and (L min , L min ) ∈ {(40, 80), (80, 160)}.We use fixed batch sizes that are large enough to take full advantage of GPU memory.We train the models for 5 or 10 epochs, depending on the size of the dataset.
Datasets that share the same source of data share almost identical hyperparameters; LF-WikiSeeAlso-320K and LF-Wikipedia-500K use identical hyperparameters, since their data are both sampled from Wikipedia.We avoid using (t, s i ) pairs for training on Wikipedia datasets, for the following reasons.For a Wikipedia article, the text in the short title is usually frequently repeated throughout the contents.Therefore, maximizing the agreement of the title with content subsequences is unnecessary and redundant.Additionally, training with (t, s i ) pairs will cause the encoder to focus solely on the title keywords in the content subsequences, instead of capturing the semantic similarity of text segments.

Baselines
We provide an overview of the baseline methods evaluated.All methods except XR-Linear encodes documents and labels into embedding vectors, and retrieves the labels with the most similar embeddings in terms of cosine similarity for a document.XR-Linear retrieves labels by querying a hierarchical tree structure.
• MACLR: (Xiong et al., 2022) A multi-stage contrastive learning method that uses clustering and TF-IDF to construct pseudo-labels.• TF-IDF: (Ramos et al., 2003) represents input and label documents as sparse TF-IDF feature vectors.• GloVe: (Pennington et al., 2014)  Table 4: Precision and recall metrics of our method and other baselines on 4 datasets for ZMTC.For each metric, the best value is bolded and the second best value is underlined.RTS achieves the state-of-the-art results on most of the metrics, with substantial improvements to some metrics over previous best.
• Sentence BERT (SBERT): (Reimers and Gurevych, 2019) a BERT model trained on extra data to specialize in producing high-quality sentence representations.• SimCSE: (Gao et al., 2021) An unsupervised contrastive learning method that constructs positive pairs by pairing a sentence in the training corpus with itself and using dropout as data augmentation, and finetunes a BERT model with such pairs.• MPNet: (Song et al., 2020) represent input and label documents with MPNet, a BERT model pre-trained with the masked and permuted training objective.• XR-Linear: (Yu et al., 2022) A model that organizes labels into a hierarchical tree and constructs pseudo-labels with TF-IDF to overcome the lack of training supervision.
• Inverse Cloze Task (ICT): (Lee et al., 2019) A BERT model trained with the ICT objective for title prediction.

Main Results
Accuracy: Table 4 shows the evaluation results on 4 ZMTC tasks.We report the precision and recall of the baselines from (Xiong et al., 2022).Our method attains the best results on most of the metrics, and substantially improves over previous stateof-the-art results on some.P @1 is improved from 16.31% to 18.64% on LF-WikiSeeAlso-320K and

Method
Ablation Settings Precision Recall Model Segmentation (t, si) (si, sj) (l, l) @1 @3 @5 @1 @3 @5 @10 @100 Original RTS MPNet RTS (40, 80) 18.74 15.30 11.96 10.64 24.16 30.45 38.19   from 28.44% to 30.67% on LF-Wikipedia-500K, and R@100 is improved from 54.99% to 59.34% on LF-Amazon-131K and from 53.83% to 57.30% on LF-WikiSeeAlso-320K.For the average metrics of all datasets, all precision and recall metrics are improved over previous state-of-the-art results, especially P @1 and R@100, which are improved by 1.4% and 1.59%, respectively.The results answer the first question, our approach can consistently improve the ZMTC performance on different tasks.Efficiency: We compare the training time of RTS with the previous state-of-the-art method MACLR on ZMTC tasks.The training time statistics are shown in Table 6.We test both methods with the same hardware configuration.For MACLR, we use the code and the best hyperparameters provided in (Xiong et al., 2022).We the model with our method until the evaluation metric P @1 reaches the highest P @1 achieved by MACLR.The results answer the second question: our proposed ZMTC method achieves 2.98 × to 5.94× speedup in training.

Ablation Study
In this section, we answer the third question and perform an ablation study.We investigate the impact of segmentation methods, pretraining, and types of positive pairs on model accuracy.All ablation experiments are based on LF-Amazon-131K, and done with the same hyperparameters described in Section 4.2.3 to ensure a fair comparison.De-tailed results of the ablation study are shown in Table 5.
Segmentation Methods: We study the impact of different text segmentation methods by comparing RTS, natural, and fixed segmentation.Natural segmentation breaks up long text into natural sentences, while fixed segmentation breaks it up into subsequences of fixed length.For fixed segmentation, we choose 60 as the length of each subsequence as it is the average length used in RTS.Natural and fixed segmentation methods perform similarly, while RTS outperforms both natural and fixed segmentation and achieves 1.4% − 1.48% better precision@1.
Pretraining: We compare using BERT (Devlin et al., 2019) and MPNet (Song et al., 2020) as the starting points for training to study the impact of pretraining.We compare the base version of BERT and MPNet, which have the same architecture and model size but different pretraining schemes.MP-Net has been shown to outperform BERT on downstream tasks (Song et al., 2020).After the same amount of training time, BERT slightly underperforms MPNet, but it still achieves significantly better results than MPNet trained with naive segmentation methods.A better pretraining scheme produces slightly a better model for ZMTC, but it is not a significant contributing factor.
Positive Pairs: We remove each type of pair for training to investigate the impact of different types of pairs have on model accuracy.First, we remove label pairs (y, y) and train with only (t, s i ) and (s i , s j ) pairs.The model retains good performance, so the label set is not necessary to produce highquality models for ZMTC.Then we further remove (s i , s j ) pairs and train with only (t, s i ) pairs.The resulting model still outperforms the ones trained on all 3 types of pairs with naive segmentation methods, since RTS exploits the structure of document data and enables the model to learn the underlying semantic similarity between segments.

Conclusion
In this paper, we proposed Randomized Text Segmentation (RTS) and positive pair construction strategies to exploit the structure within document data for end-to-end contrastive learning to advance state-of-the-art results on ZMTC tasks.Our proposed method achieves up to 2.33% improvement on precision@1 and up to 5.94× speedup in training time over previous state-of-the-art.We show that it is feasible to efficiently train high-quality models for challenging ZMTC tasks without having to resort to time-consuming, multi-stage methods with label engineering or methods that utilize inefficient softmax learning.Through extensive ablation experiments, we demonstrate the superiority of RTS over naive segmentation methods, and show that the types of positive pairs we proposed are indeed effective for learning better representation.We believe our work has a substantial impact as it can be applied to tackle many large-scale realworld problems such as cold-start recommendation problems, information retrieval, and medical document categorization and classification.

Limitations
A limitation of our approach is that it relies on complex pretrained transformer-based language models, such as BERT and MPNet, to achieve state-of-theart results in ZMTC.Transformer-based models are computationally expensive, require specialized hardware such as GPU for training, and are difficult to deploy in large-scale productions.In the future, we would like to explore using simpler models such as embedding models for ZMTC tasks for more efficient training and inference.

Ethics Statement
We use GPUs to train transformer models, which have a notable carbon footprint.However, since our proposed approach improves training efficiency over previous methods by reducing multiple stages of training to one, we hope our work can help save energy in settings such as online recommendation systems.
Figure 1: a) Randomized Text Segmentation breaks up the contents of a document into non-overlapping subsequences with lengths sampled from the discrete uniform distribution.b) Exploit the structure of document data by constructing (t, s i ) and (s i , s j ) positive pairs for contrastive learning.

Table 2 :
Statistics of the datasets used for evaluation.|X train |, |X test |, |Y | denote the number of training instances, the number of test instances, and the number of labels, respectively.

Table 3 :
Best hyperparameters and training settings for each dataset.
represents input and label documents as Glove embeddings.

Table 5 :
Experimental results of the ablation study.We study the impact of fixed and natural sentence segmentation, different pre-trained models, and different positive pairs.

Table 6 :
Training time (in hours)comparison between RTS and MACLR, the previous state-of-the-art method for ZMTC.RTS achieves significant training speedup, up to a factor of 5.94×, on all datasets.