OTSeq2Set: An Optimal Transport Enhanced Sequence-to-Set Model for Extreme Multi-label Text Classification

Extreme multi-label text classification (XMTC) is the task of finding the most relevant subset labels from an extremely large-scale label collection. Recently, some deep learning models have achieved state-of-the-art results in XMTC tasks. These models commonly predict scores for all labels by a fully connected layer as the last layer of the model. However, such models can’t predict a relatively complete and variable-length label subset for each document, because they select positive labels relevant to the document by a fixed threshold or take top k labels in descending order of scores. A less popular type of deep learning models called sequence-to-sequence (Seq2Seq) focus on predicting variable-length positive labels in sequence style. However, the labels in XMTC tasks are essentially an unordered set rather than an ordered sequence, the default order of labels restrains Seq2Seq models in training. To address this limitation in Seq2Seq, we propose an autoregressive sequence-to-set model for XMTC tasks named OTSeq2Set. Our model generates predictions in student-forcing scheme and is trained by a loss function based on bipartite matching which enables permutation-invariance. Meanwhile, we use the optimal transport distance as a measurement to force the model to focus on the closest labels in semantic label space. Experiments show that OTSeq2Set outperforms other competitive baselines on 4 benchmark datasets. Especially, on the Wikipedia dataset with 31k labels, it outperforms the state-of-the-art Seq2Seq method by 16.34% in micro-F1 score. The code is available at https://github.com/caojie54/OTSeq2Set.


Introduction
Extreme multi-label text classification (XMTC) is a Natural Language Processing (NLP) task of finding the most relevant subset labels from an extremely large-scale label set.It has a lot of usage scenarios, such as item categorization in e-commerce and * Corresponding author tagging Wikipedia articles.XMTC become more important with the fast growth of big data.
As in many other NLP tasks, deep learning based models also achieve the state-of-the-art performance in XTMC.For example, AttentionXML (You et al., 2019), X-Transformer (Chang et al., 2020) and LightXML (Jiang et al., 2021) have achieved remarkable improvements in evaluating metrics relative to the current state-of-the-art methods.These models are both composed of three parts (Jiang et al., 2021): text representing, label recalling, and label ranking.The first part converts the raw text to text representation vectors, then the label recalling part gives scores for all cluster or tree nodes including portion labels, and finally, the label ranking part predicts scores for all labels in descending order.Notice that, the label recalling and label ranking part both use fully connected layers.Although the fully connected layer based models have excellent performance, there exists a drawback which is these models can't generate a variable-length and relatively complete label set for each document.Because the fully connected layer based models select positive labels relevant to the document by a fixed threshold or take top k labels in descending order of label scores, which depends on human's decision.Another type of deep learning based models is Seq2seq learning based methods which focus on predicting variable-length positive labels only, such as MLC2Seq (Nam et al., 2017), SGM (Yang et al., 2018).MLC2Seq and SGM enhance Seq2Seq model for Multi-label classification (MLC) tasks by changing label permutations according to the frequency of labels.However, a pre-defined label order can't solve the problem of Seq2Seq based models which is the labels in XMTC tasks are essentially an unordered set rather than an ordered sequence.Yang et al. (2019) solves this problem on MLC tasks via reinforcement learning by designing a reward function to reduce the dependence of the model on the label order, but it needs to pretrain the model via Maximum Likelihood Estimate (MLE) method.The two-stage training is not efficient for XMTC tasks that have large-scale labels.
To address the above problems, we propose an autoregressive sequence-to-set model, OTSeq2Set, which generates a subset of labels for each document and ignores the order of ground truth in training.OTSeq2Set is based on the Seq2Seq (Bahdanau et al., 2015), which consists of an encoder and a decoder with the attention mechanism.The bipartite matching method has been successfully applied in Named entity recognition task (Tan et al., 2021) and keyphrase generation task (Ye et al., 2021) to allievate the impact of order in targets.Chen et al. (2019) and Li et al. (2020) have successfully applied the optimal transport algorithm to enable sequence-level training for Seq2Seq learning.Both methods can achieve optimal matching between two sequences, but the difference is the former matches two sequences one to one, and the latter gives a matrix containing regularized scores of all connections.We combine the two methods in our model.
Our contributions of this paper are summarized as follows: (1) We propose two schemes to use the bipartite matching in XMTC tasks, which are suitable for datasets with different label distributions.(2) We combine the bipartite matching and the optimal transport distance to compute the overall training loss, with the student-forcing scheme when generating predictions in the training stage.Our model can avoid the exposure bias; besides, the optimal transport distance as a measurement forces the model to focus on the closest labels in semantic label space.(3) We add a lightweight convolution module into the Seq2Seq models, which achieves a stable improvement and requires only a few parameters.(4) Experimental results show that our model achieves significant improvements on four benchmark datasets.For example, on the Wikipedia dataset with 31k labels, it outperforms the state-of-the-art method by 16.34% in micro-F1 score, and on Amazon-670K, it outperforms the state-of-the-art model by 14.86% in micro-F1 score.

Overview
Here we define necessary notations and describe the Sequence-to-Set XMTC task.Given a text sequence x containing l words, the task aims to assign a subset y g containing n labels in the total label set L to x.Unlike fully connected layer based methods which give scores to all labels, the Seq2Set XMTC task is modeled as finding an optimal positive label sequence y g that maximizes the joint probability P (ŷ|x), which is as follows: where y g is the sequence generated by the greedy search, y is the ground truth sequence with default order, ŷ is the most matched reordered sequence computed by bipartite matching.As described in Eq.( 1), we use the student-forcing scheme to avoid exposure bias (Ranzato et al., 2016) between the generation stage and the training stage.Furthermore, combining the scheme with bipartite matching enables the model to eliminate the influence of the default order of labels.

Sequence-to-Set Model
Our proposed Seq2Set model is based on the Seq2Seq (Bahdanau et al., 2015) model, and the model consists of an encoder and a set decoder with the attention mechanism and an extra lightweight convolution layer (Wu et al., 2019), which are introduced in detail below.
Encoder We implement the encoder by a bidirectional GRU to read the text sequence x from both directions and compute the hidden states for each word as follows: where e(x i ) is the embedding of x i .The final representation of the i-th word is which is the concatenation of hidden states from both directions.
Attention with lightweight convolution After the encoder computes h i for all elements in x, we compute a context vector c t to focus on different portions of the text sequence when the decoder generates the hidden state s t at time step t, The attention score α ti of each representation h i is computed by exp (e tj ) (5) where W a , U a , v a are weight parameters.For simplicity, all bias terms are omitted in this paper.
To maximally utilize hidden vectors {h i } i=1,...,l in the encoder, we use the lightweight convolution layer to compute "label" level hidden vectors { ĥi } i=1,...,k , then compute another context vector ĉt which uses the same parameters as c t , The lightweight convolutions are depth-wise separable (Wu et al., 2019), softmax-normalized and share weights over the channel dimension.Readers can refer to Wu et al. (2019) for more details.
Decoder The hidden state s t of decoder at timestep t is computed as follows: where e(p t−1 ) is the embedding of the label which has the highest probability under the distribution p t−1 .p t−1 is the probability distribution over the total label set L at time-step t − 1 and is computed by a fully connected layer: where W p ∈ R V ×z is weight parameters.
The overall label size V of the total label set L is usually huge in XMTC.In order to let the model fit into limited GPU memory, we use the hidden bottleneck layer like XML-CNN (Liu et al., 2017) to replace the fully connected layer.The bottleneck layer is described as follows:

Bipartite Matching
After generating N predictions in student-forcing scheme, we need to find the most matched reordered ground truth sequence ŷ by bipartite matching between ground truth sequence y with default order and the sequence p of generated distribution.To find the optimal matching we search for a permutation ρ with the lowest cost: where O N is the space of all permutations with length N , C match (y ρ(i) , p i ) is a pair matching cost between the ground truth label with index ρ(i) and generated distribution p i at time step i.We use the Hungarian method (Kuhn, 1955) to solve this optimal assignment problem.The matching cost considers the generated distributions and is defined as follows: where p i (y ρ(i) ) is the probability of label y ρ(i) .y ρ(i) ̸ = ∅ means that the distributions only match with non-∅ ground truth labels.
In practice, we design two assignment schemes for bipartite matching.The first scheme is to get the optimal matching between non-∅ ground truth labels and all generated distributions, then assign ∅ labels to the rest generated distributions one-to-one.The second scheme is to get the optimal matching between non-∅ ground truth labels and first-n generated distributions, then assign ∅ labels to the rest generated distributions one-to-one.Figure 1 shows the two assignment schemes.
Finally, we get the bipartite matching loss based on the reordered sequence ŷ = {y ρ(i) } i=1,...,N to train model, which is defined as follows: where λ ∅ is a scale factor less than 1 that forces the model concentrate more on non-∅ labels.

Semantic Optimal Transport Distance
In XMTC, semantic similar labels commonly appear together in each sample.The bipartite matching loss described in Eq.( 13) only utilizes one value in each predictions.We completely utilize the predictions in training, and take the optimal transport (OT) distance in embedding space as a regularization term to make all predictions close to positive labels, as the shown in Figure 2.
OT distance on discrete domain The OT distance is also known as Wasserstein distance on a discrete domain X (the sequence space), which is defined as follows: + are two discrete distributions on X, ∥µ∥ = ∥ν∥ = 1.In our case, given realizations {p i } n i=1 and {y j } m j=1 of µ and ν, we can approximate them by empirical distributions µ = 1 n p i δ p i and ν = 1 m y j δ y j .The supports of µ and ν are finite, so finally we have µ = 1 n 1 {p i } and ν = 1 m 1 {y j } .Σ(µ, ν) is the set of joint distributions whose marginals are µ and ν, which is described as is the cost matrix, whose element c(p i , y j ) denotes the distance between the i-th support point p i ∈ X of µ and the j-th support point y j ∈ X of ν.Notation ⟨•, •⟩ represents the Frobenius dot-product.With the optimal solution Γ * , L ot (p, y) is the minimum cost that transport from distribution µ to ν.
We use a robust and efficient iterative algorithm to compute the OT distance, which is the recently introduced Inexact Proximal point method for exact Optimal Transport (IPOT) (Xie et al., 2020).

Semantic Cost Function in OT
In XMTC learning, it's intractable to directly compute the cost distance between predictions and ground-truth onehot vector because of the huge label space.A more natural choice for computing cost distance is to use the label embeddings.Based on the rich semantics of embedding space, we can compute the semantic OT distance between model predictions and ground-truth.The details of computing the cosine distance are described as follows: where E ∈ R V ×d is the label embedding matrix, V is the vocabulary size and d is the dimension for the embedding vector. 1 y j is the one-hot vector whose value is 0 in all positions except at position y j is 1.
In practice, we compute the semantic optimal transport distance L ot (p, y) between non-∅ ground-truth labels and all predictions, which is shown in 3.

Complete Seq2Set Training with OT regularization
The bipartite matching gives the optimal matching between ground truth and predictions, which enables the Seq2Set model to train with Maximum likelihood estimation (MLE).In addition, the OT distance provides a measurement in semantic label embedding space.We propose to combine the OT distance and the bipartite matching loss, which gives us the final training objective: where λ is a hyper-parameter to be tuned.

Evaluation Metrics
Following the previous work (Yang et al., 2018, Nam et al., 2017, Yang et al., 2019), we evaluate all methods in terms of micro-F1 score, microprecision and micro-recall are also reported for analysis.

Baselines
We compare our proposed methods with the following competitive baselines: Seq2Seq (Bahdanau et al., 2015) is a classic Seq2Seq model with attention mechanism, which is also a strong baseline for XMTC tasks.
MLC2Seq (Nam et al., 2017) is based on Seq2Seq (Bahdanau et al., 2015), which enhances Seq2Seq model for MLC tasks by changing label permutations according to the frequency of labels.We take the descending order from frequent to rare in our experiments.
SGM (Yang et al., 2018) is based on previews Seq2Seq (Bahdanau et al., 2015) and MLC2Seq (Nam et al., 2017), which views the MLC task as a sequence generation problem to take the correlations between labels into account.
In order to extend the above three models to large-scale dataset Wiki10-31K and extreme-scale dataset Amazon-670K, we use the bottleneck layer to replace the fully connected layer.The bottleneck sizes are same as our proposed model and they are shown in Table 2.

Experiment Settings
For each dataset, the vocabulary size of input text is limited to 500,000 words according to the word frequency in the dataset.
The number of GRU layers of the encoder is 2, and for the decoder is 1.The hidden sizes of the encoder and the decoder both are 512.We set the max length of generated predictions N to the max length of labels in each dataset are shown in Table 1 as L max .
For LightConv, we set the kernel sizes to 3,7,15,30 for each layer respectively.To reduce the memory consumption of GPU, we set the stride of convolution to 3 in the last layer for all datasets except that Wiki10-31K is 4.
We set the hyperparameter λ ∅ of bipartite matching loss to 0.2 in Eq.( 13) following Ye et al. (2021), and λ of final loss to 8 in Eq.( 17).
For word embeddings in three baseline models and our proposed models, we use pre-trained 300-dimensional Glove vectors (Pennington et al., 2014) to all datasets for input text, and use the mean value of embeddings of words in labels for all datasets except for Amazon-670K, because the labels in Amazon-670K are corresponding item numbers and we randomly initialize 100-dimensional embeddings for Amazon-670K.
All models are trained by the Adam optimizer (Kingma and Ba, 2014) with a cosine annealing schedule.Besides, we use dropout (Srivastava et al., 2014) to avoid overfitting, the dropout rate is 0.2, and clip the gradients (Pascanu et al., 2013) to the maximum norm of 8.
All models are trained on one Nvidia TITAN V and one Nvidia GeForce RTX 3090, for the small dataset Eurlex-4K, we use TITAN V, and for others, we use RTX 3090.
Other hyperparameters are given in  2: Hyperparameters of all datasets.E is the number of epochs, B is the batch size, lr is init learning rate, b is the bottleneck size (Note that the bottleneck layer is not used on Eurlex-4K and AmanzonCat-13K), L t is the maximum length of input tokens.

Main Results
Table 3 compares the proposed methods with three baselines on four benchmark datasets.We focus on micro-F1 score following previous works based on Seq2Seq.The best score of each metric is in boldface.Our models outperform all baselines in micro-F1 score.We find that the two assignment scheme of bipartite matching each excel on different datasets.BM all denotes the Seq2Seq model with the first scheme of bipartite matching, BM f irst−n denotes the Seq2Seq model with the second one.BM all has the better performance on Eurlex-4K and Amazon-670K, rather BM f irst−n is better on Wiki10-31K and AmazonCat-13K.Our complete method achieves a large improvement of 16.34% micro-F1 score over the second best baseline model which is SGM on Wiki10-31K, and 14.86%, 7.26%, 1.01% on other datasets respectively, the relative impovements are shown in Table 4.

BM f irst−n vs BM all
We find that the different performance of BM f irst−n and BM all are related to the distribution of label size on the dataset.Table 5 compares BM f irst−n with BM all in micro-F1 score on all datasets.The performance difference on Eurlex-4K is very small since the two proportions which are shown in Figure 4 are nearly equal.Wiki10-31K is the same case.However, the two proportions have a big difference on AmanzonCat-13K and Amazon-670K, which leads to the large performance difference between BM f irst−n and BM all .The proportion of samples whose label size is smaller than the average is less than 50%, then the performance of BM f irst−n is better than BM all .The proportion of samples whose label size is greater than average number is greater than 50%, then the performance of BM all is better than BM f irst−n .

Effect of Lightweight Convolution and Semantic Optimal Transport Distance
To examine the effect of LightConv and OT, we add LightConv to the Seq2seq model and BM model, and add OT to BM model.The details of results are shown in Table 3.The semantic optimal transport distance and LigntConv achieve improvements against origin models on all datasets except for Amazon-670K, but it's not beyond our imagination, because Amazon-670K has an extremely large label set which is item numbers.The item numbers don't have semantic information leading to OT hav-Table 3: Comparison between our models and three baselines on Eurlex-4K, Wiki10-31K, Amazon-670K, and AmazonCat-13K.BM all denotes the Seq2Seq model with the first scheme of bipartite matching loss, BM f irst−n denotes the Seq2Seq model with the second one.OT denotes the semantic optimal transport distance loss.LC denotes the light weight convolution layer.We take the average score of 4 times experiments for all models on Eurlex-4K and Wiki10-31K, and 2 times experiments for the other two datasets.ing no effect.At the same time Amazon-670K has the lowest average number of samples per label, it's very hard to learn a proper context vector by the LightConv.However, the combination of OT and LightConv still has a small improvement over BM all methods on Amazon-670K.

Comprehensive Comparison
To realize the performance of our method on tail labels, we use the macro-averaged F1 (maF1) score which treats all labels equally regardless of their support values.To more comprehensively compare our model with baselines, we also use the weightedaveraged F1 (weF1) which considers each label's support, and the example-based F1 (ebF1) which calculates F1 score for each instance and finds their average.We show the results in Table 6.

Performance over different λ
We conduct experiments on Eurlex4K dataset to evaluate performance with different hyperparameter λ in complete loss function L. The results are shown in Figure 5.The figure shows that when λ = 8, the performance reaches its best and is stable.

Computation Time and Model Size
Table 7 shows the overall training time of all models compared in our experiments.The MLC2Seq has the same model size and training time as Seq2Seq.The model size of OTSeq2Set is almost the same as that of Seq2Seq, but the overall training time is much longer than other models, because the bipartite matching loss and the optimal transport distance need to compute for each sample individually.

Related Work
The most popular deep learning models in XMTC are fully connected layer based models.XML-CNN (Liu et al., 2017) uses a convolutional neural network (CNN) and a dynamic max pooling scheme to learn the text representation, and adds an hidden bottleneck layer to reduce model size as well as boost model performance.AttentionXML (You et al., 2019)  tentionXML, a BiLSTM is used to capture longdistane dependency among words and a multi-label attention is used to capture the most relevant parts of texts to each label.CorNet (Xun et al., 2020) proposes a correlation network (CorNet) capable of ulilizing label correlations.X-Transformers (Chang et al., 2020) proposes a scalable approach to fine-tuning deep transformer models for XMTC tasks, which is also the first method using deep transformer models in XMTC.LightXML (Jiang et al., 2021) combines the transformer model with generative cooperative networks to fine-tune transformer model.Another type of deep learning-based method is Seq2Seq learning based methods.MLC2Seq (Nam et al., 2017) is based on the classic Seq2Seq (Bahdanau et al., 2015) architecture which uses a bidirectional RNN to encode raw text and an RNN with an attention mechanism to generate predictions sequentially.MLC2Seq enhances the model performance by determining label permutations before training.SGM (Yang et al., 2018) proposes a novel decoder structure to capture the correlations between labels.Yang et al. (2019) proposes a twostage training method, which is trained with MLE then is trained with a self-critical policy gradient training algorithm.
About bipartite matching, Tan et al. ( 2021) considers named entity recognition as a Seq2Set task, the model generates an entity set by a nonautoregressive decoder and is trained by a loss function based on bipartite matching.ONE2SET (Ye et al., 2021) proposes a K-step target assignment mechanism via bipartite matching on the task of keyphrases generation, which also uses a nonautoregressive decoder.Xie et al. (2020) proposes a fast proximal point method to compute the optimal transport distance, which is named as IPOT.Chen et al. (2019) adds the OT distance as a regularization term to the MLE training loss.The OT distance aims to find an optimal matching of similar words/phrases between two sequences which promotes their semantic similarity.Li et al. (2020) introduce a method that combines the student-forcing scheme with the OT distance in text generation tasks, this method can alleviate the exposure bias in Seq2Seq learning.The above two methods both use the IPOT to compute the OT distance.

Conclusion
In this paper, we propose an autoregressive Seq2Set model for XMTC, OTSeq2Set, which combines the bipartite matching and the optimal transport distance to compute overall training loss and uses the student-forcing scheme in the training state.OT-Seq2Set not only eliminates the influence of order in labels, but also avoids the exposure bias.Besides, we design two schemes for the bipartite matching which are suitable for datasets with different label distributions.The semantic optimal transport distance can enhance the performance by the semantic similarity of labels.To take full advantage of the raw text information, we add a lightweight convolution module which achieves a stable improvement and requires only a few parameters.Experiments show that our method gains significant performance improvements against strong baselines on XMTC.

Limitations
For better effect, we compute the bipartite matching and the optimal transport distance between non-∅ targets and predictions.We can't use batch com-putation to improve efficiency so that the training time of OTSeq2Set is longer than other baseline models.
are weight parameters, and the hyper-parameter b is the bottleneck size.The size of the parameters in this part reduces from a vast size O(V × z) to a much smaller O((V + z) × b).According to the size of labels for different datasets, we can set different b to make use of GPU memory.

Figure 2 :
Figure 2: Semantic optimal transport distance in embedding space.

Figure 4 :
Figure 4: Distributions of label size on all datasets.The le average denotes the proportion of samples whose label size less than or equal to average number of label size, the gt average denotes the opposite proportion.

Figure 5 :
Figure 5: Performance of the OTSeq2Set using different λ in loss function.

Table 1 :
Data Statistics.N train and N test refer to the number of training samples and test samples respectively.D refers to the total number of features.L is the total number of labels, L is the average number of labels per training sample, L max is the max length of labels for both training and test samples, L is the average number of samples per training label.W train and W test refer to the average number of words per training and test sample respectively.

Table 4 :
Relative improvements of micro-F1 score over the second best baseline model on four datasets in Table3.BM denotes BM all or BM f irst−n .

Table 5 :
The micro-F1 score of BM all and BM f irst−n and relative difference on four datasets.

Table 6 :
Performance comparison of all models on the macro-averaged F1, the weighted-averaged F1 and the example-based F1.
utilizes a probabilistic label tree (PLT) to handle millions of labels.In At- Table 7: Training Time and Model Size.T train is the overall training hours.M is the total size of model parameters in millions.