Softmax Tree: An Accurate, Fast Classifier When the Number of Classes Is Large

Classification problems having thousands or more classes naturally occur in NLP, for example language models or document classification. A softmax or one-vs-all classifier naturally handles many classes, but it is very slow at inference time, because every class score must be calculated to find the top class. We propose the “softmax tree”, consisting of a binary tree having sparse hyperplanes at the decision nodes (which make hard, not soft, decisions) and small softmax classifiers at the leaves. This is much faster at inference because the input instance follows a single path to a leaf (whose length is logarithmic on the number of leaves) and the softmax classifier at each leaf operates on a small subset of the classes. Although learning accurate tree-based models has proven difficult in the past, we are able to overcome this by using a variation of a recent algorithm, tree alternating optimization (TAO). Compared to a softmax and other classifiers, the resulting softmax trees are both more accurate in prediction and faster in inference, as shown in NLP problems having from one thousand to one hundred thousand classes.


Introduction
Classification problems with thousands or more classes (sometimes called extreme classification) naturally occur in NLP and other areas. One example are language models. There are about 171k words in the current edition of the Oxford English Dictionary, and many more if we include all forms of a word, names, technical acronyms, etc. Another example is document classification. The Open Directory Project (ODP) contains over 1M website categories organized in a hierarchical ontology scheme. In this many-class setting, it is considerably difficult to learn a model that is accurate and fast at inference time. The simplest and most widespread model is a linear (e.g. softmax) classifier, possibly as the output layer of a neural net.
One important problem with a softmax classifier is that one must compute the score or probability of (nearly) all classes, conditional on the input instance, in order to determine the (top-n) predicted class. This has a cost O(DK) where D is the input dimension of the softmax and K the number of classes, which is slow when K and D are large. This problem also occurs with other classifiers, such as soft decision trees. Indeed, computational constraints on the vocabulary size are a major challenge for neural machine translation (Koehn, 2020), for example.
We argue that having the classifier output a positive probability (however small) for each class is slow and unnecessary when K is large, because, for any given instance, the majority of classes should indeed have a negligible probability. A much faster classifier is a traditional decision tree, which makes hard decisions by thresholding an input feature at the decision nodes and outputs a single class at each leaf. This axis-aligned tree assigns zero probability to all classes except the predicted one, which is reached through a single rootleaf path very quickly (in log K time if the tree is balanced). However, such trees are known to be insufficiently accurate even if grown very deep.
We propose a softmax tree (ST), a binary tree having sparse hyperplanes at the decision nodes (which make hard, not soft, decisions) and a small softmax classifier outputting k < K classes at each leaf (a class may appear in more than one leaf). A ST is still very fast at inference: it sends the input instance to a single leaf via a path whose length is logarithmic on the number of leaves (for a complete tree), and it assigns (without computing them) probability zero to most classes (namely, all classes not in the leaf). Trading off the depth ∆ of the tree and the number of classes k per leaf can potentially result in fast, highly accurate classifiers. However, STs are still hard to train because they define a nonconvex, nondifferentiable prob-lem. We solve this by modifying Tree Alternating Optimization (TAO), a recent algorithm for learning oblique decision trees (having hyperplane decision nodes and constant-label leaves), so that it can handle softmax leaves, and by using a good initialization.
Before describing the training algorithm (section 4), we review related work (section 2) and show (section 3) that oblique tree classifiers with constant-label leaves are intrinsically more powerful than linear classifiers, but possibly less efficient, which justifies our STs as hybrid treesoftmax classifiers. Finally, (section 5) we convincingly show that our STs have both higher accuracy and much faster inference time than several previous models on high-dimensional problems having up to 10 5 classes.

Related Work
The extreme multi-class classification problems have been previously addressed both in the literature of machine learning and natural language processing. The most basic method is one-versus-all (Bishop, 2006) where an independent binary classifier is learned per class. Another classical approach, error correcting output codes (ECOC) by Dietterich and Bakiri (1995), represents each class with a binary code and learns a separate binary classifier for each bit. However, these methods become costly or even intractable (both for training and predicting) when the number of classes is huge. Moreover, each binary classifier needs to handle highly imbalanced dataset as all classes except one would be instances of the negative class.
Various approaches have been proposed to speed up the training/prediction time and reduce the computational complexity. The most popular of them capitalize on using tree-based structures since it naturally leads to the logarithmic time reduction. Decision trees have been actively used in this area (Bengio et al., 2010;Daumé III et al., 2017). Nevertheless, traditional axis-aligned decision trees, such as C4.5 (Quinlan, 1993) or CART (Breiman et al., 1984), have very low accuracy (Choromanska and Langford, 2015). Nested dichotomies (Frank and Kramer, 2004) rely on a tree structure to divide a set of classes into two disjoint subsets and learn a binary classifier to separate them. However, human expertise is necessary to obtain a tree structure and class assignments. Additionally, the total error of the model accumulates over the depths since there is no way to refine binary classifiers once split is performed. More recent works which are specifically designed to cope with large number of classes (Beygelzimer et al., 2009;Bengio et al., 2010) employ similar idea but take into account class distributions to generate a tree structure. Other tree-based approaches include global or partial optimization over parameters of a tree. For instance, Daumé III et al. (2017) propose to use a fixed structured tree where each node has much smaller sized linear multi-class classifier. Sun et al. (2019) extend this work by allowing a tree structure to grow. Other works capitalize on generating "perfectly" balanced trees to guarantee logarithmic inference time (Jernite et al., 2017;Choromanska and Langford, 2015). Optimizing a tree parameters in these methods is typically done by approximating gradient information in a certain way (possibly in "online" fashion). Other tree-based methods exist with more focus on large scale extreme multi-label classification and ranking (Prabhu and Varma, 2014;Bhatia et al., 2015). Finally, there are works which employ non-tree based approaches, such as hashingbased methods (Medini et al., 2019), subsampling of classes and training set (Joshi et al., 2017), etc.
In the context of NLP, most of the abovementioned methods are applicable in the number of practical applications, such as large-scale document classification and language modeling. Moreover, there are methods that are specifically designed for language modeling tasks where vocabulary size can be very large and it demands efficient computation of the softmax outputs. Hierarchical softmax (HSM) (Goodman, 2001) is an approximation which employs a "soft" decision tree with linear nodes to address this issue. HSM has been actively used in the problem of learning distributed representations of words (Bengio et al., 2003;Mikolov et al., 2013b) where it can be jointly trained with neural nets of various complexity (Morin and Bengio, 2005). Follow up works on this topic (Mnih and Hinton, 2009;Mikolov et al., 2013a) propose various initializations for the tree structure (e.g. random, Huffmann tree, etc.). Although the training of HSM can be efficiently done using specific loss functions, but during prediction time, input follows all children with a certain probability which brings no speedup compared to the plain softmax. It is still possible to transform a soft tree back into a "hard" tree once training is done (by choosing a child with the highest probability at each split). For example, a recent work from Han et al. (2018) apply a similar approach. However, as we will experimentally show later, it increases an error due to further approximation. Similar observations were found in (Mikolov et al., 2013b) where various subsampling techniques showed better results. Recently, certain pruning mechanisms have been proposed as an alternative approach to speed up the prediction time (Bojanowski et al., 2017).

Linear Classifiers and Oblique
Classification Trees The parameters are typically learned from data (e.g. by using the one-vs-all scheme or by optimizing a loss such as the cross-entropy for a softmax).
Theorem 3.1. Any K-class linear classifier can be exactly represented by an oblique classification tree with constant-label leaves. The converse is not true.
Proof. We give a constructive proof. Define z = Ax + b ∈ R K , so the linear classifier output is f (x) = arg max(z 1 , . . . , z K ). The latter argmax function of K arguments can be exactly computed by a complete binary tree of depth K − 1 as illustrated in fig. 1 for K = 4. Each decision node performs a comparison "z i ≥ z j " and chooses the right child if it holds, else the left child. Each leaf's class label is the corresponding argmax value. This tree represents the iterative algorithm to compute arg max(z 1 , . . . , z K ) by scanning each element left to right for the maximum: max(z K , . . . max(z 3 , max(z 2 , z 1 ))). Specifically, each root-leaf path tree corresponds to one possible execution path, and the nodes at depth j compare with z j . Since a comparison "z i ≥ z j " is equivalent to "(a i − a j ) T x + (b i − b j ) ≥ 0", each decision node is a linear decision function and the tree is oblique. This argument is valid for interior points, but can be made to work for points on the boundary between classes by appropriately breaking the ties (replacing "≥" with ">") as needed. The converse is not true because an oblique tree can dedicate more than one leaf to a class, resulting in a nonconvex class region (the union of two polytopes). This cannot be represented by a linear classifier, whose class regions are polytopes of the form Note that an axis-aligned decision tree (where each decision node tests a single input feature), however deep, cannot exactly represent a linear classifier unless the latter is itself axis-aligned.
The above theorem shows that oblique decision trees are strictly a more powerful family of classifiers than linear classifiers 1 . The proof construction produces a large tree, having 2 K−1 leaves (although its inference time is equal to that of the linear classifier, O(DK)). In practice, the tree need not exactly represent the linear classifier throughout the input space but just over the instances of a training set, and this can likely be achieved with far smaller trees. For example, if each class is linearly separable from the rest, the corresponding tree has just K leaves and each of the K − 1 decision nodes has exactly one leaf child (except the deepest decision node, which has two leaf children). Still, for a given dataset, the linear classifier may be a more efficient model in number of parameters than the tree. Which classifier (linear or oblique tree with constant-label leaves) is better is an empirical question. A further issue is that finding the global optimum of the training problem is easy for a linear classifier (e.g. the cross-entropy is convex for a softmax classifier) but NP-hard for a decision tree. This leads us to the model we propose in this paper, the softmax tree, which is a hybrid between a purely linear classifier and an oblique tree with constant-label leaves.

Softmax Trees: Definition and Training
Unlike soft decision trees, which can be readily optimized via gradient-based methods, hard decision trees pose a far more difficult optimization problem, not just nonconvex but nondifferentiable (and NP-hard). Traditional tree learning algorithms such as CART (Breiman et al., 1984) or C5.0 (Quinlan, 1993) are based on greedily and recursively partitioning the input space, and pruning the resulting tree to reduce overfitting. However, they are known to produce suboptimal trees (Hastie et al., 2009). In this work, we build on a recent algorithm, Tree Alternating Optimization (TAO) (Carreira-Perpiñán and Tavallali, 2018; Carreira-Perpiñán, 2021) which is a non-greedy optimization method for treebased models. Originally described for oblique trees with constant-label leaves, TAO has shown a huge success in training a wide range of other tree-based models: regression trees (Zharmagambetov and Carreira-Perpiñán, 2020), tree ensembles (Carreira-Perpiñán and Zharmagambetov et al., 2021a,b), hybrid models (Zharmagambetov and Carreira-Perpiñán, 2021a,b), etc. Moreover, they are shown to have a great potential to study model interpretability and explainability .
TAO works very differently from CART and much more like a regular machine-learning optimization algorithm, but instead of gradients (which do not apply) it uses alternating optimization over groups of nodes of a fixed tree structure. This results in a monotonic decrease of the objective function over all the tree parameters and convergence to a local optimum. Detailed comparison of TAO against traditional trees can be found in (Zharmagambetov et al., 2021c). Next, we describe our softmax trees and training algorithm, noting the differences with Carreira-Perpiñán and Tavallali (2018).
Consider a K-class problem with training set . . , K} of Ddimensional instances and labels. Let T(x; Θ) be a binary decision tree which produces a prediction for each input x by routing x from the root to exactly one leaf and applying a predictor function at that leaf. Each node (both decision and a leaf) has learnable parameters θ i and the total set of parameters of a tree is Θ = {θ i } i∈N , where N is the set of nodes. Each decision node i has a decision sending instance x to the corresponding child of node i, and each leaf has a predictor function g i (x; θ i ): R D → {1, . . . , K} that produces the actual output. In a softmax tree (ST): • Each decision function uses a (sparse) hyperplane (oblique tree): "go to the right child if This is unlike Carreira-Perpiñán and Tavallali (2018), which used a constant-label predictor.
TAO assumes a fixed tree structure (say, complete of depth ∆) and initial node parameters. Hence, the hyperparameters of a ST are ∆ and k. TAO optimizes the following objective function: where L(·, ·) is the cross-entropy, and the ℓ 1 penalty over the weight vectors (of both decision nodes and leaves) promotes sparsity, via a hyperparameter α ≥ 0. TAO is based on two theorems. First, eq. (1) separates over any subset of non-descendant nodes (e.g. all the nodes at the same depth); this follows from the fact that the tree makes hard decisions. All such nodes may be optimized in parallel. Second, optimizing over the parameters of a single node i simplifies to a well-defined reduced problem over the instances that currently reach node i (the reduced set R i ⊂ {1, . . . , N }). The form of the reduced problem depends on the type of node: • For a decision node, it is a weighted 0/1 loss binary classification problem, where the two classes correspond to the left and right child, which are the only possible outcomes for an instance. Class left i (right i ) incurs a loss (weight) given by the prediction of the leaf reached from the left (right) child's subtree. Thus, each instance is assigned as pseudolabel the child with lower loss. The reduced problem takes the form (where L and y n are the said loss and pseudolabel, resp.): This is as in Carreira-Perpiñán and Tavallali (2018) except that in our STs the loss is the cross-entropy of the corresponding leaf. This problem is NP-hard but can be well approximated with a convex surrogate; we use ℓ 1regularized logistic regression where each instance is weighted by the loss difference between the winner child and the other child, and solve it using LIBLINEAR (Fan et al., 2008).
• For a leaf node, the reduced problem consists of optimizing the original loss but over the leaf classifier on its reduced set: In our STs, g i is a k-class softmax classifier with an ℓ 1 sparsity penalty. We first estimate the k classes (out of K possible classes) as the k most populous classes in R i . Then we train the softmax, which is a convex problem. We solve it using SAG (Schmidt et al., 2017).
The resulting algorithm visits nodes in reverse breadt-first search order and is shown in Algorithm 1. Essentially, each iteration trains all nodes at the same depth (in parallel) from the leaves to the root, by solving either an ℓ 1 -regularized softmax classifier at each leaf, or an ℓ 1 -regularized logistic regression at each decision node. Note that the ℓ 1 penalty on the decision nodes' weight vectors means that some of them may become zero, which makes the node redundant and can be pruned at the end, reducing the size of the tree.

Dealing with zero probabilities
In our STs, each leaf operates on k classes. If k = K, each possible class receives a positive probability, but if k ≪ K then many (K − k) classes receive exactly zero probability. This is necessary to achieve the fast prediction we seek, but it results in an infinite cross-entropy value whenever an instance with ground-truth class y is routed to a leaf that does not contain y. This causes no issue in the reduced problem over a leaf (the softmax uses only the top-k classes in that leaf), but it does cause an issue in the reduced problem over a decision node.
Here, we have to solve a weighted 0/1 loss binary classification problem where the weights are obtained by evaluating the prediction's loss from the left and right subtrees for each instance in the node, and some of those weights can be infinity.
Ri ← instances of the most populous k classes in Ri; θi ← fit a linear classifier on Ri to minimize eq. (3); else generate pseudolabels y n for each point n ∈ Ri; θi ← fit a weighted binary classifier to minimize eq. (2); end end end until max number of iterations; postprocessing: remove dead or pure subtrees; To make sure learning succeeds, we tried the following approaches and evaluate their performances in Table 1: 1. Remove from the reduced problem any instance with loss=∞ (in either the left or right subtree). This performs very badly.
2. Replace loss=∞ by loss=β, where β is typically a large value (e.g. 100, 10 7 ). This is the option that works best in a number of datasets we have tried (see Table 1), but it requires an extra hyperparameter β. This is essentially the same as using a leaf model which predicts class probabilities with a softmax for its k classes and a constant, small value exp(−β) for all other K − k classes.
3. Use the 0/1 loss instead of the cross-entropy in the overall objective function of eq. (1). This avoids the infinity issue altogether, since the pseudolabels' weight is either 0 or 1 (as in Carreira-Perpiñán and Tavallali, 2018). However, the reduced problem over a leaf must now optimize the 0/1 loss (which is NP-hard) rather than the cross-entropy; we approximate this by using the cross-entropy as surrogate loss, so we still learn a softmax as usual. This requires no additional hyperparameter and does quite well. It is our default option (unless otherwise specified in sec. 5).  Table 1: Top-1 errors for STs with different ways of handling the loss=∞ during the decision node optimization. "0/1 loss" refers to using the 0/1 loss instead of the cross-entropy, "remove loss=∞" refers to removing the instances with loss= ∞, and "∞-to-β" refers to approximating the infinity loss as β (e.g. β = 100).

Obtaining an initial tree
While TAO monotonically decreases the objective function, it still converges to a local optimum. For the constant-label leaf oblique trees of Carreira-Perpiñán and Tavallali (2018), which were applied to problems with few classes, using as initial tree a complete tree of depth ∆ with random parameters worked well (we call this "random initialization"). However, with many classes we have observed that the following greedy hierarchical clustering initialization works quite better. Assume a complete tree of depth ∆ having L = 2 ∆ leaves (although the idea carries over to any binary tree structure). The following simple algorithm is guaranteed to assign classes to leaves in a way that respects the ST structure and keeps similar classes near each other in the tree (pseudocode can be found in Appendix A). First, we cluster the training instances into L clusters using k-means. The L clusters will be assigned one-to-one to the L leaves by a greedy hierarchical clustering, as follows. We greedily merge pairs of clusters to achieve L 2 "superclusters". That is, we first merge the two closest clusters into one supercluster (which becomes their parent node). Then, we merge the two closest clusters of the remaining clusters, etc. Note that, unlike in regular hierarchical agglomerative clustering, the resulting supercluster is not considered for merging immediately, but rather each level is considered separately, so that we obtain a tree with a desired structure (balanced). We define the distance between two (super)clusters as the Euclidean distance between their means. We repeat the greedy merging into L 4 , L 8 , etc. superclusters until we reach a single supercluster containing all training instances (the root of the tree). This gives the assignment of clusters to leaves of our tree. (A faster version of this is obtained by first replacing all the training instances within each class with a "class prototype", weighted by the number of instances, and then proceeding as above to find a greedy hierarchical clustering of these K prototypes.) Now that each instance is assigned to one leaf, the first TAO iteration can start, in reverse BFS order.
The idea is that the tree leaves induce a hierarchical partition of the input space into polytopes, hence 1) the training instances within one leaf's polytope should generally be closer to each other than to instances in other polytopes, and 2) this remains true as clusters are merged according to the tree (i.e., the polytopes of two sibling leaves will be near each other, etc.).

Computational complexity
Training Assuming training each node (logistic regression or softmax) is linear on the sample size, training all the decision nodes at the same depth is approximately constant and equal to training one logistic regression on the whole training set; likewise, training all the leaves is approximately equal to training one k-class softmax classifier on the whole training set. Thus, the total sequential cost of one iteration is approximately equal to that of one k-class softmax and ∆ logistic regressions on the whole dataset. As noted above, all the nodes at the same depth can be trained in parallel.
Inference Assuming the final tree is complete, an input instance spends O(∆D) to reach a leaf and O(kD) at its softmax (which typically dominates the path cost). This is a speedup of O( K ∆+k ) ≈ O( K k ) compared to a single softmax over all classes, a remarkable speedup in practice. The inference time is actually smaller because 1) the final tree may be smaller because some nodes were pruned, and 2) the weight vectors at decision node hyperplanes and leaf softmaxes are typically sparse (this is particularly important with high-dimensional features such as bag-of-words).

Experiments
We demonstrate the performance of our method on two popular NLP tasks: (a) large scale text classification, and (b) language modeling. Experiments suggest that our resulting softmax trees outperform simple and advanced baselines either in accuracy (and yet very fast) or in prediction time (and yet showing competitive accuracy); or quite often in both of these indicators. Moreover, the resulting models are compact in terms of memory requirements.

Setup
We initialize our softmax trees (ST) using a "clustering-based" method described in section 4.2 (unless otherwise specified). The sparsity penalty (α) set according to the cross-validation (10% of the training data). Increasing the number of TAO iterations results to a better performance but at a cost of having slower training time. Maximum number of classes (k) at each leaf is another tunable hyperparameter and we report it for each performed experiment (e.g. ST(k=50)). Details and exact values for all other hyperparameters can be found in Appendix C.
As for the baselines, we use scikit-learn's (Pedregosa et al., 2011) implementation of the oneversus-all and softmax linear classifiers. Additionally, we compare our results with more recent baselines which show state-of-the-art performance on various extreme classification problems: LOMTree (Choromanska and Langford, 2015), RecallTree (Daumé III et al., 2017), (π, κ)-DS (Joshi et al., 2017) and MACH (Medini et al., 2019). Where applicable, we use the available implementations of the mentioned methods. Finally, we have implemented hierarchical softmax as a tree-based baseline for language modeling tasks. Further details can be found in Appendix C.
We report the top-1 and top-5 errors, maximum depth (∆), mean inference time per test sample (in ms) and uncompressed model sizes (in GB). We average the errors over 3 independent runs for softmax trees, whereas the best performance is reported for other baselines. The inference time is calculated in a single CPU without parallel processing using the following methodology: we sequentially pass each test sample to the trained model and measure its prediction time. Then we average the results over all test set. Also, we report the storage requirement for each model (uncompressed and stored in sparse format if applicable). Appendix D has additional metrics (e.g. number of leaves, number of classes per leaf, etc.). Our hardware setup is Intel Xeon CPU E5-2699 v3 @ 2.30GHz with 256 GB RAM.  indicates our method which uses at most k classes at each leaf. The results in brackets are taken from the corresponding papers. "+" shows the results of using cross-entropy loss with β = 100 (see section 4.1).

Results: text classification
We perform the first set of experiments on two document categorization benchmarks with large number of classes: ODP-website categorization problem which has over 105k classes and WIKI-Small (with > 36k classes). Input feature vector for each document is normalized bag-of-words representation containing around 400k dimensions. See Appendix B for details and additional benchmarks. Table 2 shows that the STs consistently outperform other baselines and by a considerable margin, showing outstanding performance on these benchmarks. Moreover, they achieve faster inference time compared to most of the baselines (e.g. one-vs-all, MACH) and shows a similar speed as of RecallTree and LOMTree (i.e., other tree-based methods). It worth to mention that our obtained inference times for some baselines (e.g. RecallTree, MACH) diverge from the reported results in other papers. We believe this is because: 1) different computing setup is used; 2) measuring methodology is somewhat different (see setup).
Additionally, fig. 2 shows a tradeoff between error-vs-depth and inference time-vs-depth. It also examines different values for k. In general, increasing k results to better models in terms of error. On the other hand, it increases the inference time (right figure), although the difference is typically negligible. Finally, the results suggest that    Table 2 but on PTB-language modeling task. We also report the test Perplexity (with percentage of the covered points) and top-5 error. "*" indicates that smoothing was applied to replace 0 probabilities with some small epsilon and renormalize the output.
the Depth (∆) should be sufficiently large but overfitting may occur passing a certain point (e.g. middle plot). Note that for these set of experiments, we use a random initialization for STs. Table 2 reports another critical aspect -compactness of our models. Just as our STs are very fast, they also generate extremely compact models compared to baselines (at least 10x gain). This is due to the L1 penalty applied at each node, which leads to sparse weights. Moreover, we observe that the best performance for STs is typically achieved with shallow trees (see ∆) which also helps to reduce the model size.

Results: language modeling
We conduct experiments on PTB dataset which has been extensively used to study language modeling problems. Dataset description as well as our preprocessing steps can found in Appendix B.
As for the baselines, we use the same one-vs-all classifier described earlier and Hierarchical Softmax (HSM) model (we closely follow the setup from (Mikolov et al., 2013a)). Also, we have implemented "HSM-approx" which chooses a child with the highest probability at each split (i.e., it achieves a faster prediction time). Setup for one-  vs-all and ST is the same as in section 5.1, except we use the random initialization for ST. As for the HSM, we use our own implementation in Pytorch (see details in Appendix C).
We report the train/test Perplexities (PPL), which is commonly done for such tasks: PPL = exp(− 1 N N i=1 log P r(y i |x i )), where N is the sample size (train or test), y i and x i are ground truth label and input feature vector of the instance i, respectively. Most of the baselines described in the previous section (especially tree-based methods) do not produce class probabilities and they can not be directly applied to solve the language modeling problem, so we omit their comparison. For ST, we calculate P r(y i |x i ) by routing an instance x i to the corresponding leaf of a tree and taking softmax on the output produced by that leaf. If y i (correct class) is not presented in that leaf (it may happen since a leaf stores k < K classes) then we do not include it to the calculation. Therefore, we provide the total number of points with non-zero probability predictions. Note that the fact that our STs output exactly zero probability for many classes is by design and results in its inference speed. Also note that a softmax classifier will happily assign a positive probability to a class whose region is actually empty (i.e., no input x ∈ R D ever results in that class winning). That said, we also provide the results of applying a smoothing technique (Eisenstein, 2019, section 6.2) to ensure positive probabilities for all classes, without any increase in inference time (denoted by "*" in tables and more results can be found in Appendix D). Specifically, we assign some small ǫ to all instances with zero probability and renormalize the output probabilities. This requires additional hyperparameter ǫ which we tune using crossvalidation. Table 3 summarizes our results. First of all, one can notice that both versions of HSM perform worse compared to one-vs-all (both error and PPL) which coincides with previous findings (Mikolov et al., 2013b). As for the ST, it shows a decent test error (both top-1/top-5) and the fastest inference time than the other baselines. Regarding the perplexity score, our method produces exactly zero probability for some instances which makes overall PPL unbounded (i.e., infinity). However, if we discard such cases and focus on a subset of data for which probability estimate is non-zero (see "% covered" in the table), then it achieves a significantly low PPL. Moreover, it is clear from the Table 3 that ST covers majority of the points and such coverage increases as we increase k. As for the results using smoothing (denoted by "*"), the PPL score is still much lower compared to HSM but higher than one-vs-all. This logically makes sense since instances with zero probability increase PPL score substantially.

Neural language modeling
Modern neural nets are well known to achieve the state-of-the-art performance in language modeling problems. As a comparison, simple RNNs can easily reach PPL = 101 on the same problem (Mikolov et al., 2011) from the previous section. Therefore, we combine our softmax trees with the output of LSTM and show that it achieves a comparable performance with faster inference time. Specifically, we use our Pytorch implementation (see details in Appendix C) of the RNN model for the word-level language modeling on the same PTB dataset with all 10k unique words as the vocabulary. Table 4 summarizes our findings. The neural net model achieves 96.33 perplexity score on a test set using softmax classifier as the last layer. Once training is done, we extract the last output of the LSTM layer and use it as input to the ST (i.e., input is a vector ∈ R 150 ). In other words, ST is not trained in end-to-end fashion but sequentially. Despite this, our method shows a   Table 2. For ST, we report the training times for the best performing architecture (in terms of test error). For LOMTree, we report the results from (Daumé III et al., 2017) when applicable.
similar performance compared to the plain softmax in terms of train/test errors and consistently faster during inference time (about 5.7 times). Regarding the perplexity score, as in the above case, we cover majority of the data points for which the PPL is significantly low compared to the baseline. Table 5 gives representative runtimes for several datasets. We train all methods using at most 16 parallel threads. In general, all tree-based and hashing-based methods are faster to train compared to one-vs-all. For smaller datasets (see Appendix D), training softmax trees as expensive as RecallTree, but faster than MACH. For larger datasets, ST requires more time to find a good solution. Even in that case, it shows a comparable runtime against MACH. Overall, the runtime of ST is reasonable and more than justified by the fast inference time and low test error it achieves.

Conclusion
Softmax trees strike a balance between having a single softmax classifier, which is easy to optimize but slow at inference, and a decision tree with contant-label leaves, which is hard to optimize but fast at inference. Tuning the depth of the tree and the number of classes per leaf softmax results in classifiers that are both more accurate and much faster than a regular softmax or other hierarchical softmax approaches in many-class problems. Finding good local optima for softmax trees is possible with a modification of the tree alternating optimization (TAO) algorithm combined with a good initialization. We are now working on forests of softmax trees and on growing the tree structure adaptively.

Broader Impact
Our work is on optimization and efficiently training of machine learning models for classification task. We anticipate no impact beyond that of the models themselves.    Table 6 summarizes the characteristics of the datasets used in our experiments. Below we provide a description for each of them.
• ALOI (Amsterdam Library of Object Images) is a color image collection of one-thousand small objects, recorded for scientific purposes. Images for each object category are created by systematically changing viewing angle, illumination angle, and illumination color (see details at https://aloi.science.uva.nl/). We obtained the preprocessed form of the dataset from the LIBSVM multiclass data collection, where the extended color histogram with 128 dimensions is used to extract image features. We follow the same random partition of the data (90% train and 10% test) as in (Choromanska and Langford, 2015). As a preprocessing step, we subtract the mean.
• ODP (Open Directory Project) is the comprehensive human-edited directory of the website categories. As of April 2013 there were over 1M categories organized in a hierarchical ontology scheme. We use the preprocessed version of it 3 which uses 105k categories, as in (Daumé III et al., 2017;Medini et al., 2019). For each document, input feature vector is bag-of-words (normalized) and the class label is the category associated with the document.
• WIKI-Small is another text categorization dataset obtained from Joshi et al. (2017). It is a subset of Large Scale Hierarchical Text Classification challenge (LSHTC) (Partalas et al., 2015). For each document, a feature vector is bag-ofwords and the class label is the category associated with the document obtained from DMOZ and DBpedia hierarchical ontology of the WEB.
• PTB (Penn Treebank) is a standard dataset used to evaluate performances of language models. We use the preprocessed version from Mikolov et al. (2010) which is publicly available online. The dataset consists of the plain text sentences in English with approximately 1M tokens and 10k unique words (i.e. vocabulary size). For the neural language modeling experiments, we proceed with this dataset as is without further modification (i.e. section 5.3.1). But for the section 5.3, we construct the dataset as follows.
We filter out words that appeared less than 10 times which leaves us with 5 970 unique words (=number of classes). We construct a multiclass classification task as predicting the next word given previous 3 words. As for the input features, we use a pretrained version of GloVe (Pennington et al., 2014) 4 to obtain a word representation in vector space. We downloaded pretrained word vectors (∈ R 50 ) which were trained on Wikipedia 2014 and Gigaword 5. We obtain a word vector for each context word and simply concatenate them. For example, consider the following sequence "black lives matter protest". First, we extract 50 dimensional GloVe vectors for "black", "lives", "matter" and then concatenate them which results into 150 dimensional vector (i.e. this would be the total number of input features). The ground truth label would be a single integer: 4011 (assuming the index of the word "protest" is 4011 in our vocabulary).

C Hyperparameter Tuning for ST and Baselines
• ST We have implemented our softmax trees in Python 3.8.3 with parallel processing at each level (using Ray (Moritz et al., 2018)). The sparsity penalty (α) set according to the crossvalidation (usually 10% of the training data). Experimentally, we have found that α = 0.1 (for ODP, ALOI) and α = 1.0 (for PTB, WIKI-Small) leads to the best performance. We report the mean error (training and test) and standard deviation over 3 independent runs (in most cases) or result of a single run. A number of TAO iterations is set to 20 for PTB and ODP; 30 for ALOI; and 40 for WIKI-Small. A decision node optimization involves an ℓ 1 -regularized logistic regression which is solved using LIBLIN-EAR (Fan et al., 2008). Similarly, optimizing a single leaf involves ℓ 1 -regularized k-class linear classification which is solved using SAGA (Defazio et al., 2014). Both SAGA and LIB-LINEAR are available through scikit-learn interface (Pedregosa et al., 2011). For SAGA, we set the maximum number of iterations to 20. For the K-means clustering algorithm, we use a Python implementation available in scikit-learn. We use default parameters, except for the number different runs n_init for the ODP dataset to make the runtime faster. Finally, we experiment with several types of losses as explained in section 4.1. In most of our experiments, we use 0/1 loss to approximate loss=∞ and this is our default option. However, we found that carefully tuned β (e.g. 100) for cross-entropy loss shows the best results for text classification tasks and we use it to report our final results.
• One-vs-all and linear softmax We use scikitlearn's implementation for these baselines (with l2 penalty and "SAG" (Schmidt et al., 2017) solver since it is scalable to larger datasets). We set n_jobs parameter to 32 and C to 10 in all of our experiments. Since running one-versus-all takes extremely large runtime, we limit the maximum number of iterations (10 for ODP, 100 for ALOI, 20 for WIKI-Small and PTB). We do not report results of the linear softmax for text classification since: 1) it shows similar performance as one-vs-all; 2) and it requires a huge resources to run for ODP and Wiki-Small (simply infeasible for our hardware setup).
• MACH (Medini et al., 2019) We use their available implementation online 5 and tune its most important hyperparameters (B, R) for each dataset. See Table 2 for the specific values for each problem.
• (π, κ)-DS (Joshi et al., 2017) We use their available implementation online with the set of hyperparameters suggested by authors.
• RecallTree (Daumé III et al., 2017) We use a version implemented inside Vowpal Wabbit 6 . For ODP and ALOI, we use the suggested hyperparameters from the official web page. However, we tune its most important hyperparameters for WIKI-Small: max_candidates, max_depth, passes.
• Hierarchical Softmax (HSM) We use our own implementation in Pytorch. We closely follow the setup from (Mikolov et al., 2013a): the structure of a tree is obtained from Huffman's algorithm (frequency of each word is calculated from the raw training data), each decision node applies a linear transformation followed by sigmoid non-linearity and the objective function to minimize is negative log-likelihood. Training is done using SGD with small learning rate of 0.005 multiplied by 0.995 after each step and fixed momentum of 0.9. HSM, in its pure implementation, does not support mini-batch updates (i.e. > 1, although various methods exist to agglomerate gradients for each node) and thus, we set it to 1.
• Training LSTM for language modeling. Our implementation is similar to the one that can be found in the official Pytorch examples web page. We choose LSTM model with two layers, with embedding size of 256 and 150 hidden states. The sequence length is fixed as 20 and an initial learning rate for SGD is set to 20 which is divided by 4 if no improvement on the validation loss. We train this model for 40 epochs using negative log-likelihood as a criterion.   below, we denote by "ST † " our special initialization (Algorithm 2) and by "ST" a default random initialization (i.e., from a complete binary tree of depth ∆ and random node parameters with Gaussian (0,1) and normalized to unit length). Additionally, ST + shows the results of using crossentropy loss with β = 100 (see section 4.2). "random" refers to the initialization with a complete binary tree of depth ∆ and random node parameters. "class prototypes" refers to the initialization based on hierarchical clustering described in the Algorithm 2. "whole data" is a slight variation of it, where instead performing K-means on the class prototypes, it does that on the whole dataset to obtain initial leaf clusters. Results show that clustering-based initializations can considerably boost the performance.   Table 9: Results on text classification datasets (sorted by decreasing test error): similar to Table 2 but with additional results on ALOI. Moreover, we report the train error, std over 3 independent runs (when applicable), average number of leaves of a tree and average number of classes per leaf. " †" denotes a ST version with clustering-based initialization (Algorithm 2) and "+" shows the results of using cross-entropy loss with β = 100 (see section 4.2).

Method
Etrain ( Table 3 where we additionally report the train Perplexity, train errors, total number of leaves and average number of classes per leaf. Moreover, for our ST, we additionally report the PPL score where smoothing is applied to handle zero probabilities, i.e., we assign some small epsilon to all instances with zero probability and renormalize the output distribution (see PPL test smooth).  Table 11: Results on PennTreebank (but trained on output of the LSTM)-extension of Table 4 but the train errors/scores are additionally reported. Also, we provide PPL scores with smoothing (as in Table 3).