Block-wise Word Embedding Compression Revisited: Better Weighting and Structuring

Word embedding is essential for neural network models for various natural language processing tasks. Since the word embedding usually has a considerable size, in order to deploy a neural network model having it on edge devices, it should be effectively compressed. There was a study for proposing a block-wise low-rank approximation method for word embedding, called GroupReduce. Even if their structure is effective, the properties behind the concept of the block-wise word embedding compression were not sufficiently explored. Motivated by this, we improve GroupReduce in terms of word weighting and structuring. For word weighting, we propose a simple yet effective method inspired by the term frequency-inverse document frequency method and a novel differentiable method. Based on them, we construct a discriminative word embedding compression algorithm. In the experiments, we demonstrate that the proposed algorithm more effectively finds word weights than competitors in most cases. In addition, we show that the proposed algorithm can act like a framework through successful cooperation with quantization.


Introduction
Deep neural networks have had lots of attention due to their great success in many applications. Recently, deep learning is being actively applied to edge devices like a smartphone with important reasons including data privacy and low latency. However, deep neural networks usually have a tremendous number of parameters, so that one does not simply deploy them on such devices having limited resources. In order to resolve this issue, there is a line of research compressing neural networks.
Existing works for neural network compression mainly focus on convolutional layers and fully connected layers. In addition to those layers, there is a special and important layer called word embedding * Corresponding author. which has a considerable size and is commonly used in natural language processing (NLP) tasks. A word embedding is represented by a matrix where each row vector corresponds to a word, which is used as a vector representation of the word. There are also many existing works for compressing a word embedding layer. Among those works, (Chen et al., 2018) proposed an interesting compression method, named GroupReduce, for word embedding which constructs word clusters and conducts low-rank approximation on blocks (sub-embedding matrices) induced by them. They also proposed a low-rank approximation method working with specific weights on words. Even if their structure is simple and effective, the properties behind the concept of the block-wise word embedding compression were not sufficiently explored.
The major contribution of this work is to propose two effective word weighting methods for blockwise word embedding compression and to exploit a non-uniform partitioning method for lightweight embedding structure. Based on them, we construct a Discriminative Block-wise word embedding compression algorithm (DiscBlock) which significantly outperforms GroupReduce. In addition, we show that it can be cooperated with another compression technique like quantization as a compression framework. Outline. In this work, we first introduce a block-wise word embedding structure inspired by GroupReduce of (Chen et al., 2018). Next, we discuss better word weighting and clustering to build the structure. After that, we conduct extensive experiments to demonstrate the effectiveness of DiscBlock with various downstream tasks such as language modeling, machine translation, text classification, and question and answering.
ing, and it requires considerable size. Thus, many approaches have been proposed for compressing it. Several works were proposed for compact representation of word embedding. (Andrews, 2016) proposed a way of exploiting Lloyd's algorithm to get low-bit representations of embedding vectors. (Ling et al., 2016) studied 8-bit representations for word embedding with training. (Hubara et al., 2017) proposed low-bit quantized neural networks for convolutional neural networks and recurrent neural networks. More recent works focus on devising a better structure of word embedding or an optimized way of computing encodings. (Shu and Nakayama, 2018) proposed a method compressing word embedding through compositional discrete codes, which can be trained by gradient descent. (Shi and Yu, 2018) proposed a product quantizationbased compression method, which divides an embedding matrix into sub-matrices via k-means clustering. In aspect of word clustering, it is similar to our clustering method, but we do not use embedding vectors as the targets of clustering. Instead, we use real-valued word weights. In a slightly different line of research, (May et al., 2019) proposed an evaluation score for the downstream performance of compressed word embeddings, which is named the eigenspace overlap score. In addition, (May et al., 2019) showed a lower bound of the eigenspace overlap score for a uniform simple quantization-based compression method to explain its empirical effectiveness. We do not use the eigenspace overlap score in this work, but the quantization method will be used in the experiments. (Kim et al., 2020b) proposed a codebook-based compression method supporting word-level adaptive code length. The adaptive code length of a word can be considered as a word importance measure, but the code length should be predefined on domain of very limited size.
Since word embedding is generally represented by a matrix, decomposition-based compression techniques and efficient embedding structures were proposed. (Chen et al., 2018) proposed the blockwise low-rank approximation method for word embedding. (Hrinchuk et al., 2020) devised a way of interpreting an embedding matrix into a 3-dimensional tensor and proposed an embedding structure by decomposing it with tensor-train decomposition. (Panahi et al., 2020) proposed a smallsize word embedding structure inspired by quantum entanglement. (Lioutas, 2020) proposed a re-cent study for word embedding factorization based on distillation. As (Lioutas, 2020) conducted experiments for combining their approach with GroupReduce, it can be also applied to our algorithm.
There are lots of existing approaches for word embedding compression, but none of existing approaches deeply study word weights in sense of compressing word embedding. Decomposition. Since this work is based on lowrank approximation, we also study decompositionbased model compression approaches. (Kim et al., 2016) proposed a low-rank Tucker decomposition on kernel tensors. (Yu et al., 2017) proposed a framework unifying low-rank approximation and pruning of kernel tensors, which assumes that kernels are likely to be low-rank and sparse. (Astrid and Lee, 2018) proposed a canonical polyadic decomposition-based compression method for approximating a convolutional layer. (Ma et al., 2019) proposed a variation of the transformer of (Vaswani et al., 2017) by decomposing multi-linear attention with Block-Term tensor decomposition (De Lathauwer, 2008). Note that the transformer of (Ma et al., 2019) contains word embedding, but it is not compressed in their work.

Notations
The set of words, called a vocabulary, is denoted by V , and its size is denoted by n. We have a n × d embedding matrix E corresponding to V where d is the dimension of each word embedding vector and n > d. log x stands for the natural logarithm of x. diag(x 1 , . . . , x k ) ∈ R k×k is the diagonal matrix with the input arguments. For any vector v, v i denotes the i-th element of v. In addition, when we conduct Singular Value Decomposition (SVD) on a matrix, the diagonal matrix is assumed to be already multiplied to another for simplicity.

Weighted Singular Value Decomposition
Consider a list of L consisting of the entire words, which is sorted in a certain order. Consider a vector s such that s i is the weight assigned to the i-th word in L. Let us define a diagonal matrix S = diag( √ s 1 , √ s 2 , . . . , √ s n ). Then, (Chen et al., 2018) introduced a rewritten form of the objective function of weighted SVD and how to get the solution as follows.
Suppose that we conduct SVD on SE instead of E and the result is U V T . Then, the solution of the weighted SVD on E with weight vector s is (U * , V * ) = S −1 U , V T .

Block-wise Low-Rank Approximation
Let us introduce GroupReduce, which is the blockwise low-rank approximation for word embedding of (Chen et al., 2018). GroupReduce works with a set of multiple groups G such that the union of all the groups in G is the entire word set and they are disjoint. For grouping, (Chen et al., 2018) takes a simple approach, which is to sort words in descending order of frequency and partition them to same-size g groups. For each group G i in G, we induce the subembedding matrix consisting of the embedding vectors of words in G i , and it is denoted by E i . In addition, suppose that each word w in G i is associated with its frequency as a weight. Then, GroupReduce computes the weighted SVD of E i with a certain rank r i as U i × V i . (Chen et al., 2018) set r i to be f i fc r where f i is the average frequency of words in G i , f c is the average frequency of words in the group with the least frequent words, and r is a user-specific rank to the group with the least frequent words. Finally, GroupReduce approximates E as following: where [A, B] is the concatenation with subembedding matrices A and B over words. A Note on Refinement. (Chen et al., 2018) proposed an algorithm of refining this group assignment scheme with consideration of minimizing the total reconstruction error of the weighted SVD. However, it may be failed to get a better assignment, because words in a group with a low rank has great tendency to be moved to another with a high rank. This must be unintended and unhelpful due to the meaning of word weights. Thus, in this work, their refinement algorithm is not used.

Proposed Algorithms
DiscBlock uses the same block-wise word embedding structure of (Chen et al., 2018). In addition, given word weights, we assign a rank to each group in the same way of (Chen et al., 2018). The difference between DiscBlock and GroupReduce are made on word weighting and clustering, which are explained in this section.

Beyond Frequency: Better Weighting
Even if the concept of the word frequency is quite simple and it is reasonably useful, it is not the best option in many cases. For example, when the word frequency is used as the importance of a word to a document in information retrieval, the importance of unimportant words like 'is' can be overestimated. Since GroupReduce uses word frequency as a measure of word importance, it may have a similar problem that unimportant words are overestimated so that they may be falsely included in a high-rank word group. Motivated by this, we propose two different methods for word weighting.

Simple Yet Effective Word Weighting
In order to solve the problem of word frequency in information retrieval, the concept of the Term Frequency and Inverse Document Frequency (TF-IDF) score was introduced. This scoring method determines the importance of a word to a document with consideration of both frequency and the number of documents having it. Inspired by the concept of the TF-IDF score, we define TF-IDF based word importance as follows.
where α is a user-specific parameter for scaling, is a small value for avoiding zero, D is the entire document set, D w is a set of documents having word w, and f w,D is the frequency of word w in document D. In this work, α and are set to be 0.1 and 1 |D| , respectively. Rationale. tf avg (w) is a normalized term frequency and idf(w) is the logarithm of the inverse document frequency of w. Note that for frequent words over many documents like 'is', the inverse document frequency is likely to be low, so that we can avoid to assign such words to a high-rank word group.

Differentiable Word Weighting
For more effectively achieving word weights than the TF-IDF based method, we devise a trainable (differentiable) word weighting method. Given a trained (target) model M , this method modifies M and trains it to learn effective word weights. After the training process, trained word weights are used to form the diagonal matrix S and to construct the block-wise low-rank approximation. Note that when training word weights, the other weight parameters in M are not re-trained. Word Importance through Masking. Consider a word w and its embedding vector v w in the embedding matrix of M . Suppose that we assign a number of zeros to v w uniformly at random and there is no loss of accuracy for M . Then, we can consider that the reason why v w carries many zeros without loss of accuracy is that w is not an important word to M .
Based on this idea, we propose an advanced method for computing word weights based on masks on elements of the embedding matrix. The masks are formulated to effectively cause information bottleneck (low-rankness), which will be helpful for a model to select masks while minimizing task loss in training process. Suppose that we have a positional index p w on v w , which is called a pivot and determined by the importance of w. The proposed method gives sparsity to v w by replacing values after p w in v w with zeros, so that resulting masks are aligned as depicted in Figure 1 Then, we claim the following proposition. Proposition 4.1. For a masked embedding matrix E by the proposed method, if the rank of E is r, the rank ofÊ is, By manipulating pivots, the proposed method can guarantee that the masked embedding matrix is low-rank (i.e., rank(Ê) < d).
Making it Trainable. The remaining problem is how to determine pivot p w for each word w depending on the importance of w in a differentiable way. In order to parameterize the pivot, consider a function p(x w ) where x w is a trainable parameter such that 0 < x w ≤ 1 and it is defined as: Then, we formulate a masking function m : R → {0, 1} d as following: It is easy to see that the range of m(·) is the same as the output space of the pivot-based masking. In addition, since x w is proportional to the number of non-masked elements, which is likely to have positive correlation with the importance of w, we use x w as the word weight of w in this method.
is not differentiable, and otherwise, its derivative is zero. This property leads to the fact that it is hard to train x w with gradient descent due to the zero derivative issue. (Kim et al., 2020a) addressed a similar problem to this issue by introducing a trainable gate function. We use the same gradient shaping function β : R → R proposed by (Kim et al., 2020a) as following: where L is a large positive integer. The trainable masking functionm : R → R d is defined as: It is easy to prove the uniform convergence of m(·) to m(·) with a large value of L. The idea of this approach is that β(·) has an extremely small value near zero, but its derivative is one where β(·) is differentiable, so that we can train x w . Learning from Hunger. In order to learn word importance properly in the training process, we need to define an additional loss term. This is because without any regulation, the model must be trained to minimize the number of masked elements. The additional loss based on the sparsity of masked embedding vectors is defined as: where λ and γ are real-valued user-specific parameters, · 2 2 is the l2-norm, γ ∈ R n is the vector where all the elements are γ, 1 ∈ R n is the ones vector, and x ∈ R n is a trainable word weight vector. λ is set up with consideration of the ratio between task loss and L(x; γ), and γ is defined to control the desired sparsity. This loss function leads the model to learn word importance with limited budget. One Step Further. If we use masked embedding matrixÊ instead of E, the output distribution of the subsequent layers of M will be changed due to the rank reduction. In order to alleviate the unintended change, we propose a function c(·) which takes the masked word vector and the masking vector as inputs. This is called a compensation function, and it can be formulated in two ways. Given a masked word vectorṽ w ∈ R d inÊ, one way for the compensation is to define g linear layers, select one of them depending on what p(x w ) is, and to makeṽ w pass through it. For simplicity and efficiency, the selection is uniformly conducted in forward pass. That is, if d(i−1) g < p(x w ) ≤ di g , v w will be passed through the i-th linear layer.
The other way is to let the model find effective selection with training. Given a masked word vectorṽ w ∈ R d and its masking vectorm w ∈ R d computed bym(·), c(·) is formulated as: where σ can be the softmax function or Gumbel-Softmax, W 1 ∈ R δ×d , W 2 ∈ R g×δ , b ∈ R g , and C i ∈ R d×d , and b i ∈ R d . C i and b i are the weight matrix and the bias of the i-th linear layer in c(·), respectively. W 1 and W 2 are weight matrices and b is a bias vector to determine which linear layer is used. δ is designed to have a lower value than g. W 1 and W 2 are initialized as all-ones matrices while b and b i are initialized as zero-vectors. C i is initialized as a matrix where diagonal entries are one and off-diagonal entries are zeros.
Including the case of no compensation, the two compensation functions are depicted in Figure 2.
The rationale behind the compensation functions is that we want to compute word weights while mimicking the block-wise low-rank approximation. Consider a sub-embedding matrix E i ∈ G and its low-rank approximation U i (V i ) T . Since the rank r i of U i is smaller than d, we can consider that U i has word embedding vectors projected to a lower dimensional embedding space. On the other hand, V T i can be seen as a linear transformation matrix toward the original dimensional space. Similarly, the compensation function acts like V T i , so that it will be trained to alleviate the impact of using masked word vectors to the subsequent layers.

Clustering
Recall that in GroupReduce, words are sequentially partitioned into same-size groups in the sorted list in descending order of frequency. Let us call this the uniform partitioning method. Even if GroupReduce implicitly assumes that words in the same group have similar importance, due to the powerlaw distribution of words in terms of frequency, words which have very different importance can be included in the same group. In addition, the uniform partitioning method hinders GroupReduce from achieving high compression ratio. This issue will be discussed in the experiments.
In order to address this problem, one decent option is to use the k-means clustering method instead of the uniform partitioning method. It is trivial that with a proper word importance function, the k-means clustering method can more effectively collect words in the same group having similar importance than the uniform partitioning method.

Implementation Details
Tasks, Datasets, and Models. We conduct extensive experiments to demonstrate the effectiveness of DiscBlock. We have following tasks: language modeling, question and answering, text classification, and machine translation.
Since we deal with many different types of datasets and models, we use various open-sourced implementations as follows. Note that we do not make any change over datasets and preprocessing implementations. For language modeling, we use Penn Treebank, WikiText2, and WikiText103 of (Merity et al., 2017) as datasets. We use a 2-layered LSTM model with dropout after the word embedding layer for the encoder. Our implementation for this task comes from the language modeling codebase provided by PyTorch examples 1 . For question and answering, we use SQuAD (Stanford Question Answering Dataset) 1.0 in (Rajpurkar et al., 2016) and the DrQA model proposed in (Chen et al., 2017). Our implementation for handling this dataset is based on the codebase 2 used in (May et al., 2019). For text classification, we use two datasets: SNLI (Stanford Natural Language Inference) in (Bowman et al., 2015) and SST-1 (Stanford Sentiment Treebank) in (Socher et al., 2013). For SNLI, we use a open-sourced codebase 3 providing a bidirectional LSTM. For SST-1, another open-sourced codebase 4 is used with TextCNN in (Kim, 2014). For machine translation, we use the IWSLT14 (International Workshop on Spoken Language Translation 2014) German-to-English dataset in (Cettolo et al., 2015). For this dataset, we use JoeyNMT 5 which is a lightweight framework for machine translation proposed in (Kreutzer et al., 2019). In addition, a recurrent neural network based on GRU with attention is used for this task. The basic statistics of the datasets presents in Table 1. All scores reported in this work come from test sets except SQuAD. For SQuAD, scores are computed from the validation set. Training. In order to get base models, which have word embedding targets to compress, we use epochs specified in the open-sourced codebases except WikiText103. Due to the huge size of Wiki-Text103, we use 10 epochs for training on it.
The learning rate and the number of epochs for retraining are varied over datasets. Retraining epochs are determined with consideration of total retraining time. If the dataset and the model are not large, the number of epochs for retraining is the same as that for training the base models from scratch. In addition, the learning rate for retraining is scaled to half or 10% of the original rate.
Note that the learning rate for training word weights is also experimentally determined. The number of epochs for training word weights is set to be usually much smaller than that for training the base models. Trainable Weight Initialization. For the differentiable word weighting method, given word w, x w is initialized to tf-idf avg (w). Since the differentiable word weighting method is proposed to get better weights than tf-idf avg , it is a good starting point. Compression Ratio. In order to fairly evaluate the effectiveness of each comparison method, we control compression ratio to be approximately 50× for IWSLT14 and 20× for the other datasets if there is no mention about compression ratio. Hyperparameters. We have several hyperparameters for the word weighting methods. For simplicity, δ is 1 and γ is set to 0.95 or 0.99.
λ is determined through multiple experiments and it ranges from 0.5 to 25.0 except SNLI. For SNLI, λ is set to be 0. In this case, the sparsity loss is not helpful to train effective masks.
The number of groups g is experimentally determined to be 5. We conducted experiments for 10 and 20, but both DiscBlock and GroupReduce show stable performance with g = 5.  Competitors. We have three competitors: SVD, TensorTrain, and GroupReduce. SVD is the truncated singular value decomposition method. The compression ratio of SVD is controlled by manipulating the number of singular values to use.
TensorTrain is the tensor-train decomposedbased method in (Hrinchuk et al., 2020). In (Hrinchuk et al., 2020), TensorTrain is computed by training from scratch. Similarly, we train it from scratch, but the other trainable parameters are trained from pretrained values provided by the base model. Note that for each dataset, the learning rate to train TensorTrain is experimentally selected from between the learning rate for training the base model and that for retraining.
For GroupReduce, we use the refinement of (Chen et al., 2018). GroupReduce first constructs uniform partitions and refines them via local search heuristics, but the uniform partition construction sometimes fails due to the power-law frequency distribution of words. In this case, even if the rank assigned to the least frequent partition is 1, the compressed embedding size is much larger than the target compression ratio. For comparing GroupReduce with our algorithm even in such a case, we add a base value to each frequency score for smoothing the distribution. The base value is 2 k where k is the minimum value for achieving the target compression ratio. We denote results where this remedy is applied by * as a prefix.

Results
Let us denote DiscBlock with frequency, tf-idf avg , and the differentiable word weighting method by DiscBlock-F , DiscBlock-T , and DiscBlock-D, respectively. The implementation is available at the repository 6 .

Overall Performance
The overall compression performance of the comparison methods presents in Table 2. TensorTrain is not tested for WikiText103 because it requires too much cost to be trained.
The table presents that DiscBlock is much more effective than SVD and GroupReduce in most cases. Especially, the gap between DiscBlock and the others is remarkable in terms of model performance before retraining. This merit is critical when the base model and the dataset are extremely large.
We also conducted experiments about more powerful compression scenarios. The results are shown in Table 3. Compared to the other competitors, Dis-cBlock-D has the most stable performance even in such scenarios. Figure 3 depicts the distribution of word weights computed by frequency, tf-idf avg , and the differentiable word weighting method. Compared to frequency, tf-idf avg and the differentiable word weighting method provide more smoothed distributions. Such smoothed distributions have advantage to avoid over-estimating words which are considered important by a word weighting method.

Effectiveness of Compensation Functions
We compared compensation functions for the differentiable word weighting method and the results  are presented in Table 4. Conti-Gumbel is the continuous selection with Gumbel-Softmax, while Conti-Softmax is that with the softmax function.
The results show that Conti-Gumbel is the best except PTB and SQuAD. Even for PTB and SQuAD, Conti-Gumbel achieves almost same performance compared to the best competitors of them. Meanwhile, the discrete method is not that effective compared to Identity. This may be because the discrete method divides words by the uniform selection, which is different from partitions computed by the k-means clustering method.

Why Non-uniform Clustering Matters
The k-means (non-uniform) clustering method is necessary for achieving high-level compression performance. The first reason is shown in Table 5, which contains comparing results in terms of architectural effectiveness. In this table, the uniform partitioning method does not have any result for IWSLT14, because it cannot achieve near 20× compression ratio even if the assigned rank to the least frequent word group is 1. It does not have any result for SQuAD with frequency either due to the power-law frequency distribution of words. In addition, the results imply that the k-means clustering method provides slightly better compressed word embedding structures in many cases. In Figure 4, Conti-Uniform stands for using the uniform partitioning method to construct word groups with word weights computed by the differentiable weighting method. Compared to results  with the k-means clustering method (DiscBlock), Conti-Uniform fails to find good word weights.

Another Application: Knowledge Embedding Compression
Knowledge embedding consists of numerical vectors representing entities and relations in a knowledge graph. We conducted a toy experiment for demonstrating the effectiveness of our algorithm to knowledge embedding compression. The dataset used in this experiment is FB15K-237, which consists of 14.5K entities and 237 relations. Since the number of relations is ignorable, we compress only the entity embedding matrix. In the table, H@10 is the proportion of correct entities ranked in the top 10 entities and MRR is the mean reciprocal rank measuring the number of correct predicted triples. We implement this experiment based on the opensource codebase 7 . The results are shown in Table 6. For this task, frequency is the number of occurrences of an entity in triples. tf-idf avg is not applied because there is no similar concept to a document in this task. Compared to SVD, DiscBlock-D is somewhat behind in terms of MRR and H@10 after retraining. However, it still has better performance than the competitors before retraining. This result implies that DiscBlock-D well approximates the original embedding matrix, but the block-wise structure is not helpful to achieve high performance via retraining. We believe that this observation can be a good starting point for future research.

Toward a Compression Framework:
Cooperating with Quantization Since we cluster words with their importance, each sub-embedding matrix includes words having similar word importance. That is, given a compression method, which is controllable in terms of compression strength, we can apply it to each subembedding matrix according to its average word importance. In this experiment, we use SmallFry in (May et al., 2019), which is a quantization method for word embedding. For each word group G i , we apply SmallFry to it with assigning the number of bits depending on the average word importance. The number of bits q * is defined as: where q is a user-specific parameter for the number of bits, s is the average score of G i , and s max is the maximum average score over groups in G. For simplicity, ω and q are set to 1 and 2, respectively. The result is shown in Table 7 where BlockFry is a method which partitions word groups with a word importance and applies SmallFry to subembedding matrices induced by the groups. In the result, BlockFry is more effective than SmallFry in many cases. Especially, in terms of model performance before retraining, the gap between them is considerable.

Conclusions
The block-wise low-rank approximation of (Chen et al., 2018) is an effective method for word embedding compression. However, its word weighting and partitioning scheme is somewhat simple and there is big room for improvement from it. Motivated by this, we propose a discriminative block-wise word embedding compression algorithm, named DiscBlock, based on the two effective word weighting methods and the k-means clustering method. The experimental results show that DiscBlock significantly outperforms the competitors including GroupReduce in terms of accuracy loss in most cases. In addition, we explore the limitation of GroupReduce in terms of compression ratio due to the uniform partition construction. Finally, as a compression framework, we show that DisckBlock can cooperate with another compression method to achieve better compression performance than it can achieve alone.