The Chinese Remainder Theorem for Compact, Task-Precise, Efficient and Secure Word Embeddings

The growing availability of powerful mobile devices and other edge devices, together with increasing regulatory and security concerns about the exchange of personal information across networks of these devices has challenged the Computational Linguistics community to develop methods that are at once fast, space-efficient, accurate and amenable to secure encoding schemes such as homomorphic encryption. Inspired by recent work that restricts floating point precision to speed up neural network training in hardware-based SIMD, we have developed a method for compressing word vector embeddings into integers using the Chinese Reminder Theorem that speeds up addition by up to 48.27% and at the same time compresses GloVe word embedding libraries by up to 25.86%. We explore the practicality of this simple approach by investigating the trade-off between precision and performance in two NLP tasks: compositional semantic relatedness and opinion target sentiment classification. We find that in both tasks, lowering floating point number precision results in negligible changes to performance.


Introduction
In recent years, NLP models, particularly language models, have come under increasing scrutiny for their potential privacy leaks (e.g., Carlini et al. (2018)). One answer has been to push NLP models onto edge devices, such as mobile phones and browsers, for on-device inference using differentially private federated learning (e.g., ). Edge computing, in turn, requires smaller and faster models, such as Distill-BERT (Sanh et al., 2019) or mobileBERT (Sun et al., 2020), which theoretically improves the viability of using BERT embeddings on these devices.
What has yet to take place, however, is careful numerical analysis of word embeddings at varying precisions for different NLP tasks. Numerical analysis can inform potential alternatives to defining the representations of word embeddings themselves. We must carefully orchestrate a balance among the size of these representations, their security, the algebra of operations that they enable, and, because many edge devices are resourcerestricted relative to the GPU clusters that we do research on, the efficiency and power consumption of those operations. In particular, feature/basic phones are growing in popularity in developing nations 12 , often with the possibility of expanding their memory to about 16GB with a MicroSD. Many privacy-enabled language models, in particular, are simply out of reach for these low-resource devices.
No work to date has proposed using the Chinese Remainder Theorem (CRT) to create softwarebased compressed word embeddings, adapted to Single Instruction Multiple Data (SIMD). We propose here a CRT-based method which speeds up addition by up to 48.27% and compresses GloVe (Pennington et al., 2014) word embedding libraries by up to 25.86% (up to 68.68% of the original full-precision library), depending on the precision selected. We also explore different levels of numerical precisions for a representative task, as well as more abstractly by analysing the absolute error resulting from adding and multiplying truncated and rounded values.
Related Work. A number of recent papers have explored limiting input precision in order to speed up neural network training and inference. Zhang et al. (2018) and Ling et al. (2016) specifically looked at the effects of limiting word embedding precision. Zhang et al. (2018) evaluate embeddings of different sizes combined with hardwareimplemented SIMD for an adaptation of stochastic gradient descent that achieves a memory improvement of 2X and 1.2X faster training speed. Ling et al. (2016) demonstrate that using word embeddings made up of 8-bit fixed-point values performs just as well as other embeddings on word similarity, phrase similarity, and dependency parsing tasks. Tissier et al. (2019), on the other hand, has shown how to generate binary word embeddings that fit within a CPU's cache to increase computing speed. With limited effect on task accuracy, they manage to generate 256-bit embeddings which are 37.5× smaller than traditional 300-dimensional word embeddings.
While we have demonstrated that CRT can be used to increase the efficiency of word vector addition by transforming a vector into a single large number, it is usually used for the opposite purpose: to transform larger numbers into many smaller numbers to improve multiplication efficiency. For this purpose, CRT is integrated in certain homomorphic encryption libraries.
CRT has also been proposed as a method of encrypting multiple entries in a database (e.g., a i ) as one number, each decryptable by a secret modulus (e.g., m i ) (Yan, 2002). The security of such schemes are shaky and more complexity is usually added to guarantee security (Liu et al., 2014;Lin et al., 1992 Hellerman (1966). More recently, the Chinese Remainder Theorem was proposed as a method for attaining SIMD in order to optimize arithmetic for homomorphically encrypted values (Gentry et al., 2012;Smart and Vercauteren, 2014). We do not know of work that takes advantage of CRT-based SIMD for NLP, despite the repetitive tasks performed within certain algorithms.

Chinese Remainder Theorem
Theorem 2.1 (The Chinese Remainder Theorem). Let m 1 , m 2 , ..., m r be pairwise relatively prime positive integers. Then the system of congruences . . . We use a to denote the vector of values a 1 , a 2 , ..., a r and m to denote the vector m 1 , m 2 , ...., m r .
The Chinese Remainder Theorem can also be used to separate very large numbers into smaller chunks on which, in certain cases, arithmetic operations can be performed more efficiently (Rosen, 2000).

Algorithm: Vec2int
First, we must determine a minimum floating point precision (φ) the values within our word embeddings must have. Since the CRT only works with positive integers (and polynomials), we will multiply each value in the word embeddings (a i ) by 10 φ and truncate (or round) the result, then add the lowest possible integer value that can be found within our word embeddings (s). Before running the CRT algorithm, we must pre-compute a vector m of moduli, each of which are coprime and strictly greater than any possible value of a i . a and m can then be input into the CRT algorithm in order to produce an integer X. Since we know the values of m, we can easily perform a "reverse CRT" by calculating a i ← X (mod m i ).
Algorithm 1 Vector To Integer 1: procedure VEC2INT(a, φ, m, s) 2: for i ← 1 to r do 3: When two CRT-encoded vectors (say X 1 and X 2 ) are added together, their sum at index i can still be decoded by calculating (X 1 + X 2 ) (mod m i ), provided that either X 1 + X 2 < m i , or (X 1 + X 2 ) (mod 2m i ), when X 1 + X 2 ≥ m i . The same property holds for multiplication; i.e., (X 1 * X 2 ) (mod (m i ) 2 ), when X 1 * X 2 ≥ m i . In other words, this representation supports vector addition, multiplication and, in a more restricted range, subtraction, through executing those operations directly on integers.
Addition and multiplication are fundamental to most compositional semantics approaches, beginning with Lapata (2008, 2010). Addition is also central to measuring the relationship between word pairs, particularly when using the L1-norm as a metric, which is less sensitive to outliers than the L2-norm (Stratos, 2017). Word analogy is an example of such a task. Of course, the these operations can also be applied to measuring the relationship between sentence pairs and document pairs.

Reasoning about Precision using Numerical Analysis
One might assume that we have to perform extensive empirical analysis to determine the minimum floating point precision necessary for the inputs of a particular NLP algorithm. However, there are generalizations we can make based on concepts as simple as the absolute error and relative error of components of the operations, described in Heath (2018) as: Absolute error = approx. value − true value (1) Relative error = absolute error true value (2) For our operations, the vectors truncated or rounded at precision Φ within a dataset |D Φ | are denoted v Φi , while the vectors of the original dataset are denoted v i . If we want to refer to index k of vector v i , we write it as v i [k], where |v 1 | is the size of the embeddings we are working with. The average absolute error between the vectors of the two datasets of size |D| is computed as follows: We calculate the average absolute error of addition as follows: We have tested the efficiency of adding and multiplying integer representations of word embeddings compared to adding and multiplying their original vectors. On a 1.9GHz Intel i7-8650U with a 2.11 GHz burst rate, 1024MB L2 cache and 8192MB L3 cache, 100K additions ( Figure 1) generally take 3x longer to perform on vectors than their integer encodings, regardless of precision and dimensionality, while the results for multiplication ( Figure 2) are profound and negative at larger precisions and dimensionalities.
We may then theoretically expect the CRTencoding of vectors as integers to result in a significant performance gain in the case of addition, but for the most part a significant performance loss in the case of multiplication. Nevertheless, in comparing the various pairs of rows in Tables 1 and 2, we notice that the computational gains made from adding CRT representations rather than word vec-   tors can considerably outweigh the losses from multiplying integer representations when the number of additions is greater or equal to the number of multiplications. In practice, as illustrated in several use cases below, the number of additions can be far greater in many important tasks.
Recall also the findings of Ling et al. (2016), who did not observe a significant decrease in performance when using an 8-bit fixed-point value for word embeddings in word and phrase similarity and dependency parsing tasks. This implies that there is some room for compromise on precision practical NLP tasks as well.

Use Case 1: Compositional Semantics
A prime example of the significant performance improvements brought on by integer addition of word embeddings in the area of compositional semantics can be found in Salton and McGill (1986). It was demonstrated that vector addition is more effective than other proposed unsupervised compositional models (multiplication (Mitchell and Lapata, 2010), tensor product with convolution (Widdows and Ferraro, 2008), and dilation (Mitchell and Lapata, 2010)) for determining semantic relatedness between bigrams and other bigrams or unigrams (Asaadi et al., 2019).
Dataset and previous results. In Asaadi et al. (2019), the authors introduce BIRD, a bigram relatedness dataset created using the Best-Worst Scaling annotation method. To accomplish this task, annotators are provided with n sample, where n is often 4, and are asked which of the samples best represent the a given property and which one represents it the worst Mohammad, 2016, 2017). Asaadi et al. (2019) compute bigram semantic relatedness using three different kinds of word embeddings; namely, pre-trained GloVe vectors 3 , pre-trained fastText word embeddings 4 , and word-context cooccurrence vectors extracted from a corpus of university websites (Turney et al., 2011) 5 . Specifically, the authors compute the relatedness score for the vectors representing the term pair AB-X, where AB is a bigram and X can be either a bigram or a unigram. The results of four unsupervised compositional models were compared: • Weighted addition (Salton and McGill, 1986); • Multiplication (Mitchell and Lapata, 2010); • Tensor product with convolution (Widdows and Ferraro, 2008); • Dilation (Mitchell and Lapata, 2010).
They then used Pearson correlation to compare the semantic relatedness scores, computed using these unsupervised compositional models, with the gold-standard in the BiRD. Addition turns out to be the composition method that results in the highest Pearson correlation scores.
Integrating vec2int The task introduced by Asaadi et al. (2019) is a prime candidate for using vec2int to speed up computations and reduce space: vast numbers of additions need to be performed in order to compute bigram relatedness in large datasets. These datasets often occupy a large amount of space and would also benefit from a compression method that can support addition in the compressed domain. The catch is that, in order to convert from a vector to an integer, we must limit the precision of each of the entries of the vector. Clearly, this can lead to concerns regarding the levels of accuracy that can be guaranteed by functions taking these precision-limited vectors as inputs. For our calculations, we forego any comparisons with floating point precision below 8 digits (other than the original number of digits 6 ), since we want to maximize computational efficiency and minimize storage size. Recall that φ is the number of digits we keep after the decimal point. Figure 3 and Figure 4 display the effects of truncating and rounding floating point numbers at various precisions then converting them to integers for determining bigram relatedness through the composition of fastText and GloVe vectors, respectively 7 . Interestingly, even sticking to a floating point precision of 1 or 2 can lead to results which are just as good as with the original precision. Figure 5 shows the average absolute error (see Equation 3) and Figure 6, the average relative error (see Equation 4), for each possible sum of two vectors within the term context, fastText, and GloVe datasets. We can see that an increase in average absolute and relative errors lines up with the decrease in correlation score, and so we can determine at which point bigram relatedness would start to falter without computing Pearson or Spearman correlations. In fact, it appears to suffice only to calculate the average ( Figure 9) and relative ( Figure 10) errors of the truncated and rounded word vectors themselves (see Equations 1 and 2, respectively).
For interest, we show the average absolute error ( Figure 7) and average relative error (Figure 8) for composition through multiplying fastText vectors at various precisions. fastText exhibits the most dramatic difference in performance between composition by addition and composition by multiplication at lower precisions. 6 The largest precision within the term context vector is φ = 20, within fastText is φ = 12, and within GloVe is φ = 8. 7 We would like to thank Shima Asaadi, Saif M. Mohammad, and Svetlana Kiritchenko for providing the code and data they used for the their Big BiRD paper. The correlation scores for term context vectors are nearly the same as those for GloVe at untruncated precision, but remain stable down to φ = 2.

Compression
We test the utility of the CRT as a method of compressing word embeddings. The results are shown in Table 3, reaching up to a 25.85% space reduction for GloVe vectors. We also calculate an upper bound of how large a dataset of word embeddings can get when compressed using CRT representations using the same relative primes m as the ones used to calculate the results displayed on Tables 3. Calculating the upper bound essentially comes down to associating each word in the dataset to the largest possible number N given m: To demonstrate that, no matter what the vectors were the compression would work, we show the upper bounds compared to the size of the original dataset of 50-dimensional GloVe embeddings and the true CRT-compressed dataset of those embeddings on Table 4.

Use Case 2: Data Preprocessing for Private and Secure Computing
Another potential use case is if one wants to convert a text to its corresponding word embeddings directly on device, before being sent for cloud processing through homomorphic encryption or secure multiparty computation algorithms, which compute on precision-limited obfuscated data. Arora et al. (2020) shows that pre-trained embeddings perform within 5-10% accuracy of benchmark tasks (NER, Sentiment analysis, GLUE) compared to contextual embeddings (e.g., BERT) and can even regularly match their performance when using industrial-level data.
Directly converting the words to their respective word embeddings on device is particularly useful for those algorithms, where table lookup over large amounts of data can be a fairly expensive operation. Not only can CRT representations compress the word embeddings to save valuable space on edge devices, but they can be selectively decompressed, unlike zip files, as needed. If you want to only convert the words in a user query to embeddings, there is no need to decompress the entire dataset!
We compare the results of CNN-based sentence-level sentiment analysis using GloVe vectors as input (Kim, 2014) 8 with those using            BERT. Kim (2014) is the same work that Arora et al. (2020) used to compare with BERT over 6 different datasets. Of those datasets, we choose to analyze the variation in performance on the MR dataset (movie reviews -one sentence per review) Pang and Lee (2005) and on the opinion polarity detection subtask of the MPQA dataset (Wiebe et al., 2005).
In addition to comparing BERT-based sentiment analysis results with CNNs that take fullprecision 300-dimensional GloVe vectors as inputs (which Arora et al. (2020) does), we verify the feasibility of this use case, by conducting experiments with varying precisions of 50dimensional and 300-dimensional GloVe embeddings. Table 5 shows that sentence-based sentiment analysis with GloVe embeddings remains almost unchanged even at Φ = 1 on the MR dataset, on which Arora et al. (2020) shows that BERT outperforms CNNs by 6.9. BERT therefore outperforms the CNN with 50d GloVe embeddings at precision φ = 1 (which achieves a performance of 77.3) by 8.9. However, GloVe often matches BERT in this task when the BERT embeddings are trained on 16 times less data. Arora et al. (2020) also show that BERT outperforms GloVe on the MPQA dataset by a measly 0.9. We determine that BERT outperforms the CNN with 50d GloVe embeddings at precision φ = 1 (which achieves a performance of 88.46) by only 1.14.
As is mentioned in Arora et al. (2020), it takes "440MB to store BERT BASE parameters, and on the order of 5-10 GB to store activations[, while p]retrained non-contextual embeddings (e.g., GloVe) require O(nd) to store a nby-d e-bedding matrix (e.g., 480 MB to store sentence Φ 50d 300d -1 0.4908 0.5001 0 0.7571 0.7320 1 0.7730 0.7902 2 0.7745 0.7900 5 0.7730 full 0.769 0.7942 Table 5: Performance when varying the precision Φ of input GloVe embeddings to sentence-level sentiment analysis using a CNN. a 400k by 300 GloVe embedding matrix)." Our method manages to reduce this embedding matrix size by 25%. MobileBERT uncased, which is made up of 23.21% of the number of parameters of BERT BASE , ends up with a storage size of 139 MB (compare with 421 MB for BERT BASE uncased). That resource-constrained model is still 2.62x larger than our 53.66MB compressed GloVe embedding matrix, and without the benefit of our significant performance gain. Indeed it is difficult to imagine how a method that manipulates 512-bit vectors could compete with adding integers. Mo-bileBERT's performance scores are within a couple of percentage points from those of BERT BASE .

Discussion and Future Work
These preliminary results suggest that the vec2int algorithm would be an efficient way of encoding word vectors for specific NLP tasks, namely those which would benefit from arithmetic-supporting vector compression (which tar and zip are not). They also suggest that analysis of average relative and absolute error can be used to tune these representations.
We have yet to test the effectiveness of using integer representations of word2vec (Mikolov et al., 2013) and BERT (Devlin et al., 2018), but expect the outcomes to be similar to the results for term context vectors, fastText, and GloVe. It would also be particularly interesting to see what kind of performance improvements can be obtained by applying the CRT to sentence-as well as document-level embeddings.
Relevant to privacy concerns, some constraints on homomorphic encryption schemes include limiting the number of possible multiplications and also limiting the precision of the values being computed upon. The same sort of numerical precision analysis and computational limitations that we have done in this paper can inform how we as NLP scientists think of how our methods are integrable into certain privacy technology. Average relative and absolute error might be useful for creating better encrypted NLP algorithms using homomorphic encryption, since many homomorphic encryption schemes tend to require integer inputs or limited-precision inputs.