On-Device Text Representations Robust To Misspellings via Projections

Recently, there has been a strong interest in developing natural language applications that live on personal devices such as mobile phones, watches and IoT with the objective to preserve user privacy and have low memory. Advances in Locality-Sensitive Hashing (LSH)-based projection networks have demonstrated state-of-the-art performance in various classification tasks without explicit word (or word-piece) embedding lookup tables by computing on-the-fly text representations. In this paper, we show that the projection based neural classifiers are inherently robust to misspellings and perturbations of the input text. We empirically demonstrate that the LSH projection based classifiers are more robust to common misspellings compared to BiLSTMs (with both word-piece & word-only tokenization) and fine-tuned BERT based methods. When subject to misspelling attacks, LSH projection based classifiers had a small average accuracy drop of 2.94% across multiple classifications tasks, while the fine-tuned BERT model accuracy had a significant drop of 11.44%.


Introduction
At the core of Natural Language Processing (NLP) neural models are pre-trained word embeddings like Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014a) and ELMo (Peters et al., 2018). They help initialize the neural models, lead to faster convergence and have improved performance for numerous application such as Question Answering (Liu et al., 2018), Summarization (Cheng and Lapata, 2016), Sentiment Analysis (Yu et al., 2017). While word embeddings are powerful in unlimited constraints such as computation power * Work done during internship at Google † Work done while at Google AI and compute resources, it becomes challenging to deploy them to on-device due to their huge size.  (Ravi, 2017) representing the same word.
This led to interesting research by (Ravi and Kozareva, 2018;Sankar et al., 2019), who showed that word embeddings can be replaced with lightweight binary Locality-Sensitive Hashing (LSH) based projections learned on-the-fly. The projection approach surmounts the need to store any embedding matrices, since the projections are dynamically computed. This further enables user privacy by performing inference directly on device without sending user data (e.g., personal information) to the server. The embedding memory size is reduced from O(V ) to O(K), where V is the token vocabulary size and K << V , is the binary LSH projection size. The projection representations can operate on either word or character level, and can be used to represent a sentence or a word depending on the NLP application. For instance, recently the Projection Sequence Networks (ProSeqo) ) used BiLSTMs over word-level projection representations to represent long sentences and achieved close to state-ofthe-art results in both short and long text classification tasks with varying amounts of supervision and vocabulary sizes. Figure 2: Memory for V look-up vectors for each token vs storing K(<< V ) vectors and linearly combining them for token representation. We consider K = 1120 following (Ravi and Kozareva, 2018) in this paper.
Despite being successful, there are no existing systematic research efforts focusing on evaluating the capabilities of the LSH based projection for text representations. To that end, we empirically analyze the effectiveness and robustness of the LSH projection approach for text representation by conducting two types of studies in this paper.

Classification with perturbed inputs, where
we show that Projection based networks 1) Projection Sequence Networks (ProSeqo)  and 2) Self-Governing Neural Networks (SGNN) models ) evaluated with perturbed LSH projections are robust to misspellings and transformation attacks, while we observe significant drop in performance for BiLSTMs and fine-tuned BERT classifiers.
2. Perturbation Analysis, where we test the robustness of the projection approach by directly analyzing the changes in representations when the input words are subject to the char misspellings. The purpose of this study is to examine if the words or sentences with misspelling are nearby in the projection space instead of frequently colliding with the projection representations of other valid words.
Overall, our studies showcase the robustness of LSH projection representations and resistance to misspellings. Due to their effectiveness, we believe that in the future, text representations using LSH projections can go beyond memory constrained settings and even be exploited in large scale models like Transformers (Vaswani et al., 2017).

Binary LSH projections for text representations
The dependency on vocabulary size V , is one of the primary reasons for the huge memory footprint of embedding matrices. It is common to represent a token, x by one-hot representation, Y(x) ∈ [0, 1] V and a distributed representation of the token is obtained by multiplying the one-hot representation with the embedding matrix, W V ∈ R d×V as in One way to remove the dependency on the vocabulary size is to learn a smaller matrix, W K ∈ R d×K (K << V ), as shown in Figure 2. For instance, 300-dimensional Glove embeddings, W V (Pennington et al., 2014b) with 400k vocabulary size occupies > 1 GB while the W K occupies only ≈ 1.2 MB for K = 1000 yielding a 1000× reduction in size. Instead of learning a unique vector for each token in the vocabulary, we can think of the columns of this W K matrix as a set of basis vectors and each token can be represented as a linear combination of basis vectors in W K as in Figure  1. We select the basis vectors from W K for each token with a fixed K-bit binary vector instead of a V -bit one-hot vector.
The LSH Projection function, P ( Figure 3) (Ravi, 2017(Ravi, , 2019 used in SGNN (Ravi and Kozareva, 2018) and ProSeqo  does exactly this as it dynamically generates a fixed binary projection representation, P(x) ∈ [0, 1] K for any token, x by extracting morphological input features like char (or token) n-gram & skip-gram features, parts of speech tags etc. from x and a modified Locality-Sensitive Hashing (LSH) based transformation, L as in where F extracts n-grams (or skip-grams), [f 1 , · · · , f n ] from the input text. Here, [f 1 , · · · , f n ] could refer to either character level or token level n-grams(or skip-grams) features. Given the LSH projection representation, P(x), the distributed representation of the token, x is represented as in It is worth noting that projection operation, P can also be used map an entire sentence directly to the [0, 1] K space.
As for the projection based classifiers, the ProSeqo model ) runs a BiL-STM over word-level binary LSH projection representations to predict the correct classes, while the SGNN model (Ravi and Kozareva, 2018) computes a binary LSH projection representation for the entire input text, followed by a 2-layer MLP and a softmax layer on top of it for class prediction. SGNN was designed for short text, while ProSeqo is also suitable for long text classification tasks.
There have been a number of research efforts (Sakaguchi et al., 2016;Edizel et al., 2019;Pruthi et al., 2019) to improve the robustness of neural classifiers to misspelling attacks and other text transformations. Recently, Pruthi et al. (2019) observe that fine-tuned BERT and BiLSTM based models are very brittle (for e.g., accuracy drops from 90.3% to 45.8% in the SST (Socher et al., 2013) classification task) to adversarial misspelling attacks. Contrary to intuition, they observe that word-piece and character-level models are more susceptible to spelling attacks compared to the word-level models.
The LSH projection operation, P, is a function of n-grams (and skip-grams) of the input text x and usually the fraction of n-grams affected by spelling attacks tend to be minimal resulting in insignificant changes to the projection representation, P(x). Therefore, we hypothesize that the projection based models like ProSeqo, SGNN, etc. are inherently robust to commonly occurring spelling attacks. In the following sections, we investigate the robustness of projection based classifiers by subjecting them to common misspellings, followed by an analysis of changes in the binary LSH projections of input text under such transformations.

Effect of Misspellings on Text Classification
We study the robustness of two types of projection based models -ProSeqo and SGNN. On the other hand, we fine-tune the pretrained BERT-base model (with word-piece tokenization) (Devlin et al., 2018) and train two-layer BiLSTMs (with both word-only and word-piece tokenization) for comparable accuracies with respect to the projection based models for a fair comparison. By word-only tokenization, we mean that models encode input words using a lookup table for each word. In our setup, we test the robustness of the neural classifiers by subjecting the corresponding test sets to common misspellings and omissions. We consider the following perturbation operations: randomly dropping, inserting, and swapping internal characters within words of the input sentences Pruthi et al., 2019) 1 . We decide to perturb each word in a sentence with a fixed probability, P perturb . Following (Ravi and Kozareva, 2018), we fix the projection dimension to K = 1120.

Datasets
For evaluation purposes, we use the following text classification datasets for dialog act classification MRDA (Shriberg et al., 2004) and SWDA (Godfrey et al., 1992;Jurafsky et al., 1997), for intent prediction ATIS (Tür et al., 2010) and long text classification Amazon Reviews (Zhang et al., 2015) and Yahoo! Answers (Zhang et al., 2015). Table 1 shows the characteristics of each dataset.    Table 3: Comparison of projection based models vs BiLSTMs subject to various types and amounts of perturbations. BiLSTM-wp and BiLSTM-w refer to models with word-piece and word-only tokenization respectively. Table 2 reports the average classifier accuracy drops when all the models are subject to all types of perturbations (swap, drop, & add) on multiple classification tasks (two short text and two long text). We see that the accuracy drop for the projection based models is significantly lower across all datasets. It is also worth noting that the standard deviations across the 5 runs are also minimal for the projection based models further showcasing the stability of projection representations. In another experiment shown in Table 3, we subject different models to varying types and amounts of perturbations. Similarly, we see that the accuracy drop for the projection based models is the smallest across all datasets and amounts of perturbation. Compared to the word-only models, we observe that the word-piece models are also comparably susceptible to character perturbations which agrees with the findings in (Pruthi et al., 2019).

Perturbation Analysis
Apart from the classification experiments, we also directly analyze the changes in the binary LSH projection representations by subjecting input text to different types and amount of perturbations. To that end, we take a large corpus enwik9 2 (vocabulary size of 500k and 129M words) to analyze the average Hamming distance between LSH projections of the words in the corpus. Next, we compute the average changes in the projection representations by subjecting them to the character perturbations from Section 3. Table 4   following observations from our experiments: 1. Average Hamming distance between LSH projections of words is ≈ K/2, where K is the projection dimension which implies that the words are more or less uniformly spread out from each other indicating that there are no bias issues in the [0, 1] K representation space.
2. Assuming P perturb = 0.2, we observe that LSH projection changes only by ≈ 11% w.r.t the average Hamming distance between the words in the corpus when subject to misspellings. For instance, if the average Hamming distance between LSH projections of words is 100 bits, misspellings change the projections by only 11 bits on average. Intuitively, this suggests that neural layers on top of the LSH projection tend to rarely confuse a misspelled word for another valid word.
Also from Table 4, we found that the changes in the LSH-projection, ∆ P(x) due to perturbations is directly proportional to LSH projection dimension, K and perturbation probability, P perturb as in, ∆ P(x) ∝ K · P perturb .

Conclusion
In this work, we perform a detailed study analyzing the robustness of recent LSH-based projection neural networks for memory-efficient text representations. Based on multiple text classification tasks and perturbation studies, we find projection-based neural models to be robust to text transformations compared to BERT or BiLSTMs with embedding lookup tables for words and word-pieces.