ProFormer: Towards On-Device LSH Projection Based Transformers

At the heart of text based neural models lay word representations, which are powerful but occupy a lot of memory making it challenging to deploy to devices with memory constraints such as mobile phones, watches and IoT. To surmount these challenges, we introduce ProFormer – a projection based transformer architecture that is faster and lighter making it suitable to deploy to memory constraint devices and preserve user privacy. We use LSH projection layer to dynamically generate word representations on-the-fly without embedding lookup tables leading to significant memory footprint reduction from O(V.d) to O(T), where V is the vocabulary size, d is the embedding dimension size and T is the dimension of the LSH projection representation.We also propose a local projection attention (LPA) layer, which uses self-attention to transform the input sequence of N LSH word projections into a sequence of N/K representations reducing the computations quadratically by O(Kˆ2). We evaluate ProFormer on multiple text classification tasks and observed improvements over prior state-of-the-art on-device approaches for short text classification and comparable performance for long text classification tasks. ProFormer is also competitive with other popular but highly resource-intensive approaches like BERT and even outperforms small-sized BERT variants with significant resource savings – reduces the embedding memory footprint from 92.16 MB to 1.7 KB and requires 16x less computation overhead, which is very impressive making it the fastest and smallest on-device model.


Introduction
Transformers (Vaswani et al., 2017) based architectures like BERT (Devlin et al., 2018), XL-net * Work done during internship at Google † Work done while at Google AI (Yang et al., 2019), GPT-2 (Radford et al., 2019), MT-DNN (Liu et al., 2019a), RoBERTA (Liu et al., 2019b) reached state-of-the-art performance on tasks like machine translation (Arivazhagan et al., 2019), language modelling (Radford et al., 2019), text classification benchmarks like GLUE (Wang et al., 2018). However, these models require huge amount of memory and need high computational requirements making it hard to deploy to small memory constraint devices such as mobile phones, watches and IoT. Recently, there have been interests in making BERT lighter and faster (Sanh et al., 2019;McCarley, 2019). In parallel, recent on-device works like SGNN (Ravi and Kozareva, 2018), SGNN++  and (Sankar et al., 2019) produce lightweight models with extremely low memory footprint. They employ a modified form of LSH projection to dynamically generate a fixed binary projection representation, P(x) ∈ [0, 1] T for the input text x using word or character n-grams and skip-grams features, and a 2-layer MLP + softmax layer for classification. As shown in (Ravi and Kozareva, 2018) these models are suitable for short sentence lengths as they compute T bit LSH projection vector to represent the entire sentence. However,  showed that such models cannot handle long text due to significant information loss in the projection operation.
On another side, recurrent architectures represent long sentences well, but the sequential nature of the computations increases latency requirements and makes it difficult to launch on-device. Recently, self-attention based architectures like BERT (Devlin et al., 2018) have demonstrated remarkable success in capturing long term dependencies in the input text via purely attention mechanisms. BERT's model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation in (Vaswani et al., 2017). The self-attention scores can be computed in parallel as they do not have recurrent mechanisms. But usually these architectures are very deep and the amount of computation is quadratic in the order of O(L · N 2 ), where L is the number of layers (Transformer blocks) and N is the input sentence length. Straightforward solutions like reducing the number of layers is insufficient to launch transformers on-device due to the large memory and quadratic computation requirements.
In this paper, we introduce a projection-based neural architecture ProFormer that is designed to (a) be efficient and learn compact neural representations (b) handle out of vocabulary words and misspellings (c) drastically reduce embedding memory footprint from hundreds of megabytes to few kilobytes and (d) reduce the computation overhead quadratically by introducing a local attention layer which reduces the intermediate sequence length by a constant factor, K. We achieve this by bringing the best of both worlds by combining LSH projection based representations (for low memory footprint) and self-attention based architectures (to model dependencies in long sentences). To tackle computation overheard in the transformer based models, we reduce the number of self-attention layers and additionally introduce an intermediate local projection attention (LPA) to quadratically reduce the number of self-attention operations. The main contributions of our paper are: • We propose novel on-device neural network called ProFormer which combines LSH projection based text representations, with transformer architecture and locally projected selfattention mechanism that captures long range sentence dependencies while yielding low memory footprint and low computation overhead.
• ProFormer reduces the computation overhead O(L · N 2 ) and latency in multiple ways: by reducing the number of layers L from twelve to two and introducing new local projection attention layer that decreases number of selfattention operations by a quadratic factor.
• ProFormer is light weigh compact on-device model, while BERT on-device still needs huge embedding table ( 92.16 MB for V = 30k, d = 768) with number of computation flops in the order of O(L · N 2 ), where L is the number of layers, N is the number of words in the input sentence.
• We conduct empirical evaluations and comparisons against state-of-the-art on-device and prior deep learning approaches for short and long text classification. Our model ProFormer reached state-of-art performance for short text and comparable performance for long texts, while maintaining small memory footprint and computation requirements.

LSH Projection Layer
It is a common practice to represent each word in the input sentence, x = [w 1 , w 2 , · · · , w N ] as an embedding vector based on its one-hot representation. Instead, we adopt LSH projection layer from (Ravi, 2017(Ravi, , 2019 which dynamically generates a T bit representation, P(w i ) ∈ [0, 1] T for the input word, w i based on its morphological features like n-grams, skip-grams from the current and context words, parts-of-speech tags, etc.
Since the LSH projection based approach does not rely on embedding lookup tables to compute word representation, we obtain significant memory savings of the order, O(V · d), where V is the vocabulary size and d is the embedding dimension. For instance, the embedding look-up table occupies 92.16 MB (V = 30k, d = 768 (Devlin et al., 2018)), while the LSH projection layer requires only ≈ 1.7 KB (T = 420) as shown in Table 1.

Models
Embedding memory Computations

Local Projection Attention (LPA) Layer
The LPA layer shown in Figure 2 consists of a single layer multi-headed self-attention layer similar to the Transformer architecture in (Vaswani et al., 2017) followed by a max-pooling layer yielding a compressed representation of K input words, [w 1 , w 2 , · · · w K ]. where LPA consists of the self-attention and maxpooling operation, K is a Group factor 1 . We equally divide the N word-level LSH projection representations into N/K groups of size K. The LPA layer compresses each group of K word representations into LPA(P(w 1:K )) ∈ R d yielding N/K representations in total. The LPA layer reduces the self-attention computation overhead in the subsequent transformer layer (Vaswani et al., 2017) by O(K 2 ).

Transformer Layer
This layer consists of 2-layer bidirectional Transformer encoder based on the original implementation described in (Vaswani et al., 2017). This layer transforms the N/K input representations from the LPA layer described in the previous sub-section into N/K output representations. In this layer, we reduce both the computation overhead and memory footprint by reducing the number of layers from L to 2 reducing the computation overhead by O(L/2) (6 times in the case of 12-layer BERT-base model).

Max-Pooling and Classification Layer
We summarize the N/K representations from the transformer layer to get a single d dimensional vector by max-pooling across the N/K time-steps, followed by a softmax layer to predict the output class Y .

Datasets & Experimental Setup
In this section, we describe our datasets and experimental setup. We use text classification datasets from state-of-the-art on-device evaluations such as: MRDA (Shriberg et al., 2004) and ATIS (Tür et al., 2010), AG News (Zhang et al., 2015a) and Yahoo! Answers (Zhang et al., 2015a). Table 2 shows the characteristics of each dataset.  We compare our model with previous state of the art neural architectures, including on-device approaches. We also fine-tune the pretrained 12layer BERT-base model (Devlin et al., 2018) on all classification tasks and compare to our model. BERT-base consists 12-layers of transformer blocks (Vaswani et al., 2017) and is pretrained in an unsupervised manner on a large corpus (BooksCorpus (Zhu et al., 2015) and English WikiPedia) using masked-language model objective. We fine-tune the pretrained BERT-base (Devlin et al., 2018) to each of the classification tasks. For training, we use Adam with learning rate of 1e-4, β 1 =0.9, β 2 =0.999, L2 weight decay of 0.01, learning rate warmup over the first 10, 000 steps, and linear decay of the learning rate. We use dropout probability of 0.1 on all layers and training batch size of 256. For further comparison, we also trained much smaller BERT baselines with 2-layers of transformer blocks and smaller input embedding sizes.

Results
Tables 3 and 4 show the results on the ATIS & MRDA short text classification and AG & Y!A long text classification tasks. We compare our approach, ProFormer against prior state-of-the-art on-device works, fine-tuned BERT-base, smaller 2-layer BERT variants and other non-on-device neural approaches. Overall, our model ProFormer improved upon non-on-device neural models while keeping very small memory footprint and high accuracy. This is very impressive since ProFormer can be directly deployed to memory constraint devices like phones, watches and IoT while still maintaining high accuracy. ProFormer also improved upon prior on-device state-of-the-art neural approaches like SGNN (Ravi and Kozareva, 2018) and SGNN++  reaching over 35% improvement on long text classification. Similarly it improved over on-device ProSeqo ) models for all datasets and reached comparable performance on MRDA. In addition to the quality improvements, ProFormer also keeps smaller memory footprint than ProSeqo, SGNN and SGNN++.

Conclusion
We proposed a novel on-device neural network Pro-Former, which combines LSH projection based text representations, with trans-former architecture and locally projected self-attention mechanism that captures long range sentence dependencies. Overall, ProFormer yields low memory footprint and reduces computations quadratically. In series of experimental evaluations on short and long text classifications we show that ProFormer improved upon prior neural models and on-device work like SGNN (Ravi and Kozareva, 2018), SGNN++  and ProSeqo . ProFormer reached comparable performance to our BERT-base implementation, however it produced magnitudes more compact models than BERT-base. This is very impressive showing both effectiveness and compactness of our neural model.