HashFormers: Towards Vocabulary-independent Pre-trained Transformers

Transformer-based pre-trained language models are vocabulary-dependent, mapping by default each token to its corresponding embedding. This one-to-one mapping results into embedding matrices that occupy a lot of memory (i.e. millions of parameters) and grow linearly with the size of the vocabulary. Previous work on on-device transformers dynamically generate token embeddings on-the-fly without embedding matrices using locality-sensitive hashing over morphological information. These embeddings are subsequently fed into transformer layers for text classification. However, these methods are not pre-trained. Inspired by this line of work, we propose HashFormers, a new family of vocabulary-independent pre-trained transformers that support an unlimited vocabulary (i.e. all possible tokens in a corpus) given a substantially smaller fixed-sized embedding matrix. We achieve this by first introducing computationally cheap hashing functions that bucket together individual tokens to embeddings. We also propose three variants that do not require an embedding matrix at all, further reducing the memory requirements. We empirically demonstrate that HashFormers are more memory efficient compared to standard pre-trained transformers while achieving comparable predictive performance when fine-tuned on multiple text classification tasks. For example, our most efficient HashFormer variant has a negligible performance degradation (0.4% on GLUE) using only 99.1K parameters for representing the embeddings compared to 12.3-38M parameters of state-of-the-art models.


Introduction
The majority of transformer-based (Vaswani et al., 2017) pre-trained language models (PLMs; Devlin et al. 2019;Liu et al. 2019;Dai et al. 2019;Yang et al. 2019) are vocabulary-dependent, with each single token mapped to its corresponding vector in an embedding matrix.This one-to-one mapping makes it impractical to support out-of-vocabulary tokens such as misspellings or rare words (Pruthi et al., 2019;Sun et al., 2020).Moreover, it linearly increases the memory requirements with the vocabulary size for the token embedding matrix (Chung et al., 2021).For example, given a token embedding size of 768, BERT-BASE with a vocabulary of 30.5K tokens needs 23.4M out of 110M total parameters while ROBERTA-BASE with 50K tokens needs 38M out of 125M total parameters.Hence, disentangling the design of PLMs from the vocabulary size and tokenization approaches would inherently improve memory efficiency and pretraining, especially for researchers with access to limited computing resources (Strubell et al., 2019;Schwartz et al., 2020).
Previous efforts for making transformer-based models vocabulary-independent include dynamically generating token embeddings on-the-fly without embedding matrices using hash embeddings (Svenstrup et al., 2017;Ravi, 2019) over morphological information (Sankar et al., 2021a).However, these embeddings are subsequently fed into transformer layers trained from scratch for ondevice text classification without any pre-training.Clark et al. (2022) proposed CANINE, a model that operates on Unicode characters using a lowcollision multi-hashing strategy to support ~1.1M Unicode codepoints as well as infinite character four-grams.This makes CANINE independent of tokenization while limiting the parameters of its embedding layer to 12.3M.Xue et al. (2022) proposed models that take as input byte sequences representing characters without explicit tokenization or a predefined vocabulary to pre-train transformers in multilingual settings.
In this paper, we propose HASHFORMERS a new family of vocabulary-independent PLMs.Our models support an unlimited vocabulary (i.e.all possible tokens in a given pre-training corpus) with a considerably smaller fixed-sized embedding matrix.We achieve this by employing simple yet computationally efficient hashing functions that bucket together individual tokens to embeddings inspired by the hash embedding methods of Svenstrup et al. (2017) and Sankar et al. (2021a).Our contributions are as follows: 1.To the best of our knowledge, this is the first attempt towards reducing memory requirements of PLMs using various hash embeddings with different hash strategies aiming to substantially reduce the embedding matrix compared to the vocabulary size; 2. Three HASHFORMER variants further reduce the memory footprint by entirely removing the need of an embedding matrix; 3. We empirically demonstrate that our HASH-FORMERS are consistently more memory efficient compared to vocabulary-dependent PLMs while achieving comparable predictive performance when fine-tuned on a battery of standard text classification tasks.
2 Related Work

Tokenization and Vocabulary-independent Transformers
Typically, PLMs are pre-trained on text that has been tokenized using subword tokenization techniques such as WordPiece (Wu et al., 2016), Byte-Pair-Encoding (BPE; Sennrich et al. 2016) and Sen-tencePiece (Kudo and Richardson, 2018).Attempts to remove the dependency of PLMs on a separate tokenization component include models that directly operate on sequences of characters (Tay et al., 2022;El Boukkouri et al., 2020).However, these approaches do not remove the requirement of an embedding matrix.Recently, Xue et al. (2022) proposed PLMs that take as input byte sequences representing characters without explicit tokenization or a predefined vocabulary in multilingual settings.PLMs in Clark et al. (2022) operating on Unicode characters or ngrams also achieved the similar goal.These methods improve memory efficiency but still rely on a complex process to encode the relatively long ngram sequences of extremely long byte/Unicode sequences, affecting their computational efficiency.
In a different direction, Sankar et al. (2021b) proposed PROFORMER, an on-device vocabularyindependent transformer-based model.It generates token hash embeddings (Svenstrup et al., 2017;Shi et al., 2009;Ganchev and Dredze, 2008) on-the-fly by applying locality-sensitive hashing over morphological features.Subsequently, hash embeddings are fed to transformer layers for text classification.However, PROFORMER is trained from scratch using task-specific data without any pre-training.

Compressing PLM Embeddings
A different line of work has focused on compressing the embedding matrix in transformer models (Ganesh et al., 2021).Prakash et al. (2020) proposed to use compositional code embeddings (Shu and Nakayama, 2018)

HashFormers
In this section, we present HASHFORMERS, a family of vocabulary-independent hashing-based pretrained transformers.

Many-to-One Mapping from Tokens to an Embedding
Given a token t, HASHFORMERS use a hash function H to map t into a value v. Using hashing allows our model to map many tokens into a single embedding and support an infinite vocabulary.We obtain the embedding index by squashing its hash value v into i = [1, ..., N ] where e = E i is the corresponding embedding from a matrix E ∈ R N ×d where N is the number of the embeddings and d is their dimensionality.We assume that |V | ≫ N where |V | is the size of the vocabulary.Subsequently, e is passed through a series of transformer layers for pre-training.This is our first variant, HASHFORMER-Emb that relies on a look-up embedding matrix (see Figure 1).Our method is independent of tokenization choices.

Message-Direct Hashing (HashFormers-MD)
Our first approach to hash tokens is by using a Message-Digest (MD5) hash function (Rivest and Dusse, 1992) to map each token to its 128-bits output, v = H(t).The mapping can be reproduced given the same secret key.MD5 is a 'random' hashing approach, returning mostly different hashes for tokens with morphological or semantic similarities.For example: It is simple and does not require any pre-processing to obtain the bit encoding for each token.To map the hash output v into its corresponding embedding, we transform its binary value into decimal and then compute the index i to E as i = v % N .

Locality-Sensitive Hashing (HASHFORMERS-LSH)
Locality-sensitive hashing (LSH) hashes similar tokens into the same indices with high probability (Rajaraman and Ullman, 2011).HASHFORMER-LSH uses LSH hashing to assign tokens with similar morphology (e.g.'play', 'plays', 'played') to the same hash encoding.This requires an additional feature extraction step for token representation.
Token to Morphological Feature Vector: We want to represent each token with a vector x as a bag of morphological (i.e.character n-grams) features.For each token, we first extract character n-grams (n ∈ 1, 2, 3, 4) to get a feature vector whose dimension is equal to the vocabulary size.2Each element in the feature vector is weighted by the frequency of the character n-grams of the token.
Morphological Vector to Hash Index: Once we obtain the morphological feature vector of each token, we first define N random hyperplanes, each represented by a random unit vector r i ∈ R dx , where d x is the dimensionality of the morphological feature vector.Following a similar approach to Kitaev et al. (2020), we compute the hash value v as the index of the nearest random hyperplane vector to the token's feature vector, x obtained by computing v = H(x) = arg max(xR), R = [r 1 ; ...; r N ] where [α; β] denotes the concatenation of two vectors.This approach results into bucketing together tokens with similar morphological vectors.Similar to HASHFORMER-MD-Emb, we compute the embedding index as i = v % N .
To prevent storing a large projection matrix (R dx×N ) for accommodating each unit vector, we design an on-the-fly computational approach.We only store a vector η ∈ R dx that is randomly initialized from the standard normal distribution, guaranteeing that each column r in the matrix R is a permutation of η with a unique offset value (e.g.r 1 = [η 2 , ..., η N , η 1 ]).Each offset value only relies on the index of the hyperplane.This setting ensures that each hyperplane has the same L2-norm.

Compressing the Embedding Space
We also propose three embedding compression approaches that allow an even smaller number of parameters to represent token embeddings and support unlimited tokens (i.e.very large |V |) without forcing a large number of tokens to share the same embedding.For this purpose, we first use a hash function H to map each token t into a T -bit value b, b ∈ [0, 2 T ).Then, we pass b through a transformation procedure to generate the corresponding embedding (to facilitate computation, we cast b into a T -bit vector τ ).This way tokens with different values b will be assigned to a different embedding by keeping the number of parameters relatively small.Figure 2 shows an overview of this method.
Pooling Approach (Pool) Inspired by Svenstrup et al. (2017) and Prakash et al. (2020), we first create a universal learnable codebook, which is a matrix denoted as B ∈ R 2 k ×d .Then, we split the hash bit vector τ in k successive bits without overlap to obtain ⌈ T k ⌉ binary values.We then cast these binary values into an integer value representing a codeword.Hence, each token is represented by a vector c ∈ R d with elements c j ∈ [0, 2 k ).For example, given k = 4 and a 12-bits vector [1,0,1,0,0,1,0,0,0,0,0,1], 4-bit parts are treated as separate binary codewords [1010,0100,0001] then transformed into their decimal format codebook [10,4,1].We construct the embedding e ∈ R d for each token by looking up the decimal codebook and extracting ⌈ T k ⌉ vectors corresponding to its ⌈ T k ⌉ codewords.We then apply a weighted average pooling on them using a softmax function: where W ∈ R ⌈ T k ⌉×d is a learnable weight matrix as well as the codebook B. The total number of parameters required for this pooling transformation is This can be much smaller than the N × d parameters required for standard PLMs that use a one-to-one mapping between tokens and embeddings, where Additive Approach (Add) Different to the Pool method that uses a universal codebook, we create T different codebooks {B 1 , B 2 , ..., B T }, each containing two learnable embedding vectors corresponding to codewords 0 and 1 respectively.We get a T -bits vector τ ∈ {0, 1} T for each token, where each element in the vector τ is treated as a codeword.We look up each codeword in their corresponding codebook to obtain T vectors and add up them to compute the token embedding e: where B j ∈ R 2×d , j = 1, .., T , γ is the scaling factor.3Hence, the total number of parameters the additive transformation approach requires is 2 × T × d.Similar to the Pool approach, the number of parameters required is smaller than the vocabulary size: Projection Approach (Proj) Finally, we propose a new simpler approach compared to Pool and Add.We create T learnable random initialized vectors as T pseudo-axes to trace the orientation of each Tbits vector τ corresponding to the token t.Given a token bit vector τ , the j th element in the embedding e is computed as the Pearson's correlation coefficient (PCC) between τ and the learnable vector w j (3) ., T , hence, the total number of parameters the projection transformation approach requires is only depicts an overview of our HASHFORMER-Proj model.

Hashing for Compressed Embeddings
Similar to the embedding based HASHFORMERS-Emb, our embedding compression-based models also consider the same two hash approaches (MD and LSH) for generating the T -bit vector of each token.

MD5:
We directly map the tokens to its 128-bits output b with a universal secret key.

LSH:
We repeat the same morphological feature extraction step to obtain a feature vector x corresponding to each token t.However, rather than using 2 T random hyperplanes that require storing vectors of size R 2 T , we simply use T random hyperplanes similar to Ravi (2019); Sankar et al. (2021b).Each bit in b represents which side of the corresponding hyperplane r ∈ R d the feature vector x is located: b j = sgn(sgn(x • r i ) + 1), j = 1, ..., T .This allows an on-the-fly computation without storing any vector (Ravi, 2019).

Pre-training Objective
Since our models support an arbitrary number of unique tokens, it is intractable to use a standard Masked Language Modeling (Devlin et al., 2019) pre-training objective.We opted using SHUFFLE +

Implementation Details
Model Architecture Following the architecture of BERT-base, we use 12 transformer layers, an embedding size of 768 and a maximum sequence length of 512. 6For HASHFORMERS-LSH, we set T = 128 to make it comparable to HASHFORMERS-MD, as MD5 produces a 128bit hash value.For HASHFORMER-MD-Pool and HASHFORMER-LSH-Pool, we choose k = 10 to keep the number of total parameters for the embeddings relatively small.We also experiment with two sizes of the embedding matrix of HASHFORM-ERS-Emb for MD and LSH hashing.The first uses an embedding matrix of 50K, the same number of embedding parameters as BERT-base, while the second uses 1K which is closer to the size of the smaller Pool, Add and Proj models.
Hyperparameters Hyperparameter selection details are in Appendix A.

Pre-training
We pre-train all HASHFORMERS, BERT-MLM and BERT-S+R on the English Wikipedia and BookCorpus (Zhu et al., 2015) from HuggingFace (Lhoest et al., 2021) for up to 500k steps with a batch size of 128.For our HASH-FORMER models, we use white space tokenization resulting into a vocabulary of 11,890,081 unique tokens.For BERT-MLM and BERT-S+R, we use a 50,000 BPE vocabulary (Liu et al., 2019).
Hardware For pre-training, we use eight NVIDIA Tesla V100 GPUs.For fine-tuning on downstream tasks, we use one NVIDIA Tesla V100 GPU.

Predictive Performance Evaluation
We evaluate all models on GLUE (Wang et al., 2018) benchmark.We report matched accuracy for MNLI, Matthews correlation for CoLA, Spearman correlation for STS, F1 score for QQP and accuracy for all other tasks.

Efficiency Evaluation
Furthermore, we use the following metrics to measure and compare the memory and computational efficiency of HASHFORMERS and the baselines.

Memory Efficiency Metrics
We define the three memory efficiency metrics together with a performance retention metric to use it as a point of reference: • Performance Retention Ratio: We compute the ratio between the predictive performance of our target model compared to a baseline model performance.A higher PRR indicates better performance.
• Parameters Compression Ratio (All): We compute use the ratio between the total number of parameters of our target model and that of the baseline to measure the memory efficiency of the target model compared to the baseline.A higher PCR All score indicates better memory efficiency for the entire model.
• Parameters Compression Ratio (Emb): We also use the ratio between the number of parameters required by a target model for representing embeddings and that of the baseline.
A higher PCR Emb score indicates better memory efficiency for the embedding representation.
• Proportion of Embedding Parameters: We also use the proportion of parameters of embeddings out of the total parameters of each model to show the memory footprint of the embedding space on each model.
Ideally, we expect a smaller PoEP, indicating that the embedding parameters occupy as little memory as possible out of the total number of parameters of a model.For number of parameters calculations, please see Appendix B.

Computational Efficiency Metrics
We also measure the computational efficiency for pretraining (PT) and inference (Infer).Each pretraining step is defined as a forward pass and a backward pass.The inference is defined by a single forward pass.
• and CANINE-C on average GLUE score (79.9 and 76.0 vs. 79.5, 79.6 and 70.4 respectively).Surprisingly, the more sophisticated HASHFORMER-LSH-Emb that takes morphological similarity of tokens into account does not outperform HASHFORMER-MD-Emb that uses a random hashing.We believe that HASHFORMER-MD generally outperforms HASHFORMER-LSH mainly due to its ability to map morphologically similar tokens to different vectors.This way it can distinguish tenses etc.. On the other hand, HASHFORMER-LSH confuses words with high morphological similarity (e.g.play, played) because it assigns them to the same embedding.However, LSH contributes to the performance improvement of smaller HASHFORMERS with compressed embedding spaces compared to their MD variants, i.e.Add (78.9 vs. 75.3),Add (78.9 vs. 75.7)and Proj (79.1 vs. 75.7).The best performing compressed HASHFORMER-LSH-Proj model obtains 79.1 average GLUE score, which is only 0.4 lower than the BERT baselines.Reducing the number of embedding vectors in Emb (1K) models is detrimental to performance and leads to drastic drops between 11.8% and 13.1%.This indicates that the model size plays a more important role than the choice of tokenization approach (i.e.white space or BPE) or the vocabulary size (i.e.12M vs. 50K).At the same time, comparing to Emb, the Pool, Add and Proj approaches do not suffer from predictive accuracy degradation, i.e. 0.4-4.2%.
All our HASHFORMERS show clear advantages comparing to the LSH based PROFORMER which is not pre-trained across the majority of tasks (i.e.MNLI, QNLI, QQP, MRPC, CoLA and STS).Although PROFORMER shows that for a relatively simpler sentiment analysis task (SST), pre-training might not be necessary.

Memory Efficiency Comparison
Table 2 shows the results on memory efficiency and performance retention (%) on GLUE using BERT-MLM as a baseline.Notably, Pool, Add and Proj models provide large compression to the total number of embeddings parameters compared to Emb as well as CANINE-C and BERT variants.This is approximately a 30% PCR All and 97-99% PCR Emb compared to BERT.These models also achieve very high performance retention (from 94.7% to 99.5%) which highlights their efficiency.In one case, HASHFORMER-LSH-Add outperforms the BERT-MLM baseline on CoLA with a retention ratio of 105  HASHFORMERS-Emb have an embedding matrix of equal size (i.e.50K embeddings) as BERT.However, BERT only supports a vocabulary of 50K tokens, while HASHFORMERS-Emb supports an unlimited vocabulary, e.g.12M unique tokens in our pre-training corpora.Using a smaller embedding matrix (i.e.1K), the performance retention drops 20%~26%.Despite the fact that HASH-FORMERS-Emb (1K) has a similar number of embedding parameters as the embedding compression approaches (i.e.Pool, Add, Proj), it falls far behind those models, i.e. between 8.5% and 14.3% for both MD and LSH variants.This demonstrates the effectiveness of our proposed embedding compression approaches.
Although, the more lightweight ProFormer with only two transformer layers consists of 15.1M parameters in total (approximately a 87.9% PCR All ), its performance7 fall far behind our worst HASHFORMER-MD-Pool with a difference of 29.5% PRR on GLUE Avg.score.Nevertheless, ProFormer requires more bits for hashing the tokens, resulting in more parameters for representing token embeddings (322.6K)comparing to HASH-FORMERS-Add and .Such memory efficiency gains substantially sacrifice model's predictive performance.

Computational Efficiency Comparison
Table 3 shows the pre-training (PT) and inference (Infer) time per sample for HASHFORMERS, CA-NINE-C, BERT-S+R using BERT-MLM as a baseline for reference.We note that HASHFORMERS have comparable pre-training training time (PT) to the fastest BERT model (BERT-S+R).This highlights that the complexity of the pre-training objective is more important than the size of the embedding matrix for improving computational efficiency for pre-training.
During inference, we observe that the speed-up obtained by HASHFORMERS is up to 2.6x compared to both BERT models.However, this is due to the tokenization approach.HASHFORMERS operate on the word level, so the sequence length of the input data is smaller, leading to inference speed-ups.Finally, we observe that CANINE-C has a slower inference time compared to both BERT models and HASHFORMERS.This might be due to its relatively more complex approach for processing the long Unicode character input sequence.

Conclusions
We have proposed HASHFORMERS, a family of vocabulary-independent hashing-based pre-trained transformers.We have empirically demonstrated that our models are computationally cheaper and more memory efficient compared to standard pretrained transformers, requiring only a fraction of their parameters to represent token embeddings.HASHFORMER-LSH-Proj variant needs 99.1K parameters for representing the embeddings compared to millions of parameters required by state-ofthe-art models with only a negligible performance degradation.For future work, we plan to explore multilingual pre-training with HASHFORMERS and explore their ability in encoding linguistic properties (Alajrami and Aletras, 2022).

Limitations
We experiment only using English data to make comparisons with previous work easier.For languages without explicit white spaces (e.g.Chinese and Japanese), our methods can be applied with different tokenization techniques, e.g. using a fixed length window of characters.
Figure3shows the overview of the Pool process.

Table 1 :
Results on GLUE dev sets with standard deviations over three runs in parentheses.Bold values denote best performing method in each task.
Table1presents results on GLUE for our HASH-FORMERS models and all baselines.We first observe that both the performance of our HASH-FORMERS-Emb models (MD and LSH) are comparable to the two BERT variants(MLM and S+R)

Table 2 :
Memory efficiency metrics and performance retention (%) on GLUE for HASHFORMER models, CANINE-C and ProFormer using BERT-MLM as a baseline.

Table 3 :
Results on pre-training speed and inference speed under different embedding compression strategies.We use BERT-MLM as the baseline model.The sequence length is fixed to 512 for pre-training.For inference, sequence length is equal to the length of the longest sequence in the batch.