Multi-word Tokenization for Sequence Compression

Large Language Models have proven highly successful at modelling a variety of tasks. However, this comes at a steep computational cost that hinders wider industrial uptake. In this paper, we present MWT: a Multi-Word Tokenizer that goes beyond word boundaries by representing frequent multi-word expressions as single tokens. MWTs produce a more compact and efficient tokenization that yields two benefits: (1) Increase in performance due to a greater coverage of input data given a fixed sequence length budget; (2) Faster and lighter inference due to the ability to reduce the sequence length with negligible drops in performance. Our results show that MWT is more robust across shorter sequence lengths, thus allowing for major speedups via early sequence truncation.


Introduction
The field of Natural Language Processing (NLP) has seen major breakthroughs with the advent of Large Language Models (LLMs) (Vaswani et al., 2017;Devlin et al., 2018;Touvron et al., 2023;OpenAI, 2023).Despite their successes, LLMs like ChatGPT (OpenAI, 2023;Brown et al., 2020) possess hundreds of billions of parameters that entail enormous computational cost by design.Traditional model compression methods such as Knowledge Distillation (Hinton et al., 2015), Pruning (Michel et al., 2019;Zhu and Gupta, 2017), and Quantization (Shen et al., 2020;Gupta et al., 2015) have focused on creating lighter models either by shrinking the architectural size or by reducing the number of FLOPs.
Recently, LLMs have been shown to produce impressive performance on inputs that have been carefully designed to contain all the necessary information for a given instruction.As such, there is an increasing trend in designing longer and longer prompts that has led to a significant rise in computational cost.To address this, interest has grown in compressing the input sequences from the tokenizer (Gee et al., 2022;Mu et al., 2023;Petrov et al., 2023).Indeed, various works have shown the importance of tokenization in determining the length of a sequence in specialized domains (Gee et al., 2022) or on underrepresented languages (Petrov et al., 2023).
In this paper, we propose a method for reducing the computational cost of a LLM by compressing the textual inputs using Multi-Word Tokenizers (MWTs).To achieves this, we enrich the vocabulary of the tokenizer with statistically determined multi-word expressions.By encoding the frequent n-grams with single tokens, the sequences produced are both shorter and more informative, thus allowing for major speedups via early sequence truncation.Additionally, MWTs are shown to be compatible with the aforementioned traditional compression methods.Experimentally, we assess MWTs on three text classification datasets.We show how our approach still performs well when combined with distilled models (Sanh et al., 2019) and other sequence compression techniques (Gee et al., 2022).The code for our paper is publicly available1 .
The rest of the paper is organized as follows.First, we review the related works in Section 2.Then, we describe our approach in Section 3 and present the experiments in Section 4. Finally, we draw our conclusions in Section 5.

Related Works
Most model compression research falls into one of the following categories: Knowledge Distillation (Hinton et al., 2015;Sanh et al., 2019;Jiao et al., 2020;Wang et al., 2020;Sun et al., 2020), Pruning (Zhu and Gupta, 2017;Michel et al., 2019), and Quantization (Shen et al., 2020).The family of approaches is somewhat complementary and Input: an energizable member is operably coupled to the outer sleeve .
Figure 1: Tokenization using generic T gen and adapted T 100 tokenizers.T 1000 gen and T 1000 100 are extended with the top-1000 bigrams.Tokens obtained with domain-adaptation or MWT are highlighted in orange and blue respectively.MWTs are shown to be highly complementary to existing tokenizers for sequence compression.can be applied individually or jointly.Each approach alters the model's size to obtain a more efficient architecture.Differently, other works such as FlashAttention (Dao et al., 2022) seek to optimize a model's implementation.In particular, LLMs are sped up by reducing the number of memory accesses for the self-attention mechanism.
Sequence Compression.An emerging direction for reducing the cost of LLMs involves the designing of shorter input sequences.Prompting techniques such as Mu et al. (2023) compress repetitive lengthy prompts into gist tokens.Other works emphasize the role of tokenization in sequence compression.In Petrov et al. (2023), the authors show how the tokenizer of most LLMs strongly favor the English language over other languages.For underrepresented languages, the same translated sentence may consist of inputs that are up to 15 times longer.Analogously, Gee et al. (2022) investigated the tokenization efficiency of general-purpose tokenizers in vertical domains such as medicine and law.They proposed a transfer learning technique that adapts the vocabulary of a LLM to specific language domains.An effect of a dedicated vocabulary is a more efficient tokenization that reduces the number of sub-word tokens in a sequence.
In this work, we push this effect further, going beyond word boundaries by introducing Multi-Word Expressions (MWEs) in the form of n-grams into the tokenizer as shown in Figure 1.The underlying intuition behind this is that a more compact tokenization can save computations by allowing the model to process shorter sequences without a significant loss of information.The usage of MWEs is not novel with several works (Lample et al., 2018;Otani et al., 2020;Kumar and Thawani, 2022) introducing phrases or n-grams to improve the quality of machine translation.In Kumar and Thawani (2022), the authors generalized BPE (Sennrich et al., 2016) to multi-word tokens.However, to the best of our knowledge, we are the first to investigate MWEs in the context of sequence compression.

Multi-word Tokenizer
Tokenization is a necessary step in the feeding of textual data to a LLM.Typically, tokenizers split a text into a sequence of symbols which can be entire words or only subparts.To do this, a vocabulary is first constructed by statistically learning the most frequent tokens from a large general-purpose corpus (Sennrich et al., 2016;Schuster and Nakajima, 2012;Kudo and Richardson, 2018).The resulting tokenizer can then be used to segment an input text by greedily looking for the solution with the least number of tokens.Building upon this, we inject into the tokenizer new symbols formed by n-grams of words.We do this by first selecting the most frequent n-grams to include in its vocabulary.Then, we place an n-gram merging step within the tokenization pipeline as sketched in Figure 2. The added n-grams will be treated as single tokens further down the tokenization pipeline.
N-gram Selection.In order to maximize the sequence reduction, we statistically estimate the top-K most frequent n-grams in a reference training corpus.Although the approach is greedy, hence sub-optimal, it still effectively yields significant compression while being extremely fast and easy to compute.More formally, given a corpus D and N ≥ 2, we compute all the possible n-grams g n ∈ D, where n = 2, . . ., N .Then, we count their frequency f (g n ), ∀g n ∈ D. The K most frequent n-grams G K are included in the vocabulary V ← V ∪ G K of the tokenizer T .Fast Vocabulary Transfer.Given that the vocabulary of the tokenizer has changed, the newly added symbols G K must be included into the embedding matrix of the language model as well.To avoid retraining the entire model from scratch which is highly resource-demanding, or a random initialization of new tokens which would perform poorly, we make use of Fast Vocabulary Transfer (FVT) instead (Gee et al., 2022).
FVT is a transfer learning technique that assigns embeddings to new tokens by combining existing elements of the embedding matrix as shown in Figure 3.After initializing the multi-word embeddings with FVT, we found it beneficial to tune the model with Masked-Language Modeling (MLM) as done by Gee et al. (2022).We believe this is helpful as it aids the model in further readjusting the embeddings of the new tokens.

Experiments
Given a fixed number of tokens, a more compact input sequence preserves a greater amount of infor-mation.This can be used to either achieve a better performance with limited benefits in speedup, or vice versa, i.e. making the model faster with negligible drops in performance.The experiments aim to analyze how these two aspects interact with one another.We focus on text classification as it is a problem of particular interest for many industryoriented applications.

Experimental Setup
Our experiments were conducted on the cased versions of BERT base (Devlin et al., 2018) and DistilBERT base (Sanh et al., 2019).Additionally, we consider an adapted tokenizer with a vocabulary size equal to that of the generic tokenizer from a pre-trained model as done by Gee et al. (2022).We refer to the generic and adapted tokenizers as T gen and T 100 respectively.Both tokenizers are extended with the top-K n-grams of 1000, 2500, and 5000.Overall, we compare eight different tokenizers indicated as: T gen , T 1000 gen , T 2500 gen , T 5000 gen and T 100 , T 1000 100 , T 2500 100 , T 5000 100 .Implementation Details.We train each model with 5 different random initializations.The macro-F1 and inference speedup are measured as metrics.The average of all 5 initializations is taken as the final value of each metric.The inference speedup measurements were done on a V100-PCIE GPU with 16GBs of dedicated RAM.
Following Gee et al. (2022), we first apply one epoch of MLM using the in-domain dataset.Next, the model is fine-tuned for 10 epochs with early stopping on the downstream task.We set the initial learning rate to 3 • 10 −5 for both MLM and downstream fine-tuning, while the batch size is set to 8 and 32 for MLM and downstream fine-tuning respectively.
Choice of N.An important hyperparameter is N, i.e. the maximum number of words constituting an n-gram.In our experiments, N is set to 2 as we believe that using bigrams only provides better generalization properties.Increasing the value of N may lead to an overspecialization of n-grams which could overfit on small textual corpora.

Datasets
To determine the effectiveness of MWTs, we select 3 different text classification tasks from diverse linguistic domains, namely medical (ADE), legal (LEDGAR), and tech (PATENT).

ADE.
A sentence classification dataset of determining whether a sentence is Adverse Drug Event (ADE)-related or not (Gurulingappa et al., 2012).
The sentences are characterized by the presence of medical terminologies of drugs and their adverse effects.We use the same train, validation, and test splits as in Gee et al. (2022).
LEDGAR.A document classification dataset of contracts obtained from the US Securities and Exchange Commission (SEC) filings (Tuggener et al., 2020).The task is to determine whether the main topic of the contract provision from a set of 100 mutually-exclusive labels.The dataset is also part of LexGLUE (Chalkidis et al., 2022), which is a benchmark for legal language understanding.
PATENT.A document classification dataset2 of US patent applications filed under the Cooperative Patent Classification (CPC) code (Sharma et al., 2019).A human written abstractive summary is provided for each patent application.The task is to determine the category that a patent application belongs to from 9 unbalanced classes.

Results
Preliminary Analysis.Before measuring the effects of MWTs on LLMs, we analyze how the average sequence length changes for each dataset depending on the tokenizer.From Table 1, increasing the top-K most frequent n-grams naturally yields a greater compression.However, even a 1000 bigrams is enough to achieve a reduction of about 20%.When multi-words are combined with an adapted tokenizer T 100 , the joint sequence narrowing effects appear to be highly complementary, achieving a compression rate close to 50% in ADE.
In practice, a 50% reduction means that on average we can store the same amount of text in half the sequence length.Consequently, we could in principle reduce a LLM's maximum sequence length by a factor of 2.
Multi-word Tokenization.As a first evaluation, we assess the macro-F1 and inference speedups achieved by fine-tuned BERT models with multiword tokenizers: T 1000 gen , T 2500 gen , T 5000 gen .The pretrained BERT with a generic tokenizer T gen is considered as the reference model.From Table 2, MWTs are shown to either improve the reference performance or induce a relatively negligible degradation.At the same time, the sequence compression from MWTs yields a natural speedup that depending on the dataset varies from about x1.1 to x1.4.
MWT and Domain Adaptation.Additionally, we investigate the application of MWTs with tokenizers adapted to the dataset: T 1000 100 , T 2500 100 , T 5000 100 .With the exception of PATENT, most models are shown to achieve significant inference speedups of up to x1.8 with minimal degradation in performance from Table 2.We hypothesize that this is due to the fact that the language domain of PATENT is not as specialized as ADE and LEDGAR, which reduces the benefits of using an adapted tokenizer.MWT and Truncation.Based on the preliminary analysis, we analyze how truncating sequences with different maximum lengths affects both the performance and inference speedup.Reducing the maximum sequence length has a double impact on the inference speedup given a fixed amount of resources.First, latency linearly grows with respect to the sequence length.Second, reducing the sequence length releases GPU resources that can be used to enlarge the batch size.We consider 4 maximum sequence lengths for each dataset by progressively halving the initial maximum sequence length, i.e. {128, 64, 32, 16} for ADE, {256, 128, 64, 32} for LEDGAR, and {512, 256, 128, 64} for PATENT.
From Figure 4, we can see the performance of T gen dropping more rapidly than MWTs as truncation increases (maximum sequence length decreases).In the extreme 8-times truncation, the performance of T gen falls dramatically for both ADE and LEDGAR.However, MWTs are shown to be more robust to truncation, hence their degradation in performance is smoother and without sudden collapses.In both ADE and LEDGAR, a 4times truncation leads to nearly identical or better performance, while bringing significant inference speedups of ∼x2.4 and ∼x4.4 respectively.If a certain performance degradation is acceptable, the inference speedup can be maximized, reaching up to ∼x9.4 in LEDGAR.
MWT and Distillation.Additionally, we investigate the interaction between sequence compression and knowledge distillation in Table 3.To this end, we utilize a DistilBERT model with MWTs.For simplicity, we restrict our analysis to LEDGAR and to a single multi-word tokenizer T 2500 gen on different maximum sequence lengths.From the table, our MWT is shown to retain most of its performance with a quarter of the sequence length and an in-ference speedup of ∼x8.8.Even with an extreme sequence truncation to only 64 tokens, we can still achieve a ∼x18.1 inference speedup with only a 2.7% drop in relative performance.

Model
Length

Conclusion
In this work, we proposed a sequence compression approach that reduces textual inputs by exploiting the use of multi-word expressions drawn from the training set according to their top-K frequencies.We conducted an investigation on 3 different datasets by evaluating each model in conjunction with other compression methods (Gee et al., 2022;Sanh et al., 2019).Our approach is shown to be highly robust to shorter sequence lengths, thus yielding a more than x4 reduction in computational cost with negligible drops in performance.In the future, we expect to extend our analysis to other language models and tasks such as language generation in the scope of sequence compression.

Limitations
As

Figure 2 :Figure 3 :
Figure 2: Sketch of the Multi-word Tokenizer pipeline.First, n-grams are statistically learned from the training set.Then, the top-K n-grams are added to the vocabulary of the tokenizer.N-grams are merged from left to right within a sequence after pre-tokenization.

Table 1 :
Average sequence length from tokenization.The generic T gen and adapted T 100 tokenizers are extended with varying top-Ks of 1000, 2500, and 5000.
Plot of macro-F1 against maximum sequence length.The generic T gen and adapted T 100 tokenizers are represented by solid and dashed lines respectively.MWTs are shown to be more robust on shorter sequence lengths, thus allowing for major speedups via early sequence truncation.

Table 2 :
Absolute values of BERT fine-tuned on the downstream task using a sequence length of 128, 512 and 256 for ADE, LEDGAR and PATENT respectively.T gen is shown on the first row, while relative values to T gen are shown on subsequent rows.

Table 3 :
The macro-F1 and inference speedup results on LEDGAR with DistilBERT.MWTs are shown to be highly compatible with distilled models.

Table 5 :
Model performance of DistilBERT averaged across 5 seeds.