A Formal Perspective on Byte-Pair Encoding

Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a $\frac{1}{{\sigma(\boldsymbol{\mu}^\star)}}(1-e^{-{\sigma(\boldsymbol{\mu}^\star)}})$-approximation of an optimal merge sequence, where ${\sigma(\boldsymbol{\mu}^\star)}$ is the total backward curvature with respect to the optimal merge sequence $\boldsymbol{\mu}^\star$. Empirically the lower bound of the approximation is $\approx 0.37$. We provide a faster implementation of BPE which improves the runtime complexity from $\mathcal{O}\left(N M\right)$ to $\mathcal{O}\left(N \log M\right)$, where $N$ is the sequence length and $M$ is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization.


Introduction
Byte-Pair Encoding (BPE) is a popular technique for building and applying an encoding scheme to natural language texts.It is one the most common tokenization methods used for language models (Radford et al., 2019;Bostrom and Durrett, 2020;Brown et al., 2020;Scao et al., 2022) as well as for various other conditional language modeling tasks, e.g., machine translation (Ding et al., 2019) and chatbots (Zhang et al., 2020).Despite having been popularized by Sennrich et al. (2016) in NLP as a tokenization scheme, BPE has its roots in the compression literature, where Gage (1994) introduce the method as a faster alternative to Lempel-Ziv-Welch (Welch, 1984;Cover and Thomas, 2006, 13.4).However, the ubiquity of BPE notwithstanding, the formal underpinnings of the algorithm are underexplored, and there are no existing proven guarantees about BPE's performance.
The training and applying of BPE are traditionally presented as greedy algorithms, but the exact optimization problems they seek to solve are neither presented in the original work of Gage (1994) nor in the work of Sennrich et al. (2016).We fill this void by offering a clean formalization of BPE training as maximizing a function we call compression utility1 over a specific combinatorial space, which we define in Definition 2.3.Unexpectedly, we are then able to prove a bound on BPE's approximation error using total backward curvature σ(µ ⋆ ) (Zhang et al., 2015).Specifically, we find the ratio of compression utilities between the greedy method and the optimum is bounded below by 1 σ(µ ⋆ ) (1 − e −σ(µ ⋆ ) ), which we find empirically ≈ 0.37 for σ(µ ⋆ ) = 2.5.Our proof of correctness hinges on the theory of submodular functions (Krause and Golovin, 2014;Bilmes, 2022).2Indeed, we are able to prove that compression utility is a special kind of submodular function (Malekian, 2009) over a constrained space.And, despite the presence of the length constraint, which we expound upon formally in §3, we are able to prove a similar bound to 1− 1 /e as in the unconstrained case (Alaei et al., 2010).
Additionally, we give a formal analysis of greedy BPE's runtime and provide a speed-up over the original implementation (Gage, 1994;Sennrich et al., 2016).Our runtime improvement stems from the development of a nuanced data structure that allows us to share work between iterations of the greedy procedure and that lends itself to an amortized analysis.Specifically, given a string with N characters with a desired merge count of M (usually N ≫ M ), our implementation runs in O (N log M ), an improvement over the O (N M )time algorithm presented by Sennrich et al. (2016) and the O (N log N ) analysis presented by Kudo and Richardson (2018).Finally, our formalism allows us to construct an exact program for computing an optimal solution to the BPE training problem.Unfortunately, the algorithm runs in exponential time, but it is still significantly faster than a naïve brute-force approach.
Our work should give NLP practitioners confidence that BPE is a wise choice for learning a subword vocabulary based on compression principles.In general, such constrained submodular maximization problems are hard (Lovász, 1983).While we do not have a proof that the BPE problem specifically is NP-hard, it does not seem likely that we could find an efficient algorithm for the problem.Regarding the runtime, our implementation of greedy BPE runs nearly linearly in the length of the string which would be hard to improve unless we plan to not consider the entire string.

Formalizing Byte-Pair Encoding
We first provide a brief intuition for the BPE training problem and the greedy algorithm that is typically employed to solve it.Then, we will develop a formalization of BPE using the tools of combinatorial optimization, rather than as a procedure.3Consider the string in Example 1: picked pickled pickles.We wish to create a compact representation of this string, where compactness is quantified in terms of the number of symbols (i.e., vocabulary units) required to precisely encode the string.The free parameter is the vocabulary that we will use to construct this representation, albeit the total size of the chosen vocabulary is often a constraint. 4In our example, let's assume we are allowed a maximum number of 13 symbols in the vocabulary5 with which we can encode our string.The question is: "How can we select these symbols to achieve our goal of compactness under this constraint?"

A Worked Example
Let us first consider the simple choice of using all the characters present in the string as our vocabulary: This scheme leads to a representation with a length of 22 units, including spaces.In order to decrease this length (while retaining all information present in the original string), we would need to add an additional symbol to our vocabulary: one with which we can replace co-occurrences of two symbols.But how should we choose this entry?One strategy-the one employed by the BPE algorithm-is to use the concatenation of the adjacent units a b that occur with the highest frequency in our string; all occurrences of these adjacent units could then be replaced with a single new unit ab.We refer to this as a merge, which we later define and denote formally as [a, b].In Example 1, the first merge is [p, i], and leads to a representation of length 19 with vocabulary size of 9+1.We can iteratively repeat the same process; the application of 5 total merges results in the vocabulary units pick, pickl, ed, e, and s.These subwords6 allow us to represent our original string using just 9+1 symbols.If we continued merging, the text representation would become shorter (in terms of number of symbols required to create the representation) but the merge count (and vocabulary size) would grow.Therefore, the number of merges M , or also the merge count, is a hyperparameter to the whole procedure.The procedure outlined above is exactly the greedy algorithm for BPE proposed by Gage (1994).We provide a minimal implementation in Python in Code 1.
We will define the compression gain of a merge at any given step of the algorithm, corresponding to the number of occurrences where a merge can be applied.The compression gain of a merge does not always correspond to the frequency of adjacent merge components in that string, due to possible overlaps.Consider, for instance, the string aaa and the merge [a, a].The frequency of aa is 2, but the merge can be applied only once ([a, a]a).While Gage (1994) and Sennrich et al. (2016) admit overlapping pair counts, Kudo and Richardson (2018)'s popular implementation adjusts the algorithm to disregard the overlaps.We stick to the latter, which is more suitable from the optimization standpoint adopted here.

Merges
The fundamental building block of the BPE problem is a merge, which we define formally below.Informally, a merge is the action of creating a new symbol out of two existing ones.Out of convention, we also refer to the resulting object as a merge.
Definition 2.1.Let Σ be an alphabet, a finite, nonempty set.The set of all merges over Σ is the smallest set of pairs Υ Σ with the following closure property: A merge sequence is a sequence of merges, which we denote µ = ⟨µ 1 , . . ., µ N ⟩ ∈ Υ * Σ .7It is perhaps easiest to understand the concept of a merge through an example.
Note that the strings corresponding to the merges in a merge sequence-along with the characters that make up the set of trivial merges-determine a vocabulary, to be used in downstream applications. 8The greedy BPE algorithm constructs a merge sequence iteratively by picking each merge as the pairing of neighbouring symbols in the current sequence of symbols that is being processed.For instance, the sequence µ in Example 2.2 is not valid since it does not contain the merge [a, c] before the third element [[a, b], [a, c]].
Note that M Υ Σ is closed under concatenation, i.e., for two valid merge sequences µ ′ , µ ′′ ∈ M Υ Σ , we have that µ ′ µ ′′ ∈ M Υ Σ ,9 where we use µµ ′ to denote the sequence concatenation of µ and µ ′ .Applying Merge Sequences.Given some string x ∈ Σ * , we can derive the representation of that string according to the merge sequence µ = ⟨µ 1 , . . ., µ N ⟩ by iteratively applying each merge µ n .Note that by the definition of Υ * Σ , we can trivially lift a string x = ⟨σ 1 , σ 2 , . ..⟩ to a merge sequence by treating each of its characters σ i ∈ Σ as merges.Thus, we define this procedure more generally in terms of some arbitrary µ ∈ Υ * Σ .Concretely, we denote the application of a merge µ n to µ as APPLY µn (µ).As suggested by Code 1 (line 11), this action consists of replacing all µ k , µ k+1 in µ such that (µ k , µ k+1 ) = µ n by µ n itself, in a left-to-right fashion.We thus obtain a new µ ∈ Υ * Σ , to which a new single merge can be applied.We lift APPLY to a merge sequence µ by simply repeating the application of APPLY on µ (n)  for the successive µ n in µ; accordingly, we denote this procedure as APPLY µ (µ).As a result, we obtain µ (|µ|) , which is a non-overlapping ordered forest, i.e., a partial bracketing of the original string x.We provide an example in Fig. 1.Note that the application of the merge sequence is deterministic.
String Yields.We now define a conceptually reverse operation to applying merges, i.e., deriving a string from structured µ (n) .
Definition 2.4.The yield of a single µ ∈ Υ Σ , denoted as YIELD(µ), is defined recursively: For a given µ, YIELD is applied sequentially.The resulting characters can then be concatenated to derive a single string.The yield operation can also be used to derive vocabulary units-often referred to as subwords; explicitly, the yields of individual merges in a sequence µ can be used to form a vocabulary.Strictly speaking, in Sennrich et al.'s (2016) implementation of BPE, the elements of the merge sequences µ are not of the form , rather than consisting of prior merges as in our formalization, the merges of Sennrich et al.'s (2016) consist of the yields of those merges.This introduces an ambiguity with respect to our formalization since: for a given merge sequence in that implementation, more than one sequence µ ∈ Υ * Σ could correspond, some of which would not be valid.As an example, consider the se- [[a, [b, c]], d]⟩, the last of which is invalid.However, it turns out that this is not an issue for us: by construction, the successive elements of the sequence are determined by the previous ones (cf.Alg.1), which means that, in fact there is no ambiguity, and the merge sequences in Sennrich et al.'s (2016) implementation always correspond to what our formalization defines as a valid merge sequence.

The BPE Training Optimization Problem
We now define the BPE training task as a combinatorial optimization problem.The objective we seek to optimize is the compression utility of the chosen merge sequence (taken with respect to a string), which we define below.
Definition 2.5.Let x ∈ Σ * be a string.We define the compression utility of a valid merge sequence µ applied to x as the following function: Note that for any merge sequence µ, κ x (µ) ≥ 0 and we take κ x (⟨⟩) = 0.Then, for any merge sequence µ ′ = ⟨µ ′ 1 , . . ., µ ′ |x|−1 ⟩ of length |x| − 1 where every merge produces replacements, we have We can further define the compression gain of two merge sequences with respect to each other.
Definition 2.6.The compression gain of µ ′ with respect to a sequence µ, denoted as κ x (µ ′ | µ), is defined as (3) Similarly, the compression gain of a single merge µ with respect to a sequence µ, denoted as We use the compression gain to later make a sequence of observations which leads to proving the function submodularity and eventually its approximation bound of the BPE training algorithm.Now, armed with Definition 2.5, we can formally state our optimization problem.In words, we seek to find a valid merge sequence µ with length of M that maximizes the compression utility κ x (•) for a pre-specified string x ∈ Σ * .We write this combinatorial optimization problem more formally as follows:10 The most common procedure found in the NLP literature for solving Eq. ( 4) is a greedy algorithm (Gage, 1994;Sennrich et al., 2016).The implementation of Gage's (1994) algorithm presented by Sennrich et al. (2016) We describe this greedy algorithm in detail in §3 and provide a novel theoretical result: The algorithm comes with a 1 σ(µ ⋆ ) (1 − e −σ(µ ⋆ ) ) bound on its approximation error of Eq. (4).In §4, we further offer an asymptotic speed-up to Sennrich et al.'s (2016) algorithm, reducing its runtime to O (N log M ) .Finally, for completeness, we offer an exact program for finding an optimal valid merge sequence in §5.While this algorithm runs in exponential time, which prevents it to be used in real applications, it is still faster than the brute-force counterpart.

A Greedy Approximation of BPE
We demonstrate that, for any string x ∈ Σ * , the following bound holds where, as in the previous section, µ † is the valid merge sequence output by the greedy algorithm and µ ⋆ is an optimal valid merge sequence.To prove this bound, we rely heavily on the theory of submodularity (Krause and Golovin, 2014;Bilmes, 2022).

Properties of Compression Utility (κ)
We start by proving some useful facts about the compression utility function κ x .Specifically, we first show that κ x is a specific type of monotone non-decreasing submodular sequence function, which we make precise in the following definitions.
Proof.For all n ∈ N, we have that κ Next, we turn to a definition of sequence submodularity from Alaei et al. (2010).In contrast to Alaei et al.'s (2010) definition, we add the additional constraint that a merge-sequence function must take a valid merge sequence as an argument. 11and for all ν ∈ Υ Σ such that both µ ′ ν and µν are valid, we have Proposition 3.4.Let κ x be the compression utility function.Then, for a fixed x ∈ Σ * , κ x (•) is submodular (Definition 3.3) when the domain is restricted to the set of valid merges M Υ Σ .
Proof.Let µ, µ ′ ∈ M Υ Σ such that µ ′ ≼ µ, and let ν = [ν ′ , ν ′′ ] be any merge such that µν, µ ′ ν ∈ M Υ Σ .First, notice that, once a merge µ n in a merge sequence µ is applied, the number of occurrences of µ n in κ x (µ ≤n ) cannot be increased by any sequence of further applications, because all submerges of µ n where applied exhaustively (i.e., to all consecutive occurrences of their immediate submerges).Now, from µ ′ ν ∈ M Υ Σ , it follows that both ν ′ and ν ′′ are in µ ′ .Therefore, the number of occurrences ν ′ and ν ′′ , and a fortiori of successive occurrences of them, cannot be greater in κ x (µ) than in κ x (µ ′ ), and hence In the context of compression, the submodularity property means, that the compression gain achieved after adding a specific merge to a merge sequence can never increase with merge sequence length.However, the requirement that the added merge does not create an invalid merge sequence is important.We highlight this importance in the following example.In order to formally prove our desired guarantee regarding the approximation bound of the greedy BPE algorithm, it is not enough that compression utility is sequence submodular over valid merge sequences.For this reason, we identified another property of the compression utility function that allows us to push through our result.
Definition 3.6.We define the following partial order on merges.For merges µ, µ ′ ∈ Υ Σ , we say Definition 3.7.A real-valued function over valid merge sequences is hierachically sequence submodular if, for every valid merge sequence of the form µ ′ ν ′ µν where ν ′ ⊂ ν according to the partial order given in Definition 3.6, we have that Note that hierarchical sequence submodularity is a different concept from function modularity, described in Definition 3.3.Indeed, in the case of functions over valid merge sequences, neither submodularity nor hierarchical sequence submodularity implies the other.To see this, note that roughly speaking, submodularity describes the difference in the value of a function when the same element is given as an argument, albeit conditioned on the presence of two different (but related) other arguments.However, if the same argument is considered in Eq. ( 7), we have which is a trivial bound due to the non-negativity of κ x (•).The naming is inspired by the fact we require the partial order over merges, which creates the hierarchy.
Proposition 3.8.Let κ x be the compression utility function.Then, for a fixed x ∈ Σ * , κ x (•) is hierarchically submodular (Definition 3.3) when the domain is restricted to the set of valid merges M Υ Σ .
Finally, we adapt the definition of total backward curvature from (Zhang et al., 2015) to our needs.Intuitively, the total backward curvature is related to how much the utility of µ can decrease if ν is applied before, at the beginning.Definition 3.9.The total backward curvature of the compression utility function κ with respect to an optimal merge sequence µ ⋆ is denoted with σ(µ ⋆ ):

The Greedy Algorithm for BPE
In words, the greedy algorithm proceeds as follows: For each of the M iterations, the algorithm chooses the next merge that is both valid and (locally) maximizes the objective in Eq. ( 4).We give pseudocode in Alg. 1.In practice, as shown in Code 1, this is done by choosing the merge that occurs most frequently (can be adjusted for pair overlaps).The main loop occurs M times.In the subsequent theorem we show the approximation bound for the greedy algorithm.Theorem 3.10.The greedy algorithm for BPE training, i.e., for learning a length M merge sequence µ † , is with respect to the optimal length M merge sequence µ ⋆ .
Proof.The proof is shown in App. A. ■

Measuring Total Backward Curvature
We do not have a formal bound for σ(µ ⋆ ) and estimate it by enumerating all strings of maximum length |x| ≤ 15 given a finite alphabet |Σ| = 5 and maximum merge sequence size |µ ⋆ | < 5.The found maximum is σ(µ ⋆ ) = 2.5, from which follows an optimality bound of ≈ 0.37.When we restrict our search to texts from a natural language (English), we obtain a slightly lower estimate σ(µ ⋆ ) N = 2.0 and hence optimality bound ≈ 0.43.We leave the further study of the backward curvature constant to future work.Notice that in the main proof of Theorem 3.10 in App.A, we used σ to bound only one particular type of sequence that becomes the prefix to µ ⋆ , namely µ † .We may then check for prefixing only greedy sequences instead of taking the maximum across µ ∈ Υ * Σ , |µ| ≤ M as in Definition 3.9: This yields σ′ (µ ⋆ , µ † ) = 1.5 and therefore the bound of ≈ 0.52.More important than the particular bound value is that it is constant and that the BPE training algorithm can not be arbitratily suboptimal with sequence length.

A Runtime Speed-up
We now introduce a speed-up of the greedy BPE algorithm.Assuming constant-time comparison of strings, finding the maximum pair count over the whole string is O (N ), which is the same as applying one merge.h.ADDPOSITION(w 1 .prev,w 1 ) 13: h.ADDPOSITION(w 1 , w 1 .next) 14: end for 15: µ ← µ • ⟨µ⟩ 16: end for 17: return x, µ presented by Sennrich et al. (2016) and shown in Alg. 1, is spent on (1) recalculating the frequencies of pairs (Alg. 1, line 3) which are not affected by the most recent merge, and (2) scanning the whole string to apply a single merge (Alg. 1, line 4).To make this explicit, consider the following example.Our idea to speed up Alg. 1 stems from the insight that we do not have to iterate over the entire sequence, an O (N ) operation, on each of the M iterations.12Indeed, on the t th iteration, we show that one only has to do work proportional to the number of new nodes that are added to the forest (Alg.2, line 6).To achieve this, we introduce a more efficient data structure for BPE. 13 Our first step is to treat the string as a linked list of subwords, initialized as a linked list of characters, that we destructively modify at each iteration.With each possible merge, we store a list of pointers where the merge operation could happen.The max heap is then sorted by the size of the sets.Lines 6 to 14 in Alg. 2 show the necessary operations needed to be performed on the linked list.Notably REMOVE- Nodes in red will be removed in the next step, nodes in green were added in contrast to the previous step and nodes in purple were just added but will be removed.Black lines from queue to the string show which nodes to merge.Grey lines show which pairs in the priority queue will have reduced frequencies.
POSITION removes the specific pair position from the set in the max heap and ADDPOSITION adds it.See Fig. 2 for an illustration of applying a single merge in one place based on the introductory example in Example 1.The possible merge pairs are stored with a priority queue with their frequency as the sort key.During one operation, we need to remove the top merge pair and add counts for the newly created possible merge.The cost of one merge then becomes O (R t log M ) where R t is the number of pairs in the string where the merge occurs and log M the complexity of adding and updating frequency of a new merge pair.Note that it is not log N , because we are keeping only top-M possible pairs in the heap.
At first glance, this suggests the overall runtime of O M t=1 R t log M with the worst case of the merge being applied along the whole string, therefore O (M N log M ).
Theorem 4.2.Let N be the length of the string x ∈ Σ * that is given as input.Then, Alg. 2 runs in O (N log M ) time.
Proof.Let D t be the amount of work performed at each iteration modifying the data structure.We additionally do O (log M ) work updating the priority queue on lines 6 to 14 in Alg. 2 since it has at most M elements.Thus, Alg. 2 clearly runs in O M t=1 R t log M .We perform an amortized analysis.For this, we first make an observation about the upper bound on the number of merges and then show amortized analysis.However, for a string x of length N , there are at most N − 1 merges that can be applied to x.This implies that ■

An Exact Algorithm
In this section, we turn to developing an algorithm for exactly solving the BPE problem, i.e., Eq. ( 4).We change algorithmic paradigms and switch to memoization.While we are not able to devise a polynomial-time scheme, we are able to find an exact algorithm that is, in some cases, faster than the brute-force technique of enumerating all valid merge sequences.We first analyze the brute-force method of enumerating all valid merge sequences.
Proposition 5.1.The set of valid merges of length M over a string Proof.The proof can be found in App. A. ■ A simple direct enumeration of all possible merge sequences with the time complexity of one merge O (N M ) gives us a brute-force algorithm that runs in O N M min |Σ| 2M , N M time.The brute-force program explores all possible sequences of merges-including many that are redundant.For instance, both ⟨[p, o], [h, a]⟩ and ⟨[h, a], [p, o]⟩ induce the same partial bracketing when applied to another merge sequence, as in §2.2.Luckily, we are able to offer an exact characterization of when two merge sequences induce the same bracketing.To this end, we provide the following definitions.We use the term transposition to refer to the swapping of items; i.e., a transposition (i, j) over a merge sequence µ refers to the swapping of µ i and µ j .Definition 5.2.A pair of merges µ = [µ n , µ m ] and ) is safe if and only if, for all k < j, µ k does not conflict with µ j and, for all k > i, µ k does not conflict with µ i .A permutation π = ⟨ρ 1 ρ 2 • • • ρ n ⟩, decomposed into transpositions, that maps one valid merge sequence µ to another valid merge sequence π(µ) = µ ′ is safe if and only if all transpositions are safe.
Informally, Definition 5.3 says that for a permutation to produce a valid merge sequence, there should be no conflicts between the swapped merges and all merges in between.For example, given the merge sequence The reason for this definition is that safe permutations characterize when two merge sequences always give the same results.Indeed, for x = ddabcacab, applying the first merge sequence: Definition 5.4.Two merge sequences µ and µ ′ are equivalent if and only if, for all x ∈ Σ * , APPLY µ (x) = APPLY µ ′ (x).Symbolically, we write µ ≡ µ ′ if µ and µ ′ are equivalent.
Proof.The proof can be found in App. A. ■ Following the previous example, it is easy to ver- In contrast to synthetic examples with a constrained alphabet of, e.g., {a, b, c}, far fewer merge conflicts arise in natural language.We can leverage this to develop a faster algorithm that only explores paths that are not equivalent to each other.We first define the concept of partial ordering between merges.
All valid merge sequences are equivalent to some merge sequence which is partially ordered using ⋗ so that no neighbouring elements violate this partial ordering.The brute-force algorithm works as depth-first search through an acyclic graph: each state corresponds to a unique sequence of merges and each transition corresponds to appending a merge to the end of the current state's merges.For the improved version, we make sure that only sequences which are ordered using ⋗ are searched and the rest are pruned.The pseudocode for the program is shown in Alg. 3.Even though the runtime is still prohibitively slow for application, Fig. 3 demonstrates how much speed is gained over the brute-force version which explores all states.Algorithm 3 Exact BPE with memoization guard.Removing segments marked with X would result in the brute-force version.Inputs: string x, merge count M Output: tokenized string x, merge sequence µ 1: q ← STACK( ) 2: q.PUSH(⟨⟩, x) 3: µ * , x * ← ⟨⟩, x 4: while |q| ̸ = 0 do 5: µ, x ← q.POP() for µ ∈ PAIRS(x) do 8: x ′ ← SINGLEAPPLY(x, µ) 10: end if 14: q.PUSH(µ ′ , x ′ ) 15: end if 16: end for 17: end while 18: return µ * , x *

Conclusion
In this paper, we developed the formalisms surrounding the training task of BPE, a very popular tokenization algorithm in NLP.This allowed us to prove a lower bound on the compression utility by greedy BPE as 1 − e −σ(µ ⋆ ) .We further analyzed the runtime of the naïve and faster greedy BPE algorithms and provided a speedup for finding an optimal BPE merge sequence.Future works should focus on providing either formal guarantees for σ(µ ⋆ ) or studying σ(µ ⋆ ) ′ across natural languages.

Limitations
Our work has focused strongly on the formal aspects of BPE.NLP practictioners should not be dissuaded from using BPE for subword tokenization, despite our presentation of examples where greedy BPE fails.Indeed, in contrast to synthetic examples on toy alphabet, on real data we made an observation that greedy BPE may be close to optimal.

A Proofs
Our proof of approximate optimality is based on the proof of greedily sequence maximizing submodular functions by Alaei et al. (2010); Zhang et al. (2015).However, we leverage a problem-specific property, which we dub hiearchical submodularity.We restate the definition here for ease.Definition 3.7.A real-valued function over valid merge sequences is hierachically sequence submodular if, for every valid merge sequence of the form µ ′ ν ′ µν where ν ′ ⊂ ν according to the partial order given in Definition 3.6, we have that Lemma A.1.Let µ ′ , µ ∈ M Υ Σ be valid merge sequences.Then, there exists a merge ν in µ such that µ ′ ν is a valid merge sequence and κ In words, the compression gain of some element in µ with respect to µ ′ is greater or equal to the average compression gain per element of µ with respect to µ ′ Proof.Let us choose on of the possible maximums, t = argmax 1≤t ′ ≤|µ| κ x (µ t ′ | µ ′ µ <t ′ ).Because we are taking the maximum, which is always equal to or greater than the average,14 then κ Then, we have that either: • µµ t ∈ M Υ Σ , in which case the result follows by submodularity, or • µµ t / ∈ M Υ Σ , in which case there exists a µ t ′ such that: In particular, all trivial submerges of µ t (i.e., all submerges of µ t whose constituents are in Σ) fulfill all four conditions: the first one by definition, the second by definiton of M Υ Σ , the third because µ ∈ M Υ Σ , and the fourth by hierarchical submodularity (first inequality) and by submodularity (second inequality).

■
We now proceed with the proof of approximate optimality of the greedy BPE merge sequence.Theorem 3.10.The greedy algorithm for BPE training, i.e., for learning a length M merge sequence µ † , is with respect to the optimal length M merge sequence µ ⋆ .
Proof.We make use of the sequence µ † <M (rather than µ † ) for reasons that will subsequently become clear.From Lemma A.1, we know that we can find µ ⋆ j such that µ † <M µ ⋆ j is a valid merge sequence and From the greedy property of µ † , we know: • Base Case: Since π is safe, then for [a, b] = π(µ) 1 , a and b are necessarily characters in Σ.
• Inductive Step: Suppose for k = n − 1, π(µ) ≤k applies merges which are applied by µ.We then show π(µ) n also applies the same merges as µ.Consider π(µ) n = (µ m , µ m ′ ); since π is safe, both µ m and µ m ′ already exist in APPLY µ ≤n (x).Moreover, since there are no conflicts, applying π(µ) n results in the same encoded sequence.■ Proposition 5.1.The set of valid merges of length M over a string x ∈ Σ * is O min |Σ| 2M , N M .
Proof.On one hand, we note that we have an upper bound of N − 1 possible merges that can occupy the first element of the sequence, assuming every symbol in x is distinct.Next, we have N − 2 possible merges that can occupy the second element of the sequence, again, assuming every symbol in x is distinct.Continuing this pattern, we arrive at a simple upper bound on the number of merges M −1 m=0 (N − 1 − m).This quantity is recognizable as a falling factorial, which gives us the closed form (N −1)! (N −M −2)! .This can be trivially bounded by N M .However, on the other hand, we know a valid merge sequence can produce merges with a yield up to length M , and there are Σ ≤M M unique sequences.We can upper-bound the number of valid merge sequences by the total number of all possible merge sequences, of which there are M !.The size of Σ ≤M is the sum

B BPE Modifications
In this section, we describe multiple modifications to the greedy BPE algorithm which speed up the runtime.We do not address popular heuristic modifications such as lowercasing the text or adding 20% of the most frequent words to the subword dictionary.

B.1 (Not) Merging Space
Currently, spaces are treated as any other characters and are allowed to be part of merges.Therefore in the string "not_that_they_watch_the_watch" the first merge is [_,t] and the string looks as "not[_,t]hat[_,t]hey watch[_,t]he watch".The next merge may be across tokens: [t, [_,t]].This is not desirable if we want only want to split tokens into subwords (i.e.use merges that do not contain spaces).Furthermore, in §3 we are duplicating work by computing pair frequencies and merges multiple times across the same tokens that occur multiple times (see previous string example).In practice (Tab.1), only 1.5% of all tokens are unique.We may then speed up our computation by considering only unique tokens.Therefore, the new runtime complexity is O (V • |x u |) where x u = {t | token t ∈ x} which is |x| |xu| × faster.

B.2 Non-iterative BPE
A popular implementation of BPE-like algorithm in Python15 uses a different speed-up mechanism to avoid O (N V ) runtime.This is done by: (1) collecting all possible merges observed in the data up until some maximum yield size which determines the maximum subword size, such as 5 and (2) taking top-M frequent pairs as part of the subword dictionary.Note that because of hiearchical submodularity (Definition 3.7), this will produce valid merges.This is because if µ = [µ ′ , µ ′′ ] is chosen, so must µ ′ and µ ′′ because they have at least the same frequency as µ.For example, for abcabcd, and maximum yield width 3, the merges would be However, it is easy to see that this approximation algorithm is not bounded.For a constant maximum yield width of w, consider x = a wn and V = w + k.The shortest possible output of this algorithm will be µ n .However, an optimal merge sequence can perform additional merge sequences, therefore producing ν with lower bound of 0 as supremum.This means that we can construct adversarial example for which the compression given by this algorithm is arbitrarily suboptimal.

C Additional Experimental Details and Results
Sentence Example 1: Compression of the text picked pickled pickles using 5 greedy merges according to the greedy BPE algorithm.The most frequently occurring pair of vocabulary items is highlighted and subsequently merged.The merge sequence is ⟨[p,i], [c,k],[pi,ck], [e,d], [pick,l]⟩ (notation simplified for clarity).
A minimal implementation of Sennrich et al.'s (2016) greedy algorithm for BPE in Python.See Code 2 for a version with overlap-adjusted counts.

Example 2. 2 .
Given the alphabet Σ = {a, b, c}, the following are some of the elements of Υ Σ : [a, b], [a, [a, b]], and [[a, b], [a, c]].We obtain a merge sequence by arranging these merges into an ordering

Figure 1 :
Figure 1: Application of the merge sequence µ = ⟨[a, b], [c, b], [[a, b], a], [[[a, b], a], [c, b]]⟩ on the string x = abaabacbcb.The result can be represented as an ordered forest.Each tree is associated with a subword in the text: aba, abacb, and cb.
Example 4.1.Consider x = abba(cddc) n and merge [a, b] for n ≥ 1.We can only apply the merge at the beginning of the string, which results in the forest [a, b]ba(cddc) n .However, Alg. 2 still scans the entirety of the sequence to recalculate the pair frequencies of [c, d], [d, c] and [c, c].This additional work is unnecessary.

Figure 2 :
Figure2: Visualization of linked list representation of the string and the associated priority queue (frequency values in dashed boxes) with merges.Nodes in red will be removed in the next step, nodes in green were added in contrast to the previous step and nodes in purple were just added but will be removed.Black lines from queue to the string show which nodes to merge.Grey lines show which pairs in the priority queue will have reduced frequencies.

M
!, this leads to the falling factorial (M |Σ| M )! (M |Σ| M −M )! which we can upper bound by (M |Σ| M ) M which is in O |Σ| 2M .Taking the min of these two upper bounds gives us the overall upper bound.■ [a, b],[[a, b], c], [b, c], [a, [b, c]], . ... The runtime of this is O (|x| log M ) because we are scanning the whole string and at each point are modifying maximum heap.

n 2 k
. The compressions are wn − n and wn − n 2 k and the ratio wn−n wn− n 2 k

Figure 3 :
Figure 3: Comparison of runtimes for brute-force DFS and DFS with memoization.Values above 1 correspond to DFS+memoization being × faster than DFS.Points show average 16 of runs on 5 different input strings (each 2 randomly sampled English sentences of 64 characters).
which is the number of new nodes in the forest created by applying ν ′ , must be at least equal to κ x (ν | µ ′ ν ′ µ), if not greater.■Proposition3.8 gives us a different notion of submodularity, which is important for the proof of the greedy BPE training guarantee.As an illustrative example of the proposition, we return to Fig.1.In this case, µ