Fast WordPiece Tokenization

Tokenization is a fundamental preprocessing step for almost all NLP tasks. In this paper, we propose efficient algorithms for the WordPiece tokenization used in BERT, from single-word tokenization to general text (e.g., sentence) tokenization. When tokenizing a single word, WordPiece uses a longest-match-first strategy, known as maximum matching. The best known algorithms so far are O(nˆ2) (where n is the input length) or O(nm) (where m is the maximum vocabulary token length). We propose a novel algorithm whose tokenization complexity is strictly O(n). Our method is inspired by the Aho-Corasick algorithm. We introduce additional linkages on top of the trie built from the vocabulary, allowing smart transitions when the trie matching cannot continue. For general text, we further propose an algorithm that combines pre-tokenization (splitting the text into words) and our linear-time WordPiece method into a single pass. Experimental results show that our method is 8.2x faster than HuggingFace Tokenizers and 5.1x faster than TensorFlow Text on average for general text tokenization.


Introduction
Tokenization is the process of splitting text into smaller units called tokens (e.g., words). It is a fundamental preprocessing step for almost all NLP applications: sentiment analysis, question answering, machine translation, information retrieval, etc.
Modern NLP models like BERT (Devlin et al., 2019), GPT-3 (Brown et al., 2020), and XL-Net (Yang et al., 2019) tokenize text into subword units (Schuster and Nakajima, 2012;Sennrich et al., 2016;Kudo, 2018). As a midpoint between words and characters, subword units retain linguistic meaning (like morphemes), while alleviating out-of-vocabulary situations even with a relatively small-size vocabulary. * Research conducted while working at Google. In this paper, we propose efficient algorithms for WordPiece, the subword tokenization used in BERT (Devlin et al., 2019). Given Unicode text that has already been cleaned up and normalized, WordPiece has two steps: (1) pre-tokenize the text into words (by splitting on punctuation and whitespaces), and (2) tokenize each word into wordpieces.
For single-word tokenization, WordPiece uses a greedy longest-match-first strategy: iteratively pick the longest prefix of the remaining text that matches a vocabulary token. This is well-known as Maximum Matching or MaxMatch (Palmer, 2000), which has also been used for Chinese word segmentation since 1980s (Liu and Liang, 1986).
Despite its wide use in NLP for decades, to the best of our knowledge, the most efficient Max-Match algorithms so far are ( 2 ) (where is the input word length) or ( ) (where is the maximum vocabulary token length) (see Section 2). It's worth noting that the latter has a vocabularyspecific multiplicative factor , which can be large when the vocabulary contains long words.
We propose LinMaxMatch, a novel MaxMatch algorithm for WordPiece tokenization, whose tokenization time is strictly ( ) without any vocabulary-specific multiplicative factors. Inspired by the Aho-Corasick algorithm (Aho and Corasick, 1975), we organize vocabulary tokens in a trie (Fredkin, 1960) and introduce precomputed failure links and failure pops. During tokenization, if an input character does not match any trie edge, we perform smart transitions to avoid backtracking to earlier input characters. This involves collecting the recognized tokens (i.e., failure pops) and moving to a trie node (via the failure link), from where we continue to match the same character (Section 3).
For general text tokenization, referred to as end-to-end tokenization in this paper, we propose E2E WordPiece, an end-to-end algorithm that combines pre-tokenization and WordPiece tokenization into a single, linear-time pass (Section 4).
Experimental results show that our method is 8.2x faster than HuggingFace Tokenizers (Hug-gingFace, 2020) and 5.1x faster than Tensor-Flow Text (Google, 2020) on average for general text tokenization (Section 5).
Although tokenization is relatively faster than other steps, it's still worth improving the performance: Tokenization is a prerequisite step for almost all NLP tasks, and any improvement on its efficiency helps reduce the latency of the entire inference. One potential impact of the work, for example, is on mobile NLP applications. Ondevice models are generally highly optimized for reducing latency, e.g., by distilling or compressing larger models. Thus, the impact of tokenization can be significant here. Another impact is on aggregate computational savings for Web services like Google, Facebook, Twitter, etc. For example, Google uses BERT to power its Web search nowadays. 1 Google serves billions of search queries per day, and it processes hundreds of trillions of Web pages in index building. By employing a faster tokenization system, the aggregate computational savings would be material, which also benefits the environment (for less power consumption).
This paper also makes a theoretical contribution. The proposed LinMaxMatch algorithm solves the decades-old MaxMatch problem in the optimal ( ) time, and the idea is applicable to other string matching or rewriting problems (Section 3.6).
The code will be available at https://www. tensorflow.org/text.

Related Work
Maximum Matching (or MaxMatch) has been used for Chinese word segmentation (CWS) since the 1980s (Liu and Liang, 1986;Palmer, 2000). Recent CWS work focuses on machine learning-based segmentation approaches, but MaxMatch remains a commonly referenced baseline (Chang et al., 2008).
The widely-adopted MaxMatch algorithm, which is used in the original WordPiece algorithm (Google, 2018), starts from the longest possible prefix and decrements the length in search of the longest-matching token (Jie et al., 1989). A variant starts from the shortest substring and increases the length (Webster and Kit, 1992;Reps, 1998;Sassano, 2014). The worst-case time complexity of the previous algorithms are ( 2 ) or ( ) or even higher than that. 23 For example, the complexity of Sassano (2014) is ( ) (in our notations), since Lookup(t,c,i,N) ( Figure 1 in their paper) may take ( ) time (which is similar to the analysis in Section 3.2 of this paper). Reps (1998) recognizes maximum matching tokens using regular expressions in the context of compilers; their complexity is (| | ), where | | is the number of states in the automaton built from the grammar/vocabulary. If applied to WordPiece tokenization, since vocabulary tokens are finite strings, their complexity can be refined as ( ).
Our algorithm is inspired by the Aho-Corasick algorithm (Aho and Corasick, 1975), but the two algorithms are designed to address different problems. Aho-Corasick is not optimal for the Max-Match problem. In the worst-case scenario where every substring in the input matches a vocabulary token, Aho-Corasick finds a quadratic number of matches, resulting in an overall quadratic complexity for MaxMatch. By comparison, our algorithm achieves the worst-case linear complexity for Max-Match due to a novel definition of failure links, the newly-introduced failure pops, as well as a different way of emitting tokens.
It's worth clarifying the difference between our failure links and the tabular solution of Reps (1998). In their work, a table called failed_previously is used to store whether a state <q,i> has been seen before in a failed attempt to match a token (where q is a state of the automaton and i is a position of the input). Reps (1998)  to transit to when trie matching cannot continue (Definition 1), and they are precomputed based on the vocabulary only, independent of the input.
Finally, we discuss the complexity of algorithms for Byte-Pair Encoding (BPE) (Schuster and Nakajima, 2012;Sennrich et al., 2016) and Sentence-Piece (Kudo, 2018). Note that they are different problems from MaxMatch (the topic of this paper). SentencePiece is based on unigram language modeling, and the optimal segmentation can be found in ( ) time with the Viterbi algorithm (Viterbi, 1967). BPE algorithms can be implemented in two ways. One is to enumerate the symbol pairs in the order that they were added to the vocabulary in the building phase. For each symbol pair, we scan the current sequence and replace all their occurrences with the merged symbol. The complexity is (| | ), where | | is the size of the vocabulary. The other approach is to repeatedly select the pair of symbols from the current sequence that has the highest priority (e.g., the maximum frequency). Using a heap, this approach can be done in ( ).

Linear-Time Single-Word Tokenization
In this section, we present LinMaxMatch, an ( ) algorithm for single-word WordPiece tokenization.

Background and Notations
Given a vocabulary, 4 WordPiece tokenizes a word using the MaxMatch approach: iteratively pick the longest prefix of the remaining text that matches a vocabulary token until the entire word is segmented. If a word cannot be tokenized, the entire word is mapped to a special token <unk>. WordPiece tokenization distinguishes wordpieces at the start of a word from wordpieces starting in the middle. The latter start with a special symbol ## (in BERT), which is called the suffix indicator and is denoted as ♯ in this paper. Our method works with any suffix indicator: ##, an arbitrary string, or the empty string (i.e., no distinction between the two kinds of wordpieces).
For example, the word johanson may be tokenized as [johan, ##son].
We use the running example from Figure 1. Table 1 summarizes our notations. We construct a trie from the vocabulary . We use ( , ) = to denote a trie edge from node to node with character as the label. If there is no outgoing edge 4 The construction of the vocabulary is outside the scope of this paper. We refer the interested reader to Google (2020). from with label , ( , ) = ∅. Let be the string represented by the node , that is, the string obtained by concatenating all the edge labels along the path from the root to node . Let be the root of the trie and ♯ be the node for the suffix indicator ♯. Obviously, = (where denotes the empty string) and ♯ = ♯. The depth of node is defined as the number of characters in excluding the suffix indicator prefix (if any). Hence, the depth of or ♯ is 0. In Figure 1a, nodes 0 and 2 have depth 0, nodes 1, 3, and 8 have depth 1, node 10 has depth 2, etc.

Symbol
Meaning The empty string ♯ The suffix indicator string The vocabulary <unk> The unkown token , A string A character A whitespace character , ♯ The trie root and the node for ♯ , Trie nodes; is often the parent of ∅ Null node ( , ) Trie edge from node , with label The string represented by node ( ), ( ) Failure link and failure pops The length of the input The maximum length of tokens in The sum of the lengths of tokens in

Intuition
To motivate our linear algorithm, let's first consider an alternative approach to MaxMatch using a simple vocabulary trie: when searching the longest token at a position, it starts from the shortest substring and iterates over the input text from left to right, following trie matching to find the longest prefixes that matches a vocabulary token. Example 1. Consider the vocabulary and the trie from Figure 1a, with the input string abcdz. The expected output is [a, ##b, ##c, ##dz].
Starting from position 0, we follow the trie edges to match the input characters from a to d, arriving at node 6. No trie edge exits node 6 with character z as the label. The longest matching prefix seen so far is a, which is the first recognized token. ◊ The challenge of this approach is that, when the trie fails to match the next character, the longest vocabulary token match may be several characters back. As shown in Example 1, from position 0 we've matched the prefix abcd but found that the longest matching token is a. When looking for the next token, we reset the start position at character b and reprocess bcd.., resulting in repetitive and wasteful iterations. The time complexity is ( ).
The idea of LinMaxMatch is to use precomputed information to avoid reprocessing the characters.
Example 2. For the same example as above, when the trie matching fails at character z, since abcd has been matched, given the vocabulary in use (Figure 1a), we should be able to know that the first two longest-matching tokens are [a, ##b]. After collecting the tokens, we should reset our state as if we just matched ##cd and then continue to match the same character z. No need to reprocess bcd. ◊ Specifically, when trie matching arrives at node but cannot continue further, it must have matched the string represented by (i.e. ). We consider the tokens that MaxMatch would generate for the beginning of (called "failure pops" ( )), which should be popped off the beginning of and put into the result. After that, we should transit to a state (following the "failure link" ( )) that corresponds to the remaining suffix of , from which the algorithm continues to match the next character. ( ) and ( ) are defined as below and can be precomputed based on the vocabulary.
Definition 1. Failure links and pops. Given a node and the corresponding string , consider the shortest non-empty list of longest-matchingprefix tokens [ 1 , 2 , ..., ] (where ∈ , ≠ or ♯, for 1 ≤ ≤ ) that we can remove from (in order) until the remaining suffix can be represented by some node ′ from the trie.
Put it another way, ( ) and ( ) are defined by finding the longest prefix of the string that matches a vocabulary token, popping it, and repeating this procedure until the suffix string is found on the trie. Figure 1b shows ( ) and ( ) computed for the example vocabulary and trie.
For readers with the background of finite-state transducers (FSTs) (Mohri, 1997), it's helpful to see that ( ) is related to the state transition function and ( ) is related to the output function (more discussions in Section 3.6).

LinMaxMatch Tokenization
Assume that, based on the vocabulary, we have precomputed the trie, failure links, and failure pops (precomputation is discussed in Section 3.4). Given an input string, we follow the trie edges to process the input characters one by one. When trie matching cannot continue from node , we make a failure transition in two steps: (1) retrieve failure pops ( ) and append to the end of tokenization result, and (2) follow the failure link to node ( ). After that, we continue from the new node ( ).
Algorithm 1 shows the tokenization algorithm. For now, ignore lines 4-5; we explain it later.
The main function calls MATCHLOOP() with two inputs: appended by a whitespace and the start position 0 (line 1). Inside that function, let's use the term step to denote an iteration of the loop Each step starts from the current node and follows ( ) zero, one, or multiple times (line 12), appending the tokens in ( ) to the result along the way (line 11), until it finds a trie edge that matches the current character (line 9) or ( ) = ∅ (line 10).
On the other hand, if the word cannot be tokenized, when MATCHLOOP() returns on line 10, there are two cases: (1) Some normal input character cannot be consumed after attempting failure transitions (i.e., < | |).
Line 15 is only for safety reasons; it will not be visited since a whitespace is appended at the end. Example 3. Consider = = abcdz , using the vocabulary from Figure 1a. The expected tokenization is [a, ##b, ##c, ##dz].   Table 2 shows the sequence of node transitions and result tokens in MATHLOOP(). The first row is the original state. Steps 1-4 are self-explanatory.
Step 5 is more complex: when we reach step 5, the prefix abcd has already been processed. The current node is node 6, and the next character is z. As (6, z) = ∅, we copy (6) to the result (which becomes [a, ##b]) and follow (6) to node 10. Next, as (10, z) = ∅, we copy (10) to the result (which becomes [a, ##b, ##c]) and follow (10) to node 12. Now, as (12, z) = 13, we follow the trie edge to node 13 and proceed to step 6.
For 2 , MATCHLOOP() consumes all normal characters abcd but not the whitespace . When it returns on line 10, = | 2 |, is node 12 (since (12) = ∅), and the result tokens are [a, ##b, ##c], which do not cover character d. Actually, the string ##d represented by node 12 can-not be tokenized.
Tokens are reset to [<unk>] in both cases. ◊ Corner cases One behavior of the original Word-Piece algorithm (Google, 2018) is that, if the input starts with the suffix indicator, the first result token may start with the suffix indicator. For example, in Figure 1, if the input is ##bc, the tokenization result is [##b, ##c]. In this paper, by having ♯ as a descendant of , LinMaxMatch follows the same behavior and returns the same result. Because ♯ is set as a descendant of , if the input is ♯ itself (e.g., ##), normally Algorithm 1 would have returned an empty list of tokens, which is inconsistent with Google (2018). We handle this as a special case. Line 4 checks whether is ♯ by the following (instead of directly comparing the strings): if and only if = ♯, the landing node is ♯ and the result tokens are empty after consuming all normal input characters (i.e., = | |) 5 . If so, the tokens are reset by the precomputed result of the original WordPiece algorithm on ♯ (line 5).
Algorithm 1 can be proved to be consistent with the original WordPiece algorithm (Google, 2018).

LinMaxMatch Precomputation
Given a vocabulary, it is straightforward to build the trie. This section explains how to precompute failure links (⋅) and failure pops (⋅).
We could compute (⋅) and (⋅) by directly using the procedure from Definition 1. Instead, we propose a faster algorithm (see Section 3.5 for complexity). Our algorithm computes ( ) and ( ) by leveraging ( ) and ( ) from the parent node . Suppose ( , ) = . Intuitively, as the string of parent is a prefix of the string of node , it is likely that ( ) and ( ) share some common longest-matching-prefixes in the beginning. It can be proved that when ∉ , ( ) consists of (1) the tokens from ( ), followed by (2) the longest-matching-prefixes that the procedure from Definition 1 generates for the string ( ) . Otherwise, when ∈ , it's trivial that ( ) = [ ] based on Definition 1. Notice that ( ) and ( ) are computed using similar information for nodes that have strictly smaller depth than . Breadth-First-Search (BFS) is suitable for the computation.
Algorithm 2 is the precomputation algorithm. On line 1, the algorithm builds a trie for and keeps track of and ♯ . These nodes have depth 0 and are the starting points for our BFS traversal (line 2). We assume that initially ( ) = ∅ and ( ) = [ ] for every node . The core part is in lines 7-15, which computes ( ) and ( ) as discussed earlier.
The rest of the algorithm handles technical details. E.g., if ♯ is the empty string, the nodes and ♯ are identical; accordingly, line 2 avoids duplicate nodes. Otherwise, ♯ is a descendant of , and we need line 6 to avoid revisiting it in the BFS traversal.
It can be proved that Algorithm 2 correctly precomputes ( ), ( ) for each trie node .

Complexity Analysis
The complexity of tokenization (Algorithm 1) can be proved to be ( ) in a similar way as Aho-Corasick (Aho and Corasick, 1975). In brief, each step (an iteration of the loop from lines 8-13) makes zero or more failure transitions followed by exactly one normal (non-failure) transition. In each step, suppose we start at node with depth . We never follow more than failure transitions in that step: each failure transition takes us to a node with a strictly smaller depth. Any normal transition along trie edges increments the depth of node by 1 (line 13). Therefore, the total number of failure transitions is no more than the total number of normal transitions, which is ( ). Each transition is (1) plus the work to extend the list of tokens on line 11. As there are at most resulting tokens in total, the total tokenization time is ( ).
Since at least operations are required to read the entire input, our ( ) algorithm is asymptotically optimal. To the best of our knowledge, this is the first time that the optimal complexity for MaxMatch is proved to be strictly ( ), without a vocabulary-specific multiplicative factor.
For precomputation (Algorithm 2), the BFS traversal itself is ( ), where is the sum of the lengths of vocabulary tokens. A similar depthbased analysis (as in the case of the tokenization algorithm) shows that that the total number of times we traverse a failure link on line 13 is ( ).
The non-trivial parts are the construction of (⋅) on lines 12 and 15. The total size of (⋅) is ( ): there are ( ) lists, and the size of each list is ( ). A straightforward implementation needs ( ) time and space to construct and store (⋅). This is good enough in practice, as the precomputation is performed offline before any tokenization process. We plan to discuss optimized implementations in a follow-up publication.

Connection with Other Methods / Tasks
LinMaxMatch can be turned into a finite-state transducer (FST) (Mohri, 1997) by eliminating the failure transitions in Algorithm 1. 6 An FST extends a finite-state automaton (FSA) with an output tape. To turn LinMaxMatch into an FST, for node and character , we define the state transition function ′ ( , ) and the output function ′ ( , ) as follows: • ′ ( , ) precomputes the final state in lines 9-13 of Algorithm 1, where it starts from and follows failure transitions as needed, until it consumes or meets a null failure link; • ′ ( , ) consists of the failure pops collected along the way.
LinMaxMatch extends the Aho-Corasick Algorithm (Aho and Corasick, 1975). It can be applied to more string search or transducer problems. Let us name a few here. LinMaxMatch can be adapted to solve the multi-keyword search problem which Aho-Corasick is designed for. It can be also adapted to address other MaxMatch variants, such as Backward MaxMatch (Webster and Kit, 1992), recognizing unseen characters as singlecharacter tokens (Palmer, 2000), or combing with transformation rules (Sassano, 2014). Other potential applications include word segmentation in Asian languages (Sassano, 2014), phonological or morphological analysis (Kaplan and Kay, 1994;Jurafsky and Martin, 2009).

Linear-Time End-to-End Tokenization
The existing BERT tokenization implementations (Google, 2018) pre-tokenize the input text (splitting it into words by punctuation and whitespace characters) and then call WordPiece tokenization on each resulting word. For example, the text john johanson's may be split into [john, johan, ##son, ', s].
We propose an end-to-end WordPiece tokenizer that combines pre-tokenization and WordPiece into a single, linear-time pass. It uses the LinMaxMatch trie matching and failure transition loop as much as possible and only checks for punctuation and whitespace characters among the relatively few input characters that are not handled by the loop. It is more efficient as it traverses the input only once, performs fewer punctuation / whitespace checks, and skips the creation of intermediate words.
Precomputation We use the same process as in Section 3.4, with several differences: After the trie is constructed, we remove all trie links labeled with a punctuation character. 7 Then, for every possible punctuation character , we add a trie data node with no descendants, and a trie link from the root to with label . If is part of the vocabulary, we set = , otherwise = <unk>. The resulting trie matches all punctuation characters, as either themselves or as <unk>, depending on the vocabulary. Punctuation characters are not part of longer tokens, and there is no suffix token for a punctuation character. This reflects the fact that each punctuation character is a word by itself.
We then run the rest of Algorithm 2 to compute the failure pops and failure links.
Finally, for punctuation nodes, we set their failure links to a special node ; their failure pops are not changed. The special node has no parent and no descendants, and = , ( ) = ∅. Node indicates that a punctuation character was matched.
Tokenization Algorithm 3 tokenizes general text into wordpieces. It starts by appending a whitespace at the end of the input (line 1). In each iteration, it recognizes wordpieces for the current word by employing (almost) the same routine as in single-word tokenization (lines 3-7 in Algorithm 3 versus lines 1-5 in Algorithm 1). 8 When returning from MATCHLOOP(), Algorithm 3 must have met a character that cannot be consumed after attempting failure transitions, such as a whitespace, a punctuation, or some unseen character. Lines 4-5 examine whether the current word can be tokenized (by checking if the current position is at a word boundary and where the node lands at) and reset the tokens as appropriate (see related discussions in Section 3.3).
Lines 6-7 further handle the corner case that the word happens to be the suffix indicator itself (in the same way as Algorithm 1, see Section 3.3). Note that normally the suffix indicator contains only punctuation characters (e.g., ## in BERT); in that case lines 6-7 can be saved, because the suffix indicator itself is not be tokenized as a single word.
The tokens of the current word are then appened to the result (line 8). Finally, the algorithm moves the cursor past the boundary of the current word (lines 9-10) and skips any following whitespaces (lines 11-12) to process the next word.
It can be shown that Algorithm 3 is consistent with Google (2018) for general text tokenization, and the time complexity is ( ).

Experiments
Experimental Setup We benchmark our method against two widely-adopted WordPiece tokenization implementations: • HuggingFace Tokenizers (HuggingFace, 2020), from the HuggingFace Transformer 8 The common routine can be factored out as a function.
In both cases, we use pre-tokenization and Word-Piece tokenization, and skip other steps provided by those libraries (text cleanup, normalization, etc) for fair comparison. Both libraries use the original WordPiece tokenization algorithm (Google, 2018). They both generate not only the numeric ids of the tokens, but also the token strings and start/end offsets of the input word. We modify both libraries to generate only the token ids, 9 for two reasons: (1) most downstream models (e.g., BERT) consume only the token ids, and (2) we want to focus on the core tokenization work, not on, e.g., string copying.
We implement LinMaxMatch and E2E Word-Piece and made them return the numeric ids of the tokens, leveraging a double array-based trie library (Yata et al., 2007).
We compare our algorithms with HuggingFace and TensorFlow Text on a large corpus (several million words) and found that the tokenization results are identical for both single-word and end-to-end tokenization. In the rest of this section, we focus on the tokenization speed. All experiments are conducted on a Linux desktop with a six-core Intel Xeon @ 3.60GHz CPU and 64GB memory. We iterate each benchmark (after warming up) until it ran for a long-enough period of time, repeat each experiment 10 times, and report the average results. Our method is implemented and benchmarked in C++; so is TensorFlow Text. HuggingFace uses (and is benchmarked in) Rust.
We use the WordPiece vocabulary released with the BERT-Base, Multilingual Cased model, a model that supports 104 languages (Google, 2018).
To generate the test data, we sample 1,000 sentences from the multilingual Wikipedia dataset, covering 82 languages including English, Chinese, French, Russian, etc. On average, each word has 4 characters, and each sentence has 82 characters or 17 words. We found this dataset large enough: a much larger dataset (consisting of hundreds of thousands of sentences) generated similar results.
We run BERT's BasicTokenizer (Google, 2018) to clean up and normalize each sentence, including Unicode clean-up and normalization. Following the guidance for the BERT-Base Multilingual Cased model (Google, 2018), we do not instruct BasicTokenizer to do lower casing or accent stripping. In addition, preprocessing adds spaces around every CJK character, and thus Chinese is effectively character-tokenized. For simplicity, we keep Chinese in the test set, but keep in mind that each Chinese word is just one Chinese character, and any WordPiece implementation is efficient on such short words. Using a dataset with long words would emphasize the speed advantage of our algorithm even more than indicated below.
For single-word tokenization, we further used BasicTokenizer to pre-tokenize each sentence on punctuation and whitespace characters. This results in 17,223 words, 8,508 of them unique.
Results Table 3 shows the mean and the 95 percentile 10 running time when tokenizing a single word or general text (end-to-end) for each system. For single-word tokenization, ours is 3x faster on average; the speedup is greater for long-tail in-10 When computing the 95 percentile, the running time on each individual input is approximated by the average running time of all input examples of the same length. puts. Regarding general text end-to-end tokenization, ours is 8.2x faster than HuggingFace and 5.1x faster than TensorFlow Text on average. Figure 2 shows how the running time grows with respect to the input length for single-word tokenization.

System
Single Word End-to-End

Conclusion
We proposed LinMaxMatch for single-word Word-Piece tokenization, which is asymptoticallyoptimal linear-time with respect to the input length, without a vocabulary-specific multiplicative factor. We also proposed E2E WordPiece that combines pre-tokenization and WordPiece tokenziation into a single, linear-time pass for even higher efficiency. Experimental results show that our approach is 8.2x faster than HuggingFace and 5.1x faster than Ten-sorFlow Text on average for general text tokenization. For future work, we will adapt the proposed methods to more text processing techniques.

A Mathematical Formulations and Proofs of LinMaxMatch
In this section, we present the mathematical formulations of the proposed LinMaxMatch algorithm and prove the correctness. We introduce more notations here.
Definition 2. The length of string is | | (i.e., the number of characters in ) if does not start with ♯; otherwise, its length is | | − |♯|.
For example, the length of abc is 3, the length of ##d is 1 (where ## is the suffix indicator), and the length of or ♯ is 0.
Lemma 1. For an nonempty string , where is the last character and is the prefix ( could be or ♯), if ∉ , we have = and = .
11 For example, suppose that the suffix indicator is ##, and # (a single character) is in but ##a is not in . Then by definition ##a is not # (the character); it is instead.

Let
denote the trie node that represents the string (so = ), or ∅ if no such nodes exist. When ≠ ∅, we say the string is on the trie. For the example in Figure 1, abcd is the node 6 while abcdz = ∅. Table 4 summarizes the additional notations.

Symbol Meaning
The longest prefix of being in The suffix of after removing prefix , plus a preceding ♯ ( ) MaxMatch result for given The node that represents string ( ) MinPop Matching onto some node ( ) Tokens popped when computing ( )

A.1 MaxMatch in WordPiece
MaxMatch in WordPiece tokenization (Google, 2018) can be formalized as follows: 12 Definition 5. MaxMatch Given vocabulary , for string , MaxMatch ( ) is recursively defined as: (1) Note that if the input is exactly the suffix indicator ♯ itself, by Definition 5, (♯) ≝ [ ], which may be different from the original MaxMatch algorithm (Google, 2018) (see Sec. 3.3). Throughout this section, we focus on Definition 5, but be aware that if the original input is exactly the suffix indicator, we resort to the original MaxMatch algorithm.

A.2 MinPop Matching
We introduce a few concepts and discuss their properties and relationships, as shown in Figure 3, which eventually lead to the mathematical formulation of the algorithm and the proofs.
The first concept is MinPop Matching, which means "minimally popping longest-matching prefixes off the beginning of a string until matching a trie node". The formal definition is as follows:

Definition 6. MinPop Matching
For a string , define: • ( ): returns a node that represents if possible, or a node pointing to the suffix of after popping the least number of consecutive prefixes following the left-to-right longest-matchfirst process if possible, otherwise ∅.
• ( ): returns the list of consecutive longestmatching prefix tokens that are popped when computing ( ). (2) Example 5. Table 5 shows ( ) and ( ) of example strings using the vocabulary in Figure 1.  Note that if is on the trie, no popping is needed when computing ( ) and ( ). See Example 5.
MinPop Matching provides an alternative way to compute MaxMatch as shown in Lemma 2.
Lemma 2. For ease of presentation, we augment the trie by adding two nodes representing and ♯ , respectively, where is the whitespace character that is not in the alphabet of the vocabulary. Note that although and ♯ are on the trie, the two strings are not added to the vocabulary. Figure 4 shows the augmented trie built from the example vocabulary in Figure 1. Then MaxMatch ( ) can be equivalently computed as: Sketch of Proof. If is either or ♯, it's straightforward that ( ) is or ♯ , which is not ∅ on the augmented trie, and ( ) = [] = ( ). Let ∉ { , ♯}. Since is not in the vocabulary alphabet, is not on the trie (i.e., = ∅). If can be successfully tokenized, according to Equation 2, it will keep popping the longestmatching prefixes until the remaining suffix becomes ♯ , which is on the augmented trie. Hence, ( ) becomes ♯ (≠ ∅), and ( ) equals to ( ). Otherwise, by Equation 2, at some point will be ; thus, ( ) will eventually be ∅. Equation 3 returns [<unk>], which equals to ( ).
Example 6. In Figure 4, Definition 7. One-Step MinPop Matching ℎ( , ) and ( , ) capture this process: from node , match one character by minimally popping longest-matching prefixes. Mathematically: Example 7. Table 6       Otherwise, the string contains at least one character. Let's denote the string as , where is its prefix and is the last character. ( could be or ♯). Let = ( ), we have: Sketch of Proof. We prove by induction on the length of the prefix string . Note that the length of a string does not count the leading suffix indicator (Definition 2). The basis is when the length of is 0, i.e., is either or ♯. It's trivial to verify that Equation 5 holds for the basis case.
For the inductive steps, let the length of be (≥ 1). Assume that Equation 5 holds for any string ′ and character where the length of ′ is smaller than . There are three cases to discuss.
In the remaining two cases, = ∅, hence = ∅, which means ∉ . Hence, Since is a shorter string whose length is smaller than , by the induction assumption, we have Example 8. Take Figure 4 as an example, let = ##bcd and = z. We know = ( ) is node 10 and ( ) = [##b] (Table 5). Given = 10 and = z, we also know that ℎ( , ) = ℎ(10, z) = 13 and ( , ) = (10, z) = [##c] (Table 6) If we precompute and store ℎ( , ), ( , ) for every pair of node and character , then for an arbitrary string , we can efficiently compute ( ), ( ) (Lemma 3) and MaxMatch ( ) (Lemma 2). This results in an algorithm that can be formualized as a finite-state transducer (FST) (more discussions in Section A.7). However, it needs more space to store the ℎ( , ) and ( , ) tables. For example, the size of the ℎ( , ) table is (| | ⋅ |Σ|), where | | is the size of the trie and |Σ| is the size of the alphabet. In the following sections, we show that, while maintaining the overall linear time complexity (Section 3.5), failure links ( ) and failure pops ( ) can be used to efficiently compute ℎ( , ) and ( , ), but with much less space. For example, the size of ( ) table is (| |), which is much less than the (| | ⋅ |Σ|) space needed for the ℎ( , ) table. This eventually results in Algorithm 1, which is a more practical approach.

A.6 Precomputation and its Correctness
Algorithm 2 precomputes failure links (⋅) and failure pops (⋅) based on the following lemma.
Lemma 5. The following process correctly computes ( ), ( ) for any trie node . If ∈ { , ♯ }, ( ) = ∅, ( ) = [ ] ( Definition 8). Otherwise, let be the parent of and be the label from to (i.e., ( , ) = ), we have: In Section 3.6 we discussed that LinMaxMatch can be turned into a finite-state transducer (FST) by precomputing the transition function ′ ( , ) and the output function ′ ( , ) to eliminate the failure transitions. As aforementioned in Section A.3, ′ ( , ) and ′ ( , ) are essentially one-step Min-Pop matching: If we precompute and store ′ ( , ) and ′ ( , ), i.e. ℎ( , ) and ( , ), Algorithm 1 can be rewritten as Algorithm 4 (according to Lemma 3), where the differences are lines 8-11 in bold. In Algorithm 4, we can see that the failure transitions are eliminated and LinMaxMatch works as an FST. The time complexity is trivially linear.