Alex Salcianu
2021
Fast WordPiece Tokenization
Xinying Song
|
Alex Salcianu
|
Yang Song
|
Dave Dopson
|
Denny Zhou
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Tokenization is a fundamental preprocessing step for almost all NLP tasks. In this paper, we propose efficient algorithms for the WordPiece tokenization used in BERT, from single-word tokenization to general text (e.g., sentence) tokenization. When tokenizing a single word, WordPiece uses a longest-match-first strategy, known as maximum matching. The best known algorithms so far are O(nˆ2) (where n is the input length) or O(nm) (where m is the maximum vocabulary token length). We propose a novel algorithm whose tokenization complexity is strictly O(n). Our method is inspired by the Aho-Corasick algorithm. We introduce additional linkages on top of the trie built from the vocabulary, allowing smart transitions when the trie matching cannot continue. For general text, we further propose an algorithm that combines pre-tokenization (splitting the text into words) and our linear-time WordPiece method into a single pass. Experimental results show that our method is 8.2x faster than HuggingFace Tokenizers and 5.1x faster than TensorFlow Text on average for general text tokenization.
2017
Natural Language Processing with Small Feed-Forward Networks
Jan A. Botha
|
Emily Pitler
|
Ji Ma
|
Anton Bakalov
|
Alex Salcianu
|
David Weiss
|
Ryan McDonald
|
Slav Petrov
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
We show that small and shallow feed-forward neural networks can achieve near state-of-the-art results on a range of unstructured and structured language processing tasks while being considerably cheaper in memory and computational requirements than deep recurrent models. Motivated by resource-constrained environments like mobile phones, we showcase simple techniques for obtaining such small neural network models, and investigate different tradeoffs when deciding how to allocate a small memory budget.
Search
Co-authors
- Jan A. Botha 1
- Emily Pitler 1
- Ji Ma 1
- Anton Bakalov 1
- David Weiss 1
- show all...