A Formal Perspective on Byte-Pair Encoding

Vilém Zouhar, Clara Meister, Juan Gastaldi, Li Du, Tim Vieira, Mrinmaya Sachan, Ryan Cotterell


Abstract
Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method.BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a 1/sigma*(1-e(-sigma))-approximation of an optimal merge sequence, where sigma is the total backward curvature with respect to the optimal merge sequence. Empirically the lower bound of the approximation is approx0.37.We provide a faster implementation of BPE which improves the runtime complexity from O(NM) to O(N log M), where N is the sequence length and M is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization.
Anthology ID:
2023.findings-acl.38
Original:
2023.findings-acl.38v1
Version 2:
2023.findings-acl.38v2
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
598–614
Language:
URL:
https://aclanthology.org/2023.findings-acl.38
DOI:
10.18653/v1/2023.findings-acl.38
Bibkey:
Cite (ACL):
Vilém Zouhar, Clara Meister, Juan Gastaldi, Li Du, Tim Vieira, Mrinmaya Sachan, and Ryan Cotterell. 2023. A Formal Perspective on Byte-Pair Encoding. In Findings of the Association for Computational Linguistics: ACL 2023, pages 598–614, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
A Formal Perspective on Byte-Pair Encoding (Zouhar et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.38.pdf
Video:
 https://aclanthology.org/2023.findings-acl.38.mp4