Should you marginalize over possible tokenizations?

Nadezhda Chirkova, Germán Kruszewski, Jos Rozen, Marc Dymetman


Abstract
Autoregressive language models (LMs) map token sequences to probabilities. The usual practice for computing the probability of any character string (e.g. English sentences) is to first transform it into a sequence of tokens that is scored by the model. However, there are exponentially many token sequences that represent any given string. To truly compute the probability of a string one should marginalize over all tokenizations, which is typically intractable. Here, we analyze whether the practice of ignoring the marginalization is justified. To this end, we devise an importance-sampling-based algorithm that allows us to compute estimates of the marginal probabilities and compare them to the default procedure in a range of state-of-the-art models and datasets. Our results show that the gap in log-likelihood is no larger than 0.5% in most cases, but that it becomes more pronounced for data with long complex words.
Anthology ID:
2023.acl-short.1
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–12
Language:
URL:
https://aclanthology.org/2023.acl-short.1
DOI:
10.18653/v1/2023.acl-short.1
Bibkey:
Cite (ACL):
Nadezhda Chirkova, Germán Kruszewski, Jos Rozen, and Marc Dymetman. 2023. Should you marginalize over possible tokenizations?. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–12, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Should you marginalize over possible tokenizations? (Chirkova et al., ACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.acl-short.1.pdf
Video:
 https://aclanthology.org/2023.acl-short.1.mp4