Tokenization via Language Modeling: the Role of Preceding Text

Rastislav Hronsky, Emmanuel Keuleers


Abstract
While language models benefit immensely from their capacity to model large context (i.e., sequence of preceding tokens), the role of context is unclear in text tokenization, which is, in many cases, language model-driven to begin with. In this paper, we attempt to explore the role in three different writing systems and using three different text tokenization strategies (word-based, Morfessor, and BPE). In the first experiment, we examined how the size of context used for predicting the next token affects the ranking of the segmentation strategies i.t.o. language model surprisal. This effect was very writing system specific: minimal in case of English, and rank-reversing due to increased context size and token granularity in case of Turkish and Chinese. In the second experiment, we examined how context alters segmentation hypotheses when using language models to identify word boundaries. In this case, the effect was subtle: using context-aware, rather than context-free segment scores improved boundary recognition accuracy by up to 0.5%, once baseline effects were exploited.
Anthology ID:
2024.cawl-1.4
Volume:
Proceedings of the Second Workshop on Computation and Written Language (CAWL) @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Kyle Gorman, Emily Prud'hommeaux, Brian Roark, Richard Sproat
Venues:
CAWL | WS
SIG:
SIGWrit
Publisher:
ELRA and ICCL
Note:
Pages:
23–35
Language:
URL:
https://aclanthology.org/2024.cawl-1.4
DOI:
Bibkey:
Cite (ACL):
Rastislav Hronsky and Emmanuel Keuleers. 2024. Tokenization via Language Modeling: the Role of Preceding Text. In Proceedings of the Second Workshop on Computation and Written Language (CAWL) @ LREC-COLING 2024, pages 23–35, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Tokenization via Language Modeling: the Role of Preceding Text (Hronsky & Keuleers, CAWL-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.cawl-1.4.pdf