Interpreting Topic Models in Byte-Pair Encoding Space

Jia Peng Lim, Hady Lauw


Abstract
Byte-pair encoding (BPE) is pivotal for processing text into chunksize tokens, particularly in Large Language Model (LLM). From a topic modeling perspective, as these chunksize tokens might be mere parts of valid words, evaluating and interpreting these tokens for coherence is challenging. Most, if not all, of coherence evaluation measures are incompatible as they benchmark using valid words. We propose to interpret the recovery of valid words from these tokens as a ranking problem and present a model-agnostic and training-free recovery approach from the topic-token distribution onto a selected vocabulary space, following which we could apply existing evaluation measures. Results show that topic sets recovered from BPE vocabulary space are coherent.
Anthology ID:
2025.coling-main.720
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10810–10838
Language:
URL:
https://aclanthology.org/2025.coling-main.720/
DOI:
Bibkey:
Cite (ACL):
Jia Peng Lim and Hady Lauw. 2025. Interpreting Topic Models in Byte-Pair Encoding Space. In Proceedings of the 31st International Conference on Computational Linguistics, pages 10810–10838, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Interpreting Topic Models in Byte-Pair Encoding Space (Lim & Lauw, COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.720.pdf