Unlike “Likely”, “Unlike” is Unlikely: BPE-based Segmentation hurts Morphological Derivations in LLMs

Paul Lerner, François Yvon


Abstract
Large Language Models (LLMs) rely on subword vocabularies to process and generate text. However, because subwords are marked as initial- or intra-word, we find that LLMs perform poorly at handling some types of affixations, which hinders their ability to generate novel (unobserved) word forms. The largest models trained on enough data can mitigate this tendency because their initial- and intra-word embeddings are aligned; in-context learning also helps when all examples are selected in a consistent way; but only morphological segmentation can achieve a near-perfect accuracy.
Anthology ID:
2025.coling-main.348
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5181–5190
Language:
URL:
https://aclanthology.org/2025.coling-main.348/
DOI:
Bibkey:
Cite (ACL):
Paul Lerner and François Yvon. 2025. Unlike “Likely”, “Unlike” is Unlikely: BPE-based Segmentation hurts Morphological Derivations in LLMs. In Proceedings of the 31st International Conference on Computational Linguistics, pages 5181–5190, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Unlike “Likely”, “Unlike” is Unlikely: BPE-based Segmentation hurts Morphological Derivations in LLMs (Lerner & Yvon, COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.348.pdf