LLM Neologism: Emergence of Mutated Characters due to Byte Encoding

Ran Iwamoto, Hiroshi Kanayama


Abstract
The process of language generation, which selects the most probable tokens one by one, may intrinsically result in output strings that humans never utter. We name this phenomenon “LLM neologism” and investigate it focusing on Japanese, Chinese, and Korean languages, where tokens can be smaller than characters. Our findings show that LLM neologism occurs through the combination of two high-frequency words with common tokens. We also clarify the cause of LLM neologism in the tokenization process with limited vocabularies. The results of this study provides important clues for better encoding of multibyte characters, aiming to prevent catastrophic results in AI-generated documents.
Anthology ID:
2024.inlg-main.3
Volume:
Proceedings of the 17th International Natural Language Generation Conference
Month:
September
Year:
2024
Address:
Tokyo, Japan
Editors:
Saad Mahamood, Nguyen Le Minh, Daphne Ippolito
Venue:
INLG
SIG:
SIGGEN
Publisher:
Association for Computational Linguistics
Note:
Pages:
24–29
Language:
URL:
https://aclanthology.org/2024.inlg-main.3
DOI:
Bibkey:
Cite (ACL):
Ran Iwamoto and Hiroshi Kanayama. 2024. LLM Neologism: Emergence of Mutated Characters due to Byte Encoding. In Proceedings of the 17th International Natural Language Generation Conference, pages 24–29, Tokyo, Japan. Association for Computational Linguistics.
Cite (Informal):
LLM Neologism: Emergence of Mutated Characters due to Byte Encoding (Iwamoto & Kanayama, INLG 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.inlg-main.3.pdf