Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning

Zhu Xu; Zhiqiang Zhao; Zihan Zhang; Yuchi Liu; Quanwei Shen; Fei Liu; Yu Kuang; Jian He; Conglin Liu

doi:10.18653/v1/2025.acl-long.194

Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning

Zhu Xu, Zhiqiang Zhao, Zihan Zhang, Yuchi Liu, Quanwei Shen, Fei Liu, Yu Kuang, Jian He, Conglin Liu

Abstract

Tokenization methods like Byte-Pair Encoding (BPE) enhance computational efficiency in large language models (LLMs) but often obscure internal character structures within tokens. This limitation hinders LLMs’ ability to predict precise character positions, which is crucial in tasks like Chinese Spelling Correction (CSC) where identifying the positions of misspelled characters accelerates correction processes. We propose Token Internal Position Awareness (TIPA), a method that significantly improves models’ ability to capture character positions within tokens by training them on reverse character prediction tasks using the tokenizer’s vocabulary. Experiments demonstrate that TIPA enhances position prediction accuracy in LLMs, enabling more precise identification of target characters in original text. Furthermore, when applied to downstream tasks that do not require exact position prediction, TIPA still boosts performance in tasks needing character-level information, validating its versatility and effectiveness.

Anthology ID:: 2025.acl-long.194
Original:: 2025.acl-long.194v1
Version 2:: 2025.acl-long.194v2
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3839–3853
Language:
URL:: https://aclanthology.org/2025.acl-long.194/
DOI:: 10.18653/v1/2025.acl-long.194
Bibkey:
Cite (ACL):: Zhu Xu, Zhiqiang Zhao, Zihan Zhang, Yuchi Liu, Quanwei Shen, Fei Liu, Yu Kuang, Jian He, and Conglin Liu. 2025. Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3839–3853, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning (Xu et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.194.pdf

PDF (v2) PDF (v1) Cite Search Fix data