Empowering Character-level Text Infilling by Eliminating Sub-Tokens

Houxing Ren, Mingjie Zhan, Zhongyuan Wu, Hongsheng Li


Abstract
In infilling tasks, sub-tokens, representing instances where a complete token is segmented into two parts, often emerge at the boundaries of prefixes, middles, and suffixes. Traditional methods focused on training models at the token level, leading to sub-optimal performance in character-level infilling tasks during the inference stage. Alternately, some approaches considered character-level infilling, but they relied on predicting sub-tokens in inference, yet this strategy diminished ability in character-level infilling tasks due to the large perplexity of the model on sub-tokens. In this paper, we introduce FIM-SE, which stands for Fill-In-the-Middle with both Starting and Ending character constraints. The proposed method addresses character-level infilling tasks by utilizing a line-level format to avoid predicting any sub-token in inference. In addition, we incorporate two special tokens to signify the rest of the incomplete lines, thereby enhancing generation guidance. Extensive experiments demonstrate that our proposed approach surpasses previous methods, offering a significant advantage. Code is available at https://github.com/SenseLLM/FIM-SE.
Anthology ID:
2024.acl-long.179
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3253–3267
Language:
URL:
https://aclanthology.org/2024.acl-long.179
DOI:
Bibkey:
Cite (ACL):
Houxing Ren, Mingjie Zhan, Zhongyuan Wu, and Hongsheng Li. 2024. Empowering Character-level Text Infilling by Eliminating Sub-Tokens. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3253–3267, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Empowering Character-level Text Infilling by Eliminating Sub-Tokens (Ren et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.179.pdf