Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slips

Yingfa Chen; Chenlong Hu; Cong Feng; Chenyang Song; Shi Yu (于是); Xu Han (韩旭); Zhiyuan Liu; Maosong Sun

Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slips

Yingfa Chen, Chenlong Hu, Cong Feng, Chenyang Song, Shi Yu, Xu Han, Zhiyuan Liu, Maosong Sun

Abstract

This study presents a multi-modal multi-granularity tokenizer specifically designed for analyzing ancient Chinese scripts, focusing on the Chu bamboo slip (CBS) script used during the Spring and Autumn and Warring States period (771-256 BCE) in Ancient China. Considering the complex hierarchical structure of ancient Chinese scripts, where a single character may be a combination of multiple sub-characters, our tokenizer first adopts character detection to locate character boundaries. Then it conducts character recognition at both the character and sub-character levels. Moreover, to support the academic community, we assembled the first large-scale dataset of CBSs with over 100K annotated character image scans. On the part-of-speech tagging task built on our dataset, using our tokenizer gives a 5.5% relative improvement in F1-score compared to mainstream sub-word tokenizers. Our work not only aids in further investigations of the specific script but also has the potential to advance research on other forms of ancient Chinese scripts.

Anthology ID:: 2025.coling-main.414
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6201–6211
Language:
URL:: https://aclanthology.org/2025.coling-main.414/
DOI:
Bibkey:
Cite (ACL):: Yingfa Chen, Chenlong Hu, Cong Feng, Chenyang Song, Shi Yu, Xu Han, Zhiyuan Liu, and Maosong Sun. 2025. Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slips. In Proceedings of the 31st International Conference on Computational Linguistics, pages 6201–6211, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slips (Chen et al., COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.414.pdf

PDF Cite Search Fix data