Speculating LLMs’ Chinese Training Data Pollution from Their Tokens

Qingjie Zhang; Di Wang; Haoting Qian; Liu Yan; Tianwei Zhang; Ke Xu; Qi Li; Minlie Huang; Hewu Li; Han Qiu

doi:10.18653/v1/2025.emnlp-main.1327

Speculating LLMs’ Chinese Training Data Pollution from Their Tokens

Qingjie Zhang, Di Wang, Haoting Qian, Liu Yan, Tianwei Zhang, Ke Xu, Qi Li, Minlie Huang, Hewu Li, Han Qiu

Abstract

Tokens are basic elements in the datasets for LLM training. It is well-known that many tokens representing Chinese phrases in the vocabulary of GPT (4o/4o-mini/o1/o3/4.5/4.1/o4-mini) are indicating contents like pornography or online gambling. Based on this observation, our goal is to locate Polluted Chinese (PoC) tokens in LLMs and study the relationship between PoC tokens’ existence and training data. (1) We give a formal definition and taxonomy of PoC tokens based on the GPT’s vocabulary. (2) We build a PoC token detector via fine-tuning an LLM to label PoC tokens in vocabularies by considering each token’s both semantics and related contents from the search engines. (3) We study the speculation on the training data pollution via PoC tokens’ appearances (token ID). Experiments on GPT and other 23 LLMs indicate that tokens widely exist while GPT’s vocabulary behaves the worst: more than 23% long Chinese tokens (i.e., a token with more than two Chinese characters) are either porn or online gambling. We validate the accuracy of our speculation method on famous pre-training datasets like C4 and Pile. Then, considering GPT-4o, we speculate that the ratio of “波*野结衣” related webpages in GPT-4o’s training data is around 0.5%.

Anthology ID:: 2025.emnlp-main.1327
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 26113–26133
Language:
URL:: https://aclanthology.org/2025.emnlp-main.1327/
DOI:: 10.18653/v1/2025.emnlp-main.1327
Bibkey:
Cite (ACL):: Qingjie Zhang, Di Wang, Haoting Qian, Liu Yan, Tianwei Zhang, Ke Xu, Qi Li, Minlie Huang, Hewu Li, and Han Qiu. 2025. Speculating LLMs’ Chinese Training Data Pollution from Their Tokens. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 26113–26133, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Speculating LLMs’ Chinese Training Data Pollution from Their Tokens (Zhang et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.1327.pdf
Checklist:: 2025.emnlp-main.1327.checklist.pdf

PDF Cite Search Checklist Fix data