Lost in Pronunciation: Detecting Chinese Offensive Language Disguised by Phonetic Cloaking Replacement

Haotan Guo; Jianfei He; Jiayuan Ma; Hongbin Na; Zimu Wang; Haiyang Zhang; Qi Chen; Wei Wang; Zijing Shi; Tao Shen; Ling Chen

doi:10.18653/v1/2025.emnlp-industry.172

Lost in Pronunciation: Detecting Chinese Offensive Language Disguised by Phonetic Cloaking Replacement

Haotan Guo, Jianfei He, Jiayuan Ma, Hongbin Na, Zimu Wang, Haiyang Zhang, Qi Chen, Wei Wang, Zijing Shi, Tao Shen, Ling Chen

Abstract

Phonetic Cloaking Replacement (PCR), defined as the deliberate use of homophonic or near-homophonic variants to hide toxic intent, has become a major obstacle to Chinese content moderation. While this problem is well-recognized, existing evaluations predominantly rely on rule-based, synthetic perturbations that ignore the creativity of real users. We organize PCR into a four-way surface-form taxonomy and compile PCR-ToxiCN, a dataset of 500 naturally occurring, phonetically cloaked offensive posts gathered from the RedNote platform. Benchmarking state-of-the-art LLMs on this dataset exposes a serious weakness: the best model reaches only an F1-score of 0.672, and zero-shot chain-of-thought prompting pushes performance even lower. Guided by error analysis, we revisit a Pinyin-based prompting strategy that earlier studies judged ineffective and show that it recovers much of the lost accuracy. This study offers the first comprehensive taxonomy of Chinese PCR, a realistic benchmark that reveals current detectors’ limits, and a lightweight mitigation technique that advances research on robust toxicity detection.

Anthology ID:: 2025.emnlp-industry.172
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:: November
Year:: 2025
Address:: Suzhou (China)
Editors:: Saloni Potdar, Lina Rojas-Barahona, Sebastien Montella
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2538–2550
Language:
URL:: https://aclanthology.org/2025.emnlp-industry.172/
DOI:: 10.18653/v1/2025.emnlp-industry.172
Bibkey:
Cite (ACL):: Haotan Guo, Jianfei He, Jiayuan Ma, Hongbin Na, Zimu Wang, Haiyang Zhang, Qi Chen, Wei Wang, Zijing Shi, Tao Shen, and Ling Chen. 2025. Lost in Pronunciation: Detecting Chinese Offensive Language Disguised by Phonetic Cloaking Replacement. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2538–2550, Suzhou (China). Association for Computational Linguistics.
Cite (Informal):: Lost in Pronunciation: Detecting Chinese Offensive Language Disguised by Phonetic Cloaking Replacement (Guo et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-industry.172.pdf

PDF Cite Search Fix data