Shiuan Ni Liang
2025
CHIFRAUD: A Long-term Web Text Dataset for Chinese Fraud Detection
Min Tang
|
Lixin Zou
|
Zhe Jin
|
ShuJie Cui
|
Shiuan Ni Liang
|
Weiqing Wang
Proceedings of the 31st International Conference on Computational Linguistics
Detecting fraudulent online text is essential, as these manipulative messages exploit human greed, deceive individuals, and endanger societal security. Currently, this task remains under-explored on the Chinese web due to the lack of a comprehensive dataset of Chinese fraudulent texts. However, creating such a dataset is challenging because it requires extensive annotation within a vast collection of normal texts. Additionally, the creators of fraudulent webpages continuously update their tactics to evade detection by downstream platforms and promote fraudulent messages. To this end, this work firstly presents the comprehensive long-term dataset of Chinese fraudulent texts collected over 12 months, consisting of 59,106 entries extracted from billions of web pages. Furthermore, we design and provide a wide range of baselines, including large language model-based detectors, and pre-trained language model approaches. The necessary dataset and benchmark codes for further research are available via https://github. com/xuemingxxx/ChiFraud.