Zewen Bai

2025

Commonality and Individuality! Integrating Humor Commonality with Speaker Individuality for Humor Recognition
Haohao Zhu | Junyu Lu | Zeyuan Zeng | Zewen Bai | Xiaokun Zhang | Liang Yang | Hongfei Lin
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Humor recognition aims to identify whether a specific speaker’s text is humorous. Current methods for humor recognition mainly suffer from two limitations: (1) they solely focus on one aspect of humor commonalities, ignoring the multifaceted nature of humor; and (2) they typically overlook the critical role of speaker individuality, which is essential for a comprehensive understanding of humor expressions. To bridge these gaps, we introduce the Commonality and Individuality Incorporated Network for Humor Recognition (CIHR), a novel model designed to enhance humor recognition by integrating multifaceted humor commonalities with the distinctive individuality of speakers. The CIHR features a Humor Commonality Analysis module that explores various perspectives of multifaceted humor commonality within user texts, and a Speaker Individuality Extraction module that captures both static and dynamic aspects of a speaker’s profile to accurately model their distinctive individuality. Additionally, Static and Dynamic Fusion modules are introduced to effectively incorporate the humor commonality with speaker’s individuality in the humor recognition process. Extensive experiments demonstrate the effectiveness of CIHR, underscoring the importance of concurrently addressing both multifaceted humor commonality and distinctive speaker individuality in humor recognition.

pdf bib abs

The proliferation of hate speech has caused significant harm to society. The intensity and directionality of hate are closely tied to the target and argument it is associated with. However, research on hate speech detection in Chinese has lagged behind, and existing datasets lack span-level fine-grained annotations. Furthermore, the lack of research on Chinese hateful slang poses a significant challenge. In this paper, we provide two valuable fine-grained Chinese hate speech detection research resources. First, we construct a Span-level Target-Aware Toxicity Extraction dataset (STATE ToxiCN), which is the first span-level Chinese hate speech dataset. Secondly, we evaluate the span-level hate speech detection performance of existing models using STATE ToxiCN. Finally, we conduct the first study on Chinese hateful slang and evaluate the ability of LLMs to understand hate semantics. Our work contributes valuable resources and insights to advance span-level hate speech detection in Chinese.

pdf bib abs

基于双系统推理框架的法律判决研究
Shengdi Yin | Zewen Bai | Hongfei Lin | Liang Yang
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)

"法律判决预测是法律人工智能领域的一项重要任务。本文提出了一种基于外部知识的可解释性双系统推理框架,来解决现有方法在刑期预测任务中精度不高且可解释性不强的问题。该框架借鉴认知科学领域的双系统理论,利用大型语言模型的文本理解和生成能力,模拟人类法官处理案件时的决策过程,最终给出具有清晰推理路径的刑期预测结果。此外,通过构建一个高质量思考增强数据集和一个外部法条知识库,提升了模型的解释能力并且有效地抑制法条判断模型出现法条幻觉。实验结果表明,该框架显著提升了CAIL-small和CAIL-big数据集中刑期预测子任务上的精度和可解释性。"

pdf bib abs

Overview of CCL25-Eval Task 10: Fine-grained Chinese Hate Speech Identification Evaluation Task
Junyu Lu | Zewen Bai | Shengdi Yin | Liang Yang | Hongfei Lin
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)

"This paper provides an overview of the CCL25-Eval Task 10, i.e., Fine-grained Chinese Hate Speech Identification Evaluation. The primary objective of this task is to perform a fine-grained analysis of hateful samples. In addition to binary classification, systems are required to identify and extract the comment target, argument span, and the associated targeted group within each sample, thereby enhancing the model’s capability in fine-grained detection and improving the interpretability of its decisions. In total, more than 300 teams registered for the task, with 100 teams submitting valid results. We present the submitted results and provide a comprehensive analysis of the technical approaches adopted by the top-performing teams. The dataset used in this task has been available."

2023

pdf bib abs

This paper describes our system used in the SemEval-2023 Task 9 Multilingual Tweet Intimacy Analysis. There are two key challenges in this task: the complexity of multilingual and zero-shot cross-lingual learning, and the difficulty of semantic mining of tweet intimacy. To solve the above problems, our system extracts contextual representations from the pretrained language models, XLM-T, and employs various optimization methods, including adversarial training, data augmentation, ordinal regression loss and special training strategy. Our system ranked 14th out of 54 participating teams on the leaderboard and ranked 10th on predicting languages not in the training data. Our code is available on Github.

pdf bib abs

Sexism is an injustice afflicting women and has become a common form of oppression in social media. In recent years, the automatic detection of sexist instances has been utilized to combat this oppression. The Subtask A of SemEval-2023 Task 10, Explainable Detection of Online Sexism, aims to detect whether an English-language post is sexist. In this paper, we describe our system for the competition. The structure of the classification model is based on RoBERTa, and we further pre-train it on the domain corpus. For fine-tuning, we adopt Unsupervised Data Augmentation (UDA), a semi-supervised learning approach, to improve the robustness of the system. Specifically, we employ Easy Data Augmentation (EDA) method as the noising operation for consistency training. We train multiple models based on different hyperparameter settings and adopt the majority voting method to predict the labels of test entries. Our proposed system achieves a Macro-F1 score of 0.8352 and a ranking of 41/84 on the leaderboard of Subtask A.