Geng Tu

2025

pdf bib abs
CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation
Jingqian Zhao | Bingbing Wang | Geng Tu | Yice Zhang | Qianlong Wang | Bin Liang | Jing Li | Ruifeng Xu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training.Current studies mitigate this issue by modifying existing datasets or generating new ones from freshly collected information. However, these methods fall short of ensuring contamination-resilient evaluation, as they fail to fully eliminate pre-existing knowledge from models or preserve the semantic complexity of the original datasets. To address these limitations, we propose CoreEval, a Contamination-resilient Evaluation strategy for automatically updating data with real-world knowledge. This approach begins by extracting entity relationships from the original data and leveraging the GDELT database to retrieve relevant and up-to-date knowledge. The retrieved knowledge is then recontextualized and integrated with the original data, which is refined and restructured to ensure semantic coherence and enhanced task relevance. Ultimately, a robust data reflection mechanism in a Chain-of-Thought manner to iteratively verify and refine labels, ensuring consistency between the updated and original datasets. Extensive experiments on updated datasets validate the robustness of CoreEval, demonstrating its effectiveness in mitigating performance overestimation caused by data contamination.

pdf bib abs
Large Language Models as Reader for Bias Detection
Xuan Luo | Jing Li | Zhong Wenzhong | Geng Tu | Ruifeng Xu
Findings of the Association for Computational Linguistics: EMNLP 2025

Detecting bias in media content is crucial for maintaining information integrity and promoting inclusivity. Traditional methods analyze text from the writer’s perspective, which analyzes textual features directly from the writer’s intent, leaving the reader’s perspective underexplored. This paper investigates whether Large Language Models (LLMs) can be leveraged as readers for bias detection by generating reader-perspective comments. Experiments are conducted on the BASIL (news bias) and BeyondGender (gender bias) datasets with LLMs Gemma-7B, Phi-3-3.8B, Llama3.1-8B, Llama3.1-70B, and GPT4. The results demonstrate the effectiveness of reader-perspective comments for open-source LLMs, achieving performance comparable to GPT4’s. The findings highlight the significance of emotion-related comments, which are generally more beneficial than value-related ones in bias detection. In addition, experiments on Llamas show that comment selection ensures consistent performance regardless of model sizes and comment combinations. This study is particularly beneficial for small-size open-source LLMs.

pdf bib abs
HITSZ-HLT at SemEval-2025 Task 8: Multi-turn Interactive Code Generation for Question Answering on Tabular Data
Jun Wang | Feng Xiong | Hongling Xu | Geng Tu | Ruifeng Xu
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

This paper introduces the system developed by the HITSZ-HLT team for SemEval-2025 Task 8: DataBench, Question-Answering over Tabular Data.The primary objective of Table Question Answering (TableQA) is to provide accurate answers to user queries by interpreting and understanding tabular data. To address this, we propose the Multi-turn Interactive Code GeneratiOn(MICO) framework. Specifically, MICO employs code generation as proxy task for TableQA and integrates feedback from the execution of the generated code via multi-turn dialogue process, thereby guiding the model towards self-correction.Experimental results demonstrate the effectiveness of our framework, achieving notable performance with a rank of 4/38 on the DataBench and 5/38 on the DataBench lite.

2024

Multimodal Emotion Recognition in Conversations (ERC) aims to identify emotions in conversational videos. Current efforts focus on modeling both context-sensitive and speaker-sensitive dependencies and multimodal fusion. Despite the progress, models in Multimodal ERC (MERC) still struggle due to a lack of CommonSense Knowledge (CSK). In contrast, models in textual ERC typically employ CSK to enhance emotion inference. However, in multimodal scenarios, relying solely on textual CSK while neglecting visual CSK may hinder the understanding of visual emotional cues. To address this, we introduce a novel approach called Multiple Knowledge Enhanced Interactive Graph Network (MKE-IGN) to integrate multiple knowledge, such as textual and visual CSK, into the edge representations, thereby facilitating the modeling of relations between utterances and different types of CSK. Furthermore, considering that irrelevant CSK might be retained as noise, MKE-IGN adaptively selects this CSK guided by the mood-congruent effect and refines it based on contexts. Experimental results show that MKE-IGN outperforms state-of-the-art methods on two popular datasets.

pdf bib abs
HITSZ-HLT at WASSA-2024 Shared Task 2: Language-agnostic Multi-task Learning for Explainability of Cross-lingual Emotion Detection
Feng Xiong | Jun Wang | Geng Tu | Ruifeng Xu
Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

This paper describes the system developed by the HITSZ-HLT team for WASSA-2024 Shared Task 2, which addresses two closely linked sub-tasks: Cross-lingual Emotion Detection and Binary Trigger Word Detection in tweets. The main goal of Shared Task 2 is to simultaneously identify the emotions expressed and detect the trigger words across multiple languages. To achieve this, we introduce a Language-agnostic Multi Task Learning (LaMTL) framework that integrates emotion prediction and emotion trigger word detection tasks. By fostering synergistic interactions between task-specific and task-agnostic representations, the LaMTL aims to mutually enhance emotional cues, ultimately improving the performance of both tasks. Additionally, we leverage large-scale language models to translate the training dataset into multiple languages, thereby fostering the formation of language-agnostic representations within the model, significantly enhancing the model’s ability to transfer and perform well across multilingual data. Experimental results demonstrate the effectiveness of our framework across both tasks, with a particular highlight on its success in achieving second place in sub-task 2.

2023

pdf bib abs
A Training-Free Debiasing Framework with Counterfactual Reasoning for Conversational Emotion Detection
Geng Tu | Ran Jing | Bin Liang | Min Yang | Kam-Fai Wong | Ruifeng Xu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Unintended dataset biases typically exist in existing Emotion Recognition in Conversations (ERC) datasets, including label bias, where models favor the majority class due to imbalanced training data, as well as the speaker and neutral word bias, where models make unfair predictions because of excessive correlations between specific neutral words or speakers and classes. However, previous studies in ERC generally focus on capturing context-sensitive and speaker-sensitive dependencies, ignoring the unintended dataset biases of data, which hampers the generalization and fairness in ERC. To address this issue, we propose a Training-Free Debiasing framework (TFD) that operates during prediction without additional training. To ensure compatibility with various ERC models, it does not balance data or modify the model structure. Instead, TFD extracts biases from the model by generating counterfactual utterances and contexts and mitigates them using simple yet empirically robust element-wise subtraction operations. Extensive experiments on three public datasets demonstrate that TFD effectively improves generalization ability and fairness across different ERC models.

Argument pair extraction (APE) aims to extract interactive argument pairs from two passages within a discussion. The key challenge of APE is to effectively capture the complex context-aware interactive relations of arguments between the two passages. In this paper, we elicit relational semantic knowledge from large-scale pre-trained language models (PLMs) via a probing technique. The induced sentence-level relational probing graph can help capture rich explicit interactive relations between argument pairs effectively. Since the relevance score of a sentence pair within a passage is generally larger than that of the sentence pair from different passages, each sentence would prefer to propagate information within the same passage and under-explore the interactive relations between two passages. To tackle this issue, we propose a graph decomposition method to decompose the probing graph into four sub-graphs from intra- and inter-passage perspectives, where the intra-passage graphs can help detect argument spans within each passage and the inter-passage graphs can help identify the argument pairs between the review and rebuttal passages. Experimental results on two benchmark datasets show that our method achieves substantial improvements over strong baselines for APE.

pdf bib abs
Context or Knowledge is Not Always Necessary: A Contrastive Learning Framework for Emotion Recognition in Conversations
Geng Tu | Bin Liang | Ruibin Mao | Min Yang | Ruifeng Xu
Findings of the Association for Computational Linguistics: ACL 2023

Emotion recognition in conversations (ERC) aims to detect the emotion of utterances in conversations. Existing efforts generally focus on modeling context- and knowledge-sensitive dependencies. However, it is observed that the emotions of many utterances can be correctly detected without context or external knowledge. In such cases, blindly leveraging the context and external knowledge may impede model training. Based on this, we propose a novel framework based on contrastive learning (CL), called CKCL (including the contrastive learning scenarios among Context and Knowledge), to distinguish the above utterances for better vector representations. The CKCL framework defines context- and knowledge-independent utterances, as the positive sample, whose predicted results are unchanged even masking context and knowledge representations, otherwise, the negative sample. This can obtain a latent feature reflecting the impact degree of context and external knowledge on predicted results, thus effectively denoising irrelevant context and knowledge during training. Experimental results on four datasets show the performance of CKCL-based models is significantly boosted and outperforms state-of-the-art methods.

pdf bib abs
An Empirical Study on Multiple Knowledge from ChatGPT for Emotion Recognition in Conversations
Geng Tu | Bin Liang | Bing Qin | Kam-Fai Wong | Ruifeng Xu
Findings of the Association for Computational Linguistics: EMNLP 2023

Multiple knowledge (e.g., co-reference, topics, emotional causes, etc) has been demonstrated effective for emotion detection. However, exploring this knowledge in Emotion Recognition in Conversations (ERC) is currently a blank slate due to the lack of annotated data and the high cost involved in obtaining such knowledge. Fortunately, the emergence of Large Language Models (LLMs) holds promise in filling this void. Therefore, we propose a Multiple Knowledge Fusion Model (MKFM) to effectively integrate such knowledge generated by LLMs for ERC and empirically study its impact on the model. Experimental results on three public datasets have demonstrated the effectiveness of multiple knowledge for ERC. Furthermore, we conduct a detailed analysis of the contribution and complementarity of this knowledge.

pdf bib abs
Learning More from Mixed Emotions: A Label Refinement Method for Emotion Recognition in Conversations
Jintao Wen | Geng Tu | Rui Li | Dazhi Jiang | Wenhua Zhu
Transactions of the Association for Computational Linguistics, Volume 11

One-hot labels are commonly employed as ground truth in Emotion Recognition in Conversations (ERC). However, this approach may not fully encompass all the emotions conveyed in a single utterance, leading to suboptimal performance. Regrettably, current ERC datasets lack comprehensive emotionally distributed labels. To address this issue, we propose the Emotion Label Refinement (EmoLR) method, which utilizes context- and speaker-sensitive information to infer mixed emotional labels. EmoLR comprises an Emotion Predictor (EP) module and a Label Refinement (LR) module. The EP module recognizes emotions and provides context/speaker states for the LR module. Subsequently, the LR module calculates the similarity between these states and ground-truth labels, generating a refined label distribution (RLD). The RLD captures a more comprehensive range of emotions than the original one-hot labels. These refined labels are then used for model training in place of the one-hot labels. Experimental results on three public conversational datasets demonstrate that our EmoLR achieves state-of-the-art performance.