Haiqin Yang - ACL Anthology

Haiqin Yang

2026

Sparse Adapter Fusion for Continual Learning in NLP
Min Zeng | Xi Chen | Haiqin Yang | Yike Guo
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Continual learning in natural language processing plays a crucial role in adapting to evolving data and preventing catastrophic forgetting. Despite significant progress, existing methods still face challenges, such as inefficient parameter reuse across tasks, risking catastrophic forgetting when tasks are dissimilar, and the unnecessary introduction of new parameters for each task, which hampers knowledge sharing among similar tasks. To tackle these issues, we propose a Sparse Adapter Fusion Method (SAFM), which dynamically fuses old and new adapters to address these challenges. SAFM operates in two stages: the decision stage and the tuning stage. In the decision stage, SAFM determines whether to incorporate a new adapter, reuse an existing one, or add an empty adapter. The architecture search procedure, designed to prioritize reusing or adding empty adapters, minimizes parameter consumption and maximizes reuse. In the tuning stage, SAFM especially facilitates a layer-wise loss to encourage differentiation between adapters, effectively capturing knowledge within the same task. Experimental results consistently show that SAFM outperforms state-of-the-art (SOTA) methods, achieving comparable performance while utilizing less than 60% of the parameters.

2025

Intrinsic Test of Unlearning Using Parametric Knowledge Traces
Yihuai Hong | Lei Yu | Haiqin Yang | Shauli Ravfogel | Mor Geva
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

The task of “unlearning” certain concepts in large language models (LLMs) has gained attention for its role in mitigating harmful, private, or incorrect outputs. Current evaluations mostly rely on behavioral tests, without monitoring residual knowledge in model parameters, which can be adversarially exploited to recover erased information. We argue that unlearning should also be assessed internally by tracking changes in the parametric traces of unlearned concepts. To this end, we propose a general evaluation methodology that uses vocabulary projections to inspect concepts encoded in model parameters. We apply this approach to localize “concept vectors” — parameter vectors encoding concrete concepts — and construct ConceptVectors, a benchmark of hundreds of such concepts and their parametric traces in two open-source LLMs. Evaluation on ConceptVectors shows that existing methods minimally alter concept vectors, mostly suppressing them at inference time, while direct ablation of these vectors removes the associated knowledge and reduces adversarial susceptibility. Our findings reveal limitations of behavior-only evaluations and advocate for parameter-based assessments. We release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.

Task-wrapped Continual Learning in Task-Oriented Dialogue Systems
Min Zeng | Haiqin Yang | Xi Chen | Yike Guo
Findings of the Association for Computational Linguistics: NAACL 2025

Continual learning is vital for task-oriented dialogue systems (ToDs), and AdapterCL, equipped with residual adapters, has proven effectiveness in this domain. However, its performance is limited by training separate adapters for each task, preventing global knowledge sharing. To address this, we propose **Task-wrapped Continual Learning (TCL)**, a novel framework that employs **Task-Wrapped Adapters (TWAs)**, to simultaneously learn both global and task-specific information through parameter sharing. TCL leverages task-conditioned hypernetworks to transfer global knowledge across tasks, enabling TWAs to start from more informed initialization, efficiently learning task-specific details while reducing model parameters. Additionally, the simple, linear structure of both hypernetworks and TWAs ensure stable training, with task-free inference supported through effective loss utilization. Across 37 ToD domains, TCL consistently outperforms AdapterCL, significantly reducing forgetting. Remarkably, by setting the task embedding dimension to 1, TCL achieves a 4.76% improvement over AdapterCL while using only 46% of the parameters. These findings position TWA as a lightweight, powerful alternative to traditional adapters, offering a promising solution for continual learning in ToDs. The code is availableat https://github.com/cloversjtu/TCL.

PARSQL: Enhancing Text-to-SQL through SQL Parsing and Reasoning
Yaxun Dai | Haiqin Yang | Mou Hao | Pingfu Chao
Findings of the Association for Computational Linguistics: ACL 2025

Large language models (LLMs) have made significant strides in text-to-SQL tasks; however, small language models (SLMs) are crucial due to their low resource consumption and efficient inference for real-world deployment. Due to resource limitations, SLMs struggle to accurately interpret natural language questions and may overlook critical constraints, leading to challenges such as generating SQL with incorrect logic or incomplete conditions. To address these issues, we propose PARSQL, a novel framework that leverages SQL parsing and reasoning. Specifically, we design PARSer, an SQL parser that extracts constraints from SQL to generate sub-SQLs for data augmentation and producing step-by-step SQL explanations (reason) via both rule-based and LLM-based methods. We define a novel text-to-reason task and incorporate it into multi-task learning, thereby enhancing text-to-SQL performance. Additionally, we employ an efficient SQL selection strategy that conducts direct similarity computation between the generated SQLs and their corresponding reasons to derive the final SQL for post-correction. Extensive experiments show that our PARSQL outperforms models with the same model size on the BIRD and Spider benchmarks. Notably, PARSQL-3B achieves 56.98% execution accuracy on BIRD, rivaling 7B models with significantly fewer parameters, setting a new state-of-the-art performance. Code can be found [here](https://github.com/yaxundai/parsql).

Let’s Play Across Cultures: A Large Multilingual, Multicultural Benchmark for Assessing Language Models’ Understanding of Sports
Punit Kumar Singh | Nishant Kumar | Akash Ghosh | Kunal Pasad | Khushi Soni | Manisha Jaishwal | Sriparna Saha | Syukron Abu Ishaq Alfarozi | Asres Temam Abagissa | Kitsuchart Pasupa | Haiqin Yang | Jose G Moreno
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Language Models (LMs) are primarily evaluated on globally popular sports, often overlooking regional and indigenous sporting traditions. To address this gap, we introduce CultSportQA, a benchmark designed to assess LMs’ understanding of traditional sports across 60 countries and 6 continents, encompassing four distinct cultural categories. The dataset features 33,000 multiple-choice questions (MCQs) across text and image modalities, categorized into primarily three key types: history-based, rule-based, and scenario-based. To evaluate model performance, we employ zero-shot, few-shot, and chain-of-thought (CoT) prompting across a diverse set of Large Language Models (LLMs), Small Language Models (SLMs), and Multimodal Large Language Models (MLMs). By providing a comprehensive multilingual and multicultural sports benchmark, CultSportQA establishes a new standard for assessing AI’s ability to understand and reason about traditional sports. The dataset will be publicly available, fostering research in culturally aware AI systems.

2024

A Two-Agent Game for Zero-shot Relation Triplet Extraction
Ting Xu | Haiqin Yang | Fei Zhao | Zhen Wu | Xinyu Dai
Findings of the Association for Computational Linguistics: ACL 2024

Relation triplet extraction is a fundamental task in natural language processing that aims to identify semantic relationships between entities in text. It is particularly challenging in the zero-shot setting, i.e., zero-shot relation triplet extraction (ZeroRTE), where the relation sets between training and test are disjoint. Existing methods deal with this task by integrating relations into prompts, which may lack sufficient understanding of the unseen relations. To address these limitations, this paper presents a novel Two-Agent Game (TAG) approach to deliberate and debate the semantics of unseen relations. TAG consists of two agents, a generator and an extractor. They iteratively interact in three key steps: attempting, criticizing, and rectifying. This enables the agents to fully debate and understand the unseen relations. Experimental results demonstrate consistent improvement over ALBERT-Large, BART, andGPT3.5, without incurring additional inference costs in all cases. Remarkably, our method outperforms strong baselines by a significant margin, achieving an impressive 6%-16% increase in F1 scores, particularly when dealingwith FewRel with five unseen relations.

Dissecting Fine-Tuning Unlearning in Large Language Models
Yihuai Hong | Yuelin Zou | Lijie Hu | Ziqian Zeng | Di Wang | Haiqin Yang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Fine-tuning-based unlearning methods prevail for erasing targeted harmful, sensitive, or copyrighted information within large language models while preserving overall capabilities. However, the true effectiveness of the methods is unclear. In this paper, we delve into the limitations of fine-tuning-based unlearning through activation patching and parameter restoration experiments. Our findings reveal that these methods alter the model’s knowledge retrieval process, rather than genuinely erasing the problematic knowledge embedded in the model parameters. Furthermore, behavioral tests demonstrate that the unlearning mechanisms inevitably impact the global behavior of the models, affecting unrelated knowledge or capabilities. Our work advocates the development of more resilient unlearning techniques for truly erasing knowledge.

2023

A Diffusion Model for Event Skeleton Generation
Fangqi Zhu | Lin Zhang | Jun Gao | Bing Qin | Ruifeng Xu | Haiqin Yang
Findings of the Association for Computational Linguistics: ACL 2023

Event skeleton generation, aiming to induce an event schema skeleton graph with abstracted event nodes and their temporal relations from a set of event instance graphs, is a critical step in the temporal complex event schema induction task. Existing methods effectively address this task from a graph generation perspective but suffer from noise-sensitive and error accumulation, e.g., the inability to correct errors while generating schema. We, therefore, propose a novel Diffusion Event Graph Model (DEGM) to address these issues. Our DEGM is the first workable diffusion model for event skeleton generation, where the embedding and rounding techniques with a custom edge-based loss are introduced to transform a discrete event graph into learnable latent representations. Furthermore, we propose a denoising training process to maintain the model’s robustness. Consequently, DEGM derives the final schema, where error correction is guaranteed by iteratively refining the latent representations during the schema generation process. Experimental results on three IED bombing datasets demonstrate that our DEGM achieves better results than other state-of-the-art baselines. Our code and data are available at https://github.com/zhufq00/EventSkeletonGeneration.

A Unified One-Step Solution for Aspect Sentiment Quad Prediction
Junxian Zhou | Haiqin Yang | Yuxuan He | Hao Mou | JunBo Yang
Findings of the Association for Computational Linguistics: ACL 2023

Aspect sentiment quad prediction (ASQP) is a challenging yet significant subtask in aspectbased sentiment analysis as it provides a complete aspect-level sentiment structure. However, existing ASQP datasets are usually small and low-density, hindering technical advancement. To expand the capacity, in this paper, we release two new datasets for ASQP, which contain the following characteristics: larger size, more words per sample, and higher density. With such datasets, we unveil the shortcomings of existing strong ASQP baselines and therefore propose a unified one-step solution for ASQP, namely One-ASQP, to detect the aspect categories and to identify the aspectopinion-sentiment (AOS) triplets simultaneously. Our One-ASQP holds several unique advantages: (1) by separating ASQP into two subtasks and solving them independently and simultaneously, we can avoid error propagation in pipeline-based methods and overcome slow training and inference in generation-based methods; (2) by introducing sentiment-specific horns tagging schema in a token-pair-based two-dimensional matrix, we can exploit deeper interactions between sentiment elements and efficiently decode the AOS triplets; (3) we design "[NULL]” token can help us effectively identify the implicit aspects or opinions. Experiments on two benchmark datasets and our released two datasets demonstrate the advantages of our One-ASQP. The two new datasets are publicly released at https://www.github.com/Datastory-CN/ASQP-Datasets.

2021

MagicPai at SemEval-2021 Task 7: Method for Detecting and Rating Humor Based on Multi-Task Adversarial Training
Jian Ma | Shuyi Xie | Haiqin Yang | Lianxin Jiang | Mengyuan Zhou | Xiaoyi Ruan | Yang Mo
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper describes MagicPai’s system for SemEval 2021 Task 7, HaHackathon: Detecting and Rating Humor and Offense. This task aims to detect whether the text is humorous and how humorous it is. There are four subtasks in the competition. In this paper, we mainly present our solution, a multi-task learning model based on adversarial examples, for task 1a and 1b. More specifically, we first vectorize the cleaned dataset and add the perturbation to obtain more robust embedding representations. We then correct the loss via the confidence level. Finally, we perform interactive joint learning on multiple tasks to capture the relationship between whether the text is humorous and how humorous it is. The final result shows the effectiveness of our system.

Sattiy at SemEval-2021 Task 9: An Ensemble Solution for Statement Verification and Evidence Finding with Tables
Xiaoyi Ruan | Meizhi Jin | Jian Ma | Haiqin Yang | Lianxin Jiang | Yang Mo | Mengyuan Zhou
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

Question answering from semi-structured tables can be seen as a semantic parsing task and is significant and practical for pushing the boundary of natural language understanding. Existing research mainly focuses on understanding contents from unstructured evidence, e.g., news, natural language sentences and documents. The task of verification from structured evidence, such as tables, charts, and databases, is still less-explored. This paper describes sattiy team’s system in SemEval-2021 task 9: Statement Verification and Evidence Finding with Tables (SEM-TAB-FACT)(CITATION). This competition aims to verify statements and to find evidence from tables for scientific articles and to promote proper interpretation of the surrounding article. In this paper we exploited ensemble models of pre-trained language models over tables, TaPas and TaBERT, for Task A and adjust the result based on some rules extracted for Task B. Finally, in the leadboard, we attain the F1 scores of 0.8496 and 0.7732 in Task A for the 2-way and 3-way evaluation, respectively, and the F1 score of 0.4856 in Task B.

PALI at SemEval-2021 Task 2: Fine-Tune XLM-RoBERTa for Word in Context Disambiguation
Shuyi Xie | Jian Ma | Haiqin Yang | Lianxin Jiang | Yang Mo | Jianping Shen
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper presents the PALI team’s winning system for SemEval-2021 Task 2: Multilingual and Cross-lingual Word-in-Context Disambiguation. We fine-tune XLM-RoBERTa model to solve the task of word in context disambiguation, i.e., to determine whether the target word in the two contexts contains the same meaning or not. In implementation, we first specifically design an input tag to emphasize the target word in the contexts. Second, we construct a new vector on the fine-tuned embeddings from XLM-RoBERTa and feed it to a fully-connected network to output the probability of whether the target word in the context has the same meaning or not. The new vector is attained by concatenating the embedding of the [CLS] token and the embeddings of the target word in the contexts. In training, we explore several tricks, such as the Ranger optimizer, data augmentation, and adversarial training, to improve the model prediction. Consequently, we attain the first place in all four cross-lingual tasks.

2020

FPAI at SemEval-2020 Task 10: A Query Enhanced Model with RoBERTa for Emphasis Selection
Chenyang Guo | Xiaolong Hou | Junsong Ren | Lianxin Jiang | Yang Mo | Haiqin Yang | Jianping Shen
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper describes the model we apply in the SemEval-2020 Task 10. We formalize the task of emphasis selection as a simplified query-based machine reading comprehension (MRC) task, i.e. answering a fixed question of “Find candidates for emphasis”. We propose our subword puzzle encoding mechanism and subword fusion layer to align and fuse subwords. By introducing the semantic prior knowledge of the informative query and some other techniques, we attain the 7th place during the evaluation phase and the first place during train phase.

UNIXLONG at SemEval-2020 Task 6: A Joint Model for Definition Extraction
ShuYi Xie | Jian Ma | Haiqin Yang | Jiang Lianxin | Mo Yang | Jianping Shen
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Definition Extraction is the task to automatically extract terms and their definitions from text. In recent years, it attracts wide interest from NLP researchers. This paper describes the unixlong team’s system for the SemEval 2020 task6: DeftEval: Extracting term-definition pairs in free text. The goal of this task is to extract definition, word level BIO tags and relations. This task is challenging due to the free style of the text, especially the definitions of the terms range across several sentences and lack explicit verb phrases. We propose a joint model to train the tasks of definition extraction and the word level BIO tagging simultaneously. We design a creative format input of BERT to capture the location information between entity and its definition. Then we adjust the result of BERT with some rules. Finally, we apply TAG_ID, ROOT_ID, BIO tag to predict the relation and achieve macro-averaged F1 score 1.0 which rank first on the official test set in the relation extraction subtask.

2019

HiGRU: Hierarchical Gated Recurrent Units for Utterance-Level Emotion Recognition
Wenxiang Jiao | Haiqin Yang | Irwin King | Michael R. Lyu
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

In this paper, we address three challenges in utterance-level emotion recognition in dialogue systems: (1) the same word can deliver different emotions in different contexts; (2) some emotions are rarely seen in general dialogues; (3) long-range contextual information is hard to be effectively captured. We therefore propose a hierarchical Gated Recurrent Unit (HiGRU) framework with a lower-level GRU to model the word-level inputs and an upper-level GRU to capture the contexts of utterance-level embeddings. Moreover, we promote the framework to two variants, Hi-GRU with individual features fusion (HiGRU-f) and HiGRU with self-attention and features fusion (HiGRU-sf), so that the word/utterance-level individual inputs and the long-range contextual information can be sufficiently utilized. Experiments on three dialogue emotion datasets, IEMOCAP, Friends, and EmotionPush demonstrate that our proposed Hi-GRU models attain at least 8.7%, 7.5%, 6.0% improvement over the state-of-the-art methods on each dataset, respectively. Particularly, by utilizing only the textual feature in IEMOCAP, our HiGRU models gain at least 3.8% improvement over the state-of-the-art conversational memory network (CMN) with the trimodal features of text, video, and audio.

2018

EmotionX-DLC: Self-Attentive BiLSTM for Detecting Sequential Emotions in Dialogues
Linkai Luo | Haiqin Yang | Francis Y. L. Chin
Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media

In this paper, we propose a self-attentive bidirectional long short-term memory (SA-BiLSTM) network to predict multiple emotions for the EmotionX challenge. The BiLSTM exhibits the power of modeling the word dependencies, and extracting the most relevant features for emotion classification. Building on top of BiLSTM, the self-attentive network can model the contextual dependencies between utterances which are helpful for classifying the ambiguous emotions. We achieve 59.6 and 55.0 unweighted accuracy scores in the Friends and the EmotionPush test sets, respectively.

Co-authors

Mengyuan Zhou 2

Asres Temam Abagissa 1

Syukron Abu Ishaq Alfarozi 1

Francis Y. L. Chin 1

Manisha Jaishwal 1

Wenxiang Jiao 1

Nishant Kumar 1

Jiang Lianxin 1

Michael R. Lyu 1

Jose G. Moreno 1

Kitsuchart Pasupa 1

Bing Qin (秦兵) 1

Shauli Ravfogel 1

Sriparna Saha 1

Punit Kumar Singh 1

Ruifeng Xu (徐睿峰) 1

Venues