Dawei Song

2025

Language model (LM) distillation aims at distilling the knowledge in a large teacher LM to a small student one. As a critical issue facing LM distillation, a superior student often arises from a teacher of a relatively small scale instead of a larger one, especially in the presence of substantial capacity gap between the teacher and student. This issue, often referred to as the curse of capacity gap, suggests that there is likely an optimal teacher yielding the best-performing student along the scaling course of the teacher. Consequently, distillation trials on teachers of a wide range of scales are called for to determine the optimal teacher, which becomes computationally intensive in the context of large LMs (LLMs). This paper addresses this critical bottleneck by providing the law of capacity gap inducted from a preliminary study on distilling a broad range of small-scale (<3B) LMs, where the optimal teacher consistently scales linearly with the student scale across different model and data scales. By extending the law to LLM distillation on a larger scale (7B), we succeed in obtaining versatile LLMs that outperform a wide array of competitors.

pdf bib abs

Bridging External and Parametric Knowledge: Mitigating Hallucination of LLMs with Shared-Private Semantic Synergy in Dual-Stream Knowledge
Yi Sui | Chaozhuo Li | Chen Zhang | Dawei Song | Qiuchi Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Retrieval-augmented generation (RAG) aims to mitigate the hallucination of Large Language Models (LLMs) by retrieving and incorporating relevant external knowledge into the generation process. However, the external knowledge may contain noise and conflict with the parametric knowledge of LLMs, leading to degraded performance. Current LLMs lack inherent mechanisms for resolving such conflicts. To fill this gap, we propose a Dual-Stream Knowledge-Augmented Framework for Shared-Private Semantic Synergy (DSSP-RAG). Central to it is the refinement of the traditional self-attention into a mixed-attention that distinguishes shared and private semantics for a controlled knowledge integration. An unsupervised hallucination detection method that captures the LLMs’ intrinsic cognitive uncertainty ensures that external knowledge is introduced only when necessary. To reduce noise in external knowledge, an Energy Quotient (EQ), defined by attention difference matrices between task-aligned and task-misaligned layers, is proposed. Extensive experiments show that DSSP-RAG achieves a superior performance over strong baselines.

pdf bib abs

Expectation Preference Optimization: Reliable Preference Estimation for Improving the Reasoning Capability of Large Language Models
Zelin Li | Dawei Song
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Pairwise preference optimization, such as Direct Preference Optimization (DPO), was originally designed to align large language models (LLMs) with human values. It has recently been used to improve the supervised fine-tuning (SFT) performance of LLMs. Using pairs of single samples, DPO estimates the probability distribution of the preferences of picking one response over another. However, in tasks that involve more complicated preferences (e.g., reasoning tasks) than those in the human value alignment task, this sampling method is likely to bring deviations from the ground-truth distribution. To solve the problem, extra efforts (e.g., external annotations or amendment of the loss function) are often required. In this paper, we hypothesise that the preferences can be better estimated through a multi-sampling process. Accordingly, we propose an Expectation Preference Optimization (EPO) algorithm that takes pairs of sample groups, instead of pairs of single samples as in DPO, for preference learning. Compared to pairwise DPO, the proposed EPO tends to produce more reliable preference estimations. Applying different preference optimization methods in a self-training paradigm, we have conducted extensive experiments on various reasoning benchmarks. The results show that our EPO approach outperforms a range of baseline approaches in terms of zero-shot accuracy on all benchmarks.

pdf bib abs

Long-context efficiency has recently become a trending topic in serving large language models (LLMs). And mixture of depths (MoD) is proposed as a perfect fit to bring down both latency and memory. In this paper, however, we discover that MoD can barely transform existing LLMs without costly training over an extensive number of tokens. To enable the transformations from any LLMs to MoD ones, we showcase top-k operator in MoD should be promoted to threshold-p operator, and refinement to architecture and data should also be crafted along. All these designs form our method termed MoDification. Through a comprehensive set of experiments covering model scales from 3B to 70B, we exhibit MoDification strikes an excellent balance between efficiency and effectiveness. MoDification can achieve up to ~1.2× speedup in latency and ~1.8× reduction in memory compared to original LLMs especially in long-context applications.

2024

pdf bib abs

Recent studies have revealed that language model distillation can become less effective when there is a significant capacity gap between the teacher and the student models. In order to bridge the gap, teacher assistant-based distillation has been introduced, in which the selection of the teacher assistant plays a crucial role in transferring knowledge from the teacher to the student. However, existing approaches for teacher assistant-based distillation require numerous trials to find the optimal teacher assistant.In this paper, we propose a novel approach called Minimal Distillation Schedule (MiniDisc), which enables the scheduling of an optimal teacher assistant in just one trial for extreme model compression (e.g, to 5% scale). In particular, we empirically show that the performance of the student is positively correlated with the scale-performance tradeoff of the teacher assistant. We then introduce a new 𝜆-tradeoff metric that quantifies the optimality of the teacher assistant without the need for trial distillation to the student. By employing a sandwich framework, MiniDisc can select the optimal teacher assistant with the best 𝜆-tradeoff.We extensively evaluate MiniDisc through a series of experiments on the GLUE benchmark. The results demonstrate that our approach achieved an improved efficiency compared to various state-of-the-art baselines. Furthermore, we showcase the scalability of MiniDisc by applying it to a language model with billions of parameters.

pdf bib abs

Controllable Text Generation with Residual Memory Transformer
Hanqing Zhang | Si Sun | Haiming Wu | Dawei Song
Findings of the Association for Computational Linguistics: ACL 2024

Large-scale Causal Language Models (CLMs), e.g., GPT3 and ChatGPT, have brought great success in text generation. However, it is still an open challenge to effectively control the generation process of a CLM while balancing the flexibility, control granularity, and generation efficiency. In this paper, we provide a new alternative for controllable text generation (CTG), by designing a non-intrusive, lightweight control plugin, namely Residual Memory Transformer (RMT), to accompany the generation of CLM at arbitrary time steps. With an encoder-decoder setup, RMT can accept any types of control conditions and cooperate with the base CLM through a residual learning paradigm, to achieve a more flexible, general, and efficient CTG. Extensive experiments are carried out on various control tasks, in the form of both automatic and human evaluations. The results demonstrate the superiority of RMT over a wide range of state-of-the-art CTG approaches. The code implementation of our work is available at: https://github.com/Residual_Memory_Transformer.

pdf bib abs

Bi-DCSpell: A Bi-directional Detector-Corrector Interactive Framework for Chinese Spelling Check
Haiming Wu | Hanqing Zhang | Richeng Xuan | Dawei Song
Findings of the Association for Computational Linguistics: EMNLP 2024

Chinese Spelling Check (CSC) aims to detect and correct potentially misspelled characters in Chinese sentences. Naturally, it involves the detection and correction subtasks, which interact with each other dynamically. Such interactions are bi-directional, i.e., the detection result would help reduce the risk of over-correction and under-correction while the knowledge learnt from correction would help prevent false detection. Current CSC approaches are of two types: correction-only or single-directional detection-to-correction interactive frameworks. Nonetheless, they overlook the bi-directional interactions between detection and correction. This paper aims to fill the gap by proposing a Bi-directional Detector-Corrector framework for CSC (Bi-DCSpell). Notably, Bi-DCSpell contains separate detection and correction encoders, followed by a novel interactive learning module facilitating bi-directional feature interactions between detection and correction to improve each other’s representation learning. Extensive experimental results demonstrate a robust correction performance of Bi-DCSpell on widely used benchmarking datasets while possessing a satisfactory detection ability.

pdf bib abs

How Speculative Can Speculative Decoding Be?
Zhuorui Liu | Chen Zhang | Dawei Song
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Large language models (LLMs) have drawn great attention from the field of natural language processing and beyond, due to their impressive capability of autoregressive modeling, yet bringing an obvious problem, i.e., the largely increased latency. An emerging idea to alleviate this problem is speculative decoding, which first uses a draft model to draft tokens autoregressively and then makes the target model verify these tokens in parallel. The draft model is typically smaller than the target model, and it essentially trades generation quality for speed. Thereby, speculative decoding can be viewed as a speculative game for the target model in term of verification failures. That is, the lengthy draft tokens proposed by the small draft models could fail in the verification stage. Naturally, a critical question arises: how speculative can speculative decoding be, or in other words, how small can an adequate draft model be and how large can an appropriate number of draft tokens be? This work aims to investigate these questions and demonstrate how the scale of the draft model and the number of draft tokens would have an impact on the overall latency of the speculative decoding. We theoretically show that neither of above two factors will be infinitely speculative. Namely, there is a certain turning point for each of them. We then empirically show that the scale of the draft model could be 10-20× smaller than the target model and the optimal number of draft tokens should lie in 3-5.

pdf bib abs

Task-agnostic Distillation of Encoder-Decoder Language Models
Chen Zhang | Yang Yang | Qiuchi Li | Jingang Wang | Dawei Song
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Finetuning pretrained language models (LMs) have enabled appealing performance on a diverse array of tasks. The intriguing task-agnostic property has driven a shifted focus from task-specific to task-agnostic distillation of LMs. While task-agnostic, compute-efficient, performance-preserved LMs can be yielded by task-agnostic distillation, previous studies mainly sit in distillation of either encoder-only LMs (e.g., BERT) or decoder-only ones (e.g., GPT) yet largely neglect that distillation of encoder-decoder LMs (e.g., T5) can posit very distinguished behaviors. Frustratingly, we discover that existing task-agnostic distillation methods can fail to handle the distillation of encoder-decoder LMs. To the demand, we explore a few paths and uncover a path named as MiniEnD that successfully tackles the distillation of encoder-decoder LMs in a task-agnostic fashion. We examine MiniEnD on language understanding and abstractive summarization. The results showcase that MiniEnD is generally effective and is competitive compared to other alternatives. We further scale MiniEnD up to distillation of 3B encoder-decoder language models with interpolated distillation. The results imply the opportunities and challenges in distilling large language models (e.g., LLaMA).

pdf bib abs

DS-Group at SIGHAN-2024 dimABSA Task: Constructing In-context Learning Structure for Dimensional Aspect-Based Sentiment Analysis
Ling-ang Meng | Tianyu Zhao | Dawei Song
Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10)

Aspect-Based Sentiment Analysis (ABSA) is an important subtask in Natural Language Processing (NLP). More recent research within ABSA have consistently focused on conducting more precise sentiment analysis on aspects, i.e., dimensional Aspect-Based Sentiment Analysis (dimABSA). However, previous approaches have not systematically explored the use of Large Language Models (LLMs) in dimABSA. To fill the gap, we propose a novel In-Context Learning (ICL) structure with a novel aspect-aware ICL example selection method, to enhance the performance of LLMs in dimABSA. Experiments show that our proposed ICL structure significantly improves the fine-grained sentiment analysis abilities of LLMs.

2023

pdf bib abs

Pretrained language models (LMs) have shown compelling performance on various downstream tasks, but unfortunately they require a tremendous amount of inference compute. Knowledge distillation finds a path to compress LMs to small ones with a teacher-student paradigm. However, when the capacity gap between the teacher and the student is large, a curse of capacity gap appears, invoking a deficiency in distilling LMs. While a few studies have been carried out to fill the gap, the curse is not yet well tackled. In this paper, we aim at lifting the curse of capacity gap via enlarging the capacity of the student without notably increasing the inference compute. Largely motivated by sparse activation regime of mixture of experts (MoE), we propose a mixture of minimal experts (MiniMoE), which imposes extra parameters to the student but introduces almost no additional inference compute. Experimental results on GLUE and CoNLL demonstrate the curse of capacity gap is lifted by the magic of MiniMoE to a large extent. MiniMoE also achieves the state-of-the-art performance at small FLOPs compared with a range of competitive baselines. With a compression rate as much as ~50×, MiniMoE preserves ~95% GLUE score of the teacher.

2022

pdf bib abs

Structural bias has recently been exploited for aspect sentiment triplet extraction (ASTE) and led to improved performance. On the other hand, it is recognized that explicitly incorporating structural bias would have a negative impact on efficiency, whereas pretrained language models (PLMs) can already capture implicit structures. Thus, a natural question arises: Is structural bias still a necessity in the context of PLMs? To answer the question, we propose to address the efficiency issues by using an adapter to integrate structural bias in the PLM and using a cheap-to-compute relative position structure in place of the syntactic dependency structure. Benchmarking evaluation is conducted on the SemEval datasets. The results show that our proposed structural adapter is beneficial to PLMs and achieves state-of-the-art performance over a range of strong baselines, yet with a light parameter demand and low latency. Meanwhile, we give rise to the concern that the current evaluation default with data of small scale is under-confident. Consequently, we release a large-scale dataset for ASTE. The results on the new dataset hint that the structural adapter is confidently effective and efficient to a large scale. Overall, we draw the conclusion that structural bias shall still be a necessity even with PLMs.

pdf bib abs

Making Pretrained Language Models Good Long-tailed Learners
Chen Zhang | Lei Ren | Jingang Wang | Wei Wu | Dawei Song
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Prompt-tuning has shown appealing performance in few-shot classification by virtue of its capability in effectively exploiting pre-trained knowledge. This motivates us to check the hypothesis that prompt-tuning is also a promising choice for long-tailed classification, since the tail classes are intuitively few-shot ones. To achieve this aim, we conduct empirical studies to examine the hypothesis. The results demonstrate that prompt-tuning makes pretrained language models at least good long-tailed learners. For intuitions on why prompt-tuning can achieve good performance in long-tailed classification, we carry out in-depth analyses by progressively bridging the gap between prompt-tuning and commonly used finetuning. The summary is that the classifier structure and parameterization form the key to making good long-tailed learners, in comparison with the less important input structure. Finally, we verify the applicability of our finding to few-shot classification.

pdf bib abs

DisCup: Discriminator Cooperative Unlikelihood Prompt-tuning for Controllable Text Generation
Hanqing Zhang | Dawei Song
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Prompt learning with immensely large Casual Language Models (CLMs) has been shown promising for attribute-controllable text generation (CTG). However, vanilla prompt tuning tends to imitate training corpus characteristics beyond the control attributes, resulting in a poor generalization ability. Moreover, it is less able to capture the relationship between different attributes, further limiting the control performance. In this paper, we propose a new CTG approach, namely DisCup, which incorporates the attribute knowledge of discriminator to optimize the control-prompts, steering a frozen CLM to produce attribute-specific texts. Specifically, the frozen CLM model, capable of producing multitudinous texts, is first used to generate the next-token candidates based on the context, so as to ensure the diversity of tokens to be predicted. Then, we leverage an attribute-discriminator to select desired/undesired tokens from those candidates, providing the inter-attribute knowledge. Finally, we bridge the above two traits by an unlikelihood objective for prompt-tuning. Extensive experimental results show that DisCup can achieve a new state-of-the-art control performance while maintaining an efficient and high-quality text generation, only relying on around 10 virtual tokens.

pdf bib abs

Sparse Teachers Can Be Dense with Knowledge
Yi Yang | Chen Zhang | Dawei Song
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Recent advances in distilling pretrained language models have discovered that, besides the expressiveness of knowledge, the student-friendliness should be taken into consideration to realize a truly knowledgeable teacher. Based on a pilot study, we find that over-parameterized teachers can produce expressive yet student-unfriendly knowledge and are thus limited in overall knowledgeableness. To remove the parameters that result in student-unfriendliness, we propose a sparse teacher trick under the guidance of an overall knowledgeable score for each teacher parameter. The knowledgeable score is essentially an interpolation of the expressiveness and student-friendliness scores. The aim is to ensure that the expressive parameters are retained while the student-unfriendly ones are removed. Extensive experiments on the GLUE benchmark show that the proposed sparse teachers can be dense with knowledge and lead to students with compelling performance in comparison with a series of competitive baselines.

pdf bib abs

Prompt tuning learns soft prompts to condition the frozen Pre-trained Language Models (PLMs) for performing downstream tasks in a parameter-efficient manner. While prompt tuning has gradually reached the performance level of fine-tuning as the model scale increases, there is still a large performance gap between prompt tuning and fine-tuning for models of moderate and small scales (typically less than 11B parameters). In this paper, we empirically show that the trained prompt tokens can have a negative impact on a downstream task and thus degrade its performance. To bridge the gap, we propose a novel Prompt tuning model with an eXtremely small scale (XPrompt) under the regime of lottery tickets hypothesis. Specifically, XPrompt eliminates the negative prompt tokens at different granularity levels through a hierarchical structured pruning, yielding a more parameter-efficient prompt yet with a competitive performance. Comprehensive experiments are carried out on the SuperGLUE tasks, and the results indicate that XPrompt is able to close the performance gap at smaller model scales.

pdf bib abs

Beyond Emotion: A Multi-Modal Dataset for Human Desire Understanding
Ao Jia | Yu He | Yazhou Zhang | Sagar Uprety | Dawei Song | Christina Lioma
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Desire is a strong wish to do or have something, which involves not only a linguistic expression, but also underlying cognitive phenomena driving human feelings. As the most primitive and basic human instinct, conscious desire is often accompanied by a range of emotional responses. As a strikingly understudied task, it is difficult for machines to model and understand desire due to the unavailability of benchmarking datasets with desire and emotion labels. To bridge this gap, we present MSED, the first multi-modal and multi-task sentiment, emotion and desire dataset, which contains 9,190 text-image pairs, with English text. Each multi-modal sample is annotated with six desires, three sentiments and six emotions. We also propose the state-of-the-art baselines to evaluate the potential of MSED and show the importance of multi-task and multi-modal clues for desire understanding. We hope this study provides a benchmark for human desire analysis. MSED will be publicly available for research.

2021

pdf bib

How does Attention Affect the Model?
Cheng Zhang | Qiuchi Li | Lingyu Hua | Dawei Song
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib

Exploiting Position Bias for Robust Aspect Sentiment Classification
Fang Ma | Chen Zhang | Dawei Song
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib abs

What Does Your Smile Mean? Jointly Detecting Multi-Modal Sarcasm and Sentiment Using Quantum Probability
Yaochen Liu | Yazhou Zhang | Qiuchi Li | Benyou Wang | Dawei Song
Findings of the Association for Computational Linguistics: EMNLP 2021

Sarcasm and sentiment embody intrinsic uncertainty of human cognition, making joint detection of multi-modal sarcasm and sentiment a challenging task. In view of the advantages of quantum probability (QP) in modeling such uncertainty, this paper explores the potential of QP as a mathematical framework and proposes a QP driven multi-task (QPM) learning framework. The QPM framework involves a complex-valued multi-modal representation encoder, a quantum-like fusion subnetwork and a quantum measurement mechanism. Each multi-modal (e.g., textual, visual) utterance is first encoded as a quantum superposition of a set of basis terms using a complex-valued representation. Then, the quantum-like fusion subnetwork leverages quantum state composition and quantum interference to model the contextual interaction between adjacent utterances and the correlations across modalities respectively. Finally, quantum incompatible measurements are performed on the multi-modal representation of each utterance to yield the probabilistic outcomes of sarcasm and sentiment recognition. The experimental results show that our model achieves a state-of-the-art performance.

2020

pdf bib abs

A Multi-task Learning Framework for Opinion Triplet Extraction
Chen Zhang | Qiuchi Li | Dawei Song | Benyou Wang
Findings of the Association for Computational Linguistics: EMNLP 2020

The state-of-the-art Aspect-based Sentiment Analysis (ABSA) approaches are mainly based on either detecting aspect terms and their corresponding sentiment polarities, or co-extracting aspect and opinion terms. However, the extraction of aspect-sentiment pairs lacks opinion terms as a reference, while co-extraction of aspect and opinion terms would not lead to meaningful pairs without determining their sentiment dependencies. To address the issue, we present a novel view of ABSA as an opinion triplet extraction task, and propose a multi-task learning framework to jointly extract aspect terms and opinion terms, and simultaneously parses sentiment dependencies between them with a biaffine scorer. At inference phase, the extraction of triplets is facilitated by a triplet decoding method based on the above outputs. We evaluate the proposed framework on four SemEval benchmarks for ASBA. The results demonstrate that our approach significantly outperforms a range of strong baselines and state-of-the-art approaches.

2019

pdf bib abs

Aspect-based Sentiment Classification with Aspect-specific Graph Convolutional Networks
Chen Zhang | Qiuchi Li | Dawei Song
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Due to their inherent capability in semantic alignment of aspects and their context words, attention mechanism and Convolutional Neural Networks (CNNs) are widely applied for aspect-based sentiment classification. However, these models lack a mechanism to account for relevant syntactical constraints and long-range word dependencies, and hence may mistakenly recognize syntactically irrelevant contextual words as clues for judging aspect sentiment. To tackle this problem, we propose to build a Graph Convolutional Network (GCN) over the dependency tree of a sentence to exploit syntactical information and word dependencies. Based on it, a novel aspect-specific sentiment classification framework is raised. Experiments on three benchmarking collections illustrate that our proposed model has comparable effectiveness to a range of state-of-the-art models, and further demonstrate that both syntactical information and long-range word dependencies are properly captured by the graph convolution structure.

2018

pdf bib abs

Quantum-Inspired Complex Word Embedding
Qiuchi Li | Sagar Uprety | Benyou Wang | Dawei Song
Proceedings of the Third Workshop on Representation Learning for NLP

A challenging task for word embeddings is to capture the emergent meaning or polarity of a combination of individual words. For example, existing approaches in word embeddings will assign high probabilities to the words “Penguin” and “Fly” if they frequently co-occur, but it fails to capture the fact that they occur in an opposite sense - Penguins do not fly. We hypothesize that humans do not associate a single polarity or sentiment to each word. The word contributes to the overall polarity of a combination of words depending upon which other words it is combined with. This is analogous to the behavior of microscopic particles which exist in all possible states at the same time and interfere with each other to give rise to new states depending upon their relative phases. We make use of the Hilbert Space representation of such particles in Quantum Mechanics where we subscribe a relative phase to each word, which is a complex number, and investigate two such quantum inspired models to derive the meaning of a combination of words. The proposed models achieve better performances than state-of-the-art non-quantum models on binary sentence classification tasks.

Dawei Song

2025

2024

2023

2022

2021

2020

2019

2018

2015

2010

Co-authors

Venues