2024
pdf
bib
abs
Advancing Large Language Model Attribution through Self-Improving
Lei Huang
|
Xiaocheng Feng
|
Weitao Ma
|
Liang Zhao
|
Yuchun Fan
|
Weihong Zhong
|
Dongliang Xu
|
Qing Yang
|
Hongtao Liu
|
Bing Qin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Teaching large language models (LLMs) to generate text with citations to evidence sources can mitigate hallucinations and enhance verifiability in information-seeking systems. However, improving this capability requires high-quality attribution data, which is costly and labor-intensive. Inspired by recent advances in self-improvement that enhance LLMs without manual annotation, we present START, a Self-Taught AttRibuTion framework for iteratively improving the attribution capability of LLMs. First, to prevent models from stagnating due to initially insufficient supervision signals, START leverages the model to self-construct synthetic training data for warming up. To further self-improve the model’s attribution ability, START iteratively utilizes fine-grained preference supervision signals constructed from its sampled responses to encourage robust, comprehensive, and attributable generation. Experiments on three open-domain question-answering datasets, covering long-form QA and multi-step reasoning, demonstrate significant performance gains of 25.13% on average without relying on human annotations and more advanced models. Further analysis reveals that START excels in aggregating information across multiple sources.
pdf
bib
abs
Python is Not Always the Best Choice: Embracing Multilingual Program of Thoughts
Xianzhen Luo
|
Qingfu Zhu
|
Zhiming Zhang
|
Libo Qin
|
Xuanyu Zhang
|
Qing Yang
|
Dongliang Xu
|
Wanxiang Che
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Program of Thoughts (PoT) is an approach characterized by its executable intermediate steps, which ensure the accuracy of the logical calculations in the reasoning process. Currently, PoT primarily uses Python. However, relying solely on a single language may result in suboptimal solutions and overlook the potential benefits of other programming languages. In this paper, we conduct comprehensive experiments on the programming languages used in PoT and find that no single language consistently delivers optimal performance across all tasks and models. The effectiveness of each language varies depending on the specific scenarios. Inspired by this, we propose a task and model agnostic approach called MultiPoT, which harnesses strength and diversity from various languages. Experimental results reveal that it significantly outperforms Python Self-Consistency. Furthermore, it achieves comparable or superior performance compared to the best monolingual PoT in almost all tasks across all models. In particular, MultiPoT achieves more than 4.6% improvement on average on ChatGPT (gpt-3.5-turbo-0701).
pdf
bib
abs
Extending Context Window of Large Language Models from a Distributional Perspective
Yingsheng Wu
|
Yuxuan Gu
|
Xiaocheng Feng
|
Weihong Zhong
|
Dongliang Xu
|
Qing Yang
|
Hongtao Liu
|
Bing Qin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Scaling the rotary position embedding (RoPE) has become a common method for extending the context window of RoPE-based large language models (LLMs). However, existing scaling methods often rely on empirical approaches and lack a profound understanding of the internal distribution within RoPE, resulting in suboptimal performance in extending the context window length. In this paper, we propose to optimize the context window extending task from the view of rotary angle distribution. Specifically, we first estimate the distribution of the rotary angles within the model and analyze the extent to which length extension perturbs this distribution. Then, we present a novel extension strategy that minimizes the disturbance between rotary angle distributions to maintain consistency with the pre-training phase, enhancing the model’s capability to generalize to longer sequences. Experimental results compared to the strong baseline methods demonstrate that our approach reduces by up to 72% of the distributional disturbance when extending LLaMA2’s context window to 8k, and reduces by up to 32% when extending to 16k. On the LongBench-E benchmark, our method achieves an average improvement of up to 4.33% over existing state-of-the-art methods. Furthermore, Our method maintains the model’s performance on the Hugging Face Open LLM benchmark after context window extension, with only an average performance fluctuation ranging from -0.12 to +0.22.
pdf
bib
abs
GlobeSumm: A Challenging Benchmark Towards Unifying Multi-lingual, Cross-lingual and Multi-document News Summarization
Yangfan Ye
|
Xiachong Feng
|
Xiaocheng Feng
|
Weitao Ma
|
Libo Qin
|
Dongliang Xu
|
Qing Yang
|
Hongtao Liu
|
Bing Qin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
News summarization in today’s global scene can be daunting with its flood of multilingual content and varied viewpoints from different sources. However, current studies often neglect such real-world scenarios as they tend to focus solely on either single-language or single-document tasks. To bridge this gap, we aim to unify Multi-lingual, Cross-lingual and Multi-document Summarization into a novel task, i.e., MCMS, which encapsulates the real-world requirements all-in-one. Nevertheless, the lack of a benchmark inhibits researchers from adequately studying this invaluable problem. To tackle this, we have meticulously constructed the GLOBESUMM dataset by first collecting a wealth of multilingual news reports and restructuring them into event-centric format. Additionally, we introduce the method of protocol-guided prompting for high-quality and cost-effective reference annotation. In MCMS, we also highlight the challenge of conflicts between news reports, in addition to the issues of redundancies and omissions, further enhancing the complexity of GLOBESUMM. Through extensive experimental analysis, we validate the quality of our dataset and elucidate the inherent challenges of the task. We firmly believe that GLOBESUMM, given its challenging nature, will greatly contribute to the multilingual communities and the evaluation of LLMs.
pdf
bib
abs
Make Some Noise: Unlocking Language Model Parallel Inference Capability through Noisy Training
Yixuan Wang
|
Xianzhen Luo
|
Fuxuan Wei
|
Yijun Liu
|
Qingfu Zhu
|
Xuanyu Zhang
|
Qing Yang
|
Dongliang Xu
|
Wanxiang Che
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Existing speculative decoding methods typically require additional model structure and training processes to assist the model for draft token generation. This makes the migration of acceleration methods to the new model more costly and more demanding on device memory. To address this problem, we propose the Make Some Noise (MSN) training framework as a replacement for the supervised fine-tuning stage of the large language model. The training method simply introduces some noise at the input for the model to learn the denoising task. It significantly enhances the parallel decoding capability of the model without affecting the original task capability. In addition, we propose a tree-based retrieval-augmented Jacobi (TR-Jacobi) decoding strategy to further improve the inference speed of MSN models. Experiments in both the general and code domains have shown that MSN can improve inference speed by 2.3-2.7x times without compromising model performance. The MSN model also achieves comparable acceleration ratios to the SOTA model with additional model structure on Spec-Bench.
pdf
bib
abs
Length Extrapolation of Transformers: A Survey from the Perspective of Positional Encoding
Liang Zhao
|
Xiachong Feng
|
Xiaocheng Feng
|
Weihong Zhong
|
Dongliang Xu
|
Qing Yang
|
Hongtao Liu
|
Bing Qin
|
Ting Liu
Findings of the Association for Computational Linguistics: EMNLP 2024
Built upon the Transformer, large language models (LLMs) have captured worldwide attention due to their remarkable abilities. Nevertheless, all Transformer-based models including LLMs suffer from a preset length limit and can hardly generalize from short training sequences to longer inference ones, namely, they can not perform **length extrapolation** to handle long sequences. Thus, numerous methods have emerged to enhance the length extrapolation of Transformers. Despite the great research efforts, a systematic survey is still lacking. To fill this gap, we delve into these advances in a unified notation from the perspective of positional encoding (PE), as it has been considered the primary factor on length extrapolation. Specifically, we begin with extrapolatable PEs that have dominated this research field. Then, we dive into extrapolation methods based on them, covering position interpolation and randomized position methods. Finally, several challenges and future directions in this area are highlighted. Through this survey, We aim to enable the reader to gain a deep understanding of existing methods and provide stimuli for future research.
pdf
bib
abs
SAPT: A Shared Attention Framework for Parameter-Efficient Continual Learning of Large Language Models
Weixiang Zhao
|
Shilong Wang
|
Yulin Hu
|
Yanyan Zhao
|
Bing Qin
|
Xuanyu Zhang
|
Qing Yang
|
Dongliang Xu
|
Wanxiang Che
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The continual learning (CL) ability is vital for deploying large language models (LLMs) in the dynamic world. Existing methods devise the learning module to acquire task-specific knowledge with parameter-efficient tuning (PET) block and the selection module to pick out the corresponding one for the testing input, aiming at handling the challenges of catastrophic forgetting and knowledge transfer in CL. However, these methods tend to address only one of the challenges, ignoring the potential of aligning the two modules to effectively address catastrophic forgetting and knowledge transfer simultaneously. To this end, we propose a novel Shared Attention Framework (SAPT), to align the PET learning and selection via the Shared Attentive Learning & Selection module. Extensive Experiments on two CL benchmarks demonstrate the superiority of SAPT. Moreover, SAPT consistently demonstrates its superiority when we scale it to different model sizes (from 770M to 13B), different model architectures (T5 and LLaMA-2) and unseen tasks.
pdf
bib
abs
Improving Factual Consistency in Abstractive Summarization with Sentence Structure Pruning
Dingxin Hu
|
Xuanyu Zhang
|
Xingyue Zhang
|
Yiyang Li
|
Dongsheng Chen
|
Marina Litvak
|
Natalia Vanetik
|
Qing Yang
|
Dongliang Xu
|
Yanquan Zhou
|
Lei Li
|
Yuze Li
|
Yingqi Zhu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
State-of-the-art abstractive summarization models still suffer from the content contradiction between the summaries and the input text, which is referred to as the factual inconsistency problem. Recently, a large number of works have also been proposed to evaluate factual consistency or improve it by post-editing methods. However, these post-editing methods typically focus on replacing suspicious entities, failing to identify and modify incorrect content hidden in sentence structures. In this paper, we first verify that the correctable errors can be enriched by leveraging sentence structure pruning operation, and then we propose a post-editing method based on that. In the correction process, the pruning operation on possible errors is performed on the syntactic dependency tree with the guidance of multiple factual evaluation metrics. Experimenting on the FRANK dataset shows a great improvement in factual consistency compared with strong baselines and, when combined with them, can achieve even better performance. All the codes and data will be released on paper acceptance.
pdf
bib
abs
SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models
Zekun Wang
|
Jingchang Chen
|
Wangchunshu Zhou
|
Haichao Zhu
|
Jiafeng Liang
|
Liping Shan
|
Ming Liu
|
Dongliang Xu
|
Qing Yang
|
Bing Qin
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Despite achieving remarkable performance on various vision-language tasks, Transformer-based Vision-Language Models (VLMs) suffer from redundancy in inputs and parameters, significantly hampering their efficiency in real-world applications. Moreover, the degree of redundancy in token representations and model parameters, such as attention heads, varies significantly for different inputs. In light of the challenges, we propose SmartTrim, an adaptive acceleration framework for VLMs, which adjusts the computational overhead per instance. Specifically, we integrate lightweight modules into the original backbone to identify and prune redundant token representations and attention heads within each layer. Furthermore, we devise a self-distillation strategy to enhance the consistency between the predictions of the pruned model and its fully-capacity counterpart. Experimental results across various vision-language tasks consistently demonstrate that SmartTrim accelerates the original model by 2-3 times with minimal performance degradation, highlighting the effectiveness and efficiency compared to previous approaches. Code will be available at https://github.com/kugwzk/SmartTrim.
2023
pdf
bib
abs
Adaptive Attention for Sparse-based Long-sequence Transformer
Xuanyu Zhang
|
Zhepeng Lv
|
Qing Yang
Findings of the Association for Computational Linguistics: ACL 2023
Recently, Transformers have been widely used in various fields and have achieved remarkable results. But it is still difficult for Transformer-based models to process longer sequences because self-attention in them scales quadratically with the sequence length. Although some models attempt to use sparse attention to reduce computational complexity, hand-crafted attention patterns are unable to select useful tokens adaptively according to the context. Thus, in this paper, we propose a novel efficient Transformer model with adaptive attention, A2-Former, for long sequence modeling. It can select useful tokens automatically in sparse attention by learnable position vectors, which consist of meta position and offset position vectors. Because the learnable offset position is not an integer vector, we utilize the interpolation technique to gather corresponding vectors from the input embedding matrix by discrete indexes. Experiments on Long Range Arena (LRA), a systematic and unified benchmark with different tasks, show that our model has achieved further improvement in performance compared with other sparse-based Transformers.
pdf
bib
abs
Pre-trained Personalized Review Summarization with Effective Salience Estimation
Hongyan Xu
|
Hongtao Liu
|
Zhepeng Lv
|
Qing Yang
|
Wenjun Wang
Findings of the Association for Computational Linguistics: ACL 2023
Personalized review summarization in recommender systems is a challenging task of generating condensed summaries for product reviews while preserving the salient content of reviews. Recently, Pretrained Language Models (PLMs) have become a new paradigm in text generation for the strong ability of natural language comprehension. However, it is nontrivial to apply PLMs in personalized review summarization directly since there are rich personalized information (e.g., user preferences and product characteristics) to be considered, which is crucial to the salience estimation of input review. In this paper, we propose a pre-trained personalized review summarization method, which aims to effectively incorporate the personalized information of users and products into the salience estimation of the input reviews. We design a personalized encoder that could identify the salient contents of the input sequence by jointly considering the semantic and personalized information respectively (i.e., ratings, user and product IDs, and linguistic features), yielding personalized representations for the input reviews and history summaries separately. Moreover, we design an interactive information selection mechanism that further identifies the salient contents of the input reviews and selects relative information from the history summaries. The results on real-world datasets show that our method performs better than the state-of-the-art baselines and could generate more readable summaries.
pdf
bib
abs
Generating Extractive Answers: Gated Recurrent Memory Reader for Conversational Question Answering
Xuanyu Zhang
|
Qing Yang
Findings of the Association for Computational Linguistics: EMNLP 2023
Conversational question answering (CQA) is a more complicated task than traditional single-turn machine reading comprehension (MRC). Different from large language models (LLMs) like ChatGPT, the models of CQA need to extract answers from given contents to answer follow-up questions according to conversation history. In this paper, we propose a novel architecture, i.e., Gated Recurrent Memory Reader (GRMR), which integrates traditional extractive MRC models into a generalized sequence-to-sequence framework. After the passage is encoded, the decoder will generate the extractive answers turn by turn. Different from previous models that concatenate the previous questions and answers as context superficially and redundantly, our model can use less storage space and consider historical memory deeply and selectively. Experiments on the Conversational Question Answering (CoQA) dataset show that our model achieves comparable results to most models with the least space occupancy.
pdf
bib
abs
PUNR: Pre-training with User Behavior Modeling for News Recommendation
Guangyuan Ma
|
Hongtao Liu
|
Xing W
|
Wanhui Qian
|
Zhepeng Lv
|
Qing Yang
|
Songlin Hu
Findings of the Association for Computational Linguistics: EMNLP 2023
News recommendation aims to predict click behaviors based on user behaviors. How to effectively model the user representations is the key to recommending preferred news. Existing works are mostly focused on improvements in the supervised fine-tuning stage. However, there is still a lack of PLM-based unsupervised pre-training methods optimized for user representations. In this work, we propose an unsupervised pre-training paradigm with two tasks, i.e. user behavior masking and user behavior generation, both towards effective user behavior modeling. Firstly, we introduce the user behavior masking pre-training task to recover the masked user behaviors based on their contextual behaviors. In this way, the model could capture a much stronger and more comprehensive user news reading pattern. Besides, we incorporate a novel auxiliary user behavior generation pre-training task to enhance the user representation vector derived from the user encoder. We use the above pre-trained user modeling encoder to obtain news and user representations in downstream fine-tuning. Evaluations on the real-world news benchmark show significant performance improvements over existing baselines.
pdf
bib
abs
MSCFFN: A New FFN with Multi-Space Cross to Accelerate Transformer
Tang Dongge
|
Qing Yang
Findings of the Association for Computational Linguistics: EMNLP 2023
Transformer models have achieved impressive success in various natural language processing tasks. But it is also limited used in some areas and the heavy computation complexity is one of the main limitations. Many model structures have been proposed to reduce the computation complexity and some are really effective. The previous research can be divided into two categories. One is to use more effective training and inference strategies and the other is focused on how to replace the standard self-attention mechanism with linear attention method. Differently, we revisit the design in Transformer and find that the feed forward network (FFN) is also computationally expensive, especially when the hidden dimension is large. In this paper, we propose a new FFN structure, named MSCFFN, which splits the large matrix space to several small space to reduce the computation complexity and uses the Multi-Space Cross method to ensure the accurate result. To the best of our knowledge, this is the first time to redesign FFN to accelerate Transformers. We experimentally validate the effectiveness of the proposed method on the Long-Range Arena benchmark. And the results show MSCFFN can achieve a faster speed with a similar or even better accuracy.
pdf
bib
abs
Contrastive Pre-training for Personalized Expert Finding
Qiyao Peng
|
Hongtao Liu
|
Zhepeng Lv
|
Qing Yang
|
Wenjun Wang
Findings of the Association for Computational Linguistics: EMNLP 2023
Expert finding could help route questions to potential suitable users to answer in Community Question Answering (CQA) platforms. Hence it is essential to learn accurate representations of experts and questions according to the question text articles. Recently the pre-training and fine-tuning paradigms are powerful for natural language understanding, which has the potential for better question modeling and expert finding. Inspired by this, we propose a CQA-domain Contrastive Pre-training framework for Expert Finding, named CPEF, which could learn more comprehensive question representations. Specifically, considering that there is semantic complementation between question titles and bodies, during the domain pre-training phase, we propose a title-body contrastive learning task to enhance question representations, which directly treats the question title and the corresponding body as positive samples of each other, instead of designing extra data-augmentation strategies. Furthermore, a personalized tuning network is proposed to inject the personalized preferences of different experts during the fine-tuning phase. Extensive experimental results on six real-world datasets demonstrate that our method could achieve superior performance for expert finding.
2022
pdf
bib
abs
TranS: Transition-based Knowledge Graph Embedding with Synthetic Relation Representation
Xuanyu Zhang
|
Qing Yang
|
Dongliang Xu
Findings of the Association for Computational Linguistics: EMNLP 2022
Knowledge graph embedding (KGE) aims to learn continuous vector representations of relations and entities in knowledge graph (KG). Recently, transition-based KGE methods have become popular and achieved promising performance. However, scoring patterns like TransE are not suitable for complex scenarios where the same entity pair has different relations. Although some models attempt to employ entity-relation interaction or projection to improve entity representation for one-to-many/many-to-one/many-to-many complex relations, they still continue the traditional scoring pattern, where only a single relation vector in the relation part is used to translate the head entity to the tail entity or their variants. And recent research shows that entity representation only needs to consider entities and their interactions to achieve better performance. Thus, in this paper, we propose a novel transition-based method, TranS, for KGE. The single relation vector of the relation part in the traditional scoring pattern is replaced by the synthetic relation representation with entity-relation interactions to solve these issues. And the entity part still retains its independence through entity-entity interactions. Experiments on a large KG dataset, ogbl-wikikg2, show that our model achieves state-of-the-art results.
pdf
bib
abs
Instance-Guided Prompt Learning for Few-Shot Text Matching
Jia Du
|
Xuanyu Zhang
|
Siyi Wang
|
Kai Wang
|
Yanquan Zhou
|
Lei Li
|
Qing Yang
|
Dongliang Xu
Findings of the Association for Computational Linguistics: EMNLP 2022
Few-shot text matching is a more practical technique in natural language processing (NLP) to determine whether two texts are semantically identical. They primarily design patterns to reformulate text matching into a pre-trained task with uniform prompts across all instances. But they fail to take into account the connection between prompts and instances. This paper argues that dynamically strengthening the correlation between particular instances and the prompts is necessary because fixed prompts cannot adequately fit all diverse instances in inference. We suggest IGATE: Instance-Guided prompt leArning for few-shoT tExt matching, a novel pluggable prompt learning method. The gate mechanism used by IGATE, which is between the embedding and the PLM encoders, makes use of the semantics of instances to regulate the effects of the gate on the prompt tokens. The experimental findings show that IGATE achieves SOTA performance on MRPC and QQP, outperforming strong baselines. GitHub will host the release of codes.