Xuanyu Zhang

2025

FinMoE: A MoE-based Large Chinese Financial Language Model
Xuanyu Zhang | Qing Yang
Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal)

Large-scale language models have demonstrated remarkable success, achieving strong performance across a variety of general tasks. However, when applied to domain-specific fields, such as finance, these models face challenges due to the need for both specialized knowledge and robust general capabilities. In this paper, we introduce FinMoE, a MOE-based large-scale Chinese financial language model that bridges the gap between general language models and domain-specific requirements. FinMoE employs a dense MoE architecture, where all expert networks are simultaneously activated and dynamically combined to effectively integrate general linguistic understanding with domain-specific financial expertise. Experimental results demonstrate that FinMoE achieves state-of-the-art performance on both general-purpose and financial benchmarks at a comparable scale, validating its ability to balance domain specialization with general knowledge and reasoning.

pdf bib abs

Extracting the Essence and Discarding the Dross: Enhancing Code Generation with Contrastive Execution Feedback
Xuanyu Zhang | Qing Yang
Proceedings of the 31st International Conference on Computational Linguistics

Recent advancements have integrated the execution process and feedback into the training of large language models for code generation, demonstrating enhanced model performance. However, current methods amalgamate erroneous code with feedback and the final correct code as target sentences, inadvertently increasing the probability of generating both correct and incorrect code during inference. While multiple iterations of feedback can eventually yield the correct answer, this iterative process is cumbersome and time-consuming for users who prefer immediate accurate results. To address this challenge, we propose ConCoder, a contrastive learning-based code generation model with execution feedback. This approach enables the model to efficiently produce accurate code from the outset while rectifying and optimizing the incorrect code. Furthermore, our training emphasizes learning from the causes of errors, allowing the model to understand and avoid mistakes. Through extensive experiments, ConCoder demonstrates significant improvements in generating accurate code and understanding error correction, paving the way for more reliable code generation models.

pdf bib abs

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling
Xianzhen Luo | Yixuan Wang | Qingfu Zhu | Zhiming Zhang | Xuanyu Zhang | Qing Yang | Dongliang Xu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The rapid growth in the parameters of LLMs has made inference latency a fundamental bottleneck. Speculative decoding represents a lossless approach to accelerate inference through a guess-and-verify paradigm. Some methods rely on additional architectures to guess draft tokens, which need extra training before use. Alternatively, retrieval-based train-free techniques build libraries from pre-existing corpora or by n-gram generation. However, they face challenges like large storage requirements, time-consuming retrieval, and limited adaptability. Observing that candidate tokens generated during the decoding process are likely to reoccur in future sequences, we propose Token Recycling. This approach stores candidate tokens in an adjacency matrix and employs a breadth-first-search (BFS)-like algorithm to construct a draft tree, which is then validated through tree attention. New candidate tokens from the decoding process are then used to update the matrix. Token Recycling requires <2MB of additional storage and achieves approximately 2x speedup across all sizes of LLMs. It significantly outperforms existing train-free methods by 30% and even a training method by 25%.

2024

pdf bib abs

State-of-the-art abstractive summarization models still suffer from the content contradiction between the summaries and the input text, which is referred to as the factual inconsistency problem. Recently, a large number of works have also been proposed to evaluate factual consistency or improve it by post-editing methods. However, these post-editing methods typically focus on replacing suspicious entities, failing to identify and modify incorrect content hidden in sentence structures. In this paper, we first verify that the correctable errors can be enriched by leveraging sentence structure pruning operation, and then we propose a post-editing method based on that. In the correction process, the pruning operation on possible errors is performed on the syntactic dependency tree with the guidance of multiple factual evaluation metrics. Experimenting on the FRANK dataset shows a great improvement in factual consistency compared with strong baselines and, when combined with them, can achieve even better performance. All the codes and data will be released on paper acceptance.

pdf bib abs

Existing speculative decoding methods typically require additional model structure and training processes to assist the model for draft token generation. This makes the migration of acceleration methods to the new model more costly and more demanding on device memory. To address this problem, we propose the Make Some Noise (MSN) training framework as a replacement for the supervised fine-tuning stage of the large language model. The training method simply introduces some noise at the input for the model to learn the denoising task. It significantly enhances the parallel decoding capability of the model without affecting the original task capability. In addition, we propose a tree-based retrieval-augmented Jacobi (TR-Jacobi) decoding strategy to further improve the inference speed of MSN models. Experiments in both the general and code domains have shown that MSN can improve inference speed by 2.3-2.7x times without compromising model performance. The MSN model also achieves comparable acceleration ratios to the SOTA model with additional model structure on Spec-Bench.

pdf bib abs

The continual learning (CL) ability is vital for deploying large language models (LLMs) in the dynamic world. Existing methods devise the learning module to acquire task-specific knowledge with parameter-efficient tuning (PET) block and the selection module to pick out the corresponding one for the testing input, aiming at handling the challenges of catastrophic forgetting and knowledge transfer in CL. However, these methods tend to address only one of the challenges, ignoring the potential of aligning the two modules to effectively address catastrophic forgetting and knowledge transfer simultaneously. To this end, we propose a novel Shared Attention Framework (SAPT), to align the PET learning and selection via the Shared Attentive Learning & Selection module. Extensive Experiments on two CL benchmarks demonstrate the superiority of SAPT. Moreover, SAPT consistently demonstrates its superiority when we scale it to different model sizes (from 770M to 13B), different model architectures (T5 and LLaMA-2) and unseen tasks.

pdf bib abs

Program of Thoughts (PoT) is an approach characterized by its executable intermediate steps, which ensure the accuracy of the logical calculations in the reasoning process. Currently, PoT primarily uses Python. However, relying solely on a single language may result in suboptimal solutions and overlook the potential benefits of other programming languages. In this paper, we conduct comprehensive experiments on the programming languages used in PoT and find that no single language consistently delivers optimal performance across all tasks and models. The effectiveness of each language varies depending on the specific scenarios. Inspired by this, we propose a task and model agnostic approach called MultiPoT, which harnesses strength and diversity from various languages. Experimental results reveal that it significantly outperforms Python Self-Consistency. Furthermore, it achieves comparable or superior performance compared to the best monolingual PoT in almost all tasks across all models. In particular, MultiPoT achieves more than 4.6% improvement on average on ChatGPT (gpt-3.5-turbo-0701).

2023

pdf bib abs

Generating Extractive Answers: Gated Recurrent Memory Reader for Conversational Question Answering
Xuanyu Zhang | Qing Yang
Findings of the Association for Computational Linguistics: EMNLP 2023

Conversational question answering (CQA) is a more complicated task than traditional single-turn machine reading comprehension (MRC). Different from large language models (LLMs) like ChatGPT, the models of CQA need to extract answers from given contents to answer follow-up questions according to conversation history. In this paper, we propose a novel architecture, i.e., Gated Recurrent Memory Reader (GRMR), which integrates traditional extractive MRC models into a generalized sequence-to-sequence framework. After the passage is encoded, the decoder will generate the extractive answers turn by turn. Different from previous models that concatenate the previous questions and answers as context superficially and redundantly, our model can use less storage space and consider historical memory deeply and selectively. Experiments on the Conversational Question Answering (CoQA) dataset show that our model achieves comparable results to most models with the least space occupancy.

pdf bib abs

Adaptive Attention for Sparse-based Long-sequence Transformer
Xuanyu Zhang | Zhepeng Lv | Qing Yang
Findings of the Association for Computational Linguistics: ACL 2023

Recently, Transformers have been widely used in various fields and have achieved remarkable results. But it is still difficult for Transformer-based models to process longer sequences because self-attention in them scales quadratically with the sequence length. Although some models attempt to use sparse attention to reduce computational complexity, hand-crafted attention patterns are unable to select useful tokens adaptively according to the context. Thus, in this paper, we propose a novel efficient Transformer model with adaptive attention, A2-Former, for long sequence modeling. It can select useful tokens automatically in sparse attention by learnable position vectors, which consist of meta position and offset position vectors. Because the learnable offset position is not an integer vector, we utilize the interpolation technique to gather corresponding vectors from the input embedding matrix by discrete indexes. Experiments on Long Range Arena (LRA), a systematic and unified benchmark with different tasks, show that our model has achieved further improvement in performance compared with other sparse-based Transformers.

2022

pdf bib abs

TranS: Transition-based Knowledge Graph Embedding with Synthetic Relation Representation
Xuanyu Zhang | Qing Yang | Dongliang Xu
Findings of the Association for Computational Linguistics: EMNLP 2022

Knowledge graph embedding (KGE) aims to learn continuous vector representations of relations and entities in knowledge graph (KG). Recently, transition-based KGE methods have become popular and achieved promising performance. However, scoring patterns like TransE are not suitable for complex scenarios where the same entity pair has different relations. Although some models attempt to employ entity-relation interaction or projection to improve entity representation for one-to-many/many-to-one/many-to-many complex relations, they still continue the traditional scoring pattern, where only a single relation vector in the relation part is used to translate the head entity to the tail entity or their variants. And recent research shows that entity representation only needs to consider entities and their interactions to achieve better performance. Thus, in this paper, we propose a novel transition-based method, TranS, for KGE. The single relation vector of the relation part in the traditional scoring pattern is replaced by the synthetic relation representation with entity-relation interactions to solve these issues. And the entity part still retains its independence through entity-entity interactions. Experiments on a large KG dataset, ogbl-wikikg2, show that our model achieves state-of-the-art results.

pdf bib abs

Few-shot text matching is a more practical technique in natural language processing (NLP) to determine whether two texts are semantically identical. They primarily design patterns to reformulate text matching into a pre-trained task with uniform prompts across all instances. But they fail to take into account the connection between prompts and instances. This paper argues that dynamically strengthening the correlation between particular instances and the prompts is necessary because fixed prompts cannot adequately fit all diverse instances in inference. We suggest IGATE: Instance-Guided prompt leArning for few-shoT tExt matching, a novel pluggable prompt learning method. The gate mechanism used by IGATE, which is between the embedding and the PLM encoders, makes use of the semantics of instances to regulate the effects of the gate on the prompt tokens. The experimental findings show that IGATE achieves SOTA performance on MRPC and QQP, outperforming strong baselines. GitHub will host the release of codes.

2019

pdf bib abs

MCˆ2: Multi-perspective Convolutional Cube for Conversational Machine Reading Comprehension
Xuanyu Zhang
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Conversational machine reading comprehension (CMRC) extends traditional single-turn machine reading comprehension (MRC) by multi-turn interactions, which requires machines to consider the history of conversation. Most of models simply combine previous questions for conversation understanding and only employ recurrent neural networks (RNN) for reasoning. To comprehend context profoundly and efficiently from different perspectives, we propose a novel neural network model, Multi-perspective Convolutional Cube (MCˆ2). We regard each conversation as a cube. 1D and 2D convolutions are integrated with RNN in our model. To avoid models previewing the next turn of conversation, we also extend causal convolution partially to 2D. Experiments on the Conversational Question Answering (CoQA) dataset show that our model achieves state-of-the-art results.