2024
pdf
bib
abs
SCALE: Synergized Collaboration of Asymmetric Language Translation Engines
Xin Cheng
|
Xun Wang
|
Tao Ge
|
Si-Qing Chen
|
Furu Wei
|
Dongyan Zhao
|
Rui Yan
Findings of the Association for Computational Linguistics: ACL 2024
In this paper, we introduce SCALE, a collaborative framework that connects a compact Specialized Translation Model (STM) and a general-purpose Large Language Model (LLM) as one unified translation engine. By introducing translation from STM into the triplet in-context demonstrations, SCALE unlocks refinement and pivoting ability of LLM, thus 1) mitigating language bias of LLMs and parallel data bias of STMs, 2) enhancing LLM speciality without sacrificing generality, and 3) facilitating continual learning in a LLM-tuning-free way.Our comprehensive experiments show that SCALE significantly outperforms both LLMs (GPT-4, GPT-3.5) and supervised models (NLLB, M2M) in either high-resource or challenging low-resource settings. Moreover SCALE shows great scalability by only updating the lightweight STM and witness consistent system improvement, an averaged 4 BLEURT score across 4 languages without tuning LLM. Interestingly, SCALE could also effectively exploit the existing language bias of LLMs by using an English-centric STM as a pivot to conduct translation between any language pairs, outperforming GPT-4 by an average of 6 COMET points across eight translation directions. Furthermore we provide an in-depth analysis of SCALE’s robustness, translation characteristics, latency costs and inherent language bias, providing solid foundation for future studies exploring the potential synergy between LLMs and more specialized models.
2023
pdf
bib
abs
Causality-Guided Multi-Memory Interaction Network for Multivariate Stock Price Movement Prediction
Di Luo
|
Weiheng Liao
|
Shuqi Li
|
Xin Cheng
|
Rui Yan
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Over the past few years, we’ve witnessed an enormous interest in stock price movement prediction using AI techniques. In recent literature, auxiliary data has been used to improve prediction accuracy, such as textual news. When predicting a particular stock, we assume that information from other stocks should also be utilized as auxiliary data to enhance performance. In this paper, we propose the Causality-guided Multi-memory Interaction Network (CMIN), a novel end-to-end deep neural network for stock movement prediction which, for the first time, models the multi-modality between financial text data and causality-enhanced stock correlations to achieve higher prediction accuracy. CMIN transforms the basic attention mechanism into Causal Attention by calculating transfer entropy between multivariate stocks in order to avoid attention on spurious correlations. Furthermore, we introduce a fusion mechanism to model the multi-directional interactions through which CMIN learns not only the self-influence but also the interactive influence in information flows representing the interrelationship between text and stock correlations. The effectiveness of the proposed approach is demonstrated by experiments on three real-world datasets collected from the U.S. and Chinese markets, where CMIN outperforms existing models to establish a new state-of-the-art prediction accuracy.
pdf
bib
abs
Dialogue Summarization with Static-Dynamic Structure Fusion Graph
Shen Gao
|
Xin Cheng
|
Mingzhe Li
|
Xiuying Chen
|
Jinpeng Li
|
Dongyan Zhao
|
Rui Yan
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Dialogue, the most fundamental and specially privileged arena of language, gains increasing ubiquity across the Web in recent years. Quickly going through the long dialogue context and capturing salient information scattered over the whole dialogue session benefit users in many real-world Web applications such as email thread summarization and meeting minutes draft. Dialogue summarization is a challenging task in that dialogue has dynamic interaction nature and presumably inconsistent information flow among various speakers. Many researchers address this task by modeling dialogue with pre-computed static graph structure using external linguistic toolkits. However, such methods heavily depend on the reliability of external tools and the static graph construction is disjoint with the graph representation learning phase, which makes the graph can’t be dynamically adapted for the downstream summarization task. In this paper, we propose a Static-Dynamic graph-based Dialogue Summarization model (SDDS), which fuses prior knowledge from human expertise and adaptively learns the graph structure in an end-to-end learning fashion. To verify the effectiveness of SDDS, we conduct experiments on three benchmark datasets (SAMSum, MediaSum, and DialogSum) and the results verify the superiority of SDDS.
pdf
bib
abs
Decouple knowledge from paramters for plug-and-play language modeling
Xin Cheng
|
Yankai Lin
|
Xiuying Chen
|
Dongyan Zhao
|
Rui Yan
Findings of the Association for Computational Linguistics: ACL 2023
Pre-trained language models (PLM) have made impressive results in a wide range of NLP tasks and it has been revealed that one of the key factors to their success is the parameters of these models implicitly learn various types of knowledge in the pre-training corpus. However, encoding knowledge implicitly in the model parameters has two fundamental drawbacks. First, the knowledge is neither editable nor scalable once the model is trained, which is especially problematic in that knowledge is consistently evolving. Second, it lacks interpretability and prevents us from understanding what kind of knowledge PLM needs to solve a certain task. In this paper, we introduce {pasted macro ‘MODEL’}, a pre-training model with differentiable plug-in memory (DPM). The key intuition behind is to decouple the knowledge storage from model parameters with an editable and scalable key-value memory and leverage knowledge in an explainable manner by knowledge retrieval in the {pasted macro ‘MEMORY’}. We conduct extensive experiments under various settings to justify this design choice. In domain adaptation setting, {pasted macro ‘MODEL’} could be easily adapted to different domains with pluggable in-domain memory—obtaining 3.95 F1 improvements across four domains, without any in-domain training. {pasted macro ‘MODEL’} could also keep absorbing new knowledge after pre-training is done by knowledge updating operation in the {pasted macro ‘MEMORY’} without re-training. Finally, we show that by incorporating training samples into {pasted macro ‘MEMORY’} with knowledge prompting, {pasted macro ‘MODEL’} could further be improved by the instruction of in-task knowledge.
pdf
bib
abs
RWKV: Reinventing RNNs for the Transformer Era
Bo Peng
|
Eric Alcaide
|
Quentin Anthony
|
Alon Albalak
|
Samuel Arcadinho
|
Stella Biderman
|
Huanqi Cao
|
Xin Cheng
|
Michael Chung
|
Leon Derczynski
|
Xingjian Du
|
Matteo Grella
|
Kranthi Gv
|
Xuzheng He
|
Haowen Hou
|
Przemyslaw Kazienko
|
Jan Kocon
|
Jiaming Kong
|
Bartłomiej Koptyra
|
Hayden Lau
|
Jiaju Lin
|
Krishna Sri Ipsit Mantri
|
Ferdinand Mom
|
Atsushi Saito
|
Guangyu Song
|
Xiangru Tang
|
Johan Wind
|
Stanisław Woźniak
|
Zhenyuan Zhang
|
Qinghua Zhou
|
Jian Zhu
|
Rui-Jie Zhu
Findings of the Association for Computational Linguistics: EMNLP 2023
Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.
2022
pdf
bib
abs
Neural Machine Translation with Contrastive Translation Memories
Xin Cheng
|
Shen Gao
|
Lemao Liu
|
Dongyan Zhao
|
Rui Yan
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Retrieval-augmented Neural Machine Translation models have been successful in many translation scenarios. Different from previous works that make use of mutually similar but redundant translation memories (TMs), we propose a new retrieval-augmented NMT to model contrastively retrieved translation memories that are holistically similar to the source sentence while individually contrastive to each other providing maximal information gain in three phases. First, in TM retrieval phase, we adopt contrastive retrieval algorithm to avoid redundancy and uninformativeness of similar translation pieces. Second, in memory encoding stage, given a set of TMs we propose a novel Hierarchical Group Attention module to gather both local context of each TM and global context of the whole TM set. Finally, in training phase, a Multi-TM contrastive learning objective is introduced to learn salient feature of each TM with respect to target sentence. Experimental results show that our framework obtains substantial improvements over strong baselines in the benchmark dataset.