2024
pdf
bib
abs
CoCA: Fusing Position Embedding with Collinear Constrained Attention in Transformers for Long Context Window Extending
Shiyi Zhu
|
Jing Ye
|
Wei Jiang
|
Siqiao Xue
|
Qi Zhang
|
Yifan Wu
|
Jianguo Li
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Self-attention and position embedding are two crucial modules in transformer-based Large Language Models (LLMs). However, the potential relationship between them is far from well studied, especially for long context window extending. In fact, anomalous behaviors that hinder long context extrapolation exist between Rotary Position Embedding (RoPE) and vanilla self-attention.Incorrect initial angles between Q and K can cause misestimation in modeling rotary position embedding of the closest tokens.To address this issue, we propose Collinear Constrained Attention mechanism, namely CoCA. Specifically, we enforce a collinear constraint between Q and K to seamlessly integrate RoPE and self-attention.While only adding minimal computational and spatial complexity, this integration significantly enhances long context window extrapolation ability. We provide an optimized implementation, making it a drop-in replacement for any existing transformer-based models.Extensive experiments demonstrate that CoCA excels in extending context windows. A CoCA-based GPT model, trained with a context length of 512, can extend the context window up to 32K (60×) without any fine-tuning.Additionally, incorporating CoCA into LLaMA-7B achieves extrapolation up to 32K within a training length of only 2K.Our code is publicly available at: https://github.com/codefuse-ai/Collinear-Constrained-Attention
pdf
bib
abs
Hide and Seek in Noise Labels: Noise-Robust Collaborative Active Learning with LLMs-Powered Assistance
Bo Yuan
|
Yulin Chen
|
Yin Zhang
|
Wei Jiang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Learning from noisy labels (LNL) is a challenge that arises in many real-world scenarios where collected training data can contain incorrect or corrupted labels. Most existing solutions identify noisy labels and adopt active learning to query human experts on them for denoising. In the era of large language models (LLMs), although we can reduce the human effort to improve these methods, their performances are still subject to accurately separating the clean and noisy samples from noisy data. In this paper, we propose an innovative collaborative learning framework NoiseAL based on active learning to combine LLMs and small models (SMs) for learning from noisy labels. During collaborative training, we first adopt two SMs to form a co-prediction network and propose a dynamic-enhanced threshold strategy to divide the noisy data into different subsets, then select the clean and noisy samples from these subsets to feed the active annotator LLMs to rectify noisy samples. Finally, we employ different optimization objectives to conquer subsets with different degrees of label noises. Extensive experiments on synthetic and real-world noise datasets further demonstrate the superiority of our framework over state-of-the-art baselines.
2023
pdf
bib
abs
AraMUS: Pushing the Limits of Data and Model Scale for Arabic Natural Language Processing
Asaad Alghamdi
|
Xinyu Duan
|
Wei Jiang
|
Zhenhai Wang
|
Yimeng Wu
|
Qingrong Xia
|
Zhefeng Wang
|
Yi Zheng
|
Mehdi Rezagholizadeh
|
Baoxing Huai
|
Peilun Cheng
|
Abbas Ghaddar
Findings of the Association for Computational Linguistics: ACL 2023
Developing monolingual large Pre-trained Language Models (PLMs) is shown to be very successful in handling different tasks in Natural Language Processing (NLP). In this work, we present AraMUS, the largest Arabic PLM with 11B parameters trained on 529GB of high-quality Arabic textual data. AraMUS achieves state-of-the-art performances on a diverse set of Arabic classification and generative tasks. Moreover, AraMUS shows impressive few-shot learning abilities compared with the best existing Arabic PLMs.
2019
pdf
bib
abs
HLT@SUDA at SemEval-2019 Task 1: UCCA Graph Parsing as Constituent Tree Parsing
Wei Jiang
|
Zhenghua Li
|
Yu Zhang
|
Min Zhang
Proceedings of the 13th International Workshop on Semantic Evaluation
This paper describes a simple UCCA semantic graph parsing approach. The key idea is to convert a UCCA semantic graph into a constituent tree, in which extra labels are deliberately designed to mark remote edges and discontinuous nodes for future recovery. In this way, we can make use of existing syntactic parsing techniques. Based on the data statistics, we recover discontinuous nodes directly according to the output labels of the constituent parser and use a biaffine classification model to recover the more complex remote edges. The classification model and the constituent parser are simultaneously trained under the multi-task learning framework. We use the multilingual BERT as extra features in the open tracks. Our system ranks the first place in the six English/German closed/open tracks among seven participating systems. For the seventh cross-lingual track, where there is little training data for French, we propose a language embedding approach to utilize English and German training data, and our result ranks the second place.
pdf
bib
abs
SUDA-Alibaba at MRP 2019: Graph-Based Models with BERT
Yue Zhang
|
Wei Jiang
|
Qingrong Xia
|
Junjie Cao
|
Rui Wang
|
Zhenghua Li
|
Min Zhang
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning
In this paper, we describe our participating systems in the shared task on Cross- Framework Meaning Representation Parsing (MRP) at the 2019 Conference for Computational Language Learning (CoNLL). The task includes five frameworks for graph-based meaning representations, i.e., DM, PSD, EDS, UCCA, and AMR. One common characteristic of our systems is that we employ graph-based methods instead of transition-based methods when predicting edges between nodes. For SDP, we jointly perform edge prediction, frame tagging, and POS tagging via multi-task learning (MTL). For UCCA, we also jointly model a constituent tree parsing and a remote edge recovery task. For both EDS and AMR, we produce nodes first and edges second in a pipeline fashion. External resources like BERT are found helpful for all frameworks except AMR. Our final submission ranks the third on the overall MRP evaluation metric, the first on EDS and the second on UCCA.
2006
pdf
bib
A Pragmatic Chinese Word Segmentation Approach Based on Mixing Models
Wei Jiang
|
Yi Guan
|
Xiao-Long Wang
International Journal of Computational Linguistics & Chinese Language Processing, Volume 11, Number 4, December 2006
pdf
bib
A Pragmatic Chinese Word Segmentation System
Wei Jiang
|
Yi Guan
|
Xiao-Long Wang
Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing
2005
pdf
bib
Chinese Word Segmentation based on Mixing Model
Wei Jiang
|
Jian Zhao
|
Yi Guan
|
Zhiming Xu
Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing