2024
pdf
bib
abs
A Learning Rate Path Switching Training Paradigm for Version Updates of Large Language Models
Zhihao Wang
|
Shiyu Liu
|
Jianheng Huang
|
Wang Zheng
|
YiXuan Liao
|
Xiaoxin Chen
|
Junfeng Yao
|
Jinsong Su
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Due to the continuous emergence of new data, version updates have become an indispensable requirement for Large Language Models (LLMs). The training paradigms for version updates of LLMs include pre-training from scratch (PTFS) and continual pre-training (CPT). Preliminary experiments demonstrate that PTFS achieves better pre-training performance, while CPT has lower training cost. Moreover, their performance and training cost gaps widen progressively with version updates. To investigate the underlying reasons for this phenomenon, we analyze the effect of learning rate adjustments during the two stages of CPT: preparing an initialization checkpoint and continual pre-training based on this checkpoint. We find that a large learning rate in the first stage and a complete learning rate decay process in the second stage are crucial for version updates of LLMs. Hence, we propose a learning rate path switching training paradigm. Our paradigm comprises one main path, where we pre-train a LLM with the maximal learning rate, and multiple branching paths, each of which corresponds to an update of the LLM with newly-added training data. Extensive experiments demonstrate the effectiveness and generalization of our paradigm. Particularly, when training four versions of LLMs, our paradigm reduces the total training cost to 58% compared to PTFS, while maintaining comparable pre-training performance.
pdf
bib
abs
DAGCN: Distance-based and Aspect-oriented Graph Convolutional Network for Aspect-based Sentiment Analysis
Zhihao Wang
|
Bo Zhang
|
Ru Yang
|
Chang Guo
|
Maozhen Li
Findings of the Association for Computational Linguistics: NAACL 2024
Aspect-based sentiment analysis (ABSA) is a task that aims to determine the sentiment polarity of aspects by identifying opinion words. Recent advancements have predominantly been rooted either in semantic or syntactic methods. However, both of them tend to interference from local factors such as irrelevant words and edges, hindering the precise identification of opinion words. In this paper, we present Distance-based and Aspect-oriented Graph Convolutional Network (DAGCN) to address the aforementioned issue. Firstly, we introduce the Distance-based Syntactic Weight (DSW). It focuses on the local scope of aspects in the pruned dependency trees, thereby reducing the candidate pool of opinion words. Additionally, we propose Aspect-Fusion Attention (AF) to further filter opinion words within the local context and consider cases where opinion words are distant from the aspect. With the combination of DSW and AF, we achieve precise identification of corresponding opinion words. Extensive experiments on three public datasets demonstrate that the proposed model outperforms state-of-the-art models and verify the effectiveness of the proposed architecture.
2023
pdf
bib
abs
A Sequence-to-Sequence&Set Model for Text-to-Table Generation
Tong Li
|
Zhihao Wang
|
Liangying Shao
|
Xuling Zheng
|
Xiaoli Wang
|
Jinsong Su
Findings of the Association for Computational Linguistics: ACL 2023
Recently, the text-to-table generation task has attracted increasing attention due to its wide applications. In this aspect, the dominant model formalizes this task as a sequence-to-sequence generation task and serializes each table into a token sequence during training by concatenating all rows in a top-down order. However, it suffers from two serious defects: 1) the predefined order introduces a wrong bias during training, which highly penalizes shifts in the order between rows; 2) the error propagation problem becomes serious when the model outputs a long token sequence. In this paper, we first conduct a preliminary study to demonstrate the generation of most rows is order-insensitive. Furthermore, we propose a novel sequence-to-sequence&set text-to-table generation model. Specifically, in addition to a text encoder encoding the input text, our model is equipped with a table header generator to first output a table header, i.e., the first row of the table, in the manner of sequence generation. Then we use a table body generator with learnable row embeddings and column embeddings to generate a set of table body rows in parallel. Particularly, to deal with the issue that there is no correspondence between each generated table body row and target during training, we propose a target assignment strategy based on the bipartite matching between the first cells of generated table body rows and targets. Experiment results show that our model significantly surpasses the baselines, achieving state-of-the-art performance on commonly-used datasets.
pdf
bib
abs
Revisiting Non-Autoregressive Translation at Scale
Zhihao Wang
|
Longyue Wang
|
Jinsong Su
|
Junfeng Yao
|
Zhaopeng Tu
Findings of the Association for Computational Linguistics: ACL 2023
In real-world systems, scaling has been critical for improving the translation quality in autoregressive translation (AT), which however has not been well studied for non-autoregressive translation (NAT). In this work, we bridge the gap by systematically studying the impact of scaling on NAT behaviors. Extensive experiments on six WMT benchmarks over two advanced NAT models show that scaling can alleviate the commonly-cited weaknesses of NAT models, resulting in better translation performance. To reduce the side-effect of scaling on decoding speed, we empirically investigate the impact of NAT encoder and decoder on the translation performance. Experimental results on the large-scale WMT20 En-De show that the asymmetric architecture (e.g. bigger encoder and smaller decoder) can achieve comparable performance with the scaling model, while maintaining the superiority of decoding speed with standard NAT models. To this end, we establish a new benchmark by validating scaled NAT models on the scaled dataset, which can be regarded as a strong baseline for future works. We release code and system outputs at
https://github.com/DeepLearnXMU/Scaling4NAT.
2022
pdf
bib
abs
Learning to Detect Noisy Labels Using Model-Based Features
Zhihao Wang
|
Zongyu Lin
|
Junjie Wen
|
Xianxin Chen
|
Peiqi Liu
|
Guidong Zheng
|
Yujun Chen
|
Zhilin Yang
Findings of the Association for Computational Linguistics: EMNLP 2022
Label noise is ubiquitous in various machine learning scenarios such as self-labeling with model predictions and erroneous data annotation. Many existing approaches are based on heuristics such as sample losses, which might not be flexible enough to achieve optimal solutions. Meta learning based methods address this issue by learning a data selection function, but can be hard to optimize. In light of these pros and cons, we propose SENT (Selection-Enhanced Noisy label Training) that does not rely on meta learning while having the flexibility of being data-driven. SENT transfers the noise distribution to a clean set and trains a model to distinguish noisy labels from clean ones using model-based features. Empirically, on a wide range of tasks including text classification and speech recognition, SENT improves performance over strong baselines under the settings of self-training and label corruption.