2025
pdf
bib
abs
Revisiting Scaling Laws for Language Models: The Role of Data Quality and Training Strategies
Zhengyu Chen
|
Siqi Wang
|
Teng Xiao
|
Yudong Wang
|
Shiqi Chen
|
Xunliang Cai
|
Junxian He
|
Jingang Wang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Traditional scaling laws in natural language processing suggest that increasing model size and training data enhances performance. However, recent studies reveal deviations, particularly in large language models, where performance improvements decelerate—a phenomenon known as sub-scaling. This paper revisits these scaling laws by examining the impact of data quality and training strategies on model performance. Through extensive empirical analysis of over 400 models, we identify high data density and non-optimal resource allocation as key factors contributing to sub-scaling. High data density leads to diminishing returns due to redundant information, while optimal resource allocation is crucial for sustained performance improvements. We propose a sub-optimal scaling law that better predicts performance in sub-scaling regimes, highlighting the importance of data quality and diversity.
pdf
bib
abs
Explaining Length Bias in LLM-Based Preference Evaluations
Zhengyu Hu
|
Linxin Song
|
Jieyu Zhang
|
Zheyuan Xiao
|
Tianfu Wang
|
Zhengyu Chen
|
Nicholas Jing Yuan
|
Jianxun Lian
|
Kaize Ding
|
Hui Xiong
Findings of the Association for Computational Linguistics: EMNLP 2025
The use of large language models (LLMs) as judges, particularly in preference comparisons, has become widespread, but this reveals a notable bias towards longer responses, undermining the reliability of such evaluations. To better understand such bias, we propose to decompose the preference evaluation metric, specifically the win rate, into two key components: desirability and information mass, where the former is length-independent and related to trustworthiness such as correctness, toxicity, and consistency, and the latter is length-dependent and represents the amount of information in the response. We empirically demonstrated the decomposition through controlled experiments and found that response length impacts evaluations by influencing information mass. To derive a reliable evaluation metric that assesses content quality without being confounded by response length, we propose AdapAlpaca, a simple yet effective adjustment to win rate measurement. Specifically, AdapAlpaca ensures a fair comparison of response quality by aligning the lengths of reference and test model responses under equivalent length intervals.
pdf
bib
abs
SampleMix: A Sample-wise Pre-training Data Mixing Strategy by Coordinating Data Quality and Diversity
Xiangyu Xi
|
Deyang Kong
|
Jian Yang
|
Jiawei Yang
|
Zhengyu Chen
|
Wei Wang
|
Jingang Wang
|
Xunliang Cai
|
Shikun Zhang
|
Wei Ye
Findings of the Association for Computational Linguistics: EMNLP 2025
Existing pretraining data mixing methods for large language models (LLMs) typically follow a domain-wise methodology, a top-down process that first determines domain weights and then performs uniform data sampling across each domain. However, these approaches neglect significant inter-domain overlaps and commonalities, failing to control the global diversity of the constructed training dataset. Further, uniform sampling within domains ignores fine-grained sample-specific features, potentially leading to suboptimal data distribution. To address these shortcomings, we propose a novel sample-wise data mixture approach based on a bottom-up paradigm. This method performs global cross-domain sampling by systematically evaluating the quality and diversity of each sample, thereby dynamically determining the optimal domain distribution. Comprehensive experiments across multiple downstream tasks and perplexity assessments demonstrate that SampleMix surpasses existing domain-based methods. Meanwhile, SampleMix requires 1.4x to 2.1x fewer training steps to achieve the baselines’ performance, highlighting the substantial potential of SampleMix to optimize pre-training data.
2024
pdf
bib
abs
Scaling Laws Across Model Architectures: A Comparative Analysis of Dense and MoE Models in Large Language Models
Siqi Wang
|
Zhengyu Chen
|
Bei Li
|
Keqing He
|
Min Zhang
|
Jingang Wang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The scaling of large language models (LLMs) is a critical research area for the efficiency and effectiveness of model training and deployment. Our work investigates the transferability and discrepancies of scaling laws between Dense Models and Mixture of Experts (MoE) models. Through a combination of theoretical analysis and extensive experiments, including consistent loss scaling, optimal batch size/learning rate scaling, and resource allocation strategies scaling, our findings reveal that the power-law scaling framework also applies to MoE Models, indicating that the fundamental principles governing the scaling behavior of these models are preserved, even though the architecture differs. Additionally, MoE Models demonstrate superior generalization, resulting in lower testing losses with the same training compute budget compared to Dense Models. These findings indicate the scaling consistency and transfer generalization capabilities of MoE Models, providing new insights for optimizing MoE Model training and deployment strategies.
pdf
bib
abs
Let’s Ask GNN: Empowering Large Language Model for Graph In-Context Learning
Zhengyu Hu
|
Yichuan Li
|
Zhengyu Chen
|
Jingang Wang
|
Han Liu
|
Kyumin Lee
|
Kaize Ding
Findings of the Association for Computational Linguistics: EMNLP 2024
Textual Attributed Graphs (TAGs) are crucial for modeling complex real-world systems, yet leveraging large language models (LLMs) for TAGs presents unique challenges due to the gap between sequential text processing and graph-structured data. We introduce AskGNN, a novel approach that bridges this gap by leveraging In-Context Learning (ICL) to integrate graph data and task-specific information into LLMs. AskGNN employs a Graph Neural Network (GNN)-powered structure-enhanced retriever to select labeled nodes across graphs, incorporating complex graph structures and their supervision signals. Our learning-to-retrieve algorithm optimizes the retriever to select example nodes that maximize LLM performance on graph. Experiments across three tasks and seven LLMs demonstrate AskGNN’s superior effectiveness in graph task performance, opening new avenues for applying LLMs to graph-structured data without extensive fine-tuning.
pdf
bib
FinNLP-AgentScen-2024 Shared Task: Financial Challenges in Large Language Models - FinLLMs
Qianqian Xie
|
Jimin Huang
|
Dong Li
|
Zhengyu Chen
|
Ruoyu Xiang
|
Mengxi Xiao
|
Yangyang Yu
|
Vijayasai Somasundaram
|
Kailai Yang
|
Chenhan Yuan
|
Zheheng Luo
|
Zhiwei Liu
|
Yueru He
|
Yuechen Jiang
|
Haohang Li
|
Duanyu Feng
|
Xiao-Yang Liu
|
Benyou Wang
|
Hao Wang
|
Yanzhao Lai
|
Jordan Suchow
|
Alejandro Lopez-Lira
|
Min Peng
|
Sophia Ananiadou
Proceedings of the Eighth Financial Technology and Natural Language Processing and the 1st Agent AI for Scenario Planning
2023
pdf
bib
abs
MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization
Yuyan Chen
|
Zhihao Wen
|
Ge Fan
|
Zhengyu Chen
|
Wei Wu
|
Dayiheng Liu
|
Zhixu Li
|
Bang Liu
|
Yanghua Xiao
Findings of the Association for Computational Linguistics: EMNLP 2023
Prompt engineering, as an efficient and effective way to leverage Large Language Models (LLM), has drawn a lot of attention from the research community. The existing research primarily emphasizes the importance of adapting prompts to specific tasks, rather than specific LLMs. However, a good prompt is not solely defined by its wording, but also binds to the nature of the LLM in question. In this work, we first quantitatively demonstrate that different prompts should be adapted to different LLMs to enhance their capabilities across various downstream tasks in NLP. Then we novelly propose a model-adaptive prompt optimizer (MAPO) method that optimizes the original prompts for each specific LLM in downstream tasks. Extensive experiments indicate that the proposed method can effectively refine prompts for an LLM, leading to significant improvements over various downstream tasks.