Xudong Zhou
2026
Learning Temporally-Aware Sample Weights for Preference Optimization
Mengyang Li | Xudong Zhou | Pinlong Zhao
Findings of the Association for Computational Linguistics: ACL 2026
Mengyang Li | Xudong Zhou | Pinlong Zhao
Findings of the Association for Computational Linguistics: ACL 2026
Preference optimization is fundamental for aligning large language models. While existing methods use sample weighting, they typically rely on static functions of instantaneous model states and ignore temporal learning dynamics. We contend that a sample’s value evolves throughout training, characterized by patterns such as stable convergence or noisy oscillation. We propose MetaPO, a framework that meta-learns adaptive weights using three temporal features: reward margin evolution, learning volatility, and reference deviation. Through bilevel optimization on validation data, MetaPO automatically discovers weighting strategies tailored to specific datasets. Experiments on models ranging from 7B to 70B parameters demonstrate statistically significant improvements over strong baselines, achieving gains of up to 2.4 points on AlpacaEval 2.0 and Arena-Hard. Interpretability analysis confirms that temporal features drive over 70% of the weighting decisions and that the learned weights correlate strongly with sample quality.