Pinlong Zhao

2026

Learning Temporally-Aware Sample Weights for Preference Optimization
Mengyang Li | Xudong Zhou | Pinlong Zhao
Findings of the Association for Computational Linguistics: ACL 2026

Preference optimization is fundamental for aligning large language models. While existing methods use sample weighting, they typically rely on static functions of instantaneous model states and ignore temporal learning dynamics. We contend that a sample’s value evolves throughout training, characterized by patterns such as stable convergence or noisy oscillation. We propose MetaPO, a framework that meta-learns adaptive weights using three temporal features: reward margin evolution, learning volatility, and reference deviation. Through bilevel optimization on validation data, MetaPO automatically discovers weighting strategies tailored to specific datasets. Experiments on models ranging from 7B to 70B parameters demonstrate statistically significant improvements over strong baselines, achieving gains of up to 2.4 points on AlpacaEval 2.0 and Arena-Hard. Interpretability analysis confirms that temporal features drive over 70% of the weighting decisions and that the learned weights correlate strongly with sample quality.

pdf bib abs

B-APO: Bias-Targeted Adversarial Preference Optimization for Debiasing Multimodal Large Language Models
Pinlong Zhao | Zike Ding | Zengshu Ye | Zhou Zhaoting
Findings of the Association for Computational Linguistics: ACL 2026

Multimodal Large Language Models (MLLMs) often suffer from modality bias, where the model disproportionately relies on one modality while neglecting critical information from others. Existing debiasing methods via modality masking create biased responses by completely removing an entire modality, forming an extreme and static training environment. However, real-world multimodal bias often emerges under subtle perturbations (e.g., mild occlusion, noisy instructions), where both modalities are present but the model is tempted to rely on spurious shortcuts. We propose B-APO (Bias-Targeted Adversarial Preference Optimization), which casts debiasing as a bias-targeted min-max game: we generate hard negatives by applying small adversarial perturbations in the latent space to maximally induce language-vision-prior reliance, and then perform preference alignment to enlarge the margin between clean and adversarial responses. This encourages the model to anchor on true cross-modal evidence even under the most adversarial conditions. Extensive experiments on bias and hallucination benchmarks demonstrate that B-APO achieves superior debiasing performance while maintaining general capabilities.

pdf bib abs

What Tokens Truly Matter? The Logit Conflation Problem in LLM Sampling
Pinlong Zhao | Huijun Tang | Pengfei Jiao | Mengyang Li
Findings of the Association for Computational Linguistics: ACL 2026

Sampling methods for large language models select candidate tokens based on logit statistics, implicitly assuming that high logits indicate desirable outputs. We identify the Logit Conflation Problem, where a token’s logit aggregates prompt-independent factors, including linguistic fluency and parametric associations, with prompt-relevance. However, only prompt-relevance determines instruction-following quality. We propose SEAL-Sampling (Signal Extraction for Active ReLevance) to isolate this component through attention-weighted attribution. Our framework defines prompt-relevance as the causal effect of prompt content on token logits and establishes attention patterns as an efficient proxy. Experiments on LLaMA-3 demonstrate significant improvements over top-nσ, with gains of 1.8% on AlpacaEval 2.0 and 2.2% on IFEval. Furthermore, attribution scores correlate weakly with raw logits, confirming the extraction of an orthogonal signal. The method is training-free and introduces minimal latency, adding less than 12ms overhead per token.

pdf bib abs

What Do LLMs Learn First? Asymmetric Learning Dynamics of Input Complexity and Output Ambiguity in Preference Alignment
Mengyang Li | Jingwen Wang | Pinlong Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Direct Preference Optimization (DPO) has become a standard approach for aligning large language models with human preferences, yet existing methods treat all preference pairs uniformly during training. We identify two distinct sources of learning difficulty: Input Complexity (IC), capturing prompt understanding challenges, and Output Ambiguity (OA), measuring preference discrimination difficulty. Through systematic analysis, we demonstrate that these dimensions induce asymmetric learning dynamics, with IC-related competencies developing rapidly in early training while OA-related competencies emerge more gradually. Building on this observation, we propose DECOPO, a training framework that maintains separate, adaptive pacing schedules for each dimension. Experiments on UltraFeedback show that DECOPO achieves 42.3% length-controlled win rate on AlpacaEval 2.0 and 7.66 on MT-Bench, outperforming curriculum baselines by 2.1% and 0.21 points respectively, while matching full-data baseline performance with only 75% of training samples.

Co-authors

Zengshu Ye 1

Zhou Zhaoting 1

Xudong Zhou 1

Venues

Findings3
ACL1

Fix author