2025
pdf
bib
abs
HAF-RM: A Hybrid Alignment Framework for Reward Model Training
Shujun Liu
|
Xiaoyu Shen
|
Yuhang Lai
|
Siyuan Wang
|
Shengbin Yue
|
Zengfeng Huang
|
Xuanjing Huang
|
Zhongyu Wei
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The reward model has become increasingly important in alignment, assessment, and data construction for large language models (LLMs). Most existing researchers focus on enhancing reward models through data improvements, following the conventional training framework for reward models that directly optimizes the predicted rewards.In this paper, we propose a hybrid alignment framework **HAF-RM** for reward model training by introducing an additional constraint on token-level policy probabilities in addition to the reward score. It can simultaneously supervise the internal preference model at the token level and optimize the mapping layer of the reward model at the sequence level.Experiment results on five datasets sufficiently show the validity and effectiveness of our proposed hybrid framework for training a high-quality reward model.By decoupling the reward modeling procedure and incorporating hybrid supervision, our **HAF-RM** framework offers a principled and effective approach to enhancing the performance and alignment of reward models, a critical component in the responsible development of powerful language models. We release our code at [https://haf-rm.github.io](https://haf-rm.github.io).
pdf
bib
abs
Multi-Agent Simulator Drives Language Models for Legal Intensive Interaction
Shengbin Yue
|
Ting Huang
|
Zheng Jia
|
Siyuan Wang
|
Shujun Liu
|
Yun Song
|
Xuanjing Huang
|
Zhongyu Wei
Findings of the Association for Computational Linguistics: NAACL 2025
Large Language Models (LLMs) have significantly advanced legal intelligence, but the scarcity of scenario data impedes the progress toward interactive legal scenarios. This paper introduces a Multi-agent Legal Simulation Driver (MASER) to scalably generate synthetic data by simulating interactive legal scenarios. Leveraging real-legal case sources, MASER ensures the consistency of legal attributes between participants and introduces a supervisory mechanism to align participants’ characters and behaviors as well as addressing distractions. A Multi-stage Interactive Legal Evaluation (MILE) benchmark is further constructed to evaluate LLMs’ performance in dynamic legal scenarios. Extensive experiments confirm the effectiveness of our framework.
pdf
bib
abs
How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation
Zhuohan Long
|
Siyuan Wang
|
Shujun Liu
|
Yuhang Lai
Findings of the Association for Computational Linguistics: EMNLP 2025
Jailbreak attacks, where harmful prompts bypass generative models’ built-in safety, raise serious concerns about model vulnerability. While many defense methods have been proposed, the trade-offs between safety and helpfulness, and their application to Large Vision-Language Models (LVLMs), are not well understood. This paper systematically examines jailbreak defenses by reframing the standard generation task as a binary classification problem to assess model refusal tendencies for both harmful and benign queries. We identify two key defense mechanisms: safety shift, which increases refusal rates across all queries, and harmfulness discrimination, which improves the model’s ability to differentiate between harmful and benign inputs. Using these mechanisms, we develop two ensemble defense strategies—inter-mechanism and intra-mechanism ensembles—to balance safety and helpfulness. Experiments on the MM-SafetyBench and MOSSBench datasets with LLaVA-1.5 models show that these strategies effectively improve model safety or optimize the trade-off between safety and helpfulness.
2024
pdf
bib
abs
ALaRM: Align Language Models via Hierarchical Rewards Modeling
Yuhang Lai
|
Siyuan Wang
|
Shujun Liu
|
Xuanjing Huang
|
Zhongyu Wei
Findings of the Association for Computational Linguistics: ACL 2024
We introduce ALaRM, the first framework modeling hierarchical rewards in reinforcement learning from human feedback (RLHF), which is designed to enhance the alignment of large language models (LLMs) with human preferences. The framework addresses the limitations of current alignment approaches, which often struggle with the inconsistency and sparsity of human supervision signals, by integrating holistic rewards with aspect-specific rewards. This integration enables more precise and consistent guidance of language models towards desired outcomes, particularly in complex and open text generation tasks. By employing a methodology that filters and combines multiple rewards based on their consistency, the framework provides a reliable mechanism for improving model alignment. We validate our approach through applications in long-form question answering and machine translation tasks, employing gpt-3.5-turbo for pairwise comparisons, and demonstrate improvements over existing baselines. Our work underscores the effectiveness of hierarchical rewards modeling in refining LLM training processes for better human preference alignment. We release our code at https://ALaRM-fdu.github.io.