Hongyang Chen
2025
When Evolution Strategy Meets Language Models Tuning
Bo Huang
|
Yuxin Jiang
|
Mingyang Chen
|
Yi Wang
|
Hongyang Chen
|
Wei Wang
Proceedings of the 31st International Conference on Computational Linguistics
Supervised Fine-tuning has been pivotal in training autoregressive language models, yet it introduces exposure bias. To mitigate this, Post Fine-tuning, including on-policy and off-policy methods, has emerged as a solution to enhance models further. However, each has its limitations regarding performance enhancements and susceptibility to overfitting. In this paper, we introduce a novel on-policy approach called Evolution Strategy Optimization (ESO), which is designed by harnessing the principle of biological evolution, namely survival of the fittest. Particularly, we consider model tuning as an evolution process, and each output sentence generated by the model can provide a perturbation signal to the model parameter space. Then, the fitness of perturbation signals is quantified by the difference between its score and the averaged one offered by a reward function, which guides the optimization process. Empirically, the proposed method can achieve superior performance in various tasks and comparable performance in the human alignment task.