Zhiyi Chen
2025
EcoTune: Token-Efficient Multi-Fidelity Hyperparameter Optimization for Large Language Model Inference
Yuebin Xu
|
Zhiyi Chen
|
Zeyi Wen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Tuning inference hyperparameters, such as temperature and maximum output tokens, on downstream tasks can enhance inference performance. However, directly applying hyperparameter optimization to these hyperparameters is token-expensive. Multi-fidelity optimization improves HPO efficiency with low-fidelity evaluations, but its static scheduling strategies ignore token consumption, leading to high costs. To address these limitations, we propose a token-efficient multi-fidelity optimization method, which enhances inference performance and minimizes token usage. Our method is empowered by (i) a token-based fidelity definition with explicit token cost modeling on configurations; (ii) a novel Token-Aware Expected Improvement acquisition function that selects configurations based on performance gain per token; and (iii) a dynamic fidelity scheduling mechanism that adapts to real-time budget status. We evaluate our method on LLaMA-2 and LLaMA-3 series across MMLU, Humaneval, MedQA, and OpenBookQA. Our method improves over the HELM leaderboard by 7.1%, 24.3%, 21.9%, and 4.6%, respectively. Compared to existing multi-fidelity HPO baselines, our method reduces token consumption by over 80% while maintaining or surpassing performance, demonstrating the state-of-the-art token efficiency for inference-time optimization.