Reinforcement Learning for Aligning Large Language Models Agents with Interactive Environments: Quantifying and Mitigating Prompt Overfitting

Mohamed Salim Aissi; Clément Romac; Thomas Carta; Sylvain Lamprier; Pierre-Yves Oudeyer; Olivier Sigaud; Laure Soulier; Nicolas Thome

doi:10.18653/v1/2025.findings-naacl.390

Reinforcement Learning for Aligning Large Language Models Agents with Interactive Environments: Quantifying and Mitigating Prompt Overfitting

Mohamed Salim Aissi, Clément Romac, Thomas Carta, Sylvain Lamprier, Pierre-Yves Oudeyer, Olivier Sigaud, Laure Soulier, Nicolas Thome

Abstract

Reinforcement learning (RL) is a promising approach for aligning large language models (LLMs) knowledge with sequential decision-making tasks. However, few studies have thoroughly investigated the impact on LLM agents capabilities of fine-tuning them with RL in a specific environment. In this paper, we propose a novel framework to analyze the sensitivity of LLMs to prompt formulations following RL training in a textual environment. Our findings reveal that the performance of LLMs degrades when faced with prompt formulations different from those used during the RL training phase. Besides, we analyze the source of this sensitivity by examining the model’s internal representations and salient tokens. Finally, we propose to use a contrastive loss to mitigate this sensitivity and improve the robustness and generalization capabilities of LLMs.

Anthology ID:: 2025.findings-naacl.390
Original:: 2025.findings-naacl.390v1
Version 2:: 2025.findings-naacl.390v2
Volume:: Findings of the Association for Computational Linguistics: NAACL 2025
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7030–7046
Language:
URL:: https://aclanthology.org/2025.findings-naacl.390/
DOI:: 10.18653/v1/2025.findings-naacl.390
Bibkey:
Cite (ACL):: Mohamed Salim Aissi, Clément Romac, Thomas Carta, Sylvain Lamprier, Pierre-Yves Oudeyer, Olivier Sigaud, Laure Soulier, and Nicolas Thome. 2025. Reinforcement Learning for Aligning Large Language Models Agents with Interactive Environments: Quantifying and Mitigating Prompt Overfitting. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 7030–7046, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: Reinforcement Learning for Aligning Large Language Models Agents with Interactive Environments: Quantifying and Mitigating Prompt Overfitting (Aissi et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-naacl.390.pdf

PDF (v2) PDF (v1) Cite Search Fix data