CONTRANS: Weak-to-Strong Alignment Engineering via Concept Transplantation

Weilong Dong; Xinwei Wu; Renren Jin; Shaoyang Xu; Deyi Xiong

CONTRANS: Weak-to-Strong Alignment Engineering via Concept Transplantation

Weilong Dong, Xinwei Wu, Renren Jin, Shaoyang Xu, Deyi Xiong

Abstract

Ensuring large language models (LLM) behave consistently with human goals, values, and intentions is crucial for their safety but yet computationally expensive. To reduce the computational cost of alignment training of LLMs, especially for those with a huge number of parameters, and to reutilize learned value alignment, we propose ConTrans, a novel framework that enables weak-to-strong alignment transfer via concept transplantation. From the perspective of representation engineering, ConTrans refines concept vectors in value alignment from a source LLM (usually a weak yet aligned LLM). The refined concept vectors are then reformulated to adapt to the target LLM (usually a strong yet unaligned base LLM) via affine transformation. In the third step, ConTrans transplants the reformulated concept vectors into the residual stream of the target LLM. Experiments demonstrate the successful transplantation of a wide range of aligned concepts from 7B models to 13B and 70B models across multiple LLMs and LLM families. Remarkably, ConTrans even surpasses instruction-tuned models in terms of truthfulness. Experiment results validate the effectiveness of both inter-LLM-family and intra-LLM-family concept transplantation. Our work successfully demonstrates an alternative way to achieve weak-to-strong alignment generalization and control.

Anthology ID:: 2025.coling-main.279
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4130–4148
Language:
URL:: https://aclanthology.org/2025.coling-main.279/
DOI:
Bibkey:
Cite (ACL):: Weilong Dong, Xinwei Wu, Renren Jin, Shaoyang Xu, and Deyi Xiong. 2025. CONTRANS: Weak-to-Strong Alignment Engineering via Concept Transplantation. In Proceedings of the 31st International Conference on Computational Linguistics, pages 4130–4148, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: CONTRANS: Weak-to-Strong Alignment Engineering via Concept Transplantation (Dong et al., COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.279.pdf

PDF Cite Search Fix data