Enhancing Cross-Language Code Translation via Task-Specific Embedding Alignment in Retrieval-Augmented Generation

Manish Bhattarai; Minh Vu; Javier E. Santos; Ismael Boureima; Daniel O’Malley

doi:10.18653/v1/2025.knowledgenlp-1.8

Enhancing Cross-Language Code Translation via Task-Specific Embedding Alignment in Retrieval-Augmented Generation

Manish Bhattarai, Minh Vu, Javier E. Santos, Ismael Boureima, Daniel O’Malley

Abstract

We introduce a novel method to enhance cross-language code translation from Fortran to C++ by integrating task-specific embedding alignment into a Retrieval-Augmented Generation (RAG) framework. Unlike conventional retrieval approaches that utilize generic embeddings agnostic to the downstream task, our strategy aligns the retrieval model directly with the objective of maximizing translation quality, as quantified by the CodeBLEU metric. This alignment ensures that the embeddings are semantically and syntactically meaningful for the specific code translation task. Our methodology involves constructing a dataset of 25,000 Fortran code snippets sourced from Stack-V2 dataset and generating their corresponding C++ translations using the LLaMA 3.1-8B language model. We compute pairwise CodeBLEU scores between the generated translations and ground truth examples to capture fine-grained similarities. These scores serve as supervision signals in a contrastive learning framework, where we optimize the embedding model to retrieve Fortran-C++ pairs that are most beneficial for improving the language model’s translation performance. By integrating these CodeBLEU-optimized embeddings into the RAG framework, our approach significantly enhances both retrieval accuracy and code generation quality over methods employing generic embeddings. On the HPC Fortran2C++ dataset, our method elevates the average CodeBLEU score from 0.64 to 0.73, achieving a 14% relative improvement. On the Numerical Recipes dataset, we observe an increase from 0.52 to 0.60, marking a 15% relative improvement. Importantly, these gains are realized without any fine-tuning of the language model, underscoring the efficiency and practicality of our approach.

Anthology ID:: 2025.knowledgenlp-1.8
Volume:: Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing
Month:: May
Year:: 2025
Address:: Albuquerque, New Mexico, USA
Editors:: Weijia Shi, Wenhao Yu, Akari Asai, Meng Jiang, Greg Durrett, Hannaneh Hajishirzi, Luke Zettlemoyer
Venues:: KnowledgeNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 107–117
Language:
URL:: https://aclanthology.org/2025.knowledgenlp-1.8/
DOI:: 10.18653/v1/2025.knowledgenlp-1.8
Bibkey:
Cite (ACL):: Manish Bhattarai, Minh Vu, Javier E. Santos, Ismael Boureima, and Daniel O’Malley. 2025. Enhancing Cross-Language Code Translation via Task-Specific Embedding Alignment in Retrieval-Augmented Generation. In Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing, pages 107–117, Albuquerque, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):: Enhancing Cross-Language Code Translation via Task-Specific Embedding Alignment in Retrieval-Augmented Generation (Bhattarai et al., KnowledgeNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.knowledgenlp-1.8.pdf

PDF Cite Search Fix data