Introducing Compiler Semantics into Large Language Models as Programming Language Translators: A Case Study of C to x86 Assembly

Shuoming Zhang; Jiacheng Zhao; Chunwei Xia; Zheng Wang; Yunji Chen; Huimin Cui

doi:10.18653/v1/2024.findings-emnlp.55

Introducing Compiler Semantics into Large Language Models as Programming Language Translators: A Case Study of C to x86 Assembly

Shuoming Zhang, Jiacheng Zhao, Chunwei Xia, Zheng Wang, Yunji Chen, Huimin Cui

Abstract

Compilers are complex software containing millions of lines of code, taking years to develop. This paper investigates to what extent Large Language Models (LLMs) can replace hand-crafted compilers in translating high-level programming languages to machine instructions, using C to x86 assembly as a case study. We identify two challenges of using LLMs for code translation and introduce two novel data pre-processing techniques to address the challenges: numerical value conversion and training data resampling. While only using a 13B model, our approach achieves a behavioral accuracy of over 91%, outperforming the much larger GPT-4 Turbo model by over 50%. Our results are encouraging, showing that LLMs have the potential to transform how compilation tools are constructed.

Anthology ID:: 2024.findings-emnlp.55
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 996–1011
Language:
URL:: https://aclanthology.org/2024.findings-emnlp.55/
DOI:: 10.18653/v1/2024.findings-emnlp.55
Bibkey:
Cite (ACL):: Shuoming Zhang, Jiacheng Zhao, Chunwei Xia, Zheng Wang, Yunji Chen, and Huimin Cui. 2024. Introducing Compiler Semantics into Large Language Models as Programming Language Translators: A Case Study of C to x86 Assembly. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 996–1011, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Introducing Compiler Semantics into Large Language Models as Programming Language Translators: A Case Study of C to x86 Assembly (Zhang et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-emnlp.55.pdf
Software:: 2024.findings-emnlp.55.software.zip
Data:: 2024.findings-emnlp.55.data.zip

PDF Cite Search Software Data Fix data