Unsupervised Binary Code Translation with Application to Code Clone Detection and Vulnerability Discovery

Iftakhar Ahmad, Lannan Luo


Abstract
Binary code analysis has immense importance in the research domain of software security. Today, software is very often compiled for various Instruction Set Architectures (ISAs). As a result, cross-architecture binary code analysis has become an emerging problem. Recently, deep learning-based binary analysis has shown promising success. It is widely known that training a deep learning model requires a massive amount of data. However, for some low-resource ISAs, an adequate amount of data is hard to find, preventing deep learning from being widely adopted for binary analysis. To overcome the data scarcity problem and facilitate cross-architecture binary code analysis, we propose to apply the ideas and techniques in Neural Machine Translation (NMT) to binary code analysis. Our insight is that a binary, after disassembly, is represented in some assembly language. Given a binary in a low-resource ISA, we translate it to a binary in a high-resource ISA (e.g., x86). Then we can use a model that has been trained on the high-resource ISA to test the translated binary. We have implemented the model called UNSUPERBINTRANS, and conducted experiments to evaluate its performance. Specifically, we conducted two downstream tasks, including code similarity detection and vulnerability discovery. In both tasks, we achieved high accuracies.
Anthology ID:
2023.findings-emnlp.971
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14581–14592
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.971
DOI:
10.18653/v1/2023.findings-emnlp.971
Bibkey:
Cite (ACL):
Iftakhar Ahmad and Lannan Luo. 2023. Unsupervised Binary Code Translation with Application to Code Clone Detection and Vulnerability Discovery. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14581–14592, Singapore. Association for Computational Linguistics.
Cite (Informal):
Unsupervised Binary Code Translation with Application to Code Clone Detection and Vulnerability Discovery (Ahmad & Luo, Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.971.pdf