Learning Cross-Architecture Instruction Embeddings for Binary Code Analysis in Low-Resource Architectures

Junzhe Wang, Qiang Zeng, Lannan Luo


Abstract
Binary code analysis is indispensable for a variety of software security tasks. Applying deep learning to binary code analysis has drawn great attention because of its notable performance. Today, source code is frequently compiled for various Instruction Set Architectures (ISAs). It is thus critical to expand binary analysis capabilities to multiple ISAs. Given a binary analysis task, the scale of available data on different ISAs varies. As a result, the rich datasets (e.g., malware) for certain ISAs, such as x86, lead to a disproportionate focus on these ISAs and a negligence of other ISAs, such as PowerPC, which suffer from the “data scarcity” problem. To address the problem, we propose to learn cross-architecture instruction embeddings (CAIE), where semantically-similar instructions, regardless of their ISAs, have close embeddings in a shared space. Consequently, we can transfer a model trained on a data-rich ISA to another ISA with less available data. We consider four ISAs (x86, ARM, MIPS, and PowerPC) and conduct both intrinsic and extrinsic evaluations (including malware detection and function similarity comparison). The results demonstrate the effectiveness of our approach to generate high-quality CAIE with good transferability.
Anthology ID:
2024.findings-naacl.84
Volume:
Findings of the Association for Computational Linguistics: NAACL 2024
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1320–1332
Language:
URL:
https://aclanthology.org/2024.findings-naacl.84
DOI:
10.18653/v1/2024.findings-naacl.84
Bibkey:
Cite (ACL):
Junzhe Wang, Qiang Zeng, and Lannan Luo. 2024. Learning Cross-Architecture Instruction Embeddings for Binary Code Analysis in Low-Resource Architectures. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 1320–1332, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Learning Cross-Architecture Instruction Embeddings for Binary Code Analysis in Low-Resource Architectures (Wang et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-naacl.84.pdf