CP-BCS: Binary Code Summarization Guided by Control Flow Graph and Pseudo Code

Tong Ye, Lingfei Wu, Tengfei Ma, Xuhong Zhang, Yangkai Du, Peiyu Liu, Shouling Ji, Wenhai Wang


Abstract
Automatically generating function summaries for binaries is an extremely valuable but challenging task, since it involves translating the execution behavior and semantics of the low-level language (assembly code) into human-readable natural language. However, most current works on understanding assembly code are oriented towards generating function names, which involve numerous abbreviations that make them still confusing. To bridge this gap, we focus on generating complete summaries for binary functions, especially for stripped binary (no symbol table and debug information in reality). To fully exploit the semantics of assembly code, we present a control flow graph and pseudo code guided binary code summarization framework called CP-BCS. CP-BCS utilizes a bidirectional instruction-level control flow graph and pseudo code that incorporates expert knowledge to learn the comprehensive binary function execution behavior and logic semantics. We evaluate CP-BCS on 3 different binary optimization levels (O1, O2, and O3) for 3 different computer architectures (X86, X64, and ARM). The evaluation results demonstrate CP-BCS is superior and significantly improves the efficiency of reverse engineering.
Anthology ID:
2023.emnlp-main.911
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14740–14752
Language:
URL:
https://aclanthology.org/2023.emnlp-main.911
DOI:
10.18653/v1/2023.emnlp-main.911
Bibkey:
Cite (ACL):
Tong Ye, Lingfei Wu, Tengfei Ma, Xuhong Zhang, Yangkai Du, Peiyu Liu, Shouling Ji, and Wenhai Wang. 2023. CP-BCS: Binary Code Summarization Guided by Control Flow Graph and Pseudo Code. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14740–14752, Singapore. Association for Computational Linguistics.
Cite (Informal):
CP-BCS: Binary Code Summarization Guided by Control Flow Graph and Pseudo Code (Ye et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.911.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.911.mp4