Xiangyu Zhang


2024

pdf bib
Threat Behavior Textual Search by Attention Graph Isomorphism
Chanwoo Bae | Guanhong Tao | Zhuo Zhang | Xiangyu Zhang
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Cyber attacks cause over $1 trillion loss every year. An important task for cyber security analysts is attack forensics. It entails understanding malware behaviors and attack origins. However, existing automated or manual malware analysis can only disclose a subset of behaviors due to inherent difficulties (e.g., malware cloaking and obfuscation). As such, analysts often resort to text search techniques to identify existing malware reports based on the symptoms they observe, exploiting the fact that malware samples share a lot of similarity, especially those from the same origin. In this paper, we propose a novel malware behavior search technique that is based on graph isomorphism at the attention layers of Transformer models. We also compose a large dataset collected from various agencies to facilitate such research.Our technique outperforms state-of-the-art methods, such as those based on sentence embeddings and keywords by 6-14%. In the case study of 10 real-world malwares, our technique can correctly attribute 8 of them to their ground truth origins while using Google only works for 3 cases.

2023

pdf bib
A Quantitative Approach to Understand Self-Supervised Models as Cross-lingual Feature Extracters
Shuyue Stella Li | Beining Xu | Xiangyu Zhang | Hexin Liu | Wenhan Chao | Paola Garcia
Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023)

pdf bib
Backdooring Neural Code Search
Weisong Sun | Yuchen Chen | Guanhong Tao | Chunrong Fang | Xiangyu Zhang | Quanjun Zhang | Bin Luo
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Reusing off-the-shelf code snippets from online repositories is a common practice, which significantly enhances the productivity of software developers. To find desired code snippets, developers resort to code search engines through natural language queries. Neural code search models are hence behind many such engines. These models are based on deep learning and gain substantial attention due to their impressive performance. However, the security aspect of these models is rarely studied. Particularly, an adversary can inject a backdoor in neural code search models, which return buggy or even vulnerable code with security/privacy issues. This may impact the downstream software (e.g., stock trading systems and autonomous driving) and cause financial loss and/or life-threatening incidents. In this paper, we demonstrate such attacks are feasible and can be quite stealthy. By simply modifying one variable/function name, the attacker can make buggy/vulnerable code rank in the top 11%. Our attack BADCODE features a special trigger generation and injection procedure, making the attack more effective and stealthy. The evaluation is conducted on two neural code search models and the results show our attack outperforms baselines by 60%. Our user study demonstrates that our attack is more stealthy than the baseline by two times based on the F1 score.

pdf bib
Syntax-Aware Retrieval Augmented Code Generation
Xiangyu Zhang | Yu Zhou | Guang Yang | Taolue Chen
Findings of the Association for Computational Linguistics: EMNLP 2023

Neural code generation models are nowadays widely adopted to generate code from natural language descriptions automatically. Recently, pre-trained neural models equipped with token-level retrieval capabilities have exhibited great potentials in neural machine translation. However, applying them directly to code generation experience challenges: the use of the retrieval-based mechanism inevitably introduces extraneous noise to the generation process, resulting in even syntactically incorrect code. Computationally, such models necessitate frequent searches of the cached datastore, which turns out to be time-consuming. To address these issues, we propose kNN-TRANX, a token-level retrieval augmented code generation method. kNN-TRANX allows for searches in smaller datastores tailored for the code generation task. It leverages syntax constraints for the retrieval of datastores, which reduces the impact of retrieve noise. We evaluate kNN-TRANX on two public datasets and the experimental results confirm the effectiveness of our approach.