VulLibGen: Generating Names of Vulnerability-Affected Packages via a Large Language Model

Tianyu Chen, Lin Li, ZhuLiuchuan ZhuLiuchuan, Zongyang Li, Xueqing Liu, Guangtai Liang, Qianxiang Wang, Tao Xie


Abstract
Security practitioners maintain vulnerability reports (e.g., GitHub Advisory) to help developers mitigate security risks. An important task for these databases is automatically extracting structured information mentioned in the report, e.g., the affected software packages, to accelerate the defense of the vulnerability ecosystem.However, it is challenging for existing work on affected package identification to achieve high precision. One reason is that all existing work focuses on relatively smaller models, thus they cannot harness the knowledge and semantic capabilities of large language models.To address this limitation, we propose VulLibGen, the first method to use LLM for affected package identification. In contrast to existing work, VulLibGen proposes the novel idea to directly generate the affected package. To improve the precision, VulLibGen employs supervised fine-tuning (SFT), retrieval augmented generation (RAG) and a local search algorithm. The local search algorithm is a novel post-processing algorithm we introduce for reducing the hallucination of the generated packages. Our evaluation results show that VulLibGen has an average precision of 0.806 for identifying vulnerable packages in the four most popular ecosystems in GitHub Advisory (Java, JS, Python, Go) while the best average precision in previous work is 0.721. Additionally, VulLibGen has high value to security practice: we submitted 60 <vulnerability, affected package> pairs to GitHub Advisory (covers four ecosystems) and 34 of them have been accepted and merged.
Anthology ID:
2024.acl-long.527
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9767–9780
Language:
URL:
https://aclanthology.org/2024.acl-long.527
DOI:
Bibkey:
Cite (ACL):
Tianyu Chen, Lin Li, ZhuLiuchuan ZhuLiuchuan, Zongyang Li, Xueqing Liu, Guangtai Liang, Qianxiang Wang, and Tao Xie. 2024. VulLibGen: Generating Names of Vulnerability-Affected Packages via a Large Language Model. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9767–9780, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
VulLibGen: Generating Names of Vulnerability-Affected Packages via a Large Language Model (Chen et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.527.pdf