Code Vulnerability Detection via Nearest Neighbor Mechanism

,


Introduction
Code vulnerabilities generally denote security bugs or weaknesses in software.The presence of vulnerabilities in software is an inevitable problem because of design flaws of the program language and faults caused by programmers.Traditional detection methods, such as static methods (Kim et al., 2017;Roy et al., 2009) or dynamic methods (Serebryany, 2015;Cadar et al., 2008;Sen et al., 2005), generally require a lot of human labor to summarize vulnerability rules or massive computational resources to trigger potential vulnerabilities by executing programs.To improve detection efficiency, there are multiple efforts (Li et al., 2018;Zhou et al., 2019;Feng et al., 2020) have attempted to introduce deep learning and NLP techniques for automated vulnerability detection.Deep learning-based detectors can extract semantic features from the source code and automatically learn potential vulnerability patterns.Li et al. (2018) proposed a vulnerability detection framework named VulDeePecker, which used a binary classifier based on LSTM to detect whether a piece of code is vulnerable or not.In order to learn comprehensive program semantics to characterize vulnerabilities of high diversity and complexity in the source code, Zhou et al. (2019) utilized the graph neural network to extract code semantic representations.Inspired by the big success of pre-training models on numerous natural language processing (NLP) tasks (Liu et al., 2019;Raffel et al., 2020), code pre-training models for programming languages, such as CodeBert (Feng et al., 2020) and Graph-CodeBERT (Guo et al., 2020), were proposed and applied to a variety of code-related tasks such as code search, code completion, and code vulnerability detection, etc.Although the above methods have achieved remarkable improvements, they still easily generate a large number of false predictions because of the relatively large classification uncertainty.
In vulnerability detection tasks, vulnerabilities are sometimes subtle flaws.
Fig. 1 illustrates the comparison between the vulnerable and nonvulnerable code snippets.We can find that the difference between vulnerable and non-vulnerable samples is very small.This means that some vulnerable and non-vulnerable samples are close to the classifier boundary, which has the large classification uncertainty and easily results in false positive/negative predictions (as depicted in Fig. 2).To alleviate this problem and improve classification performance, we argue that fully exploiting the label information of samples with the same class can assist the model to classify samples more accurately.That is, when predicting the label of a sample, the model can obtain a good guideline from its neighbor samples.Motivated by this obervation, we present a novel code vulnerability detection method by introducing the k-nearest neighbor mechanism, which could fully utilize the label information of neighbor samples to perform more accurate vulnerability classification.Specifically, it firstly retrieves multiple neighbor samples according to the representation of the sample and then integrate their labels into the final prediction of the model.Furthermore, in order to retrieve more similar samples and make the label information of retrieved neighbor samples more effective, we introduce supervised contrastive learning (SCL) to learn discriminative code representations.Supervised contrastive learning (SCL) could pull samples with the same class closer to each other and push away samples with the different classes.In short, our contributions are summarized as follows: • We propose a novel vulnerability detection method by introducing the k-nearest neighbor mechanism, which fully utilizes the label information of neighbor samples to assist the model to conduct prediction.
• We introduce supervised contrastive learning to learn more discriminative code representations, so as to retrieve more similar neighbor samples to improve detection performance.
• We conduct extensive experiments to demonstrate the effectiveness of the proposed method.Experimental results show that our method achieves universal and significant performance improvements compared to baseline methods.
2 Related Work

Code Vulnerability
To alleviate the reliance on the intense labor of security experts and improve detection efficiency, recent research works (Lin et al., 2017;Grieco et al., 2016;Li et al., 2018;Zou et al., 2019) have attempted to introduce deep learning techniques for vulnerability detection.There are multiple research works in pinpointing vulnerabilities at different levels of granularity (e.g., program (Grieco et al., 2016), package (Neuhaus and Zimmermann, 2009), file (Shin et al., 2010), function (Yamaguchi et al., 2014;Zhou et al., 2019)) by combining deep learning with static bug detection.

K-Nearest Neighbor
Some research works have applied kNN to NLP in recent years.Khandelwal et al. (2019)

Methodology
In this section, we will elaborate the details of the proposed method.The overview of the proposed method is illustrated in Fig. 3.

Problem Formulation
Let a sample be defined as (x i , y i ), where x i represents a code sample and y i ∈ {0, 1} (where 1: vulnerable code and 0: non-vulnerable code).The goal of vulnerability detection is to learn a mapping from x i to y i , f : x i → y i to predict whether a code sample is vulnerable or not.The prediction function f can be learned by minimizing the loss function below: where L(•) is the cross-entropy loss function.

K-Nearest Neighbor for Vulnerability Detection
As mentioned before, to further improve detection performance, we propose a novel detection method based on the k-nearest neighbor mechanism, which utilizes the label information of neighbor samples to assist the model to make predictions.The proposed method consists of two components: the construction of the memory bank for storing sample representations and the prediction based on the k-nearest neighbor mechanism.
The construction of the memory bank Given an code sample (x i , y i ), h i denotes the latent feature representation outputted by the vulnerability detection model.The memory bank is the data structure that is used to store the latent representation and the corresponding label of each training sample.Formally, the memory bank is built by: M B = {(h 1 , y 1 ), ..., (h n , y n )}.
Prediction based on the k-nearest neighbor mechanism Given a code sample x i , the latent representation of x i can be obtained from the model: Then the representation h i will be used as the key to query the memory bank to obtain the k-nearest neighbor samples by calculating the distance of each sample pairs (h i , h j ): , where h j denotes the representation of a sample stored in the memory bank M B. The prediction based on the k-nearest neighbor mechanism is formulated by: where α j denotes the weight of the j th neighbor sample.In our work, α j is calculated by: where D(•) denotes the euclidean distance of sample pairs (h i , h j ).
The final prediction is formulated as follows: where y M odel denotes the model prediction and γ and δ represent the weights of model prediction and kNN prediction respectively.

Supervised Contrastive Learning
As mentioned before, some vulnerable code text is similar to the non-vulnerable code text, which is easy to cause labels of the retrieved samples to be different from the label of the testing sample.
That is, for a non-vulnerable code sample, quite a few of its retrieved neighbor samples may be vulnerable, which cannot provide effective help and even misleads the final prediction.To alleviate this problem, supervised contrastive learning (SCL) (Khosla et al., 2020) is introduced to learn more discriminative code representations.Supervised contrastive learning aims to pull samples belonging to the same class together in the representational space and push away samples from different classes.In this way, it makes vulnerable samples closer to each other in the representational space so that the label information of retrieved samples can provide more effective help for the model prediction (Khosla et al., 2020).Within a minibatch B = {x 1 , ...x M } consisting of M samples, let P (i) be the index of all the other samples whose classes are the same as the sample i and A(i) be the index of all samples except the sample i.The supervised contrastive learning can be calculated as: where τ is the contrastive learning temperature and z i = f (x i ) denotes the latent representation.
The overall loss function can be formulated as: L = L CE + λL sup , where L CE denotes the cross entropy loss function.

Experimental Setup
Our experiments are implemented based on the open-source deep learning framework Pytorch (Paszke et al., 2019).The total training epochs for all models is 50 with a batch size of 64.We optimize the model weights using SGD with an initial learning rate of 0.01.As for the hyper-parameters of our proposed method, γ is set to 0.7 and δ is set to 0.3.λ is set to 0.2.We conduct experiments on the published code vulnerability dataset QEMU+FFmpeg (Zhou et al., 2019), which collects real-world vulnerabilities from GitHub repositories.The labelling is done based on commit messages and domain experts.Statistics on the QEMU+FFmpeg dataset are summarized in Table 1.Following the previous work (Zhou et al., 2019), we adopt the F1-score as our evaluation metric.

Baseline
In our experiments, adopted baselines are listed as follows: Devign (Zhou et al., 2019) uses a gated graph convolutional networks to extract the semantic information of the source code and finally utilizes a 1-D CNN-based pooling ("Conv") to make a prediction.
CodeBERT (Feng et al., 2020) is a powerful pretrained model for programming language, which is trained in six programming languages.
GraphCodeBERT (Guo et al., 2020) is a new pretrained programming language model, extending CodeBERT to consider the inherent structure of code data flow into the training objective.

Results
The experimental results are reported in Tab.2.We can find that our proposed method achieves universal and obvious performance improvements compared to baselines.The CodeBERT using our proposed prediction method achieves the best performance on FFmpeg, obtaining an F1-score gain of   2.69% compared to the original CodeBERT model.The GraphCodeBERT using our proposed method achieves an F1-score of 72.47% on QEMU, surpassing the original GraphCodeBERT model by 2.46%.

Ablation Study
In our proposed method, we introduce the k-nearest neighbor mechanism to fully utilize label information of neighbor samples and use supervised contrastive learning to learn more discriminative code representations.In this section, we conduct experiments to verify the effectiveness of the above components respectively.The results of ablation studies are reported in Tab.3.We can find that both the k-nearest neighbor mechanism and supervised contrastive learning achieve performance improvements individually.The k-nearest neighbor mechanism can bring an average gain of 1.68%.In addition, compared to the individual k-nearest neighbor mechanism, the k-nearest neighbor mechanism using supervised contrastive learning can achieve better classification performance, demonstrating that introducing supervised contrastive learning can learn more discriminative code representations to better assist the kNN prediction.

Conclusion
In this paper, we propose a novel code vulnerability detection method based on the k-nearest neighbor mechanism, which fully utilizes the label information of neighbor samples to assist the model to predict code vulnerabilities.Moreover, to ensure that the label information of retrieved neighbor samples can provide more effective help for model predictions, we introduce supervised contrastive learning to make vulnerable samples closer to each other in the representational space.Finally, we conduct extensive experiments to verify the effectiveness of our proposed method.

Limitations
Here we summarize the limitations for further discussion and investigation of the community.Our proposed k-nearest neighbor mechanism requires manually setting the optimal k * to achieve the best classification performance, which is inefficient.A better solution is to design an adaptive k-nearest neighbor mechanism which could adaptively seek optimal neighbor samples according to the distribution information of training samples.

Figure 1 :
Figure 1: Illustrations of the vulnerable and nonvulnerable code snippets.

Figure 3 :
Figure 3: The framework of the proposed method.

Table 1 :
Statistics on the QEMU+FFmpeg dataset.

Table 2 :
The final performance on QEMU+FFmpeg dataset.

Table 3 :
F1-score of the ablation studies.kNN indicates the proposed k-nearest neighbor mechanism and SCL indicates contrastive learning.