DeeDP: vulnerability detection and patching based on deep learning

We present the DeeDP system for automatic vulnerabilities detection and patch providing. DeeDP allows to detect vulnerabilities in C/C++ source code and generate patch for fixing detected issue. This system uses deep learning methods to organize rules for deciding whether a code fragment is vulnerable. Patch generation processes can be performed based on neural network and rule-based approaches. The system uses the abstract syntax tree (AST) representations of the source code fragments. We have tested effectiveness of our approach on different open source projects. For example, Microsoft/Terminal (https://github.com/microsoft/Terminal) was analyzed with DeeDP: our system detected security issue and generated patch which was successfully approved and applied by Microsoft maintainers.


Introduction
There are many cyber attacks which are rooted in software vulnerabilities. Prevention of software products compromising is related to application of different techniques e.g. Microsoft Security Development Lifecycle (SDL) and deep software analysis on early stages of development process. But involving huge amount of security experts for software analysis is too expensive, so most promising way is automation of each step: from code examination to errors correction. This paper represents technology and system DeeDP for detection of vulnerabilities in source code and providing of a patch to fix detected errors. Our technology is based on deep learning approach [1] for extraction of vulnerable fragments of code represented as ASTs [2] and automatic patch generation.
This work is a continuation of research on automating the detecting and fixing vulnerabilities in software. The method of automating vulnerability detection is deeply described in the article [3]. Now we consider the procedure for fixing errors -generating patches.
In section Related works we review existing solutions in the field of patch generation and their disadvantages. Next, we consider overall design and approaches that were used for system creation, as well as a result of technology application.

Related works
Automatic patch generation allows you to fix vulnerabilities in software without spending time and money necessary for developers to understand process and correct detected defects [4]. There are two main methods a onworkfokin@gmail.com of patch generation: 1) based on the study of a valid code (human patches) (Prophet, SPR, RSRepair, Gen-Prog, AE), 2) based on the use of fixed patterns (Senx, PAR) [5].
The best-known "generate-and-validate" patch approach starts with collection of test input data, where at least one piece of it identifies vulnerability in the software. The patch generation system modifies the program and generates space of patches, then looking for plausible patches in this space (i.e. patches that give the correct output for all test input data). [4] Prophet, SPR, GenProg, RSRepair, AE work is based on this approach [6,7,8,9,10].
GenProg, AE and RSRepair use various search algorithms (genetic programming, stochastic search, random search) in combination with transformations that remove, insert or change existing program operators [11]. Prophet is focused on study of existing valid human patches [12]. It uses a parameterized logarithmic probabilistic model based on two features extracted from the abstract syntax trees (AST) of each patch: 1) the way the patch changes the source program, 2) the relationship of how the values associated with the patch are used in the source program and in the patched program. Prophet ranks the possible patches generated for the defect according to the probabilities of their correctness [5].
SPR [13] uses a set of conversion schemes to generate patch set. Then it uses a step-by-step program fix process to validate the generated patches by checking them on the original test suite, in which at least one detects vulnerability in the original program. Prophet works with the same search space as SPR, but differs in that it uses its own model for correctness studies and, according to the study [4], shows better results than hand-coded SPR heuristics.
This approach has a significant drawback, which was confirmed by research [4], namely, the difficulty in correct evaluation of patches validity due to the small amount of input test data. In regards to this, these systems generate incorrect patches that pass the initial test cases, but remove the functionality of the program and create new vulnerabilities.
In the second method, patches are generated by applying fixed patterns that are written by a person based on the generalization of the rules to correct common vulnerabilities. PAR and Senx work in this way [5], [14]. The disadvantage of this method is the need to review and summarize a large number of written patches to fix vulnerabilities, as well as to write a large number of patterns and correctly process all the variables that are influenced by patch [15].

Task statement
Based on aforementioned discussion about fixing issues in software, let discuss our approaches. Before we want to determine some entities: P -product, some software project C/C++ language with available source code; X -fragment of vulnerable source code, which includes target function and context of execution (target function -it is function which can be used in wrong way and will lead to vulnerabilities in software);̃︀ -patch, fragment of code is related to X, but without vulnerability (target function is used in right way). General feature set and functionalities of product P have been the same as before applying changes̃︀.
We want to build function:̃︀ = ( , ) , where (•) -is transformation of source code X (C/C++) represented as AST to respective source codẽ︀ without weaknesses, A -additional parameters.
We are considering two approaches: • creation patch̃︀, based on deterministic approach (rule-based), when (•) is represented as rule (pattern) how to transform X, according to type of CWE [16]. So, in this case Training neural network will be supervised, based on collected samples from existed open source repositories with detected weaknesses and after patch-fix from contributor. In this case transformation function doesn't need any additional pa-rameters̃︀ = ( ), so neural network should remember dependency how was fixed such issue in represented dataset. It means that the first approach more reliable, because it assumes that developer which create respective rule is high experience, but this approach cannot be scaled. The second approach can be fully automated and can be expanded on different cases: fixing known weaknesses; automatic converting code to approved project style (handling exception, using specific functions and etc.); applying specific code style. The trained neural network (•) will apply extracted from dataset dependencies. The second approach is much more promising, but it needs enough examples for training procedure. In Result section we provided examples how was generated patch with rule and based on neural network. This paper discusses the automation technology of source code analysis for finding vulnerabilities and fixing detected errors. There are two main steps: vulnerability detection and patch generation figure 1.

General architecture of proposed method
A subsystem of DeeDP for vulnerability detection is based on deep learning approach and performs the following steps: source code preprocessing, AST creation, code gadget extraction, code gadget vectorization (word2vec), application of trained BLSTM neural network, preparation of human-readable report. We have pushed off the concepts for source code analysis which are described in [14] and improved the steps with code representations.
DeeDP subsystem for patch generation performs transformation of code fragment (code gadget) which was identified as vulnerable with specific function to obtain a patch with improved code fragment. Transformation can be done with rule-based approach (specific AST transformation according to detected issue) and based on code generation with neural network (LSTM).

Patch generation based on rules
The procedure for generating patches based on rules consists of the four steps: collecting data from static analyzer; source code preprocessing, patch generation, final verification. Current investigation was targeted on the next list of weaknesses: division by zero pattern, closed resource patterns, double free pattern, out of bounds pattern, string buffer overflow pattern. Rulebased approach allows to apply specific transformation to source code represented as AST that relates to detected issue.
1) Collecting data from static analyzer: verification the source code with a static analyzer and detecting vulnerabilities/weaknesses. We used on this step VulDetect (our own solution for detection vulnerabilities based on deep learning approach [3] and SVACE [17]. Based on result analysis we are creating specific file with meta-information about issue location. 2) Source code preprocessing: building an AST representation of a code (with clang), extracting fragment of AST (code gadget) according to issue location. Code gadget -it is fragment of general AST representation of all source code, which includes detected issue and context of this function, based on data flow analysis. Extract useful information for each token in AST related to detected issue. After this we are transforming AST representation for the next step (replacing user names, functions etc.). 3) Patch generation: according to type of detected weaknesses we have specific rules (patterns) which determines how we have to transform AST for removing issue. After we are transforming the AST according to the pattern. The next step is converting the received AST into a patch and applying the patch to product. 4) Final verification: checking again the received code with static analyzer to determine if the weaknesses has been fixed and if new ones have been created.

Patch generation based on neural network
After building patches using rule-based approach we have understood that creating and applying patterns are monotonous and requires a lot of intensive manual labor, so we decided to automatize collecting and applying rulebased patterns. According to Microsoft SDL procedure there are may exists examples of good and bad usage of some target calls, so according to this examples we can generate some data set and fit Neural Network. Other data set can be generated from project source code, for example, most common usages of target calls, style patterns, etc.
First of all, idea was in replacing bad code to good code or adding some code for bad code become good code, so we need to replace some text on other. In one hand we can create bijection function (one-to-one correspondence) for replacing bad code to good code.
Eventually, we need to create application that can understand context of code and transform it to good code, so we tried to use method that translating sentences from one language to other with some modifications. One of standard methods of translating from one language to other is seq2seq model (encoder-decoder model). This model can be splitted up on two parts: -encoder and -decoder part. On figure 2 illustrated standard encoder-decoder architecture. Encoder takes a raw input text data just like any other RNN architectures do. At the end, Encoder outputs a neural representation ('thought' vector). The output of Encoder is going to be the input data for Decoder. Then Decoder transform neural representation to words.
All models vary in terms of their architecture. A natural choice for sequential data is the recurrent neural network (RNN). Usually an RNN is used for both the encoder and decoder parts [18]. The RNN models differ in some aspects: • directionality: unidirectional or bidirectional • depth: single or multi-layer • type: Long Short-term Memory (LSTM), or a gated recurrent unit (GRU) In this article, we used a single RNN which is unidirectional and uses GRU as a recurrent unit. figure  Fig. 2. Standard encoder-decoder architecture 3. Is an example of model that translate a source code «fclose(arg_1);» to «if (arg_1) fclose(arg_1);». Here «<s>» marks the start of the decoding process while «</s>» tells the decoder stop.
For creating more quality model we need to add attention mechanism [19]. The main idea of the attention mechanism is to establish direct short-cut connections between the target and the source by paying 'attention' to relevant source content as we translate. A nice byproduct of the attention mechanism is an easyto-visualize alignment matrix between the source and target sentences.
In the simple seq2seq model we pass the last source state from the encoder to the decoder when starting the decoding process. This works well for short and medium-length sentences; but for long sentences, the single fixed-size hidden state becomes an information bottleneck. Instead of discarding all of the hidden states computed in the source RNN, the attention mechanism provides an approach that allows the decoder to peek at them (treating them as a dynamic memory of the source information). By doing so, the attention mechanism improves the translation of longer sentences. Nowadays, attention mechanisms are the defacto standard and have been successfully applied to many other tasks (including image caption generation, speech recognition, and text summarization).
The attention computation happens at every decoder time step. It consists of the following stages: 1) The current target hidden state is compared with all source states to derive attention weights. 2) Based on the attention weights we compute a context vector as the weighted average of the source states. 3) Combine the context vector with the current target hidden state to yield the final attention vector. 4) The attention vector is fed as an input to the next time step (input feeding).

Results
Training BLSTM neural network for vulnerabilities detection module was performed on dataset with more than 15000 code gadgets (code samples with presence of buffer overflow vulnerability, as well as samples with vulnerabilities associated with incorrect resource management). Training samples were created based on source codes taken from the National Vulnerability Database (NVD), and from the NIST Software Assurance Reference Dataset (SARD) [20].

Weaknesses detection based on deep learning
Results of detection procedure were described deeply in previous paper [3], but we have updated our detector with retraining on expanded data set.
Training BLSTM neural network for vulnerability detection module was performed on dataset with more than 15000 code gadgets (code samples with presence of buffer overflow vulnerability, as well as samples with vulnerabilities associated with incorrect resource management). Training samples were created based on source codes taken from the National Vulnerability Database (NVD), and from the NIST Software Assurance Reference Dataset (SARD) [21]. Moreover, we improved step with converting code gadget representation to vector based on word2vec method. Result of estimation accuracy of weaknesses detection performed in the table 1.

Results of generation rule-base patches
Proposed technology was verified on different open source projects, for example, jsoncpp, Microsoft/Terminal, and others.
DeeDP detected a security issue with resource management in Microsoft/Terminal and created a patch which was successfully approved and applied by Microsoft maintainers: (https://github.com/microsoft/Terminal/commit/ 99555ef9e9ba89b03bbeedf238b7e65375775b56).

Results of generation patches base on neural network
We have performed test of approach for creation patch with seq2seq concept, when neural network itself extracts statistical dependency between code with weaknesses and improved code.   We have developed DeeDP system with web UI where in which direct links to GitHub can be pasted for analysis. After analysis the system shows results of vulnerability verification and suggested patches. The next steps of our investigation are expansion list of detectable vulnerabilities and improvement of patch generation techniques.

Conclusion
This paper describes approaches for automation of detection and fixing vulnerabilities in C/C++ source code based on different approaches. Presented rulebased approach for fixing issue shows robustness, but cannot be easily scaled. We implemented this approach first of all for baseline and collecting dataset, which will be used in the second method.
We have verified ability to use neural network for generation patch and got very good result. In future we are planning to expend set of weaknesses which can be detected with our VulDetect module and can be automatically fixed by Patch generation module. Moreover we want to use Generative adversarial network (GAN) for generation code without issue from primary source code. In general we need to add ability to continuously train both neural networks (for detection and for generation processes) on new appeared samples.