Detecting vulnerabilities in code has been a problem facing the software development community for decades. Undetected weaknesses in production code can become attack entry points if detected and exploited by attackers. Such vulnerabilities can greatly damage the reputation of the company releasing the software and, potentially, the operational and financial well-being of the companies that installed the software and suffered from the attack. The magnitude of this problem keeps growing. In 2020, the US-CERT database confirmed 17,447 new vulnerabilities; a record number for the fourth year running.
The software development community has developed a variety of methods for detecting those weaknesses before they get into production code. Every piece of code goes through thorough static scanning, dynamic scanning and penetration testing before it is released into a product. But these scans still suffer from false positives, false negatives and long run times, making the security process a burden on the development team.
Deeper Into Transfer Learning
Recently, extensive research has been conducted on how to leverage artificial intelligence (AI) and deep learning techniques to analyze and generate code. The main challenge in this domain has been figuring out how to leverage and include years of knowledge amassed by code experts into deep learning models. Some of the research approaches the challenge from the data generation point of view to solve the problem of creating and labeling samples; some design specific deep networks to solve the problems that arise from code structure and semantics; while others design feature extraction techniques to solve the problems of parsing code using AI.
Transfer learning is one of the most promising deep learning approaches for leveraging existing expert knowledge. It has demonstrated success in overcoming a lack of samples by using existing pre-trained models for problems in a similar domain. For example, transfer learning for medical imaging leverages pre-trained models for image classification to classify medical images.
The transfer learning approach proves to be successful in this case because the layers of a pre-trained model can extract features of a ‘general’ image, while the transfer layer can make the final classification for the medical image domain. However, in the software development domain, there are no pre-trained models that can successfully extract features from code.
3 Steps to Transfer Learning
To solve the code classification problem developers may run into, we suggest the following three-step transfer learning approach:
- Leverage an existing code analyzer to create an internal-state representation of the code, parse the code using this tool, run initial analysis and create an internal representation of the code.
- Use a pre-trained image classification convolutional neural network (CNN) model to extract features from this internal representation and apply transfer learning to it.
- Use transfer learning to train a classic machine learning model (such as a support vector machine) on existing data.
The image below describes the training process. For each labeled code sample, create an analyzer tool internal state representation of the samples, feed the internal state to a CNN, obtain the penultimate CNN layer output and feed it to a support vector machine (SVM). Then, train the SVM using this input and the original sample’s label.
Figure 1: Three Steps to Transfer Learning
Using this approach, we solved the feature extraction problem by using an existing tool that can parse code, analyze it and create a new representation of it (such as a call graph). This new representation is fed into a pre-trained model that helps solve the data generation problem by leveraging transfer learning techniques.
Test It!
To test the above approach, we used the Juliet data set developed by NIST. This set contains 64K labeled C/C++ code samples. These samples are targeted at specific Common Weakness Enumerations (CWEs), and some are tailored to deceive security scans. We used a state-of-the-art static analysis tool that parses and analyzes C/C++ code for weaknesses to create an internal representation of these code samples. We then fed the internal representation to a MobileNetV2 model (a 53-layer CNN pre-trained for image classification), applied transfer learning by removing the last layer in MobileNetV2 and fed its output to an SVM classifier.
We chose SVM after running a light-weight grid search on several classifiers available in scikit-learn, a free software machine learning library for the Python programming language. We chose MobileNetV2 after running some experiments on pre-trained classifiers available in Keras, an open-source software library that provides a Python interface for artificial neural networks. The analysis tool was configured to its default setting.
We ran classifications using our method and compared them to the analyzer’s prediction results. Our method was more effective in detecting CWEs than the analyzer. It showed a higher f1_score and was able to detect weaknesses that were not detected by the analysis tool.
An example of this is CWE476, a null pointer dereference that is caused by a code split into two functions that are connected by a global variable. CWE476 was not detected by the static analysis tool but was detected by our Transfer Learning approach.
Smarter Testing, Better Code
The effectiveness of this approach comes from leveraging a tool that codifies years of code analysis domain expertise. This tool provided us with great feature extraction capabilities that were coded, for a different purpose, throughout the years. Using Transfer Learning and adding SVM as the last neural network layer allowed us to overcome the lack of labeled data samples and effective data generation techniques for code. Finally, we believe the pre-trained MobileNetV2 model was successful at feature extraction of the intermediate representation of the analysis tool due to the nature of this representation.
In the future, we plan to expand our experiments to real-life examples and research possible enhancements to the approach described above. One of the main challenges is to understand how to choose a better pre-trained model that can replace MobileNetV2 CNN. An interesting direction is to train a model on a helper-problem and then use it to replace the CNN layer.
Hardening Code for Better Security
The number of vulnerabilities has been reaching new record numbers annually for the fourth consecutive year as new security flaws found in code keep seeing attackers exploit them to compromise organizations and data. These are security flaws discovered after the fact, in production code, in firmware and product code, websites and the logic that connects the dots.
Finding flaws in code before it is released is a priority, but it can also delay deadlines and release expectations. In some cases, companies struggle to find the right skillset or fund secure code reviews for the project. Many reasons can work against code becoming more secure, and that is precisely what makes it critical that we find new, smarter ways to enable developers to check for security flaws before code is released.
Using AI to solve business issues can accelerate solutions for all of society. Just as we applied AI to better analyze code, enterprises need AI that is fluid, adaptable and capable of applying knowledge acquired for one purpose to new domains and challenges. They need AI that can combine different forms of knowledge, unpack causal relationships and learn new things on its own. In short, enterprises need AI with fluid intelligence — and that’s exactly what we’re building.
Learn more
This blog is based on a patent filed by IBM in May 2020: “Leveraging AI for vulnerability detection using internal representation of code analysis results, Fady Copty, Shai Doron, and Reda Igbaria.”
Research Staff Member, IBM