Detecting vulnerabilities in code has been a problem facing the software development community for decades. Undetected weaknesses in production code can become attack entry points if detected and exploited by attackers. Such vulnerabilities can greatly damage the reputation of the company releasing the software and, potentially, the operational and financial well-being of the companies that installed the software and suffered from the attack. The magnitude of this problem keeps growing. In 2020, the US-CERT database confirmed 17,447 new vulnerabilities; a record number for the fourth year running.

The software development community has developed a variety of methods for detecting those weaknesses before they get into production code. Every piece of code goes through thorough static scanning, dynamic scanning and penetration testing before it is released into a product. But these scans still suffer from false positives, false negatives and long run times, making the security process a burden on the development team.

Deeper Into Transfer Learning

Recently, extensive research has been conducted on how to leverage artificial intelligence (AI) and deep learning techniques to analyze and generate code. The main challenge in this domain has been figuring out how to leverage and include years of knowledge amassed by code experts into deep learning models. Some of the research approaches the challenge from the data generation point of view to solve the problem of creating and labeling samples; some design specific deep networks to solve the problems that arise from code structure and semantics; while others design feature extraction techniques to solve the problems of parsing code using AI.

Transfer learning is one of the most promising deep learning approaches for leveraging existing expert knowledge. It has demonstrated success in overcoming a lack of samples by using existing pre-trained models for problems in a similar domain. For example, transfer learning for medical imaging leverages pre-trained models for image classification to classify medical images.

The transfer learning approach proves to be successful in this case because the layers of a pre-trained model can extract features of a ‘general’ image, while the transfer layer can make the final classification for the medical image domain. However, in the software development domain, there are no pre-trained models that can successfully extract features from code.

3 Steps to Transfer Learning

To solve the code classification problem developers may run into, we suggest the following three-step transfer learning approach:

  • Leverage an existing code analyzer to create an internal-state representation of the code, parse the code using this tool, run initial analysis and create an internal representation of the code.
  • Use a pre-trained image classification convolutional neural network (CNN) model to extract features from this internal representation and apply transfer learning to it.
  • Use transfer learning to train a classic machine learning model (such as a support vector machine) on existing data.

The image below describes the training process. For each labeled code sample, create an analyzer tool internal state representation of the samples, feed the internal state to a CNN, obtain the penultimate CNN layer output and feed it to a support vector machine (SVM). Then, train the SVM using this input and the original sample’s label.

Figure 1: Three Steps to Transfer Learning

Using this approach, we solved the feature extraction problem by using an existing tool that can parse code, analyze it and create a new representation of it (such as a call graph). This new representation is fed into a pre-trained model that helps solve the data generation problem by leveraging transfer learning techniques.

Test It!

To test the above approach, we used the Juliet data set developed by NIST. This set contains 64K labeled C/C++ code samples. These samples are targeted at specific Common Weakness Enumerations (CWEs), and some are tailored to deceive security scans. We used a state-of-the-art static analysis tool that parses and analyzes C/C++ code for weaknesses to create an internal representation of these code samples. We then fed the internal representation to a MobileNetV2 model (a 53-layer CNN pre-trained for image classification), applied transfer learning by removing the last layer in MobileNetV2 and fed its output to an SVM classifier.

We chose SVM after running a light-weight grid search on several classifiers available in scikit-learn, a free software machine learning library for the Python programming language. We chose MobileNetV2 after running some experiments on pre-trained classifiers available in Keras, an open-source software library that provides a Python interface for artificial neural networks. The analysis tool was configured to its default setting.

We ran classifications using our method and compared them to the analyzer’s prediction results. Our method was more effective in detecting CWEs than the analyzer. It showed a higher f1_score and was able to detect weaknesses that were not detected by the analysis tool.

An example of this is CWE476, a null pointer dereference that is caused by a code split into two functions that are connected by a global variable. CWE476 was not detected by the static analysis tool but was detected by our Transfer Learning approach.

Smarter Testing, Better Code

The effectiveness of this approach comes from leveraging a tool that codifies years of code analysis domain expertise. This tool provided us with great feature extraction capabilities that were coded, for a different purpose, throughout the years. Using Transfer Learning and adding SVM as the last neural network layer allowed us to overcome the lack of labeled data samples and effective data generation techniques for code. Finally, we believe the pre-trained MobileNetV2 model was successful at feature extraction of the intermediate representation of the analysis tool due to the nature of this representation.

In the future, we plan to expand our experiments to real-life examples and research possible enhancements to the approach described above. One of the main challenges is to understand how to choose a better pre-trained model that can replace MobileNetV2 CNN. An interesting direction is to train a model on a helper-problem and then use it to replace the CNN layer.

Hardening Code for Better Security

The number of vulnerabilities has been reaching new record numbers annually for the fourth consecutive year as new security flaws found in code keep seeing attackers exploit them to compromise organizations and data. These are security flaws discovered after the fact, in production code, in firmware and product code, websites and the logic that connects the dots.

Finding flaws in code before it is released is a priority, but it can also delay deadlines and release expectations. In some cases, companies struggle to find the right skillset or fund secure code reviews for the project. Many reasons can work against code becoming more secure, and that is precisely what makes it critical that we find new, smarter ways to enable developers to check for security flaws before code is released.

Using AI to solve business issues can accelerate solutions for all of society. Just as we applied AI to better analyze code, enterprises need AI that is fluid, adaptable and capable of applying knowledge acquired for one purpose to new domains and challenges. They need AI that can combine different forms of knowledge, unpack causal relationships and learn new things on its own. In short, enterprises need AI with fluid intelligence — and that’s exactly what we’re building.

Learn more

This blog is based on a patent filed by IBM in May 2020: “Leveraging AI for vulnerability detection using internal representation of code analysis results, Fady Copty, Shai Doron, and Reda Igbaria.”

More from Artificial Intelligence

X-Force releases detection & response framework for managed file transfer software

5 min read - How AI can help defenders scale detection guidance for enterprise software tools If we look back at mass exploitation events that shook the security industry like Log4j, Atlassian, and Microsoft Exchange when these solutions were actively being exploited by attackers, the exploits may have been associated with a different CVE, but the detection and response guidance being released by the various security vendors had many similarities (e.g., Log4shell vs. Log4j2 vs. MOVEit vs. Spring4Shell vs. Microsoft Exchange vs. ProxyShell vs.…

Unmasking hypnotized AI: The hidden risks of large language models

11 min read - The emergence of Large Language Models (LLMs) is redefining how cybersecurity teams and cybercriminals operate. As security teams leverage the capabilities of generative AI to bring more simplicity and speed into their operations, it's important we recognize that cybercriminals are seeking the same benefits. LLMs are a new type of attack surface poised to make certain types of attacks easier, more cost-effective, and even more persistent. In a bid to explore security risks posed by these innovations, we attempted to…

Artificial intelligence threats in identity management

4 min read - The 2023 Identity Security Threat Landscape Report from CyberArk identified some valuable insights. 2,300 security professionals surveyed responded with some sobering figures: 68% are concerned about insider threats from employee layoffs and churn 99% expect some type of identity compromise driven by financial cutbacks, geopolitical factors, cloud applications and hybrid work environments 74% are concerned about confidential data loss through employees, ex-employees and third-party vendors. Additionally, many feel digital identity proliferation is on the rise and the attack surface is…

AI reduces data breach lifecycles and costs

3 min read - The cybersecurity tools you implement can make a difference in the financial future of your business. According to the 2023 IBM Cost of a Data Breach report, organizations using security AI and automation incurred fewer data breach costs compared to businesses not using AI-based cybersecurity tools. The report found that the more an organization uses the tools, the greater the benefits reaped. Organizations that extensively used AI and security automation saw an average cost of a data breach of $3.60…