December 5, 2016 By Brad Harris 3 min read

Machine learning is everywhere in the world of cybersecurity these days. It is often thought of as the magic bullet to secure systems and networks — a tool able to identify previously invisible attacks through a nontransparent set of functions, as in neural nets. Transparency aside, neural nets and other algorithms have indeed proven very effective.

Security professionals run into a distinct problem when attempting to do this, however. Machine learning classifiers perform much better in the supervised case, where labeled data is available. Attack data is clearly distinguished from normal data and marked for training of the classifier.

In this case, the classifier learns from the features of the training set to determine the differences between attack and normal examples. Features are typically variables that are either extracted directly or computed, depending on the data. They are usually identified by domain knowledge experts who carefully choose which variables make the most sense for the given challenge.

One problem with supervised models is that attacks change. After enough time passes, the supervised models are no longer useful unless the chosen algorithm is designed to change over time with different-looking data. Another difficulty is class imbalance. There are many more normal examples than abnormal ones in a cybersecurity situation. Although attacks are relatively rare, the system is presented with a great deal of data that illustrates the nonmalicious case. There are literally dozens of these classification algorithms, including everything from simple decision trees to highly complex deep neural nets.

Anomaly Detectors

But what happens in the real world where these labeled examples are not available? The ideal purpose of a machine learning algorithm is to create a situation where unknown attacks are identified and analysts are alerted. These systems are called anomaly detectors. The process requires a set of algorithms known as unsupervised learners and only works if there are commonalities among the features that can be used to group the data points.

One way to do this is to assume that the training data is all nominal. That is, try to train the unsupervised algorithm in a situation where the noise from attacks is minimal. This allows the learner to group the normal data by the features extracted from the data set. These algorithms include representatives from simple k-means clustering to self-organizing maps and beyond. One huge problem with these algorithms is that they lead to high false positive rates. This often means that their output is ignored because analysts simply do not have the time to chase all the false positives.

Human Training Boosts Machine Learning

There is a third set of algorithms known as semi-supervised. This set uses some labeled data and some unlabeled data. In a typical case, one of the classes of data is labeled and others are not. For example, the system may be exposed to labeled normal data and other data that may be either normal or attack data. In this scenario, the semi-supervised algorithm tries to find commonalities in the features of the labeled data, then tries to identify and label data from the rest of the data set.

One way to solve the problem of data labeling is to introduce a human into the training loop. People can help the machine learning system by labeling attack data and perhaps normal data as well. They then teach the algorithm and create supervised data that can be used to train the next iteration of the machine learning system. This technique is extremely powerful but often underutilized due to the push to create the perfect unsupervised machine learner. But human expertise may be the only way to teach an effective anomaly detector.

AI2: Looking Ahead

The goal of a machine learning system such as an anomaly detector should be to augment, not replace, the human analyst. Analysts have to sort through a massive amount of supposed attacks, often generated by signature-based detectors, which only identify known attacks. This leads to high rates of missed attacks — false negatives — and attackers simply learn how to avoid the detectors.

Furthermore, to reduce the amount of false positives the analyst must reject, detectors are often tuned so that only certain signatures set off alerts. If a signature produces a high false positive rate, it is often noted but not analyzed. A machine learning system should help to eliminate the false positives and indicate attacks not identified by signatures. This would give the analyst a more holistic view of the strikes attempted against a network or host.

AI2, created by authors at MIT and PatternEx, attempts to be just such a system. It is described as an “analyst-in-the-loop” technology designed to improve detection rates by a factor of 3.41 and reduce false positives by a factor of five compared to an unsupervised detector. We’ll examine AI2 more completely in the next part of this series.

Read the IBM Executive Report: Cybersecurity in the cognitive era

More from Artificial Intelligence

Risk, reward and reality: Has enterprise perception of the public cloud changed?

4 min read - Public clouds now form the bulk of enterprise IT environments. According to 2024 Statista data, 73% of enterprises use a hybrid cloud model, 14% use multiple public clouds and 10% use a single public cloud solution. Multiple and single private clouds make up the remaining 3%.With enterprises historically reticent to adopt public clouds, adoption data seems to indicate a shift in perception. Perhaps enterprise efforts have finally moved away from reducing risk to prioritizing the potential rewards of public cloud…

Is AI saving jobs… or taking them?

4 min read - Artificial intelligence (AI) is coming to take your cybersecurity job. Or, AI will save your job. Well, which is it? As with all things security-related, AI-related and employment-related, it's complicated. How AI creates jobs A major reason it's complicated is that AI is helping to increase the demand for cybersecurity professionals in two broad ways. First, malicious actors use AI to get past security defenses and raise the overall risk of data breaches. The bad guys can increasingly use AI-based…

Trends: Hardware gets AI updates in 2024

4 min read - The surge in artificial intelligence (AI) usage over the past two and a half years has dramatically changed not only software but hardware as well. As AI usage continues to evolve, PC makers have found in AI an opportunity to improve end-user devices by offering AI-specific hardware and marketing them as "AI PCs."Pre-AI hardware, adapted for AIA few years ago, AI often depended on hardware that was not explicitly designed for AI. One example is graphics processors. Nvidia Graphics Processing…

Topic updates

Get email updates and stay ahead of the latest threats to the security landscape, thought leadership and research.
Subscribe today