December 5, 2016 By Brad Harris 3 min read

Machine learning is everywhere in the world of cybersecurity these days. It is often thought of as the magic bullet to secure systems and networks — a tool able to identify previously invisible attacks through a nontransparent set of functions, as in neural nets. Transparency aside, neural nets and other algorithms have indeed proven very effective.

Security professionals run into a distinct problem when attempting to do this, however. Machine learning classifiers perform much better in the supervised case, where labeled data is available. Attack data is clearly distinguished from normal data and marked for training of the classifier.

In this case, the classifier learns from the features of the training set to determine the differences between attack and normal examples. Features are typically variables that are either extracted directly or computed, depending on the data. They are usually identified by domain knowledge experts who carefully choose which variables make the most sense for the given challenge.

One problem with supervised models is that attacks change. After enough time passes, the supervised models are no longer useful unless the chosen algorithm is designed to change over time with different-looking data. Another difficulty is class imbalance. There are many more normal examples than abnormal ones in a cybersecurity situation. Although attacks are relatively rare, the system is presented with a great deal of data that illustrates the nonmalicious case. There are literally dozens of these classification algorithms, including everything from simple decision trees to highly complex deep neural nets.

Anomaly Detectors

But what happens in the real world where these labeled examples are not available? The ideal purpose of a machine learning algorithm is to create a situation where unknown attacks are identified and analysts are alerted. These systems are called anomaly detectors. The process requires a set of algorithms known as unsupervised learners and only works if there are commonalities among the features that can be used to group the data points.

One way to do this is to assume that the training data is all nominal. That is, try to train the unsupervised algorithm in a situation where the noise from attacks is minimal. This allows the learner to group the normal data by the features extracted from the data set. These algorithms include representatives from simple k-means clustering to self-organizing maps and beyond. One huge problem with these algorithms is that they lead to high false positive rates. This often means that their output is ignored because analysts simply do not have the time to chase all the false positives.

Human Training Boosts Machine Learning

There is a third set of algorithms known as semi-supervised. This set uses some labeled data and some unlabeled data. In a typical case, one of the classes of data is labeled and others are not. For example, the system may be exposed to labeled normal data and other data that may be either normal or attack data. In this scenario, the semi-supervised algorithm tries to find commonalities in the features of the labeled data, then tries to identify and label data from the rest of the data set.

One way to solve the problem of data labeling is to introduce a human into the training loop. People can help the machine learning system by labeling attack data and perhaps normal data as well. They then teach the algorithm and create supervised data that can be used to train the next iteration of the machine learning system. This technique is extremely powerful but often underutilized due to the push to create the perfect unsupervised machine learner. But human expertise may be the only way to teach an effective anomaly detector.

AI2: Looking Ahead

The goal of a machine learning system such as an anomaly detector should be to augment, not replace, the human analyst. Analysts have to sort through a massive amount of supposed attacks, often generated by signature-based detectors, which only identify known attacks. This leads to high rates of missed attacks — false negatives — and attackers simply learn how to avoid the detectors.

Furthermore, to reduce the amount of false positives the analyst must reject, detectors are often tuned so that only certain signatures set off alerts. If a signature produces a high false positive rate, it is often noted but not analyzed. A machine learning system should help to eliminate the false positives and indicate attacks not identified by signatures. This would give the analyst a more holistic view of the strikes attempted against a network or host.

AI2, created by authors at MIT and PatternEx, attempts to be just such a system. It is described as an “analyst-in-the-loop” technology designed to improve detection rates by a factor of 3.41 and reduce false positives by a factor of five compared to an unsupervised detector. We’ll examine AI2 more completely in the next part of this series.

Read the IBM Executive Report: Cybersecurity in the cognitive era

More from Artificial Intelligence

Researchers develop malicious AI ‘worm’ targeting generative AI systems

2 min read - Researchers have created a new, never-seen-before kind of malware they call the "Morris II" worm, which uses popular AI services to spread itself, infect new systems and steal data. The name references the original Morris computer worm that wreaked havoc on the internet in 1988.The worm demonstrates the potential dangers of AI security threats and creates a new urgency around securing AI models.New worm utilizes adversarial self-replicating promptThe researchers from Cornell Tech, the Israel Institute of Technology and Intuit, used what’s…

What should an AI ethics governance framework look like?

4 min read - While the race to achieve generative AI intensifies, the ethical debate surrounding the technology also continues to heat up. And the stakes keep getting higher.As per Gartner, “Organizations are responsible for ensuring that AI projects they develop, deploy or use do not have negative ethical consequences.” Meanwhile, 79% of executives say AI ethics is important to their enterprise-wide AI approach, but less than 25% have operationalized ethics governance principles.AI is also high on the list of United States government concerns.…

GenAI: The next frontier in AI security threats

3 min read - Threat actors aren’t attacking generative AI (GenAI) at scale yet, but these AI security threats are coming. That prediction comes from the 2024 X-Force Threat Intelligence Index. Here’s a review of the threat intelligence types underpinning that report.Cyber criminals are shifting focusIncreased chatter in illicit markets and dark web forums is a sign of interest. X-Force hasn’t seen any AI-engineered campaigns yet. However, cyber criminals are actively exploring the topic. In 2023, X-Force found the terms “AI” and “GPT” mentioned…

Topic updates

Get email updates and stay ahead of the latest threats to the security landscape, thought leadership and research.
Subscribe today