Machine learning is everywhere in the world of cybersecurity these days. It is often thought of as the magic bullet to secure systems and networks — a tool able to identify previously invisible attacks through a nontransparent set of functions, as in neural nets. Transparency aside, neural nets and other algorithms have indeed proven very effective.

Security professionals run into a distinct problem when attempting to do this, however. Machine learning classifiers perform much better in the supervised case, where labeled data is available. Attack data is clearly distinguished from normal data and marked for training of the classifier.

In this case, the classifier learns from the features of the training set to determine the differences between attack and normal examples. Features are typically variables that are either extracted directly or computed, depending on the data. They are usually identified by domain knowledge experts who carefully choose which variables make the most sense for the given challenge.

One problem with supervised models is that attacks change. After enough time passes, the supervised models are no longer useful unless the chosen algorithm is designed to change over time with different-looking data. Another difficulty is class imbalance. There are many more normal examples than abnormal ones in a cybersecurity situation. Although attacks are relatively rare, the system is presented with a great deal of data that illustrates the nonmalicious case. There are literally dozens of these classification algorithms, including everything from simple decision trees to highly complex deep neural nets.

Anomaly Detectors

But what happens in the real world where these labeled examples are not available? The ideal purpose of a machine learning algorithm is to create a situation where unknown attacks are identified and analysts are alerted. These systems are called anomaly detectors. The process requires a set of algorithms known as unsupervised learners and only works if there are commonalities among the features that can be used to group the data points.

One way to do this is to assume that the training data is all nominal. That is, try to train the unsupervised algorithm in a situation where the noise from attacks is minimal. This allows the learner to group the normal data by the features extracted from the data set. These algorithms include representatives from simple k-means clustering to self-organizing maps and beyond. One huge problem with these algorithms is that they lead to high false positive rates. This often means that their output is ignored because analysts simply do not have the time to chase all the false positives.

Human Training Boosts Machine Learning

There is a third set of algorithms known as semi-supervised. This set uses some labeled data and some unlabeled data. In a typical case, one of the classes of data is labeled and others are not. For example, the system may be exposed to labeled normal data and other data that may be either normal or attack data. In this scenario, the semi-supervised algorithm tries to find commonalities in the features of the labeled data, then tries to identify and label data from the rest of the data set.

One way to solve the problem of data labeling is to introduce a human into the training loop. People can help the machine learning system by labeling attack data and perhaps normal data as well. They then teach the algorithm and create supervised data that can be used to train the next iteration of the machine learning system. This technique is extremely powerful but often underutilized due to the push to create the perfect unsupervised machine learner. But human expertise may be the only way to teach an effective anomaly detector.

AI2: Looking Ahead

The goal of a machine learning system such as an anomaly detector should be to augment, not replace, the human analyst. Analysts have to sort through a massive amount of supposed attacks, often generated by signature-based detectors, which only identify known attacks. This leads to high rates of missed attacks — false negatives — and attackers simply learn how to avoid the detectors.

Furthermore, to reduce the amount of false positives the analyst must reject, detectors are often tuned so that only certain signatures set off alerts. If a signature produces a high false positive rate, it is often noted but not analyzed. A machine learning system should help to eliminate the false positives and indicate attacks not identified by signatures. This would give the analyst a more holistic view of the strikes attempted against a network or host.

AI2, created by authors at MIT and PatternEx, attempts to be just such a system. It is described as an “analyst-in-the-loop” technology designed to improve detection rates by a factor of 3.41 and reduce false positives by a factor of five compared to an unsupervised detector. We’ll examine AI2 more completely in the next part of this series.

Read the IBM Executive Report: Cybersecurity in the cognitive era

More from Artificial Intelligence

Data Privacy: How the Growing Field of Regulations Impacts Businesses

The proposed rules over artificial intelligence (AI) in the European Union (EU) are a harbinger of things to come. Data privacy laws are becoming more complex and growing in number and relevance. So, businesses that seek to become — and stay — compliant must find a solution that can do more than just respond to current challenges. Take a look at upcoming trends when it comes to data privacy regulations and how to follow them. Today's AI Solutions On April…

Tackling Today’s Attacks and Preparing for Tomorrow’s Threats: A Leader in 2022 Gartner® Magic Quadrant™ for SIEM

Get the latest on IBM Security QRadar SIEM, recognized as a Leader in the 2022 Gartner Magic Quadrant. As I talk to security leaders across the globe, four main themes teams constantly struggle to keep up with are: The ever-evolving and increasing threat landscape Access to and retaining skilled security analysts Learning and managing increasingly complex IT environments and subsequent security tooling The ability to act on the insights from their security tools including security information and event management software…

4 Ways AI Capabilities Transform Security

Many industries have had to tighten belts in the "new normal". In cybersecurity, artificial intelligence (AI) can help.   Every day of the new normal we learn how the pandemic sped up digital transformation, as reflected in the new opportunities and new risks. For many, organizational complexity and legacy infrastructure and support processes are the leading barriers to the effectiveness of their security.   Adding to the dynamics, short-handed teams are overwhelmed with too much data from disparate sources and…

What’s New in the 2022 Cost of a Data Breach Report

The average cost of a data breach reached an all-time high of $4.35 million this year, according to newly published 2022 Cost of a Data Breach Report, an increase of 2.6% from a year ago and 12.7% since 2020. New research in this year’s report also reveals for the first time that 83% of organizations in the study have experienced more than one data breach and just 17% said this was their first data breach. And at a time when…