Machine learning is everywhere in the world of cybersecurity these days. It is often thought of as the magic bullet to secure systems and networks — a tool able to identify previously invisible attacks through a nontransparent set of functions, as in neural nets. Transparency aside, neural nets and other algorithms have indeed proven very effective.
Security professionals run into a distinct problem when attempting to do this, however. Machine learning classifiers perform much better in the supervised case, where labeled data is available. Attack data is clearly distinguished from normal data and marked for training of the classifier.
In this case, the classifier learns from the features of the training set to determine the differences between attack and normal examples. Features are typically variables that are either extracted directly or computed, depending on the data. They are usually identified by domain knowledge experts who carefully choose which variables make the most sense for the given challenge.
One problem with supervised models is that attacks change. After enough time passes, the supervised models are no longer useful unless the chosen algorithm is designed to change over time with different-looking data. Another difficulty is class imbalance. There are many more normal examples than abnormal ones in a cybersecurity situation. Although attacks are relatively rare, the system is presented with a great deal of data that illustrates the nonmalicious case. There are literally dozens of these classification algorithms, including everything from simple decision trees to highly complex deep neural nets.
But what happens in the real world where these labeled examples are not available? The ideal purpose of a machine learning algorithm is to create a situation where unknown attacks are identified and analysts are alerted. These systems are called anomaly detectors. The process requires a set of algorithms known as unsupervised learners and only works if there are commonalities among the features that can be used to group the data points.
One way to do this is to assume that the training data is all nominal. That is, try to train the unsupervised algorithm in a situation where the noise from attacks is minimal. This allows the learner to group the normal data by the features extracted from the data set. These algorithms include representatives from simple k-means clustering to self-organizing maps and beyond. One huge problem with these algorithms is that they lead to high false positive rates. This often means that their output is ignored because analysts simply do not have the time to chase all the false positives.
Human Training Boosts Machine Learning
There is a third set of algorithms known as semi-supervised. This set uses some labeled data and some unlabeled data. In a typical case, one of the classes of data is labeled and others are not. For example, the system may be exposed to labeled normal data and other data that may be either normal or attack data. In this scenario, the semi-supervised algorithm tries to find commonalities in the features of the labeled data, then tries to identify and label data from the rest of the data set.
One way to solve the problem of data labeling is to introduce a human into the training loop. People can help the machine learning system by labeling attack data and perhaps normal data as well. They then teach the algorithm and create supervised data that can be used to train the next iteration of the machine learning system. This technique is extremely powerful but often underutilized due to the push to create the perfect unsupervised machine learner. But human expertise may be the only way to teach an effective anomaly detector.
AI2: Looking Ahead
The goal of a machine learning system such as an anomaly detector should be to augment, not replace, the human analyst. Analysts have to sort through a massive amount of supposed attacks, often generated by signature-based detectors, which only identify known attacks. This leads to high rates of missed attacks — false negatives — and attackers simply learn how to avoid the detectors.
Furthermore, to reduce the amount of false positives the analyst must reject, detectors are often tuned so that only certain signatures set off alerts. If a signature produces a high false positive rate, it is often noted but not analyzed. A machine learning system should help to eliminate the false positives and indicate attacks not identified by signatures. This would give the analyst a more holistic view of the strikes attempted against a network or host.
AI2, created by authors at MIT and PatternEx, attempts to be just such a system. It is described as an “analyst-in-the-loop” technology designed to improve detection rates by a factor of 3.41 and reduce false positives by a factor of five compared to an unsupervised detector. We’ll examine AI2 more completely in the next part of this series.