With AI2, Machine Learning and Analysts Come Together to Impress, Part 3: The Experiment

This is the third and final installment in a series covering AI2 and machine learning. Be sure to read Part 1 for an introduction to AI2 and Part 2 for background on the algorithms used in the system.

Machine Learning, Human Teaching

The data set researchers Kalyan Veeramachaneni and Ignacio Arnaldo used to produce their paper, “AI2: Training a Big Data Machine to Defend,” is quite impressive. Experiments are too often based on data that is either unrepresentative of the real world or too brief to offer a realistic perspective. The authors claimed they evaluated their system on three months’ worth of enterprise platform logs with 3.6 billion log lines, working out to millions a day. This is far more representative of what we would see in real life.

It does raise one question, however: What environment did the data come from? Many of IBM’s customers see millions of attack incidents a day. In the paper, the authors claimed a value of less than 0.1 percent, putting the number of malicious attacks they detected in the thousands.

Class Imbalance

The authors detailed a dearth of malicious activity and the so-called “class imbalance” problem that arises when there are far more normal events than malicious events. While this is true even with large customers, there are many examples of malicious activity in large enterprises, especially at the border.

The paper explained the analysis of the ratio of normal to malicious users. It is somewhat unusual that the researchers chose to designate users as the unique entity in their analysis — typically, network attacks are measured by IP addresses. This approach is more reminiscent of the DARPA intrusion detection challenge, but on a much grander scale and not nearly as preprocessed. As with most anomaly detectors, there is noise in the normal data.

The Experiment

In section 8.1 of the paper, the authors outlined the types of attacks they look for. Note that, once again, their unique entities were users, which focused their attack types by necessity. This is more complicated, since they needed to watch for multistep behaviors. User-level attacks usually involve several actions, one after the other, that the system must spot. Also note that they also used IP addresses as a feature here, observing trends in attacks involving the number of IP addresses linked with the user entities.

First, they tried to identify account takeover attacks. This generally involves an attacker guessing the credentials of a user to access the system. Even more impressive, the researchers also searched for fraudulent account creation using a stolen credit card, which is extremely difficult to catch.

Lastly, the authors identified terms of service violations. This one is a bit more straightforward in a signature-based system, but it presents challenges in an anomaly detector. In a signature-based system, one can program a set of rules to determine what defines the terms of service. In an anomaly detector, the system must search for different behaviors from a normal user, which might represent a violation.

Hiding in the Noise

Many anomaly detectors are based purely on unsupervised algorithms, which have no access to clearly identified attacks verses normal labels. The algorithm never knows if it is “right” — it strictly evaluates based on what it sees most of the time, which it calls normal.

There are severe problems with this approach. By carefully introducing malicious traffic in a low and slow manner, the attacker can force a recalibration of what is considered normal. If this happens, the fraudster can then perform attacks with impunity. It is also possible for attackers to hide in the noise. For example, a command-and-control (C&C) protocol that uses typical Transport Layer Security (TLS) traffic may not be flagged as abnormal unless the infected computer does this often and too quickly.

The authors tested the use of labeled data from the past. This is a reasonable test, since the enterprise may have had logs that had already been filtered by their security operations center (SOC) analysts. This can happen if an enterprise wants to store the logs for trend analysis, for example. In this case, however, the enterprise may only keep attack data, leading to the flip side of the aforementioned class imbalance problem: There are more maliciously labeled examples than normal ones. The labeled data may also have noise in it, meaning there could be misidentified examples in the data.

The Results

As for the results, Figure 11 in the paper showed a graphical view of just how well the system did. Having historically labeled data definitely helps bootstrap the system. With no historical data, the system detected 143 of 318 total attacks. With historical data, it found 211. As the active model is continuously trained, the model will improve.

This demonstrates the importance of the domain expertise in the system’s feedback. Unlike many unsupervised anomaly detectors, the system gets better with time as long as there are experts to help teach it. The system is not meant to solve the problem by itself, but rather learn from the labeled examples provided by the SOC analysts. In fact, the authors claimed that at the end of the 12 weeks, the performance with and without historical data was the same.

Finally, the authors reported that the system with no historical data performed 3.41 times better than the unsupervised detector and reduced false positives fivefold. This means that analysts can focus on, say, 200 events per day instead of a thousand. This is quite an improvement in efficiency.

The technique shows very real promise and emphasizes the usefulness of domain knowledge in machine learning analysis. Machine learning can’t be the only tool in the arsenal — it needs human oversight to succeed.

Contributor'photo

Brad Harris

Security Researcher, IBM X-Force

Brad has worked in the network and computer security field in both the public and private sectors. He has done...