December 19, 2016 By Brad Harris 4 min read

This is the third and final installment in a series covering AI2 and machine learning. Be sure to read Part 1 for an introduction to AI2 and Part 2 for background on the algorithms used in the system.

Machine Learning, Human Teaching

The data set researchers Kalyan Veeramachaneni and Ignacio Arnaldo used to produce their paper, “AI2: Training a Big Data Machine to Defend,” is quite impressive. Experiments are too often based on data that is either unrepresentative of the real world or too brief to offer a realistic perspective. The authors claimed they evaluated their system on three months’ worth of enterprise platform logs with 3.6 billion log lines, working out to millions a day. This is far more representative of what we would see in real life.

It does raise one question, however: What environment did the data come from? Many of IBM’s customers see millions of attack incidents a day. In the paper, the authors claimed a value of less than 0.1 percent, putting the number of malicious attacks they detected in the thousands.

Class Imbalance

The authors detailed a dearth of malicious activity and the so-called “class imbalance” problem that arises when there are far more normal events than malicious events. While this is true even with large customers, there are many examples of malicious activity in large enterprises, especially at the border.

The paper explained the analysis of the ratio of normal to malicious users. It is somewhat unusual that the researchers chose to designate users as the unique entity in their analysis — typically, network attacks are measured by IP addresses. This approach is more reminiscent of the DARPA intrusion detection challenge, but on a much grander scale and not nearly as preprocessed. As with most anomaly detectors, there is noise in the normal data.

The Experiment

In section 8.1 of the paper, the authors outlined the types of attacks they look for. Note that, once again, their unique entities were users, which focused their attack types by necessity. This is more complicated, since they needed to watch for multistep behaviors. User-level attacks usually involve several actions, one after the other, that the system must spot. Also note that they also used IP addresses as a feature here, observing trends in attacks involving the number of IP addresses linked with the user entities.

First, they tried to identify account takeover attacks. This generally involves an attacker guessing the credentials of a user to access the system. Even more impressive, the researchers also searched for fraudulent account creation using a stolen credit card, which is extremely difficult to catch.

Lastly, the authors identified terms of service violations. This one is a bit more straightforward in a signature-based system, but it presents challenges in an anomaly detector. In a signature-based system, one can program a set of rules to determine what defines the terms of service. In an anomaly detector, the system must search for different behaviors from a normal user, which might represent a violation.

Hiding in the Noise

Many anomaly detectors are based purely on unsupervised algorithms, which have no access to clearly identified attacks verses normal labels. The algorithm never knows if it is “right” — it strictly evaluates based on what it sees most of the time, which it calls normal.

There are severe problems with this approach. By carefully introducing malicious traffic in a low and slow manner, the attacker can force a recalibration of what is considered normal. If this happens, the fraudster can then perform attacks with impunity. It is also possible for attackers to hide in the noise. For example, a command-and-control (C&C) protocol that uses typical Transport Layer Security (TLS) traffic may not be flagged as abnormal unless the infected computer does this often and too quickly.

The authors tested the use of labeled data from the past. This is a reasonable test, since the enterprise may have had logs that had already been filtered by their security operations center (SOC) analysts. This can happen if an enterprise wants to store the logs for trend analysis, for example. In this case, however, the enterprise may only keep attack data, leading to the flip side of the aforementioned class imbalance problem: There are more maliciously labeled examples than normal ones. The labeled data may also have noise in it, meaning there could be misidentified examples in the data.

The Results

As for the results, Figure 11 in the paper showed a graphical view of just how well the system did. Having historically labeled data definitely helps bootstrap the system. With no historical data, the system detected 143 of 318 total attacks. With historical data, it found 211. As the active model is continuously trained, the model will improve.

This demonstrates the importance of the domain expertise in the system’s feedback. Unlike many unsupervised anomaly detectors, the system gets better with time as long as there are experts to help teach it. The system is not meant to solve the problem by itself, but rather learn from the labeled examples provided by the SOC analysts. In fact, the authors claimed that at the end of the 12 weeks, the performance with and without historical data was the same.

Finally, the authors reported that the system with no historical data performed 3.41 times better than the unsupervised detector and reduced false positives fivefold. This means that analysts can focus on, say, 200 events per day instead of a thousand. This is quite an improvement in efficiency.

The technique shows very real promise and emphasizes the usefulness of domain knowledge in machine learning analysis. Machine learning can’t be the only tool in the arsenal — it needs human oversight to succeed.

More from Artificial Intelligence

Autonomous security for cloud in AWS: Harnessing the power of AI for a secure future

3 min read - As the digital world evolves, businesses increasingly rely on cloud solutions to store data, run operations and manage applications. However, with this growth comes the challenge of ensuring that cloud environments remain secure and compliant with ever-changing regulations. This is where the idea of autonomous security for cloud (ASC) comes into play.Security and compliance aren't just technical buzzwords; they are crucial for businesses of all sizes. With data breaches and cyber threats on the rise, having systems that ensure your…

Cybersecurity Awareness Month: 5 new AI skills cyber pros need

4 min read - The rapid integration of artificial intelligence (AI) across industries, including cybersecurity, has sparked a sense of urgency among professionals. As organizations increasingly adopt AI tools to bolster security defenses, cyber professionals now face a pivotal question: What new skills do I need to stay relevant?October is Cybersecurity Awareness Month, which makes it the perfect time to address this pressing issue. With AI transforming threat detection, prevention and response, what better moment to explore the essential skills professionals might require?Whether you're…

3 proven use cases for AI in preventative cybersecurity

3 min read - IBM’s Cost of a Data Breach Report 2024 highlights a ground-breaking finding: The application of AI-powered automation in prevention has saved organizations an average of $2.2 million.Enterprises have been using AI for years in detection, investigation and response. However, as attack surfaces expand, security leaders must adopt a more proactive stance.Here are three ways how AI is helping to make that possible:1. Attack surface management: Proactive defense with AIIncreased complexity and interconnectedness are a growing headache for security teams, and…

Topic updates

Get email updates and stay ahead of the latest threats to the security landscape, thought leadership and research.
Subscribe today