This is the third and final installment in a series covering AI2 and machine learning. Be sure to read Part 1 for an introduction to AI2 and Part 2 for background on the algorithms used in the system.

Machine Learning, Human Teaching

The data set researchers Kalyan Veeramachaneni and Ignacio Arnaldo used to produce their paper, “AI2: Training a Big Data Machine to Defend,” is quite impressive. Experiments are too often based on data that is either unrepresentative of the real world or too brief to offer a realistic perspective. The authors claimed they evaluated their system on three months’ worth of enterprise platform logs with 3.6 billion log lines, working out to millions a day. This is far more representative of what we would see in real life.

It does raise one question, however: What environment did the data come from? Many of IBM’s customers see millions of attack incidents a day. In the paper, the authors claimed a value of less than 0.1 percent, putting the number of malicious attacks they detected in the thousands.

Class Imbalance

The authors detailed a dearth of malicious activity and the so-called “class imbalance” problem that arises when there are far more normal events than malicious events. While this is true even with large customers, there are many examples of malicious activity in large enterprises, especially at the border.

The paper explained the analysis of the ratio of normal to malicious users. It is somewhat unusual that the researchers chose to designate users as the unique entity in their analysis — typically, network attacks are measured by IP addresses. This approach is more reminiscent of the DARPA intrusion detection challenge, but on a much grander scale and not nearly as preprocessed. As with most anomaly detectors, there is noise in the normal data.

The Experiment

In section 8.1 of the paper, the authors outlined the types of attacks they look for. Note that, once again, their unique entities were users, which focused their attack types by necessity. This is more complicated, since they needed to watch for multistep behaviors. User-level attacks usually involve several actions, one after the other, that the system must spot. Also note that they also used IP addresses as a feature here, observing trends in attacks involving the number of IP addresses linked with the user entities.

First, they tried to identify account takeover attacks. This generally involves an attacker guessing the credentials of a user to access the system. Even more impressive, the researchers also searched for fraudulent account creation using a stolen credit card, which is extremely difficult to catch.

Lastly, the authors identified terms of service violations. This one is a bit more straightforward in a signature-based system, but it presents challenges in an anomaly detector. In a signature-based system, one can program a set of rules to determine what defines the terms of service. In an anomaly detector, the system must search for different behaviors from a normal user, which might represent a violation.

Hiding in the Noise

Many anomaly detectors are based purely on unsupervised algorithms, which have no access to clearly identified attacks verses normal labels. The algorithm never knows if it is “right” — it strictly evaluates based on what it sees most of the time, which it calls normal.

There are severe problems with this approach. By carefully introducing malicious traffic in a low and slow manner, the attacker can force a recalibration of what is considered normal. If this happens, the fraudster can then perform attacks with impunity. It is also possible for attackers to hide in the noise. For example, a command-and-control (C&C) protocol that uses typical Transport Layer Security (TLS) traffic may not be flagged as abnormal unless the infected computer does this often and too quickly.

The authors tested the use of labeled data from the past. This is a reasonable test, since the enterprise may have had logs that had already been filtered by their security operations center (SOC) analysts. This can happen if an enterprise wants to store the logs for trend analysis, for example. In this case, however, the enterprise may only keep attack data, leading to the flip side of the aforementioned class imbalance problem: There are more maliciously labeled examples than normal ones. The labeled data may also have noise in it, meaning there could be misidentified examples in the data.

The Results

As for the results, Figure 11 in the paper showed a graphical view of just how well the system did. Having historically labeled data definitely helps bootstrap the system. With no historical data, the system detected 143 of 318 total attacks. With historical data, it found 211. As the active model is continuously trained, the model will improve.

This demonstrates the importance of the domain expertise in the system’s feedback. Unlike many unsupervised anomaly detectors, the system gets better with time as long as there are experts to help teach it. The system is not meant to solve the problem by itself, but rather learn from the labeled examples provided by the SOC analysts. In fact, the authors claimed that at the end of the 12 weeks, the performance with and without historical data was the same.

Finally, the authors reported that the system with no historical data performed 3.41 times better than the unsupervised detector and reduced false positives fivefold. This means that analysts can focus on, say, 200 events per day instead of a thousand. This is quite an improvement in efficiency.

The technique shows very real promise and emphasizes the usefulness of domain knowledge in machine learning analysis. Machine learning can’t be the only tool in the arsenal — it needs human oversight to succeed.

More from Artificial Intelligence

Data Privacy: How the Growing Field of Regulations Impacts Businesses

The proposed rules over artificial intelligence (AI) in the European Union (EU) are a harbinger of things to come. Data privacy laws are becoming more complex and growing in number and relevance. So, businesses that seek to become — and stay — compliant must find a solution that can do more than just respond to current challenges. Take a look at upcoming trends when it comes to data privacy regulations and how to follow them. Today's AI Solutions On April…

Tackling Today’s Attacks and Preparing for Tomorrow’s Threats: A Leader in 2022 Gartner® Magic Quadrant™ for SIEM

Get the latest on IBM Security QRadar SIEM, recognized as a Leader in the 2022 Gartner Magic Quadrant. As I talk to security leaders across the globe, four main themes teams constantly struggle to keep up with are: The ever-evolving and increasing threat landscape Access to and retaining skilled security analysts Learning and managing increasingly complex IT environments and subsequent security tooling The ability to act on the insights from their security tools including security information and event management software…

4 Ways AI Capabilities Transform Security

Many industries have had to tighten belts in the "new normal". In cybersecurity, artificial intelligence (AI) can help.   Every day of the new normal we learn how the pandemic sped up digital transformation, as reflected in the new opportunities and new risks. For many, organizational complexity and legacy infrastructure and support processes are the leading barriers to the effectiveness of their security.   Adding to the dynamics, short-handed teams are overwhelmed with too much data from disparate sources and…

What’s New in the 2022 Cost of a Data Breach Report

The average cost of a data breach reached an all-time high of $4.35 million this year, according to newly published 2022 Cost of a Data Breach Report, an increase of 2.6% from a year ago and 12.7% since 2020. New research in this year’s report also reveals for the first time that 83% of organizations in the study have experienced more than one data breach and just 17% said this was their first data breach. And at a time when…