# With AI2, Machine Learning and Analysts Come Together to Impress, Part 2: The Algorithms

*This is the second installment in a three-part series covering AI2 and machine learning. Be sure to read* *Part 1* *for an introduction to AI2.*

AI2 is an “analyst-in-the-loop” system, meaning that it exploits the expertise of a security analyst to improve itself. A “human-in-the-loop” system is used to generate more supervised examples for the machine learning stage to use in an iterative training algorithm. This is exactly what AI2 does, allowing feedback to make the machine gradually smarter in the security domain.

## AI2: All About Algorithms

According to the paper “AI2: Training a big data machine to defend,” the system consists of four components:

- A big data behavioral analytics platform;
- An ensemble of outlier detection methods;
- A mechanism to obtain feedback from security analysts; and
- A supervised learning module.

According to the paper, the system mines logs from web servers and firewalls. The web logs are to prevent web attacks while specifically indicating the use of firewall logs to stop data exfiltration. This is an interesting use case and is often difficult to detect. There are extremely sophisticated ways of exfiltrating data, many of which would not be noted in firewall logs. The use of covert channels can be nearly impossible to detect.

It would be easy to extrapolate this system’s consumption of intrusion prevention system (IPS) logs or other log sources in addition to the logs described above. This would allow for the assistance of analysts in a more traditional configuration as well. As for sources of unknown attacks, IT managers could use a protocol decoding engine, which would allow other protocols to be monitored. This analysis could lead to a much more thorough attack discovery system. However, these are items for future work.

The paper’s authors describe two metrics for their data: the size of the raw data set, which in this case over a three month period was 3.6 billion log lines, and the number of unique entities per day, which represents the number of unique IP addresses, users or sessions. The data here involves hundreds of thousands of unique users per day.

## Big Data Behavioral Analytics

The authors, using their domain knowledge, chose users as the analytical unique entity and computed 24 features on a daily basis from the user data. The design goals for the big data platform were to support the analysis of 10 million or more users and to be able to retrieve behavioral signatures for 50,000 entities at a time. A behavioral signature is the series of steps required to perform an attack and may involve several related variables.

The big data system scales by first looking at the log streams, extracting the entities from it and then updating activity records, which are summaries of activities over a very short period of time. In the AI2 platform, the time window is one minute. Then the activity aggregator consolidates the activity records over a larger time window, such as a day. From this, the entity-feature matrix can be populated. The entity-feature matrix is exactly what it sounds like: a matrix with entities in one dimension, and a feature vector in the other. A novel way to accelerate performance is to keep activity records, which overlap (i.e., by the minute, hour, day and week). This reduces the number of records required to compute features for any arbitrary amount of time.

## Outlier Ensemble Detector

The paper describes the ensemble use of three outlier detectors. Outlier detectors are, in this case, equivalent to unsupervised detectors. Outliers are representative of possible attacks in these algorithms. Note that ensemble unsupervised detectors are rare for two reasons.

First, it is difficult to compare the answers given by two detectors because there is no consistent unit of measurement between many detectors. Some algorithms use the same distance metric, but even then, the relative scores are hard to compare. Second, there is no ground truth in the unsupervised case in general. The challenge is to determine just how confident one can be in the answer given by each unsupervised detector. Without ground truth, this is extremely problematic.

The first detector the authors used was Principal Component Analysis (PCA), which is often used as a feature reduction technique, eliminating features that have a lower impact on the final outcome. This method is a matrix decomposition method that looks for significant differences in correlation between the decomposed matrix and the reconstruction of the input matrix using only the principal components. Choosing the first principal components (primary features) to deconstruct the matrix, then using only those components to attempt to reconstruct the matrix, creates a condition where the reconstruction error is low for the normal cases and high for the outliers. The authors used this reconstruction error as the outlier score for the PCA method.

The authors also employed a replicator neural network (RNN), also known as an autoencoder. They used a feed-forward neural net that has three hidden layers, the second of which contains half as many neurons as the first and third, to create a nonlinear transformation of the data, which compresses the feature space. The network then decompresses it, creating a reconstruction. Just as with the PCA method, the reconstruction error is used as the outlier score. Just as in PCA, because the RNN was trained to minimize the reconstruction error, the higher error scores represent the outliers.

Finally, the authors used a density-based algorithm. Density-based methods are usually clustering techniques. In this case, examples are used to form a multivariate joint probability distribution. Where probabilities are less dense, there are typically outliers — that is, high-density regions are typically representative of normal data, whereas low-density regions indicate abnormal data. Once again, this approach is driven by the observation that attacks are rare and, therefore, form low-density areas of the distribution. The outlier score is the probability density at a given point of the distribution.

## The Feedback Loop and the Supervised Learning Module

At a high level, the authors first trained the supervised and unsupervised components, evaluated a day’s worth of data, displayed highly scored outliers and then collected feedback from the analyst. They then repeated this procedure, each time training a new model for the next day. Because the system only displays a small fraction of the total number of events, the analyst can focus on what are hopefully events more likely to be attacks. Note that these outliers are not necessarily malicious, they are just different from the rest of the events. This is why analyst feedback is required. The results of this feedback are used to increase the amount of labeled data for a supervised model.

The paper describes the potential benefit of categorizing attacks and, subsequently, creating different models per category. While this is interesting, the extra dependence on a category selection algorithm may introduce extra noise into the system, as the algorithm is bound to produce some errors. However, more specific, supervised models should produce better results with attack prediction.