This is the second installment in a three-part series about machine learning. Be sure to read part one for the full story.
When designing machine learning solutions for security, it’s important to decide on a classifier that will perform the best with minimal error. Given the sheer number of choices available, it’s easy to get confused. Let’s explore some tips to help security leaders select the right machine learning algorithm for their needs.
4 Types of Machine Learning Techniques
If you know which category your problem falls into, you can narrow down your choices. Machine learning algorithms are broadly categorized under four types of learning problems.
1. Supervised Learning
Supervised learning trains the algorithm based on example sets of input/output pairs. The goal is to develop new inferences based on patterns inferred from the sample results. Sample data must be available and labeled. For example, designing a spam detector model by learning from samples of labeled spam/nonspam is supervised learning. Another example is a problem such as the Defense Advanced Research Projects Agency (DARPA)’s Knowledge Discovery and Data Mining 1999 challenge (KDD-99), in which contestants competed to design a machine learning-based intrusion detection system (IDS) from a set of 41 features per instance labeled either “attack” or “normal.”
2. Unsupervised Learning
Unsupervised learning uses data that has not been labeled, classified or categorized. The machine is challenged to identify patterns through processes such as clustering, and the outcome is usually unknown. Clustering is a task in which samples are compared with each other in an attempt to find examples that are close to each other, usually by either a measure of density or a distance metric, such as Euclidian distance when projected into a high-dimensional space.
A security problem that falls into this category is network anomaly detection, which is a different method of designing an IDS. In this case, the algorithm doesn’t assume that it knows an attack from a normal input. Instead, the algorithm tries to understand what normal traffic is by watching the network in a (hopefully) clean state so that it can learn the patterns of traffic. Then, anything that falls outside of this “normal” region is a possible attack. Note that there is a great deal of uncertainty with these algorithms because they do not actually know what an attack looks like.
3. Semisupervised Learning
This type of learning uses a combination of labeled and unlabeled data, typically with the majority being unlabeled. It is primarily used to provide some concept of a known classification to unsupervised algorithms. Several techniques, such as label spreading and weakly supervised learning, can also be employed to augment a supervised training set with a small number of samples. This requires a great deal of work because it is an extremely common scenario.
For the challenge of exploit kit identification, for example, we can find some known exploit kits to train our model, but there are many variants and unknown kits that can’t be labeled. Semisupervised learning can help solve this problem. Note that semisupervised and fully supervised learning can often be differentiated by the choice of features to learn against. Depending on the features, you can often label much more data than you can with another selected set of features.
4. Reinforcement Learning
Unlike the other three types of learning problems, reinforcement learning seeks the optimal path to a desired result by rewarding improvement. The problem set is generally small and the training data well-understood. An example is a generative adversarial network (GAN) such as this experiment, in which the distance, measured in correct and incorrect bits, was used as a loss function to encrypt messages between two neural networks.
Another example is PassGAN, where a GAN was trained to guess passwords more efficiently and the reinforcement function, at a very high level, was how many passwords it guessed correctly. Specifically, PassGAN learned rules to take a dictionary and create likely passwords based on an actual set of leaked passwords. It modeled human behavior in guessing the ways that humans transformed passwords into nondictionary character strings.
Problem Type and Training Data
Another, broader categorization of algorithm is based on the type of problem, such as classification, regression, anomaly detection or dimensionality reduction. There are specific machine learning algorithms designed for each type of problem.
Once you have narrowed down your choices based on the above broader categories, you should consider the algorithm’s bias or variance.
Bias is defined in practical effect as the ability of an algorithm to model the training data accurately. Bias represents the proximity of the model to the training data. A high bias results in the model missing relationships between features and the target variable, which leads to what is known as underfitting.
Variance is how accurately the model measures noise in the distribution of training samples. A high variance means that random variables are captured more effectively. Obviously, we do not want this either. Therefore, we seek to minimize both of these values. We can control bias and variance through different parameters. For example, the k-means algorithm with a high value of k gives us high bias and low variance. This makes sense because the value of k means that we have allowed more clusters in space to form, and more of the relationships among the features will be missed as the algorithm fits the data to a larger number of clusters.
A deeper decision tree has more variance because it contains more branches (decision points) and therefore has more false relationships that model noise in the training data. For artificial neural networks (ANNs), variance increases and bias decreases with the increase in the number of layers. Therefore, deep learning has a very low bias. However, this is at the cost of more noise being represented in the deep learning models.
Other Considerations for Selecting Machine Learning Algorithms
Data type also dictates the choice of algorithm because some algorithms work better on certain data types than others. For example, support vector machines (SVMs), linear and logistic regression, and neural networks require the feature vector to be numerical. On the other hand, decision trees can be more flexible to different data types, such as nominal input.
Some algorithms perform poorly when there is correlation between the features — meaning that multiple features demonstrate the same patterns. A feature that is defined as the result of calculations based on other features would be highly correlated with those input features. Linear regression and logistic regression, along with other algorithms, require regularization to avoid numerical instabilities that come from redundancy in data.
The relationship between independent and dependent variables can also help us determine which algorithm to choose. Because naive Bayes, linear regression and logistic regression perform well if each feature has an independent contribution to the output, these algorithms are poor choices for correlated data. Unfortunately, when features are extracted from voluminous data, it is often impossible to know whether the independence assumption holds true without further analysis. If the relationship is more complex, decision trees or neural networks may be a better choice.
Noise in the output values can also inform which algorithm to choose, as well as the parameters for your selected algorithm. However, you are best served by cleaning the data of noise. If you can’t clean the data, you will most likely receive results with poor classification boundaries. An SVM, for example, is very sensitive to noise because it attempts to draw a margin, or boundary, between or among the classes of the data as they are labeled. A decision tree or bagged tree such as a random forest might be a better choice here since these allow for the tree(s) to be pruned. A shallower tree models less noise.
Dimensionality of the input space is critical to your decision as well. In dimensionality, there are two common phenomena that are directly at odds with each other. First, the so-called curse of dimensionality occurs when the data contains a very large number of features. If you think about this in special terms, a two-dimensional figure verses a three-dimensional figure can look very different. A very packed set of points on a plane can become spread apart once we add the third dimension. If we try to cluster these points, the distances between two arbitrary points will dramatically increase.
Alternatively, the blessing of dimensionality means that having more dimensions helps us model the problem more completely. In this sense, the plane may pack completely unrelated points together because the two-dimensional coordinates are close to each other even though they have nothing to do with each other. In three dimensions, these points might be spread farther apart, showing us a more complete separation of the unrelated points.
These two ideas cause us, as domain experts and deep learning designers, to pick our features carefully. We do not want to encounter the sparsity of the curse of dimensionality because there might not be a clear boundary, but we also do not want to pack our unrelated examples too close to each other and create an unusable boundary. Some algorithms work better in higher dimensions. In very high dimensional space, for example, you might choose a boosted random forest or a deep learning model.
Transparency and Functional Complexity
The level of visibility into the model’s decision process is a very important criterion for selecting an algorithm. Algorithms that provide decision trees show clearly how the model reached a decision, whereas a neural network is essentially a black box.
Similarly, you should consider functional complexity and the amount of training data. Functional complexity can be understood in terms of things like speed and memory usage. If you lack sufficient computational power or memory in the machine on which your model will be run, you should choose an algorithm such as naive Bayes or one of the many rules generator-based algorithms. Naive Bayes counts the frequencies of terms, so its model is only as big as the number of features in the data. Rules generators essentially create a series of if/then conditions that the data must satisfy to be classified correctly. These are very good for low-complexity devices. If, however, you have a good deal of power and memory, you might go so far as deep learning, which (in most but not all configurations) requires many more resources.
How Much Training Data Do You Need?
You might also opt for naive Bayes if you have a smaller set of training data available. A small training data size will severely limit you, so it is always best to acquire as much data as you can. A few hundred might be able to construct a shallow decision tree or a naive Bayesian model. A few thousand to tens of thousands is usually enough to create an effective random forest, an algorithm that performs very well in most learning problems. Finally, if you want to try deep learning, you may need hundreds of thousands — or, preferably, millions — of training examples, depending on how many layers you want your deep learning model to have.
Try Several Classifiers and Pick the Best
When all else fails, try several classifiers that are suitable for your problem and then compare the results. Key metrics include accuracy of the model and true and false positive rates; derived metrics such as precision, recall and F1; and even metrics as complex as area under the receiver operating characteristic curve (AUC) and area under the precision-recall curve (AUPRC). Each measurement tells you something different about the model’s performance. We’ll define these metrics in greater detail in the final installment of our three-part series.
Watson for Cybersecurity Researcher
Security Researcher, IBM X-Force