Artificial intelligence (AI) brings a powerful new set of tools to the fight against threat actors, but choosing the right combination of libraries, test suites and trading models when building AI security systems is highly dependent on the situation. If you’re thinking about adopting AI in your security operations center (SOC), the following questions and considerations can help guide your decision-making.
What Problem Are You Trying to Solve?
Spam detection, intrusion detection, malware detection and natural language-based threat hunting are all very different problem sets that require different AI tools. Begin by considering what kind of AI security systems you need.
Understanding the outputs helps you test data. Ask yourself whether you’re solving a classification or regression problem, building a recommendation engine or detecting anomalies. Depending on the answers to those questions, you can apply one of four basic types of machine learning:
- Supervised learning trains an algorithm based on example sets of input/output pairs. The goal is to develop new inferences based on patterns inferred from the sample results. Sample data must be available and labeled. For example, designing a spam detection model by learning from samples labeled spam/nonspam is a good application of supervised learning.
- Unsupervised learning uses data that has not been labeled, classified or categorized. The machine is challenged to identify patterns through processes such as cluster analysis, and the outcome is usually unknown. Unsupervised machine learning is good at discovering underlying patterns and data, but is a poor choice for a regression or classification problem. Network anomaly detection is a security problem that fits well in this category.
- Semisupervised learning uses a combination of labeled and unlabeled data, typically with the majority being unlabeled. It is primarily used to improve the quality of training sets. For exploit kit identification problems, we can find some known exploit kits to train our model, but there are many variants and unknown kits that can’t be labeled. We can use semisupervised learning to address the problem.
- Reinforcement learning seeks the optimal path to a desired result by continually rewarding improvement. The problem set is generally small, and the training data well-understood. An example of reinforcement learning is a generative adversarial network (GAN), such as this experiment from Cornell University in which distance, measured in the form of correct and incorrect bits, is used as a loss function to encrypt messages between two neural networks and avoid eavesdropping by an unauthorized third neural network.
Artificial Intelligence Depends on Good Data
Machine learning is predicated on learning from data, so having the right quantity and quality is essential. Security leaders should ask the following questions about their data sources to optimize their machine learning deployments:
- Is there enough data? You’ll need a sufficient amount to represent all possible scenarios that a system will encounter.
- Does the data contain patterns that machine learning systems can learn from? Good data sets should have frequently recurring values, clear and obvious meanings, few out-of-range values and persistence, meaning that they change little over time.
- Is the data sparse? Are certain expected values missing? This can create misleading results.
- Is the data categorical or numeric in nature? This dictates the choice of the classifier we can use.
- Are labels available?
- Is the data current? This is particularly important in AI security systems because threats change so quickly. For example, a malware detection system that has been trained on old samples will have difficulty detecting new malware variations.
- Is the source of the data trusted? You don’t want to train your model from publicly available data of origins you don’t trust. Data sample poisoning is just one attack vector through which machine learning-based security models are compromised.
Choosing the Right Platforms and Tools
There is a wide variety of platforms and tools available on the market, but how do you know which is the right one for you? Ask the following questions to help inform your choice:
- How comfortable are you in a given language?
- Does the tool integrate well with your existing environment?
- Is the tool well-suited for big data analytics?
- Does it provide built-in data parsing capabilities that enable the model to understand the structure of data?
- Does it use a graphical or command-line interface?
- Is it a complete machine learning platform or just a set of libraries that you can use to build models? The latter provides more flexibility, but also has a steeper learning curve.
What About the Algorithm?
You’ll also need to select an algorithm to employ. Try a few different algorithms and compare to determine which delivers the most accurate results. Here are some factors that can help you decide which algorithm to start with:
- How much data do you have, and is it of good quality? Data with many missing values will deliver lower-quality results.
- Is the learning problem supervised, unsupervised or reinforcement learning? You’ll want to match the data set to the use case as described above.
- Determine the type of problem being solved, such as classification, regression, anomaly detection or dimensionality reduction. There are different AI algorithms that work best for each type of problem.
- How important is accuracy versus speed? If approximations are acceptable, you can get by with smaller data sets and lower-quality data. If accuracy is paramount, you’ll need higher quality data and more time to run the machine learning algorithms.
- How much visibility do you need into the process? Algorithms that provide decision trees show you clearly how the model reached a decision, while neural networks are a bit of a black box.
How to Train, Test and Evaluate AI Security Systems
Training samples should be constantly updated as new exploits are discovered, so it’s often necessary to perform training on the fly. However, training in real time opens up the risk of adversarial machine learning attacks in which bad actors attempt to disrupt the results by introducing misleading input data.
While it is often impossible to perform training offline, it is desirable to do so when possible so the quality of the data can be regulated. Once the training process is complete, the model can be deployed into production.
One common method of testing trained models is to split the data set and devote a portion of the data — say, 70 percent — to training and the rest to testing. If the model is robust, the output from both data sets should be similar.
Choice of evaluation metrics also depends on the type of problem you’re trying to solve. For example, a regression problem tries to find the range of error between the actual value and the predicted value, so the metrics you might use include mean absolute error, root mean absolute error, relative absolute error and relative squared error.
For a classification problem, the objective is to determine which categories new observations belong in — which requires a different set of quality metrics, such as accuracy, precision, recall, F1 score and area under the curve (AUC).
Deployment on the Cloud or On-Premises?
Lastly, you’ll need to select a location for deployment. Cloud machine learning platforms certainly have advantages, such as speed of provisioning, choice of tools and the availability of third-party training data. However, you may not want to share data in the cloud for security and compliance reasons. Consider these factors before choosing whether to deploy on-premises or in a public cloud.
These are just a few of the many factors to consider when building security systems with artificial intelligence. Remember, the best solution for one organization or security problem is not necessarily the best solution for everyone or every situation.