Artificial intelligence (AI) brings a powerful new set of tools to the fight against threat actors, but choosing the right combination of libraries, test suites and trading models when building AI security systems is highly dependent on the situation. If you’re thinking about adopting AI in your security operations center (SOC), the following questions and considerations can help guide your decision-making.

What Problem Are You Trying to Solve?

Spam detection, intrusion detection, malware detection and natural language-based threat hunting are all very different problem sets that require different AI tools. Begin by considering what kind of AI security systems you need.

Understanding the outputs helps you test data. Ask yourself whether you’re solving a classification or regression problem, building a recommendation engine or detecting anomalies. Depending on the answers to those questions, you can apply one of four basic types of machine learning:

  1. Supervised learning trains an algorithm based on example sets of input/output pairs. The goal is to develop new inferences based on patterns inferred from the sample results. Sample data must be available and labeled. For example, designing a spam detection model by learning from samples labeled spam/nonspam is a good application of supervised learning.
  2. Unsupervised learning uses data that has not been labeled, classified or categorized. The machine is challenged to identify patterns through processes such as cluster analysis, and the outcome is usually unknown. Unsupervised machine learning is good at discovering underlying patterns and data, but is a poor choice for a regression or classification problem. Network anomaly detection is a security problem that fits well in this category.
  3. Semisupervised learning uses a combination of labeled and unlabeled data, typically with the majority being unlabeled. It is primarily used to improve the quality of training sets. For exploit kit identification problems, we can find some known exploit kits to train our model, but there are many variants and unknown kits that can’t be labeled. We can use semisupervised learning to address the problem.
  4. Reinforcement learning seeks the optimal path to a desired result by continually rewarding improvement. The problem set is generally small, and the training data well-understood. An example of reinforcement learning is a generative adversarial network (GAN), such as this experiment from Cornell University in which distance, measured in the form of correct and incorrect bits, is used as a loss function to encrypt messages between two neural networks and avoid eavesdropping by an unauthorized third neural network.

Artificial Intelligence Depends on Good Data

Machine learning is predicated on learning from data, so having the right quantity and quality is essential. Security leaders should ask the following questions about their data sources to optimize their machine learning deployments:

  • Is there enough data? You’ll need a sufficient amount to represent all possible scenarios that a system will encounter.
  • Does the data contain patterns that machine learning systems can learn from? Good data sets should have frequently recurring values, clear and obvious meanings, few out-of-range values and persistence, meaning that they change little over time.
  • Is the data sparse? Are certain expected values missing? This can create misleading results.
  • Is the data categorical or numeric in nature? This dictates the choice of the classifier we can use.
  • Are labels available?
  • Is the data current? This is particularly important in AI security systems because threats change so quickly. For example, a malware detection system that has been trained on old samples will have difficulty detecting new malware variations.
  • Is the source of the data trusted? You don’t want to train your model from publicly available data of origins you don’t trust. Data sample poisoning is just one attack vector through which machine learning-based security models are compromised.

Choosing the Right Platforms and Tools

There is a wide variety of platforms and tools available on the market, but how do you know which is the right one for you? Ask the following questions to help inform your choice:

  • How comfortable are you in a given language?
  • Does the tool integrate well with your existing environment?
  • Is the tool well-suited for big data analytics?
  • Does it provide built-in data parsing capabilities that enable the model to understand the structure of data?
  • Does it use a graphical or command-line interface?
  • Is it a complete machine learning platform or just a set of libraries that you can use to build models? The latter provides more flexibility, but also has a steeper learning curve.

What About the Algorithm?

You’ll also need to select an algorithm to employ. Try a few different algorithms and compare to determine which delivers the most accurate results. Here are some factors that can help you decide which algorithm to start with:

  • How much data do you have, and is it of good quality? Data with many missing values will deliver lower-quality results.
  • Is the learning problem supervised, unsupervised or reinforcement learning? You’ll want to match the data set to the use case as described above.
  • Determine the type of problem being solved, such as classification, regression, anomaly detection or dimensionality reduction. There are different AI algorithms that work best for each type of problem.
  • How important is accuracy versus speed? If approximations are acceptable, you can get by with smaller data sets and lower-quality data. If accuracy is paramount, you’ll need higher quality data and more time to run the machine learning algorithms.
  • How much visibility do you need into the process? Algorithms that provide decision trees show you clearly how the model reached a decision, while neural networks are a bit of a black box.

How to Train, Test and Evaluate AI Security Systems

Training samples should be constantly updated as new exploits are discovered, so it’s often necessary to perform training on the fly. However, training in real time opens up the risk of adversarial machine learning attacks in which bad actors attempt to disrupt the results by introducing misleading input data.

While it is often impossible to perform training offline, it is desirable to do so when possible so the quality of the data can be regulated. Once the training process is complete, the model can be deployed into production.

One common method of testing trained models is to split the data set and devote a portion of the data — say, 70 percent — to training and the rest to testing. If the model is robust, the output from both data sets should be similar.

A somewhat more refined approach called cross-validation divides the data set into groups of equal sizes and trains on all but one of the groups. For example, if the number of groups is “n,” then you would train on n-1 groups and test with the one set that is left out. This process is repeated many times, leaving out a different group for testing each time. Performance is measured by averaging results across all repetitions.

Choice of evaluation metrics also depends on the type of problem you’re trying to solve. For example, a regression problem tries to find the range of error between the actual value and the predicted value, so the metrics you might use include mean absolute error, root mean absolute error, relative absolute error and relative squared error.

For a classification problem, the objective is to determine which categories new observations belong in — which requires a different set of quality metrics, such as accuracy, precision, recall, F1 score and area under the curve (AUC).

Deployment on the Cloud or On-Premises?

Lastly, you’ll need to select a location for deployment. Cloud machine learning platforms certainly have advantages, such as speed of provisioning, choice of tools and the availability of third-party training data. However, you may not want to share data in the cloud for security and compliance reasons. Consider these factors before choosing whether to deploy on-premises or in a public cloud.

These are just a few of the many factors to consider when building security systems with artificial intelligence. Remember, the best solution for one organization or security problem is not necessarily the best solution for everyone or every situation.

More from Artificial Intelligence

Data Privacy: How the Growing Field of Regulations Impacts Businesses

The proposed rules over artificial intelligence (AI) in the European Union (EU) are a harbinger of things to come. Data privacy laws are becoming more complex and growing in number and relevance. So, businesses that seek to become — and stay — compliant must find a solution that can do more than just respond to current challenges. Take a look at upcoming trends when it comes to data privacy regulations and how to follow them. Today's AI Solutions On April…

Tackling Today’s Attacks and Preparing for Tomorrow’s Threats: A Leader in 2022 Gartner® Magic Quadrant™ for SIEM

Get the latest on IBM Security QRadar SIEM, recognized as a Leader in the 2022 Gartner Magic Quadrant. As I talk to security leaders across the globe, four main themes teams constantly struggle to keep up with are: The ever-evolving and increasing threat landscape Access to and retaining skilled security analysts Learning and managing increasingly complex IT environments and subsequent security tooling The ability to act on the insights from their security tools including security information and event management software…

4 Ways AI Capabilities Transform Security

Many industries have had to tighten belts in the "new normal". In cybersecurity, artificial intelligence (AI) can help.   Every day of the new normal we learn how the pandemic sped up digital transformation, as reflected in the new opportunities and new risks. For many, organizational complexity and legacy infrastructure and support processes are the leading barriers to the effectiveness of their security.   Adding to the dynamics, short-handed teams are overwhelmed with too much data from disparate sources and…

What’s New in the 2022 Cost of a Data Breach Report

The average cost of a data breach reached an all-time high of $4.35 million this year, according to newly published 2022 Cost of a Data Breach Report, an increase of 2.6% from a year ago and 12.7% since 2020. New research in this year’s report also reveals for the first time that 83% of organizations in the study have experienced more than one data breach and just 17% said this was their first data breach. And at a time when…