Artificial intelligence (AI) brings a powerful new set of tools to the fight against threat actors, but choosing the right combination of libraries, test suites and trading models when building AI security systems is highly dependent on the situation. If you’re thinking about adopting AI in your security operations center (SOC), the following questions and considerations can help guide your decision-making.

What Problem Are You Trying to Solve?

Spam detection, intrusion detection, malware detection and natural language-based threat hunting are all very different problem sets that require different AI tools. Begin by considering what kind of AI security systems you need.

Understanding the outputs helps you test data. Ask yourself whether you’re solving a classification or regression problem, building a recommendation engine or detecting anomalies. Depending on the answers to those questions, you can apply one of four basic types of machine learning:

  1. Supervised learning trains an algorithm based on example sets of input/output pairs. The goal is to develop new inferences based on patterns inferred from the sample results. Sample data must be available and labeled. For example, designing a spam detection model by learning from samples labeled spam/nonspam is a good application of supervised learning.
  2. Unsupervised learning uses data that has not been labeled, classified or categorized. The machine is challenged to identify patterns through processes such as cluster analysis, and the outcome is usually unknown. Unsupervised machine learning is good at discovering underlying patterns and data, but is a poor choice for a regression or classification problem. Network anomaly detection is a security problem that fits well in this category.
  3. Semisupervised learning uses a combination of labeled and unlabeled data, typically with the majority being unlabeled. It is primarily used to improve the quality of training sets. For exploit kit identification problems, we can find some known exploit kits to train our model, but there are many variants and unknown kits that can’t be labeled. We can use semisupervised learning to address the problem.
  4. Reinforcement learning seeks the optimal path to a desired result by continually rewarding improvement. The problem set is generally small, and the training data well-understood. An example of reinforcement learning is a generative adversarial network (GAN), such as this experiment from Cornell University in which distance, measured in the form of correct and incorrect bits, is used as a loss function to encrypt messages between two neural networks and avoid eavesdropping by an unauthorized third neural network.

Artificial Intelligence Depends on Good Data

Machine learning is predicated on learning from data, so having the right quantity and quality is essential. Security leaders should ask the following questions about their data sources to optimize their machine learning deployments:

  • Is there enough data? You’ll need a sufficient amount to represent all possible scenarios that a system will encounter.
  • Does the data contain patterns that machine learning systems can learn from? Good data sets should have frequently recurring values, clear and obvious meanings, few out-of-range values and persistence, meaning that they change little over time.
  • Is the data sparse? Are certain expected values missing? This can create misleading results.
  • Is the data categorical or numeric in nature? This dictates the choice of the classifier we can use.
  • Are labels available?
  • Is the data current? This is particularly important in AI security systems because threats change so quickly. For example, a malware detection system that has been trained on old samples will have difficulty detecting new malware variations.
  • Is the source of the data trusted? You don’t want to train your model from publicly available data of origins you don’t trust. Data sample poisoning is just one attack vector through which machine learning-based security models are compromised.

Choosing the Right Platforms and Tools

There is a wide variety of platforms and tools available on the market, but how do you know which is the right one for you? Ask the following questions to help inform your choice:

  • How comfortable are you in a given language?
  • Does the tool integrate well with your existing environment?
  • Is the tool well-suited for big data analytics?
  • Does it provide built-in data parsing capabilities that enable the model to understand the structure of data?
  • Does it use a graphical or command-line interface?
  • Is it a complete machine learning platform or just a set of libraries that you can use to build models? The latter provides more flexibility, but also has a steeper learning curve.

What About the Algorithm?

You’ll also need to select an algorithm to employ. Try a few different algorithms and compare to determine which delivers the most accurate results. Here are some factors that can help you decide which algorithm to start with:

  • How much data do you have, and is it of good quality? Data with many missing values will deliver lower-quality results.
  • Is the learning problem supervised, unsupervised or reinforcement learning? You’ll want to match the data set to the use case as described above.
  • Determine the type of problem being solved, such as classification, regression, anomaly detection or dimensionality reduction. There are different AI algorithms that work best for each type of problem.
  • How important is accuracy versus speed? If approximations are acceptable, you can get by with smaller data sets and lower-quality data. If accuracy is paramount, you’ll need higher quality data and more time to run the machine learning algorithms.
  • How much visibility do you need into the process? Algorithms that provide decision trees show you clearly how the model reached a decision, while neural networks are a bit of a black box.

How to Train, Test and Evaluate AI Security Systems

Training samples should be constantly updated as new exploits are discovered, so it’s often necessary to perform training on the fly. However, training in real time opens up the risk of adversarial machine learning attacks in which bad actors attempt to disrupt the results by introducing misleading input data.

While it is often impossible to perform training offline, it is desirable to do so when possible so the quality of the data can be regulated. Once the training process is complete, the model can be deployed into production.

One common method of testing trained models is to split the data set and devote a portion of the data — say, 70 percent — to training and the rest to testing. If the model is robust, the output from both data sets should be similar.

A somewhat more refined approach called cross-validation divides the data set into groups of equal sizes and trains on all but one of the groups. For example, if the number of groups is “n,” then you would train on n-1 groups and test with the one set that is left out. This process is repeated many times, leaving out a different group for testing each time. Performance is measured by averaging results across all repetitions.

Choice of evaluation metrics also depends on the type of problem you’re trying to solve. For example, a regression problem tries to find the range of error between the actual value and the predicted value, so the metrics you might use include mean absolute error, root mean absolute error, relative absolute error and relative squared error.

For a classification problem, the objective is to determine which categories new observations belong in — which requires a different set of quality metrics, such as accuracy, precision, recall, F1 score and area under the curve (AUC).

Deployment on the Cloud or On-Premises?

Lastly, you’ll need to select a location for deployment. Cloud machine learning platforms certainly have advantages, such as speed of provisioning, choice of tools and the availability of third-party training data. However, you may not want to share data in the cloud for security and compliance reasons. Consider these factors before choosing whether to deploy on-premises or in a public cloud.

These are just a few of the many factors to consider when building security systems with artificial intelligence. Remember, the best solution for one organization or security problem is not necessarily the best solution for everyone or every situation.

More from Artificial Intelligence

X-Force releases detection & response framework for managed file transfer software

5 min read - How AI can help defenders scale detection guidance for enterprise software tools If we look back at mass exploitation events that shook the security industry like Log4j, Atlassian, and Microsoft Exchange when these solutions were actively being exploited by attackers, the exploits may have been associated with a different CVE, but the detection and response guidance being released by the various security vendors had many similarities (e.g., Log4shell vs. Log4j2 vs. MOVEit vs. Spring4Shell vs. Microsoft Exchange vs. ProxyShell vs.…

Unmasking hypnotized AI: The hidden risks of large language models

11 min read - The emergence of Large Language Models (LLMs) is redefining how cybersecurity teams and cybercriminals operate. As security teams leverage the capabilities of generative AI to bring more simplicity and speed into their operations, it's important we recognize that cybercriminals are seeking the same benefits. LLMs are a new type of attack surface poised to make certain types of attacks easier, more cost-effective, and even more persistent. In a bid to explore security risks posed by these innovations, we attempted to…

Artificial intelligence threats in identity management

4 min read - The 2023 Identity Security Threat Landscape Report from CyberArk identified some valuable insights. 2,300 security professionals surveyed responded with some sobering figures: 68% are concerned about insider threats from employee layoffs and churn 99% expect some type of identity compromise driven by financial cutbacks, geopolitical factors, cloud applications and hybrid work environments 74% are concerned about confidential data loss through employees, ex-employees and third-party vendors. Additionally, many feel digital identity proliferation is on the rise and the attack surface is…

AI reduces data breach lifecycles and costs

3 min read - The cybersecurity tools you implement can make a difference in the financial future of your business. According to the 2023 IBM Cost of a Data Breach report, organizations using security AI and automation incurred fewer data breach costs compared to businesses not using AI-based cybersecurity tools. The report found that the more an organization uses the tools, the greater the benefits reaped. Organizations that extensively used AI and security automation saw an average cost of a data breach of $3.60…