Trying to discern drug smugglers passing through customs presents exactly the same problem as trying to discern security threats passing through our networks. Machine learning has been applied to both with varying degrees of success, but ultimately the technology reaches the same limitations. Machine learning has two basic elements: feature vectors and classification exemplars — the data that is gathered and the corresponding classification examples.

In the case of drug smugglers, we might observe number of travelers, point of origin, point of destination, number of bags, length of stay and weight of the bags. We might also flag any traveler or pair of travelers with two or more bags whose combined weight is greater than 150 pounds, whose stay is less than a week and who originated from a climate conducive to poppies.

In the case of threat analytics, we observe the addresses of the endpoints, the amount of data that flows to and from each endpoint, the geolocated countries of the endpoints and the frequency of communications. Given this context, should it not then be easier to discern between malicious and benign behavior?

Reducing False Positives

The following graph is an expected results curve for machine learning:

The more feature vectors (data) we can collect, combined with continual and up-to-date classification examples, the more easily machine learning can discern between malicious and benign behavior, with the false positive rate trending to near zero. False positives, or incorrectly classified events, are the bane of all machine learning, whether it’s used to detect drug smugglers or malicious network traffic.

A traveling aid worker might carry unusual tools in his luggage. Similarly, an unusual level of file transfer traffic from an untrustworthy geolocation might simply be a large download of vacation photos. False positives consume human resources to investigate events that can be commonplace, thus diminishing the effectiveness of machine learning. These are still early days.

Infinite Exemplars and Finite Data

So how do we decrease security false positives? It’s important to keep in mind that classification exemplars are infinite and data is finite. If our traveler is denoted as malicious and subsequently proves to be legitimate, feed this as just another training exemplar of valid travelers. The same holds true for network traffic: Simply classify the photo downloads as valid traffic. The more exemplars, the better the classification, right? This is true to the extent that the feature vectors are granular enough to allow discrete exemplars, so the data points don’t apply to multiple classifications.

Feature vectors are limited, however, and in many instances exhausted. In the case of a traveler, we might add the color of his or her bags, style of clothes or frequency of travel. In the case of threat analytics, we might add file content type and file entropy. Still, at some point the data is finite. Some vacationers simply look like drug smugglers, and vice versa.

Experts have conducted an abundance of research on network flow information. Unfortunately, flow information is easily exhausted because it only provides a few dozen features and suffers greatly from false positives, per the following diagram:

Here, exemplars continue to increase, but feature vectors are limited. The key to improving machine learning is to increase feature vectors so that exemplars correctly classify events as malicious or benign. The traveler might be wearing additional attire, like sunglasses or a hat. Similarly, network traffic might include tunneling or proxy knowledge content. An unfortunate reality for network analytics is that the desired data is typically not available or simply unattainable. So where does this leave us in terms of machine learning for security?

The Future of Machine Learning

Machine learning is very effective in eliminating white noise and classifying benign traffic with a high degree of accuracy — that is, what it believes to be benign is absolutely benign. Still, the false positive rates for predicting malicious events have been pretty disappointing.

For machine learning to be effective, IT professionals must switch from an analytics-based approach to a data-centric one, as more analytics on the same data produces the same results. We need to derive better data if we are going to rely on machine learning alone for security assessments.

IBM has done extensive research using machine learning for security analytics in the areas of Domain Name System (DNS) and the insider threat. IBM is currently beta testing machine learning-based applications for DNS that detect tunneling, beaconing, fluxing, squatting and exfiltration.

Here, false positives are reduced by increasing feature vectors with supplemental WHOIS and URL/website analytics. The insider threat technology is based on user behavior analytics that classify typical and anomalous behavior. For example, logging into an application from the same endpoint with a different identifier would be an anomaly upon first occurrence.

Machine learning can reduce the white noise, but without an injection of security-relevant data, it has a long way to go before it can be considered a generational leap in analytics. The next generation of machine learning-based security analytics will undoubtedly have a heavy focus on data acquisition.

More from Intelligence & Analytics

The 13 Costliest Cyberattacks of 2022: Looking Back

2022 has shaped up to be a pricey year for victims of cyberattacks. Cyberattacks continue to target critical infrastructures such as health systems, small government agencies and educational institutions. Ransomware remains a popular attack method for large and small targets alike. While organizations may choose not to disclose the costs associated with a cyberattack, the loss of consumer trust will always be a risk after any significant attack. Let’s look at the 13 costliest cyberattacks of the past year and…

What Can We Learn From Recent Cyber History?

The Center for Strategic and International Studies compiled a list of significant cyber incidents dating back to 2003. Compiling attacks on government agencies, defense and high-tech companies or economic crimes with losses of more than a million dollars, this list reveals broader trends in cybersecurity for the past two decades. And, of course, there are the headline breaches and supply chain attacks to consider. Over recent years, what lessons can we learn from our recent history — and what projections…

When Logs Are Out, Enhanced Analytics Stay In

I was talking to an analyst firm the other day. They told me that a lot of organizations purchase a security information and event management (SIEM) solution and then “place it on the shelf.” “Why would they do that?” I asked. I spent the majority of my career in hardware — enterprise hardware, cloud hardware, and just recently made the jump to security software, hence my question. “Because SIEMs are hard to use. A SIEM purchase is just a checked…

4 Most Common Cyberattack Patterns from 2022

As 2022 comes to an end, cybersecurity teams globally are taking the opportunity to reflect on the past 12 months and draw whatever conclusions and insights they can about the threat landscape. It has been a challenging year for security teams. A major conflict in Europe, a persistently remote workforce and a series of large-scale cyberattacks have all but guaranteed that 2022 was far from uneventful. In this article, we’ll round up some of the most common cyberattack patterns we…