October 17, 2016 By Russell Couturier 4 min read

Trying to discern drug smugglers passing through customs presents exactly the same problem as trying to discern security threats passing through our networks. Machine learning has been applied to both with varying degrees of success, but ultimately the technology reaches the same limitations. Machine learning has two basic elements: feature vectors and classification exemplars — the data that is gathered and the corresponding classification examples.

In the case of drug smugglers, we might observe number of travelers, point of origin, point of destination, number of bags, length of stay and weight of the bags. We might also flag any traveler or pair of travelers with two or more bags whose combined weight is greater than 150 pounds, whose stay is less than a week and who originated from a climate conducive to poppies.

In the case of threat analytics, we observe the addresses of the endpoints, the amount of data that flows to and from each endpoint, the geolocated countries of the endpoints and the frequency of communications. Given this context, should it not then be easier to discern between malicious and benign behavior?

Reducing False Positives

The following graph is an expected results curve for machine learning:

The more feature vectors (data) we can collect, combined with continual and up-to-date classification examples, the more easily machine learning can discern between malicious and benign behavior, with the false positive rate trending to near zero. False positives, or incorrectly classified events, are the bane of all machine learning, whether it’s used to detect drug smugglers or malicious network traffic.

A traveling aid worker might carry unusual tools in his luggage. Similarly, an unusual level of file transfer traffic from an untrustworthy geolocation might simply be a large download of vacation photos. False positives consume human resources to investigate events that can be commonplace, thus diminishing the effectiveness of machine learning. These are still early days.

Infinite Exemplars and Finite Data

So how do we decrease security false positives? It’s important to keep in mind that classification exemplars are infinite and data is finite. If our traveler is denoted as malicious and subsequently proves to be legitimate, feed this as just another training exemplar of valid travelers. The same holds true for network traffic: Simply classify the photo downloads as valid traffic. The more exemplars, the better the classification, right? This is true to the extent that the feature vectors are granular enough to allow discrete exemplars, so the data points don’t apply to multiple classifications.

Feature vectors are limited, however, and in many instances exhausted. In the case of a traveler, we might add the color of his or her bags, style of clothes or frequency of travel. In the case of threat analytics, we might add file content type and file entropy. Still, at some point the data is finite. Some vacationers simply look like drug smugglers, and vice versa.

Experts have conducted an abundance of research on network flow information. Unfortunately, flow information is easily exhausted because it only provides a few dozen features and suffers greatly from false positives, per the following diagram:

Here, exemplars continue to increase, but feature vectors are limited. The key to improving machine learning is to increase feature vectors so that exemplars correctly classify events as malicious or benign. The traveler might be wearing additional attire, like sunglasses or a hat. Similarly, network traffic might include tunneling or proxy knowledge content. An unfortunate reality for network analytics is that the desired data is typically not available or simply unattainable. So where does this leave us in terms of machine learning for security?

The Future of Machine Learning

Machine learning is very effective in eliminating white noise and classifying benign traffic with a high degree of accuracy — that is, what it believes to be benign is absolutely benign. Still, the false positive rates for predicting malicious events have been pretty disappointing.

For machine learning to be effective, IT professionals must switch from an analytics-based approach to a data-centric one, as more analytics on the same data produces the same results. We need to derive better data if we are going to rely on machine learning alone for security assessments.

IBM has done extensive research using machine learning for security analytics in the areas of Domain Name System (DNS) and the insider threat. IBM is currently beta testing machine learning-based applications for DNS that detect tunneling, beaconing, fluxing, squatting and exfiltration.

Here, false positives are reduced by increasing feature vectors with supplemental WHOIS and URL/website analytics. The insider threat technology is based on user behavior analytics that classify typical and anomalous behavior. For example, logging into an application from the same endpoint with a different identifier would be an anomaly upon first occurrence.

Machine learning can reduce the white noise, but without an injection of security-relevant data, it has a long way to go before it can be considered a generational leap in analytics. The next generation of machine learning-based security analytics will undoubtedly have a heavy focus on data acquisition.

More from Intelligence & Analytics

X-Force Threat Intelligence Index 2024 reveals stolen credentials as top risk, with AI attacks on the horizon

4 min read - Every year, IBM X-Force analysts assess the data collected across all our security disciplines to create the IBM X-Force Threat Intelligence Index, our annual report that plots changes in the cyber threat landscape to reveal trends and help clients proactively put security measures in place. Among the many noteworthy findings in the 2024 edition of the X-Force report, three major trends stand out that we’re advising security professionals and CISOs to observe: A sharp increase in abuse of valid accounts…

Web injections are back on the rise: 40+ banks affected by new malware campaign

8 min read - Web injections, a favored technique employed by various banking trojans, have been a persistent threat in the realm of cyberattacks. These malicious injections enable cyber criminals to manipulate data exchanges between users and web browsers, potentially compromising sensitive information. In March 2023, security researchers at IBM Security Trusteer uncovered a new malware campaign using JavaScript web injections. This new campaign is widespread and particularly evasive, with historical indicators of compromise (IOCs) suggesting a possible connection to DanaBot — although we…

Accelerating security outcomes with a cloud-native SIEM

5 min read - As organizations modernize their IT infrastructure and increase adoption of cloud services, security teams face new challenges in terms of staffing, budgets and technologies. To keep pace, security programs must evolve to secure modern IT environments against fast-evolving threats with constrained resources. This will require rethinking traditional security strategies and focusing investments on capabilities like cloud security, AI-powered defense and skills development. The path forward calls on security teams to be agile, innovative and strategic amidst the changes in technology…

Topic updates

Get email updates and stay ahead of the latest threats to the security landscape, thought leadership and research.
Subscribe today