October 17, 2016 By Russell Couturier 4 min read

Trying to discern drug smugglers passing through customs presents exactly the same problem as trying to discern security threats passing through our networks. Machine learning has been applied to both with varying degrees of success, but ultimately the technology reaches the same limitations. Machine learning has two basic elements: feature vectors and classification exemplars — the data that is gathered and the corresponding classification examples.

In the case of drug smugglers, we might observe number of travelers, point of origin, point of destination, number of bags, length of stay and weight of the bags. We might also flag any traveler or pair of travelers with two or more bags whose combined weight is greater than 150 pounds, whose stay is less than a week and who originated from a climate conducive to poppies.

In the case of threat analytics, we observe the addresses of the endpoints, the amount of data that flows to and from each endpoint, the geolocated countries of the endpoints and the frequency of communications. Given this context, should it not then be easier to discern between malicious and benign behavior?

Reducing False Positives

The following graph is an expected results curve for machine learning:

The more feature vectors (data) we can collect, combined with continual and up-to-date classification examples, the more easily machine learning can discern between malicious and benign behavior, with the false positive rate trending to near zero. False positives, or incorrectly classified events, are the bane of all machine learning, whether it’s used to detect drug smugglers or malicious network traffic.

A traveling aid worker might carry unusual tools in his luggage. Similarly, an unusual level of file transfer traffic from an untrustworthy geolocation might simply be a large download of vacation photos. False positives consume human resources to investigate events that can be commonplace, thus diminishing the effectiveness of machine learning. These are still early days.

Infinite Exemplars and Finite Data

So how do we decrease security false positives? It’s important to keep in mind that classification exemplars are infinite and data is finite. If our traveler is denoted as malicious and subsequently proves to be legitimate, feed this as just another training exemplar of valid travelers. The same holds true for network traffic: Simply classify the photo downloads as valid traffic. The more exemplars, the better the classification, right? This is true to the extent that the feature vectors are granular enough to allow discrete exemplars, so the data points don’t apply to multiple classifications.

Feature vectors are limited, however, and in many instances exhausted. In the case of a traveler, we might add the color of his or her bags, style of clothes or frequency of travel. In the case of threat analytics, we might add file content type and file entropy. Still, at some point the data is finite. Some vacationers simply look like drug smugglers, and vice versa.

Experts have conducted an abundance of research on network flow information. Unfortunately, flow information is easily exhausted because it only provides a few dozen features and suffers greatly from false positives, per the following diagram:

Here, exemplars continue to increase, but feature vectors are limited. The key to improving machine learning is to increase feature vectors so that exemplars correctly classify events as malicious or benign. The traveler might be wearing additional attire, like sunglasses or a hat. Similarly, network traffic might include tunneling or proxy knowledge content. An unfortunate reality for network analytics is that the desired data is typically not available or simply unattainable. So where does this leave us in terms of machine learning for security?

The Future of Machine Learning

Machine learning is very effective in eliminating white noise and classifying benign traffic with a high degree of accuracy — that is, what it believes to be benign is absolutely benign. Still, the false positive rates for predicting malicious events have been pretty disappointing.

For machine learning to be effective, IT professionals must switch from an analytics-based approach to a data-centric one, as more analytics on the same data produces the same results. We need to derive better data if we are going to rely on machine learning alone for security assessments.

IBM has done extensive research using machine learning for security analytics in the areas of Domain Name System (DNS) and the insider threat. IBM is currently beta testing machine learning-based applications for DNS that detect tunneling, beaconing, fluxing, squatting and exfiltration.

Here, false positives are reduced by increasing feature vectors with supplemental WHOIS and URL/website analytics. The insider threat technology is based on user behavior analytics that classify typical and anomalous behavior. For example, logging into an application from the same endpoint with a different identifier would be an anomaly upon first occurrence.

Machine learning can reduce the white noise, but without an injection of security-relevant data, it has a long way to go before it can be considered a generational leap in analytics. The next generation of machine learning-based security analytics will undoubtedly have a heavy focus on data acquisition.

More from Intelligence & Analytics

New report shows ongoing gender pay gap in cybersecurity

3 min read - The gender gap in cybersecurity isn’t a new issue. The lack of women in cybersecurity and IT has been making headlines for years — even decades. While progress has been made, there is still significant work to do, especially regarding salary.The recent  ISC2 Cybersecurity Workforce Study highlighted numerous cybersecurity issues regarding women in the field. In fact, only 17% of the 14,865 respondents to the survey were women.Pay gap between men and womenOne of the most concerning disparities revealed by…

Protecting your data and environment from unknown external risks

3 min read - Cybersecurity professionals always keep their eye out for trends and patterns to stay one step ahead of cyber criminals. The IBM X-Force does the same when working with customers. Over the past few years, clients have often asked the team about threats outside their internal environment, such as data leakage, brand impersonation, stolen credentials and phishing sites. To help customers overcome these often unknown and unexpected risks that are often outside of their control, the team created Cyber Exposure Insights…

X-Force Threat Intelligence Index 2024 reveals stolen credentials as top risk, with AI attacks on the horizon

4 min read - Every year, IBM X-Force analysts assess the data collected across all our security disciplines to create the IBM X-Force Threat Intelligence Index, our annual report that plots changes in the cyber threat landscape to reveal trends and help clients proactively put security measures in place. Among the many noteworthy findings in the 2024 edition of the X-Force report, three major trends stand out that we’re advising security professionals and CISOs to observe: A sharp increase in abuse of valid accounts…

Topic updates

Get email updates and stay ahead of the latest threats to the security landscape, thought leadership and research.
Subscribe today