Trying to discern drug smugglers passing through customs presents exactly the same problem as trying to discern security threats passing through our networks. Machine learning has been applied to both with varying degrees of success, but ultimately the technology reaches the same limitations. Machine learning has two basic elements: feature vectors and classification exemplars — the data that is gathered and the corresponding classification examples.
In the case of drug smugglers, we might observe number of travelers, point of origin, point of destination, number of bags, length of stay and weight of the bags. We might also flag any traveler or pair of travelers with two or more bags whose combined weight is greater than 150 pounds, whose stay is less than a week and who originated from a climate conducive to poppies.
In the case of threat analytics, we observe the addresses of the endpoints, the amount of data that flows to and from each endpoint, the geolocated countries of the endpoints and the frequency of communications. Given this context, should it not then be easier to discern between malicious and benign behavior?
Reducing False Positives
The following graph is an expected results curve for machine learning:
The more feature vectors (data) we can collect, combined with continual and up-to-date classification examples, the more easily machine learning can discern between malicious and benign behavior, with the false positive rate trending to near zero. False positives, or incorrectly classified events, are the bane of all machine learning, whether it’s used to detect drug smugglers or malicious network traffic.
A traveling aid worker might carry unusual tools in his luggage. Similarly, an unusual level of file transfer traffic from an untrustworthy geolocation might simply be a large download of vacation photos. False positives consume human resources to investigate events that can be commonplace, thus diminishing the effectiveness of machine learning. These are still early days.
Infinite Exemplars and Finite Data
So how do we decrease security false positives? It’s important to keep in mind that classification exemplars are infinite and data is finite. If our traveler is denoted as malicious and subsequently proves to be legitimate, feed this as just another training exemplar of valid travelers. The same holds true for network traffic: Simply classify the photo downloads as valid traffic. The more exemplars, the better the classification, right? This is true to the extent that the feature vectors are granular enough to allow discrete exemplars, so the data points don’t apply to multiple classifications.
Feature vectors are limited, however, and in many instances exhausted. In the case of a traveler, we might add the color of his or her bags, style of clothes or frequency of travel. In the case of threat analytics, we might add file content type and file entropy. Still, at some point the data is finite. Some vacationers simply look like drug smugglers, and vice versa.
Experts have conducted an abundance of research on network flow information. Unfortunately, flow information is easily exhausted because it only provides a few dozen features and suffers greatly from false positives, per the following diagram:
Here, exemplars continue to increase, but feature vectors are limited. The key to improving machine learning is to increase feature vectors so that exemplars correctly classify events as malicious or benign. The traveler might be wearing additional attire, like sunglasses or a hat. Similarly, network traffic might include tunneling or proxy knowledge content. An unfortunate reality for network analytics is that the desired data is typically not available or simply unattainable. So where does this leave us in terms of machine learning for security?
The Future of Machine Learning
Machine learning is very effective in eliminating white noise and classifying benign traffic with a high degree of accuracy — that is, what it believes to be benign is absolutely benign. Still, the false positive rates for predicting malicious events have been pretty disappointing.
For machine learning to be effective, IT professionals must switch from an analytics-based approach to a data-centric one, as more analytics on the same data produces the same results. We need to derive better data if we are going to rely on machine learning alone for security assessments.
IBM has done extensive research using machine learning for security analytics in the areas of Domain Name System (DNS) and the insider threat. IBM is currently beta testing machine learning-based applications for DNS that detect tunneling, beaconing, fluxing, squatting and exfiltration.
Here, false positives are reduced by increasing feature vectors with supplemental WHOIS and URL/website analytics. The insider threat technology is based on user behavior analytics that classify typical and anomalous behavior. For example, logging into an application from the same endpoint with a different identifier would be an anomaly upon first occurrence.
Machine learning can reduce the white noise, but without an injection of security-relevant data, it has a long way to go before it can be considered a generational leap in analytics. The next generation of machine learning-based security analytics will undoubtedly have a heavy focus on data acquisition.
CTO Security Intelligence, IBM