Machine Learning: When It Works and When It Doesn’t
It’s difficult to talk about security analytics without considering machine learning. Machine learning is used to detect malicious websites, flow anomalies, infectious files, infected endpoints and user behavior anomalies. It’s applied to big data repositories to glean information and insights that may otherwise go undetected.
Multiple industries are using machine learning to better automate security screening, border entry, college applicant selection, loan analytics and health care. Behind the scenes, almost every industry that affects our daily lives involves some type of machine learning.
Training the System
Machine learning is based upon statistical analytics of existing data and learning applied to new data sets. In the case of college applicants, admission analysts train the system by feeding transcripts, financial information, demographic information, high school information, SAT scores and any data that may seem relevant for accepting an applicant. In the case of network security, security analysts train the system by examining web browsing tendencies, entry/exit data, email tendencies, login authentication data and any other available user behavioral analytics.
The goal is to identify and classify anomalous situations that serve to train the system. This sounds great doesn’t it? Not so fast — in its infancy, machine learning can produce errant results.
Machine Learning Flunks Its First Tests
For example, my son was recently denied a home mortgage loan. He has good credit, a steady job and met the minimal standards for a down payment. He was simply denied the application with little explanation other than the computer had determined he was a high risk. After weeks of significant pressure on the bank, we discovered that his job was classified as a high-risk occupation for long-term employment.
One of my colleagues was also recently identified as a high-risk user based upon his browsing habits and voice-over-IP (VoIP) usage. Full disclosure identified his geolocated communications as the culprit, coupled with his web browsing habits. The web browsing habits could not be identified or fully disclosed since they were simply “anomalous.” I suspect it was because he communicated with his homeland family often and the geolocation was from a country associated with cybercrime.
Accuracy and Classification
Machine learning makes complex statistical decisions on data based solely on the accuracy of classification. It recursively quantifies and correlates millions of potential decision trees until it has the most accurate classification. In human terms, it does not understand why these decisions make sense, only what are the most accurate decisions based upon the classifications. This is a real problem.
The above diagram is a very simple decision tree derived using machine learning classifiers. The ellipses are different data sets used by the classifiers. If this were a classifier for loan applications, would it make sense to a human? Machine learning makes decisions based on best-guess algorithms, but more importantly, it makes decisions that have no apparent human explanation.
In fact, the main benefit of machine learning — the ability to make decisions that are not humanly evident — is also its potential danger. Imagine machine learning inaccurately identifying a malicious website and then blocking access to that website. The owner wants an explanation and remediation, but the classification cannot explain it.
Machine learning is gradually touching every part of our lives and making decisions of which we may not be fully aware. There is a significant need to disclose both the underlying data and the classification schemes of these processes. IBM is currently working with machine learning analytics to determine domain or website maliciousness. With this comes the ethical responsibility to disclose the information and decision analytics that determine benign or malicious intent. If we deny access to a website, we must then provide the human explanation and core data that drove this action.
In security, this level of responsibility also has a legal tangent. What is the damage to the website owner if access is inappropriately denied? IBM is building both traceability and disclosure into our Domain Name System (DNS) security analytics and believes it will be a significant differentiator. It also carries interesting side effects that involve human interaction to reclassify incorrect data: Explain it in human terms and then allow someone to educate the classifier with new data. Maybe we add a “like” or “don’t like” button for misclassified data.
IBM prides itself on business ethics as one of its core foundations to help build trust with consumers. Therefore, we’re building transparency into our machine learning analytics and striving to be right more than we’re wrong. I would encourage the decision-makers of machine learning products to challenge the transparency of the offering and demand humanly interpretable audits of outcomes.