Whether we realize it or not, machine learning touches our daily lives in many ways. When you upload a picture on social media, for example, you might be prompted to tag other people in the photo. That’s called image recognition, a machine learning capability by which the computer learns to identify facial features. Other examples include number and voice recognition applications.
From an intrusion detection perspective, analysts can apply machine learning, data mining and pattern recognition algorithms to distinguish between normal and malicious traffic.
Boosting Intrusion Detection With Machine Learning
One way that a computer can learn is by examples. For instance, a computer can learn to recognize a specific object, such as a car:
The computer can extract features from the car such as its color — in this case, red. If we classify the object by its color, we can model it as follows:
The algorithm then generates the following learning/classifying/decision tree:
After the computer learns the above, you can ask it to classify the following object:
The computer will classify the rose as a car because it is also red. We need to extract more valuable and discriminate features, such as shape, to help the computer differentiate the car from any other red object.
The Need for Intelligent IDS
An intrusion detection system (IDS) monitors the network traffic looking for suspicious activity, which could represent an attack or unauthorized access. Traditional systems were designed to detect known attacks but cannot identify unknown threats. They most commonly detect known threats based on defined rules or behavioral analysis through baselining the network.
A sophisticated attacker can bypass these techniques, so the need for more intelligent intrusion detection is increasing by the day. Researchers are attempting to apply machine learning techniques to this area of cybersecurity.
The foundation of any intelligent IDS is a robust data set to provide examples from which the computer can learn. Today, however, very little security data is publicly available. That’s why I conducted an experiment in which I created a small, new data set with discernible features that can help analysts train computers to detect the most serious threats, even zero-day attacks.
Network Traffic Analysis
Network traffic can be analyzed at the packet, connection or session level. In general, the connection represents a bidirectional flow and the session represents multiple connections between the same source and destination.
In my prototype system, I used the powerful network analysis platform Bro to analyze traffic based on the connection level. Bro can monitor Transmission Control Protocol (TCP), User Datagram Protocol (UDP) and Internet Control Message Protocol (ICMP), and write the analyzed traffic to well-structured, tab-separated files suitable for post-processing. The platform interprets UDP and ICMP connection using flow semantics.
Bro writes several log files about network traffic. The conn.log file, for example, contains generic information about each connection, such as the time stamp, connection ID, source IP, source port, destination IP and destination port. This information is not enough. To extract more features from the network traffic, we need to create features and attributes to help us distinguish between normal and harmful traffic.
It is challenging to stick with generic features. It is not useful to extract features for each application-layer protocol, since there are thousands. In his paper, “Machine Learning for Application-Layer Intrusion Detection,” researcher Konrad Rieck explained the benefits of selecting generic features, such as those as shown below:
This is a great start, but we still need more features to help the machine recognize attacks. To add more depth to the analysis, we should determine whether the payload contains:
- SQL command or SQL injection queries; and
- Command injection.
Those features can help the machine detect zero-day and web application attacks. To extract all the features, I limit the extraction process to the data sent by the source of the connection.
Most features can be extracted using a regular expression or calculated directly from the connection content. Shellcode is a notable exception, because attackers can encrypt, compress or encode it. I used Libemu, a x86 emulation and shellcode detection library, which works well but still can’t detect unencypted shellcodes. To solve this problem, at the suggestion of Dr. Ali Hadi, I used malware analysis platform Cuckoo Sandbox. Hadi suggested extracting more features from the traffic, such as the sequence of application program interfaces (APIs).
Both features are important for detecting shellcode and malware. By running the whole payload as a sequence of instructions in Cuckoo Sandbox, I can determine whether it represents an attack based on whether the system calls for a Windows Sockets 2 (Winsock) API.
Creating Useful Data Sets
So we’ve captured and analyzed the network traffic. How do we label it as normal or malicious traffic?
For my experiment, I installed Ubuntu to be used as a target machine, as well as the Damn Vulnerable Web Application (DVWA), a dummy application designed to help security professionals test their cyberdefense skills. I launched several attacks against the DVWA from a different computer, then used Bro to analyze the traffic between the two machines. I also configured Bro to extract the connection content as binary files.
I launched several types of attacks, such as SQL injection, command injection and cross-site scripting (XSS), against the vulnerable web application on the target machine. To conduct an SQL injection from the attacking machine, for example, open the target web app, navigate to the SQL injection tab and write the following in the text field:
(%’ or 0=0 union select null, table_name from information_schema. tables #)
If the web app is vulnerable, the result will look like this:
Bro then outputs several log files, including conn.log, which contains general information about each network connection. Each row in the image below represents a connection
I also configured Bro to extract the content of the connection in a separate file as I performed the attacks. This way, I know what attack data was sent to the vulnerable web application. To classify the connections, I used a hex dump to see each connection content file:
According to the content, I classified the connection to the corresponding attack type. I then inspected the content of each connection file using the hex dump tool to find the exact attack traffic. We can see that the user sent the following in a GET request:
After decoding the request, you will see the following:
%’ or 0=0 union select null, table_name from information_schema.tables #
Now that we’ve identified this connection content as an attack connection, specifically an SQL attack, we will label it as such in the spreadsheet.
The data set contains 41 instances with 33 attributes, as illustrated below.
The following figure shows the newly created data set.
Now that we have a good data set with features to detect advanced attacks, we can use it to train the computer to classify new connections.
Selecting and Classifying Features
I selected nine of the most important and generic features out of 33 to train the computer to recognize the attacks:
- Number of nonprintable characters;
- Number of punctuation characters;
- Contains SQL statement;
- Contains command injection; and
For the classification, I used Weka, a collection of machine learning algorithms for data mining tasks. For the testing, I used a cross-validation with 10 folds.
The table below shows the classification accuracy using several machine learning algorithms.
Intrusion Detection in the Cognitive Era
Security analysts can use machine learning to build an effective intrusion detection capability. The trick is to select the right features to create the most effective data set with which to train the machine to distinguish between normal and malicious traffic.
This is just one of the many ways IT professionals can apply cognitive computing to cybersecurity. You can even combine machine learning with your existing IDS by importing the induced rules from the classification tree into the system.