Applying Machine Learning to Improve Your Intrusion Detection System

Whether we realize it or not, machine learning touches our daily lives in many ways. When you upload a picture on social media, for example, you might be prompted to tag other people in the photo. That’s called image recognition, a machine learning capability by which the computer learns to identify facial features. Other examples include number and voice recognition applications.

From an intrusion detection perspective, analysts can apply machine learning, data mining and pattern recognition algorithms to distinguish between normal and malicious traffic.

Boosting Intrusion Detection With Machine Learning

One way that a computer can learn is by examples. For instance, a computer can learn to recognize a specific object, such as a car:

Red Car

The computer can extract features from the car such as its color — in this case, red. If we classify the object by its color, we can model it as follows:

Object IDColorClass
Redcar
Bluenot car
Redcar

The algorithm then generates the following learning/classifying/decision tree:

decision tree

After the computer learns the above, you can ask it to classify the following object:

red rose

The computer will classify the rose as a car because it is also red. We need to extract more valuable and discriminate features, such as shape, to help the computer differentiate the car from any other red object.

The Need for Intelligent IDS

An intrusion detection system (IDS) monitors the network traffic looking for suspicious activity, which could represent an attack or unauthorized access. Traditional systems were designed to detect known attacks but cannot identify unknown threats. They most commonly detect known threats based on defined rules or behavioral analysis through baselining the network.

Related to this Article

A sophisticated attacker can bypass these techniques, so the need for more intelligent intrusion detection is increasing by the day. Researchers are attempting to apply machine learning techniques to this area of cybersecurity.

The foundation of any intelligent IDS is a robust data set to provide examples from which the computer can learn. Today, however, very little security data is publicly available. That’s why I conducted an experiment in which I created a small, new data set with discernible features that can help analysts train computers to detect the most serious threats, even zero-day attacks.

Network Traffic Analysis

Network traffic can be analyzed at the packet, connection or session level. In general, the connection represents a bidirectional flow and the session represents multiple connections between the same source and destination.

In my prototype system, I used the powerful network analysis platform Bro to analyze traffic based on the connection level. Bro can monitor Transmission Control Protocol (TCP), User Datagram Protocol (UDP) and Internet Control Message Protocol (ICMP), and write the analyzed traffic to well-structured, tab-separated files suitable for post-processing. The platform interprets UDP and ICMP connection using flow semantics.

Feature Extraction

Bro writes several log files about network traffic. The conn.log file, for example, contains generic information about each connection, such as the time stamp, connection ID, source IP, source port, destination IP and destination port. This information is not enough. To extract more features from the network traffic, we need to create features and attributes to help us distinguish between normal and harmful traffic.

It is challenging to stick with generic features. It is not useful to extract features for each application-layer protocol, since there are thousands. In his paper, “Machine Learning for Application-Layer Intrusion Detection,” researcher Konrad Rieck explained the benefits of selecting generic features, such as those as shown below:

generic-features-suggested-by-Prof-K. Rieck

This is a great start, but we still need more features to help the machine recognize attacks. To add more depth to the analysis, we should determine whether the payload contains:

  • Shellcode;
  • JavaScript code;
  • SQL command or SQL injection queries; and
  • Command injection.

Those features can help the machine detect zero-day and web application attacks. To extract all the features, I limit the extraction process to the data sent by the source of the connection.

Most features can be extracted using a regular expression or calculated directly from the connection content. Shellcode is a notable exception, because attackers can encrypt, compress or encode it. I used Libemu, a x86 emulation and shellcode detection library, which works well but still can’t detect unencypted shellcodes. To solve this problem, at the suggestion of Dr. Ali Hadi, I used malware analysis platform Cuckoo Sandbox. Hadi suggested extracting more features from the traffic, such as the sequence of application program interfaces (APIs).

Both features are important for detecting shellcode and malware. By running the whole payload as a sequence of instructions in Cuckoo Sandbox, I can determine whether it represents an attack based on whether the system calls for a Windows Sockets 2 (Winsock) API.

Creating Useful Data Sets

So we’ve captured and analyzed the network traffic. How do we label it as normal or malicious traffic?

For my experiment, I installed Ubuntu to be used as a target machine, as well as the Damn Vulnerable Web Application (DVWA), a dummy application designed to help security professionals test their cyberdefense skills. I launched several attacks against the DVWA from a different computer, then used Bro to analyze the traffic between the two machines. I also configured Bro to extract the connection content as binary files.

I launched several types of attacks, such as SQL injection, command injection and cross-site scripting (XSS), against the vulnerable web application on the target machine. To conduct an SQL injection from the attacking machine, for example, open the target web app, navigate to the SQL injection tab and write the following in the text field:

(%’ or 0=0 union select null, table_name from information_schema. tables #)

If the web app is vulnerable, the result will look like this:

DVWA SQL Injection example

Bro then outputs several log files, including conn.log, which contains general information about each network connection. Each row in the image below represents a connection

Bro conn.log file

I also configured Bro to extract the content of the connection in a separate file as I performed the attacks. This way, I know what attack data was sent to the vulnerable web application. To classify the connections, I used a hex dump to see each connection content file:

hex dump of a connection content

According to the content, I classified the connection to the corresponding attack type. I then inspected the content of each connection file using the hex dump tool to find the exact attack traffic. We can see that the user sent the following in a GET request:

%25%27+0%3D0+union+select+null%2C+table_name+from+information_schema.tables+%23

After decoding the request, you will see the following:

%’ or 0=0 union select null, table_name from information_schema.tables #

Now that we’ve identified this connection content as an attack connection, specifically an SQL attack, we will label it as such in the spreadsheet.

The data set contains 41 instances with 33 attributes, as illustrated below.

Dataset attack classes

The following figure shows the newly created data set.

final classified dataset

Now that we have a good data set with features to detect advanced attacks, we can use it to train the computer to classify new connections.

Selecting and Classifying Features

I selected nine of the most important and generic features out of 33 to train the computer to recognize the attacks:

  • Protocol;
  • Service;
  • Entropy;
  • Number of nonprintable characters;
  • Number of punctuation characters;
  • Contains JavaScript;
  • Contains SQL statement;
  • Contains command injection; and
  • Class.

For the classification, I used Weka, a collection of machine learning algorithms for data mining tasks. For the testing, I used a cross-validation with 10 folds.

The table below shows the classification accuracy using several machine learning algorithms.

classification results using multiple learning algorithms

Intrusion Detection in the Cognitive Era

Security analysts can use machine learning to build an effective intrusion detection capability. The trick is to select the right features to create the most effective data set with which to train the machine to distinguish between normal and malicious traffic.

This is just one of the many ways IT professionals can apply cognitive computing to cybersecurity. You can even combine machine learning with your existing IDS by importing the induced rules from the classification tree into the system.

Read the white paper: Cybersecurity in the cognitive era

Share this Article:
Mutaz Alsallal

MSS SIEM Analyst, IBM

Mutaz Alsallal is an MSS SIEM Analyst with IBM. In this role, he works to detect intruders based on analysis of security and network events. Prior to his role at IBM, he co-found Jamalon - the largest online bookstore in Middle East - and was a member of the Security Operation Center Support for Umniah Belong. Mutaz holds dual Computer Science degrees from Petra University and Wroclaw University of Technology.