Whether we realize it or not, machine learning touches our daily lives in many ways. When you upload a picture on social media, for example, you might be prompted to tag other people in the photo. That’s called image recognition, a machine learning capability by which the computer learns to identify facial features. Other examples include number and voice recognition applications.

From an intrusion detection perspective, analysts can apply machine learning, data mining and pattern recognition algorithms to distinguish between normal and malicious traffic.

Boosting Intrusion Detection With Machine Learning

One way that a computer can learn is by examples. For instance, a computer can learn to recognize a specific object, such as a car:

The computer can extract features from the car such as its color — in this case, red. If we classify the object by its color, we can model it as follows:

Object ID Color Class
Red car
Blue not car
Red car
Scroll to view full table

The algorithm then generates the following learning/classifying/decision tree:

After the computer learns the above, you can ask it to classify the following object:

The computer will classify the rose as a car because it is also red. We need to extract more valuable and discriminate features, such as shape, to help the computer differentiate the car from any other red object.

The Need for Intelligent IDS

An intrusion detection system (IDS) monitors the network traffic looking for suspicious activity, which could represent an attack or unauthorized access. Traditional systems were designed to detect known attacks but cannot identify unknown threats. They most commonly detect known threats based on defined rules or behavioral analysis through baselining the network.

A sophisticated attacker can bypass these techniques, so the need for more intelligent intrusion detection is increasing by the day. Researchers are attempting to apply machine learning techniques to this area of cybersecurity.

The foundation of any intelligent IDS is a robust data set to provide examples from which the computer can learn. Today, however, very little security data is publicly available. That’s why I conducted an experiment in which I created a small, new data set with discernible features that can help analysts train computers to detect the most serious threats, even zero-day attacks.

Network Traffic Analysis

Network traffic can be analyzed at the packet, connection or session level. In general, the connection represents a bidirectional flow and the session represents multiple connections between the same source and destination.

In my prototype system, I used the powerful network analysis platform Bro to analyze traffic based on the connection level. Bro can monitor Transmission Control Protocol (TCP), User Datagram Protocol (UDP) and Internet Control Message Protocol (ICMP), and write the analyzed traffic to well-structured, tab-separated files suitable for post-processing. The platform interprets UDP and ICMP connection using flow semantics.

Feature Extraction

Bro writes several log files about network traffic. The conn.log file, for example, contains generic information about each connection, such as the time stamp, connection ID, source IP, source port, destination IP and destination port. This information is not enough. To extract more features from the network traffic, we need to create features and attributes to help us distinguish between normal and harmful traffic.

It is challenging to stick with generic features. It is not useful to extract features for each application-layer protocol, since there are thousands. In his paper, “Machine Learning for Application-Layer Intrusion Detection,” researcher Konrad Rieck explained the benefits of selecting generic features, such as those as shown below:

This is a great start, but we still need more features to help the machine recognize attacks. To add more depth to the analysis, we should determine whether the payload contains:

  • Shellcode;
  • JavaScript code;
  • SQL command or SQL injection queries; and
  • Command injection.

Those features can help the machine detect zero-day and web application attacks. To extract all the features, I limit the extraction process to the data sent by the source of the connection.

Most features can be extracted using a regular expression or calculated directly from the connection content. Shellcode is a notable exception, because attackers can encrypt, compress or encode it. I used Libemu, a x86 emulation and shellcode detection library, which works well but still can’t detect unencypted shellcodes. To solve this problem, at the suggestion of Dr. Ali Hadi, I used malware analysis platform Cuckoo Sandbox. Hadi suggested extracting more features from the traffic, such as the sequence of application program interfaces (APIs).

Both features are important for detecting shellcode and malware. By running the whole payload as a sequence of instructions in Cuckoo Sandbox, I can determine whether it represents an attack based on whether the system calls for a Windows Sockets 2 (Winsock) API.

Creating Useful Data Sets

So we’ve captured and analyzed the network traffic. How do we label it as normal or malicious traffic?

For my experiment, I installed Ubuntu to be used as a target machine, as well as the Damn Vulnerable Web Application (DVWA), a dummy application designed to help security professionals test their cyberdefense skills. I launched several attacks against the DVWA from a different computer, then used Bro to analyze the traffic between the two machines. I also configured Bro to extract the connection content as binary files.

I launched several types of attacks, such as SQL injection, command injection and cross-site scripting (XSS), against the vulnerable web application on the target machine. To conduct an SQL injection from the attacking machine, for example, open the target web app, navigate to the SQL injection tab and write the following in the text field:

(%’ or 0=0 union select null, table_name from information_schema. tables #)

If the web app is vulnerable, the result will look like this:

Bro then outputs several log files, including conn.log, which contains general information about each network connection. Each row in the image below represents a connection

I also configured Bro to extract the content of the connection in a separate file as I performed the attacks. This way, I know what attack data was sent to the vulnerable web application. To classify the connections, I used a hex dump to see each connection content file:

According to the content, I classified the connection to the corresponding attack type. I then inspected the content of each connection file using the hex dump tool to find the exact attack traffic. We can see that the user sent the following in a GET request:

%25%27+0%3D0+union+select+null%2C+table_name+from+information_schema.tables+%23

After decoding the request, you will see the following:

%’ or 0=0 union select null, table_name from information_schema.tables #

Now that we’ve identified this connection content as an attack connection, specifically an SQL attack, we will label it as such in the spreadsheet.

The data set contains 41 instances with 33 attributes, as illustrated below.

The following figure shows the newly created data set.

Now that we have a good data set with features to detect advanced attacks, we can use it to train the computer to classify new connections.

Selecting and Classifying Features

I selected nine of the most important and generic features out of 33 to train the computer to recognize the attacks:

  • Protocol;
  • Service;
  • Entropy;
  • Number of nonprintable characters;
  • Number of punctuation characters;
  • Contains JavaScript;
  • Contains SQL statement;
  • Contains command injection; and
  • Class.

For the classification, I used Weka, a collection of machine learning algorithms for data mining tasks. For the testing, I used a cross-validation with 10 folds.

The table below shows the classification accuracy using several machine learning algorithms.

Intrusion Detection in the Cognitive Era

Security analysts can use machine learning to build an effective intrusion detection capability. The trick is to select the right features to create the most effective data set with which to train the machine to distinguish between normal and malicious traffic.

This is just one of the many ways IT professionals can apply cognitive computing to cybersecurity. You can even combine machine learning with your existing IDS by importing the induced rules from the classification tree into the system.

Read the white paper: Cybersecurity in the cognitive era

More from Intelligence & Analytics

ITG10 Likely Targeting South Korean Entities of Interest to the Democratic People’s Republic of Korea (DPRK)

7 min read - In late April 2023, IBM Security X-Force uncovered documents that are most likely part of a phishing campaign mimicking credible senders, orchestrated by a group X-Force refers to as ITG10, and aimed at delivering RokRAT malware, similar to what has been observed by others. ITG10's tactics, techniques and procedures (TTPs) overlap with APT37 and ScarCruft. The initial delivery method is conducted via a LNK file, which drops two Windows shortcut files containing obfuscated PowerShell scripts in charge of downloading a…

7 min read

SOCs Spend 32% of the Day On Incidents That Pose No Threat

4 min read - When it comes to the first line of defense for any company, its Security Operations Center (SOC) is an essential component. A SOC is a dedicated team of professionals who monitor networks and systems for potential threats, provide analysis of detected issues and take the necessary actions to remediate any risks they uncover. Unfortunately, SOC members spend nearly one-third (32%) of their day investigating incidents that don't actually pose a real threat to the business according to a new report…

4 min read

BlackCat (ALPHV) Ransomware Levels Up for Stealth, Speed and Exfiltration

9 min read - This blog was made possible through contributions from Kat Metrick, Kevin Henson, Agnes Ramos-Beauchamp, Thanassis Diogos, Diego Matos Martins and Joseph Spero. BlackCat ransomware, which was among the top ransomware families observed by IBM Security X-Force in 2022, according to the 2023 X-Force Threat Intelligence Index, continues to wreak havoc across organizations globally this year. BlackCat (a.k.a. ALPHV) ransomware affiliates' more recent attacks include targeting organizations in the healthcare, government, education, manufacturing and hospitality sectors. Reportedly, several of these incidents resulted…

9 min read

Despite Tech Layoffs, Cybersecurity Positions are Hiring

4 min read - It’s easy to read today’s headlines and think that now isn’t the best time to look for a job in the tech industry. However, that’s not necessarily true. When you read deeper into the stories and numbers, cybersecurity positions are still very much in demand. Cybersecurity professionals are landing jobs every day, and IT professionals from other roles may be able to transfer their skills into cybersecurity relatively easily. As cybersecurity continues to remain a top business priority, organizations will…

4 min read