Everything you do in threat intelligence is about indicators or patterns. In a binary world, patterns are actually just how different indicators work together in the chain of a malicious event.

Working with threat intelligence for years now, I’ve often asked myself several fundamental cyberthreat intelligence questions:

  • What exactly is this attack and how can I identify it?
  • Is this attack description related to the technology I use?
  • Is the company I want to protect a possible target?
  • What would this attack look like in my environment?

While thinking about the right answers to these questions and creating your own attack model and hypothesis can be a lot of fun, extracting indicators from intelligence and enriching them is usually not. To speed up this task, I always prefer to write small scripts or code snippets to do these things for me, so I can focus on the fun part.

Extracting Different Types of Indicators From Cyberthreat Intelligence

All code examples below can be found on my public GitHub repository. In this blog, we will mainly look at code snippets used in indifetch.py, which is exactly what it sounds like: fetching indicators from a text or string. As a disclaimer, you should always review the regex used and not trust the code blindly. I always advise some cross-checking and keeping an eye out for a better regex or faster way to do a task — the following is just one way to do this.

Hashes

Hashes are fairly easy. They normally come in two flavors, md5 and SHA. A function that covers md5 could look like this:


<p class=""><em>def getMD5(text):
thisset = set()
md5_r = re.compile(r"([a-fA-F\d]{32})")
for item in md5_r.findall(text):
thisset.add(item)
return thisset</em></p>

As you can see, we use the simple regex [a-fA-F\d]{32} to fetch the indicator out of a given text. The regex matches any character “a” to “f” and any number in a string of 32-character length. We use a Python set because items in a set are unique, eliminating duplicates right from the start.

Changing this function to cover SHA256 is an easy next step. Besides it being a totally different algorithm, the representation is the same but twice the length of the characters ([a-fA-F\d]{64}), using 64 characters instead of 32.

IP Addresses

Next up are IPs. IPv4, in particular, follows a pretty simple pattern: four numbers, none of which are higher than 255, separated by a dot.


<p class="left"><em>def getIP(text):
IPlist = set()
ip = re.compile(r"\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\b")
for item in ip.findall(text):
IPlist.add(item)
return IPlist</em></p>

We use the same style of function, as I always try to reuse code if I can. One potential problem in this situation is that, sometimes, version numbers use the same style, which can confuse the code.

URLs and Domains

The last common indicators we will cover in a function are URLs and domains. Many reports will have a large set of URLs since most malware has to connect to a command-and-control (C&C) server or exfiltrate data. In addition, with proxies and firewalls, they are often one of the easier indicators to catch.

<em>def getURL(text):
URLlist = set()
urls = re.compile(r'(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
for item in urls.findall(text):
URLlist.add(item)
return URLlist</em>

Since you are familiar with the style by now, the regex above worked well for me, but I have also used simpler representations in the past, such as:


<p class="left"><em>http[a-zA-Z0-9./:]*</em></p>

Speed Up Your Threat Analysis

Python is a great tool for scripting these little code snippets and speeding up threat analysis. Some scripts can even extract indicators and make an API call to resources like the X-Force Exchange to get the current scoring of the indicator, further speeding up the process.

One regex you may want to look into is for Common Vulnerabilities and Exposures (CVE) numbers. I tend to use something simple, such as:


<p class="left"><em>CVE[^\w]*\d{4}[^\w]+\d{4,}</em></p>

Note: Remember to always use set() in Python instead of lists to remove duplicates right from the start. This comes in handy especially when you automate API calls as part of the script.

There are many cyberthreat intelligence tools and platforms that can do this dismantling of information for you, but it can be extremely useful to understand the magic behind the process before relying on a tool.

More from Intelligence & Analytics

Email campaigns leverage updated DBatLoader to deliver RATs, stealers

11 min read - IBM X-Force has identified new capabilities in DBatLoader malware samples delivered in recent email campaigns, signaling a heightened risk of infection from commodity malware families associated with DBatLoader activity. X-Force has observed nearly two dozen email campaigns since late June leveraging the updated DBatLoader loader to deliver payloads such as Remcos, Warzone, Formbook, and AgentTesla. DBatLoader malware has been used since 2020 by cybercriminals to install commodity malware remote access Trojans (RATs) and infostealers, primarily via malicious spam (malspam). DBatLoader…

New Hive0117 phishing campaign imitates conscription summons to deliver DarkWatchman malware

8 min read - IBM X-Force uncovered a new phishing campaign likely conducted by Hive0117 delivering the fileless malware DarkWatchman, directed at individuals associated with major energy, finance, transport, and software security industries based in Russia, Kazakhstan, Latvia, and Estonia. DarkWatchman malware is capable of keylogging, collecting system information, and deploying secondary payloads. Imitating official correspondence from the Russian government in phishing emails aligns with previous Hive0117 campaigns delivering DarkWatchman malware, and shows a possible significant effort to induce a sense of urgency as…

X-Force releases detection & response framework for managed file transfer software

5 min read - How AI can help defenders scale detection guidance for enterprise software tools If we look back at mass exploitation events that shook the security industry like Log4j, Atlassian, and Microsoft Exchange when these solutions were actively being exploited by attackers, the exploits may have been associated with a different CVE, but the detection and response guidance being released by the various security vendors had many similarities (e.g., Log4shell vs. Log4j2 vs. MOVEit vs. Spring4Shell vs. Microsoft Exchange vs. ProxyShell vs.…

Unmasking hypnotized AI: The hidden risks of large language models

11 min read - The emergence of Large Language Models (LLMs) is redefining how cybersecurity teams and cybercriminals operate. As security teams leverage the capabilities of generative AI to bring more simplicity and speed into their operations, it's important we recognize that cybercriminals are seeking the same benefits. LLMs are a new type of attack surface poised to make certain types of attacks easier, more cost-effective, and even more persistent. In a bid to explore security risks posed by these innovations, we attempted to…