Almost since their inception, threat intelligence and network operations have faced the problem of having more than one name for a single thing. In other words, each thing often has more than one alias.
For instance, is Biscuit the name of a malware family or a baked good? Is Silicon a threat actor or a chemical element? Is this string of digits a message digest algorithm (MD5) hash or a cryptocurrency wallet address? Is example.com a filename or a domain name? With apologies to “Saturday Night Live“: “It’s a floor wax … it’s a dessert topping … new Shimmer is a floor wax and a dessert topping!”
Newcomers to the field of information security quickly notice that the threat intelligence discipline names many things. The number of things with names — and the variety of different names (aliases) for each thing — can often confuse attempts to investigate, defend and respond to the multitude of threats out there. Some areas where this ambiguity can create problems in threat intelligence include:
- Identifying which malware family and variant a sample represents;
- Connecting an antivirus (AV) signature to a malware family, and to the signatures produced by other AV tools for the same and similar samples;
- Mapping network events and malware into the stages of a kill chain;
- Linking malware samples, variants and families with threat actors and campaigns that use them; and
- Associating threat actions with exploits with vulnerabilities with malware with remedies.
A Threat by Any Other Name…
Since the relationships between the things and their names change over time, attribution is even more difficult to manage, and investigators must account for the passage of time when correlating the names with elements from their everyday work.
No single approach successfully deals with all the different alias problems in threat intelligence. In fact, it can be one of the most treacherous and aggravating aspects of the business. The table below provides some examples of multiple aliases used by different organizations. Each list contains different names for the same thing, but the lists are not related:
- Same malware family, multiple names: Derusbi, Photo, Shyape, Sakula, Mivast, Sakurel.
- Same ransomware variant: WannaCry.A, WCry.A, WannaCrypt.A.
- AV signatures for the same sample: PSW.ILUSpy, Generic.Malware.SL!.36EAE0BB, W32/Application.FURU-4724, MSIL/Autorun.Spy.KeyLogger.AWTrojan.Win32.Reconyc.cwyboi.
- Filenames for the same data stream: dispci.exe, abcd.jpeg, abcd.txt, 669629.exe.
- Different hashes for the same file: FD4FC9439E932952DFB9EF5CE25312AEB70358B1, 065aa01311ca8f3e0016d8ae546d30a4.
- Command-and-control (C&C): hxxp://22.214.171.124/quantum/ghsnls.php, hxxp://126.96.36.199/yumhong/ghsnls.php, hxxp://188.8.131.52/quantum/ghsnls.php.
- Same threat actor, multiple names: APT28, FANCY BEAR, Sofacy.
Though the dictionary definition of the word “aliasing” involves signal sampling, for our purposes it refers to the state of one thing having multiple aliases, or one name representing multiple things. Let’s examine some other areas in threat intelligence where aliasing creates ambiguity.
Structural (or Implicit) Aliasing
Some aliasing in threat intelligence stems from technical considerations. For example, domain names exist to make life easier for users; users remember names more easily than strings of digits. The benefit clearly outweighs the disadvantage of both the multiple names and the additional infrastructure — i.e., Domain Name System (DNS) services — necessary to correlate them in practice.
Data streams such as phishing emails and malware programs provide another example — actually, they provide two examples. Many threat intelligence situations draw our attention to a specific stream of data (or file), whether it’s a malware program, a spam email message or a stolen document. Each file is unique, just like each person is unique. However, just as with people, we can use multiple names to identify any given file.
First and foremost, file names impose “implicit” aliasing on the data they identify. The same malware sample or phishing email message might exist with several different names. Users, analysts, legitimate software, malware and other agents can all assign arbitrary names to any file, within the limits of a file system or protocol.
Even if the file always has a particular name on one file system, multiple file systems can receive and store the same sequence of bytes, but sometimes not the same filenames. If the file systems have incompatible naming rules, users and agents have no choice but to create a new name for the file when adding it to those systems.
In addition, threat intelligence operators often need to uniquely identify a specific malware sample, malicious PDF file, phishing message or similar threat. We generally don’t want to include the complete file to uniquely identify it — if for no other reason than to help reduce bandwidth and storage costs. Since the filename cannot uniquely identify it, we fall back on mathematics for help in finding unique identifiers.
Frequently, threat intelligence specialists identify files by a number related to the contents of the file. We call these numbers hash values, or simply hashes. Mathematical functions called hashing algorithms analyze the contents of a file and return a single (usually large) number we can use to identify those contents, regardless of the filename.
Hashes allow us to “almost-uniquely” identify the contents of a file by a value that is considerably more compact and easier to handle than the entire file contents. For example, if the contents of two files with different names produce the same hash value, we should be able to say very confidently that they have the same contents. In fact, that’s exactly what the industry often does to manage blacklists for malware programs and white lists for benign programs.
Exactly how almost-uniquely a given hashing algorithm identifies the contents of a file depends on two primary factors: the design of the hashing algorithm and the size of the results it produces. Obviously, hash algorithms that produce results with more digits can identify a file more almost-uniquely than those that produce results with fewer digits.
Not all hashing algorithms are created equal, however. Even though two different algorithms produce the same number of digits in their output, they can have distinctive design goals. The difference in goals often changes how uniquely they represent any file’s contents. Many hashing algorithms focus on performance and adequate uniqueness within a specific domain. Threat intelligence, however, typically uses hash algorithms that are cryptographically secure because those algorithms prioritize uniqueness of the results over performance.
Time Marches On
As general computing power increases, the cost required to fraudulently duplicate a hash value decreases and can eventually reach the point where the hash algorithm becomes economically feasible to evade. That is, at some point, it may become inexpensive for an attacker to create a malicious file that has the same hash value as a benign file, which is called a hash collision. Intentional hash collisions allow malicious files to masquerade as benign files to bypass defensive measures such as white-listing or blacklisting systems. Once an attacker can economically cause a collision and masquerade a malicious file as a benign file, threat intelligence operators can no longer trust that hash algorithm to produce “unique enough” values to identify specific threats or separate malicious files from benign ones.
In the past decade, both MD5 and Secure Hash Algorithm 1 (SHA-1) fell victim to advances in general computing power. Once that happens, the security industry has little choice but to move away from the less nearly unique results of the older hashing algorithm and move toward the more nearly unique results of newer algorithms.
Therefore, both technology and circumstance have forced threat intelligence specialists to deal with hash value aliasing. Historical records still use the old algorithms to identify threats for a long time, while current records use the new algorithms and, for continuity, the old ones too. This situation directly gives rise to multiple hash values for a single file because each file produces different MD5, SHA-1 and SHA-3 hash values.
As if the instances of implicit aliasing weren’t enough, we also find many instances of explicit aliasing. Some explicit aliasing is coincidental (e.g., due to temporal issues) and some is discretionary (e.g., due to business reasons). Higher-level concepts are also often more susceptible to aliasing, such as naming threat actors and threat groups.
Threat actors are shadowy entities to begin with, and they strive to stay in the shadows. Victims rarely want publicity either. Together, those facts make it easier for the siloed work of multiple investigators to overlap and result in different names for the same findings when looking into incidents or campaigns.
Any malware family or threat actor might end up with a number of names for the following reasons:
- Different investigators encounter them without knowing about, or making the connection to, each other’s work.
- Different investigators have access to different aspects of an operation or platform due to differences in the “footprint” of the sensor network(s) providing them data, and they don’t make the connection.
- Different investigators have commercial reasons for using a different name even when they’re aware of the existing name(s) and work(s).
Similarly, antivirus signatures from different vendors for the same malware typically have different names. About 50 AV engines compete in the market, and they use about 45 different naming conventions among them. Even from the same vendor, signature naming can change over time, especially for signatures based on heuristics. Plus, each AV vendor encounters a different subset of all the malware that’s out there, known as the “footprint problem.”
The malicious actors don’t stand still either. When investigations of an attack become public, the threat actor or malware author behind it often scurries away, quickly wiping his or her traces and changing infrastructure such as IP addresses, domain names, malware internal naming and malware file names. In some cases, threat actors change their techniques more fundamentally, or even emulate others to throw investigators off track, causing even more names to appear for the exact same group and the exact same malicious tools they use.
Another form of explicit aliasing affects system vulnerabilities. A variety of industry organizations issue security advisories and messages to alert defenders. The organizations name the advisories based on their internal control procedures. For example, Mitre, a not-for-profit organization, manages the public catalog of Common Vulnerabilities and Exposures (CVE) identifiers, which are also known as CVE IDs.
Other commercial entities, and even open source projects, issue advisories covering overlapping topics, but they’re named according to their own process controls. This means that a specific vulnerability could be identified by a CVE ID like CVE-2012-0158, by a vendor report ID from each vendor with affected products, or by IDs from other organizations, like the now-defunct Open Source Vulnerability Database (OSVDB).
One way our industry tries to alleviate the problem is to include both the vendor designation for a vulnerability and its CVE ID in advisories so that humans and automated systems can directly connect the reports and the identifiers.
The commercialization of both malware and attack platforms also contributes to the problem. It creates an “inverted” aliasing effect, in a sense, compared to what we’ve discussed so far. Since a variety of threat actors can purchase the same malware, hire the same attack infrastructure or both, investigators cannot firmly connect activity from that malware or infrastructure to activity instigated by threat actor A because threat actors B, C and D all have commercialized access to the same assets.
Some geographies suffer more from this aspect of the problem than others. Japan and Brazil, for example, have deeply commercialized threat ecosystems that lead to many insignificant changes to the malware as a wide variety of different actors integrate it into their campaigns. These changes occur frequently and produce new hash values after each change, but the changes don’t rise to the level of branching into a new malware family name. This drastically increases the number of hash values in the ecosystem and the number of variants in each family, while also diluting our ability to connect a variant to an actor.
How Can Companies Manage Multiple Threat Intelligence Aliases?
Due to both technical and nontechnical factors, a lot of things in the threat intelligence world will continue to have multiple names, and this will likely continue for a long time. So how do your organization’s processes and people prepare to deal with all these forms of aliasing? Robust defenders must:
- Be aware of the ways aliases can impact your processes;
- Be able to determine and remember as many of the names for each thing as possible; and
- Ensure that searching for any alias for something returns all results for that thing, regardless of the alias.
The IBM X-Force Exchange platform can help defenders answer some of these questions. Users can leverage this cloud-based threat intelligence sharing platform to research threats and vulnerabilities, collaborate with peers and respond effectively to threats. X-Force Exchange makes many of its features public and available without registration, but also offers additional features to registered users. IBM Security also helps customers make sense of threat intelligence by building Watson for Cybersecurity into IBM QRadar SIEM.