The Limits of Linguistic Analysis for Security Attribution

Everyone wants to know who was behind the latest audacious cyberattack. Security professionals have long attempted to identify threat actors through linguistic analysis, but this method is limited when it comes to attribution.

Part of the problem is that cybercriminals purposely build deception mechanisms into their code. “Deception is always a major part of an attack,” according to Network World. “The attackers want to make sure that if the operation is discovered, any evidence that’s unearthed points toward someone else.” This often means using servers or domain names from other places on purpose, or using a variety of communications paths that have nothing to do with their own country or place of origin.

As Fahmida Y. Rashid explained on CSO Online, “Linguistic analysis will very rarely lead to the smoking gun. At the very least, it will uncover a whole set of clues for researchers to track down, and at the best, it will support (or confirm) other pieces of evidence.”

Two Kinds of Linguistic Analysis

There are generally two kinds of linguistic analysis: one that looks at how the actual source code was written, and another that examines the actual text used. What’s the difference? The first case examines the style of code and determines whether it is similar to other pieces of code that have been found in malware samples. The second method is more about word choices found in user dialogues, comments within the code, input screens or other displays visible to the end user. All ransomware contains ransom notes, for example. Are the same words in these notes consistently misspelled, or do they have the same typographic conventions?

Part of linguistic analysis is understanding how native speakers use their language. If a threat actor regularly omits definite articles, for example, this is a good indication that he or she is probably not a native English speaker. However, people speak multiple languages and can also use machine translations, both of which can cloud the results.

An Inconclusive Method

The challenge with linguistic analysis is that isn’t enough to be conclusive on its own — it needs to be combined with other evidence to point the way toward attribution. In the case of WannaCry, the ransom notes were written in 27 different languages. One analyst concluded that a Chinese-speaking author was behind the original ransom messages, but the finding wasn’t ironclad.

Despite its inconclusiveness, linguistic analysis is a fascinating field of study. It’s also one that can improve as big data models mature, making the future of this security research bright.

David Strom

Security Evangelist

New cybersecurity sheets from CISA and NSA: An overview

4 min read - The Cybersecurity and Infrastructure Security Agency (CISA) and National Security Agency (NSA) have recently released new CSI (Cybersecurity Information) sheets aimed at providing information and guidelines to organizations on how to effectively secure their cloud environments.This new release includes a total of five CSI sheets, covering various aspects of cloud security such as threat mitigation, identity and access management, network security and more. Here's our overview of the new CSI sheets, what they address and the key takeaways from each.Implementing…

Databases beware: Abusing Microsoft SQL Server with SQLRecon

20 min read - Over the course of my career, I’ve had the privileged opportunity to peek behind the veil of some of the largest organizations in the world. In my experience, most industry verticals rely on enterprise Windows networks. In fact, I can count on one hand the number of times I have seen a decentralized zero-trust network, enterprise Linux, macOS network, or Active Directory alternative (FreeIPA). As I navigate my way through these large and often complex enterprise networks, it is common…

Easy configuration fixes can protect your server from attack

4 min read - In March 2023, data on more than 56,000 people — including Social Security numbers and other personal information — was stolen in the D.C. Health Benefit Exchange Authority breach. The online health insurance marketplace hack exposed the personal details of Congress members, their families, staff and tens of thousands of other Washington-area residents. It appears the D.C. breach was due to “human error”, according to a recent report. Apparently, a computer server was misconfigured to allow access to data without proper…

Security Intelligence

{{title}}

{{title}}

{{title}}

Topics

Two Kinds of Linguistic Analysis

An Inconclusive Method

More from Network

New cybersecurity sheets from CISA and NSA: An overview

Databases beware: Abusing Microsoft SQL Server with SQLRecon

Easy configuration fixes can protect your server from attack

Topic updates