Everyone wants to know who was behind the latest audacious cyberattack. Security professionals have long attempted to identify threat actors through linguistic analysis, but this method is limited when it comes to attribution.
Part of the problem is that cybercriminals purposely build deception mechanisms into their code. “Deception is always a major part of an attack,” according to Network World. “The attackers want to make sure that if the operation is discovered, any evidence that’s unearthed points toward someone else.” This often means using servers or domain names from other places on purpose, or using a variety of communications paths that have nothing to do with their own country or place of origin.
As Fahmida Y. Rashid explained on CSO Online, “Linguistic analysis will very rarely lead to the smoking gun. At the very least, it will uncover a whole set of clues for researchers to track down, and at the best, it will support (or confirm) other pieces of evidence.”
Two Kinds of Linguistic Analysis
There are generally two kinds of linguistic analysis: one that looks at how the actual source code was written, and another that examines the actual text used. What’s the difference? The first case examines the style of code and determines whether it is similar to other pieces of code that have been found in malware samples. The second method is more about word choices found in user dialogues, comments within the code, input screens or other displays visible to the end user. All ransomware contains ransom notes, for example. Are the same words in these notes consistently misspelled, or do they have the same typographic conventions?
Part of linguistic analysis is understanding how native speakers use their language. If a threat actor regularly omits definite articles, for example, this is a good indication that he or she is probably not a native English speaker. However, people speak multiple languages and can also use machine translations, both of which can cloud the results.
An Inconclusive Method
The challenge with linguistic analysis is that isn’t enough to be conclusive on its own — it needs to be combined with other evidence to point the way toward attribution. In the case of WannaCry, the ransom notes were written in 27 different languages. One analyst concluded that a Chinese-speaking author was behind the original ransom messages, but the finding wasn’t ironclad.
Despite its inconclusiveness, linguistic analysis is a fascinating field of study. It’s also one that can improve as big data models mature, making the future of this security research bright.