The Art and Science of How Spam Filters Work
If you want to know how important spam filters are to your online experience, try turning them off for a day. You’ll quickly see why these tools we tend to take for granted are so essential. We may not know how spam filters work, but we’re grateful that they do.
Spam volumes have been dropping in recent years, but there’s still plenty of junk out there. According to Trend Micro’s Global Spam Map, volumes exceed 400 billion messages on some days, but we almost never see spam in our inbox. Why is that?
In the cat-and-mouse game of cybersecurity, spam is one area where the good guys have kept reasonably well ahead of the bad. And the outlook for the future is bright: Machine learning could take spam filtering to a new level.
There are many approaches to catching spam, but they all do basically the same thing: scan header information for evidence of malice, look up senders on blacklists of known spammers and filter content for patterns that point to junk mail. The first two tasks are mostly science — the third is art.
Deciphering Header Data
Header information is that long river of text at the top of an email that you thankfully never have to see. It looks like this:
Received: by 10.107.191.69 with SMTP id p66csp1537538iof
X-Received: by 10.107.175.218 with SMTP id p87mr2784731ioo.80.1477075567036
Fri, 21 Oct 2016 11:46:07 -0700 (PDT)
Buried beneath all that gobbledygook is important information. It shows things like the IP address of every server that touched the email, date and time stamps, security signatures and other stuff you don’t need to know, but is useful in understanding where that mail came from. Spam filters look for attempts to deceive the recipient (e.g., g00gle.com instead of google.com) and compare addresses to blacklists of known spammers to automatically filter out those that match.
Blacklists are lists of known spammers collected by internet service providers (ISPs), email providers and server administrators. Anyone can create and publish a blacklist, but the most popular ones, such as SpamCop, Spamhaus and URIBL, have the most credibility. Publishers create these lists by monitoring spam reports from users. That’s why it’s important to label unwanted email as spam. When you do so, you’re helping to keep everyone’s mailbox pristine.
Smart spammers have ways of disguising header information to make their messages look genuine. Not all spammers are smart, however, so header analysis alone catches a lot of the most obvious spam. Even spammers who are good at cloaking information may overlook some telltale details. If delivery reporting is disabled, for example, it’s a sign that the sender is transmitting a large volume of mail and doesn’t want to be bothered with bounce messages. That’s a possible spammer.
There’s no one rule for how spam filters work. Each has its own quirks. Some frown on email sent from free services like Hotmail and Gmail, for example, or may downgrade messages targeted just to an email address without an accompanying name. Each engine is unique. Fortunately, email administrators can manipulate most of these settings to their liking.
The art of spam filtering comes into play when analyzing the contents of a message. This is where the best filters shine, but it’s also where legitimate messages can end up in spam purgatory.
Some content tactics are almost certain to land a message in the spam folder. Emails containing attached executable files or links to blacklisted websites are sure giveaways, as are those with common spam keywords. A few years ago, many spam filters flagged emails containing short codes from services like bit.ly and 3.ly. With the profusion of short codes spawned by Twitter, however, that tactic is less common today.
If those schemes are so easily detected, you might wonder why spammers continue to use them. Unfortunately, there are enough gullible people out there that even a very low hit rate can be profitable. High-volume spammers don’t expect more than about a .1 percent open rate, but that still translates to 1,000 people for every 1 million messages sent.
“When you get a reply, it’s 70 percent sure that you’ll get the money,” one spammer told the Los Angeles Times in a 2005 interview. Although much has changed since then, even a minuscule response rate can be profitable if the volumes are large enough, and spam is free and easy to send.
Machine Learning: Changing How Spam Filters Work
With the advent of powerful machine learning algorithms and big data economics, there’s potential to change how spam filters work.
Apache SpamAssassin is a widely used platform that incorporates advanced statistical techniques to score incoming messages. The same tactics that are applied to detecting fraudulent reviews on travel and e-commerce sites can work in spam analysis as well. When you mark a message as spam, it goes into a hopper with millions of messages that others have flagged. Algorithms churn through these messages to find similar characteristics, such as word proximity or misspellings, that show up frequently in spam.
Cloud computing is also changing the rules of spam filtering by making more powerful filters available to a broader audience at lower cost. Cloud services are increasingly displacing on-premises filters, bringing the benefits of economies of scale. Because cloud providers collect data from many sources, they can compile large databases for machine learning processing. The result should be better content filtering.
You can fine-tune your own spam settings by specifying senders or domains to exclude. Some email administrators even like to loosen controls to be sure legitimate messages don’t get caught. Either way, it’s a good idea to check your spam folder every few days to ensure messages you’ve been waiting for aren’t lurking there. Spam filters are pretty good these days, but nothing’s perfect.