November 16, 2003

Advanced Spam Filtering using Spamassassin and Exim

I have the most advanced spam filtering system on the planet. I feel like I've actually beaten the spam problem. More details can be found on Computer Tyme Hosting.

How do I do it? What is the magig? Well - there is no magic. I'm using a combination of the Exim MTA and Spamassassing with a bunch of my own custom rules and tricks.

Two Spam Piles

Spamassassin is very good by itself - but not good enough. one thing that the Spamassassin folks haven't quite grasped is sorting Spam into 2 piles - high scoring spam and low scoring spam.

The high spam is almost surely spam. The low spam is probably spam - but if there is a false positive - it will be low scoring. Thus the false positive is easy to find. By using this system the high spam can be ignored or trashed without losing anything. I get about 300-400 spams a day. Most all are caught as high spam.

Direct IMAP folder delivery

Once the spam is tagged - if the user is using IMAP and has folders named spam-high and spam-low - the Exim MTA delivers the spam directly into those folders rather than the Inbox. In this way the inbox is spam free and can be downloaded without downloading spam that is left on the server side. This makes downloading much quicker.

The spam folders are still accessable - so you can look at the spam you are missing. You can check the spam-low for false positives. And - IMAP allows you to create more server side folders for other important information. With a Squirrelmail interface, you can access your email from any browser.

Making the Spam Filter Smarter

Spamassassin uses a Bayesian filter that allows it to learn from spam and nonspam and get sparter. Very high scoring spam (+15 points) and very low scoring spam (-2 points) are autolearned. But - I provide two other imap folders to train the filter on missed spam. Just drag spam-low and missed spam into the spam-missed folder and - every 15 minutes - the learn bot comes along and learns it. Next time that spam comes in it is caught.

Exim Rules for Blacklisting

One of the major advances I made over Spamassassin is adding blacklisting lists to Exim. These lists - just text files - add headers if there is a match. One of the things I list are things that spam links to. Spam wants you to do something and often that means click on a link. I have a list of about 400 sites that if spam links to it - I flag it. I add spamassassin rules to score there extra headers. This trich proves to be extremely effective.

I have other lists too. I blacklist based on received strings so that sending hosts are blocked. I have a list of misspelled words like p0rn that spammers user to get around spam filters. I have a blacklist of dead email targets that no one is really mailing to. If the spam CCs and of these nonexistent people - it gets flagged.

I also have whitelists that whitelist various hosts, newsgroups, words, etc. Whitelisting creates a negative score bringing the spam score below 0. This creates a good stream for non-spam for the autolearning system so that it knows what spam and nonspam look like.

Taking out the Trash

The spam does not accululate on the server forever. Once a week the trash bot come along and empties out old messages from the spam and trash folders. Anything over 15 days is gone. So - you don't have to even delete your spam. Just leave it on the server and the trash bot will cleran it up for you.

Summary of Enhancements:


  • Two levels of Spam Tagging
  • Direct Delivery to IMAP Folders
  • Learning System for User Feedback
  • Multiple Exim Blacklist Front end
  • Server side Trash Collection

How well does it work?

When I started spam filtering I though that 75% would be real good and that 80% was a theoretical maximum. I am now running about 99% accurate, so of the 300-400 spams I get every day - only 3 or 4 get through. This saves me a hell of a lot of time. If not for this spam filtering - I wouldn't be able to get nearly as much done. I don't have a lot of hours to devote to deleting spam. This save me a ton of time.

Where can I get this?

Well - I do email hosting as well as web hosting. So - if you have a domain and you want this - I can fix you up. If I like your cause - I might even host it for free.

Posted by marc at November 16, 2003 12:52 PM | TrackBack
Comments

Check out Dspam at nuclearelephant. I taught it with about 2500 Spam and about 500 Ham. Very accurate. Written in C, uses Mysql. Very fast.

Using this in conjunction with Spamassassin and Spamcop or Spamhaus can get you 99.99999% accuracy.

Craig Jackson

Posted by: craig jackson at June 20, 2004 02:42 PM
Post a comment









Remember personal info?