Stories of Spam Part 2

Automatically Scanning Email with SpamAssassin

Introduction

In Part One, I explained how to help SpamAssassin learn about new spam. Now I'll put that acquired data to use when scanning messages.

There are several ways to scan email with SpamAssassin. Personally, I like having the message automatically scanned right at the mail server. This ensures every message that passes through the server is scanned and properly dealt with.

Tools

Postfix will be the MTA, although it's possible to use any other major mail server for this project. The SpamAssassin command line utilities, spamd and spamc, will be used to scan the messages. Finally, Debian will be the Linux distribution.

Installing SpamAssassing, spamd, and spamc

With Debian, spamd is included with the main SpamAssassin package. To install, simply do:

# apt-get install spamassassin

And to enable spamd, edit the /etc/defaults/spamassassin file:

# Change to one to enable spamd
ENABLED=1

Finally, start the server:

/etc/init.d/spamassassin start

spamc has its own package. No configuration is needed, just simply use apt-get:

# apt-get install spamc

Configuring Postfix

The only file to edit is /etc/postfix/master.cf. This file contains definitions for all of Postfix's services and daemons. master.cf is actually a really interesting file and I'm going to cover it in more detail in another article.

The first line to edit is the first uncommented line that starts with smtp. The edited version looks like this:

smtp  inet  n   -   -  -  -  smtpd -o content_filter=spamc

The -o argument tells Postfix to override a configuration parameter from the main.cf file. In this case, content_filter is the parameter being overwritten. The job of content_filter is to define what transport to send the mail to for optional filtering. spamc is the name that defines the transport. The name is completely arbitrary, but since the actual spamc program will be used, it looks good for consistency purposes.

The actual transport definition is made at the very bottom of the file on a new line:

spamc   unix    -       n       n     -     -   pipe \
    user=spam argv=/usr/bin/spamc -f \
    -e /usr/sbin/sendmail -oi -f  ${sender} ${recipient}

If you're familiar with any type of programming, think of this configuration as working with functions: the first edit is the function call while the second edit is the function definition.

While content_filter could have just been defined in the main.cf file, a second entry would have to have been made in /etc/postfix/transport, as well as the transport definition in /etc/postfix/master.cf. Making two changes in one file is much more simple than three changes in three files.

There's a lot of information here, but it's not that hard to understand. Basically, Postfix will spawn an instance of spamc as user "spam". The job of spamc is to connect to spamd. spamd runs the email against the SpamAssassin database to check for spam. Once spamd is finished, it returns the newly tagged email back to spamc. Before spamc finishes, it calls sendmail to re-inject the email back into the mail system.

Here's a more visual representation:

Spam2-1

The spam user that spamc runs as should be a simple, unprivileged account.

Now that the transport system is in place, it's time to configure SpamAssassin.

Configuring SpamAssassin

Since there's enough information on configuring SpamAssassin to fill a book (which has been done), just a few basics will be covered. The main SpamAssassin config file is located at /etc/spamassassin/local.cf. A simple config could look like this:

rewrite_header Subject *****SPAM*****
required_score 7.5
use_bayes 1
bayes_path /var/spamassassin/bayes
bayes_auto_learn 1

rewrite_header: This will tag a header of your choice with whatever text you specify. The most common way to use this is for subject rewriting -- which is what I've shown.

required_score: SpamAssassin runs various tests on an email to check whether or not it's spam. Each test returns a score and all the scores add up. If the score is greater than or equal to your required score, the email is spam. The default score is 5.

use_bayes: SpamAssassin stores it's knowledge of collected spam in a bayes database. This is also where sa-learn stores it's knowledge when using the sweep script from Part One. By turning this on, we're giving SpamAssassin access to that database.

bayes_path: The value of this is a bit of a misnomer. /var/spamassassin/bayes is not a file or directory, but a portion of an absolute file. The final value looks like this:

/var/spamassassin/bayes_seen
/var/spamassassin/bayes_toks

bayes_auto_learn: If turned on, SpamAssassin will automatically add the email to the bayes database when it detects a high score. This is similar to automatically calling sa-learn on the email.

Results

When all of this is put together, here's what a non-spam (ham) message will look like once it's received:

Return-Path: <joe@powerbook-g4-12.local>
X-Original-To: jane@terrarum.net
Delivered-To: jane@terrarum.net
Received: by terrarum.net (Postfix, from userid 1001)
id D89E92A9; Tue, 14 Mar 2006 19:15:23 -0500 (EST)
X-Spam-Checker-Version: SpamAssassin 3.1.0 (2005-09-13) 
    on terrarum.net
X-Spam-Level:
X-Spam-Status: No, score=0.7 required=7.5 tests=
    ALL_TRUSTED,AWL autolearn=ham version=3.1.0
Received: from powerbook-g4-12.local ( local [192.168.1.250])
by terrarum.net (Postfix) with ESMTP id C47072A7
for <jane@terrarum.net>; Tue, 14 Mar 2006 19:15:20 -0500 (EST)
Received: by powerbook-g4-12.local (Postfix, from userid 501)
id 0FF341A9126; Tue, 14 Mar 2006 19:13:38 -0500 (EST) 
To: jane@terrarum.net
Subject: Hellooo
Message-Id: <20060315001338.0FF341A9126@powerbook-g4-12.local>
Date: Tue, 14 Mar 2006 19:13:38 -0500 (EST)
From: joe@powerbook-g4-12.local (Joe Topjian)

hello

As you can see, SpamAssassin added three headers starting with X-Spam-*

Issues

With this method, all mail is still sent to the end-user ensuring no false positives are lost. In the case of a false positive, the message is rewritten by SpamAssassin, but the original contents are still preserved. While this might cause some annoyance for the end-user, they still receive their message.

A second issue with this method still deals with the end-user receiving all their mail -- including real spam messages. Most email clients support the creation of rules to handle messages tagged as spam. However, some end-users might see this as too much work on their end. Therefore, it's up to the Administrator to put more controls in place to help mitigate the amount of spam the user receives. This is what future articles in this series will be covering.

Conclusion

There's no use in keeping a spam database that sa-learn contributes to without actually using it. Here, I've presented a method to efficiently detect and tag spam at the mail server level. The detected messages show up in the end-users Inbox marked as "spam", but with the original message still intact.

Tags: , , ,