stories of spam part 3

Introduction

By using Transports with Postfix, you can add a lot of flexibility and customization to how your mail is handled. In Part Two of this series, I described how to use Transports in conjunction with the spamc program to tag email as either spam or ham. While spamc does a great job at tagging, it doesn't provide any features beyond that -- such as delivery options, quarantining, and virus scanning. As a result, more robust scripts and packages have been created to provide these extra features. Before I cover some of those packages, I'd like to explain how to write a custom spam filtering script.

Tools

All the same tools from Part Two will be used in this part. The Spam Filter script will be written in Python -- specifically version 2.4 since I'll be using the subprocess module.

Installation

Python 2.3 is installed by default on Debian Sarge. To replace it with Python 2.4, follow these commands:

# apt-get remove --purge python2.3
# apt-get install python2.4 python2.4-dev

How the Spam Filter Script Will Work

The Spam Filter script will have two objectives: to detect if an incoming email is spam and determine if it should still be delivered to the mail box. If the filter decides not to deliver the message, it will simply place the email in a separate mail box which I'll call the spam pit.

The script will read the incoming message via STDIN, enabling it to be invoked in several different ways:

$ cat message.txt | spamfilter.py
$ spamfilter.py < message.txt

Or in the case of Postfix, it will replace spamc in the transport:

spamd  unix  -    n    n    -    -    pipe \
  user=spam argv=/usr/bin/spamfilter.py ${sender} ${recipient}

Dissecting the Script

The modules used in the script will be declared at the very beginning:

import sys, re
import subprocess
import datetime, time
import socket

The sys module allows the message to be read via STDIN. re adds support for Regular Expressions which will be used to determine if the message was tagged as spam. Next, the subprocess module (new to Python 2.4) allows for easy access to external programs. Finally, the datetime, time, and socket modules will be used together to create a unique name for a message dropped in the spam pit (for example, terrarum.1143163137).

The next step of the script is to define some variables:

spampit = "/var/spool/vmail/terrarum.net/spampit/new/"
hostname = socket.gethostname()
header = re.compile(r'^X-Spam-Status: (No|Yes), \
    score=(-?\d+\.\d+) required=(-?\d+\.\d+).*$') 
message = []
is_spam = 0

Most of this is self-explanatory: spampit is simply a directory name to the location where the script will dump the email; hostname finds the hostname of the server; message becomes an empty Python dictionary, and is_spam will be a boolean switch that's currently set to false.

The line worth more detail here is the header declaration. This is the regular expression that will detect if a message has been tagged as spam. It'll look for the X-Spam-Status header of the email and look for 3 values: the Yes or No answer, the score the email received, and the score the server defines as spam.

Next, the script reads the message in via STDIN:

for line in sys.stdin:
    message.append(line)

Adding each line as an entry in the dictionary is the recommended way of appending text to a variable in Python.

Next, spamc will be invoked to check the message:

p = subprocess.Popen(["spamc"], \
    stdin=subprocess.PIPE, stdout=subprocess.PIPE)
p.stdin.write("".join(message)) 
message = p.communicate()[0]

The subprocess module enables the script to open the spamc program as a file that can be written to and read from. The message variable is overwritten with the contents of the new, tagged email.

In the next part, the script will loop through the tagged email line by line to search for the X-Spam-Status header. Once it finds the line, it will check to see if the score is greater than or equal to the required score. If it is, the is_spam boolean will be set to True.

for line in message.split("\n"):
r = header.match(line)
if r:
    if float(r.group(2)) >= float(r.group(3)):
        is_spam = 1
        break

Finally, the script will determine what to do with the message based on the answer from the previous block. If the message is spam, it will create a unique name for the message and dump it in the spam pit. However, if the message is not spam, the script will simply pass the message back to Postfix for delivery.

if is_spam:
    epoch = int(time.mktime(datetime.datetime.now().timetuple()))
    f = open("/var/spool/vmail/spampit/%s.%s" % (hostname, epoch), 'w')
    f.write(message)
    f.close()      
else:
    sendmail = subprocess.Popen(["/usr/sbin/sendmail", "-oi", "-f", sys.argv[1], sys.argv[2]], \ 
        stdout=subprocess.PIPE, stdin=subprocess.PIPE)

    sendmail.stdin.write(message)

Issues

While playing Delivery God with the mail server can give an Administrator total control of the messages passing through the system, it can also backfire. A false-positive message -- one that has been marked as spam but is really legit -- will never arrive to the intended recipient. As a counter measure, the email can simply be dug up from the spam pit. However, the end-user has probably lost all trust in the delivery system and will be constantly wondering if other messages have been wrongly tagged. A good rule of thumb would be to set the required score high enough to pass the false-positive mark. Some spam will still slip by and be delivered to the recipient, but the chance of losing a real message has lowered.

Conclusion

The script outlined in this part is nowhere near as feature complete as other packages. It simply serves as a short example of how an Administrator can add more control to mail delivery. Also, understanding basic routines of mail control can serve as a solid basis to better understanding those more robust packages.

View the completed Spam Filter script.

Tags: , , , , , ,