Articles May 12, 2006 at 5:04 pm

Easy way for users to tweak Bayes filtering

A perl script to read individual user spam and ham folders.

After I got Bayes learning to work correctly in SpamAssassin, I wanted a more convenient way for users to mark messages that were incorrectly classified. The Apple recommendation, as I’m sure you know, requires the user to redirect the message to a particular email address, where a nightly script will process it.

Read on for more…

For each of my users, I created two new folders inside their “Deleted Messages” directory (which shows up as the Trash folder in Apple Mail): “Relearn as Spam” and “Relearn as Not Spam.” This lets the user simply drag the message to the appropriate folder. Because the new folders are subdirectories of Trash, the user understands that the message will be deleted, and she had better option-drag if she wants to keep it.

Each night, a Perl script runs and processes the messages in those folders. Those who know far more than this newb will be able to write a superior script, but I include mine here for any help it might offer. Here’s how it works: it iterates through /var/spool/imap/user/ and looks for files in the the “relearn” folders. If it finds something, it copies it to an archive of all relearned messages (I keep them in case I need to reset the Bayes filter someday) and purges the messages from the “relearn” folders. Then it goes through the newly archived files and runs them through sa-learn. It also keeps a log.

This might not be suitable for many; I have a very small number of mail users and they all use Apple Mail. I haven’t tried it in any other environment. But it works for me, and I hope someone might find this concept useful.

Here’s the script:

#! /usr/bin/perl
# my_sa_learn.pl
# 05 Mar 06, 20 Mar 06

# optional flags: -q for quiet; -v for verbose; -t for test
use strict;
use File::Copy;

########################
# adjust these values if needed:
my $mailstorepath = "/var/spool/imap/";
my $mailstoreuserpath = $mailstorepath . "user";
my $cyrususer = "cyrusimap";
my $clamavuser = "clamav";
my $ipurgepath = "/usr/bin/cyrus/bin/ipurge";
my $spamarchivedir = "/var/sa-learn-corpus/spam/";
my $hamarchivedir = "/var/sa-learn-corpus/ham/";
my $userspamdir = "/Deleted Messages/Relearn as Spam";
my $escapeduserspamdir = "/Deleted Messages/Relearn as Spam";
my $userhamdir = "/Deleted Messages/Relearn as Not Spam";
my $escapeduserhamdir = "/Deleted Messages/Relearn as Not Spam";
my $logfile = "/var/log/my.log";
########################

my $needSpamUpdate = 0;     my $needHamUpdate = 0;
my $corpuspath;             my $userhadfiles = 0;
my $spamdir;                my $hamdir;
my $thisuser;               my $mailfile;
my $dirname;                my $filecount = 0;
my $newfile;                my $oldfile;
my $didarchive = 0;         my $result;
my $args;                   my $thetype;
my $mailfilepath;           my $seconds = time;
my $thisuserspamdir;        my $thisuserhamdir;
my $shortdir;               my $longdir;
my $devnull = " >> /dev/null";
my $mode;

($args) = @ARGV[0];
# in verbose mode, reset $devnull to keep shell output:
if ($args =~ /-v/ ) {
    $devnull = "";
    $mode = "verbose";
} elsif ($args =~/-q/ ) {
    $mode = "quiet";
} elsif ($args =~/-t/ ) {
    $mode = "test";
}
# get yyyymmdd to use as a directory name:
my $day = "00" . (localtime)[3];
$day = substr($day,-2);
my $month = "00" . ((localtime)[4] + 1);
$month = substr($month,-2);
my $year = (localtime)[5] + 1900;
my $datestring = "$year$month$day";
my $spamcorpuspath = "$spamarchivedir$datestring/";
my $hamcorpuspath = "$hamarchivedir$datestring/";

##############################################################
# Main loop: iterate through /var/spool/imap/user/ and copy
# items to the archive
Qprint("my_sa_learn.pl running...n");
opendir(USERS, $mailstoreuserpath) or die "Could not open $mailstoreuserpath.";
while (defined( $thisuser = readdir(USERS) )) {
    if ($thisuser !~ /^./ ) {
        Vprint("Checking directory $mailstoreuserpath/$thisuser...n");
        #$thisuserspamdir = "/$thisuser$userspamdir";
        #$thisuserhamdir = "/$thisuser$userhamdir";
        Checkuser('spam');
        Checkuser('ham');
        Vprint("Finished checking directory $mailstoreuserpath/$thisusern");
    }
}
closedir(USERS);

# Now process anything that we copied to the archive
if ($needSpamUpdate) {
    DoLearn('spam');
}

if ($needHamUpdate) {
    DoLearn('ham');
}

# Sync the db if needed
if ($didarchive) {
    Vprint("Calling  sa-learn --sync...n");
    if ($mode =~ /test/) {
        $result = 0;
    } else {
        $result = system "su - $clamavuser -c "sa-learn --sync $devnull"";
    }
    if ($result) {
        Qprint("sa-learn --sync failed with a result of $resultn");
    } else {
        Vprint("sa-learn --sync sucessfuln");
    }
    Vprint("Finished calling sa-learn --syncn");
}

$seconds = time - $seconds;
Qprint("Learned $filecount items in $seconds seconds.n");

###########################################
sub Makecorpusdir {
    ($dirname) = @_ ;
    Vprint("Looking for $dirname...n");
    if ( -d $dirname) {
        Vprint("Found $dirnamen");
    } else {
        mkdir $dirname, 700 || die "Could not create $dirname";
        Vprint("Created $dirnamen");
    }
}

###############################################
sub Checkuser {
    #($shortdir) = @_[0];
    ($thetype) = @_ ;
    $userhadfiles = 0;
    Vprint("Looking for $thetype filesn");

    # Because the copy operation and the ipurge operation use different paths, we
    # have to build two paths using only $thisuser and $thetype
    if ($thetype =~ /spam/) {
        $longdir = "$mailstoreuserpath/$thisuser$userspamdir";
        $shortdir = "user/$thisuser$escapeduserspamdir";
    } else {
        $longdir = "$mailstoreuserpath/$thisuser$userhamdir";
        $shortdir = "user/$thisuser$escapeduserhamdir";
    }
    Vprint("Looking for $longdir...n");
    if ( -d $longdir) {
        Vprint("Found $longdirn");
        opendir(SPAMDIR, $longdir) or die "Could not open $longdir.";
        while (defined( $mailfile = readdir(SPAMDIR) )) {
            if ($mailfile !~ /^./ ) {
                if ($mailfile !~ /^cyrus/ ) {
                    Vprint("File found at $mailfilen");
                    $userhadfiles = 1;

                    if ($thetype =~ /spam/) {
                        $corpuspath = $spamcorpuspath;
                        if ($needSpamUpdate == 0) {
                            $needSpamUpdate = 1;
                            Makecorpusdir($corpuspath);
                        }
                    } else {
                        $corpuspath = $hamcorpuspath;
                        if ($needHamUpdate == 0) {
                            $needHamUpdate = 1;
                            Makecorpusdir($corpuspath);
                        }
                    }
                    $filecount++;
                    $newfile = "$corpuspath$thisuser-$filecount";
                    $oldfile = "$longdir/$mailfile";
                    if ($mode !~ /test/) {
                        copy($oldfile, $newfile) || die "Could not copy $oldfile to $newfile";
                    }
                    Vprint("Copied $oldfile to $newfilen");
                }
            }
        }

        if ($userhadfiles) {
            Vprint("Purging $thetype files in $shortdir...n");
            if ($mode =~ /test/) {
                $result = 0;
            } else {
                $result = system "sudo -u $cyrususer $ipurgepath -d 0 -f $shortdir $devnull";
            }
            # ipurge needs this path: user/ross/Deleted Messages/Relearn as Spam
            # no trailing slash, path starts at user/
            # -d 0 means all files, regardless of age; -f means force; this line
            # was adapted from spamtrainer by Athanasios Alexandrides
            if ($result) {
                Qprint("Purge failed with a result of $resultn");
            } else {
                Vprint("Purge sucessfuln");
            }
        } else {
            Vprint("No $thetype files found for $thisusern");
        }
        closedir(SPAMDIR);
    } else {
        Vprint("Did not find $longdirn");
    }
}

sub DoLearn {
    ($thetype) = @_ ;
    if ($thetype =~ /spam/) {
        $corpuspath = $spamcorpuspath;
    } else {
        $corpuspath = $hamcorpuspath;
    }
    Vprint("Calling sa-learn --$thetype on files in $corpuspath...n");
    opendir(SPAMCORPUS, $corpuspath) or die "Could not open $corpuspath.";
    while (defined( $mailfile = readdir(SPAMCORPUS) )) {
        $mailfilepath = "$corpuspath$mailfile";
        if ($mailfile !~ /^./ ) {
            Vprint("Running sa-learn on $mailfile...n");
            if ($mode =~ /test/) {
                $result = 0;
            } else {
                $result = system "cat $mailfilepath | su - clamav -c "sa-learn --$thetype --no-sync $devnull "";
            }
            if ($result) {
                Qprint("sa-learn on $mailfile failed with a result of $resultn");
            } else {
                Vprint("sa-learn on $mailfile sucessfuln");
            }
        }
    }
    closedir(SPAMCORPUS);
    Vprint("Finished calling sa-learn --$thetypen");
    $didarchive = 1;
}

sub Vprint {
    ($_) = @_;
    if (($mode =~ /verbose/) || ($mode =~ /test/)) {
        WriteToLog($_);
    }
}

sub Qprint {
    ($_) = @_;
    if ($mode !~ /quiet/) {
        WriteToLog($_);
    }
}

sub WriteToLog {
    ($_) = @_;
    my $logstring = "$datestring " . (localtime[2]) . ":" . (localtime)[1] . ":" . (localtime)[0];
    $logstring .= " $_";
    open(LOGFILE, ">>$logfile") ;
    print LOGFILE $logstring;
    close LOGFILE;
}

Leave a reply

You must be logged in to post a comment.