Official Versions: In
English or
French
Maintainer: David Relson <relson@osagesoftware.com>
This document is intended to answer frequently asked questions about bogofilter.
Bogofilter is a fast Bayesian spam filter along the lines suggested by Paul Graham in his article A Plan For Spam. Bogofilter uses Gary Robinson's geometric-mean algorithm with the Fisher's method modification to classify email as spam or non-spam.
The bogofilter home page at SourceForge is the central clearinghouse for bogofilter resources.
Bogofilter was started by Eric S. Raymond on August 19, 2002. It gained popularity in September 2002, and a number of other authors have started to contribute to the project.
The NEWS file describes bogofilter's version history.
Bogofilter is some kind of a bogometer or bogon filter, i.e., it tries to identify bogus mail by measuring the bogosity.
There are currently four mailing lists for bogofilter:
List Address | Links | Description |
---|---|---|
bogofilter-announce@aotto.com | [subscribe] [archive] | An announcement-only list where new versions are announced. |
bogofilter@aotto.com | [subscribe] [archive] | A discussion list where any conversation about bogofilter may take place. |
bogofilter-dev@aotto.com | [subscribe] [archive] | A list for sharing patches, development, and technical discussions. |
bogofilter-cvs@lists.sourceforge.net | [subscribe] [archive] | Mailing list for announcing code changes to the CVS archive. |
"Training on error" involves scanning a corpus of known spam and non-spam messages; only those that are misclassified, or classed as unsure, get registered in the training database. It's been found that sampling just messages prone to misclassification is an effective way to train; if you train bogofilter on the hard messages, it learns to handle obvious spam and non-spam too.
This method can be enhanced by using a "security margin". By increasing the spam cutoff value and decreasing the ham cutoff value, messages which are close to a cutoff will be used for training. Using security margins was shown to improve results when training on error.
"Training to exhaustion" is repeating training on error, with the same message corpus, until no errors remain.
A basic assumption of bayes' theory is that the messages used for training are a randomly chosen sample of the messages received. This is violated when choosing messages by analyzing them first. Though theoretically wrong, in practice "training on error" seems to work.
Registering messages different numbers of times (as can happen with training to exhaustion) changes the distribution of token scores in the training database so that it's different from the distribution in the input messages. This violates a basic assumption of Bayesian classification, and may lead to unpredictable results. Though theoretically wrong, in practice "training to exhaustion" seems to work.
Note: bogominitrain.pl
has a -f
option
to do "training to exhaustion". If you choose to use it, you need
to be aware of its possible effects. Using -fn
avoids
repeated training for each message.
To classify messages as ham (non-spam) or spam, bogofilter
needs to learn from your mail. To start with it is best to have
collections (that are as large as possible) of messages you know
for sure are ham or spam. (Errors here will cause problems later,
so try hard;-)
. Warning: Only use your mail; using other
collections (like a spam collection found on the web), might cause
bogofilter to draw a wrong conclusion — after all you want it to
understand your mail.
Once you have the spam and ham collections, you have basically four choices. In all cases it works better if your training base (the above collections) is bigger, rather than smaller. The smaller your training collection is, the higher the number of errors bogofilter will make in production. Let's assume your collection is two mbox files: ham.mbx and spam.mbx.
Method 1) Full training. Train bogofilter with all your messages. In our example:
bogofilter -s < spam.mbx bogofilter -n < ham.mbx
Note: bogofilter's contrib directory includes two scripts that both use a train-on-error technique. This technique scores each message and adds to the database only those messages that were scored incorrectly (messages scored as uncertain, ham scored as spam, or spam scored as ham). The goal is to build a database of those words needed to correctly classify messages. The resulting database is smaller than the one build using full training.
Method 2) Use the script bogominitrain.pl (in the contrib
directory). It checks the messages in the same order as your
mailbox files. You can use the -f
option which will
repeat this until all messages in your training collection are
classified correctly (you can even adjust the level of
certainty). Since the script makes sure the database understands
your training collection "exactly" (with your chosen
precision), it works very well. You can use -o
to
create a security margin around your spam_cutoff. Assuming
spam_cutoff=0.6 you might want to score all ham in your
collection below 0.3 and all spam above 0.8. Our example is:
bogominitrain.pl -fnv ~/.bogofilter ham.mbx spam.mbx '-o 0.8,0.3'
Method 3) Use the script randomtrain (in the contrib directory). The script generates a list of all the messages in the mailboxes, randomly shuffles the list, and then scores each message, with training as needed. In our example:
randomtrain -s spam.mbx -n ham.mbx
As with method 4, it works better if you start with full training using several thousand messages. This will give a database that is more comprehensive and significantly bigger.
Method 4) If you have enough spams and non-spams in your training collection, separate out some 10,000 spams and 10,000 non-spams into separate mbox files, and train as in method 1. Then use bogofilter to classify the remaining spams and non-spams. Take any messages that it classifies as unsure or classifies incorrectly, and train with those. Here are two little scripts you can use to classify the train-on-error messages:
#! /bin/sh # class3 -- classify one message as bad, good or unsure cat >msg.$$ bogofilter $* <msg.$$ res=$? if [ $res = 0 ]; then cat msg.$$ >>corpus.bad elif [ $res = 1 ]; then cat msg.$$ >>corpus.good elif [ $res = 2 ]; then cat msg.$$ >>corpus.unsure fi rm msg.$$
#! /bin/sh # classify -- put all messages in mbox through class3 src=$1; shift formail -s class3 $* <$src
In our example (after the initial full training):
classify spam.mbx [bogofilter options] bogofilter -s < corpus.good rm -f corpus.* classify ham.mbx [bogofilter options] bogofilter -n < corpus.bad rm -f corpus.*
It is important to understand the consequences of the methods just described. Doing full training as in methods 1 and 4 produces a larger database than does training with methods 2 or 3. If your database size needs to be small (for example due to quota limitations), use methods 2 or 3.
Full training with method 1 is fastest. Training on error (as in methods 2, 3 and 4) is effective, but learning is pretty slow.
Bogofilter will make mistakes once in a while. So ongoing training is important. There are two main methodologies for doing this. First, you can train with every incoming message (using the -u option). Second, you can train on error only.
Since you might want to rebuild your database at some point, for example when a major new feature is implemented in bogofilter, it can be very useful to update your training collection continuously.
Bogofilter always does the best it can with the information
available to it. However, it will make mistakes, i.e., classify
ham as spam (false positives) or spam as ham (false negatives). To
reduce the likelihood of repeating the mistake, it is necessary to
train bogofilter with the errant message. If a message is
incorrectly classified as spam, use switch -n
to
train with it as ham. Use switch -s
to train with a
spam message.
Bogofilter has a -u
switch that automagically
updates the wordlists after scoring each message. As bogofilter
sometimes misclassifies a message, monitoring is necessary to
correct any mistakes. Corrections can be done using
-Sn
to change a message's classification from spam to
non-spam and -Ns
to change it from non-spam to spam.
Correcting a misclassfied message may affect classification for other message. The smaller your database is, the higher is the likelihood that a training error will casue a misclassification.
Using a method like #2 or #3 (above) can compensate for this effect. Repeat the training with your complete training collection (including all the new messages added since the earlier training). This will add messages to the database which show that adverse effect on both sides until you have a new equilibrium.
An alternative strategy, based on method 4 in the previous section, is the following: Periodically take blocks of messages and use the scripts in method 4 above to classify them. Then manually review the good, bad and unsure files, correct any errors, and split the unsures into spam and non-spam. Until you have accumulated some 10,000 spam and 10,000 non-spam in your training database, train with the good, the bad, and the separated errors and unsures; thereafter, train with only the separated and unsures, discarding the messages that bogofilter already classifies correctly.
Note that you should periodically run:
bogoutil -d wordlist.db | bogoutil -l wordlist.db.new mv wordlist.db wordlist.db.prv mv wordlist.db.new wordlist.db
or, for spamlist.db and goodlist.db (if using bogofilter with separate spam and ham wordlists). This will compact the database so it occupies the minimum of disk space.
If you have a working SpamAssassin installation (or care to create one), you can use its return codes to train bogofilter. The easiest way is to create a script for your MDA that runs SpamAssassin, tests the spam/non-spam return code, and runs bogofilter to register the message as spam (or non-spam). The sample procmail recipe below shows one way to do this:
BOGOFILTER = "/usr/bin/bogofilter" BOGOFILTER_DIR = "training" SPAMASSASSIN = "/usr/bin/spamassassin" :0 HBc * ? $SPAMASSASSIN -e #spam yields non-zero #non-spam yields zero | $BOGOFILTER -n -d $BOGOFILTER_DIR #else (E) :0Ec | $BOGOFILTER -s -d $BOGOFILTER_DIR :0fw | $BOGOFILTER -p -e :0: * ^X-Bogosity:.Yes spam :0: * ^X-Bogosity:.No non-spam
Bogofilter can instructed to display information on the scoring of a message by running it with flags "-v", "-vv", "-vvv", or "-R".
X-Bogosity: No, tests=bogofilter, spamicity=0.500000
X-Bogosity: No, tests=bogofilter, spamicity=0.500000 int cnt prob spamicity histogram 0.00 29 0.000209 0.000052 ############################# 0.10 2 0.179065 0.003425 ## 0.20 2 0.276880 0.008870 ## 0.30 18 0.363295 0.069245 ################## 0.40 0 0.000000 0.069245 0.50 0 0.000000 0.069245 0.60 37 0.667823 0.257307 ##################################### 0.70 5 0.767436 0.278892 ##### 0.80 13 0.836789 0.334980 ############# 0.90 32 0.984903 0.499835 ################################
Each row shows an interval, the count of tokens with scores in that interval, the average spam probability for those tokens, the message's spamicity score (for those tokens and all lesser valued tokens), and a bar graph corresponding to the token count.
In the above histogram there are a lot of low scoring tokens and a lot of high scoring tokens. They "balance" one another to give the spamicity score of 0.5000
X-Bogosity: No, tests=bogofilter, spamicity=0.500000 n pgood pbad fw U "which" 10 0.208333 0.000000 0.000041 + "own" 7 0.145833 0.000000 0.000059 + "having" 6 0.125000 0.000000 0.000069 + ... "unsubscribe.asp" 2 0.000000 0.095238 0.999708 + "million" 4 0.000000 0.190476 0.999854 + "copy" 5 0.000000 0.238095 0.999883 + N_P_Q_S_s_x_md 138 0.00e+00 0.00e+00 5.00e-01 1.00e-03 4.15e-01 0.100The columns printed contain the following information:
The final lines show:
The "-R" output is formatted for use with the R language for statistical computing. More information is available at The R Project for Statistical Computing.
Many people get unsolicited email using asian language charsets. Since they don't know the languages and don't know people there, they assume it's spam.
The good news is that bogofilter does detect them quite successfully. The bad news is that this can be expensive. You have basically two choices:
You can simply let bogofilter handle it. Just train bogofilter with the asian language messages identified as spam. Bogofilter will parse the messages as best it can and will add tokens to the spam wordlist. The wordlist will contain many tokens which don't make sense to you (since the charset cannot be displayed), but bogofilter can work with them and successfully identify asian spam.
A second method is to use the "replace_nonascii_characters" config file option. This will replace high-bit characters, i.e. those between 0x80 and 0xFF, with question marks, '?'. This keeps the database much smaller. Unfortunately this conflicts with european language which have many accented vowels and consonant in the high-bit range.
If you are sure you will not receive any legitimate messages in those languages, you can kill them right away. This will keep the database smaller. You can do this with an MDA script.
Here's a procmail recipe that will sideline messages written with asian charsets:
## Silently drop all asian language mail UNREADABLE='[^?"]*big5|iso-2022-jp|ISO-2022-KR|euc-kr|gb2312|ks_c_5601-1987' :0: * 1^0 $ ^Subject:.*=\?($UNREADABLE) * 1^0 $ ^Content-Type:.*charset="?($UNREADABLE) spam-unreadable :0: * ^Content-Type:.*multipart * B ?? $ ^Content-Type:.*^?.*charset="?($UNREADABLE) spam-unreadable
With the above recipe, bogofilter will never see the message.
To find the spam and ham counts for a token (word) use bogoutil's '-w' option. For example, "bogoutil -w $BOGOFILTER_DIR example.com" gives the good and bad counts for "example.com".
If you want the spam score in addition to the spam and ham counts for a token (word) use bogoutil's '-p' option. For example, "bogoutil -p $BOGOFILTER_DIR example.com" gives the good and bad counts for "example.com".
To find out how many messages are in your wordlists query the special token .MSG_COUNT, i.e., run command "bogoutil -w $BOGOFILTER_DIR .MSG_COUNT" to see the counts for the spam and ham word lists.
To tell how many tokens are in your wordlists pipe the output of bogoutil's dump command to command "wc", i.e. use "bogoutil -d $BOGOFILTER_DIR/wordlist.db | wc -l " to display the count. (If you've got spamlist.db and goodlist.db, run the command for each of them).
If you think your word lists are hosed, you can see what BerkeleyDB thinks by running:
db_verify wordlist.db
If there is a problem, you may be able to recover some (or all) of the tokens and their counts with the following commands:
bogoutil -d wordlist.db | bogoutil -l wordlist.db.new
or with
db_dump -r wordlist.db > wordlist.txt db_load wordlist.new < wordlist.txt
If you've got two wordlists (spamlist.db and goodlist.db), run the above commands for each of them.
If you don't already have a v3.0+ version of BerkeleyDB, then download it, unpack it, and do these commands in the db directory:
$ cd build_unix $ sh ../dist/configure $ make # make install
Next, download a portable version of bogofilter.
Unpack it, and then do:
$ ./configure --with-db=/usr/local/BerkeleyDB.4.1 $ make # make install-strip
You will either want to put a symlink to libdb.so in /usr/lib, or use a modified LD_LIBRARY_PATH environment variable before you start bogofilter.
$ LD_LIBRARY_PATH=/usr/lib:/usr/local/lib:/usr/local/BerkeleyDB.4.1
Note that some make versions shipped with Solaris break when you try to build bogofilter outside of its source directory. Either build in the source directory (as suggested above) or use GNU make (gmake).
The FreeBSD ports and packages carry very recent versions of bogofilter. This approach uses the highly recommended portupgrade and cvsup software packages. To install these two fine pieces, type (you need to do this only once):
# pkg_add -r portupgrade cvsup
To install or upgrade bogofilter, just upgrade your portstree using cvsup, then type:
# portupgrade -N bogofilter
See the file doc/programmer/README.hp-ux in the source distribution.
If all you're just reading from them, there are no problems. When you're updating them, you need to use the correct file locking to avoid data corruption. When you compile bogofilter, you will need to verify that the configure script has set "#define HAVE_FCNTL 1" in your config.h file. Popular UNIX operating systems will all support this. If you are running an unusual, or an older version of an operating system, make sure it supports fcntl(). If "#define HAVE_FCNTL 1" is set, which indicates fcntl() is supported on your system, then comment out "#define HAVE_FLOCK 1" so that the locking system uses fcntl() locking instead of the default of flock() locking. If your system does not support fcntl(), then you will not be able to share word list files over NFS without the risk of data corruption.
Next, make sure you have NFS set up properly, with "lockd" running. Refer to your NFS documentation for more information about running "lockd" or "rpc.lockd". Most operating systems with NFS turn this on by default.
Likely the return codes are being reformatted by waitpid(2).
In C use WEXITSTATUS(status) in sys/wait.h, or comparable macro,
to get the correct value. In Perl you can just use
'system("bogofilter $input") >> 8'. If you want more info, run
"man waitpid"
.
With version 0.11 bogofilter's options for registering mail as ham or spam have been changed. They now allow registering (or unregistering) messages in the ham and spam word lists. Prior to this, there was no way to unregister a message from a word list (without registering it in the other word list).
Bogofilter has four registration options - '-s', '-n', '-S', and '-N'. With the release of version 0.11 the meaning of '-S' and '-N' has been changed to allow unregistering messages from the word lists. Here's what the four options mean:
Prior to version 0.11, the '-S' option was used to move a message from the ham word list to the spam word list, i.e. there were two actions. Now with 0.11 each of the two actions is invoked by its own option. To get the same effect as the old '-S', you should use '-N -s' (or '-Ns' which means the same thing).
Similarly, the old '-N' option is now '-Sn' (or '-S -n').
MDA scripts typically use '-s' and '-n' and don't need to change. Other scripts which use '-S' and '-N' for fixing registration errors do need to be changed.
Bogofilter-0.14 introduced an additional exit code to support its third mail classification (of spam, ham, and unsure). Prior to 0.14, the exit codes were 0 for spam, 1 for ham, and 2 for error. They are now 0 for spam, 1 for ham, 2 for unsure, and 3 for error.
Prior to version 0.15.4, bogofilter added special prefixes to tags from "Subject:", "From:", "To:", and "Return-Path:" header lines. As of version 0.15.4, "Received:" line tokens are also specially tagged and all other header line tokens are given a "head:" prefix. Since bogofilter hasn't previously seen "header:" tokens, these tokens don't contribute to either the ham or spam scores. This loss of information causes bogofilter to perform less well.
There are two ways to deal with this problem. The first is simply to retrain bogofilter with all the old ham and spam. This will create the needed "head:" entries and all will be fine.
The second is to use the new header-degen ("-H") option. When classifying a message, header-degen causes bogofilter to look up both "head:token" and "token" in the wordlist. The two ham and spam scores are combined to give a cumulative result for scoring. Registering messages will create "head:token" entries in the wordlist. After a month or so of using '-H', the wordlist should contain plenty of the new "head:" entries and using '-H' should no longer be necessary.