Bogofilter FAQ

Official Versions: In English or French
Maintainer: David Relson <relson@osagesoftware.com>

This document is intended to answer frequently asked questions about bogofilter.


What is bogofilter?

Bogofilter is a fast Bayesian spam filter along the lines suggested by Paul Graham in his article A Plan For Spam. bogofilter uses Gary Robinson's geometric-mean algorithm with the Fisher's method modification to classify email as spam or non-spam.

The bogofilter home page at SourceForge is the central clearinghouse for bogofilter resources.

Bogofilter was started by Eric S. Raymond on August 19, 2002. It gained popularity in September 2002, and a number of other authors have started to contribute to the project.

The NEWS file describes bogofilter's version history.


Bogo-what?

Bogofilter is some kind of a bogometer or bogon filter, i.e., it tries to identify bogus mail by measuring the bogosity.


How does bogofilter work?

See the man page's THEORY OF OPERATION section for an introduction. The main source for understanding this is Gary Robinson's Linux Journal article "A Statistical Approach to the Spam Problem".

After you read all this you might ask some questions. The first could be "Is bogofilter really a Bayesian spam filter?" Bogofilter is based on Bayes' theorem and uses it in the initial calculations and other statistical methods later. Without doubt it is a statistical spam filter with a Bayesian flavor.

Other questions you might have might concern the basic assumptions of Bayes' theory. Two short answers are: "No, they are not satisfied" and "We don't care as long as it works". A longer answer will mention that the basic assumption that "an e-mail is a random collection of words, each independent of the others" is violated. There are several places where practice doesn't follow theory. Some are always present, and some which will depend on the way you use bogofilter:

As the man page explains, bogofilter tries to understand how badly the null hypothesis fails. Some people argue that "those departures from reality usually work in our favor" (from Gary's article). Some argue that, even then, we should not violate too much. Nobody really knows. Just keep in mind that problems might occur if you push too hard. The key to bogofilter's approach is: What matters most is simply what works in the real world.

Now that you have been warned, have fun and use bogofilter as suits you best.


Mailing Lists

There are currently four mailing lists for bogofilter:

List Address Links Description
bogofilter-announce@aotto.com [subscribe] [archive] An announcement-only list where new versions are announced.
bogofilter@aotto.com [subscribe] [archive] A discussion list where any conversation about bogofilter may take place.
bogofilter-dev@aotto.com [subscribe] [archive] A list for sharing patches, development, and technical discussions.
bogofilter-cvs@lists.sourceforge.net [subscribe] [archive] Mailing list for announcing code changes to the CVS archive.

What are "training on error" and "training to exhaustion"?

"Training on error" involves scanning a corpus of known spam and non-spam messages; only those that are misclassified, or classed as unsure, get registered in the training database. It's been found that sampling just messages prone to misclassification is an effective way to train; if you train bogofilter on the hard messages, it learns to handle obvious spam and non-spam too.

This method can be enhanced by using a "security margin". By increasing the spam cutoff value and decreasing the ham cutoff value, messages which are close to a cutoff will be used for training. Using security margins improves results when training on error. In general, greater margins help more (although too much also isn't optimal). As a rule of thumb spam cutoff +/- 0.3 gives good results. For tristate mode, you might try the middle of the unsure interval +/- 0.3 for training.

Repeating training on error on the same message corpus can improve accuracy. The idea is that messages which were rated correctly in the first place might after some more training be rated wrongly which will then be corrected.

"Training to exhaustion" is repeating training on error, with the same message corpus, until no errors remain. Also this method can be improved with security margins. See Gary Robinson's Rants on this topic for more details.

Note: bogominitrain.pl has a -f option to do "training to exhaustion". Using -fn avoids repeated training for each message.


How do I start my bogofilter training?

To classify messages as ham (non-spam) or spam, bogofilter needs to learn from your mail. To start with it is best to have collections (that are as large as possible) of messages you know for sure are ham or spam. (Errors here will cause problems later, so try hard;-). Warning: Only use your mail; using other collections (like a spam collection found on the web), might cause bogofilter to draw a wrong conclusion — after all you want it to understand your mail.

Once you have the spam and ham collections, you have basically four choices. In all cases it works better if your training base (the above collections) is bigger, rather than smaller. The smaller your training collection is, the higher the number of errors bogofilter will make in production. Let's assume your collection is two mbox files: ham.mbox and spam.mbox.

Note: Bogofilter's contrib directory includes two scripts that both use a train-on-error technique. This technique scores each message and adds to the database only those messages that were scored incorrectly (messages scored as uncertain, ham scored as spam, or spam scored as ham). The goal is to build a database of those words needed to correctly classify messages. The resulting database is smaller than the one build using full training.

Comparing these methods

It is important to understand the consequences of the methods just described. Doing full training as in methods 1 and 4 produces a larger database than does training with methods 2 or 3. If your database size needs to be small (for example due to quota limitations), use methods 2 or 3.

Full training with method 1 is fastest. Training on error (as in methods 2, 3 and 4) is effective, but the initial training takes longer.


How can I keep the scoring accuracy high?

Bogofilter will make mistakes once in a while. So ongoing training is important. There are two main methodologies for doing this. First, you can train with every incoming message (using the -u option). Second, you can train on error only.

Since you might want to rebuild your database at some point, for example when a major new feature is implemented in bogofilter, it can be very useful to update your training collection continuously.

Bogofilter always does the best it can with the information available to it. However, it will make mistakes, i.e., classify ham as spam (false positives) or spam as ham (false negatives). To reduce the likelihood of repeating the mistake, it is necessary to train bogofilter with the errant message. If a message is incorrectly classified as spam, use switch -n to train with it as ham. Use switch -s to train with a spam message.

Bogofilter has a -u switch that automatically updates the wordlists after scoring each message. As bogofilter sometimes misclassifies a message, monitoring is necessary to correct any mistakes. Corrections can be done using -Sn to change a message's classification from spam to non-spam and -Ns to change it from non-spam to spam.

Correcting a misclassified message may affect classification for other message. The smaller your database is, the higher is the likelihood that a training error will cause a misclassification.

Using a method like #2 or #3 (above) can compensate for this effect. Repeat the training with your complete training collection (including all the new messages added since the earlier training). This will add messages to the database which show that adverse effect on both sides until you have a new equilibrium.

An alternative strategy, based on method 4 in the previous section, is the following: Periodically take blocks of messages and use the scripts in method 4 above to classify them. Then manually review the good, bad and unsure files, correct any errors, and split the unsures into spam and non-spam. Until you have accumulated some 10,000 spam and 10,000 non-spam in your training database, train with the good, the bad, and the separated errors and unsures; thereafter, train with only the separated and unsures, discarding the messages that bogofilter already classifies correctly.

Note that you should periodically run:

	bogoutil -d wordlist.db | bogoutil -l wordlist.db.new
	mv wordlist.db wordlist.db.prv
	mv wordlist.db.new wordlist.db

or, for spamlist.db and goodlist.db (if using bogofilter with separate spam and ham wordlists). This will compact the database so it occupies the minimum of disk space.


How can I use SpamAssassin to train Bogofilter?

If you have a working SpamAssassin installation (or care to create one), you can use its return codes to train bogofilter. The easiest way is to create a script for your MDA that runs SpamAssassin, tests the spam/non-spam return code, and runs bogofilter to register the message as spam (or non-spam). The sample procmail recipe below shows one way to do this:

	BOGOFILTER     = "/usr/bin/bogofilter"
	BOGOFILTER_DIR = "training"
	SPAMASSASSIN  = "/usr/bin/spamassassin"

	:0 HBc
	* ? $SPAMASSASSIN -e
	#spam yields non-zero
	#non-spam yields zero
	| $BOGOFILTER -n -d $BOGOFILTER_DIR
	#else (E)
	:0Ec
	| $BOGOFILTER -s -d $BOGOFILTER_DIR

	:0fw
	| $BOGOFILTER -p -e

	:0:
	* ^X-Bogosity:.Yes
	spam

	:0:
	* ^X-Bogosity:.No
	non-spam

What mailbox (file) formats does bogofilter understand?

Bogofilter understands the traditional Unix mbox format, the Maildir and MH formats. Note though that bogofilter does not support subfolders, you will have to explicitly list them in MH or Maildir++ folders - just mention the full path to the subfolder.

For unsupported formats, you will have to convert the mailbox to a format bogofilter understands. Mbox is often convenient because it can be piped into bogofilter.

The following list may be able help you:

UW-IMAP/PINE mbx format
To convert to mbox: mailtool copy /full/path/to/mail.mbx '#driver.unix//full/path/to/mbox'

What does bogofilter's verbose output mean?

Bogofilter can instructed to display information on the scoring of a message by running it with flags "-v", "-vv", "-vvv", or "-R".


What can I do about Asian spam?

Many people get unsolicited email using Asian language charsets. Since they don't know the languages and don't know people there, they assume it's spam.

The good news is that bogofilter does detect them quite successfully. The bad news is that this can be expensive. You have basically two choices:


How do I manually query the database

To find the spam and ham counts for a token (word) use bogoutil's '-w' option. For example, "bogoutil -w $BOGOFILTER_DIR example.com" gives the good and bad counts for "example.com".

If you want the spam score in addition to the spam and ham counts for a token (word) use bogoutil's '-p' option. For example, "bogoutil -p $BOGOFILTER_DIR example.com" gives the good and bad counts for "example.com".

To find out how many messages are in your wordlists query the special token .MSG_COUNT, i.e., run command "bogoutil -w $BOGOFILTER_DIR .MSG_COUNT" to see the counts for the spam and ham word lists.

To tell how many tokens are in your wordlists pipe the output of bogoutil's dump command to command "wc", i.e. use "bogoutil -d $BOGOFILTER_DIR/wordlist.db | wc -l " to display the count. (If you've got spamlist.db and goodlist.db, run the command for each of them).


Why am I getting DB_PAGE_NOTFOUND messages?

You have a problem with your BerkeleyDB database. There are two likely causes: either you've hit a max size limit or the database is corrupt.

Some mail transfer agents, such as Postfix, impose file size limits. When bogofilter's database reaches that limit, write problems will occur.

To show the database size use:

	ls -lh $BOGOFILTER_DIR/wordlist.db

To show the postfix setting:

	postconf | grep mailbox_size_limit

To set the limit to 73MB (or whatever size is right for you):

	postconf -e mailbox_size_limit=73000000

If you think your database may be corrupt, read How can I tell if my wordlists are corrupted? FAQ entry.


How can I tell if my word lists are corrupted?

If you think your word lists are hosed, you can see what BerkeleyDB thinks by running:

	db_verify wordlist.db

If there is a problem, you may be able to fix it with:

	db_recover wordlist.db

Alternatively may be able to recover some (or all) of the tokens and their counts with the following commands:

	bogoutil -d wordlist.db | bogoutil -l wordlist.db.new

or with

	db_dump -r wordlist.db | db_load wordlist.new

You can also use a text file instead of a pipe, as in:

	bogoutil -d wordlist.db > wordlist.txt
	bogoutil -l wordlist.db.new < wordlist.txt

If you've got two wordlists (spamlist.db and goodlist.db), run the above commands for each of them.


How do I upgrade from separate word databases to the new combined wordlist format?

Run script bogoupgrade. For more info, run "bogoupgrade -h" to see its help message or read its man page.


Can I share word lists over NFS?

If all you're just reading from them, there are no problems. When you're updating them, you need to use the correct file locking to avoid data corruption. When you compile bogofilter, you will need to verify that the configure script has set "#define HAVE_FCNTL 1" in your config.h file. Popular UNIX operating systems will all support this. If you are running an unusual, or an older version of an operating system, make sure it supports fcntl(). If "#define HAVE_FCNTL 1" is set, which indicates fcntl() is supported on your system, then comment out "#define HAVE_FLOCK 1" so that the locking system uses fcntl() locking instead of the default of flock() locking. If your system does not support fcntl(), then you will not be able to share word list files over NFS without the risk of data corruption.

Next, make sure you have NFS set up properly, with "lockd" running. Refer to your NFS documentation for more information about running "lockd" or "rpc.lockd". Most operating systems with NFS turn this on by default.


Why does bogofilter give return codes like 0 and 256 when it's run from inside a program?

Likely the return codes are being reformatted by waitpid(2). In C use WEXITSTATUS(status) in sys/wait.h, or comparable macro, to get the correct value. In Perl you can just use 'system("bogofilter $input") >> 8'. If you want more info, run "man waitpid".


Now that I've upgraded why are my scripts broken?

Over time bogofilter accumulated a large number of functions. Some of those were discontinued or changed. Please read the NEWS file for details.


Now that I've upgraded why is bogofilter working less well?

The lexer, i.e., that part of bogofilter which extracts tokens from a message, evolves. This results in different readings of messages with the consequence that some tokens in the database can no longer be used.

If you encounter this problem, you are strongly advised to rebuild your database. If this is not an option for you, you might want to use version 0.15.13 and read the documentation which comes with it for how to migrate your database.


How can I delete all the spam (or non-spam) tokens?

Bogoutil lets you dump a wordlist and load the tokens into a new wordlist. With the added use of awk and grep, counts can be zeroed and tokens with zero counts for both spam and non-spam can be deleted.

The following commands will delete the tokens from spam messages:

	bogoutil -d wordlist.db | \
	awk '{print $1 " " $2 " 0"}' | grep -v " 0 0" | \
	bogoutil -l wordlist.new.db

The following commands will delete the tokens from non-spam messages:

	bogoutil -d wordlist.db | \
	awk '{print $1 " 0 " $3}' | grep -v " 0 0" | \
	bogoutil -l wordlist.new.db

How do I get bogofilter working on Solaris, BSD, etc?

If you don't already have a v3.0+ version of BerkeleyDB, then download it, unpack it, and do these commands in the db directory:

	$ cd build_unix
	$ sh ../dist/configure
	$ make
	# make install

Next, download a portable version of bogofilter.

On Solaris

Unpack it, and then do:

	$ ./configure --with-db=/usr/local/BerkeleyDB.4.1
	$ make
	# make install-strip

You will either want to put a symlink to libdb.so in /usr/lib, or use a modified LD_LIBRARY_PATH environment variable before you start bogofilter.

	$ LD_LIBRARY_PATH=/usr/lib:/usr/local/lib:/usr/local/BerkeleyDB.4.1

Note that some make versions shipped with Solaris break when you try to build bogofilter outside of its source directory. Either build in the source directory (as suggested above) or use GNU make (gmake).

On FreeBSD

The FreeBSD ports and packages carry very recent versions of bogofilter. This approach uses the highly recommended portupgrade and cvsup software packages. To install these two fine pieces, type (you need to do this only once):

	# pkg_add -r portupgrade cvsup

To install or upgrade bogofilter, just upgrade your portstree using cvsup, then type:

	# portupgrade -N bogofilter

On HP-UX

See the file doc/programmer/README.hp-ux in the source distribution.


Can I use the make command on my operating system?

Bogofilter has been successfully built on many operating systems using GNU make and the native make commands. However, bogofilter's Makefile doesn't work with some make commands.

GNU make is recommended for building bogofilter because we know it works. We cannot support less capable make commands. If your non-GNU make command can successfully build bogofilter, that's great. If you encounter problems, the right thing to do is install GNU make. If your non-GNU make can't build bogofilter, we're sorry but you're on your own.