The magazine of the Melbourne PC User Group

Getting Rid Of SPAM — The Bayesian Approach
Roger Brown

Roger Brown reports on some tests he did recently with Bayesian SPAM filters - they work for him and they sound exciting


Just now, SPAM and how to get rid of it, seems to be the hottest topic of concern to everyone using the Internet. We all hate the stuff, but how can we deal with it without the cumbersome and annoying task of manually identifying and deleting messages every time we open that in-box?

Various filtering approaches have been used over the years but the stuff still arrives apace as spammers evade black lists, avoid concept filters and hijack other users with undaunted fervour.

But one approach which is showing signs of finally being able to deal with spam, is the Bayesian Filter. It's a statistical approach that comes closest to identifying SPAM the same way people do - by learning to associate certain words and phrases with SPAM so as to identify it with remarkable accuracy.

Who was Bayes - and what is Bayesian analysis?

While a detailed discussion of Bayes and his contribution to statistical analysis may or may not be within the scope of this article; it is well outside the competence of the writer. However, readers may note that Bayes is not a modern computer scientist - he was an 18th century English mathematician.

To briefly quote http://www.bayesian.org/bayesian/bayes.html. Bayes, Thomas (b.1702, London - d.1761, Tunbridge Wells, Kent), mathematician who first used probability inductively and established a mathematical basis for probability inference (a means of calculating, from the number of times an event has not occurred, the probability that it will occur in future trials).
 



Figure 1. Part of PopFile's SPAM database


Bayesian SPAM filtering uses the statistical approach first promulgated by Bayes to calculate, from an analysis of words appearing in a given e-mail message, whether it is spam. This calculation is based on the occurrence of these words in other messages which have been identified as either SPAM or good.

See http://email.about.com/cs/bayesianfilters/a/bayesian_filter.htm for a more detailed explanation.

How?

How do filter programs of this type work? Bayesian filter programs work as local e-mail proxies - they take over the task of downloading your e-mail from the e-mail server. This is necessary because the Bayesian process requires that the entire e-mail be read for analysis. E-mail client programs (such as Outlook Express) need to be slightly reconfigured to down- load mail via the filter.

At first the filter does not know whether a given e-mail is SPAM or not (the program comes with no preset database). Initially the user needs to point out which e-mails are spam. But as soon as this is done the filter program starts to build up a database of words associated with SPAM and with good messages, and it quickly starts to identify such messages without user intervention and with remarkable accuracy. These databases are entirely local and are specific to the actual mail received by the individual user.

Over the initial "training" period the user will need to make corrections or reclassifications and each time this is done, changes to the filter databases result. This means that as the filter "learns", it becomes increasingly more accurate in identifying SPAM according to each user's concept of what constitutes spam.

The databases consist mainly of single words and word counts but as will be noted from the screen shots with this article, portions of e-mail addresses and other header material can also be included. Commonly occurring words are normally excluded.

Conventional blacklists and whitelists may be added by the user to supplement the action of the filter but these are not generally helpful - usually Bayesian filters are better when left to build their own databases unhindered.

When SPAM messages are identified, the filter does not delete them - they are tagged. This may be by adding the tag "[Spam]" to the subject line or by adding an additional mail header.

This allows such messages to be readily identified by filters set up within the e-mail client program (eg. within Outlook Express) - the user can decide whether such messages should be quarantined or deleted outright.

As Bayesian filter programs usually retain a copy of processed messages for a short period, outright deletion of SPAM is perfectly practicable as in the event of a false positive; a copy of the message is still available.

What about some practical results?

I trialled two popular freeware Bayesian filters - K9 and PopFile. Each program was tested over a minimum of 6000 messages - a period of just under four weeks for each. All my mail from two pop accounts and one direct SMTP account was directed through the filter - on average around 250 messages per day, much of it list generated. During the trial, SPAM content was somewhere between 7 and 10%.

K9 http://keir.net/k9.html

I started with this program because it is a native Windows application with a pleasant interface and a well thought out and intuitive layout. Clear set up and operating instructions are available from the Web site.

Getting started was simple and the initial process of classifying e-mail messages was quick and intuitive.

The results of the trial are shown in Figure 4. The left hand column shows the full period of the trial. The right hand column shows the last few days. Although the overall accuracy is recorded as 96.6% (lower than expected), it must also be noted that of 542 SPAM e-mails received, 207 were initially wrongly identified. That puts the actual SPAM detection rate at just under 62%.

Generally I felt that results for this program were disappointing - SPAM messages escaped detection even late in the trial period.


 



Figures 2 and 3. Example of K9 Filter in TheBat!

PopFile  http://popfile.sourceforge.net/old_index.html.

Initially I was a little reluctant to try PopFile because it is not a native Windows application but is instead a perl script. Administration is via a Web interface. Installation is however quite simple as PopFile provides a Windows installer that looks after all the details without the user having to be concerned with technicalities. The Web interface is simple and quite well designed but like most Web interfaces it is slower than the native Windows interface provided by K9.

Immediately it became apparent that PopFile was much faster to produce good accuracy rates than was K9. The overall accuracy rate is much better than K9 as is the actual SPAM detection rate (91%) - only 43 out of 459 SPAM e-mails were missed, nearly all of these in the initial training period. Over the closing weeks of the trial, wrongly classified e-mails were a rarity.

A significant factor in this appeared to be PopFile's more efficient manner of building its databases. K9 builds its data bases from all e-mails received - that results (as can be seen from the screen shot ) in the database for SPAM being much smaller than that for good e-mails.

PopFile builds its database only when corrections are made. That makes the SPAM database much larger than that for good e-mails and appears to provide much quicker SPAM recognition.

It should be noted however that this apparent advantage might disappear of a larger percentage of SPAM was received.

Essentially, for much of the trial the only maintenance required was to briefly peruse the record of what PopFile had done - in case there were false positives (also rare). As mentioned earlier, any false positives found can still be read from the PopFile stored copy.

As a result PopFile is now detecting SPAM messages with as close to 100% accuracy as could be hoped for - certainly just the sort of performance I was hoping for.

Security

Any type of proxy can represent a security hazard if supplied in an open state or misconfigured. As an example of this, some readers will be aware that the otherwise very useful local proxy server AnalogX (used for Internet connection sharing) has in recent times attracted quite wide notoriety, including a report in The New York Times, as being one of the programs most used by spammers as a backdoor to send SPAM via other unsuspecting users. This came about because the program author chose to provide the AnalogX software completely open to the Internet by default.

With this in mind it was good to see that both the programs tested were supplied in a secure state. K9 appears to be hardwired to only accept connections from the computer on which it is installed - PopFile can accept connections from other computers but its default configuration also limits connections to localhost.

Potential users should note that the only connection either of the programs makes to the Internet is to download mail. No use is made of any external server for processing purposes - all mail is entirely secure within the user's system.

I wanted to use both programs across a home network (my mail client is on my laptop) and that was only possible by use of a port mapper to simulate a localhost connection.



Figure 4. The K9 statistics screen.



Figure 4. PopFile's statistics screen

Conclusion

This is a promising technique as it cannot be readily fooled by existing spammers' tricks such as innocuous subject lines. Whilst I personally found one of the two programs tested much better than the other, quite different results have been reported elsewhere. Much could depend on the nature of mail received. Either program is worth trying. My preference is PopFile - for me it has certainly got rid of the stuff.

Reprinted from the August 2003 issue of PC Update, the magazine of Melbourne PC User Group, Australia

[ About Melbourne PC User Group ]