The magazine of the Melbourne PC User Group
Getting Rid Of SPAM — The Bayesian Approach
Roger Brown |
|
|
Roger
Brown reports on some tests he did recently with Bayesian SPAM filters -
they work for him and they sound exciting |
Just now, SPAM and how to get rid of it, seems to be the hottest topic of
concern to everyone using the Internet. We all hate the stuff, but how can we
deal with it without the cumbersome and annoying task of manually identifying
and deleting messages every time we open that in-box?
Various filtering approaches have been used over the years but the stuff still
arrives apace as spammers evade black lists, avoid concept filters and hijack
other users with undaunted fervour.
But one approach which is showing signs of finally being able to deal with spam,
is the Bayesian Filter. It's a statistical approach that comes closest to
identifying SPAM the same way people do - by learning to associate certain words
and phrases with SPAM so as to identify it with remarkable accuracy.
Who was Bayes - and what is Bayesian analysis?
While a detailed discussion of Bayes and his contribution to statistical
analysis may or may not be within the scope of this article; it is well outside
the competence of the writer. However, readers may note that Bayes is not a
modern computer scientist - he was an 18th century English mathematician.
To briefly quote http://www.bayesian.org/bayesian/bayes.html. Bayes, Thomas (b.1702, London - d.1761, Tunbridge Wells, Kent), mathematician who first used
probability inductively and established a mathematical basis for probability
inference (a means of calculating, from the number of times an event has not
occurred, the probability that it will occur in future trials).
|

Figure 1. Part of PopFile's SPAM database |
Bayesian SPAM filtering uses the statistical approach first promulgated by Bayes
to calculate, from an analysis of words appearing in a given e-mail message,
whether it is spam. This calculation is based on the occurrence of these words
in other messages which have been identified as either SPAM or good.
See http://email.about.com/cs/bayesianfilters/a/bayesian_filter.htm for a more
detailed explanation.
How?
How do filter programs of this type work? Bayesian filter programs work as local
e-mail proxies - they take over the task of downloading your e-mail from the
e-mail server. This is necessary because the Bayesian process requires that the
entire e-mail be read for analysis. E-mail client programs (such as Outlook
Express) need to be slightly reconfigured to down- load mail via the filter.
At first the filter does not know whether a given e-mail is SPAM or not (the
program comes with no preset database). Initially the user needs to point out
which e-mails are spam. But as soon as this is done the filter program starts to
build up a database of words associated with SPAM and with good messages, and it
quickly starts to identify such messages without user intervention and with
remarkable accuracy. These databases are entirely local and are specific to the
actual mail received by the individual user.
Over the initial "training" period the user will need to make corrections or
reclassifications and each time this is done, changes to the filter databases
result. This means that as the filter "learns", it becomes increasingly more
accurate in identifying SPAM according to each user's concept of what
constitutes spam.
The databases consist mainly of single words and word counts but as will be
noted from the screen shots with this article, portions of e-mail addresses and
other header material can also be included. Commonly occurring words are
normally excluded.
Conventional blacklists and whitelists may be added by the user to supplement
the action of the filter but these are not generally helpful - usually Bayesian
filters are better when left to build their own databases unhindered.
When SPAM messages are identified, the filter does not delete them - they are
tagged. This may be by adding the tag "[Spam]" to the subject line or by adding
an additional mail header.
This allows such messages to be readily identified by filters set up within the
e-mail client program (eg. within Outlook Express) - the user can decide whether
such messages should be quarantined or deleted outright.
As Bayesian filter programs usually retain a copy of processed messages for a
short period, outright deletion of SPAM is perfectly practicable as in the event
of a false positive; a copy of the message is still available.
What about some practical results?
I trialled two popular freeware Bayesian filters - K9 and PopFile. Each program
was tested over a minimum of 6000 messages - a period of just under four weeks
for each. All my mail from two pop accounts and one direct SMTP account was
directed through the filter - on average around 250 messages per day, much of it
list generated. During the trial, SPAM content was somewhere between 7 and 10%.
K9 http://keir.net/k9.html
I started with this program because it is a native Windows application with a
pleasant interface and a well thought out and intuitive layout. Clear set up and
operating instructions are available from the Web site.
Getting started was simple and the initial process of classifying e-mail
messages was quick and intuitive.
The results of the trial are shown in Figure 4. The left hand column
shows the full period of the trial. The right hand column shows the last few
days. Although the overall accuracy is recorded as 96.6% (lower than expected),
it must also be noted that of 542 SPAM e-mails received, 207 were initially
wrongly identified. That puts the actual SPAM detection rate at just under 62%.
Generally I felt that results for this program were disappointing - SPAM
messages escaped detection even late in the trial period. |

|
|

Figures 2 and 3. Example of K9 Filter in
TheBat! |
PopFile
http://popfile.sourceforge.net/old_index.html.
Initially I was a little reluctant to try PopFile because it is not a native
Windows application but is instead a perl script. Administration is via a Web
interface. Installation is however quite simple as PopFile provides a Windows
installer that looks after all the details without the user having to be
concerned with technicalities. The Web interface is simple and quite well
designed but like most Web interfaces it is slower than the native Windows
interface provided by K9.
Immediately it became apparent that PopFile was much faster to produce good
accuracy rates than was K9. The overall accuracy rate is much better than K9 as
is the actual SPAM detection rate (91%) - only 43 out of 459 SPAM e-mails were
missed, nearly all of these in the initial training period. Over the closing
weeks of the trial, wrongly classified e-mails were a rarity.
A significant factor in this appeared to be PopFile's more efficient manner of
building its databases. K9 builds its data bases from all e-mails received -
that results (as can be seen from the screen shot ) in the database for SPAM
being much smaller than that for good e-mails.
PopFile builds its database only when corrections are made. That makes the SPAM
database much larger than that for good e-mails and appears to provide much
quicker SPAM recognition.
It should be noted however that this apparent advantage might disappear of a
larger percentage of SPAM was received.
Essentially, for much of the trial the only maintenance required was to briefly
peruse the record of what PopFile had done - in case there were false positives
(also rare). As mentioned earlier, any false positives found can still be read
from the PopFile stored copy.
As a result PopFile is now detecting SPAM messages with as close to 100%
accuracy as could be hoped for - certainly just the sort of performance I was
hoping for.
Security
Any type of proxy can represent a security hazard if supplied in an open state
or misconfigured. As an example of this, some readers will be aware that the
otherwise very useful local proxy server AnalogX (used for Internet connection
sharing) has in recent times attracted quite wide notoriety, including a report
in The New York Times, as being one of the programs most used by spammers as a
backdoor to send SPAM via other unsuspecting users. This came about because the
program author chose to provide the AnalogX software completely open to the
Internet by default.
With this in mind it was good to see that both the programs tested were supplied
in a secure state. K9 appears to be hardwired to only accept connections from
the computer on which it is installed - PopFile can accept connections from
other computers but its default configuration also limits connections to
localhost.
Potential users should note that the only connection either of the programs
makes to the Internet is to download mail. No use is made of any external server
for processing purposes - all mail is entirely secure within the user's system.
I wanted to use both programs across a home network (my mail client is on my
laptop) and that was only possible by use of a port mapper to simulate a
localhost connection.
|

Figure 4. The K9 statistics screen. |

Figure 4. PopFile's statistics screen |
Conclusion
This is a promising technique as it cannot be readily fooled by existing
spammers' tricks such as innocuous subject lines. Whilst I personally found one
of the two programs tested much better than the other, quite different results
have been reported elsewhere. Much could depend on the nature of mail received.
Either program is worth trying. My preference is PopFile - for me it has
certainly got rid of the stuff.
Reprinted from the August 2003 issue of PC Update, the magazine of Melbourne PC User Group, Australia
|