You are on page 1of 17

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/267067074

Advances in Spam Filtering Techniques

Article · January 2012


DOI: 10.1007/978-3-642-25237-2_12

CITATIONS READS

9 131

2 authors:

Tiago A. Almeida Akebo Yamakami


Universidade Federal de São Carlos University of Campinas
50 PUBLICATIONS 458 CITATIONS 7 PUBLICATIONS 10 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Tiago A. Almeida on 15 June 2015.

The user has requested enhancement of the downloaded file.


Advances in Spam Filtering Techniques

Tiago A. Almeida and Akebo Yamakami

Abstract Nowadays e-mail spam is not a novelty, but it is still an important rising
problem with a big economic impact in society. Fortunately, there are different ap-
proaches able to automatically detect and remove most of those messages, and the
best-known ones are based on machine learning techniques, such as Naı̈ve Bayes
classifiers and Support Vector Machines. However, there are several different mod-
els of Naı̈ve Bayes filters, something the spam literature does not always acknowl-
edge. In this work, we present and compare seven different versions of Naı̈ve Bayes
classifiers, the well-known linear Support Vector Machine and a new method based
on the Minimum Description Length principle. Furthermore, we have conducted an
empirical experiment on six public and real non-encoded datasets. The results indi-
cate that the proposed filter is fast to construct, incrementally updateable and clearly
outperforms the state-of-the-art spam filters.

1 Introduction

E-mail is one of the most popular, fastest and cheapest means of communication
which has become a part of everyday life for millions of people, changing the way
we work and collaborate. The downside of such a success is the constantly growing
volume of e-mail spam we receive.
The term spam is generally used to denote an unsolicited commercial e-mail.
Spam messages are annoying to most users because they clutter their mailboxes. It
can be quantified in economical terms since many hours are wasted everyday by
workers. It is not just the time they waste reading the spam but also the time they
spend removing those messages.

Tiago A. Almeida and Akebo Yamakami


School of Electrical and Computer Engineering, University of Campinas – UNICAMP, 13081-970,
Campinas, Sao Paulo, Brazil, e-mail: {tiago,akebo}@dt.fee.unicamp.br

1
2 Tiago A. Almeida and Akebo Yamakami

The amount of spam is frightfully increasing. The average of spams sent per day
increased from 2.4 billion in 20021 to 300 billion in 20102 representing more than
90% of all incoming e-mail. On a worldwide basis, the total cost in dealing with
spam was estimated to rise from US$ 20.5 billion in 2003, to US$ 198 billion in
2009.
Many methods have been proposed to automatically classify messages as spams
or legitimates. Among all techniques, machine learning algorithms have achieved
more success [8]. Those methods include approaches that are considered top-
performers in text categorization, like support vector machines and Naı̈ve Bayes
classifiers.
A relatively recent method for inductive inference which is still rarely employed
in text categorization tasks is the Minimum Description Length principle. It states
that the best explanation, given a limited set of observed data, is the one that yields
the greatest compression of the data [7, 11, 22].
In this work, we present a spam filtering approach based on the Minimum De-
scription Length principle and compare its performance with seven different models
of Naı̈ve Bayes classifiers and the linear Support Vector Machine. Here, we carry out
a evaluation with the practical purpose of filtering e-mail spams in order review and
compare the currently top-performers spam filters. We have conducted an empiri-
cal experiment using six well-known, large, and public databases and the reported
results indicate that our approach outperforms currently established spam filters.
Separated pieces of this work was presented at IEEE ICMLA 2009 [2], ACM
SAC 2010 [3, 4] and IEEE IJCNN 2010 [1]. Here, we have connected all ideas in a
very consistent way. We have also offered a lot more details about each study and
significantly extended the performance evaluation.
The remainder of this chapter is organized as follows. Section 2 presents the basic
concepts regarding the main spam filtering techniques. In Section 3, we describe
a new approach based on the Minimum Description Length principle. Section 4
presents details of the Naı̈ve Bayes algorithms applied in spam filtering domain.
The linear Support Vector Machine classifier is described in Section 5. Experimental
results are showed in Section 6. Finally, Section 7 offers conclusions and outlines
for future works.

2 Basic concepts

In general, the machine learning algorithms applied to spam filtering can be sum-
marized as follows.
Given a set of messages M = {m1 , m2 , . . . , m j , . . . , m|M | } and category set C =
{spam (cs ), legitimate (cl )}, where m j is the jth mail in M and C is the possible
label set, the task of automated spam filtering consists in building a Boolean catego-

1 See http://www.spamlaws.com/spam-stats.html
2 See www.ciscosystems.cd/en/US/prod/collateral/cisco_2009_asr.pdf
Advances in Spam Filtering Techniques 3

rization function Ω (m j , ci ) : M × C → {True, False}. When Ω (m j , ci ) is True, it


indicates message m j belongs to category ci ; otherwise, m j does not belong to ci .
In the setting of spam filtering there exist only two category labels: spam and
legitimate (also called ham). Each message m j ∈ M can only be assigned to one
of them, but not to both. Therefore, we can use a simplified categorization function
Ωspam (m j ) : M → {True, False}. Hence, a message is classified as spam when
Ωspam (m j ) is True, and legitimate otherwise.
The application of supervised machine learning algorithms for spam filtering
consists of two stages:
1. Training. A set of labeled messages (M ) must be provided as training data,
which are first transformed into a representation that can be understood by the
learning algorithms. The most commonly used representation for spam filtering
is the vector space model, in which each document m j ∈ M is transformed into a
real vector x j ∈ ℜ|Φ | , where Φ is the vocabulary (feature set), and the coordinates
of x j represent the weight of each feature in Φ . Then, we can run a learning al-
gorithm over the training data to create a classifier Ωspam (xj ) → {True, False}.
2. Classification. The classifier Ωspam (xj ) is applied to the vector representation of
a message x to produce a prediction whether x is spam or not.

3 Spam filter based on data compression model

The Minimum Description Length (MDL) principle is a formalization of Occam’s


Razor in which the best hypothesis for a given set of data is the one that yields
compact representations. The traditional MDL principle states that the preferred
model results in the shortest description of the model and the data, given this model.
In other words, the model that best compresses the data is selected. This model
selection criterion naturally balances the complexity of the model and the degree to
which this model fits the data. This principle was first introduced by Rissanen [22]
and it becomes an important concept in information theory.
The main concept behind the MDL principle can be presented as following. Let
Z be a finite or countable set and let P be a probability distribution on Z . Then
there exists a prefix code C for Z such that for all z ∈ Z , LC (z) = ⌈−log2 P(z)⌉.
C is called the code corresponding to P. Similarly, let C be a prefix code for Z .
Then there exists a (possibly defective) probability distribution P such that for all
z ∈ Z , −log2 P′ (z) = LC′ (z). P′ is called the probability distribution corresponding
to C′ . Thus, large probability according to P means small code length according to
the code corresponding to P and vice versa [7, 11, 22].
The goal of statistical inference may be cast as trying to find regularity in the data.
Regularity may be identified with ability to compress. MDL combines these two
insights by viewing learning as data compression: it tells us that, for a given set of
hypotheses H and data set D, we should try to find the hypothesis or combination
of hypotheses in H that compresses D most [7, 11, 22].
4 Tiago A. Almeida and Akebo Yamakami

In essence, compression algorithms can be applied to text categorization by


building one compression model from the training documents of each class and
using these models to evaluate the target document.
In this way, given a set of pre-classified training messages M , the task is to assign
a target e-mail m with an unknown label to one of the classes c ∈ {spam, ham}. So,
the method measures the increase of the description length of the data set as a result
of the addition of the target document. Finally, it chooses the class for which the
description length increase is minimal.
Assuming that each class (model) c is a sequence of terms extracted from the
messages and inserted into the training set, each term (token) t from m has a code
length Lt based on the sequence of terms presented in the messages of the training
set of c. The length of m when assigned to the class c corresponds to the sum of
|m|
all code lengths associated with each term of m, Lm = ∑i=1 Lti . We calculate Lti =
⌈−log2 Pti ⌉, where P is a probability distribution related with the terms of class. Let
nc (ti ) the number of times that ti appears in messages of class c, then the probability
that any term belongs to c is given by the maximum likelihood estimation:

nc (ti ) + |Φ1 |
Pti =
nc + 1

where nc corresponds to the sum of nc (ti ) for all terms which appear in messages that
belongs to c and |Φ | is the vocabulary size. In this work, we assume that |Φ | = 232 ,
that is, each term in an uncompress mode is a symbol with 32 bits. This estima-
tion reserves a “portion” of probability to words which the classifier has never seen
before.
Basically, the MDL spam filter classify a message by following these steps:
1. Tokenization: the classifier extract all terms of the new message m = {t1, . . . ,t|m| };
2. Compute the increase of the description length when m is assigned to each class
c ∈ {spam, ham}:

nspam (ti ) + |Φ1 |


& !'
|m|
Lm (spam) = ∑ − log2
i=1 nspam + 1

nham (ti ) + |Φ1 |


& !'
|m|
Lm (ham) = ∑ − log2
i=1 nham + 1

3. if Lm (spam) > Lm (ham), then m is classified as spam; otherwise, m is labeled as


ham.
4. Training method.
In the following, we offer more details about the steps 1 and 4.
Advances in Spam Filtering Techniques 5

3.1 Preprocessing and tokenization

We did not perform language-specific preprocessing techniques such as word stem-


ming, stop word removal, or case folding, since other works found that such tech-
niques tend to hurt spam-filtering accuracy [8, 21, 31].
Tokenization is the first stage in the classification pipeline; it involves breaking
the text stream into terms (“words”), usually by means of a regular expression. We
consider in this work that terms start with a printable character; followed by any
number of alphanumeric characters, excluding dots, commas and colons from the
middle of the pattern. With this pattern, domain names and mail addresses will be
split at dots, so the classifier can recognize a domain even if subdomains vary [27].
As proposed by Drucker et al. [9] and Metsis et al. [21], we do not consider the
number of times a term appears in each message. In this way, each term is computed
only once per message it appears.

3.2 Training method

Spam filters generally build their predicting models by learning from examples. A
basic training method is to start with an empty model, classify each new sample and
train it in the right class if the classification is wrong. This is known as train on error
(TOE). An improvement to this method is to train also when the classification is
right, but the score is near the boundary that is, train on or near error (TONE). This
method is also called thick threshold training [27].
The advantage of TONE over TOE is that it accelerates the learning process by
exposing the filter to additional hard-to-classify samples in the same training period.
Therefore, we employ the TONE as training method used by the proposed MDL
anti-spam filter.
A good point of the MDL classifier is that we can start with an empty training set
and according to the user feedback the classifier builds the models for each class.
Moreover, it is not necessary to keep the messages used for training since the models
are incrementally building by the term frequencies.

4 Naı̈ve Bayes spam filters

Probabilistic classifiers are historically the first proposed filters. From Bayes’ the-
orem and the theorem of the total probability, the probability for a message with
vector x = hx1 , . . . , xn i belongs to a category ci ∈ {cs , cl } is:

P(ci ).P(x|ci )
P(ci |x) = .
P(x)
6 Tiago A. Almeida and Akebo Yamakami

Since the denominator does not depend on the category, Naı̈ve Bayes (NB) filter
classifies each message in the category that maximizes P(ci ).P(x|ci ). In the spam
filtering domain it is equivalent to classify a message as spam (cs ) whenever

P(cs ).P(x|cs )
> T,
P(cs ).P(x|cs ) + P(cl ).P(x|cl )

with T = 0.5. By varying T , we can opt for more true negatives (legitimate mes-
sages correctly classified) at the expense of fewer true positives (spam messages
correctly classified), or vice-versa. The a priori probabilities P(ci ) can be estimated
as occurrences frequency of documents belonging to the category ci in the training
set M , whereas P(x|ci ) is practically impossible to estimate directly because we
would need in M some messages identical to the one we want to classify. How-
ever, the NB classifier makes a simple assumption that the terms in a message are
conditionally independent and the order they appear is irrelevant. The probabilities
P(x|ci ) are estimated differently in each NB model.
Despite the fact that its independence assumption is usually oversimplistic, sev-
eral studies have found the NB classifier to be surprisingly effective in the spam
filtering task [5, 18].
The NB classifiers are the most employed in proprietary and open-source systems
proposed for spam filtering [18, 21, 28]. However, there are different models of
Naı̈ve Bayes filters, something the spam literature does not always acknowledge.
In the following, we describe seven different models of NB spam filter available
in the literature.

4.1 Basic Naı̈ve Bayes

We call Basic NB the first NB spam filter proposed by Sahami et al. [23]. Let Φ =
{t1 , . . . ,tn } the set of terms, each message m is represented as a binary vector x =
hx1 , . . . , xn i, where each xk shows whether or not tk will occur in m. The probabilities
P(x|ci ) are calculated by:
n
P(x|ci ) = ∏ P(tk |ci ),
k=1

and the criterion for classifying a message as spam is:

P(cs ). ∏nk=1 P(tk |cs )


> T.
∑ci ∈{cs ,cl } P(ci ). ∏nk=1 P(tk |ci )

Here, probabilities P(tk |ci ) are estimated by:

|Mtk ,ci |
P(tk |ci ) = ,
|Mci |
Advances in Spam Filtering Techniques 7

where |Mtk ,ci | is the number of training messages of category ci that contain the term
tk , and |Mci | is the total number of training messages that belong to the category ci .

4.2 Multinomial term frequency Naı̈ve Bayes

The multinomial term frequency NB (MN TF NB) represents each message as a set
of terms m = {t1 , . . . ,tn }, computing each one of tk as how many times it appears in
m. In this sense, m can be represented by a vector x = hx1 , . . . , xn i, where each xk
corresponds to the number of occurrences of tk in m. Moreover, each message m of
category ci can be interpreted as the result of picking independently |m| terms from
Φ with replacement and probability P(tk |ci ) for each tk [20]. Hence, P(x|ci ) is the
multinomial distribution:
n
P(tk |ci )xk
P(x|ci ) = P(|m|).|m|!. ∏ .
k=1 xk !

Thus, the criterion for classifying a message as spam becomes:

P(cs ). ∏nk=1 P(tk |cs )xk


> T,
∑ci ∈{cs ,cl } P(ci ). ∏nk=1 P(tk |ci )xk

and the probabilities P(tk |ci ) are estimated as a Laplacian prior:

1 + Ntk ,ci
P(tk |ci ) = ,
n + Nci

where Ntk ,ci is the number of occurrences of term tk in the training messages of
category ci , and Nci = ∑nk=i Ntk ,ci .

4.3 Multinomial Boolean Naı̈ve Bayes

The multinomial Boolean NB (MN Boolean NB) is similar to the MN TF NB, in-
cluding the estimates of P(tk |ci ), except that each attribute xk is Boolean. Note that,
these approaches do not take into account the absence of terms (xk = 0) from the
messages.
Schneider [24] demonstrates that MN Boolean NB may perform better than MN
TF NB. This is because the multinomial NB with term frequency attributes is equiv-
alent to a NB version with the attributes modeled as following Poisson distributions
in each category, assuming that the message length is independent of the category.
Therefore, the multinomial NB may achieve better performance with Boolean at-
tributes, if the term frequencies attributes do not follow Poisson distributions.
8 Tiago A. Almeida and Akebo Yamakami

4.4 Multivariate Bernoulli Naı̈ve Bayes

Let Φ = {t1 , . . . ,tn } the set of terms. The multivariate Bernoulli NB (MV Bernoulli
NB) represents each message m by computing the presence and absence of each
term. Therefore, m can be represented as a binary vector x = hx1 , . . . , xn i, where
each xk shows whether or not tk will occur in m. Moreover, each message m of
category ci is seen as the result of n Bernoulli trials, where at each trial we decide
whether or not tk will appear in m. The probability of a positive outcome at trial k is
P(tk |ci ). Then, the probabilities P(x|ci ) are computed by:
n
P(x|ci ) = ∏ P(tk |ci )xk .(1 − P(tk |ci ))(1−xk ) .
k=1

The criterion for classifying a message as spam becomes:

P(cs ). ∏nk=1 P(tk |cs )xk .(1 − P(tk |cs ))(1−xk )


> T,
∑ci ∈{cs ,cl } P(ci ). ∏nk=1 P(tk |ci )xk .(1 − P(tk |ci ))(1−xk )

and probabilities P(tk |ci ) are estimated as a Laplacian prior:

1 + |Mtk ,ci |
P(tk |ci ) = ,
2 + |Mci |

where |Mtk ,ci | is the number of training messages of category ci that comprise the
term tk , and |Mci | is the total number of training messages of category ci . For more
theoretical explanation, consult Metsis et al. [21] and Losada and Azzopardi [17].

4.5 Boolean Naı̈ve Bayes

We denote as Boolean NB the classifier similar to the MV Bernoulli NB with the


difference that this does not take into account the absence of terms. Hence, the
probabilities P(x|ci ) are estimated only by:
n
P(x|ci ) = ∏ P(tk |ci ),
k=1

and the criterion for classifying a message as spam becomes:

P(cs ). ∏nk=1 P(tk |cs )


> T,
∑ci ∈{cs ,cl } P(ci ). ∏nk=1 P(tk |ci )

where probabilities P(tk |ci ) are estimated in the same way as used in the MV
Bernoulli NB.
Advances in Spam Filtering Techniques 9

4.6 Multivariate Gauss Naı̈ve Bayes

Multivariate Gauss NB (MV Gauss NB) uses real-valued attributes by assuming that
each attribute follows a Gaussian distribution g(xk ; µk,ci , σk,ci ) for each category ci ,
where the µk,ci and σk,ci of each distribution are estimated from the training set M .
The probabilities P(x|ci ) are calculate by
n
P(x|ci ) = ∏ g(xk ; µk,ci , σk,ci ),
k=1

and the criterion for classifying a message as spam becomes:

P(cs ). ∏nk=1 g(xk ; µk,cs , σk,cs )


> T.
∑ci ∈{cs ,cl } P(ci ). ∏nk=1 g(xk ; µk,ci , σk,ci )

4.7 Flexible Bayes

Flexible Bayes (FB) works similar to MV Gauss NB. However, instead of using
a single normal distribution for each attribute Xk per category ci , FB represents the
probabilities P(x|ci ) as the average of Lk,ci normal distributions with different values
for µk,ci , but the same one for σk,ci :
Lk,ci
1
P(xk |ci ) =
Lk,ci ∑ g(xk ; µk,ci ,l , σci ),
l=1

where Lk,ci is the amount of different values that the attribute Xk has in the training
set M of category ci . Each of these values is used as µk,ci ,l of a normal distribution
of the category ci . However, all distributions of a category ci are taken to have the
same σci = √ 1 .
|Mci |
The distribution of each category becomes narrower as more training messages
of that category are accumulated. By averaging several normal distributions, FB can
approximate the true distributions of real-valued attributes more closely than the
MV Gauss NB when the assumption that attributes follow normal distribution is
violated. For further details, consult John and Langley [13] and Androutsopoulos
et al. [5].

Table 1 summarizes all NB spam filters presented in this section3 .

3 The computational complexities are according to Metsis et al. [21]. At classification time, the
complexity of FB is O(n.|M |) because it needs to sum the Lk distributions.
10 Tiago A. Almeida and Akebo Yamakami

Table 1 Naı̈ve Bayes spam filters.


Complexity on
NB Classifier P(x|ci ) Training Classification
Basic NB ∏nk=1 P(tk |ci ) O(n.|M |) O(n)
MN TF NB ∏nk=1 P(tk |ci )xk O(n.|M |) O(n)
MN Boolean NB ∏nk=1 P(tk |ci )xk O(n.|M |) O(n)
MV Bernoulli NB ∏nk=1 P(tk |ci )xk .(1 − P(tk |ci ))(1−xk ) O(n.|M |) O(n)
Boolean NB ∏nk=1 P(tk |ci ) O(n.|M |) O(n)
MV Gauss NB ∏nk=1 g(xk ; µk,ci , σk,ci ) O(n.|M |) O(n)
Lk,c
Flexible Bayes ∏nk=1 Lk,c
1
∑l=1i g(xk ; µk,ci ,l , σci ) O(n.|M |) O(n.|M |)
i

5 Support Vector Machines

Support vector machine (SVM) is one of the most successful techniques used in
text classification [8, 10]. In this method a data point is viewed as a p-dimensional
vector and the approach aims to separate such points with a (p − 1)-dimensional
hyperplane. This is called a linear classifier. There are many hyperplanes that might
classify the data. One reasonable choice as the best hyperplane is the one that rep-
resents the largest separation, or margin, between the two classes. Therefore, SVM
chooses the hyperplane so that the distance from it to the nearest data point on each
side is maximized. If such a hyperplane exists, it is known as the maximum-margin
hyperplane and the linear classifier it defines is known as a maximum margin clas-
sifier (Figure 1) [29].

Fig. 1 Maximum-margin
hyperplane and margins for
a SVM trained with samples
from two classes.

SVMs belong to a family of generalized linear classifiers. A special property is


that they simultaneously minimize the empirical classification error and maximize
Advances in Spam Filtering Techniques 11

the geometric margin; hence they are also known as maximum margin classifiers.
For further details about the implementation of SVMs in spam filtering domain,
consult Cormack [8], Drucker et al. [9], Hidalgo [12], Kolcz and Alspector [14],
Sculley and Wachman [25], Sculley et al. [26] and Liu and Cui [16].

6 Experimental results

We carried out this study on the six well-known, large, real and public Enron
datasets4 . The corpora are composed of legitimate messages extracted from the
mailboxes of six former employees of the Enron Corporation. For further details
about the dataset statistics and composition, refer to Metsis et al. [21].
Tables 2, 3, 4, 5, 6, and 7 present the performance achieved by each classifier for
each Enron dataset. Bold values indicate the highest score. In order to provide a fair
evaluation, we consider the most important measures the Matthews correlation co-
efficient (MCC) [1–4] and the weighted accuracy rate (Accw %) [5] achieved by each
filter. Additionally, we present other well-known measures as spam recall (Sre%),
legitimate recall (Lre%), spam precision (Spr%), legitimate precision (Lpr%), and
total cost ratio (TCR) [5]. It is important to note that TCR offers an indication of
the improvement provided by the filter. A greater TCR indicates better performance,
and for TCR < 1, not using the filter is better. On the other hand, the MCC returns
a real value between −1 and +1. A coefficient equals to +1 indicates a perfect pre-
diction; 0, an average random prediction; and −1, an inverse prediction. It can be
calculated using the following equation [6, 19]:
(|T P|.|T N |) − (|F P|.|F N |)
MCC = p ,
(|T P| + |F P|).(|T P| + |F N |).(|T N | + |F P|).(|T N | + |F N |)

where |T P|, |F P|, |T N | and |F N | corresponds to the amount of true posi-


tives, false positives, true negatives and false negatives, respectively.

Table 2 Enron 1 – Results achieved by each filter


Measures Basic Bool MN TF MN MV MV Flex SVM MDL
Bool Bern Gauss Bayes
Sre(%) 91.33 96.00 82.00 82.67 72.00 78.67 87.33 83.33 92.00
Spr(%) 85.09 51.61 75.00 62.00 61.71 87.41 86.18 87.41 92.62
Lre(%) 93.48 63.32 88.86 79.35 81.79 95.38 94.29 95.11 97.01
Lpr(%) 96.36 97.49 92.37 91.82 87.76 91.64 94.81 93.33 96.75
Accw (%) 92.86 72.78 86.87 80.31 78.96 90.54 92.28 91.70 95.56
TCR 4.054 1.064 2.206 1.471 1.376 3.061 3.750 3.488 6.552
MCC 0.831 0.540 0.691 0.578 0.516 0.765 0.813 0.796 0.892

4 The Enron datasets are available at http://www.iit.demokritos.gr/skel/

i-config/.
12 Tiago A. Almeida and Akebo Yamakami

Table 3 Enron 2 – Results achieved by each filter


Measures Basic Bool MN TF MN MV MV Flex SVM MDL
Bool Bern Gauss Bayes
Sre(%) 80.00 95.33 75.33 74.00 65.33 62.67 68.67 90.67 91.33
Spr(%) 97.57 81.25 96.58 98.23 81.67 94.95 98.10 90.67 99.28
Lre(%) 99.31 92.45 99.08 99.54 94.97 98.86 99.54 96.80 99.77
Lpr(%) 93.53 98.30 92.13 91.77 88.87 88.52 90.25 96.80 97.10
Accw (%) 94.38 93.19 93.02 93.02 87.39 89.61 91.65 95.23 97.61
TCR 4.545 3.750 3.659 3.659 2.027 2.459 3.061 5.357 10.714
MCC 0.850 0.836 0.812 0.814 0.652 0.717 0.776 0.875 0.937

Table 4 Enron 3 – Results achieved by each filter


Measures Basic Bool MN TF MN MV MV Flex SVM MDL
Bool Bern Gauss Bayes

Sre(%) 57.33 99.33 57.33 62.00 100.00 52.67 52.00 91.33 90.00
Spr(%) 100.00 99.33 100.00 100.00 84.75 89.77 96.30 96.48 100.00
Lre(%) 100.00 99.75 100.00 100.00 93.28 97.76 99.25 98.76 100.00
Lpr(%) 86.27 99.75 86.27 87.58 100.00 84.70 84.71 96.83 96.40
Accw (%) 88.41 99.64 88.41 89.67 95.11 85.51 86.41 96.74 97.28
TCR 2.344 75.000 2.344 2.632 5.556 1.875 2.000 8.333 10.000
MCC 0.703 0.991 0.703 0.737 0.889 0.613 0.644 0.917 0.931

Table 5 Enron 4 – Results achieved by each filter


Measures Basic Bool MN TF MN MV MV Flex SVM MDL
Bool Bern Gauss Bayes
Sre(%) 94.67 98.00 93.78 96.89 98.22 94.44 94.89 98.89 97.11
Spr(%) 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Lre(%) 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Lpr(%) 86.21 94.34 84.27 91.46 94.94 85.71 86.71 96.77 92.02
Accw (%) 96.00 98.50 95.33 97.67 98.67 95.83 96.17 99.17 97.83
TCR 18.750 50.000 16.071 32.143 56.250 18.000 19.565 90.00 34.615
MCC 0.903 0.962 0.889 0.941 0.966 0.900 0.907 0.978 0.945

Table 6 Enron 5 – Results achieved by each filter


Measures Basic Bool MN TF MN MV MV Flex SVM MDL
Bool Bern Gauss Bayes

Sre(%) 89.67 87.23 88.86 94.29 98.10 86.68 88.86 89.40 99.73
Spr(%) 98.80 100.00 100.00 100.00 92.56 96.37 98.79 99.70 98.39
Lre(%) 97.33 100.00 100.00 100.00 80.67 92.00 97.33 99.33 96.00
Lpr(%) 79.35 76.14 78.53 87.72 94.53 73.80 78.07 79.26 99.31
Accw (%) 91.89 90.93 92.08 95.95 93.05 88.22 91.31 92.28 98.65
TCR 8.762 7.830 8.976 17.524 10.222 6.033 8.178 9.200 52.571
MCC 0.825 0.815 0.835 0.909 0.828 0.743 0.814 0.837 0.967
Advances in Spam Filtering Techniques 13

Table 7 Enron 6 – Results achieved by each filter


Measures Basic Bool MN TF MN MV MV Flex SVM MDL
Bool Bern Gauss Bayes
Sre(%) 86.00 66.89 76.67 92.89 96.22 92.00 89.78 89.78 98.67
Spr(%) 98.98 99.67 99.42 97.21 92.32 94.95 98.30 95.28 95.48
Lre(%) 97.33 99.33 98.67 92.00 76.00 85.33 95.33 86.67 86.00
Lpr(%) 69.86 50.00 58.50 81.18 87.02 78.05 75.66 73.86 95.56
Accw (%) 88.33 75.00 82.17 92.67 91.17 90.33 91.17 90.05 95.50
TCR 6.716 3.000 4.206 10.227 8.491 7.759 8.491 6.818 16.667
MCC 0.757 0.574 0.661 0.816 0.757 0.751 0.793 0.727 0.878

Regarding the results achieved by the classifiers, the MDL spam filter outper-
formed the other classifiers for the majority e-mail collections used in our empiri-
cal evaluation. It is important to realize that in some situations the MDL performs
much better than SVM and NB classifiers. For instance, for Enron 1 (Table 2), MDL
achieved spam recall rate equal to 92% while SVM attained 83.33%, even thought
MDL presented better legitimate recall. It means that for Enron 1 MDL was able
to recognize more than 8% of spams than SVM, representing an improvement of
10.40%. In a real situation, this difference would be extremely important. Note that,
the same result can be found for Enron 2 (Table 3), Enron 5 (Table 6) and Enron
6 (Table 7). Both methods, MDL and SVM, achieved similar performance with no
significant statistical difference just for Enron 3 (Table 4) and Enron 4 (Table 5).
The results indicate that the data compression model is more efficient to distin-
guish messages as spams or legitimates. It attained an accuracy rate higher than 95%
and high precision × recall rates for all datasets indicating that the MDL classifier
makes few mistakes. We also verify that the MDL classifier achieved MCC score
higher than 0.87 for all tested corpus. It indicates that the proposed filter almost ac-
complished a perfect prediction (MCC = 1.000) and it is much better than not using
a filter (MCC = 0.000).
Among the evaluated NB classifiers, the results indicate that all of them achieved
similar performance with no significant statistical difference. However, they had
achieved lower results than MDL and linear SVM which attained accuracy rate
higher than 90% for all Enron datasets.
Moreover, according to the results found by Schneider [24], in our experiments
the NB filters that use real and integer attributes did not achieved better results
than Boolean ones. However, Metsis et al. [21] showed that flexible Bayes are less
sensitive to the threshold T . It indicates that it is able to attain a high spam recall
even though a high legitimate recall is required.
14 Tiago A. Almeida and Akebo Yamakami

7 Conclusions

In this paper, we have presented a new spam filtering approach based on the Min-
imum Description Length principle. We have also compared its performance with
the linear Support Vector Machine and seven different models of Naı̈ve Bayes clas-
sifiers, something the spam literature does not always acknowledge.
We have conducted an empirical experiment using six well-known, large, and
public databases and the reported results indicate that the proposed classifier out-
performs currently established spam filters. It is important to emphasize that MDL
spam filter acquired the best average performance for all analyzed databases pre-
senting an accuracy rate higher than 95% for all e-mail datasets.
Actually, we are conducting more experiments using larger datasets as TREC05,
TREC06 and TREC07 corpora [8] in order to reinforce the validation. We also in-
tend to compare the approaches with other commercial and open-source spam filters,
such as Bogofilter, SpamAssassin, OSBF-Lua, among others.
Future works should take into consideration that spam filtering is a coevolution-
ary problem, because while the filter tries to evolve its prediction capacity, the spam-
mers try to evolve their spam messages in order to overreach the classifiers. Hence,
an efficient approach should have an effective way to adjust its rules in order to
detect the changes of spam features. In this way, collaborative filters [15] could be
used to assist the classifier by accelerating the adaptation of the rules and increasing
the classifiers’ performance. Moreover, spammers generally insert a large amount of
noise in spam messages in order to make the probability estimation more difficult.
Thus, the filters should have a flexible way to compare the terms in the classifying
task. Approaches based on fuzzy logic [30] could be employed to make the compar-
ison and selection of terms more flexible.

Acknowledgements The authors would like to thank J. Almeida for his very constructive sugges-
tions and the Brazilian Coordination for the Improvement of Higher Level Personnel (Capes) for
financial support.

References

[1] Almeida, T. and Yamakami, A. (2010). Content-Based Spam Filtering. In Pro-


ceedings of the 23rd IEEE International Joint Conference on Neural Networks,
pages 1–7, Barcelona, Spain.
[2] Almeida, T., Yamakami, A., and Almeida, J. (2009). Evaluation of Approaches
for Dimensionality Reduction Applied with Naive Bayes Anti-Spam Filters. In
Proceedings of the 8th IEEE International Conference on Machine Learning and
Applications, pages 517–522, Miami, FL, USA.
[3] Almeida, T., Yamakami, A., and Almeida, J. (2010a). Filtering Spams using
the Minimum Description Length Principle. In Proceedings of the 25th ACM
Symposium On Applied Computing, pages 1856–1860, Sierre, Switzerland.
Advances in Spam Filtering Techniques 15

[4] Almeida, T., Yamakami, A., and Almeida, J. (2010b). Probabilistic Anti-Spam
Filtering with Dimensionality Reduction. In Proceedings of the 25th ACM Sym-
posium On Applied Computing, pages 1804–1808, Sierre, Switzerland.
[5] Androutsopoulos, I., Paliouras, G., and Michelakis, E. (2004). Learning to Filter
Unsolicited Commercial E-Mail. Technical Report 2004/2, National Centre for
Scientific Research “Demokritos”, Athens, Greece.
[6] Baldi, P., Brunak, S., Chauvin, Y., Andersen, C., and Nielsen, H. (2000). As-
sessing the Accuracy of Prediction Algorithms for Classification: An Overview.
Bioinformatics, 16(5), 412–424.
[7] Barron, A., Rissanen, J., and Yu, B. (1998). The Minimum Description Length
Principle in Coding and Modeling. IEEE Transactions on Information Theory,
44(6), 2743–2760.
[8] Cormack, G. (2008). Email Spam Filtering: A Systematic Review. Foundations
and Trends in Information Retrieval, 1(4), 335–455.
[9] Drucker, H., Wu, D., and Vapnik, V. (1999). Support Vector Machines for Spam
Categorization. IEEE Transactions on Neural Networks, 10(5), 1048–1054.
[10] Forman, G., Scholz, M., and Rajaram, S. (2009). Feature Shaping for Lin-
ear SVM Classifiers. In Proceedings of the 15th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 299–308, Paris,
France.
[11] Grünwald, P. (2005). A Tutorial Introduction to the Minimum Description
Length Principle. In P. Grünwald, I. Myung, and M. Pitt, editors, Advances in
Minimum Description Length: Theory and Applications, pages 3–81. MIT Press.
[12] Hidalgo, J. (2002). Evaluating Cost-Sensitive Unsolicited Bulk Email Cate-
gorization. In Proceedings of the 17th ACM Symposium on Applied Computing,
pages 615–620, Madrid, Spain.
[13] John, G. and Langley, P. (1995). Estimating Continuous Distributions in
Bayesian Classifiers. In Proceedings of the 11st International Conference on
Uncertainty in Artificial Intelligence, pages 338–345, Montreal, Canada.
[14] Kolcz, A. and Alspector, J. (2001). SVM-based Filtering of E-mail Spam with
Content-Specific Misclassification Costs. In Proceedings of the 1st International
Conference on Data Mining, pages 1–14, San Jose, CA, USA.
[15] Lemire, D. (2005). Scale and Translation Invariant Collaborative Filtering
Systems. Information Retrieval, 8(1), 129–150.
[16] Liu, S. and Cui, K. (2009). Applications of Support Vector Machine Based on
Boolean Kernel to Spam Filtering. Modern Applied Science, 3(10), 27–31.
[17] Losada, D. and Azzopardi, L. (2008). Assessing Multivariate Bernoulli Mod-
els for Information Retrieval. ACM Transactions on Information Systems, 26(3),
1–46.
[18] Marsono, M., El-Kharashi, N., and Gebali, F. (2009). Targeting Spam Control
on Middleboxes: Spam Detection Based on Layer-3 E-mail Content Classifica-
tion. Computer Networks, 53(6), 835–848.
[19] Matthews, B. (1975). Comparison of the Predicted and Observed Secondary
Structure of T4 Phage Lysozyme. Biochimica et Biophysica Acta, 405(2), 442–
451.
16 Tiago A. Almeida and Akebo Yamakami

[20] McCallum, A. and Nigam, K. (1998). A Comparison of Event Models for


Naive Bayes Text Classication. In Proceedings of the 15th AAAI Workshop on
Learning for Text Categorization, pages 41–48, Menlo Park, CA, USA.
[21] Metsis, V., Androutsopoulos, I., and Paliouras, G. (2006). Spam Filtering with
Naive Bayes - Which Naive Bayes? In Proceedings of the 3rd International
Conference on Email and Anti-Spam, pages 1–5, Mountain View, CA, USA.
[22] Rissanen, J. (1978). Modeling by Shortest Data Description. Automatica, 14,
465–471.
[23] Sahami, M., Dumais, S., Hecherman, D., and Horvitz, E. (1998). A Bayesian
Approach to Filtering Junk E-mail. In Proceedings of the 15th National Confer-
ence on Artificial Intelligence, pages 55–62, Madison, WI, USA.
[24] Schneider, K. (2004). On Word Frequency Information and Negative Evidence
in Naive Bayes Text Classification. In Proceedings of the 4th International Con-
ference on Advances in Natural Language Processing, pages 474–485, Alicante,
Spain.
[25] Sculley, D. and Wachman, G. (2007). Relaxed Online SVMs for Spam Fil-
tering. In Proceedings of the 30th International ACM SIGIR Conference on Re-
search and Development in Information Retrieval, pages 415–422, Amsterdam,
The Netherlands.
[26] Sculley, D., Wachman, G., and Brodley, C. (2006). Spam Filtering using Inex-
act String Matching in Explicit Feature Space with On-Line Linear Classifiers.
In Proceedings of the 15th Text REtrieval Conference, pages 1–10, Gaithersburg,
MD, USA.
[27] Siefkes, C., Assis, F., Chhabra, S., and Yerazunis, W. (2004). Combining Win-
now and Orthogonal Sparse Bigrams for Incremental Spam Filtering. In Proceed-
ings of the 8th European Conference on Principles and Practice of Knowledge
Discovery in Databases, pages 410–421, Pisa, Italy.
[28] Song, Y., Kolcz, A., and Gilez, C. (2009). Better Naive Bayes Classification
for High-precision Spam Detection. Software – Practice and Experience, 39(11),
1003–1024.
[29] Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer-
Verlag, New York, NY, USA.
[30] Zadeh, L. (1965). Fuzzy sets. Information and Control, 8(3), 338–353.
[31] Zhang, L., Zhu, J., and Yao, T. (2004). An Evaluation of Statistical Spam
Filtering Techniques. ACM Transactions on Asian Language Information Pro-
cessing, 3(4), 243–269.

View publication stats

You might also like