You are on page 1of 15

SE/SS * ZG 548

Advanced Data Mining


Anomaly Detection
Revision 2.0

BITS Pilani Prof Vineet Garg


Bangalore Professional Development Center
Work Integrated Learning Programmes
Statistical Approaches
Univariate Normal Distribution

Statistical Approaches are model based approaches:


 A model (or distribution) is created for the data.
 Objects are evaluated with respect to how well they fit the model.
 For univariate normal distribution, Gaussian (normal) distribution is
used to identify the outlier.
 The normal distribution N(μ, σ) has two
parameters mean (μ) and standard
deviation (σ). The plot shows the
probability density function for N(0, 1).
 A object’s distance from the center can
be used to test if it is an outlier. For an
object more than 4, the probability is
one in ten thousand (extremely low).

BITS Pilani, WILP


Normal (Gaussian’s) Density
Function
(x   )2
[ ]
P( x ) 
1
.e 2.  2

2 .(  )
where,
P( x )  probability of occurence for x in t he distribution
  mean
  var iance
2

e is Euler' s Number  2.71828

BITS Pilani, WILP


Exercise

1. One univariate data vector is provided in normal distribution as {12,


24, 3, 18, 19, 21, 13}. Find out the probability for a value 15 to occur
considering the vector contains the entire population.
Mean (μ) = 15.71
Variance (σ2) = 42.20
Standard Deviation (σ) = 6.50
A = 1/{(2π)1/2.σ} = 0.061
x = 15
B = -(x- μ)2 / 2. σ2 = -0.006
C = eB = 2.71828(-0.006) = 1
So the probability = AxC= 0.061x1 = 0.061

2. [Optional] Z-normalize the above vector and using the Excel sheet
draw the probability density plot. You can use normdist() function of
Excel to calculate the probability of each data point.
4

BITS Pilani, WILP


Multivariate Normal Distribution
Probability Density Function (PDF)

 x1 and x2 are the variables in the bivariate normal distribution.


 How this density is identified?
 How outliers can be identified in multivariate normal distribution in
5
general?
BITS Pilani, WILP
Outlier in Multivariate Normal
Distribution
 For the univariate dataset, the outlier detection approach is probability density function
drawn from μ and σ assuming the points are in normal distribution.
 The question is how to adopt a similar approach for multivariate normal distribution. The
answer is to take the similar approach and thus the covariance comes into picture.
 When there is correlation among the attributes which are in normal distribution, the
concept of Mahalanobis Distance comes into picture. It uses the covariance in calculating
the distance.
 It is formalized by P.C. Mahalanobis, the famous Indian statistician who is remembered as
the founder of the Indian Statistical Institute, Kolkata and a member of the first planning
commission of India.
Mahalanobis( X , X )  [ X  X ].S 1 .[ X  X ] T
where, X is the mean of X

S 1is the inverse of cov ariance matrix


P.C. Mahalanobis
T 1893-1972
[ X  X ] is the transpose of matrix[ X  X ]

Note : X is a vector of coordinates for a po int 6

BITS Pilani, WILP


Probability Density Function
Multivariate Normal Distribution

If S is the covariance matrix for the multivariate data (m-dimensions).


Then the probability density function for a data point x is given by:
1
1  .( x  X ).S 1 ( x  X ) T
P( x )  .e 2
( 2 ) .| S |
m 1/ 2

Note the exp onent is a factor of Mahalanobis Dis tan ce

If natural log is taken of this probability the value comes proportional


to the magnitude of the Mahalanobis distance (ln e-x = -x).

So in one way, it is sufficient to use the Mahalanobis distance to


find out the outliers instead of calculating the actual probability.
BITS Pilani, WILP
Covariance Matrix & Inverse
If there are two attributes (X, Y) then covariance matrix
is defined as:
 SXX SXY 
S 
 S YX S YY 
The inverse of a 2x2 matrix is calculated as shown
below. The inverse of above covariance matrix (S-1) can
be calculated similarly:
1
a b  1  d  b
 c d   ad  bc  c a 
   
Inverse of a 3x3 matrix.
BITS Pilani, WILP
Mahalanobis Distance
Illustration

Given 15 points (A to O) an outlier needs to be found out using


Mahalanobis Distance.
# X Y (X-X') (Y-Y') Mahalanobis Dist
A 2 2 -2.7 -2.6 4.00
B 2 5 -2.7 0.4 2.06
C 6 5 1.3 0.4 0.44
D 7 3 2.3 -1.6 3.32
E 4 7 -0.7 2.4 3.30
F 6 4 1.3 -0.6 0.77
G 5 3 0.3 -1.6 1.41
H 4 6 -0.7 1.4 1.27
I 2 5 -2.7 0.4 2.06
J 1 3 -3.7 -1.6 3.66
K 6 5 1.3 0.4 0.44
L 7 4 2.3 -0.6 1.80
M 8 7 3.3 2.4 4.31
N 5 6 0.3 1.4 0.94
O 5 4 0.3 -0.6 0.24
Mean (X', Y') 4.7 4.6

Inverse Covariance Matrix


0.25 -0.09
-0.09 0.51

Review the Excel sheet provided through eLearn for the calculations. 9

BITS Pilani, WILP


A Question!

In the previous example, outlier


can be visually identified or may
be using the Euclidean’s distance.
Why we need Mahalanobis 2-D view of a dataset points

distance?

L norms provide spatial distance


only. When we have more
dimensions and attributes have
probabilistic relationship,
A possible higher dimensional view of the
Mahalanobis distance is required. same dataset

BITS Pilani, WILP


Density Based Outlier Detection
 DBSCAN clustering algorithm identifies outliers with a global view of the data set.
 In practice, datasets could demonstrate a more complex structure where objects may be considered
outliers with respect to their local neighbourhood. (a data set with different densities).
 In the shown data distribution, there are two clusters C1 and C2.
 Object O3 can be declared as distance based outlier because it is far from the majority of the
objects.
 What about objects O1 and O2?
 The distance of O1 and O2 from the objects of cluster C1 is
smaller than the average distance of an object from its
nearest neighbour in the cluster C2.
 O1 and O2 are not distance based outliers. But they are
outliers with respect to the cluster C1 because they
deviate significantly from other objects of C1.
 Similarly the distance between O4 and its nearest neighbour in C2 is higher than the distance
between O1 or O2 and their nearest neighbours in C1, still O4 may not be an outlier because
C2 is sparse. Distance based detection does not capture local outliers. There is a need of
different approach.
BITS Pilani, WILP
Density Based Outliers
For a dataset that is having few objects, let:
 distance (x, y) = The distance between the object x and object y using some norm.
 N(x, k) = The set containing k nearest neighbours for an object x.
 density (x, k) = Density of object x for its k nearest neighbours. It is defined as the
reciprocal of the average distance of k nearest neighbours from x. It can be written
as follows:
| N( x,k )|
density( x,k ) 

y N ( x ,k )
dis tan ce( x, y )

 average relative density (x, k) or outlier score = is the ratio of density of an object x
and the average density of its k nearest neighbours. It can be written as follows:
density( x,k )
average relative density( x,k ) 

y N ( x ,k )
density( y,k ) / | N( x,k )|

BITS Pilani, WILP


Illustration
average relative
Object distance (x, y) k=3 nearest Objects X Y
Pairs L1 norm Objects
neighbours density (x, k=3) density(x, k=3)
(outlier score) A 1.00 2.00
A-B 1.50 A B, C, D 0.80 1.11 B 2.00 1.50
A-C 0.50 B A, C, D 0.80 1.11 C 1.00 1.50
A-D 1.75 C A, B, D D 2.00 2.75
0.80 1.11
B-C 1.00 E 7.00 2.25
D A, B, C 0.57 0.71
B-D 1.25 F 7.00 2.50
E F, G, H 3.00 1.64
C-D 2.25 G 7.00 2.00
F E, G, H 2.00 0.92
E-F 0.25 H 7.50 2.25
G E, F, H 2.00 0.92
E-G 0.25 I 6.00 2.50
E-H
H E, F, G 1.50 0.64
0.50
E-I I E, F, G 0.80 0.34
1.25
F-G 0.50
F-H 0.75
F-I 1.00
G-H 0.75
G-I 1.50
H-I 1.75

 Notice that objects D and I are least density


objects in their clusters and I is in more dense
region than D.
 When average relative density is taken into
account, I is identified as a more potential
outlier.
BITS Pilani, WILP
Review Points

 Datasets with probabilistic distribution among attributes


vs. datasets where attributes represent spatial
coordinates.
 Identify the similarities in the univariate and multivariate
probability density functions for normal distribution.
 How do we find outliers in these two different datasets?
 Limitations of DBSCAN to find out local outliers?
 Can you think of a scenario where local outliers
identification could be useful? (e.g. rural and urban areas
on the same datasets)
14

BITS Pilani, WILP


Thank You

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956