2 views

Uploaded by Divya Gn

dgfds

- ICST 2012
- stat3 revise edm
- Principles of measurement
- Devnagari Handwritten Numeral Recognition Using Geometric Features and Statistical Combination Classifier
- CrimeStat IV Chapter 19.pdf
- A Gentle Introduction to Predictive Filters
- Fx 9750G NormalDistribution
- 10.1.1.57
- Lecture 24 - Expectation
- Chapter 5
- MODULE 13 Normal Distribution
- 25044857 Discriminant Function Analysis
- Corks
- Triola Cover&Contents
- Stat Notes
- MM ZG515-1.docx
- 632634 2 235 wdrs
- Questions On Probability
- hwmk 2 final (1)
- MIT2_72s09_lec04.pdf

You are on page 1of 15

Anomaly Detection

Revision 2.0

Bangalore Professional Development Center

Work Integrated Learning Programmes

Statistical Approaches

Univariate Normal Distribution

A model (or distribution) is created for the data.

Objects are evaluated with respect to how well they fit the model.

For univariate normal distribution, Gaussian (normal) distribution is

used to identify the outlier.

The normal distribution N(μ, σ) has two

parameters mean (μ) and standard

deviation (σ). The plot shows the

probability density function for N(0, 1).

A object’s distance from the center can

be used to test if it is an outlier. For an

object more than 4, the probability is

one in ten thousand (extremely low).

Normal (Gaussian’s) Density

Function

(x )2

[ ]

P( x )

1

.e 2. 2

2 .( )

where,

P( x ) probability of occurence for x in t he distribution

mean

var iance

2

Exercise

24, 3, 18, 19, 21, 13}. Find out the probability for a value 15 to occur

considering the vector contains the entire population.

Mean (μ) = 15.71

Variance (σ2) = 42.20

Standard Deviation (σ) = 6.50

A = 1/{(2π)1/2.σ} = 0.061

x = 15

B = -(x- μ)2 / 2. σ2 = -0.006

C = eB = 2.71828(-0.006) = 1

So the probability = AxC= 0.061x1 = 0.061

2. [Optional] Z-normalize the above vector and using the Excel sheet

draw the probability density plot. You can use normdist() function of

Excel to calculate the probability of each data point.

4

Multivariate Normal Distribution

Probability Density Function (PDF)

How this density is identified?

How outliers can be identified in multivariate normal distribution in

5

general?

BITS Pilani, WILP

Outlier in Multivariate Normal

Distribution

For the univariate dataset, the outlier detection approach is probability density function

drawn from μ and σ assuming the points are in normal distribution.

The question is how to adopt a similar approach for multivariate normal distribution. The

answer is to take the similar approach and thus the covariance comes into picture.

When there is correlation among the attributes which are in normal distribution, the

concept of Mahalanobis Distance comes into picture. It uses the covariance in calculating

the distance.

It is formalized by P.C. Mahalanobis, the famous Indian statistician who is remembered as

the founder of the Indian Statistical Institute, Kolkata and a member of the first planning

commission of India.

Mahalanobis( X , X ) [ X X ].S 1 .[ X X ] T

where, X is the mean of X

P.C. Mahalanobis

T 1893-1972

[ X X ] is the transpose of matrix[ X X ]

Probability Density Function

Multivariate Normal Distribution

Then the probability density function for a data point x is given by:

1

1 .( x X ).S 1 ( x X ) T

P( x ) .e 2

( 2 ) .| S |

m 1/ 2

to the magnitude of the Mahalanobis distance (ln e-x = -x).

find out the outliers instead of calculating the actual probability.

BITS Pilani, WILP

Covariance Matrix & Inverse

If there are two attributes (X, Y) then covariance matrix

is defined as:

SXX SXY

S

S YX S YY

The inverse of a 2x2 matrix is calculated as shown

below. The inverse of above covariance matrix (S-1) can

be calculated similarly:

1

a b 1 d b

c d ad bc c a

Inverse of a 3x3 matrix.

BITS Pilani, WILP

Mahalanobis Distance

Illustration

Mahalanobis Distance.

# X Y (X-X') (Y-Y') Mahalanobis Dist

A 2 2 -2.7 -2.6 4.00

B 2 5 -2.7 0.4 2.06

C 6 5 1.3 0.4 0.44

D 7 3 2.3 -1.6 3.32

E 4 7 -0.7 2.4 3.30

F 6 4 1.3 -0.6 0.77

G 5 3 0.3 -1.6 1.41

H 4 6 -0.7 1.4 1.27

I 2 5 -2.7 0.4 2.06

J 1 3 -3.7 -1.6 3.66

K 6 5 1.3 0.4 0.44

L 7 4 2.3 -0.6 1.80

M 8 7 3.3 2.4 4.31

N 5 6 0.3 1.4 0.94

O 5 4 0.3 -0.6 0.24

Mean (X', Y') 4.7 4.6

0.25 -0.09

-0.09 0.51

Review the Excel sheet provided through eLearn for the calculations. 9

A Question!

can be visually identified or may

be using the Euclidean’s distance.

Why we need Mahalanobis 2-D view of a dataset points

distance?

only. When we have more

dimensions and attributes have

probabilistic relationship,

A possible higher dimensional view of the

Mahalanobis distance is required. same dataset

Density Based Outlier Detection

DBSCAN clustering algorithm identifies outliers with a global view of the data set.

In practice, datasets could demonstrate a more complex structure where objects may be considered

outliers with respect to their local neighbourhood. (a data set with different densities).

In the shown data distribution, there are two clusters C1 and C2.

Object O3 can be declared as distance based outlier because it is far from the majority of the

objects.

What about objects O1 and O2?

The distance of O1 and O2 from the objects of cluster C1 is

smaller than the average distance of an object from its

nearest neighbour in the cluster C2.

O1 and O2 are not distance based outliers. But they are

outliers with respect to the cluster C1 because they

deviate significantly from other objects of C1.

Similarly the distance between O4 and its nearest neighbour in C2 is higher than the distance

between O1 or O2 and their nearest neighbours in C1, still O4 may not be an outlier because

C2 is sparse. Distance based detection does not capture local outliers. There is a need of

different approach.

BITS Pilani, WILP

Density Based Outliers

For a dataset that is having few objects, let:

distance (x, y) = The distance between the object x and object y using some norm.

N(x, k) = The set containing k nearest neighbours for an object x.

density (x, k) = Density of object x for its k nearest neighbours. It is defined as the

reciprocal of the average distance of k nearest neighbours from x. It can be written

as follows:

| N( x,k )|

density( x,k )

y N ( x ,k )

dis tan ce( x, y )

average relative density (x, k) or outlier score = is the ratio of density of an object x

and the average density of its k nearest neighbours. It can be written as follows:

density( x,k )

average relative density( x,k )

y N ( x ,k )

density( y,k ) / | N( x,k )|

Illustration

average relative

Object distance (x, y) k=3 nearest Objects X Y

Pairs L1 norm Objects

neighbours density (x, k=3) density(x, k=3)

(outlier score) A 1.00 2.00

A-B 1.50 A B, C, D 0.80 1.11 B 2.00 1.50

A-C 0.50 B A, C, D 0.80 1.11 C 1.00 1.50

A-D 1.75 C A, B, D D 2.00 2.75

0.80 1.11

B-C 1.00 E 7.00 2.25

D A, B, C 0.57 0.71

B-D 1.25 F 7.00 2.50

E F, G, H 3.00 1.64

C-D 2.25 G 7.00 2.00

F E, G, H 2.00 0.92

E-F 0.25 H 7.50 2.25

G E, F, H 2.00 0.92

E-G 0.25 I 6.00 2.50

E-H

H E, F, G 1.50 0.64

0.50

E-I I E, F, G 0.80 0.34

1.25

F-G 0.50

F-H 0.75

F-I 1.00

G-H 0.75

G-I 1.50

H-I 1.75

objects in their clusters and I is in more dense

region than D.

When average relative density is taken into

account, I is identified as a more potential

outlier.

BITS Pilani, WILP

Review Points

vs. datasets where attributes represent spatial

coordinates.

Identify the similarities in the univariate and multivariate

probability density functions for normal distribution.

How do we find outliers in these two different datasets?

Limitations of DBSCAN to find out local outliers?

Can you think of a scenario where local outliers

identification could be useful? (e.g. rural and urban areas

on the same datasets)

14

Thank You

- ICST 2012Uploaded byInternational Jpurnal Of Technical Research And Applications
- stat3 revise edmUploaded byapi-284021388
- Principles of measurementUploaded byShyam Shankar
- Devnagari Handwritten Numeral Recognition Using Geometric Features and Statistical Combination ClassifierUploaded byVikas Dongre
- CrimeStat IV Chapter 19.pdfUploaded byMirko Posavec
- A Gentle Introduction to Predictive FiltersUploaded byShatruddha Singh Kushwaha
- Fx 9750G NormalDistributionUploaded byRaiyan Rahman
- 10.1.1.57Uploaded byJeya Pradha Jeyaraj
- Lecture 24 - ExpectationUploaded byRoshan Soni
- Chapter 5Uploaded byMinh Ngọc Huỳnh
- MODULE 13 Normal DistributionUploaded byChristian-Noë Cai Cansas Eclarino
- 25044857 Discriminant Function AnalysisUploaded byDipesh Jain
- CorksUploaded byDrNaveed Ul Haq
- Triola Cover&ContentsUploaded byHanaLe
- Stat NotesUploaded byvmktpt
- MM ZG515-1.docxUploaded byArun Padmanabhan
- 632634 2 235 wdrsUploaded byOctavian Albu
- Questions On ProbabilityUploaded byShubham Gupta
- hwmk 2 final (1)Uploaded byRenxiang Lu
- MIT2_72s09_lec04.pdfUploaded byOfferOfKnow
- A Method for Determining Optimal Tenant Mix (Including Location)Uploaded byud
- Bios TatsUploaded bymissy74
- v16i08Uploaded bypgsudheesh
- Probability DistributionUploaded byMohd Riezhuan RabaNi
- Frequency Table Distribution SampleUploaded byGerald Angelo Gultia
- 1212Uploaded byJaliya Kumarasinghe
- Statistical Tables.pdfUploaded byOrestis :. Konstantinidis
- Assignment Booklet PGDAST Jan-Dec 2018Uploaded bysumit_waghmare
- ZIB ReportUploaded byAlekseyevich Gagarin
- Probability and Random ProcessesUploaded byPraveen Kumar

- Cloud Computing - Session 8Uploaded byDivya Gn
- 000-TentaTDDB84_HT09_SOL.pdfUploaded byDivya Gn
- 12s MidI -SampleExam Print1Uploaded byDivya Gn
- Service VertualizationUploaded byDivya Gn
- BayesUploaded bygopitheprince
- Ppt Univ CH16 SOLIDUploaded byDivya Gn
- Week 12 SlidesUploaded byDivya Gn
- 31daysUploaded byDivya Gn
- DBMSUploaded byDivya Gn

- 129253216 ACCT3563 Issues in Financial Reporting Analysis Part a S12013 1Uploaded byYousuf khan
- Statistics QsUploaded byvijayhegde
- Statistics for Management.pdfUploaded byHarshith
- chap03Uploaded byImam Awaluddin
- cjim-2-289.pdfUploaded byDouglas Angulo Herrera
- Solutions w 07Uploaded byJamie Samuel
- 8-Multivariate Analysis Using SASUploaded bySvend Erik Fjord
- Studenmund_Ch03_v2Uploaded byAnand Agarwal
- manual hec 4.pdfUploaded byferocillo
- Quality Using RUploaded bySami Sifi
- STA 301 Fall 2009 Final MCQSUploaded byMughees Awan
- Stewart Narrow VeinUploaded byMiguel Cabal Lorenzo
- Background_and_Meaning_of_Six_Sigma_-_part_2 (1)Uploaded byJesus Noel Mendoza Ventura
- Analytical Chemistry Lecture 3Uploaded byS J
- Experimental DesignUploaded byadiraju07
- Notes(4ps)Uploaded byyoxstl
- A Comprehensive Guide to Data ExplorationUploaded bybobby
- 4 Asumsi Multiple RegresiUploaded byAbidatur Rofifah
- Chi Squre TestUploaded bysalhotraonline
- High & LowUploaded byMichael P. Omambat
- George G. Judge, William E. Griffiths, R. Carter Hill, Helmut Lütkepohl, Tsoung-Chao Lee-The Theory and Practice of Econometrics (Wiley Series in Probability and Statistics)-Wiley (1985)Uploaded byagressive
- Portfolio DiversificationUploaded byRashed Prince
- Simple Linear Regression in RUploaded byGianni Gorgoglione
- t Test for Two Dependent SamplesUploaded byLeonard Amigo
- Chapter 2 Brase StatisticUploaded byJose Neville Diaz
- Math 540 Strayer Midterm Quiz (3 Different Quizzes)Uploaded byGaryoFrobon
- 3640001.pdfUploaded byHiren Chauhan
- time seriesUploaded bySahauddin Sha
- CBR-Index Soil PropertiesUploaded byJagathChandra
- Probability, AUC, and Excel Linest Function.txtUploaded byWathek Al Zuaiby