Professional Documents
Culture Documents
Lunch Break
Data Mining/Database Approaches.
1 −[(x−µ)/σ]2 /2
f (x) = 1 e
(2πσ 2 ) 2
f (x) > 0
and
Z ∞
f (x)dx = 1
−∞
−2 np
P (X ≥ (1 + )np) ≤ e 3
1
P (dl+1 > max di ) ≤ (by symmetry and iid)
1≤i≤l l+1
Now,
dl+1 = |xl+1 − µ| ≥ |xl+1 − s| − |s − µ|
and for all i = 1..n
di = |xi − µ| ≤ |xi − s| + |s − µ|
(x − µ)0 Σ−1 (x − µ)
σ11 σ12 . . . σ1d
σ21 σ22 . . . σ2d
Σ= .. .. . . . ..
. . .
σd1 σd2 . . . σdd
2 0 1 −3 " #
dotp/2 1.0 −3.5
1 2 ; m = [1, 3] ; 0 −1 →
−3.5 13
0 7 −1 4
10
4 61
2 29
0
14
−2
16
−4
−6
−8
−10
−10 −5 0 5 10
(x − µ)0 Σ−1 (x − µ)
(x − µ)0 Σ−1 (x − µ)
0.6 0.15
0.4 0.1
0.2 0.05
0 0
0 5 10 0 5 10
0.06
0.04
0.04
0.02
0.02
0 0
0 20 40 60 0 50 100
det(S2 ) ≤ det(S1 )
34
30
16
20
14
12
Robust distance
10
0
0 0.5 1 1.5 2 2.5 3 3.5
Mahalanobis distance
2
Dimension 2
−2
−4
−6
−8
−10
−10 −5 0 5 10
Dimension 1
*
* * *
* *
Cluster C1 and C2 have
*
* *
*
different densities
* * *
C1 C2 *
o3 *
*
* *** *
* ** * ***
** * ** * **
*
* *
*
For O3 to be a distance-
* * *
******** * * *
* *
* ** *** ** * * * *
* * * * * *
*
* *
based outliers, all points
* *** * * ** *
*** * * * *
*
* of C2 will have to be out-
*
*
liers
o2
* * *
density(x, k)
reldensity(x, k) = P
y∈N (x,k) density(y, k)/|N (x, k)|
The estimate of ai is
Now,
n
X 1
E[Xij ] = E(Xijk ak ) ≤ kak1
w
k=1 Pakdd 2006, Singapore – p. 48/5
Analysis of the Count-Min Sketch
∀j, P (âi > ai + kak1 ) = ∀j, P (COU N T (j, hj (i)) > ai + kak1 )
= ∀j, P (ai + Xij > ai + kak1 )
E[Xij ]
≤ kak1 ∀j
1 d
≤ ( w ) ≡δ
X O
O X
This data is not linearly separable (i.e., there does not exist
a straight line separating the Crosses(x) and the Circles(o).
Pakdd 2006, Singapore – p. 51/5
XOR Example
(
1 if(x1 = 0 ∧ x2 = 0) ∨ (x1 = 1 ∧ x2 = 1)
Let y =
−1 otherwise
This is the XOR function.
Define a function φ on the space X = (x1 , x2 ) as
φ(x1 , x2 ) = (x1 , x2 , x1 x2 )
Can check that image of X under φ is linearly separable
l l
2 1 X 2 X
kφ(x) − φs k = k(x, x) + 2 k(xi , xj ) − k(x, xi )
l l
i,j=1 i=1
Summary
MINDS System Association
of attacks
Outlier Pattern Analysis
Network scores
Human
Outlier Detection ...
Analyst
... Detected novel attacks
Data Capturing
Device Scan Detector
Labels
57-1
[6] G. Kollios, D. Gunopulos, N. Koudas, and S. Berchtold.
Efficient biased sampling for approximate clustering and
outlier detection in large datasets. IEEE Transactions on
Knowledge and Data Engineering, 2003.
57-2
[12] P. Sun, S. Chawla, and B. Arunasalam. Outlier detection
in sequential databases. In Proceedings of SIAM Interna-
tional Conference on Data Mining, 2006.
57-3