Professional Documents
Culture Documents
instance
Dimensionality
Curse of dimensionality
Sparsity vs Density
Density counts only presence; and Sparsity,
absence
Data Value Conflict Detection and Resolution
Patterns depend on the scale
Similarity
Distance measure
Ratio
Ratio is a numeric attribute with an inherent zero-point.
That is, if a measurement is ratio-scaled, we can speak of
a value as being a multiple (or ratio) of another value. In
addition, the values are ordered, and we can also compute
the difference between values, as well as the mean,
median, and mode.
E.g., temperature in Kelvin, (0°K = −273.15°C: It is the
point at which the particles that comprise matter have
zero kinetic energy), counts (years_of_experience,
number_of_words in a document)
Nominal The values of a nominal attribute are zip codes, employee mode, entropy,
just different names, i.e., nominal ID numbers, eye color, contingency
attributes provide only enough sex: {male, female} correlation, 2 test
information to distinguish one object
from another. (=, )
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical
current
Record
Data Matrix
Document Data
Transaction Data
Graph
World Wide Web
Molecular Structures
Ordered
Spatial Data
Temporal Data
Sequential Data
Genetic Sequence Data
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Sequences of transactions
Items/Events
An element of
the sequence
Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Motivation
To better understand the data: central tendency, variation
and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities of
precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
w x i i
w i
Median: A holistic measure i 1
[ xi ( xi ) 2 ]
1 1
s ( xi x ) xi 2
2 2 2 2 2
( x )
n 1 i 1 n 1 i 1 n i 1 N i 1
i
N i 1
news articles
visualized as
a landscape
• • •
objects
Lower when objects are more alike
p1
point x y
2
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6 Data Matrix
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
d (i, j) q (| x x |q | x x |q ... | x x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and q is the order
Properties
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
April 7, 2019 Data Mining: Concepts and Techniques 51
Special Cases of Minkowski Distance
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
gender is a symmetric attribute
the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set to 0
01
d ( jack , mary ) 0.33
2 01
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2
April 7, 2019 Data Mining: Concepts and Techniques 55
Nominal Variables
rp ,q
( p p)( q q) ( pq) n p q
(n 1) p q (n 1) p q
correlation( p, q) p q
Scatter plots
showing the
similarity from
–1 to 1.