A Framework For Clustering Evolving Data Streams

A Framework for Clustering
Evolving Data Streams

Charu C. Aggarwal, Jiawei Han, Jianyong
Wang, Philip S. Yu
Presented by: Di Yang

Charudatta Wad
Outline
Background of Clustering
Motivation for Clustering over Streaming
Data.
Overall Solution
Micro Clusters
Pyramid Time Frame
Macro Cluster
Cluster Maintenance
Definition of Clustering
For a given set of data points, partitioning them
into one or more groups of similar objects.
Similarity is often defined with the use of some
distance measure.
Difference between group by queries
and clustering.
Some of the most popular clustering
algorithms:
K- Means, BIRCH, CURE, Density Based
Clustering.
Clustering has many applications in data
bases, information visualization, data
mining.
What are Oultiers?
Motivation
Challenge in Streaming Environment:
Clustering is an expensive process.
Resource constraints.
Infinite streams.
Can simply extending one pass algorithms

for static databases to stream processing
suffice?
Motivation
Requirements of clustering for stream
processing:
Statistical summary information storage.
Efficient update process.
Ability to cluster for a specific time horizon,

Overall Solution of the Paper
Divide the clustering process to two
phases
Online Component:
periodically stores detailed summary statistics
Offline Component
uses only the summary statistics to do clustering
Micro-Clusters
What is a Micro-Cluster
A Micro-Cluster is a set of individual data points that are
close to each other and will be treated as a single unit in
further offline Macro-clustering.
View of Micro-Cluster View of Macro-Cluster

Micro-Clusters
What to Store in a Micro-Cluster
Key idea: Additivity Property

Pyramidal Time Frame
The micro-clusters are stored at snapshots.

Snapshot
When should we make the snapshot?
The snapshots follow a pyramidal pattern

Pyramidal Time Frame
Snapshots are classified into different orders which can vary from 1 to
log (T). For example, T is 55, =2, then we have orders 0 with interval
2^0=1, order 1 with interval 2^1=2, order 2 with interval 2^2=4, order 3
with interval 2^3=8, order 4 with interval 2^4=16, order 5 with interval
2^5=32.
For a data stream the maximum number of snapshots maintained at T
time units since the beginning of the stream mining process is
( + 1) log (T). ( + 1 for each order)
Why Pyramidal Pattern?
For any user-specified time window of h, at least one
stored snapshot can be found within 2 h units of the
current time.
Please Note: Only Approximate Answers!!!

Micro Cluster Creation
Itis assumed that a total of q micro-
clusters are maintained at any moment by
the algorithm.
This is done using an offline process (k-
means) at the very beginning of the data
stream computation process.
Online Micro Cluster Maintenance
How to deal with a new coming point?
1. Join one of the old cluster
2. Create a new cluster by its own
How to deal with the old clusters

1. Delete them (based on relevance stamp)
2. Merge them (merge the closest two)
A merged cluster will have all the IDs its components have
Macro-Cluster Creation
Based on the Additivity Property of cluster
feature vector
Current Time T, the window size is h. That means the user want to
find the clusters formed in (T-h, T).
Approach:
1. 1st step: Find the snapshot for T, get the micro-cluster set S(T).
2. 2nd step: Find the snapshot for T-h, get the micro-cluster set S(T-h).
3. Use S(T)-S(T-h)
Specifically, we have a merged cluster with Id list (C1, C2, C3) in S(T)
and a cluster with Id C1 in S(T-h). Then the we use
CFT(C1,C2,C3)-CFT(C1)=CFT(C2,C3), because C1 are formed before
T-h, thus should not contribute to the micro-cluster formed in (T-h,T)
Example
C_ID: [C1, C2, C3] C_ID: C_ID: [C2, C3]

[C1]
Time: T Time: T-h Result: T-h

Run K-means on Micro-Clusters
How do you feel about this paper?
My feeling:
Quite Fuzzy Results:
Approximation is every where.
Nothing New:
Micro-Clusters, K-means, Cluster Feature
Vectors, Pyramidal Time Frame are all old
stuffs.
Counter Example
C_ID: [C1, C2, C3] C_ID: [C2] C_ID: [C1, C3]
Result
Time: T Time: T-h
Advertisement
Di and Charus project deals with:
1. Deterministic Clusters
2. Clusters with Arbitrary Shapes
3. Real Expirations
4. Disk Version
5. Outlier Detection by Free

A Framework For Clustering Evolving Data Streams

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Framework For Clustering Evolving Data Streams

Uploaded by

Copyright:

Available Formats

A Framework for Clustering

Evolving Data Streams

Presented by: Di Yang

Can simply extending one pass algorithms

Ability to cluster for a specific time horizon,

View of Micro-Cluster View of Macro-Cluster

Key idea: Additivity Property

When should we make the snapshot?

The snapshots follow a pyramidal pattern

Please Note: Only Approximate Answers!!!

How to deal with the old clusters

C_ID: [C1, C2, C3] C_ID: C_ID: [C2, C3]

Time: T Time: T-h Result: T-h

C_ID: [C1, C2, C3] C_ID: [C2] C_ID: [C1, C3]

You might also like