You are on page 1of 8

CS512 (Spring 2012) “Advanced Data Mining”: Midterm Exam I

(Thursday, March 1, 2012, 90 minutes, 100 marks brief answers directly written on the exam paper)
Note: Closed book and notes but one reference sheet allowed, basic calculator permitted but other
electronic devices are not allowed, scratch paper not need to be returned. The last question is opinion
collection: whatever answer will receive three bonus points.

Name: NetID: Score:

Note: Here we publish sample exam answers for Midterm I questions worked out by Mr. Xiaolong Wang.
Xiaolong not only got full points for this midterm exam but also provided very thorough answers for
all the questions. This is a shining example of the high quality work in this class and sets up a high
standard for our study of course materials. Hope you enjoy reading it and obtain excellent results for
MidTerm II.

1. [40] Advanced Clustering

(a) [10] What are the major differences in concepts between fuzzy clustering and probabilistic
model-based clustering? What are their major differences in computational methods?
Answer:
The fuzzy clustering is based on fuzzy mathematics in which the fuzzy set map X to a
real value ranging from [0, 1]. However, this does not necessarily require the mapping is
probabilistic meaningful. When computing according to fuzzy clustering principle, the
sum squared error (SSE) is minimized in order to obtain the heuristic optimal function.
On the contrary, probabilistic model-based approach follows the maximal likelihood
principle, which favors the model that can best explain the data. And the results are
often in the forms of probability distribution of data coming from any cluster.
The both methods can be used for soft-clustering problem. But their object functions
aforementioned are different, so as their optimizing methods.

1
(b) [10] Taking micro-array data clustering as an example, explain (i) why δ-pClustering
may generate better quality clusters than δ-bi-Clustering, and (ii) why δ-pClustering is
more scalable in high-dimensional space.
Answer:
• δ-pClustering seeks to find the largest bi-clustering in which the p-score of any 2 × 2
submatrix is less(then predefined
) threshold delta. The metric p-score defined on a
a b
2 × 2 submatrix is |(a − c) − (b − d)| which has the monotonicity property.
c d
1 ∑
The δ-clustering need to compute H(I × J) = |I||J| (eij − eiJ − eIj + eIJ)2 , which
i,j
is a global measure and connot preserve local bi-clustering property.
• p-score can be computed using downward-closure property in the manner the same
as frequent pattern growth and thus the algorithm is scalable. However, δ-clustering
requires explicit deletion/addition phrases, both of which is highly computing costly
since the average function need to be performed per column/row in each addi-
tion/deletion phase.

2
(c) [10] Explain why Sparsest Cut and SCAN can both generate quality clusters in graph
data, and why SCAN is more scalable than Sparsest Cut.
Answer:
• Sparsest Cut finds in the graph a set which has the largest sparsity: Φ(S, T ) =
cutsize
min(|S|,|T |) . The optimal cut is discovered by computing the Φ value for every possible
cut and thus it is rather time consuming. For high dimension data (large graph),
this could be unaffordable overhead. ∩
Γ(u) Γ(w)
• SCAN evaluates the structure similarity σ(u, w) = √ . The algorithm
(|Γ(u)||Γ(w)|)
finds the cluster one by one by using the metric DirReachϵ,µ (u, w) = {COREϵ, µ(v)∧.
The algorithm continuously checks if the surrounding neighbors meet the thresholds,
and if it is true, it will be part of the cluster.
• Both of them can find quality cluster because they all use good measurement and
SCAN has a running time O(E) (O(N ) when the graph goes sparse) making it
scalable.

3
(d) [10] Why is it often necessary to do constraint-based clustering? Why is user-guided
clustering in CrossClus often more desirable than using must-link and cannot link ?

Answer:

• Constrain is an convenient way to inject human knowledge into clustering. For example,
when doing clustering on geographical data, we need to take obstacles/rivers/hills into
consideration, making the simple distance-based algorithm ineffective. To address this
problem, constraint-based algorithm impose constrains while clustering.
• To do constraint-based clustering with must-link/cannot-link, we need to provide anno-
tated data which can be expensive in real problems. Alternatively, by giving user-guided
hint (the attributes) we can effectively identify the pertinent features and hence imple-
ment constraint-based algorithm just like CrossClues. Besides, the cluster similarity can
often be computed with feature similarity making the user-guided clustering appealing.
∑ ∑ ∑ ∑
• V fV g = i j simf (ti , tj )simg (ti , tj ) = k q sim(fk , gq )2

4
2. [30] Outlier Analysis

(a) [10] There are three categories of outlier detection methods: supervised, unsupervised,
and semi-supervised. Given one named method in each category. At what situation that
semi-supervised outlier detection is most desirable?
Answer:
• Supervised: SVM, MaxEnt (Logsitic Regression)
• Unsupervised: statistical-based mixture modeling (mixture model)
• Semi-supervised: self-training (using the small labeled data as constraint and per-
form clustering)
When only small portion of labeled data is available and we are caring about collec-
tive outlier detections, we can perform clustering and find the cluster which containing
the labeled outliers, and thus the semi-supervised outlier detection is effective to find
collective outliers.

(b) [10] What are the differences between distance-based outlier detection vs. density-based
outlier detection? Give one case that density-based outlier detection is effective but
distance-based method is not.
Answer:
• Distance-based outlier section, such as grid-based(cell) method, is done by com-
puting the distance among instances. However, like LOF (or CBLOF) computes
the node local density and check if its value is abnormal against the surrounding
neighbors. When we desire to identify the local/contextual outliers, density-based
outlier algorithms are good choices since it can detect local information change while
distance-based approaches cannot achieve.
• For example, a group of nodes may have the same k-NN distance but their LOF
may vary a lot depending on if they are outliers or not.

5
(c) [10] In a high-dimensional space, it is often desirable to find contextual outliers. Suppose
Waltmart may issue its customers (loyalty) cards and store its customers’ shopping
records, together with product information. Outline a method that may find extremely
valuable customers in different contexts.

Answer:

• For high dimensional outlier detection, ABOF is one good choice. It compute
−−→ −−→
⟨OX, OY ⟩
V ARx,y∈D −−→ −−→
dot⟨OX⟩2 dot⟨OY ⟩2

Another possible solution is to evaluate clustering-based LOF. First we cluster customers


based on their attributes (age, salary, etc.) and we then can do fixed-width clustering
to improve computing efficiency. The CBLOF is computed to find the small portion
of extremely valuable customers (outliers). After the algorithm returns the result, we
still need to check if they are real valuable offer by other measurement, automatically
or manually.

6
3. [30] Introduction to Networks

(a) [10] What are the differences between eigenvector centrality and PageRank centrality?
Answer:
• Eigenvector Centrality: X = k −1 AX
– no free-to-go setting (Xi = 0 if the graph is reducible and node i has outlinks)
– no degree division (degree normalization), λ is not necessary equals to 1
• PageRank Centrality: X = D(D − αA)−1 I
– has free-to-go setting (restarting vector βI)
– has degree division, and ensures the prime eigenvalue is 1

(b) [10] What are the differences between co-citation and bibliographic coupling? Why do
people claim that HITS explores both concepts?
Answer:
• Suppose we have graph representation A:
– AAT is co-citation coupling
– AT A is bioliographic coupling
• In HITS { {
x = αAy AAT x = λx 1
T ⇒ T whereλ =
y = βA x A Ay = λy αβ
By running HITS, both x, y are computed.

7
(c) [10] What are the major differences between the scale-free network model and the Erdös-
Réyni model? Explain why WWW is scale-free and has a small diameter?
Answer:
• The random graph (Erdös-Rényi) model is not likely to generate high clustering nor
heavy-tailed degree distribution.
• The scale-free network is (more) likely to have heavy-tailed degree distribution, but
not quite likely to have high clustering.
• Both of them will generate networks with few components and small diameter.
• The scale-free model uses preferential attachment which favors to link edges to node
with larger degree (“riches get richer ”) and thus consequently generate the networks
obeying power-law. Since it does not require to have a fixed node number nor the
attachment is uniform, the generated networks is scale-free.
• For its preferential attachment, it also has heavily-tailed degree distribution (power-
low) and small diameters because of its existence of hubs connecting a large number
of nodes.

4. [3] (Opinion). [3 bonus points]

(a) I 2 like 2 dislike the exams in this style.


(b) In general, the exam questions are 2 too hard 2 too easy 2 just right.
(c) I 2 have plenty of time 2 have just enough time 2 do not have enough time to
finish the exam questions.

You might also like