Professional Documents
Culture Documents
e
im
r: t
:semantics
0o 0o 0o 0o
(a) Polar Transformation (b) Subspace Estimation (c) Subspace Pruning (d) Cluster Generation
vn
threshold used to decide the rank of the polynomial embed- s2
ding matrix. The former can be decided easily. For exam-
v1
ple, since noise and outliers are usually far away from true hm
data points, any small value of K can be used. The selec-
tion of K can be fixed for the computation of weights for h1
all data points. The decision of the parameter ² depends on
0o
the data. As shown in our experiments (Section 5.2), by
demoting the impact of outliers and noises, the difference Figure 4. Interestingness of subspaces.
between the rth and (r + 1)th singular values of the polyno- Note that, we cannot simply use the variance measure to
mial embedding matrix is enlarged and observable, which
define the interestingness of a subspace. The reason is as
makes the decision of ² straightforward.
follows. A subspace may contain multiple events (e.g., pe-
4 Event Detection riodical events). If there are multiple events contained in a
After estimating subspaces from the transformed polar subspace, then the variance of data points in the orthogo-
space, each subspace contains query sessions of similar nal direction (e.g., the “semantic” dimension) will be small,
topic. However, not every subspace is interesting such that while the variance of data points in the subspace direction
it contains clusters corresponding to real events. For ex- (e.g., the “temporal” dimension) will be large. Hence, in
ample, consider the query sessions with queries on popular our approach, we employ the entropy measure to define the
public portals (e.g., google). Due to their intense frequency interestingness of a subspace. Particularly, we project data
during the whole observed time period, such query sessions points to the two directions respectively and calculate the
will form a subspace although they do not represent any real respective histograms of the distributions. Let hh1 , h2 , · · · ,
event. For example, the subspace s1 in Figure 4 represents hm i and hv1 , v2 , · · · , vn i, where hi and vi are individual
such an uninteresting subspace. In this section, we first dis- bins, be the two corresponding histograms. The interesting-
cuss pruning uninteresting subspaces. Then, various events ness of a subspace can be computed as follows.
are detected from interesting subspaces. Xm n
X
I(si ) = 1 − [−p hi log hi − (1 − p) vi log vi ] (12)
4.1 Subspace pruning
i=1 i=1
We distinguish interesting and uninteresting subspaces
based on the intuitive premise used by [8]: the appear- where p ∈ [0, 1] is a weight which can be determined in
ance of an event is signaled by a “burst of activity”, with experiments to assign different importance to the entropy
certain features rising sharply in frequency when the event values in the two directions. For example, if p = 1, then
emerges. In our work, we simultaneously consider two fea- only the temporal “burst” is considered. The interesting-
tures: the occurring time of query sessions as well as the ness measure takes values from 0 to 1. The more certain
semantics of query sessions. Particularly, if a subspace is the distributions in two directions, the smaller the entropies
interesting such that it contains query sessions correspond- in the brackets of Equation (12), the greater the value of
ing to events, both the occurring time and the semantics interestingness. Given some threshold ζ, subspace si will
of query sessions in the subspace should exhibit certain be pruned as an uninteresting subspace if I(si ) < ζ. We
“bursts”. Recall that our polar transformation respectively observed from our experimental results (Section 5.2) that
maps the occurring time and semantics of a query session I(si ) of an interesting subspace si is usually much greater
to the radius and angle of a point in polar space. The tem- than I(sj ) of an uninteresting subspace sj , which makes it
poral “burst” and semantical “burst” should be reflected by easy to decide the value of ζ.
the certain distribution of data points along the subspace Note that, the calculation of the histograms may be bi-
direction and the orthogonal direction of the subspace re- ased by noisy data points. In our implementation, we re-
spectively. For example, as shown in Figure 4, the subspace move noise by an inlier growing method before projecting
s2 might be an interesting subspace since its data points data points. The details are given in a full version of this
distribute certainly along the subspace direction and the or- paper (the reference is omitted temporarily to comply with
thogonal direction of the subspace. the double blind reviewing guidelines).
Algorithm 1 DECK. The timestamp of the first clicked page of a query session
Input: A set of query sessions is taken as the occurring time of the session. We manu-
Output: A set of query session clusters corresponding to real events ally identified a set of events from the data set, including
Description:
both predictable events (e.g., the Memorial Day on May
1: Transform query sessions to polar space
2: Estimate subspaces of query sessions using KNN-GPCA
29, 2006) and unpredictable events (e.g., the death of Dana
3: for Each estimated subspace do Reeve, an American actress, on Mar 6, 2006). For each
4: Project inliers to the subspace direction and orthogonal direction event, we identified a set of query keywords and selected
respectively all query sessions which contain the query keywords and
5: Compute the entropy of the distribution histogram in the two di-
rections happen close to the date of the event. After filtering events
6: if Interestingness of the subspace is less than some threshold then which are represented by less than 50 query sessions, a total
7: Prune the subspace of 35 events are used in our experiments. The complete list
8: end if of events is given in the full version of this paper. We then
9: end for
10: for Each interesting subspace do randomly select query sessions which do not represent any
11: Perform mean shift clustering real events, together with the query sessions corresponding
12: end for to real events, to generate five data sets, which respectively
13: Return the clustering results contain 5K, 10K, 20K, 50K and 100K query sessions.
4.2 Query session clustering 5.2 Result Analysis
After pruning uninteresting subspaces, events can be Performance of DECK. We first compare the perfor-
detected from the remaining subspaces by clustering data mance of DECK with the existing two-phase-clustering al-
points based on the cluster consistency property of polar gorithm [18]. Given the set of clusters returned as detected
transformation. Although there are a plethora of published events, the existing algorithm [18] finds a best match be-
clustering techniques, methods relying upon a priori knowl- tween discovered clusters and true events. A best match of a
edge of the number of clusters are not applicable here be- true event is defined as a cluster that has the maximum over-
cause the number of events in each subspace cannot be de- lap with the true event in terms of the number of common
cided easily. Hence, in our approach, we employed a non- query-page pairs (query sessions). We further constrain that
parametric clustering technique called Mean Shift Cluster- the number of common query-page pairs (query sessions)
ing [4]. Mean shift clustering is an application of the mean should be no less than some specified threshold (in our ex-
shift procedure, which successively computes the mean shift periments, we set the threshold as 50% of the query-page
vector which always points toward the direction of the max- pairs or query sessions representing the true event). Then,
imum increase in the density and converges to a point where the evaluation metrics, precision and recall, can be com-
the gradient of density function is zero. puted as follows. Precision is the ratio of the number of
Based on the mean shift procedure, we perform mean correctly detected events to the overall discovered clusters.
shift clustering on data points in each subspace as follows. Recall is the ratio of the number of correctly detected events
Firstly, mean shift procedure is run with all the data points to the total number of events.
to find the stationary points of the density estimate. Sec- The experimental results are shown in Figures 5 (a) and
ondly, discovered stationary points are pruned such that (b) respectively. The existing algorithm is referred to as
only local maxima are retained. The set of all points that 2PClustering in the figures (please ignore the other two al-
converge to the same mode defined the same cluster. The gorithms, DECK-GPCA and DECK-NP, temporarily). We
returned clusters are expected to represent real events. The observe that DECK outperforms 2PClustering in both pre-
complete algorithm of DECK is shown in Algorithm 1. cision and recall. Since the algorithm 2PClustering did not
discuss how to decide the number of clusters, we use the
5 Performance Study number of events generated by DECK. Then, since many
In this section, we study the performance of DECK. We clusters generated by 2PClustering do not represent any
first describe the data set used in our experiments. Then, we events, both the precision and recall values are low.
present and analyze the results of conducted experiments. Since the best match approach used by the existing al-
5.1 Data Set gorithm may not be objective enough, we further evaluate
The real-life Web click-through data collected by the performance using an entropy measure. For each gen-
AOL [13] from March 2006 through May 2006 are used. erated cluster i, we compute pij as the fraction of query-
According to [13], if a user clicked on more than one page page pairs (query sessions) representing thePtrue event j.
from the list returned from a single query, these pages ap- Then, the entropy the of cluster i is Ei = − j pij log pij .
pear as successive entries in the data. Hence, in our exper- The total entropy can be calculated as the sum of the en-
iments, we simply extract successive pages corresponding tropiesPof each cluster weighted by the size of each cluster:
m
to the same query and the same user as a query session. E = i ni ×E n , where m is the number of clusters, n is
i
0.9 DECK 2P Clustering 1 DECK 2P Clustering 0.3
0.6
Entropy
Recall
0.5
0.5 0.15
0.4
0.4
0.3 0.1
0.3
0.2 0.2
0.05
0.1 0.1
0 0 0
5K 10K 20K 50K 100K 5K 10K 20K 50K 100K 5K 10K 20K 50K 100K
Figure 5. Precision, recall and entropy of DECK, 2PClustering, DECK-GPCA and DECK-NP.
σr+1
total number of query-page pairs (query sessions) and ni is σ1 +···+σr < ². They recursively increase the number of
the size of cluster i. The experimental results are shown in subspaces until the condition is satisfied. Hence, in our ex-
Figure 5 (c). Again, DECK works better than 2PCluster- periments, the value r is automatically decided by fixing the
ing. The reason is similar as before. Given the number of value of ² as 1.0E-3 for both GPCA and KNN-GPCA. Then,
clusters generated by DECK, 2PClustering generates clus- the ratios of the gap computed by KNN-GPCA to the gap
ters with larger size. Since it does not prune any data, the computed by GPCA are shown in Figure 6. We notice that
entropy of 2PClustering is higher than DECK. KNN-GPCA does enlarge the gap and improves the possi-
bility of selecting appropriate ².
3
10event 20event 35event
Singular Value Gap Ratio
2.5
6
10event 20event 35event
2
5
Interestingness Ratio
1.5
4
1
3
0.5
2
0
1
5K 10K 20K 50K 100K