You are on page 1of 10

DECK: Detecting Events from Web Click-through Data

Ling Chen1 Yiqun Hu2 Wolfgang Nejdl1


L3S Research Center1 School of Computer Engineering2
University of Hannover, Germany Nanyang Technological University, Singapore
lchen,nejdl@L3S.de yiqun.hu@gmail.com

Abstract one of the dominant forms of self-publication on the Inter-


In the past few years there has been increased research net. In [20], tags indicating real events are extracted from
interest in detecting previously unidentified events from Web a set of user-supplied tag data annotating pictures at the
resources. Our focus in this paper is to detect events from Flickr site1 . Our focus in this paper is to detect events from
the click-through data generated by Web search engines. Web click-through data, which is generated by users (pas-
Existing event detection algorithms, which mainly study the sively) when interacting with Web search engines. An ex-
news archive data, cannot be employed directly because of cerpt of click-through data containing five entries is shown
the following two unique features of click-through data: 1) in Table 1. Each entry of Web click-through data basically
the information provided by click-through data is quite lim- records the following four types of information: an anony-
ited; 2) not every query issued to a Web search engine cor- mous user identity, the query issued by the user, the time at
responds to an event in the real world. In this paper, we which the query was submitted for search, and the URL of
address this problem by proposing an effective algorithm clicked search result [13].
which Detects Events from ClicK-through data (DECK). The reason that click-through data serves as a promis-
We firstly transform click-through data to the 2D polar ing data source for event detection is trifold. Firstly, sim-
space by considering the semantic dimension and temporal ilar to the other resources on the Web, the click-through
dimension of queries. Robust subspace estimation is per- data can be regarded as a sensor of the real world as well.
formed to detect subspaces such that each subspace con- Click-through data containing query keywords and clicked
sists of queries of similar semantics. Next, we prune un- pages of users often reflect users’ response to recent real
interesting subspaces which do not contain queries corre- world events. Secondly, the sharply accentuated role of Web
sponding to real events by simultaneously considering the search engines as an entry point to the Web has given rise
respective distribution of queries along the semantic dimen- to a huge volume of click-through data, which enables ef-
sion and the temporal dimension in each subspace. Finally, fective knowledge discovery. Thirdly, click-through data is
events are detected from interesting subspaces using a non- well formatted. Complicated data preprocessing on Web
parametric clustering technique. Compared with an exist- content and structure information can be avoided.
ing approach, our experimental results based on real-life However, detecting events from click-through data is a
data have shown that the proposed approach is more accu- nontrivial problem because of the following two observa-
rate and effective in detecting real events from click-through tions: 1) The information provided is limited, so that the
data. semantic knowledge cannot be identified easily and clearly.
As shown in Table 1, the URLs which represent the search
1. Introduction results clicked by users are incomplete. Instead of referring
With the prevailing publishing activities over the Inter-
to the addresses of pages, the URLs refer to the addresses
net, the Web of nowadays covers almost every object and
of sites. Consequently, entries having the same URL may
event in the real world [18]. This phenomenon has moti-
have different semantics and correspond to different events.
vated the data mining community to discover knowledge
Similarly, the same query keywords may have been issued
such as topics, events and stories from large volumes of
by users to inquire about different events. 2) A large amount
Web data [11, 3, 10, 15]. Although most of the existing
of click-through data do not represent any real event in the
work focus on analyzing and extracting events from tra-
world. Table 2 illustrates the top 10 most frequent entries
ditional Web data such as Web content and structures, re-
in the click-through data logged by AOL in March, 2006.
cent research direction is to detect events from user gener-
None of them are correlated with any real events.
ated Web data. For example, the effort presented in [19]
aims to detect events from Weblogs, which have become 1 http : //www.f lickr.com/
User Query Time Page
337 jojo 3/19/2006 15:02 http://www.jojoonline.com
tion along the temporal dimension of query sessions in each
337 jojo 3/19/2006 15:12 http://www.azlyrics.com subspace, we prune uninteresting subspaces which do not
178 poker 3/14/2006 21:44 http://www.pokerroom.com contain query sessions corresponding to real events. Finally,
178 poker 3/14/2006 21:58 http://www.pokerstars.com
178 poker 3/14/2006 21:58 http://www.partypoker.com a non-parametric clustering method is used to cluster data
points in each interesting subspace. The generated clusters
Table 1. Example of click-through data. will be returned as detected events.
ID Query Page Freq. The main contributions of this paper are summarized as
1 google http://www.google.com 38608
2 mapquest http://www.mapquest.com 16518 follows.
3 yahoo http://www.yahoo.com 13860
4 ebay http://www.ebay.com 11669
• We develop a novel algorithm, KNN-GPCA, to seg-
5 google.com http://www.google.com 9284 ment Web click-through data into broad topics. KNN-
6 yahoo.com http://www.yahoo.com 8361 GPCA improves the robustness of the existing Gen-
7 myspace.com http://www.myspace.com 7169
8 myspace http://www.myspace.com 6678 eralized Principal Component Analysis (GPCA) [16]
9 www.google.com http://www.google.com 4739 by embedding the distribution of K nearest neighbors
10 bank of america http://www.bankofamerica.com 3957
constraint within the GPCA framework and estimates
Table 2. Frequent query-page pairs. subspaces with weighted least square approximation.
• We propose an interestingness measure, which simul-
Current research in this area is still in its infancy. To the taneously considers semantic certainty and temporal
best of our knowledge, the work done by Zhao et al. [18] certainty of click-through data, to justify the likelihood
is the first and the only one which detects events from that the subspace contains query sessions correspond-
Web click-through data. In their work, they presented a ing to events.
two-phase-clustering approach to detect events. In the first
phase, entries of click-through data, represented as query- • We evaluate the performance of DECK with extensive
page pairs, are clustered based on their semantics. In the experiments on real life click-through data collected
second phase, for each semantically similar cluster, query- by AOL. We compare the performance of DECK with
page pairs are clustered again based on their frequency evo- the existing two-phase-clustering approach [18].
lution over some predefined time intervals. The cluster- The rest of the paper is organized as follows. In Sec-
ing results will be returned as detected events. Neverthe- tion 2, we illustrate the polar transformation of Web click-
less, we observed that not every cluster returned by their through data. We describe our linear subspace estimation
approach can be regarded as the representation of one real algorithm, KNN-GPCA, in Section 3. In Section 4, we first
event, since a large portion of click-through data is not re- define the interestingness measure of subspaces and then
lated with events. Therefore, pruning irrelevant data is an discuss the detection of events from interesting subspaces.
important step involved in effective event detection from Experimental results based on real-life data sets are pre-
click-through data. sented in Section 5. We review related work in Section 6
In this paper, we propose a novel algorithm DECK, De- and conclude this paper in Section 7.
tecting Events from ClicK-through data, which proceeds in 2 Polar Space Representation
the following four steps, polar transformation, subspace Given a collection of Web click-through data, Zhao et
estimation, subspace pruning, and cluster generation, as al. [18] considered the data in unit of query-page pair which
shown in Figure 1. Given a collection of Web click-through consists of a user query and a page clicked by the user. For
data, we firstly transform it to the 2D polar space. Each example, the click-through data in Table 1 will be consid-
query session of Web click-through data (A query session, ered as five distinct query-page pairs. However, in DECK,
which will be formally defined in Section 2, contains a we study Web click-through data in unit of query session,
query and a set of corresponding pages clicked by a user.), which refers to an episode of interaction between a user and
is mapped to a point in polar space such that the angle θ a Web search engine. A query session consists of a query is-
and radius r of the point respectively reflect the semantics sued by the user and a set of pages clicked by the user on the
and the occurring time of the query session. Then, we pro- search result. The reason we adopt the unit of query session
pose a new method KNN-GPCA to estimate the subspaces instead of query-page pair will be given in the below.
of transformed click-through data (e.g., the dotted lines in
Definition 1 (Query Session) A query session S = (Q, P ),
Figure 1 (b)). Each estimated subspace contains query ses-
where Q = {q1 , q2 , · · · , qm } is a bag of keywords repre-
sions of some broad topic2 . Next, by considering both the
senting a user query issued to a search engine and P =
distribution along the semantic dimension and the distribu-
{p1 , p2 , · · · , pn } refers to the set of corresponding pages
2 TDT has its own topic definition. However, in this paper, we refer to clicked by the user on the search result. In addition, a query
a topic as some general theme, which may or may not be related with an session S is associated with a time point, T (S), when the
event. query is issued.
90o 90o 90o 90o

e
im
r: t

:semantics

0o 0o 0o 0o
(a) Polar Transformation (b) Subspace Estimation (c) Subspace Pruning (d) Cluster Generation

Figure 1. Overview of DECK.


For example, the first two entries in the Table 1 will closer their occurring time, the smaller the distance |r1 −r2 |.
be considered as a query session because they indi- We first consider mapping the semantics of a query ses-
cate that after issuing the query “jojo”, the user 337 sion S to an angle θ. Since the angle between two query
clicked the two pages “http://www.jojoonline.com” and sessions reflects how similar they are in semantics, we need
“http://www.azlyrics.com”. Since entries of a query ses- to define the semantic similarity between two query ses-
sion are temporally close to each other, in this paper, we sions. Recall that a query session contains a query and a
take the timestamp of the first entry as the occurring time set of clicked pages. We define the semantic similarity be-
of the query session for simplicity. The extraction of query tween two query sessions by considering their similarities
sessions will be explained in Section 5. in not only query keywords but also clicked pages. For both
It can be noticed that a query session usually contains queries and clicked pages, we use an adapted Jaccard coef-
multiple query-page pairs which share the same user query. ficient to measure their similarity.
The reason we adopt the unit of query session rather than Definition 2 (Semantic Similarity) Given two query ses-
query-page pair is based on the following intuitive: if a user sions S1 = (Q1 , P1 ) and S2 = (Q2 , P2 ), the semantic
inquires about some event, the multiple query-page pairs similarity between S1 and S2 , denoted as Sim(S1 , S2 ), is
corresponding to the query session usually represent the
same event. Thus, instead of clustering these query-page |Q1 ∩ Q2 | |P1 ∩ P2 |
Sim(S1 , S2 ) = α× +(1−α)×
pairs afterwards to discover events, we can group them to- max{|Q1 |, |Q2 |} max{|P1 |, |P2 |}
gether and study in unit of query session.
Before illustrating how to transform query sessions into
The weight coefficient α ∈ [0, 1] can be determined ex-
2D polar space, we first explain the reason that we choose
perimentally to assign different importance to the simi-
polar space instead of Cartesian Coordinate System. Given
larities of queries and clicked pages. For example, con-
a 2D space, 1D subspaces are defined to be lines passing
sider the query sessions in Figure 2 (a), where each en-
through the origin. If we transform each query session S
try of the table represents a query session. Let α = 0.5.
to a point (θ, r) in polar space such that the angle θ and
Sim(S1 , S2 ) = 1/2 ∗ 1/2 + 1/2 ∗ 2/3 =7/12.
the radius r respectively reflect the semantics and the oc-
Given a set of n query sessions {S1 , S2 , · · · , Sn }, a n×n
curring time of S, then query sessions of similar seman-
semantic similarity matrix M can be computed such that
tics should be mapped to points which have similar angles
each element mij = Sim(Si , Sj ). In other words, the rel-
and locate along a line (subspace) passing through the ori-
ative semantics of a query session Si is represented by the
gin. However, if we transform a query session S to a point
n-dimension row vector Ri = hmi1 , mi2 , · · · , min i of M .
(x, y) in Cartesian Coordinate System such that the x and y
In order to map the semantics of Si to an angle θi in polar
values respectively reflect the semantics and the occurring
space, we need to reduce the dimension of Ri to 1. For di-
time of S, then queries sessions of similar semantics will
mension reduction, we perform Principle Component Anal-
be mapped to points which have similar x values and locate
ysis (PCA) on the semantic similarity matrix M . Then, the
along a line parallel to the y axis. Lines parallel to the y
first principal component is used to preserve the dominant
axis (except the y axis itself) are not subspaces and cannot
variance in semantic similarities. Let {f1 , f2 , · · · , fn } be
be detected by subspace estimation algorithms.
the first principal component which corresponds to the set
Particularly, given two query sessions S1 and S2 , we map of query sessions {S1 , S2 , · · · , Sn }. A query session Si can
them to two points (θ1 , r1 ) and (θ2 , r2 ) respectively such be mapped to a point (θi , ri ) where θi is computed as
that the more similar the two query sessions are in seman-
tics, the smaller the angle |θ1 − θ2 | is. Furthermore, if the fi − minj (fj ) π
θi = ×
two query sessions are exactly same in semantics, then the maxj (fj ) − minj (fj ) 2
where minj (fj ) and maxj (fj ) are the minimum and max- ID Query Pages Time
imum values respectively in the first principal component. S1 q1q2 p1p2 2006-03-15 14:08:16
Obviously, θi is restricted to [0, π/2]. S2 q2 p1p2p3 2006-03-15 21:20:09
The mapping from the occurring time of a query session S3 q2 p2p3 2006-03-16 10:42:25
to the radius of a point can be handled directly. Given a set
S4 q3q4 p4p5 2006-03-09 19:22:18
of query sessions {S1 , S2 , · · · , Sn }, the radius of the point
(θi , ri ) corresponding to the query session Si is given by S5 q4 p4p6p7 2006-03-18 11:35:52

S6 q3q4 p4p8 2006-03-29 12:05:31


T (Si ) − minj (T (Sj ))
ri = S7 q5 p9p10 2006-03-22 20:41:11
maxj (T (Sj )) − minj (T (Sj )) (a) Example query sessions
where minj (T (Sj )) and maxj (T (Sj )) are respectively the
90o
earliest and latest occurring time of all query sessions. ri S6
S7
takes value in the range of [0, 1].
Consider the set of example query sessions in Figure 2
S5
(a) again. Query sessions S1 to S3 are similar in semantics
as well as in occurring time. Query sessions S4 through S6
are similar in semantics only. Query session S7 is dissim- S4
S3
ilar to any other query sessions in semantics. Figure 2 (b) S1

shows the polar transformation of the set of query sessions. S2

(In order to show the polar transformation clearly, we con- 0o


straint the angles of query sessions to [π/9, 4π/9] in this (b) Example polar transformation
example.) It can be observed from the figure that our polar
transformation has the following two features: Figure 2. Example polar transformation.
• Subspace consistency. The mapping from the seman-
tics of a query session to the angle of a point in polar However, as analyzed in the following subsection, the
space causes query sessions of similar semantics to lie performance of GPCA degrades in the presence of outliers.
on one and only one 1D subspace. For example, points In order to improve the robustness of GPCA, KNN-GPCA
of S1 to S3 and points of S4 to S6 locate around the takes into account the distribution of K nearest neighbors
two dotted lines in Figure 2 respectively. However, the of data points, which yields a weighted least square estima-
point of S7 appears as an outlier in the figure. tion of subspaces. In this section, we first review the al-
• Cluster consistency. The mapping from the occurring gorithm of GPCA. Then, the embedding of the distribution
time of a query session to the radius of a point in po- of K nearest neighbors is described. Finally, we estimate
lar space forces query sessions of similar semantics subspaces using weighted least square technique.
and similar occurring time to appear as clusters in sub- 3.1 Review of GPCA
spaces. For example, points of S1 to S3 form a cluster Given a set of sample data points, GPCA estimates a
in the lower dotted line, while points of S4 to S6 dis- mixture of linear subspaces by the following three steps:
tribute along the upper dotted line. fitting polynomials to data, estimating the number of sub-
spaces and estimating the normal vectors of subspaces.
3 Subspace Estimation Based on the fact that each data point x ∈ RD satisfies
According to the subspace consistency of polar transfor- T
bi x = 0 where bi is the normal vector of the subspace to
mation, our objective now is to estimate subspaces from
which x belongs, a data point lying on one of the n sub-
the transformed data such that each subspace contains
spaces satisfies the following equation:
query sessions of similar semantics. For this purpose, Yn
we propose a new subspace estimation algorithm KNN- pn (x) = (bTi x) = 0 (1)
GPCA, based on Generalized Principal Component Anal- i=1
ysis (GPCA) [16]. GPCA is an algebra-geometric approach where {bi }ni=1 are normal vectors of the subspaces. It
which simultaneously estimates subspace bases and assigns can be converted to a linear expression by expanding the
data points to subspaces. The reasons why we develop our product of all bTi x and viewing all monomials xn =
algorithm based on GPCA are as follows: unlike prior work xn1 1 xn2 2 · · · xnDD (0 ≤ nj ≤ n, j = 1 · · · D, n1 + n2 +
on subspace analysis [5, 6], GPCA does not require initial- · · · + nD = n) as system unknowns. By introducing the
ization, which usually results in local optimum, or restrict Veronese map vn : [x1 · · · xD ]T 7→ [· · · , xn , · · ·] where
the subspaces to be either orthogonal or trivially intersect- xn is a monomial of the form xn1 1 xn2 2 · · · xnDD , equation (1)
ing. Furthermore, GPCA is capable of estimating subspaces becomes the following linear X expression:
without prior knowledge on the number of subspaces. pn (x) = vn (x)T c = cn1 ···nD xn1 1 · · · xnDD = 0 (2)
where c ∈ R represents the coefficient of the monomial xn . 90o

Since each polynomial pn (x) = vn (x)T c must be satisfied


x3
by every data point, given a collection of N sample data x2
points {xj }N
j=1 , a linear system can be generated as
 
vn (x1 )T
 vn (x2 )T 
 
 · 
Ln c =   c = 0 (3)
· 
  x1
 · 
nvar
vn (xN )T svar

After fitting polynomials to data, the number of sub- 0o


spaces is estimated based on the condition that there is one
unique solution for c. Thus, when there exists no noisy data, Figure 3. Distribution of 3 nearest neighbors.
the number of subspaces can be computed as the minimum
value i such that the rank of the polynomial
µ embedding ¶ ma- should be demoted. We achieve this purpose by assigning
i+D−1 weight coefficients to data points to distinguish true data
trix Li equals to Mi − 1, where Mi = is
D−1 from noise and outliers. Particularly, we weigh data points
the number of different monomials in Li . In the presence of based on the distribution of its K nearest neighbors.
noise, GPCA relies on some pre-defined threshold ² to com- 3.2 Weight Coefficient Assignment
bine subspaces that are close to each other. For example, the The K th Nearest Neighborhood Distance (kN N D)
σr+1
rank of Li is determined as r if σ1 +···+σ r
< ², where σj is metric proposed in [7] detects and removes outliers based
the j-th singular value of Li . on the following fact: in a cluster containing more than K
Once the number of subspaces is determined, the vec- points, the kN N D for a data point is small; otherwise, it
tor of coefficient c can be computed accordingly. Then, the is large. Recall that our polar transformation has the cluster
normal vectors {bi }ni=1 can be solved by polynomial differ- consistency property, which indicates that true data points
entiation in the absence of noise. Taking noise into account, (e.g., query sessions corresponding to real events) lie in
the estimation of normal vectors is cast as a constrained clusters inside subspaces. The kN N D of true data points
nonlinear optimization problem which is initialized using should be small while the kN N D of noise and outliers
the normal vectors obtained by polynomial differentiation. should be large.
While GPCA provides an elegant solution to the prob- However, instead of simply using the kN N D to differ-
lem of linear subspace estimation, there are some inherent entiate outliers and inliers, we consider the distribution of
limitations of the algorithm. Particularly, we observed that the K nearest neighbors of data points. Given a data point
when outliers are present, accuracy of the following steps xi , let its K nearest neighbors be N Nk (xi ). Both the vari-
of GPCA will be impaired: ance of the K nearest neighbors along the direction of the
• Estimation of the number of subspaces. In the pres- subspace of xi (i.e., the direction from the origin to the point
ence of noise, GPCA determines the rank r of Li xi ), denoted as svar(N Nk (xi )), and the variance of the
based on some threshold ². When noise and outliers K nearest neighbors along the direction which is orthogo-
are moderate, the difference between the r-th singular nal to the subspace direction, denoted as nvar(N Nk (xi )),
value and the (r + 1)-th singular value of Li might be can be computed by PCA. For example, Figure 3 shows
slight. Thus, the appropriate threshold is hard to spec- the distribution of 3 nearest neighbors of three data points
ify and the number of subspaces can be determined er- x1 , x2 and x3 respectively. The two directions along which
roneously. svar(N N3 (x1 )) and nvar(N N3 (x1 )) are computed are in-
• Estimation of the normal vectors {bi }ni=1 . GPCA esti- dicated by svar and nvar respectively in the figure.
mates coefficients c before estimating the normal vec- If xi is a true data point, it forms some cluster together
tors of subspaces. In the presence of noise, c is com- with its K nearest neighbors. Thus, both svar(N Nk (xi ))
puted as the singular vectors of Li associated with its and nvar(N Nk (xi )) should be small. Consequently, the
smallest singular values. However, when outliers are sum S(N Nk (xi )) = svar(N Nk (xi )) + nvar(N Nk (xi ))
included in the calculation, the computed singular val- will be small as well. On the contrary, if xi is an outlier
ues and vectors are prone to error. Consequently, the (e.g., x2 in Figure 3), S(N Nk (xi )) will be large. How-
coefficients and the normal vectors may not be deter- ever, even if a data point forms a cluster together with its
mined appropriately. K nearest neighbors, it may not be a true data point if
Therefore, in order to improve the robustness of GPCA, its neighbors spread along the orthogonal direction of the
the impact of noise and outliers in the two estimation steps subspace (e.g., x3 in Figure 3). That is, the cluster does
    
not exist inside some subspace (Note that, directly using vn (x1 )(2..M )T c2 −W (x1 )vn (x1 )(1)
kN N D, this type of noisy data points cannot be identified).  vn (x2 )(2..M )T  c3   −W (x2 )vn (x2 )(1) 
    
In this case, svar(N Nk (xi )) is small but nvar(N Nk (xi )) W  · 
 · =
  · 

is large. Then, the ratio R(N Nk (xi )) =nvar( N Nk (xi ))/  ·  ·   · 
svar(N Nk (xi )) is large. Hence, we assign a weight, vn (xN )(2..M )T cM −W (xN )vn (xN )(1)
W (xi ), to a data point xi based on S(N Nk (xi )) and (7)
R(N Nk (xi )) as follows, The above equation can be succinctly written as W Ac =
1 d, where A is the matrix whose rows are vn (xi )(2..M )T ,
W (xi ) = (4) i = 1, 2..N and d is the right side of the above equation.
1 + S(N Nk (xi )) × R(N Nk (xi ))
By minimizing the objective function kd − AckW , we can
The value of W (xi ) ranges from 0 to 1. When the data obtain the weighted least square approximation of ci , i =
point xi lies in a cluster where data points spread along 1, 2..N as
the direction of the subspace, the value of R(N Nk (xi )) c1 = 1 and [c2 , · · · , cM ]T = (AT W T W A)−1 (AT W T W d)
is 0 in the absence of noise and is very small in the pres- (8)
ence of noise. The value of S(N Nk (xi )) is small as well. Note that, since we use the diagonal matrix of weight co-
Hence, the weight W (xi ) is close to 1. In other words, efficient W to demote the impact of noise and outliers, the
we assign a large weight to a true data point. Otherwise, estimation error of coefficient vector c is reduced.
R(N Nk (xi )) and/or S(N Nk (xi )) are large, which results To estimate the normal vectors {bi }ni=1 , we calculate
in a small W (xi ). Thus, the impact of noise and outliers them in the absence of noise using polynomial differen-
will be reduced by small weight values. The decision of the tiation as in original GPCA. The computed vectors serve
parameter K will be discussed in the below. to initialize the following constraint nonlinear optimization
3.3 KNN-GPCA which differs from GPCA in the introduction of the weight
Taking into account the weight of each data point, the coefficients: N
X
linear system of equation (3) is modified as min W (xj )kx̃j − xj k2
 W Ln c =   j=1
W (x1 ) vn (x1 )T Yn
 W (x2 )   vn (x2 )T  subject to (bTi x̃j ) = 0 j = 1, · · · , N (9)
  
 ·  · c = 0 i=1
   where x̃j is the projection of xj onto its closet subspace.
 ·   · 
W (xN ) vn (xN ) T By using Lagrange multipliers λj for each constraint, the
(5) above optimization problem is equivalent to minimizing the
where W (xi ) is the weight of data point xi . We then per- following function
XN Yn
form a Singular Value Decomposition (SVD) on W Ln to
(W (xj )kx̃j − xj k2 + λj (bTi )) (10)
estimate the number of subspaces. Note that, if xi is an out-
j=1 i=1
lier, its small weight will pull it back to the origin. Since Taking partial derivatives with respect to x̃j and equating it
origin exists in all null space, xi will not increase singular to 0, we can solve for λj /2 and W (xj )kx̃j − xj k2 . By re-
values. Consequently, the selection of the threshold ² here placing them into the objective function (9), the simplified
will not be as crucial as before. objective function on the normal vectors can be derived as
After estimating the number of subspaces, the coefficient XN Qn
W (xj )(n i=1 bTi xj )2
vector c can be computed using the weighted least square En (b1 , · · · , bn ) = Pn Q (11)
technique. Particularly, we express the left side of equation j=1
k i=1 bi l6=i (bTl xj )k2
(5) 
as After obtaining an initial estimate of normal vectors using
   
vn (x1 )(2..M )T c2 W (x1 )vn (x1 )(1)c1 polynomial differentiation, one can use standard nonlinear
 vn (x2 )(2..M )T   c3   W (x2 )vn (x2 )(1)c1  optimization techniques to optimize equation (11).
    
W  ·   · +
   ·  We now discuss some implementation and parameter is-

 ·  ·   ·  sues in this subspace estimation step. In order to compute
vn (xN )(2..M )T cM W (xN )vn (xN )(1)c1 the weight of each data point, we need find its K nearest
(6) neighbors. Due to the high complexity of directly discov-
where vn (xi )(2..M ) is a vector containing all elements ering K nearest neighbors, in our implementation, we op-
of vn (xi ) except
µ ¶ for the first element vn (xi )(1). M = timize to perform a bisecting k-means clustering first and
r+D−1 discover K nearest neighbors in each cluster only. In esti-
where r is the estimated subspace num-
D−1 mating the set of normal vectors {bi }ni=1 , we observed that
ber. In order to calculate a basis of coefficient vector, let the convergence of equation (11) is slow. Hence, in our
c1 = 1. Then, the equation (5) can be expressed as implementation, we employed a weighted k-means iteration
method to estimate the optimal vectors. We assign weighted
90o
data points to their nearest subspaces and update the normal s1
vectors of subspaces. This process is performed iteratively
till there is no change to the subspaces. According to our
experimental results, this method achieves the same perfor-
mance of equation (11) but with a faster convergence rate.
There are two parameters involved in this step: K, which
is the number of nearest neighbors, and ², which is the

vn
threshold used to decide the rank of the polynomial embed- s2
ding matrix. The former can be decided easily. For exam-

v1
ple, since noise and outliers are usually far away from true hm
data points, any small value of K can be used. The selec-
tion of K can be fixed for the computation of weights for h1
all data points. The decision of the parameter ² depends on
0o
the data. As shown in our experiments (Section 5.2), by
demoting the impact of outliers and noises, the difference Figure 4. Interestingness of subspaces.
between the rth and (r + 1)th singular values of the polyno- Note that, we cannot simply use the variance measure to
mial embedding matrix is enlarged and observable, which
define the interestingness of a subspace. The reason is as
makes the decision of ² straightforward.
follows. A subspace may contain multiple events (e.g., pe-
4 Event Detection riodical events). If there are multiple events contained in a
After estimating subspaces from the transformed polar subspace, then the variance of data points in the orthogo-
space, each subspace contains query sessions of similar nal direction (e.g., the “semantic” dimension) will be small,
topic. However, not every subspace is interesting such that while the variance of data points in the subspace direction
it contains clusters corresponding to real events. For ex- (e.g., the “temporal” dimension) will be large. Hence, in
ample, consider the query sessions with queries on popular our approach, we employ the entropy measure to define the
public portals (e.g., google). Due to their intense frequency interestingness of a subspace. Particularly, we project data
during the whole observed time period, such query sessions points to the two directions respectively and calculate the
will form a subspace although they do not represent any real respective histograms of the distributions. Let hh1 , h2 , · · · ,
event. For example, the subspace s1 in Figure 4 represents hm i and hv1 , v2 , · · · , vn i, where hi and vi are individual
such an uninteresting subspace. In this section, we first dis- bins, be the two corresponding histograms. The interesting-
cuss pruning uninteresting subspaces. Then, various events ness of a subspace can be computed as follows.
are detected from interesting subspaces. Xm n
X
I(si ) = 1 − [−p hi log hi − (1 − p) vi log vi ] (12)
4.1 Subspace pruning
i=1 i=1
We distinguish interesting and uninteresting subspaces
based on the intuitive premise used by [8]: the appear- where p ∈ [0, 1] is a weight which can be determined in
ance of an event is signaled by a “burst of activity”, with experiments to assign different importance to the entropy
certain features rising sharply in frequency when the event values in the two directions. For example, if p = 1, then
emerges. In our work, we simultaneously consider two fea- only the temporal “burst” is considered. The interesting-
tures: the occurring time of query sessions as well as the ness measure takes values from 0 to 1. The more certain
semantics of query sessions. Particularly, if a subspace is the distributions in two directions, the smaller the entropies
interesting such that it contains query sessions correspond- in the brackets of Equation (12), the greater the value of
ing to events, both the occurring time and the semantics interestingness. Given some threshold ζ, subspace si will
of query sessions in the subspace should exhibit certain be pruned as an uninteresting subspace if I(si ) < ζ. We
“bursts”. Recall that our polar transformation respectively observed from our experimental results (Section 5.2) that
maps the occurring time and semantics of a query session I(si ) of an interesting subspace si is usually much greater
to the radius and angle of a point in polar space. The tem- than I(sj ) of an uninteresting subspace sj , which makes it
poral “burst” and semantical “burst” should be reflected by easy to decide the value of ζ.
the certain distribution of data points along the subspace Note that, the calculation of the histograms may be bi-
direction and the orthogonal direction of the subspace re- ased by noisy data points. In our implementation, we re-
spectively. For example, as shown in Figure 4, the subspace move noise by an inlier growing method before projecting
s2 might be an interesting subspace since its data points data points. The details are given in a full version of this
distribute certainly along the subspace direction and the or- paper (the reference is omitted temporarily to comply with
thogonal direction of the subspace. the double blind reviewing guidelines).
Algorithm 1 DECK. The timestamp of the first clicked page of a query session
Input: A set of query sessions is taken as the occurring time of the session. We manu-
Output: A set of query session clusters corresponding to real events ally identified a set of events from the data set, including
Description:
both predictable events (e.g., the Memorial Day on May
1: Transform query sessions to polar space
2: Estimate subspaces of query sessions using KNN-GPCA
29, 2006) and unpredictable events (e.g., the death of Dana
3: for Each estimated subspace do Reeve, an American actress, on Mar 6, 2006). For each
4: Project inliers to the subspace direction and orthogonal direction event, we identified a set of query keywords and selected
respectively all query sessions which contain the query keywords and
5: Compute the entropy of the distribution histogram in the two di-
rections happen close to the date of the event. After filtering events
6: if Interestingness of the subspace is less than some threshold then which are represented by less than 50 query sessions, a total
7: Prune the subspace of 35 events are used in our experiments. The complete list
8: end if of events is given in the full version of this paper. We then
9: end for
10: for Each interesting subspace do randomly select query sessions which do not represent any
11: Perform mean shift clustering real events, together with the query sessions corresponding
12: end for to real events, to generate five data sets, which respectively
13: Return the clustering results contain 5K, 10K, 20K, 50K and 100K query sessions.
4.2 Query session clustering 5.2 Result Analysis
After pruning uninteresting subspaces, events can be Performance of DECK. We first compare the perfor-
detected from the remaining subspaces by clustering data mance of DECK with the existing two-phase-clustering al-
points based on the cluster consistency property of polar gorithm [18]. Given the set of clusters returned as detected
transformation. Although there are a plethora of published events, the existing algorithm [18] finds a best match be-
clustering techniques, methods relying upon a priori knowl- tween discovered clusters and true events. A best match of a
edge of the number of clusters are not applicable here be- true event is defined as a cluster that has the maximum over-
cause the number of events in each subspace cannot be de- lap with the true event in terms of the number of common
cided easily. Hence, in our approach, we employed a non- query-page pairs (query sessions). We further constrain that
parametric clustering technique called Mean Shift Cluster- the number of common query-page pairs (query sessions)
ing [4]. Mean shift clustering is an application of the mean should be no less than some specified threshold (in our ex-
shift procedure, which successively computes the mean shift periments, we set the threshold as 50% of the query-page
vector which always points toward the direction of the max- pairs or query sessions representing the true event). Then,
imum increase in the density and converges to a point where the evaluation metrics, precision and recall, can be com-
the gradient of density function is zero. puted as follows. Precision is the ratio of the number of
Based on the mean shift procedure, we perform mean correctly detected events to the overall discovered clusters.
shift clustering on data points in each subspace as follows. Recall is the ratio of the number of correctly detected events
Firstly, mean shift procedure is run with all the data points to the total number of events.
to find the stationary points of the density estimate. Sec- The experimental results are shown in Figures 5 (a) and
ondly, discovered stationary points are pruned such that (b) respectively. The existing algorithm is referred to as
only local maxima are retained. The set of all points that 2PClustering in the figures (please ignore the other two al-
converge to the same mode defined the same cluster. The gorithms, DECK-GPCA and DECK-NP, temporarily). We
returned clusters are expected to represent real events. The observe that DECK outperforms 2PClustering in both pre-
complete algorithm of DECK is shown in Algorithm 1. cision and recall. Since the algorithm 2PClustering did not
discuss how to decide the number of clusters, we use the
5 Performance Study number of events generated by DECK. Then, since many
In this section, we study the performance of DECK. We clusters generated by 2PClustering do not represent any
first describe the data set used in our experiments. Then, we events, both the precision and recall values are low.
present and analyze the results of conducted experiments. Since the best match approach used by the existing al-
5.1 Data Set gorithm may not be objective enough, we further evaluate
The real-life Web click-through data collected by the performance using an entropy measure. For each gen-
AOL [13] from March 2006 through May 2006 are used. erated cluster i, we compute pij as the fraction of query-
According to [13], if a user clicked on more than one page page pairs (query sessions) representing thePtrue event j.
from the list returned from a single query, these pages ap- Then, the entropy the of cluster i is Ei = − j pij log pij .
pear as successive entries in the data. Hence, in our exper- The total entropy can be calculated as the sum of the en-
iments, we simply extract successive pages corresponding tropiesPof each cluster weighted by the size of each cluster:
m
to the same query and the same user as a query session. E = i ni ×E n , where m is the number of clusters, n is
i
0.9 DECK 2P Clustering 1 DECK 2P Clustering 0.3

0.8 DECK-GP CA DECK-NP 0.9 DECK-GP CA DECK-NP DECK


0.25 2PClust ering
0.7 0.8
DECK-GPCA
0.7
0.6 0.2 DECK-NP
Precision

0.6

Entropy
Recall
0.5
0.5 0.15
0.4
0.4
0.3 0.1
0.3
0.2 0.2
0.05
0.1 0.1
0 0 0
5K 10K 20K 50K 100K 5K 10K 20K 50K 100K 5K 10K 20K 50K 100K

No. of Query Sessions No. of Query Sessions No. of Query Sessions

(a) (b) (c)

Figure 5. Precision, recall and entropy of DECK, 2PClustering, DECK-GPCA and DECK-NP.
σr+1
total number of query-page pairs (query sessions) and ni is σ1 +···+σr < ². They recursively increase the number of
the size of cluster i. The experimental results are shown in subspaces until the condition is satisfied. Hence, in our ex-
Figure 5 (c). Again, DECK works better than 2PCluster- periments, the value r is automatically decided by fixing the
ing. The reason is similar as before. Given the number of value of ² as 1.0E-3 for both GPCA and KNN-GPCA. Then,
clusters generated by DECK, 2PClustering generates clus- the ratios of the gap computed by KNN-GPCA to the gap
ters with larger size. Since it does not prune any data, the computed by GPCA are shown in Figure 6. We notice that
entropy of 2PClustering is higher than DECK. KNN-GPCA does enlarge the gap and improves the possi-
bility of selecting appropriate ².
3
10event 20event 35event
Singular Value Gap Ratio

2.5
6
10event 20event 35event
2
5
Interestingness Ratio

1.5
4

1
3

0.5
2

0
1
5K 10K 20K 50K 100K

No. of Query Sessions 0


5K 10K 20K 50K 100K

No. of Query Sessions


Figure 6. Performance of subspace estima-
tion.
Figure 7. Performance of subspace pruning.
Performance of Subspace Estimation. We also study Performance of Subspace Pruning. In order to exam-
the performance of the subspace estimation step of DECK. ine the subspace pruning step of DECK, we implemented
Particularly, we examine the effectiveness of KNN-GPCA. another alternative version of DECK, referred to as DECK-
Hence, we implemented an alternative version of DECK, NP (stands for DECK with No Pruning), which skips the
referred to as DECK-GPCA, which employs the original subspace pruning step. The performance of DECK-NP is
GPCA algorithm to estimate subspaces. As shown in Fig- shown in Figures 5 (a), (b) and (c) also. Since DECK-
ures 5 (a), (b) and (c), DECK-GPCA performs worse than NP does not prune any uninteresting subspaces, more clus-
DECK. It indicates that KNN-GPCA effectively demotes ters will be generated by this algorithm. Hence, although it
the impact of outliers and improves the robustness of sub- achieves similar recall as DECK, its precision is even lower
space estimation. than 2PClustering. Since it employs KNN-GPCA to esti-
As analyzed in Section 3, after assigning lower weights mate subspaces, the correctly discovered clusters are of high
to outliers, the selection of the parameter ², which deter- quality. Hence, its entropy is better than DECK-GPCA.
mines the number of subspaces, can be decided more easily. We further conduct experiments to examine whether the
Hence, we further investigate how KNN-GPCA enlarges threshold ζ, which is used to prune uninteresting subspaces,
the gap between the rth and (r + 1)th singular values of can be selected easily. The experiments are conducted on
the polynomial embedding matrix. We conduct the exper- the five datasets by varying the number of events. For each
iments on the five datasets by varying the number of true dataset, we order the estimated subspaces according to their
events. Note that, in the presence of noise, both GPCA and interestingness values. We then compute the interesting-
KNN-GPCA decide the number of subspaces by assuming ness ratio between the pair of successive subspaces with the
there are two subspaces first. Then, they justify whether greatest difference in their interestingness values. The ex-
perimental results are shown in Figure 7, where the high click-through data. The main features of DECK are sum-
ratios indicate that appropriate ζ values can be decided with marized as follows. Firstly, DECK simultaneously consid-
ease. ers the semantic information and temporal information of
click-through data. By transforming click-through data to
6 Related Work points in 2D polar space such that the angles and radiuses
The problem of event detection is part of a broader ini- of points respectively reflect the semantics and timestamps
tiative called Topic Detection and Tracking [2]. Particu- of queries, information of the two dimensions are consid-
larly, event detection can be divided into two categories: ered at the same time in both the subspace estimation step
retrospective detection and on-line detection [17]. The for- (e.g., each data point is weighted based on the distributions
mer refers to the detection of previously unidentified events of its K nearest neighbors in both the subspace direction
from accumulated historical collection, while the latter en- and the orthogonal direction) and the subspace pruning step
tails the discovery of the onset of new events from live feeds (e.g., entropy in both the subspace direction and the orthog-
in real-time. Our approach detects events from a collection onal direction are used to measure the interestingness of a
of Web click-through data and belongs to the first category. subspace). Secondly, in the subspace estimation step, data
We thus review some related retrospective event detection points are weighed based on the distribution of its K near-
algorithms. In [17], Yang et al. proposed to use an ag- est neighbors instead of kNND, so that a particular type of
glomerative clustering algorithm, augmented Group Aver- outliers can be identified. Thirdly, in the subspace pruning
age Clustering, to discover events from the corpus. As a step, an interestingness measure based on entropy instead of
recent example, Li et al. [11] proposed a multi-model ap- variance was proposed, so that subspaces containing multi-
proach which models both content and time information of ple events can be identified as interesting. The experimen-
documents explicitly. The particular feature of their algo- tal results on real-life Web click-through data show that our
rithm is the usage of an auto-adaptive sliding windows on approach is accurate and effective in detecting events from
time line, which overcomes the inflexible usages of times- Web click-through data.
tamps in traditional retrospective detection algorithms. The
most significant difference between our work and the exist- References
[1] J. Allan, J. G. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic
ing work is caused by the different data source, Web click- detection and tracking pilot study: Final report. In DARPA Broadcast News
through data, we considered. Direct clustering is not ap- Transcription and Understanding Workshop, 1998.
[2] J. Allan, R. Papka, and V. Lavrenko. On-line new event detection and track-
plicable for our problem because Web click-through data is ing. In SIGIR, 1998.
not necessarily correlated with events. Time window is no [3] J. Allan, C. Wade, and A. Bolivar. Retrieval and novelty detection at the
sentence level. In SIGIR, 2003.
longer a critical issue here as temporal burst can be observed [4] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature
obviously based on a window size of a day. space analysis. In IEEE Transaction On Pattern Analysis And Machine Intel-
ligence, volume 24, 2002.
Recently, there has been significant research interest in [5] J. Ho, M.-H. Yang, J. Lim, K.-C. Lee, and D. J. Kriegman. Clustering appear-
ances of objects under varying illumination conditions. In CVPR (1), 2003.
detecting events from text stream using feature-pivot ap- [6] K. ichi Kanatani. Motion segmentation by subspace separation and model
proaches. This line of research is inspired by Kleinberg’s selection. In ICCV, 2001.
[7] Q. Ke and T. Kanade. Robust subspace clustering by combined use of knnd
seminal work that describes extracting bursty features using metric and svd algorithm. In CVPR, 2004.
[8] J. M. Kleinberg. Bursty and hierarchical structure in streams. In KDD, 2002.
an infinite automaton model [8]. Fung et al. [21] proposed [9] R. Kumar, U. Mahadevan, and D. Sivakumar. A graph-theoretic approach to
to identify bursty features and then cluster features to gen- extract storylines from search results. In KDD, 2004.
[10] W.-S. Li, K. S. Candan, Q. Vu, and D. Agrawal. Retrieving and organizing
erate bursty events. Their approach has the restriction that web pages by “information unit”. In WWW, 2001.
each feature exclusively belongs to one event. However, [11] Z. Li, B. Wang, M. Li, and W.-Y. Ma. A probabilistic model for retrospective
news event detection. In SIGIR, 2005.
in our problem, a bursty query keyword (or clicked page) [12] Q. Mei and C. Zhai. Discovering evolutionary theme patterns from text: an
exploration of temporal text mining. In KDD, 2005.
probably belongs to different events. The work presented [13] G. Pass, A. Chowdhury, and C. Torgeson. A picture of search. In The First
by He et al. [22] also detects events by examining features International Conference on Scalable Information Systems, 2006.
[14] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans.
first. They analyzed every feature using Discrete Fourier Pattern Anal. Mach. Intell., 22(8), 2000.
Transformation and classified features to categories which [15] A. Sun and E.-P. Lim. Web unit mining: finding and classifying subgraphs of
web pages. In CIKM, 2003.
correspond to different types of events (e.g., periodic and [16] R. Vidal, Y. Ma, and S. Sastry. Generalized principal component analysis. In
CVPR, 2003.
aperiodic events). Their approach cannot be used directly [17] Y. Yang, T. Pierce, and J. G. Carbonell. A study of retrospective and on-line
in our problem as the semantic information of click-through event detection. In SIGIR, 1998.
[18] Q. Zhao, T.-Y. Liu, S. S. Bhowmick, and W.-Y. Ma. Event detection from
data cannot be captured and the irrelevant data cannot be evolution of click-through data. In KDD, 2006.
pruned. [19] Q. Zhao, P. Mitra, B. Chen Temporal and Information Flow Based Event
Detection from Social Text Streams. In AAAI, 2007.
[20] T. Rattenbury, N. Good, M. Naaman Towards automatic extraction of event
7 Conclusions and place semantics from flickr tags. In SIGIR, 2007.
[21] G.P. Fung, J.X. Yu, P. S. Yu, H. Lu Parameter Free Bursty Events Detection
Web click-through data was recently identified as a po- in Text Streams. In VLDB, 2005.
[22] Q. He, K. Chang, E.-P. Lim Analyzing feature trajectories for event detection.
tential source for event detection. In this paper, we propose In SIGIR, 2007.
a novel approach, called DECK, for detecting events from

You might also like