Professional Documents
Culture Documents
JACOB KOGAN
University of Maryland, Baltimore County
cambridge university press
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo
isbn-13 978-0-521-61793-2paperback
isbn-10 0-521-61793-6 paperback
Cambridge University Press has no responsibility for the persistence or accuracy of urls
for external or third-party internet websites referred to in this publication, and does not
guarantee that any content on such websites is, or will remain, accurate or appropriate.
God is a comedian playing to an audience
too afraid to laugh.
– Voltaire (1694–1778)
Contents
3 BIRCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1 Balanced iterative reducing and clustering algorithm 41
vii
viii Contents
Bibliography 189
Index 203
Foreword
xi
xii Foreword
Michael W. Berry
University of Tennessee
Preface
xiii
xiv Preface
Theorem. There does not exist a best method, that is, one which is su-
perior to all other methods, for solving all problems in a given class of
problems.
Proof: By contradiction.
Corollary. Those who believe, or claim, that their method is the best
one suffer from that alliterative affliction, ignorance/arrogance.
Proof: By observation.
1
While some of the problems are straightforward, others may require investment of
time and effort.
Preface xv
Clustering has attracted research attention for more than 50 years, and
many important results available in the literature are not covered in
this book. A partial list of excellent publications on the subject is pro-
vided in Section 1.3 and [79]. This book’s references stretch from the
1956 work of Steinhaus [125] to the recent work of Teboulle [128] (who
brought [125] and a number of additional relevant references to the
author’s attention). Undoubtedly, many important relevant contribu-
tions are missing.
In spite of my efforts, many typos, errors, and mistakes surely re-
main in the book. While it is my unfortunate duty to accept full respon-
sibility for them, I would like to encourage readers to email suggestions
and comments to kogan@umbc.edu.
Acknowledgments
1
Introduction and motivation
Step 1. Embed the documents and the query into a metric space.
Step 2. Handle problems 1 and 2 above as problems concerning
points in the metric space.
1
See, for example, http://www.groxis.com,http://www.iboogie.tv,http://www.kartoo.com,
http://www.mooter.com/, and http://vivisimo.com
2
1.1. A way to embed ASCII documents
2
A stop list is a list of words that are believed to have a little or no value as search or
discrimination terms. For example, the stop list from the SMART system at Cornell
University can be found at ftp://ftp.cs.cornell.edu/pub/smart/english.stop
3
The text is borrowed from [16]
3
Introduction and motivation
The simple example indicates that one can expect sparse high-
dimensional vectors (this is indeed the case in IR applications). A
“term by document” matrix is the n × m matrix with columns being the
vectors ai , i = 1, . . . , m. Each column of the matrix A = [a1 , . . . , am]
represents a document and each row corresponds to a unique term
in the collection. While this book is mainly concerned with document
clustering, one can consider word/term clustering by focusing on rows
of the “term by document” matrix. In fact simultaneous co-clustering
of columns and rows may benefit separate clustering of columns only
or rows only.
It is not uncommon that coordinates of ai are weighted frequencies
(and not just raw counts of word occurrences). Vectors ai are usu-
1
ally L2 normalized so that ai 2 = ( ai2 [ j]) 2 = 1. Often words are
stemmed, that is, suffixes and, sometimes, also prefixes are removed so
that, for example, an application of Porter stemming algorithm to the
words “study”, “studying”, and “studied” produces the term (not even
an English word!) “studi.” Porter stemming reduces the dictionary by
about 30% and saves the memory space without seriously compro-
mising clustering quality of English texts (handling Semitic languages
like Arabic or Hebrew is much more different in this respect). In an
attempt to further reduce the dimension of the vector space model
often only a subset of most “meaningful” words is selected for the
document–vectors construction.
A typical vector resulting from mapping a text into a finite-
dimensional Euclidean space is high dimensional, sparse, and L2 nor-
malized. So, for example, the vector a1 after normalization becomes
T
1 1 1
1
a1 = 0, 0, , , , 0, 0, 0, 0, 0, 0, , 0, 0, 0, 0, 0 .
2 2 2 2
4
1.2. Clustering and this book
Unlike many books on the subject this book does not attempt to cover
a wide range of clustering techniques, but focuses on only three basic
crisp or exclusive clustering algorithms4 and their extensions. These
algorithms are:
1. k-means,
2. Principal Direction Divisive Partitioning (PDDP),
3. Balanced Iterative Reducing and Clustering using Hierarchies
(BIRCH).
4
That is, each object is assigned to a single cluster
5
Introduction and motivation
6
1.3. Bibliographic notes
7
Introduction and motivation
8
2 Quadratic k-means algorithm
This chapter focuses on the basic version of the k-means clustering al-
gorithm equipped with the quadratic Euclidean distance-like function.
First, the classical batch k-means clustering algorithm with a general
distance-like function is described and a “best representative” of a clus-
ter, or centroid, is introduced. This completes description of the batch
k-means with general distance-like function, and the rest of the chap-
ter deals with k-means clustering algorithms equipped with the squared
Euclidean distance.
Elementary properties of quadratic functions are reviewed, and the
classical quadratic k-means clustering algorithm is stated. The follow-
ing discussion of the algorithm’s advantages and deficiencies results in
the incremental version of the algorithm (the quadratic incremental
k-means algorithm). In an attempt to address some of the deficiencies
of batch and incremental k-means we merge both versions of the al-
gorithm, and the combined algorithm is called the k-means clustering
algorithm throughout the book.
The analysis of the computational cost associated with the merger
of the two versions of the algorithm is provided and convexity prop-
erties of partitions generated by the batch k-means and k-means algo-
rithms are discussed. Definition of “centroids” as affine subspaces of Rn
and a brief discussion of connections between quadratic and spherical
k-means (formally introduced in Chapter 4) complete the chapter.
9
Quadratic k-means algorithm
1. Given a partition = {π1 , . . . , πk} of the set A one can define the
corresponding centroids c (π1 ) , . . . , c (πk) by:
c (πi ) = arg min d(x, a), x ∈ C . (2.1.4)
a∈πi
2. For a set of k “centroids” {c1 , . . . , ck} one can define a partition
= {π1 , . . . , πk} of the set A by:
πi = a : a ∈ A, d(ci , a) ≤ d(cl , a) for each l = 1, . . . , k
(2.1.5)
10
2.1. Classical batch k-means algorithm
11
Quadratic k-means algorithm
m
m
m
δ(x) = |x − ai |2 = mx 2 − 2 ai x+ ai2 . (2.1.6)
i=1 i=1 i=1
m
Note that for each x one has δ ′ (x) = 2mx − 2 i=1 ai . Since δ(x) is a
convex function δ(c) = min δ(x) if and only if 0 = δ ′ (c), hence
x
a1 + · · · + am
c= . (2.1.7)
m
If
m
m
n n
m
2 2 2
||x − ai || = [x[ j] − ai [ j]] = [x[ j] − ai [ j]] ,
i=1 i=1 j=1 j=1 i=1
(2.1.8)
then application of (2.1.7) separately to each coordinate yields
a1 + · · · + am
c = c (A) = . (2.1.9)
m
m
min |x − ai |. (2.1.10)
x∈R
i=1
12
2.1. Classical batch k-means algorithm
Indeed,
If
13
Quadratic k-means algorithm
k
k
q () = ||cmin(a) − a||2 ≥ ||cl′ − a||2
l=1 a∈πl′ l=1 a∈πl′
Figures 2.1, 2.2, and 2.3 illustrate the basic steps of Algorithm 2.1.1.
14
2.1. Classical batch k-means algorithm
3 3
2 2
1 x x 1 o *
0 x x x x 0 o * * *
−1 x x −1 o *
−2 −2
−3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Figure 2.1: Vector set and initial two cluster partition, “zeros” are in the “left”
cluster and “dots” are in the “right” cluster.
3 3
2 2
1 o * 1 o *
0 o co * * c* * 0 o co * * c* *
−1 o * −1 o *
−2 −2
−3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
15
Quadratic k-means algorithm
3 3
2 2
1 o * 1 o *
0 o o * * 0 o co o * c *
*
−1 o * −1 o *
−2 −2
−3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Note that the right-hand side equals to the left-hand side if and only if
c (A) = am.
16
2.1. Classical batch k-means algorithm
Convex clusters
3 H
ij
+
Hij Hij
2
1 a
0 ci c
j
−1
−2
−3
−3 −2 −1 0 1 2 3
Next we turn to a special case of a scalar data set A = {a1 < a2 < · · · <
am}. Let π1l = {a1 , . . . , al }, π2l = {al+1 , . . . , am}, l = 0, . . . , m. Due to
Problem 2.1.4
Q π11 , π21 ≤ Q π10 , π20 .
Furthermore, an optimal two cluster partition is {π1l , π2l } for some
0 < l < m (see Figure 2.5).
is suggested in [20]).
17
Quadratic k-means algorithm
Optimal partition
o o o o o
0 a1 a2 a3 a am
m 1
−1
−2
−3
−3 −2 −1 0 1 2 3
Example 2.1.1. Consider the scalar data set {−7, −5, −4, 4, 5, 7} and
(0) (0) (0)
the initial three cluster partition (0) = {π1 , π2 , π3 } with
(0) (0) (0)
π1 π2 π3
{−7, −5} {−4, 4} {5, 7}
18
2.1. Classical batch k-means algorithm
0.5 0.5
a1 a2 a3 a4 a5 a6 a1 a2 a3 a4 a5 a6
0 o x
c1
o o x
c2
o o
c3
x o 0 o x o
c1
o o ox
c2
o
−0.5 −0.5
−1 −1
−8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8
Figure 2.6: Example 2.1.1 provides initial (on the left) and final (on the right)
partitions with different number of nonempty clusters.
Initial partition
1. 5
0. 5
a1 a2 a3
0 o o o
0 2 3
−0.5
−1
−1.5
−0.5 0 0.5 1 1.5 2 2.5 3 3.5
Figure 2.7: Initial partition for the three scalar data set of Example 2.1.2.
19
Quadratic k-means algorithm
so that the second cluster becomes empty, and the number of nonempty
clusters is reduced from three to two.
Example 2.1.2. Consider the three scalar data set A = {0, 2, 3}, and the
(0) (0) (0) (0)
initial partition (0) = {π1 , π2 } where π1 = {0, 2}, and π2 = {3}.
An iteration of the batch k-means algorithm applied to (0) does not
change the partition. On the other hand, the partition (1) = {{0}, {2, 3}}
is superior to (0) . The better partition (1) is undetected by the
algorithm.
Project 2.1.1. Code the batch k-means algorithm. Keeping in mind spar-
sity of the data use the sparse format (a brief description of the format
follows).
20
2.2. Incremental algorithm
1. value 1 7 3 2 ,
2. rowind 1 2 0 1 ,
3. pointr 0 2 4 .
3 number of rows
2 number of cols
4 number of nonzero entries
21
Quadratic k-means algorithm
Proof:
p q
Q A B = ||c − ai ||2 + ||c − bi ||2
i=1 i=1
due to (2.2.2)
p q
= ||c (A) − ai ||2 + p||c − c (A) ||2 + ||c (B)
i=1 i=1
2 2
− bi || + q||c − c (B) ||
= Q (A) + Q (B) + p||c − c (A) ||2 + q||c − c (B) ||2 .
22
2.2. Incremental algorithm
If
We now focus on two special cases of Lemma 2.2.1. First, consider the
set B = {b} being a singleton and denote the set A ∪ B = {a1 , . . . , a p , b}
by A+ . Due to Lemma 2.2.1 one has
23
Quadratic k-means algorithm
one gets
2 p
p c(A+ ) − c (A) = ||c (A) − b||2
( p + 1)2
and
p2
||c(A+ ) − b||2 = ||c (A) − b||2
( p + 1)2
and finally
p
Q(A+ ) = Q (A) + ||c (A) − b||2 . (2.2.6)
p+1
Next consider the set B = {b1 , . . . , bq−1 , bq }, remove vector bq from B,
and denote the resulting set {b1 , . . . , bq−1 } by B − . An application of
Lemma 2.2.1 yields:
24
2.2. Incremental algorithm
may trigger the move. As (2.2.9) shows the change in the value of the
objective function caused by the move is
mi mj
= ||c(πi ) − a||2 − ||c(π j ) − a||2 .
mi − 1 mj + 1
The difference between the expressions
1 1
− k = ||c(πi ) − a||2 + ||c(π j ) − a||2 ≥ 0 (2.2.10)
mi − 1 mj + 1
is negligible when the clusters πi and π j are large. However, − k
may become significant for small clusters. In particular, it is possible
that k is negative, and the batch k-means iteration leaves a in cluster
πi . At the same time the value of is positive, and reassigning a to π j
would decrease Q. Indeed, for the data set in Example 2.1.2 and a = a2
one has
k = c1 − a2 2 − c2 − a2 2 = 1 − 1 = 0,
25
Quadratic k-means algorithm
and
2 1 1 3
= c1 − a2 2 − c2 − a2 2 = 2 − = .
1 2 2 2
This remark suggests a refinement of the batch k-means algorithm that
incorporates the incremental step. This incremental step is formally
introduced next.
Note, that:
26
2.2. Incremental algorithm
Problem 2.2.2. Show that the k-means clustering algorithm does not
necessarily lead to the optimal partition.
Problem 2.2.3. Assume that the data set A contains two identical vectors
a′ = a′′ .
27
Quadratic k-means algorithm
Optimal partition
1. 5
0. 5
a1 a2 a3
0 o o o
0 2 1
−0.5
−1
−1.5
−0.5 0 0.5 1 1.5 2 2.5 3 3.5
Throughout the book we assume that the data set contains distinct vec-
tors only. This assumption does not hold in many practical applications.
Assume that the data set A contains wi > 0 “copies” of ai (here wi is
just a positive scalar).
28
2.3. Quadratic k-means: summary
29
Quadratic k-means algorithm
30
2.3. Quadratic k-means: summary
3891 documents) with 600 “best” terms selected (see [37] for selection
procedure details). The iteration statistics is reported below:
does not change . At the same time removal of a2 from π2 and as-
signment of a2 to π1 leads to a better quality partition. In particular, an
application of Algorithm 2.2.1 does change and leads to the partition
′ = {{a1 , a2 }, {a3 , a4 }}.
31
Quadratic k-means algorithm
Partition Π
1.5
1 a2
0.5
a1
0 a3
−0.5
−1 a4
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
Partition Π
1.5
0.5
a2
a1
0 a3
a4
−0.5
−1
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
32
2.3. Quadratic k-means: summary
This shows that when c(π1 ) = c(π2 ) = c the partition may not be k-
means stable, and centroids c(π1 ), c(π2 ) of a stable k-means two cluster
partition = {π1 , π2 } with |π1 | > 1 do not coincide.
Since the assumption ai = a j when i = j holds throughout the text
the result does not depend on the assumption |π1 | > 1. Indeed, if p =
q = 1, then the clusters are singletons, and c(π1 ) = c(π2 ). If at least
one cluster, say π1 , contains more than one vector, then the choice
of a vector different from the centroid c(π1 ) is possible. The above
discussion leads to the following result.
33
Quadratic k-means algorithm
and
then
Due to Theorem 2.3.1 one has c(πi2 ) = c(π 2j ), and there is a hyperplane
Hi j that separates the space into two disjoint half spaces Hi−j and
Hi+j so that
34
2.3. Quadratic k-means: summary
Problem 2.3.1. Use Theorem 2.3.2 to prove the first distance property
listed above.
35
Quadratic k-means algorithm
Line approximation
1.5
*
*
0.5
*
*
*
0
* *
−0.5
*
−1
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
36
2.4. Spectral relaxation
where
⎡ em ⎤
√ 1 ... ...
m1
⎢ em ⎥
⎢... √ 2 ... ... ⎥
m2
Y=⎢
⎢
⎥.
⎥ (2.4.5)
⎣... ... ... ... ⎦
em
... ... ... √ k
mk
37
Quadratic k-means algorithm
λ1 ≥ λ2 ≥ · · · ≥ λn ,
then
For discussion of means see, for example, the classical monograph [67],
and [13, 132]. The classical “quadratic” batch k-means algorithm is
attributed to [57], the problem is mentioned already in the work of
Hugo Steinhaus [125]. The incremental version of the algorithm is
discussed in [47] and the combination of the two versions is described
in [66, 78, 145]. The idea of changing cluster affiliation for a subset of
the data set A only as well as modification of the objective function
that penalizes partitions with large number of clusters is also discussed
in [78].
The compressed column storage format (CCS), which is also called
the Harwell-Boeing sparse matrix format is given in [49]. For an
objective function independent distance on the set of partitions of a
given data set A see [104].
38
2.5. Bibliographic notes
39
3 BIRCH
For a data set A = {a1 , . . . , am} too large to fit into the available com-
puter memory consider a partition = {π1 , . . . , π M } of A. We would
like to consider each cluster πi ∈ as a single “feature” in such a way
that for each subset of p clusters πi1 , . . . , πi p in computation of
Q(πi1 ∪ · · · ∪ πi p ) (3.1.1)
41
BIRCH
then
p p
mi j c − bi j 2 ,
Q πi1 ∪ · · · ∪ πi p = qi j +
j=1 j=1
mi1 bi1 + · · · + mi p bi p
where c = . (3.1.2)
mi1 + · · · + mi p
42
3.1. Balanced iterative reducing and clustering algorithm
1.5 1.5
a1* a1*
1 *a3 1 *a3
a2* a2*
0.5 0.5
0 0
−0.5 −0.5
−1 −1
a4* a4*
a5* a5*
−1.5 −1.5
−1 0 1 −1 0 1
then
set t = t + 1
go to Step 3
7. Stop.
The above clustering scheme requires a single scan of the data set and
its I/O cost is linear with the data set size. We denote the final partition
generated by this scheme by
= {π1 , . . . , π M }. (3.1.3)
43
BIRCH
1.5 1.5
a1*
1 *a3 b1o 1 o
b1
a2*
0.5 0.5
0 0
−0.5 −0.5
−1 −1
a 4* b o * o
b2
2 a5
−1.5 −1.5
−1 0 1 −1 0 1
In this section we shall be concerned with the data set A = {a1 , . . . , am},
the partition = {π1 , . . . , π M } of A, and the data set B = {b1 , . . . , b M }
where bi = c(πi ). The subset {b1 , b2 } ⊆ B can be associated with a sub-
set π1 ∪ π2 ⊆ A. In general, the M cluster partition {π1 , . . . , π M } of A
associates each subset π B ⊆ B with a subset π A ⊆ A through
πA = πj.
b j ∈π B
44
3.2. BIRCH-like k-means
πiA = π j , i = 1, . . . , k. (3.2.1)
b j ∈πiB
(t) (t)
where A is a partition of A associated with B through (3.2.1).
Theorem 2.2.1 and formula (3.1.2) furnish the technical tools for the
algorithm design. Keeping in mind the partition of A given by (3.1.3)
for a cluster π B = {b1 , . . . , b p } we define the centroid c(π B ) as c(π A ),
that is
m1 b1 + · · · + mp b p
c πB = so that c π B
m1 + · · · + mp
p
2 n
= arg min mi x − bi : x ∈ R . (3.2.2)
i=1
B ′
πi = b : min(b) = i .
45
BIRCH
b j ∈πiB
Hence for the partition B = {π1B , . . . , πkB } and the associated partition
A = {π1A , . . . , πkA } one has
⎡ ⎤
k k 2
A B
Q (A ) = Q πi = ⎣ Q(π j ) + m j c πi − b j ⎦
i=1 i=1 b j ∈πiB
k
k
2
m j c πiB − b j
= Q(π j ) +
i=1 b j ∈πiB i=1 b j ∈πiB
= Q() + QB (B ) .
46
3.2. BIRCH-like k-means
(0)
1. Start with an arbitrary k-cluster partition B of the data set B. Set
the index of iteration t = 0.
(t)
2. Generate the partition nextBKM(B ).
(t) (t)
if [QB (B ) − QB (nextBKM(B )) > tolB ]
(t+1) (t)
set B = nextBKM(B )
increment t by 1.
go to 2
(t)
3. Generate the partition nextFV(B ).
(t) (t)
if [Q(B ) − Q(nextFV(B )) > tolI ]
(t+1) (t)
set B = nextFV(B ).
increment t by 1.
go to 2
4. Stop.
47
BIRCH
(t+1)
At each iteration the algorithm generates partition B from the
(t) (t) (t+1)
partition B so that QB (B ) ≥ QB (B ) and, due to Lemma 3.2.1,
(t) (t+1)
Q(A ) ≥ Q(A ). This yields the concluding statement.
(t)
Theorem 3.2.2. Let B , t = 0, 1, . . . be the sequence of partitions gen-
(t)
erated by Algorithm 3.2.1. If A , t = 0, 1, . . . are the associated parti-
(t)
tions of A, then {Q(A )} is a nonincreasing sequence of nonnegative
numbers.
and m− = m − mp , then
− m · mp
QB π B − QB (π B ) = ||c − b p ||2 .
(3.2.6)
m−
For a vector bi we denote the numbers qi and mi by q (bi ) and m (bi ).
Consider now two clusters π1B and π2B . Let M1 = b∈π B m(b), and M2 =
1
B
b∈π2B m(b). Select b ∈ π2 . The formula
+ B −
QB π1B − QB π1B + QB π2 − QB π2B
M2 · m(b) B 2
= c π2 − b p
M2 − m(b)
M1 · m(b) B 2
− c π1 − b p . (3.2.7)
M1 + m(b)
follows straightforward from (3.2.5) and (3.2.6).
48
3.3. Bibliographic notes
Problem 3.3.1. Show that the triplets (|π|, Q(π ), c(π )) and
(|π|, a∈π a, a∈π a2 ) contain identical information. That is,
show that if one triplet is available, then the other one can be computed.
49
4 Spherical k-means algorithm
S2n−1 = {x : x ∈ Rn , xT x = 1}
centered at the origin and the origin (when it does not lead to ambiguity
we shall denote the sphere just by S).
The chapter is structured similar to Chapter 2. First, we introduce
the batch spherical k-means algorithm, then the incremental version
of the algorithm is described. Finally, the batch and incremental iter-
ations are combined to generate the spherical k-means algorithm. We
conclude the chapter with a short discussion that relates quadratic and
spherical k-means algorithms.
51
Spherical k-means algorithm
⎨ a1 + · · · + am
⎧
if a1 + · · · + am = 0,
c (A) = a1 + · · · + am (4.1.2)
⎩ 0 otherwise.
Note that:
am.
3. While the motivation for spherical k-means is provided by IR appli-
cations dealing with vectors with nonnegative coordinates residing
on the unit sphere, formula (4.1.2) provides solutions to maximiza-
tion problem (4.1.1) for any set A ⊂ Rn .
Problem 4.1.1. Let d(x, a) = (xT a)2 . Define centroid c = c(π ) as a so-
lution of the maximization problem
arg max d(x, a), x ∈ S .
a∈π
Identify c.
The goal now is to find a partition max = {π1max , . . . , πkmax } that maxi-
mizes
k
Q() = a
. (4.1.3)
i=1 a∈π
52
4.1. Spherical batch k-means algorithm
53
Spherical k-means algorithm
Initial partition
1.5
1 a3
a2
0.5
0 a1
−0.5
−0.5 0 0.5 1 1.5
(0) (0)
batch k-means algorithm does not change the initial partition {π1 , π2 }.
At the same time it is clear that the partition {π1′ , π2′ } with π1′ = {a1 },
π2′ = {a2 , a3 } is “better” than the initial partition.
aT c1 ≤ aT c2 if and only if a − c1 2 ≥ a − c2 2
there exists a hyperplane passing through the origin and cutting the
sphere S into the two semispheres S − and S + so that π1 ⊂ S − and
π2 ⊂ S + .
When the data set resides in a plane the separating hyperplane
becomes a straight line passing through the origin (see Figure 4.1).
We remark that the assumption c1 = c2 can be removed, and the sets
conv π1 and conv π2 become disjoint when Algorithm 4.1.1 is aug-
mented by incremental iterations (see Algorithm 4.3.2).
54
4.1. Spherical batch k-means algorithm
1 z3
z2
c2 c1
z1
0.5 z4
−0.5
−1
−1.5
−1.5 −1 −0.5 0 0.5 1 1. 5
(t)
(t) (t+1)
(t+1) (t)
(t) (t+1)
aT cl −
cl − cl
≤ aT cl ≤ aT cl +
cl − cl
. (4.1.6)
55
Spherical k-means algorithm
(t+m)
for aT cl , m = 1, . . . . Indeed
m−1
(t+m) (t)
(t+i) (t+i+1)
aT cl ≤ aT cl +
cl − cl
. (4.1.7)
i=0
(t)
(t+l) (t+l+1)
Bi j = aiT c j +
c j − cj
, i = 1, . . . , d; j = 1, . . . , k
l=0
so that
(t+m)
aiT c j ≤ Bi j . (4.1.8)
then
56
4.2. Spherical two-cluster partition of one-dimensional data
if (t ≤ tmin )
perform Step 2 of the spherical k-means algorithm
if (t = tmin )
(t)
set Bi j = aiT c j , j = 1, . . . , k
if (t > tmin )
(t−1) (t)
set Bi j = Bi j + ||c j − c j ||, j = 1, . . . , k
(t)
compute aiT cpresent(ai )
(t)
if (Bi j > aiT cpresent(ai ) )
(t) (t)
compute aiT c j and set Bi j = aiT c j
(t)
From among all aiT c j computed above find max(ai )
Build new partition (t+1) = nextBKM((t1) ).
This section deals with building an optimal two cluster partition for vec-
tor sets residing on a one-dimensional unit circle. While the algorithm
capable of finding this partition is presented below, it turns out not to
be as straightforward as the one that finds the optimal two cluster par-
tition for one-dimensional data sets in the case of quadratic objective
function discussed in Chapter 2.
57
Spherical k-means algorithm
0 ≤ θ1 ≤ θ2 ≤ · · · ≤ θm < 2π.
58
4.2. Spherical two-cluster partition of one-dimensional data
"Reasonable" partition
1.5
z2 * z1
0.5 *
z4 * z3
*
0 *
z3 * *
z4
−0.5 z1 * *
z2
−1
−1.5
−1.5 −1 −0.5 0 0. 5 1 1. 5
z j and z j+1 recovers the optimal partition. The next example shows
that this is not necessarily the case.
59
Spherical k-means algorithm
z1
0.5 − z2 * *
z3
− z4 * *
"separating" line *e
0
1
− z3 * *
z4
−0.5 − z1 * *
z2
−1
−1.5
−1.5 −1 −0.5 0 0. 5 1 1. 5
60
4.2. Spherical two-cluster partition of one-dimensional data
Optimal partition
1.5
1 z2
*
0.5
z1
0 z3 * *
−0.5
*
−1 z4
−1.5
−1.5 −1 −0.5 0 0. 5 1 1. 5
Partition 1 Partition 2
1.5 1.5
1 z2 1 z2
* *
0.5 0.5
z z1
0 z3 * *1 0 z3* *
−0.5 −0.5
−1 z4* −1 z4*
−1.5 −1.5
−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5
61
Spherical k-means algorithm
Partition 3 Partition 4
1.5 1.5
1 z2 1 z2
* *
0.5 0.5
z1 z1
0 z3 * * 0 z3 * *
−0.5 −0.5
−1 z4* −1 z4*
−1.5 −1.5
−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5
r For two unit vectors z′ = eiθ ′ and z′′ = eiθ ′′ we denote the “midway”
θ ′ +θ ′′
vector ei 2 by mid (z′ , z′′ ).
′ ′′
x′ = eiθ , x′′ = eiθ , with θ j < θ ′ ≤ θ ′′ < θ j+1 .
62
4.2. Spherical two-cluster partition of one-dimensional data
1 z2 1 z2
* *
* mid(z2, z1) * mid(z2, z1)
0.5 0.5
z1 z
0 z3 * 0 z3 * * 1
*
−0.5 −0.5
*
−1 z4* −1 z4
−1.5 −1.5
−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5
mid(z 2 , − z4)
1 z2 *
* *
z4
0.
5
z1
0 z3 * *
−0.5
*
−1 z4
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
63
Spherical k-means algorithm
the partitions
64
4.3. Spherical batch and incremental clustering algorithms
65
Spherical k-means algorithm
In other words
q−1
= biT [c(B − ) − c(B)]
i=1
p
+ aiT [c(A+ ) − c(A)] + bqT [c(A+ ) − c(A)] + k. (4.3.2)
i=1
q−1
bT c(B − ) ≥ bT c(B), and, therefore,
By definition b∈B − b∈B − i=1
biT [c(B − ) − c(B)] ≥ 0. For the same reason
p
aiT [c(A+ ) bqT +
− c(A)] + c(A ) − c(A) ≥ 0.
i=1
These two inequalities along with (4.3.2) imply ≥ k. The last in-
equality shows that even when 0 ≥ k, and cluster affiliation of bq is
not changed by the batch spherical k-means algorithm the quantity
may still be positive. Thus, while Q({B − , A+ }) > Q({B, A}), the parti-
tion {B − , A+ } will be missed by the batch spherical k-means algorithm
(see Example 4.1.1).
66
4.3. Spherical batch and incremental clustering algorithms
67
Spherical k-means algorithm
Final partition
1.5
1 a3
a2
0.5
0 a1
−0.5
−0.5 0 0.5 1 1.5
(4.3.4)
68
4.3. Spherical batch and incremental clustering algorithms
and
Q(π +
j ) − Q (π j ) = ||s (π j ) ||2 + 2||s (π j ) ||aT c (π j ) + 1 − ||s (π j ) ||.
(4.3.5)
T
We remark that computation of the quantities ||s(πl )||, and a c(πl ),
a ∈ A, l = 1, . . . , k are needed for iterations of the spherical batch k-
means algorithm as well. Furthermore, a straightforward computation
shows that
2
Q(π +
j ) − Q (π j ) = aT c (π j )
c(π ) 1
1 + 1 + 2aT |s(π j )| +
j |s(π j )|2
1 1
+
s (π j )
c(π ) 1
1+ 1 + 2aT |s(π j )| +
j |s(π j )|2
and
T 1
Q(π +
j ) − Q (π j ) ≤ a c (π j ) +
. (4.3.6)
2
s (π j )
Hence, if a = ai and
1
≤ − Q πi− − Q (πi )
Bi j +
(4.3.7)
2 s (π j )
s (πl )
69
Spherical k-means algorithm
the procedure runs spherical batch k-means iterations. When the first
step fails to alter a partition, the second step runs a single incremen-
tal spherical k-means iteration. The proposed refinement algorithm is
given next.
(0) (0)
1. Start with an arbitrary partitioning (0) = {π1 , . . . , πk }. Set the
index of iteration t = 0.
2. Generate the partition nextBKM((t) ).
if [Q(nextBKM((t) )) − Q((t) ) > tolB ]
set (t+1) = nextBKM((t) )
increment t by 1.
go to 2
3. Generate the partition nextFV((t) ).
if [Q(nextFV((t) )) − Q((t) ) > tolI ]
set (t+1) = nextFV((t) ).
increment t by 1.
go to 2
4. Stop.
70
4.3. Spherical batch and incremental clustering algorithms
71
Spherical k-means algorithm
Hence, when constrained to the unit sphere, centroids for the quadratic
and spherical k-means coincide. Since for the unit length vectors a and
x one has
x − a = 2 − 2xT a
72
5 Linear algebra techniques
m
m
||ai − Pld (ai )||2 ≤ ||ai − Pl (ai )||2 (5.1.1)
i=1 i=1
and
m
m
||Plv (ai ) − c (Plv (A)) ||2 ≥ ||Pl (ai ) − c (Pl (A)) ||2 . (5.1.2)
i=1 i=1
Problem 5.1.2. Find out how the lines ld and lv are related.
73
Linear algebra techniques
The line l that solves Problem 5.1.1 solves the following constrained
optimization problem
m
2 T 2 T
min (ai − y − x ai ), subject to x = 1, y x = 0 .
x,y
i=1
(5.2.2)
We assume now that x is known, and will try to identify y = y(x). The
assumption about x reduces (5.2.2) to
m
min (ai − y2 − xT ai 2 ), subject to yT x = 0 . (5.2.3)
y
i=1
74
5.2. Nearest line
passes through the mean. We now turn to (5.2.2) that after the substi-
tution y = m − xxT m becomes
m
T 2 T 2
min ((ai − m) + xx m − x ai ), subject to x = 1 .
x
i=1
(5.2.6)
Note that
xT BBT x = λ2 . (5.2.9)
75
Linear algebra techniques
76
5.3. Principal directions divisive partitioning
To address Problem 5.1.2 we note that due to (5.2.1) the distance be-
tween projections of two vectors a and b is
77
Linear algebra techniques
1 1
a3 a3
* *
0.5 0.5
* *
a4 a4
0 0
a1 a1
* *
−0.5 −0.5
*a *a
2 2
−1 −1
−1.5 −1.5
−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5
Figure 5.1: The multidimensional data set and the “best” line.
P1 = { p : p ∈ P, p ≤ μ} , and P2 = { p : p ∈ P, p > μ}
(5.3.1)
For example, the arithmetic mean is a good candidate for μ (see Prob-
lem 2.1.6).
The basic idea of the Principal Direction Divisive Partitioning al-
gorithm (PDDP) is the following:
78
5.3. Principal directions divisive partitioning
One-dimensional projection
1. 5
p4
a3
* *
0. 5
*
p3
*
a4
0
a1
* p2
*
−0.5
* *
p1 a2
−1
−1.5
−1.5 −1 −0.5 0 0. 5 1 1. 5
λn
(5.3.3)
λ1 + λ 2 + · · · + λn
79
Linear algebra techniques
1 1
p4 a3
* *
0.5 0.5
*
p3
*
a4
0 0
a1
p2 *
*
−0.5 −0.5
*
p1 *a
2
−1 −1
−1.5 −1.5
−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5
80
5.3. Principal directions divisive partitioning
81
Linear algebra techniques
cluster 0 1004 5 4
cluster 1 18 1440 16
cluster 2 11 15 1380
82
5.3. Principal directions divisive partitioning
Our first goal is to select “good” index terms. We argue that for recove-
ring the three document collections the term “blood” is much more
useful than the term “case”. Indeed, while the term “case” occurs in
253 Medlars documents, 72 CISI documents, and 365 Cranfield docu-
ments, the term “blood” occurs in 142 Medlars documents, 0 CISI doc-
uments, and 0 Cranfield documents. With each term t we associate
a three dimensional “direction” vector d(t) = (d0 (t), d1 (t), d2 (t))T , so
that di (t) is the number of documents in a collection DCi containing
the term t. So, for example, d(“case”) = (253, 72, 365)T , and
d(“blood”) = (142, 0, 0)T . In addition to “blood” terms like “layer”
(d(“layer”) = (6, 0, 358)T ), or “retriev” (d(“retriev”) = (0, 262, 0)T )
seem to be much more useful than the terms “case”, “studi” and
“found” with d(“studi”) = (356, 341, 238)T , and d(“found”) = (211,
93, 322)T respectively.
When only the “combined” collection DC of 3891 documents is
available the above described construction of “direction” vectors is
not possible and we employ a simple term selection algorithm based
on term occurrence across a document collection. To exploit statistics
of term occurrence throughout the corpus we remove terms that occur
in less than r sentences across the collection, and denote the remaining
terms by slice(r ) (r should be collection dependent, the experiments in
this chapter are performed with r = 20). The first l best quality terms
that belong to slice(r ) define the dimension of the vector space model.
We denote the frequency of a term t in the document d j by f j and
measure the quality of the term t by
2
m m
1
q(t) = f j2 − fj , (5.3.6)
j=1
m j=1
83
Linear algebra techniques
suppos 21.875 6 7 9
nevertheless 21.875 6 11 5
retain 21.875 9 4 9
art 21.875 0 20 2
compos 21.875 5 5 12
ago 21.875 2 18 2
elabor 21.875 3 16 3
obviou 21.897 4 9 6
speak 20.886 6 12 3
add 20.886 3 14 4
understood 20.886 2 14 5
pronounc 20.886 18 0 3
pertain 19.897 3 8 9
merit 19.897 1 9 10
provis 19.897 1 18 1
84
5.3. Principal directions divisive partitioning
85
Linear algebra techniques
Partition 1, entire set, 2D projection Partition 1, entire set ,great circle approximation
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
−0.2 −0.2
−0.4 −0.4
−0.6 −0.6
−0.8 −0.8
−1 1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Partition 1, entire set, great circle approximation Partition 1, entire set, 2D projection
1 1
0.8 0.8
0. 0.6
0.4 0.4
0.2 0.2
0 0
−0.2 −0.2
−0.4 −0.4
−0.6 −0.6
−0.8 −0.8
−1 −1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Figure 5.5: Two-cluster partition {Z1 , Z2 } of the “circular” set Z (left), and
induced two-cluster partition of the two-dimensional classic3 projection P
(right).
86
5.4. Largest eigenvector
cluster 0 1000 3 1
cluster 1 8 10 1376
cluster 2 25 1447 21
“empty” documents
cluster 3 0 0 0
sPDDP 228 88 76 68
sPDDP + k-means 100 80 62 62
87
Linear algebra techniques
Mk x
Hence if cn
= 0, then limk→∞ ||Mk x||
= vn .
88
5.5. Bibliographic notes
1. set t = 0
2. assign initial values a(t) , and h(t)
3. normalize vectors a(t) , and h(t) , so that
(t) 2
(t)
p (a [ p]) = p (h
2
[ p]) = 1
4. set a(t+1) [ p] = (q, p)∈E h(t) [q], and h(t+1) [ p] = ( p,q)∈E a(t+1) [q]
Note that
AT h(t) Aa(t+1)
a(t+1) = , and h(t+1) = .
||AT h(t) || ||Aa(t+1) ||
AT Aa(t) AAT h(t)
This yields a(t+1) = ||AT Aa(t) ||
, and h(t+1) = ||AAT h(t) ||
. Finally
k k
AT A a(0) AAT h(0)
(t) (t)
a = , and h = . (5.4.1)
|| (AT A)k a(0) || || (AAT )k h(0) ||
Let v and w be a unit eigenvectors corresponding to maximal eigen-
values of the symmetric matrices AT A and AAT correspondingly. The
above arguments lead to the following result:
89
Linear algebra techniques
90
6 Information theoretic clustering
(when it does not lead to ambiguity, we shall drop the subscript 1 and
the superscript n − 1 and denote the set just by S+ ). Although each
vector a ∈ S+ can be interpreted as a probability distribution in an
attempt to keep the exposition of the material as simple as possible
we shall turn to probabilistic arguments only when the deterministic
approach fails to lead to desired results.
The chapter focuses on k-means clustering of a data set A =
{a1 , . . . , am} ⊂ S+ with the Kullback–Leibler divergence.
91
Information theoretic clustering
1. f (x) ≤ 0.
2. Since a[ j] > 0 for at least one j = 1, . . . , n for convenience we
may assume that a[1] > 0. Let 0 ≤ ǫ ≤ 1. When xǫ is defined
by xǫ [1] = ǫ, xǫ [i] = n1 − nǫ one has limǫ→0 f (xǫ ) = −∞, hence
supx∈S+ KL(a, x) = ∞.
3. To find maxx∈S+ f (x) we should focus on the S+ vectors whose
coordinates are strictly positive only when the corresponding co-
ordinates of a are strictly positive. Indeed, if a[1] = 0 and x[1] > 0,
then f (x) < f (x1 ) where
x[1]
x1 [1] = 0, x1 [ j] = x[ j] + , j = 2, . . . , n.
n−1
To simplify the exposition we assume that a[ j] > 0, j = 1, . . . , n.
n
To eliminate the constraint i=1 x[i] = 1 we turn to Lagrange mul-
tipliers and reduce the constrained optimization problem to the
equation
n
∇x f (x) + λ 1 − x[i] = 0.
i=1
92
6.1. Kullback–Leibler divergence
n n n
b[i] b[i]
KL(a, b) = a[i] f ≥ f a[i] = f b[i]
i=1
a[i] i=1
a[i] i=1
= f (1) = − log 1 = 0.
Problem 6.1.1. Verify that the inequality (6.1.1) becomes equality if and
only if a = λb.
Problem 6.1.2. Use the convex function f (t) = t log t to build a one-
line proof of the log sum inequality.
Problem 6.1.3. Use the log sum inequality to show that KL(a, b) is con-
vex with respect to (a, b) (i.e., KL(λa1 + (1 − λ)a2 , λb1 + (1 − λ)b2 ) ≤
λKL(a1 , b1 ) + (1 − λ)KL(a2 , b2 )).
93
Information theoretic clustering
a1 +···+am
Problem 6.2.1. Show that c = m
is the unique solution for (6.2.1).
94
6.2. k-means with Kullback–Leibler divergence
Initial partition
1.5
1 a3
a2
0.5
0 a1
−0.5
−0.5 0 0.5 1 1. 5
95
Information theoretic clustering
cluster 0 1010 6 0
cluster 1 3 4 1387
cluster 2 20 1450 11
“empty” docs
cluster 3 0 0 0
96
6.3. Numerical experiments
2.7905
2.79
2.7895
Partition quality
2.789
2.7885
2.788
2.7875
2.787
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Iteration
and the dotted line does the same for incremental iterations. The fig-
ure shows that the lion’s share of the work is done by the first three
batch iterations. From iteration 4 to iteration 16 values of the objective
function drop due to incremental iterations only. At iterations 17, 18,
and 19 the batch k-means kicks in. For the rest of the run the objective
function changes are due to incremental iterations only.
An inspection reveals that at iteration 4 a vector a was moved from
cluster π3 to cluster π2 by the incremental iteration, and a was “missed”
by the batch iteration 4 because of the infinite distance between a and
centroids c1 and c2 of clusters π1 and π2 respectively (for the very same
reason vectors were “missed” by the batch iterations of the algorithm
at iterations 5 and 19). The reasons for this behavior are described at
the end of Section 6.2.
97
Information theoretic clustering
Note that I(1 , 2 ) = KL( p(i, j), p1 (i) p2 ( j)), in particular I(1 ,
2 ) ≥ 0.
Variation of information between two partitions 1 and 2 is de-
fined by
98
6.5. Bibliographic notes
99
7 Clustering with optimization techniques
101
Clustering with optimization techniques
102
7.2. Smoothing k-means algorithm
Nonsmooth function
3.5
2.5
1.5
0.5
−0.5 F(x)=|x|
−1
−3 −2 −1 0 1 2 3
103
Clustering with optimization techniques
Smoothing
3.5
2.5
1/2
Fs (x)=(x2 + s)
2
1.5
0.5
−0.5 F(x)=|x|
−1
−3 −2 −1 0 1 2 3
m k
xl −ai 2
−
Fs (x) = −s log e s
i=1 l=1
m k
xl −ai 2
−
Fs (x) = −s log e s ,
i=1 l=1
104
7.2. Smoothing k-means algorithm
x −a 2
m l i
(xl − ai )e− s
k x −a 2
= 0. (7.2.2)
− j si
i=1
j=1
e
xl −ai 2
il e− s
ρ (x, s) = x j −ai 2
. (7.2.3)
k
j=1
e− s
105
Clustering with optimization techniques
and yields
m
ρ il (x, s)ai
xl = i=1
m . (7.2.5)
j=1
ρ jl (x, s)
xl (t + 1) = Tl (x(t)), l = 1, . . . , k. (7.2.9)
106
7.2. Smoothing k-means algorithm
1. Set t = 0.
2. For each l = 1, . . . , k compute
m
xl (t + 1) = Tl (x) = λil (x(t), s)ai .
i=1
Note that formula (7.2.7) indicates that xl belongs to the interior of the
convex hull of the data set A. In the next section we discuss convergence
of the sequence {Fs (x(t))}∞
t=0 .
The above observations indicate that, exactly like k-means, smoka may
generate less than k nonempty clusters from initial k distinct centroids.
107
Clustering with optimization techniques
x1
1 *
0.5
a1 a2
0 o o
−1 1
−0.5
−1 *
x2
−1.5
−1.5 −1 −0.5 0 0.5 1 1. 5
is missed by the algorithm. The example illustrates that the final partition
generated by smoka depends on the initial centroids.
108
7.3. Convergence
7.3. Convergence
1. Tl is a continuous map.
2. x(t) = x if and only if Tl (x(t)) = xl (t), l = 1, . . . , k (here x is a sta-
tionary point of Fs ).
3. For each l = 1, . . . , k, t = 1, 2, . . . the vector xl (t) belongs to the com-
pact convex set conv{a1 , . . . , am}.
Proof
109
Clustering with optimization techniques
110
7.3. Convergence
il e s
ρ (x(t), s) = ij (7.3.7)
k −wt
j=1
e s
Proof: The first two items above follow from the fact that the sequence
N
{x(t)}∞
t=0 belongs to a compact set in R , and the monotonicity of the
bounded below function Fs is given by Proposition 7.3.2.
Let {x(ti )} be a convergent subsequence of {x(t)}∞ t=0 . Denote
limi→∞ x(ti ) by x. Since limt→∞ Fs (x(t)) exists one has
112
7.3. Convergence
xl −ai 2
⎡ k ⎤
l=1
e− s
Finally, we obtain
xl −ai 2
k
m e− s
l=1
Fs (x) − F(x) = −s log ≤ sm log k (7.3.11)
xl −ai 2
i=1 max1≤l≤k e− s
as requested.
We now focus on the upper bound sm log k for the nonnegative expres-
sion F(x) − Fs (x) and pick i ∈ {1, . . . , m}. To simplify the exposition
we assume that
113
Clustering with optimization techniques
l=1 l=1
− 1 1 − 2 1 − k
= −s log e s 1 + e s + · · · + e s
1 − 2 1 − k
= 1 − s log 1 + e s + · · · + e s
1 − 2
≤ min xl − ai 2 − s log 1 + e s + · · ·
1≤l≤k
1 − k
+e s .
We remark that in many cases the first inequality in (7.3.12) is strict (i.e.
x1 − ai 2 < x2 − ai 2 ), and as s → 0 the expression ǫi (s) = log(1 +
1 − 2 1 − k
e s + · · · + e s ) is much smaller than log k. Since
k
x −a 2
− l si
−s log e = min xl − ai 2 − sǫi (s), (7.3.13)
1≤l≤k
l=1
one has
m k m m
xl −ai 2
Fs (x) = −s log e − s = min xl − ai 2 − s ǫi (s)
1≤l≤k
i=1 l=1 i=1 i=1
m
= F(x) − s ǫi (s).
i=1
114
7.4. Numerical experiments
changes (0) into the optimal partition (1) = {{0}, {2, 3}}. An applica-
tion of smoka with parameters s = 0.1 and tol = 0.001 to the par-
tition (0) produces centroids x1 = 0 and x2 = 2.5, thus recovering
(1) .
Next we report final partitions-related results for the three clus-
tering algorithms (quadratic batch k-means, quadratic k-means, and
smoka) and the following document collections:
115
Clustering with optimization techniques
cluster 0 907 91 13
cluster 1 120 7 1372
cluster 2 6 1362 13
“empty” documents
cluster 3 0 0 0
cluster 0 984 50 12
cluster 1 42 2 1368
cluster 2 7 1408 18
“empty” documents
cluster 3 0 0 0
116
7.4. Numerical experiments
cluster 0 1018 21 19
cluster 1 5 1 1356
cluster 2 10 1438 23
“empty” documents
cluster 3 0 0 0
cluster 0 1019 22 15
cluster 1 5 1 1362
cluster 2 9 1437 21
“empty” documents
cluster 3 0 0 0
iterations 3 87 7
Q 3612.61 3608.06 3605.5 3605.5
iterations builds a partition with the confusion matrix given in Table 7.4.
While the quality of final partitions generated by k-means and smoka
are almost identical the number of iterations performed by smoka
is significantly smaller. This information is collected in Table 7.5. To
highlight smoka’s performance, Figure 7.4 compares quality of first
117
Clustering with optimization techniques
3612
3611
Objective function value
3610
3609
3608
3607
3606
3605
−1 0 1 2 3 4 5 6 7 8
Iteration
118
7.4. Numerical experiments
119
Clustering with optimization techniques
120
7.4. Numerical experiments
iterations 9 550 31
Q 3379.17 3365.3 3341.06 3341.07
iterations 11 473 15
Q 1758.39 1737.73 1721.9 1726.29
partition (0) with PDDP, and apply batch k-means, k-means, and
smoka to (0) . The clustering results are reported in Table 7.12. Time
and time again smoka performs about the same number of iterations
as batch k-means does, and generates partitions of about the same
quality as k-means does.
121
Clustering with optimization techniques
iterations 47 5862 51
Q 18156.5 17956.1 17808 17808
122
7.5. Bibliographic notes
123
8 k-means clustering with divergences
125
k-means clustering with divergences
5 5
ψ(x) ψ(x)
4 4
Dψ(x, y)
3 3
2 2
ψ(y) ψ(y)
1 1
0 y x 0 y x
−1 −1
−1 0 1 2 3 −1 0 1 2 3
126
8.1. Bregman distance
we obtain
n
←− ν y[ j]
Dψ (x, y) = Dψ (y, x) = y − x2 +μ y[ j] log + x[ j] − y[ j] .
2 j=1
x[ j]
(8.1.5)
←−
While in general Dψ (x, y)
given by (8.1.3) is not necessarily convex in
x, when ψ(x) is given either by x2 or by nj=1 x[ j] log x[ j] − x[ j] the
←−
resulting functions Dψ (x, y) are strictly convex with respect to the first
variable.
127
k-means clustering with divergences
Lemma 8.1.1. For any three points a, b ∈ int (dom ψ) and c ∈ dom ψ
the following identity holds true:
given by (8.1.3):
m m
Theorem 8.1.1. If z = a1 +···+a
m
m
, then i=1Dψ (ai , z) ≤ i=1Dψ (ai , x).
1. Bregman distances,
2. Bregman distances with reversed order of variables.
8.2. ϕ-divergences
1
Note that this distance-like function is not necessarily convex with respect to x.
128
8.2. ϕ-divergences
Problem 8.2.1. Show that dϕ (x, y) is convex with respect to x and with
respect to y.2
Problem 8.2.2. Show that dϕ (x, y) is convex with respect to (x, y).
2
In [115, 136] the function d(x, y) is required to be convex with respect to x. Csiszar
divergences are natural candidates for distance-like functions satisfying this require-
ment.
129
k-means clustering with divergences
130
8.2. ϕ-divergences
word/term does not occur in both documents and motivates the con-
vention 0 log 0/0 ≡ 0). The Kullback–Leibler divergence is one of the
most popular distances used in data mining research.
The second example ϕ2 also yields the KL distance but with reversed
order of variables, i.e.,
n
y[ j]
dϕ2 (x, y) ≡ KL(y, x) = y[ j] log + x[ j] − y[ j] (8.2.4)
j=1
x[ j]
(compare this formula with (8.1.5) with ν = 0, and μ = 1). The centroid
c = c (π ) of a p element set π is given by the arithmetic mean, that is
c = 1p a∈π a.
Note that the obtained distance is identical to the one given by (8.1.5).
131
k-means clustering with divergences
Show that the only kernel that generates identical separable Bregman
and Csiszar divergences is the entropy kernel c[t log t − t + 1], c > 0.
k = d(ci , a) − d(c j , a). (8.3.1)
= [Q(πi ) − Q(πi− )] + [Q (π j ) − Q(π +j )]
⎡ ⎤
=⎣ d(ci , a′ ) − d(ci− , a′ )⎦ + d(ci , a)
a′ ∈πi− a′ ∈πi−
⎡ ⎤
+⎣ d(c j , a′ ) − d(c+j , a′ )⎦ − d(c j , a).
a′ ∈π +
j a′ ∈π +
j
132
8.3. Clustering with entropy-like distances
Hence
−
k is given by
⎡ ⎤ ⎡ ⎤
⎣ d(ci , a′ )− d(ci− , a′ )⎦ + ⎣ d(c j , a′ )− d(c+j , a′ )⎦ .
a′ ∈πi− a′ ∈πi− a′ ∈π +
j a′ ∈π +
j
(8.3.2)
−
k ≥ 0. (8.3.3)
133
k-means clustering with divergences
− [ψ(a′ ) − ψ(c− )−∇ψ(c− )(a′ − c− )] + d(c, a)
a′ ∈π −
= ( p − 1)[ψ(c− ) − ψ(c)]−∇ψ(c) a′ −( p−1)c
a′ ∈π −
+ d(c, a).
we get
a′ ∈π +
− [ψ(a′ ) − ψ(c+ ) − ∇ψ(c+ )(a′ − c+ )]−d(c, a)
a′ ∈π +
= ( p+1) ψ(c+ )−ψ(c) −∇ψ(c) ′
a −( p + 1)c
a′ ∈π +
−d(c, a).
we obtain
134
8.4. BIRCH-type clustering with entropy-like distances
135
k-means clustering with divergences
these special cases one needs to know clusters’ centroids c(πi ), and
only two additional scalars for each cluster. Hence, an application of
“BIRCH” type approach to k-means may lead to significant memory
savings.
We note that if π = {a1 , . . . , a p }, c (π ) is the centroid, and x is a
vector, then
p p
p p
d(x, ai ) = d(c (π ) , ai ) + d(x, ai ) − d(c (π ) , ai )
i=1 i=1 i=1 i=1
p p
= Q (π ) + d(x, ai ) − d(c (π ) , ai ) . (8.4.1)
i=1 i=1
136
8.4. BIRCH-type clustering with entropy-like distances
Proof:
k
k
Q ( ) = d(c, a) = Q (πi ) + mi d(c, ci )
i=1 a∈πi i=1
k
k
= Q (πi ) + mi [ψ (ci ) − ψ(c) − ∇ψ(c)(ci − c)]
i=1 i=1
k
k
k
= Q (πi )+mi [ψ (ci )−ψ(c)] −∇ψ(c) mi ci −c mi
i=1 i=1 i=1
k
= Q (πi ) + mi [ψ (ci ) − ψ(c)] .
i=1
137
k-means clustering with divergences
−
Problem 8.4.1. Let π B = {b1 , . . . , b p−1 , b}, π B = {b1 , . . . , b p−1 }, and
p−1
m− = i=1 m(bi ). Show that
−
Q(π B ) − Q(π B ) = q(b) + m− d(c, c− ) − m(b)d(c, b),
−
where c = c(π B ), and c− = c(π B ).
+
Problem 8.4.2. Let π B = {b1 , . . . , b p }, π B = {b1 , . . . , b p , b}, and m+ =
p
i=1 m(bi ) + m(b). Show that
+
Q(π B ) − Q(π B ) = −q(b) + m+ d(c, c+ ) + m(b)d(c, b),
+
where c = c(π B ), and c+ = c(π B ).
138
8.4. BIRCH-type clustering with entropy-like distances
then
k
k
k
Q (A) = {Q(πi ) + mi d(c, ci )} = Q (πi ) + mi eT ci − meT c
i=1 i=1 i=1
(8.4.6)
where e is the vector of ones.
Proof:
k
k
Q (A) = d(c, a) = d(c, a) = d(ci , a)+mi d(c, ci )
a∈A i=1 a∈πi i=1 a∈πi
k
= {Q(πi ) + mi d(c, ci )}. (8.4.7)
i=1
Further
k k n
c[ j]
mi d(c, ci ) = mi c[ j] log − c[ j] + ci [ j]
i=1 i=1 j=1
ci [ j]
n k
c[ j]
= mi c[ j] log − c[ j] + ci [ j]
j=1 i=1
ci [ j]
n k n
(c[ j])m
= c[ j] log k − mi c[ j]
j=1 i=1 j=1
(ci [ j])mi
i=1
k
n
+ mi ci [ j]
i=1 j=1
k
= mi eT ci − meT c.
i=1
The formula shows that in order to employ the “BIRCH type” cluster-
ing procedure one needs to have the following information concerning
each cluster π :
1. centroid c (π ),
2. cluster size |π |,
139
k-means clustering with divergences
Cluster 0 1010 6 0
Cluster 1 2 4 1387
Cluster 2 21 1450 11
“Empty” documents
Cluster 3 0 0 0
3. eT c (π ).
140
8.5. Numerical experiments with (ν, μ) k-means
Cluster 0 1010 5 1
Cluster 1 3 6 1384
Cluster 2 20 1449 13
“Empty” documents
Cluster 3 0 0 0
Cluster 0 1011 6 2
Cluster 1 8 12 1386
Cluster 2 14 1442 10
“Empty” documents
Cluster 3 0 0 0
141
k-means clustering with divergences
Table 8.4 summarizes clustering results for the sPDDP algorithm and
the combinations of “sPDDP+(ν, μ) k-means” algorithm for the three
selected values for (ν, μ) and different choices of index terms. Note
that zero document vectors are created when the number of selected
terms is less than 300. Table 8.4 indicates consistent superiority of the
“sPDDP + (0, 1) k-means” algorithm. We next show by an example
that this indication is not necessarily always correct. Figure 8.2 shows
the number of misclassified documents for 600 selected terms and 25
values of the (ν, μ) pair. While μ is kept 1, ν varies from 100 to 2500
with step 100. The graph indicates the best performance for μν = 500,
600, 700, 1000, and 1100.
We complete the section with clustering results generated by the
“sPDDP + quadratic k-means” and “sPDDP + spherical k-means”.
The algorithms are applied to the three document collections DC0,
DC1, and DC2. We use the same vector space construction as for the
“sPDDP+(ν, μ) k-means” procedure, but do not change the document
vectors unit L2 norm at the second stage of the clustering scheme. The
results of the experiment are summarized in Table 8.5 for different
choices of index terms.
142
8.5. Numerical experiments with (ν, μ) k-means
Documents misclassified by
(ν,µ) Algorithm
50
49
48
47
Misclassified documents
46
45
44
43
42
41
40
39
0 100 300 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500
ν/µ
ν
Figure 8.2: Misclassification vs. μ
ratio.
143
k-means clustering with divergences
144
8.6. Smoothing with entropy-like distances
∇ Fs (x) = 0, (8.6.3)
d(xl ,ai )
m
∇x d(xl , ai )e− s
k
= 0. (8.6.4)
d(x ,a )
i=1 − js i
e
j=1
d(xl ,ai )
il e− s
ρ (x, s) = k . (8.6.5)
d(x j ,ai )
− s
e
j=1
m
∇x d(xl , ai )ρ il (x, s) = 0. (8.6.6)
i=1
m ∂
We now compute the partial derivative i=1 ∂x[ j]
d(xl , ai ) ρ il (x, s) and
obtain the following quadratic equation for xl [ j]:
m
m
m
2
ν ρ il (x, s) (xl [ j]) + μ ρ il (x, s) − ν ai [ j]ρ il (x, s) xl [ j]
i=1 i=1 i=1
m
il
−μ ai [ j]ρ (x, s) = 0.
i=1
145
k-means clustering with divergences
il
We denote mρ (x,s)
pl by λil (x, s) and obtain the following equation
p=1 ρ (x,s)
for xl :
m
m
xl = λil (x, s)ai , λil (x, s) > 0, λil (x, s) = 1. (8.6.7)
i=1 i=1
146
8.7. Numerical experiments with (ν, μ) smoka
Cluster 0 7 1379 14
Cluster 1 113 7 1372
Cluster 2 913 74 12
“Empty” documents
Cluster 3 0 0 0
147
k-means clustering with divergences
Cluster 0 1018 21 19
Cluster 1 5 1 1356
Cluster 2 10 1438 23
“Empty” documents
Cluster 3 0 0 0
Cluster 0 1019 22 15
Cluster 1 5 1 1362
Cluster 2 9 1437 21
“Empty” documents
Cluster 3 0 0 0
148
8.7. Numerical experiments with (ν, μ) smoka
(2, 0) batch
Algorithm PDDP k-means (2, 0) k-means (2, 0) smoka
Iterations 3 87 7
Misclassifications 250 131 79 73
Q 3612.61 3608.1 3605.5 3605.5
(0, 1) batch
Algorithm PDDP k-means (0, 1) k-means (0, 1) smoka
Iterations 4 100 4
Misclassifications 250 117 43 66
Q 44 138.9 43 759.1 43 463.8 43 496.5
149
k-means clustering with divergences
(20, 1) batch
Algorithm PDDP k-means (20, 1) k-means (20, 1) smoka
Iterations 0 246 5
Misclassifications 250 250 39 66
Q 80 265.1 80 265.1 79 520.4 79 561.3
(2, 0) batch
Algorithm PDDP k-means (2, 0) k-means (2, 0) smoka
Iterations 47 5862 51
Q 18 156.5 17 956 17 808 17 810
(0, 1) batch
Algorithm PDDP k-means (0, 1) k-means (0, 1) smoka
Iterations 18 7737 23
Q 261 907 256 472 250 356 252 708
150
8.7. Numerical experiments with (ν, μ) smoka
(20, 1) batch
Algorithm PDDP k-means (20, 1) k-means (20, 1) smoka
Iterations 1 9988 25
Q 443 472 443 427 429 921 432 781
5
x 10 (0,1) Smoka vs. (0,1) K-means
2.62
2.6
Objective function value
2.58
2.56
2.54
2.52
0 5 10 15 20
Iteration
151
k-means clustering with divergences
152
8.8. Bibliographic notes
153
9 Assessment of clustering results
155
Assessment of clustering results
1. Compute I(a) = |π1i | x∈πi d(x, a), the average distance from a to
156
9.2. External criteria
1. Confusion matrix
We assume now that km = k. If, for example, c1r1 =
max{c11 , . . . , c1k}, then most of the elements of π1 belong to πrmin
1
and one could claim that the other elements of π1 are “misclas-
sified”. Repetition of this argument for clusters πi , i = 2, . . . , k
leads to the numbers c2r2 , . . . , ckrk . The total number of “misclassi-
fications” is, therefore, i,k j=1 ci j − i=1
k k
ciri = m − i=1 ciri . The
k
m− i=1 ciri
fraction 0 ≤ < 1 indicates a measure of “disagreement”
m
between and the optimal partition min . We shall call the
fraction cm(). When the partitions coincide, cm() vanishes.
Values of cm() near 1 indicate a high degree of disagreement
between the partitions.
2. Entropy
The entropy ei = e(πi ) of cluster πi is computed as ei =
m ci j c
− kj=1 ( |πi | ) log( |πiij | ). The entropy e = e() of the partition is
k |πi |
given by e = i=1 m ei .
157
Assessment of clustering results
3. Purity
The purity pi = p(πi ) of cluster πi is given by pi =
ci1 cikm
max{ |πi | , . . . , |πi | }, so that 0 ≤ pi ≤ 1 and high values for pi
indicate that most of the vectors in πi come from the same cluster
of the optimal partition. The overall purity p = p() of the parti-
k |πi |
tion is p = i=1 p , and p = 1 corresponds to the optimal
m i
k
partition. When km = k one has p() = m1 i=1 ciri = 1 − cm(),
in other words p() + cm() = 1.
4. Precision, recall and F-measure
Precision and recall are standard IR measures. Suppose a document
set D is retrieved from a set A in response to a query. Suppose that
only a subset Dr ⊆ D contains documents relevant to the query,
while the subset Ar ⊆ A contains documents relevant to the query.
Precision is a fraction of correctly retrieved objects in the retrieved
set, that is
|Dr |
precision = .
|D|
Recall is the fraction of correctly retrieved objects out of all docu-
ments relevant to the query
|Dr |
recall = .
|Ar |
If cluster πi is considered as a result of query for a particular “label”
j, then
ci j ci j
precision(i, j) = , and recall(i, j) = min .
|πi | |π j |
F-measure combines both precision and recall. The F-measure of
cluster πi with respect to “label” j is
precision(i, j) × recall(i, j) ci j
F(i, j) = 2 =2 .
precision(i, j) + recall(i, j) |πi | + |π min
j |
5. Mutual information
Mutual information I(1 , 2 ) between two partitions is given by
Definition 6.4.2. Application of mutual information formula to the
158
9.2. External criteria
6. Variation of Information
Variation of Information as defined by (6.4.1) is a distance function
on the set of partitions and can be used to evaluate the proximity
between and min .
7. Rand statistic and Jaccard coefficient
Consider a pair (ai , a j ) so that ai ∈ π ∈ , and ai ∈ π min ∈ min .
The vector a j may or may not belong to π and/or π min . The set
A × A can, therefore, be divided into four subsets which we denote
by (A × A)i j :
and
(a j ∈ π and a j ∈ π min )
or
(A × A)00 consists of all pairs (ai , a j ) so that
ai and a j belong to different clusters of partition
and
ai and a j belong to different clusters of partition min ).
159
Assessment of clustering results
160
10 Appendix: Optimization and linear algebra
background
Proof
v1T Mv2 = v1T (λ2 v2 ) = λ2 v1T v2 , and v2T Mv1 = v2T (λ1 v1 ) = λ1 v2T v1 .
Since (v1T Mv2 )T = v2T Mv1 one has λ2 v1T v2 = λ1 v2T v1 , and (λ1 − λ2 )
v1T v2 = 0. Due to the assumption λ1 − λ2 = 0, this yields v1T v2 = 0, and
completes the proof.
Mx = λx, x = 0. (10.1.1)
161
Appendix: Optimization and linear algebra background
and the left multiplication of the first equation by wT and the left
multiplication of the second equation by vT yield
Problem 10.1.1. Let M be a real symmetric matrix. Show that there are
n real eigenvalues
162
10.2. Lagrange multipliers
when h is small. Equation (10.2.2) shows that when ∇ f (x0 ) = 0 the
space Rn is partitioned into two half spaces
so that
(see Figure 10.1). Let now x0 be the point of the level set g(x) = 0 where
the minimum of f (x) is attained (see Figure 10.2). We assume that the
gradients ∇g(x0 ) and ∇ f (x0 ) are two nonzero and nonproportional
vectors (see Figure 10.3). We now focus on vectors x0 + h that belong to
the level set g(x) = 0 with h “small.” Note that ∇g(x0 )h ≈ 0, and, due
to the assumption about ∇g(x0 ) and ∇ f (x0 ), ∇ f (x0 )h = 0. Hence there
is a “small” h = 0 such that ∇ f (x0 )h < 0, h ∈ H− , and f (x0 + h) <
f (x0 ). The assumption ∇g(x0 ) and ∇ f (x0 ) are linearly independent
leads to a contradiction. We, therefore, conclude that the vectors
∇g(x0 ) and ∇ f (x0 ) are proportional, i.e. there is a scalar λ such that
163
Appendix: Optimization and linear algebra background
1 g(x)=0
0.8 H
+
∇ f (x0)
0.6
0.4
0.2
x0
0 H
∇ F(x, λ) = 0. (10.2.4)
∃x such that f (x) < +∞, and ∀x one has f (x) > −∞.
164
10.3. Elements of convex analysis
Tangent plane
1 g(x)=0
0.8
0.6
∇ g(x0 )
0.4
0.2
x0
165
Appendix: Optimization and linear algebra background
Nonoptimal point
1 g(x)=0
0.8
∇ f(x )
0
0.6
∇ g(x )
0
0.4
0.2
x0
166
10.3. Elements of convex analysis
Critical point
1 g(x)=0
∇ f(x0 )
0.8
0.6
∇ g(x0 )
0.4
x0
0.2
167
Appendix: Optimization and linear algebra background
168
10.3. Elements of convex analysis
Problem 10.3.1.
n
i=1 x[i] log x[i] if xǫ R n+ ,
Let f (x) =
+∞ otherwise.
n [i]
Compute f ∗ (y) and prove that g(y) = log( i=i ey ) is convex.
Proof
169
Appendix: Optimization and linear algebra background
2.5
1.5
1 * b
−0.5
−1 * a
−1.5
−0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
170
10.3. Elements of convex analysis
This shows that (10.3.7) holds for each γ ≥ T = mini≥N ti , and com-
pletes the proof.
171
Appendix: Optimization and linear algebra background
Independent directions
1.5
0.5
c2+sd
0 c2
λ c+( 1 − λ)c2
−0.5
c=c1+td+vt
−1 c1
c1+θ d
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
(see Figure 10.6). Analogously one can show that D(c2 ) ⊆ D(c1 ). This
completes the proof.
From now on we shall denote D(c) just by D. Due to the lemma one
has d + C ⊆ clC for each d ∈ D. Simple continuity arguments show that
172
10.3. Elements of convex analysis
are two nonempty levels sets of f , then the recession cones (C1 )∞ and
(C2 )∞ are equal.
(xi , μi )
(xi , μi ) ∈ epi f and ti → +∞ so that lim = (x, μ).
i→∞ ti
If μ+ > μ, then
1. f∞ is positively homogeneous,
2. f∞ is lower semicontinuous.
173
Appendix: Optimization and linear algebra background
1.6 epi f
λ
1.4 i
*
1.2
0.8
µ
0.6
*
0.4
0.2
0
x z
i
−0.2
−0.4
−0.5 0 0.5 1 1. 5
(zi , λi )
ti → +∞, and → (x, μ) as i → ∞. (10.3.11)
ti
zi 1
Denote ti
by yi , ti
by si and observe that
1. yi → x, si → 0;
2. f ( ysii ) = f (zi ) ≤ λi
f (zi )
3. lim inf si f ( ysii ) = lim inf ti
≤ λi
ti
and λi
ti
→ μ.
This shows that lim inf s f ( ys ) ≤ f∞ (x). To complete the proof we have
to show that the strict inequality is not possible. Assume the contrary,
174
10.3. Elements of convex analysis
The sequence ( ysii , f ( ysii )) ∈ epi f , and the vector limi→∞ (yi , si f ( ysii )) =
(x, α) ∈ (epi f )∞ . This contradicts definition of f∞ (x) (see Definition
10.3.8), and completes the proof.
⎧ ⎫
n
2
⎨ y[ j] ⎬
f∞ (x) = lim inf s 1+ : s → 0, y → x
⎩ j=1 s ⎭
n
= x[ j] = x1 .
j=1
n
Example 10.3.2. If f (x) = log j=1 ex[ j] , then
n
y[ j]
f∞ (x) = lim inf s log e s : s → 0, y → x = max x[ j].
j
j=1
175
Appendix: Optimization and linear algebra background
10.3.4. Smoothing
Let f1 , . . . , fm be real functions defined in Rn , and f be a convex
function in Rm. We shall be concerned with the function ψ(x) =
f∞ ( f1 (x), . . . , fm(x)) and the optimization problem
Table 10.1 indicates that for the functions given in Example 10.3.1 and
Example 10.3.2 the smooth function ψs approximates the nondifferen-
tiable function ψ and s f ( ys ) ≥ f∞ (y). Denote by x and xs minimizers
of ψ(x) and ψs (x), respectively. The following result describes approx-
imation of ψ by ψs .
176
10.3. Elements of convex analysis
and
0 ≤ ψ (xs ) − ψ (x) ≤ s · f (0). (10.3.17)
Proof: Note that the left side of (10.3.16) follows from (10.3.15). Due
to Theorem 10.3.2
f1 (x) fm(x) 1
f ,..., − f∞ ( f1 (x), . . . , fm(x)) ≤ f (0).
s s s
In other words
1. Due to (10.3.15) one has ψ(xs ) ≤ ψs (xs ), this implies the left in-
equality.
177
Appendix: Optimization and linear algebra background
The exposition of the linear algebra results follows [103, 110]. The
discussion of Lagrange multipliers follows [112]. For recession and
asymptotic cones one can consult [114]. In this chapter we closely fol-
low exposition of relevant material given in [5]. Smoothing results are
reported in [12].
178
11 Solutions to selected problems
Problem 2.1.6. True or False? If π1l , π2l is the optimal partition, then
xl ≤ c (A) ≤ xl+1 .
Problem 2.2.2. Show that the k-means clustering algorithm does not
necessarily lead to the optimal partition.
12 + 8ǫ + 18ǫ 2
1. (0) = {{0, 1, 1 + ǫ}, {3}} with Q((0) ) = .
18
18 − 18ǫ + 9ǫ 2
2. (1) = {{0, 1}, {1 + ǫ, 3}} with Q((1) ) = .
18
12 − 12ǫ + 18ǫ 2
3. (2) = {{0}, {1, 1 + ǫ, 3}} with Q((2) ) = .
18
179
Solutions to selected problems
Hence, when ǫ is a small positive number (for example 0 < ǫ < 0.05),
one has
Problem 2.2.3. Assume that the data set A contains two identical vec-
tors a′ = a′′ .
181
Solutions to selected problems
0.5
θ b
0
a θ
−0.5
−1
−0.5 0 0.5 1 1.5 2 2.5 3 3.5
Figure 11.1: First variation of a two cluster partition with identical centroids.
∂ b a
ψ (θ, a, b) = √ −√ sin θ.
∂θ b2 + 1 − 2b cos θ a 2 + 1 + 2a cos θ
The function φ is defined by
b a
φ (θ, a, b) = √ −√ (11.0.4)
b2 + 1 − 2b cos θ 2
a + 1 + 2a cos θ
∂
so that (θ, a, b) = φ (θ, a, b) sin θ and, therefore, the sign of
∂θ
ψ
∂
∂θ
ψ(θ, a, b) is determined by φ (θ, a, b). It is easy to see that φ (θ, a, b)
is monotonically decreasing on (0, π2 ), and, as we will see later, it is of
interest to examine
b a
π
φ (0, a, b) = − and φ , a, b
|b − 1| a + 1 2
b a
= √ −√ (11.0.5)
2
b +1 2
a +1
(when φ (0, a, b) exists) along with φ (0, s, s). The proof of the following
inequalities is straightforward
1. p > 1,
2. q > 1,
3. a1 + · · · + a p−1 = a p = 1.
This contradict the fact that {π1 , π2 } is a two cluster stable spherical k-
means partition and shows that at least one centroid should different
from zero.
Next assume that a1 + · · · + a p = 0 and b1 + · · · + bq > 0. The
assumption implies that π1 contains two or more vectors, hence
selection of a vector ai = −c (π2 ) is possible (the condition yields
b1 + · · · + bq + ai > b1 + · · · + bq − 1). For simplicity sake we
assume that ai = a p . Again consider the first variation {π1− , π2+ } of
{π1 , π2 }. Note that
and
183
Solutions to selected problems
Due to (11.0.6) one has ψ(θ0 , s, s) > ψ(0, s, s) ≥ Q(π1 ) + Q(π2 ). This
contradicts stability of {π1 , π2 }, and completes the proof.
184
Solutions to selected problems
185
Solutions to selected problems
1.5
a2 c(π1)
1 a3
0.5
0 a4
−0.5 a1
−1
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
Figure 11.2: Spherical k-means stable two cluster partition with aT c (π1 ) < 0.
Case 1. Since s(π1 ) < 1 the cluster π1 contains two or more vectors.
Due to Problem 4.3.4 one has aT c (π1 ) ≥ 0 for each a ∈ π1 . This yields
existence of a ∈ π1 so that 0 ≤ aT c (π1 ) ≤ s(π21 ) , and s (π1 ) − a ≥ 1.
Note that aT s (π2 ) ≥ 0, and s (π2 ) + a > s (π2 ) . Let {π1− , π2+ } be
the first variation obtained from {π1 , π2 } by removing a from π1 and
assigning it to π2 . The above discussion shows that
Case 2. Since 1 ≤ s(π1 ) < s(π2 ), the cluster π2 contains two or
more vectors. We select b ∈ π2 so that 0 ≤ bT c (π2 ) < 1 and consider
186
Solutions to selected problems
1.5
1
||s(π1)||
0.5
a
0
−0.5
−1
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
the first variation {π1+ , π2− } obtained by removing b from π2 and as-
signing it to π1 . If the acute angle between b and s(π2 ) is θ0 , then
Note that ψ (0, s(π1 ), s(π2 )) = Q(π1 ) + Q(π2 ), and
∂
ψ (θ, s(π1 ), s(π2 )) = φ (θ, s(π1 ), s(π2 )) sin θ
∂θ
with (see (11.0.6))
π
φ (θ, s(π1 ), s(π2 )) ≥ φ , s(π1 ), s(π2 ) > 0.
2
This yields
Q(π1− ) + Q(π2+ ) = ψ (θ0 , s(π1 ), s(π2 )) > ψ (0, s(π1 ), s(π2 ))
= Q(π1 ) + Q(π2 ),
187
Solutions to selected problems
This implies
1 1
φ ′′ (1) = φ ′′ . (11.0.11)
y y
The general solution for this second order ODE is given by
φ ′′ (1)[t log t − t + 1] + c1 [t − 1] + c2 . The initial conditions (11.0.10)
yield c1 = c2 = 0, and φ(t) = φ ′′ (1) [t log t − t + 1].
188
Bibliography
189
Bibliography
190
Bibliography
191
Bibliography
192
Bibliography
193
Bibliography
194
Bibliography
195
Bibliography
196
Bibliography
[92] J. Lafferty, S.D. Pietra, and Pietra V.D. Statistical learning algorithms
based on Bregman distances. In Proceedings of the Canadian Work-
shop on Information Theory, 1997.
[93] Man Lan, Chew-Lim Tan, Hwee-Boon Low, and Sam-Yuan Sung.
A comprehensive comparative study on term weighting schemes for
text categorization with support vector machines. In Proceedings of
the 14th international conference on World Wide Web, pp. 1032–1033,
Chiba, Japan, 2005.
[94] M. Larkey, L. Connell, and J. Callan. Collection selection and results
merging with topically organized U.S. patents and TREC data. In
Proceedings of the 9th International Conference on Information and
Knowledge Management (CIKM), pp. 282–289, McLean, VA, 2000.
ACM, New York, NY.
[95] G Leitmann. On one approach to the control of uncertain system.
Transactions of ASME, Journal of Dynamic Systems, Measurement
and Control, 50(5):373–380, 1993.
[96] A.V. Leouski and W.B. Croft. An evaluation of techniques for clus-
tering search results. Technical Report IR-76, Department of Com-
puter Science, University of Massachusetts, CIIR Technical Report,
1996.
[97] Ciya Liao, Shamim Alpha, and Paul Dixon. Feature preparation in
text categorization. www.oracle.com/technology/products/text/pdf/
feature− preparation.pdf.
[98] F. Liese and I. Vajda. Convex Statistical Distances. Teubner, Leipzig,
1987.
[99] D. Littau and D. Boley. Using low-memory representations to cluster
very large data sets. In D. Barbará and C. Kamath (eds.), Proceed-
ings of the Third SIAM International Conference on Data Mining,
pp. 341–345, 2003.
[100] D. Littau and D. Boley. Clustering very large data sets with PDDP.
In J. Kogan, C. Nicholas, and M. Teboulle (eds.), Grouping Multidi-
mensional Data: Recent Advances in Clustering, pp. 99–126. Springer-
Verlag, New York, 2006.
[101] Tao Liu, Shengping Liu, Zheng Chen, and Wei-Ying Ma. An evalu-
ation on feature selection for text clustering. In Proceedings of the
2003 IEEE International Conference on Data Mining, pp. 488–495.
IEEE Computer Society Press, Piscataway, NJ, 2003.
197
Bibliography
198
Bibliography
200
Bibliography
201
Index
203
Index
204
Index
205