A Modified K-Means Algorithm For Big Data Clustering

Thi
A Modified K-Means Algorithm for Big Data Clustering
s au er t
the he s
und
ntic up
reco ervis B.Sc. Dissertation
Submitted by
Sk. Ahammad Fahad

ID: 0401113502
rd o ion
f Sk of M
.A
A Dissertation Submitted in partial fulfillment of the requirements of the

ham
degree of Bachelor of Science in Computer Science and Information

Technology from IBAIS University
ma hbub
d. M
dF
a
aha Alam
d ca .
rrie
Department of Computer Science and Engineering

d
IBAIS University
out
Dhaka, Bangladesh
April, 2015
Thi
A Modified K-Means Algorithm for Big Data Clustering
s au er t B.Sc. Dissertation
the he s
und
ntic up
Supervised By
Md. Mahbub Alam
Assistant Professor, Department of Computer Science and Engineering
reco ervis
Dhaka University of Engineering and Technology (DUET), Gazipur-1700
rd o ion
Submitted by
f Sk of M
Sk. Ahammad Fahad

B.Sc. in Computer Science and Information Technology
ID: 0401113502
.A
ham
A Dissertation Submitted in partial fulfillment of the requirements of the

degree of Bachelor of Science in Computer Science and Information
Technology from IBAIS University
ma hbub
d. M
dF
a
aha Alam
d ca .

rrie
IBAIS University
d
Dhaka, Bangladesh
out
April, 2015
Thi
Certificate
s au er t
I am Md. Mahbub Alam hereby certifies that the work entitled A Modified K-Means
Algorithm for Big Data Clustering was prepared by Sk. Ahammad Fahad, ID:
the he s
und
040113502 and was submitted to the department of Computer Science and
Engineering as a partial fulfillment for the conferment of Bachelor of Science in
ntic up
Computer Science and Information Technology, and the aforementioned work, to the
best of my knowledge, is the said students work.
It was not copied from any other work or from any other sources except where due
reco ervis
reference or acknowledgement is made explicitly in the text, nor has any part been
written for student by another person and an authentic record of this dissertation
carried out under my supervision.
rd o ion
f Sk of M
.A
ham
Signature of Supervisor
ma hbub
d. M
Md. Mahbub Alam

dF
Assistant Professor
a
aha Alam
Dhaka University of Engineering and Technology (DUET)

Gazipur-1700, Bangladesh
d ca .
rrie
d out
ii
Thi
DECLARATION
s au er t
I hereby declare that the work which is being presented in the thesis entitled A
the he s
Modified K-Means Algorithm for Big Data Clustering for partial fulfillment of the
und
requirements of the degree of Bachelor of Science in Computer Science and
Information Technology submitted to the department of Computer Science and
ntic up
Engineering of IBAIS University, Dhaka, Bangladesh and this thesis or any material,
data and information related to it shall not be distributed, published or disclosed to
any party by the student except with IBAIS permission and I am Sk. Ahammad
reco ervis
Fahad, ID:040113502, hereby declare that this is my original work and I have not
copied from any other work or from any other sources except where due reference or
acknowledgement is made explicitly in the text, nor has any part been written for me
rd o ion
by another person and an authentic record of my work carried out under the
supervision of Md. Mahbub Alam.
.
f Sk of M
.A
ham
ma hbub
d. M
Signature of Author
dF
a
aha Alam
Sk. Ahammad Fahad

ID NO: 0401113502
d ca .
Program: Bachelor of Science in Computer Science and Information Technology

IBAIS University, Dhaka, Bangladesh
rrie
d out
iii
Thi
s au er t ACKNOWLEDGEMENTS
the he s
und
First and foremost, I would like to pay my sincerest gratitude to the almighty ALLAH
for keeping me in sound and health during the work and giving me the ability to work
ntic up
hard successfully. I would also like to show gratitude to my research supervisor Md.
Mahbub Alam. Without his assistance and dedicated involvement in every step
throughout the process, this thesis would have never been accomplished. I would like
reco ervis
to thank you very much for your support and understanding over these past seven
months. His guidance helped me in all the time of research and writing of this thesis. I
could not have imagined having a better advisor and mentor for my B.Sc. study. From
his cheerful speech and inspiration I am inspirited every time. I feel proud to be your
rd o ion
student.
I would like to give thank to all the teachers of the department of Computer Science
f Sk of M
and Engineering for their constructive criticism and valuable advices and cordial co-
operation during study in the university. Specially, I would like to thanks Head of the
department of Computer Science and Engineering Dr. Akter Hossain. His friendly
.A
guidance helps me the whole undergraduate academic session.
I have to thank my parents for their love and support throughout my Life. Thanks you
ham
both giving me strength to reach for the stars and chase my dreams. I am very much
thankful to almighty ALLAH to give such parents. Those are the idol of my life.
Brother like Faisal is one of the best gift from almighty. My sister Theba is the best
ma hbub
sister in universe. Form childhood, I dilate with her love and care. Thanks to Ferdaus
d. M
uncle and Mollika aunty for the support during thesis period.
dF
I dedicate my thesis to my parents Sk. Abul Kashem and Ainun Naher.

a
aha Alam
d ca .
rrie
Sk. Ahammad Fahad

ID NO: 0401113502
d
Program: Bachelor of Science in Computer Science and Information Technology

out
IBAIS University, Dhaka, Bangladesh
iv
Thi ABSTRACT
s au er t
the he s
und
Emergence of modern techniques for scientific data collection has resulted in large
scale accumulation of data pertaining to diverse fields. Conventional database
ntic up
querying methods are inadequate to extract useful information from these huge
amounts of data. The absence of category information distinguishes data clustering
(unsupervised learning) from classification or discriminate analysis (supervised
learning). The aim of clustering is exploratory in nature to find structure in data.
reco ervis
Machine learning algorithms; provide an automatic and easy way to accomplish such
tasks. These algorithms are classified into supervised, unsupervised, semi-supervised
algorithms. One of the most popular, simple and widely used clustering
rd o ion
(unsupervised) algorithms is K-means. The original k-means algorithm is
computationally expensive. Several methods have been proposed for improving the
performance of the k-means clustering algorithm. This thesis proposes a modified k-
f Sk of M
mean clustering algorithm where modification refers to the number of cluster and
running time. According to our observation, quality of the resulting clusters heavily
depends on the selection of initial centroids and after a certain number of iterations,
.A
only a small part of the data elements change their cluster. So there is no need to re-
distribute all data elements. Therefore, proposed method first finds the initial
ham
centroids and puts an interval between those data elements which will not change
their cluster during the next iteration and those which may change the cluster to
reduce the workload significantly in case of very big data sets. We evaluate our
method with different sets of data and compare with others methods as well.
ma hbub
d. M
dF
a
aha Alam
d ca .
rrie
d out
v
Thi
TABLE OF CONTENTS
s au er t
CERTIFICATE02
the he s
und
DECLARATION ...03
ntic up
ACKNOWLEDGEMENTS... .. 04
ABSTRACT..... 05
reco ervis
TABLE OF CONTENTS.. 06
LIST OF TABLES .. 09
rd o ion
LIST OF FIGURES.. 10
f Sk of M
CHAPTER 1: Introduction. . 11
.A
1.1 Introduction. ... 11

1.2 Big Data................................................................................................ 12
ham
1.2.1
Data Volume .. .... 13
1.2.2
Data Velocity .. ... 14
1.2.3
Data Variety ... 15
ma hbub
1.3 Clustering................................................................ ............................. 15

d. M
1.3.1 Classification and Clustering.. 16

dF
1.3.2 Organ of Clustering. ... 18

1.3.3 Different Types of Clustering. ... 19
a
1.4 Challenges. .. 20
aha Alam
1.4.1 Challenges of Define Big Data.. . 20

1.4.2 Clustering Challenges of Big Data.. ... 21
1.4.3 Challenges to Finding the Fittest Clustering Algorithms for
d ca .
Large Datasets . 22
1.5 Acquainting of this Book.. ... 23
rrie
CHAPTER 2: Literature Review... 25

d out
2.1 Data Sources.. 25

2.1.1 Web Data & Social Media. .... 25
vi
Thi
2.1.2
Machine-to-machine data........ 26
2.1.3
Big transaction data..... 26
s au er t 2.1.4
Biometrics .. .26
2.1.5
Human-generated data. 27
2.2 Data Veracity........ 27
the he s
2.3 Data Value. 28
und
2.4 Different types of Clusters. ... 29
ntic up
2.4.1 Well-Separated 29
2.4.2 Prototype-Based .. 29
2.4.3 Graph-Based 30
2.4.4 Density-Based . 31
reco ervis
2.4.5 Shared-Property .. 31
2.5 Clustering Method.. . 31
2.5.1 Partitioning Based. .. 32
rd o ion
2.5.2 Hierarchical Based .. 32
2.5.3 Density Based . 33
2.5.4 Grid Based... 34
f Sk of M
2.5.5 Model Based 34

2.6 Clustering Validity..... . 35
2.7 K-Means Algorithm....................................................... ............... 35
.A
2.7.1 Basic K-means Algorithm... 36

2.7.2 Algorithm steps with example. 37
ham
2.7.3 K-means clustering properties. 40

2.7.4 Advantage 40
2.7.5 Disadvantages.. 41
ma hbub
d. M
dF
CHAPTER 3: Background and Related Work.. . 42

a
3.1 Background of K-means Clustering. .. 42

aha Alam
3.2 Previous Works.. .. 43

d ca .
CHAPTER 4: Proposed Methodology. .. 45

rrie
4.1 Modified Algorithm.. .. 45

4.2 Fields of Modification.. .... 48
d
4.3 Algorithm Evaluation 49

out
4.3.1 Initial Sets and Member Selection. . 50

4.3.2 Finalized Cluster with Inner Interval Calculation.. 53
4.3.3 Obtain Final Cluster .55
vii
Thi
s au er t
CHAPTER 5: Experimental Result. 57
5.1 Environmental Setup and Data Input ... 57
the he s
5.2 Algorithm Testing . 58
und
5.3 Decision... . 59
ntic up
CHAPTER 6: Conclusion and Further Research .. 60
reco ervis
rd o ion
REFERENCES... .... 61
f Sk of M
.A
ham
ma hbub
d. M
dF
a
aha Alam
d ca .
rrie
d out
viii
Thi
LIST OF TABLES
s au er t
Table 2.1: Food Item with fat & protein content for clustering.. 38
the he s
Table 2.2: Food Item with fat & protein content in initial cluster.. 38
und
Table 2.3: Food Item with fat & protein content in final cluster.. .. 40
ntic up
Table 4.1: 50 Products with that size(x) & weight(y) value.. . 49
Table 4.2: After assign value to from input product set U........................... ......... 53
reco ervis
Table 4.3: with initial cluster................ ......... 54
Table 4.4: Cluster and member of with centroids ..... 54

rd o ion
Table 4.5: Cluster and interval with inner interval points.. .. 54

f Sk of M
Table 4.6: calculation result of an iteration.. .. 55
Table 4.7: Final and their associate members.. 55

.A
Table 5.1: Train four Algorithms for testing.. 57
Table 5.2: result for 500 data with different value of k.. 58
ham
Table 5.3: Result for 500 data with different value of k and f.. .. 58
Table 5.4: Result for different data points.. 58

ma hbub
d. M
dF
a
aha Alam
d ca .
rrie
d out
ix
Thi
LIST OF FIGURES
s au er t
Figure: 1.1: Data clustering . .. 16
the he s
Figure: 1.2: Classification tree.. .. 17
und
Figure 1.3: steps of clustering.. .. 18

ntic up
Figure: 1.4: Hierarch Sequence of Various Clustering.. 20
Figure: 2.1: Different types of clustering as illustrated by sets of two dimensional

reco ervis
points.. . 30
Figure: 2.2: Overview of clustering algorithms taxonomy. ... 33

rd o ion
Figure 2.3: position of food Item with fat & protein content. 38
Figure 2.4: Flow-chart of standard k-means clustering algorithm.. 39

f Sk of M
Figure 4.1: Flow-chart of modified k-means algorithm . 47
Figure 4.2: 50 data points in (x,y) coordinate .... 50

.A
Figure 4.3: Elements of U move to N1 , N2 , N3 , Nn .... 52
Figure 4.4: Final with thresholds .... 56

ham
Figure 5.1: Process time of four algorithms. .. 59

ma hbub
d. M
dF
a
aha Alam
d ca .
rrie
d out
x
Thi CHAPTER 1
s au er t INTRODUCTION
1.1 INTRODUCTION
the he s
und
Big Data has become one of the buzzwords in IT during the last couple of years. In
the current digital era, according to massive progress and development of the internet
ntic up
and online world technologies such as big and powerful data servers, we face a huge
volume of information. Data size has increased dramatically with the advent of
today's technology in many sectors such as manufacturing, business, science and web
reco ervis
application. With the rising of data sharing websites, such as Facebook, Flickr,
YouTube, Twitter, Google, Google news, there is a dramatic growth in the number of
data. For example, Facebook reports 2.5 billion content items, 105 terabytes of data
rd o ion
each half hour, 300M photos and 4M videos posted per day [1]. In Twitter, Over 651
million users, generating over 6,000 tweets per second [44]. 300 hours of video are
uploaded to YouTube every minute with more than 1 trillion video views [45].
f Sk of M
Google make 50 billion pages indexed and more than 2.4 million queries in every
minute [43]. Google news Articles from over 10,000 sources in real time [1]. More
than 4.5 million photos uploaded in a day in Flickr. [1] It is estimated that all the
.A
global data generated from the beginning of time until 2003 represented about 5
Exabyte and Over 1.8 Zettabytes(zb) created in 2011. Approximately 8
Zettabytesteri(zb) by 2015 [1].
ham
Although this massive volume of data can be really useful for people and
corporations, it can be problematic as well. A big volume of data or big data has its
ma hbub
own deficiencies as well. For big data, it needs big storages and those volumes make
d. M
operations (analytical operations, process operations, retrieval operations) very

dF
difficult and hugely time consuming. The pressure to handle the growing data amount
on the web e.g. lead Google to develop the Google File System [2] and
MapReduce[3] Efforts were made to rebuild those technologies as open source
a
aha Alam
software. This resulted in Apache Hadoop and the Hadoop distributed file System
(HDFS) [4] and laid the foundation for technologies summarized today as big data.
Map/Reduce is software in distributed computing environment based on two functions
d ca .
called map and reduce. The functions are designed to work with a list of inputs. The
map function produces an output for each item in the list while the reduce function
produces a single output for the entire list. Hadoop is a software library framework for
rrie
developing highly scalable distributed computing applications. The Hadoop

framework handles the processing details leaving developers free to focus on
application logic. However the application of Map/Reduce and Hadoop are still in the
d
early stages in manipulating big data.

out
Page | 11
Thi
Some aspects need careful attention when dealing with big data, and this work will
therefore help researchers as well as practitioners in selecting techniques and
s au er t
algorithms that are suitable for big data. Volume of data is the first and obvious
important characteristic to deal with when clustering big data compared to
conventional data clustering, as this requires substantial changes in the architecture of
the he s
storage systems. The other important characteristic of big data is Velocity. This
und
requirement leads to a high demand for online processing of data, where processing
speed is required to deal with the data flows. Variety is the third characteristic, where
ntic up
different data types, such as text, image, and video, are produced from various
sources, such as sensors, mobile phones, etc. These three V (Volume, Velocity, and
Variety) are the core characteristics of big data which must be taken into account
reco ervis
when selecting appropriate clustering techniques. One way to overcome these difficult
problems is to have big data clustered in a compact format that is still an informative
version of the entire data. Such clustering techniques aim to produce a good quality of
rd o ion
clusters/summaries.
f Sk of M
1.2 BIG DATA

Large-scale data was called Big data. How much data was big data, it cant be fixed
.A
by someone or some organization. Because, for a small company which are handle
gigabytes of data. For them, terabytes of data is big data. Like: Banks, hospital etc.
Some organization handles a lot of data. Like Google, Amazon, Wikipedia. For those
ham
organizations; petabytes (1015), exabytes (1018) was big data. The next frontier for
innovation, competition, and productivity the McKinsey Global Institute (MGI) used
the following definition:
ma hbub
d. M
Big data refers to datasets whose size is beyond the ability of typical database
software tools to capture, store, manage, and analyze. This definition is intentionally
dF
subjective and incorporates a moving definition of how big a dataset needs to be in

order to be considered big data. [5]
a
aha Alam
One of these definitions is given in IDCs The Digital Universe study:
IDC defines Big Data technologies as a new generation of technologies and

d ca .
architectures, designed to economically extract value from very large volumes of a

wide variety of data by enabling high-velocity capture, discovery, and/or analysis.
There are three main characteristics of Big Data: the data itself, the analytics of the
rrie
data, and the presentation of the results of the analytics. [6]
When we are unable to handle our large scale data via our traditional data handling
d
system, we call those data to big data for particular field. Mainly, big data was
out
identified by some characteristic. Those are known as Triple V. Those are: data
volume, data velocity and data variety. [7]
Page | 12
Thi
1.2.1 Data volume
There is no clear or concrete quantification of when volume should be considered
s au er t
big. This is rather a moving target increasing with available computing power.
While 300 terrabyte were considered big 10 years before, today petabyte are
considered big and the target is moving more and more to exabyte and even zettabyte
the he s
und
[6]. There are estimates that Walmarts processes more than 2.5 petabytes per hour,
all from customer transactions [8]. Google processes around 24 petabytes per day and
ntic up
this is growing [9]. To account for this moving target, big volume can be considered
as data whose size forces us to look beyond the tried-and-true methods that are
prevalent at that time [7].
reco ervis
BIG volume is not only dependent on the available computing, but also on other
characteristics and the application of the data.
rd o ion
It is also obvious that big volume problems are interdependent with velocity and
variety. The volume of a data set might not be problematic, if it can be bulk-loaded
and a processing time of one hour is fine. Handling the same volume might be a really
f Sk of M
hard problem, if it is arriving fast and needs to be processed within seconds. On the
same time handling volume might get harder as the data set to be processed gets
unstructured. This adds the necessity to conduct pre-processing steps to extract the
information needed out of the unstructured to structured and therefore leads to more
.A
complexity and a heavier workload for processing that data set. This exemplifies why
volume or any other of big datas characteristics should not be considered in
ham
isolation, but dependent on other data characteristics.
If we are looking at this interdependence, we can also try to explain the increase of
data volume due to variety. After all, variety also means, that the number of sources
ma hbub
d. M
organizations leverage, extract, integrate and analyze data from grows. Adding
additional data sources to your pool also means increasing the volume of the total data
dF
you try to leverage. Both, the number of potential data, sources as well as the amount
of data they generate are growing. Sensors in technological artifacts3 or used for
a
scientific experiments create a lot of data that needs to be handled. There is also a
aha Alam
trend to datafy4 our lives. People e.g. increasingly use body sensors to guide their
workout routines. Smartphones gather data while we use them or even just carry
d ca .
them with us.
It is not only additional data sources, but it is also a change in mindset that leads to
increased data volume. Or maybe expressed better: It is that change of mindset that
rrie
also leads to the urge of even adding new data sources. Companies now a days
consider data an important asset and its leverage a possible competitive differentiator
d
[10]. This leads, as mentioned above, to an urge to unlock new sources of data and to
utilize them within the organizations analytics process. Examples are the analysis of
out
clickstreams and logs for web page optimization and the integration of social media
data and sentiment analysis into marketing efforts.
Page | 13
Thi
1.2.2 Data Velocity:
Velocity refers to the speed of data. It has two characteristic. First, it describes the rate
s au er t
of new data flowing in and existing data getting updated called the acquisition rate
challenge. Second, it corresponds to the time acceptable to analyze the data and act
on it while it is flowing in, called timeliness challenge [7]. These are essentially two
the he s
und
different issues, that do not necessarily need to occur at the same time, but often they
do.
ntic up
The first of this problem: the acquisition rate challenge. Typically the workload is
transactional and the challenge is to receive, maybe filter, manage and store fast and
continuously arriving data. So, the task is to update a persistent state in some database
reco ervis
and to do that very fast and very often. In This case, traditional relational database
management systems are not sufficient for this task, as they inherently process too
much overhead in the sense of locking, logging, buffer pool management and latching
rd o ion
for multi-threaded operation.
An example for this problem is the inflow of data from sensors or RFID systems,
f Sk of M
which typically create an ongoing stream and a large amount of data. If the
measurements from several sensors need to be stored for later use, this is an OLTP-
like problem involving thousands of write- or update operations per second. Another
example are massively multiplayer games in the internet, where commands of
.A
millions of players need to be received and handled, while maintaining a consistent

state for all players [11].
ham
The second problem regards the timeliness of information extraction, analysis that is
identifying complex patterns in a stream of data [11] and reaction to incoming data.
This is often called stream analysis or stream mining. The importance to react to
ma hbub
d. M
inflowing data and events in (near) real-time and state, that this allows organizations
to get more agile than the competition [8]. In many situations real-time analysis is
dF
indeed necessary to act before the information gets worthless [7]. As mentioned it is
not sufficient to analyze the data and extract information in real-time, it is also
a
necessary to react on it and apply the insight, e.g. to the ongoing business process.
aha Alam
This cycle of gaining insight from data analysis and adjusting a process or the
handling of the current case is sometimes called the feedback loop and the speed of
d ca .
the whole loop (not of parts of it) is the decisive issue.
Strong examples for this are often customer-facing processes. One of them is fraud
detection in online transactions. Fraud is often conducted not by manipulating one
rrie
transaction, but within a certain order of transactions. Therefore it is not sufficient to

analyze each transaction on itself; rather it is necessary to detect fraud patterns across
d
transactions and within a users history. Furthermore, it is necessary to detect fraud

while the transactions are processed to deny the transactions or at least some of them
out
[7]. Another example is electronic trading, where data flows get analyzed to
automatically make buy or sell decisions [11].
Page | 14
Thi
1.2.3 Data Variety:
One driver of big data is the potential to use more diverse data sources, data sources
s au er t
that were hard to leverage before and to combine and integrate data sources as a basis
for analytics. There is a rapid increase of public available, text-focused sources due to
the rise of social media several years ago. All the blog posts, community pages and
the he s
und
messages and images from social networks, but there is also a rather new source of
data from sensors, mobile phones and GPS [8]. Companies e.g. want to combine
ntic up
sentiment analysis from social media sources with their customer master data and
transactional sales data to optimize marketing efforts. Variety hereby refers to a
general diversity of data sources. This not only implies an increased amount of
different data sources but obviously also structural differences between those sources.
reco ervis
In higher level this creates the requirement to integrate structured data, semi
structured data and unstructured data. Other hand, in lower level this means that, even
rd o ion
if sources are structured or semi-structured, they can still be heterogeneous, the
structure or schema of different data sources is not necessarily balanced, various data
formats can be used and the semantics of data can be inconsistent.
f Sk of M
Managing and integrating this collection of multi-structured data from a wide variety
of sources poses several challenges. One of them is the actual storage and
management of this data in database-like systems. Relational database management
.A
systems (RDBMS) might not be the best fit for all types and formats of data.
Stonebraker et al. state, that they are e.g. Particularly ill-suited for array or graph data
ham
[11]. Array shaped data is often used for scientific problems, while graph data is
important due to connections in social networks being typically shaped as graphs but
also due to the Linked Open Data Project [12] use of RDF and therefore graph-shaped
ma hbub
data.
d. M
dF
1.3 Clustering
a
aha Alam
In todays highly competitive business environment, clustering play an important role.

Clustering is an essential task in data mining process which is used for the purpose to
make groups or clusters of the given data set based on the similarity between them.
d ca .
Clustering is an important concept to unite objects in groups (clusters) according to

their similarity. Clustering is similar to classification except that the groups are not
predefined, but rather defined by the data alone.
rrie
Cluster analysis is one of the most important data mining methods. It is a central
problem in data management. Document clustering is the act of collecting similar
d
documents into classes, where similarity is some function on a document. Document

out
clustering does not need separate training process, or manual tagging group in
advance. It is the process of partitioning or grouping a given set of patterns into
Page | 15
Thi
disjoint clusters. The documents in the same clusters are more similar, while the
documents in different clusters are more dissimilar.
s au er t
the he s
und
ntic up
(a) Original points (b) Two clusters
reco ervis
rd o ion
f Sk of M
(c) Four clusters (d) Eight clusters
Figure: 1.1: Data clustering

.A
Data clustering is a data exploration technique that allows objects with similar
characteristics to be grouped together in order to facilitate their further processing.
Most of the initial clustering techniques were developed by statistics or pattern
ham
recognition communities [18], where the goal was to cluster a modest number of data
instances. In more recent years clustering was identified as a key technique in data
mining tasks. This fundamental operation can be applied to many common tasks such
ma hbub
as unsupervised classification, segmentation and dissection. In unsupervised

d. M
technique the correct answers are not known or just not told to the network.
dF
1.3.1 Classification & Clustering:

a
aha Alam
Classification is supervised learning. It was learns a method for predicting the

instance class from pre-labeled (classified) instances. In other hand clustering is an
unsupervised learning. It finds natural grouping of instances given un-labeled data.
d ca .
Cluster analysis is the process of classifying objects into subsets that have meaning in
the context of a particular problem. Basically, clustering is a special kind of
classification. A clustering is a type of classification imposed on a finite set of objects.
rrie
Classification problems as suggested by Lance and Williams [19].
Each leaf in the tree in Figure: 1.2 defines a different genus of classification problem.
d
The nodes in the tree of Figure: 1.2 is defined below;

out
Page | 16
Thi
Exclusive versus nonexclusive: Exclusive classification is a partition of the set of
objects. Every object belongs to exactly one subset or cluster. In nonexclusive or
s au er t
overlapping classification, each object can assign to several classes. We can describe
an example and make it easy. If we classify some people by male or female. It will be
an exclusive classification. If we classify, same peoples by their disease. Now it will
the he s
be an example of nonexclusive classification. Because, in first class or cluster, one
und
person could in only one class or cluster. But in second example, one person can go
more than one class or cluster, because one person could have several diseases.
ntic up
Classification
reco ervis
rd o ion
Exclusive Non-Exclusive (over lapping)
f Sk of M
Supervised Unsupervised
.A
Hierarchical Partitional
ham
Figure: 1.2: Classification tree

ma hbub
d. M
Supervised versus Unsupervised; An unsupervised classification uses only the

proximity matrix to perform the classification. It is called "unsupervised learning'' in
dF
pattern recognition. Because no category labels denoting; a partition of the objects are
used. Supervised classification uses category labels on the objects as well as the
a
proximity matrix. The problem is then to establish a discriminate surface that

aha Alam
separates the objects according to category. For example, suppose that various
indicesof personal healths were collected from smokers and nonsmokers. An
unsupervised classification would group the individuals based on similarities among
d ca .
the health indices and then try to determine whether smoking was a factor in the
propensity of individuals toward various diseases. A supervised classification would
study ways of discriminating smokers from non smokers based on health indices.
rrie
Hierarchical versus Partitional: Unsupervised classifications are sub divided into

hierarchical and partitional classifications by the type of structure imposed on the
d
data. A hierarchical classification is a nested sequence of partitions, whereas a

out
partitional classification is a single partition. Thus a hierarchical classification is a

special sequence of partitional classifications. We will use the term clustering for an
Page | 17
Thi
exclusive, unsupervised, partitional classification and the term hierarchical clustering
for an exclusive, unsupervised, hierarchical classification [19].
s au er t
1.3.2 Organ of Clustering
Jain and Dubes declare, there are three compulsory and two option steps in typical
the he s
pattern clustering [20]. Those are given below;
und
I. Pattern Representation.
ntic up
II. Declare a pattern proximity measure appropriate to the data domain.
III. Clustering or grouping based on proximity measure.

reco ervis
IV. Data abstraction (optional, if required) and
v. Assessment output (optional, if required).

rd o ion
First three of the steps are the mandatory task. In figure: 1.3 shows those typical
sequences of these steps, including a feedback path. Feedback path for grouping
f Sk of M
process output could affect subsequent feature extraction and similarity computation.
.A
Feature Pattern Interpattern

Pattern Selection / Similarity Grouping
ham
Extraction Representation Clusters

ma hbub
d. M
Feedback Loop
dF
Figure 1.3: steps of clustering
Pattern representation; is the combination of number of classes, the number of

a
available patterns, and the number, type, and scale of the features available to the
aha Alam
clustering algorithm [20]. Feature selection is the process of identifying the most
effective subset of the original features to use in clustering. Feature extraction is the
d ca .
use of one or more transformations of the input features to produce new salient
features.
Pattern proximity; is a variety of distance measures are in use in the various

rrie
communities [20]. A simple distance measure like Euclidean distance can often be
used to reflect dissimilarity between two patterns, whereas other similarity measures
can be used to characterize the conceptual similarity between patterns [20].
d out
The grouping step can be performed in a number of ways. The output clustering can
be hard or fuzzy. Hierarchical clustering algorithms produce a nested series of
Page | 18
Thi
partitions based on a criterion for merging or splitting clusters based on similarity.
Partitional clustering algorithms identify the partition that optimizes a clustering
s au er t
criterion. Additional techniques for the grouping operation include probabilistic [20]
and graph-theoretic [20] clustering methods.
Data abstraction; is the process of extracting a simple and compact representation of

the he s
und
a data set. Simplicity is either from the perspective of automatic analysis or it is
human oriented. In the clustering context, a typical data abstraction is a compact
ntic up
description of each cluster, usually in terms of cluster prototypes or representative
patterns such as the centroid [20].
How is the output of a clustering algorithm evaluated? What characterizes a good

reco ervis
clustering result and a poor one? All clustering algorithms will, when presented with
data, produce clusters regardless of whether the data contain clusters or not. If the
data does contain clusters, some clustering algorithms may obtain better clusters
rd o ion
than others. The assessment of a clustering procedures output, then, has several
facets. One is actually an assessment of the data domain rather than the clustering
algorithm itself. Data which do not contain clusters should not be processed by a
f Sk of M
clustering algorithm.
1.3 Different Types of Clustering:

.A
Different approaches to clustering data can be described with the help of the hierarchy
shown in figure 1.4. [19]. at the top level, there is a distinction between hierarchical
ham
and partitional approaches. Hierarchical methods produce a nested series of partitions,

while partitional methods produce only one. There have various methode of
taxonometric representations of clustering. Ours is based on the discussion in Jain and
Dubes [20]. At the top level, there is a distinction between hierarchical and part tional
ma hbub
d. M
approaches. Hierarchical methods produce a nested series of partitions, in the other

hand partitional methods produce only one.
dF
The taxonomy shown in figure 1.4 must be supplemented by a discussion of cross-

cutting issues that may affect all of the different approaches regardless of their
a
aha Alam
placement in the taxonomy.
Agglomerative vs. divisive; This aspect relates to algorithmic structure and operation.
d ca .
An agglomerative approach begins with each pattern in a distinct cluster, and

successively merges clusters together until a stopping criterion is satisfied. A divisive
method begins with all patterns in a single cluster and performs splitting until a
rrie
stopping criterion is met.
Monothetic vs. polythetic; This aspect relates to the sequential or simultaneous use of
d
features in the clustering process. Most algorithms are polythetic; that is, all features
enter into the computation of distances between patterns, and decisions are based on
out
those distances. A simple monothetic algorithm considers features sequentially to

divide the given collection of patterns.
Page | 19
Thi
Hard vs. fuzzy; A hard clustering algorithm allocates each pattern to a single cluster
during its operation and in its output. A fuzzy clustering method assigns degrees of
s au er t
membership in several clusters to each input pattern. A fuzzy clustering can be
converted to a hard clustering by assigning each pattern to the cluster with the largest
measure of membership.
the he s
und
ntic up Clustering
Hierarchial Partitional
reco ervis
rd o ion
Single Complete Square Graph Mixture Mode
link link Error Theoretic Resolving Seeking
f Sk of M
Figure: 1.4: Hierarch Sequence of Various Clustering.
Deterministic vs. stochastic; This issue is most relevant to partitional approaches

.A
designed to optimize a squared error function. This optimization can be accomplished

using traditional techniques or through a random search of the state space consisting
of all possible labelings.
ham
Incremental vs. Non-incremental; This issue arises, when the pattern set to be
clustered is large, and constraints on execution time or memory space affect the
ma hbub
architecture of the algorithm. The early history of clustering methodology does not
d. M
contain many examples of clustering algorithms designed to work with large data sets,
but the advent of data mining has fostered the development of clustering algorithms
dF
that minimize the number of scans through the pattern set, reduce the number of
patterns examined during execution, or reduce the size of data structures used in the
a
aha Alam
algorithms operations.
d ca .
1.4 Challenges
1.4.1 Challenges of Define Big Data
rrie
Despite the realization that Big Data holds the key to many new researches, there is
no consistent definition of Big Data. Till now, it has been described only in terms of
d
its promises and features (volume, velocity, variety, value, veracity). Given below are
few definitions by leading experts and consulting companies:
out
Page | 20
Thi
The IDC definition of Big Data (rather strict and conservative): A new generation of
technologies and architectures designed to economically extract value from very
s au er t
large volumes of a wide variety of data by enabling high- velocity capture, discovery,
and/or analysis" [23].
A simple definition by Jason Bloomberg [24]: Big Data: a massive volume of both
the he s
und
structured and unstructured data that is so large that it's difficult to process using
traditional database and software techniques. This is also in accordance with the
ntic up
definition given by Jim Gray in his seminal book [25].
The Gartner definition of Big Data that is termed as 3 parts definition: Big data is
high- volume, high-velocity and high- variety information assets that demand cost-
reco ervis
effective, innovative forms of information processing for enhanced insight and
decision making. [26]
Demchenko et al [27] proposed the Big Data definition as having the following 5V
rd o ion
properties: Volume, Velocity, Variety that constitute native/original Big Data
properties, and Value and Veracity are acquired as a result of data initial classification
and processing in the context of a specific process or model.
f Sk of M
It can be concluded that Big data is different from the data being stored in traditional
warehouses. The data stored there first needs to be cleansed, documented and even
.A
trusted. Moreover it should fit the basic structure of that warehouse to be stored but
this is not the case with Big data it not only handles the data being stored in traditional
ham
warehouses but also the data not suitable to be stored in those warehouses. Thus there
is requirement of entirely new frameworks and methods to deal in Big Data.
Currently, the main challenges identified for the IT Professionals in handling Big data
are:
ma hbub
d. M
I) The designing of such systems which would be able to handle such large amount of
dF
data efficiently and effectively.
II) The second challenge is to filter the most important data from all the data collected
a
aha Alam
by the organization. In other words we can say adding value to the business.
d ca .
1.4.2 Clustering Challenges of Big Data
Clustering in Big data is required to identify the existing patterns which are not clear
rrie
in first glance. The properties of big data pose some challenge against adopting
traditional clustering methods:
d
Type of dataset: The collected data in the real world contains both numeric and
categorical attributes. Clustering algorithms work effectively either on purely numeric
out
data or on categorical data; most of them perform poorly on mixed categorical and
numerical data types.
Page | 21
Thi
Size of dataset: The size of the dataset has effect on both the time-efficiency of
clustering and the clustering quality (indicated by the precision). Some clustering
s au er t
methods are more efficient than others when the data size is small, and vice versa.
Handling outliers/ noisy data: Data from real applications suffers from noisy data
which pertains to faults and misreported readings from sensors. Noise (very high or
the he s
und
low values) makes it difficult to cluster an object thereby affecting the results of
clustering. A successful algorithm must be able to handle outliers/noisy data.
ntic up
Time Complexity: Most of the clustering methods must be repeated several times to
improve the clustering quality. Therefore if the process takes too long, then it can
become impractical for applications that handle big data.
reco ervis
Stability: Stability corresponds to the ability of an algorithm to generate the same
partition of the data irrespective of the order in which the data are presented to the
algorithm. That is, the results of clustering should not depend on the order of data.
rd o ion
High dimensionality: Curse of dimensionality, a term coined by Richard E. Bellman

is relevant here. As the number of dimensions increases, the data become increasingly
f Sk of M
sparse, so the distance measurement between pairs of points becomes meaningless

and the average density of points anywhere in the data is likely to be low. Therefore,
algorithms which partition data based on the concept of proximity may not be fruitful
.A
in such situations.
Cluster shape: A good clustering algorithm should be able to handle real data and
ham
their wide variety of data types, which will produce clusters of arbitrary shape. Many
algorithms are able to identify only convex shaped clusters.
ma hbub
1.4.3 Challenges to Finding the Fittest Clustering Algorithms for Large

d. M
Datasets
dF
Clustering is a division of data into groups of similar objects. Each group, called a
cluster, consists of objects that are similar to one another and dissimilar to objects of
a
other groups. How this similarity is measured accounts for the difference between
aha Alam
various algorithms.
The properties of clustering algorithms to be considered for their comparison from

d ca .
point of view of utility in Big Data analysis include:
Type of attributes algorithm can handle

rrie
Scalability to large datasets

Ability to work with high dimensional data
Ability to find clusters of irregular shape
d
Handling outliers
Time complexity
out
Data order dependency
Page | 22
Thi
Labeling or assignment (hard or strict vs. soft or fuzzy)
Reliance on a priori knowledge and user defined parameters
Interpretability of results
s au er t
Very large data sets containing millions of objects described by tens or even hundreds
of attributes of various types (e.g., interval-scaled, binary, ordinal, categorical, etc.)
the he s
und
require that a clustering algorithm be scalable and capable of handling different
attribute types. However, most classical clustering algorithms either can handle
ntic up
various attribute types but are not efficient when clustering large data sets or can
handle large data sets efficiently but are limited to interval-scaled attributes. There are
thousands of clustering algorithms, hence we pick a representative algorithm from
each category of partitioning based, hierarchical, density based, grid partitioning
reco ervis
algorithms. CLARA (Clustering LARge Applications) relies on the sampling
approach to handle large data sets [28]. To alleviate sampling bias, CLARA repeats
the sampling and clustering process a pre-defined number of times and subsequently
rd o ion
selects as the final clustering result the set of medoids with the minimal cost.
CLARANS (Clustering Large Applications based on RANdomized Search) views the
process of finding k medoids as searching in a graph [29], which is a serial
f Sk of M
randomized search. FCM [30] is a representative algorithm of fuzzy clustering which

is based on K-means concepts to partition dataset into clusters. The FCM algorithm is
a soft clustering method in which the objects are assigned to the clusters with a
.A
degree of belief. Hence, an object may belong to more than one cluster with different
degrees of belief. BIRCH algorithm [31] builds a dendogram known as a clustering
ham
feature tree (CF tree). The CF tree can be built by scanning the dataset in an
incremental and dynamic way. Thus, it does not need the whole dataset in advance.
The DENCLUE algorithm [32] analytically models the cluster distribution according
to the sum of influence functions of all of the data points. The influence function can
ma hbub
d. M
be seen as a function that describes the impact of a data point within its neighborhood.
Local maxima of the overall density function behave like density attractors forming
dF
clusters of arbitrary shapes. OptiGrid algorithm [33] is a dynamic programming

approach to obtain an optimal grid partitioning. This is achieved by constructing the
a
best cutting hyperplanes through a set of selected projections. These projections are
aha Alam
then used to find the optimal cutting planes; each plane separating a dense space into
two half spaces. It reduces one dimension at each recursive step hence is good for
d ca .
handling large number of dimensions.

rrie
1.5 Acquainting of this Book

In this thesis paper we try to discuss all about big data clustering process. We try to
d
touch every step of Data clustering shortly but strongly with Cristal clear concept.
Every part of our thesis, we face new challenges. We solve those problems
out
professionally. For every new environment, we meet new scenario and new
Page | 23
Thi
environment. In our thesis paper we represent only cream of cream data. Every
chapter of our thesis paper, also well classify. Finally we can say, in this dissertation
s au er t
we will explain the contributions we have made the fields of Big Data clustering.
Chapter 1 is an introductory portion. With introduction, first portion have about The
Big data and its property. Then we represent the clustering and the component we
the he s
und
should know about clustering. Than we propound the challenges of Big data
clustering. Those are the complication and we going to untangle them with a simplify
ntic up
technique.
Chapter 2 beholds the literature about Big data clustering. Source of Big data are
introduce in initial portion of this chapter. Data value and veracity is present in this
reco ervis
chapter. Cluster and the type of clusters are briefly construed here. In chapter two also
include methods of clustering. Cluster validity is present in this part. In last of chapter
two, we briefly discuss all about Standard k-means algorithm and example with
rd o ion
advantage and disadvantage.
In chapter 3 have background of Big data clustering. In Big data clustering which
work already done with k-means algorithm is describe here. Some algorithms with
f Sk of M
steps are found in chapter 3.
We give name of chapter 4 is Proposed Methodology. In chapter 4 we are trying to

.A
discuss our Modified k-means Algorithm. We describe every single steps of our
algorithm with all additional Resource.
ham
Experiment of our modified algorithm is presented in chapter 5. Which decision we

get form experiment is represent here. Compares between k-means algorithm and
other modified k-means algorithm analyze with data figure showed here.
ma hbub
d. M
Chapter 6 is the last chapter of our thesis book. Conclusion and future works are
define here. Also we mention, which parts of our algorithm need more updated is
dF
represent here.
a
aha Alam
d ca .
rrie
d out
Page | 24
Thi CHAPTER 2
s au er t Literature Review
2.1 Data Sources:
the he s
und
As mentioned with the growing ability to leverage semi- and unstructured data, the
amount and variety of potential data sources is growing as well. This Section is
ntic up
intended to give an overview about this variety.
Soares classifies typical sources for big data into 5 categories: Web and social
reco ervis
media, machine-to- machine data, big transaction data, biometrics and human-
generated data [13].
2.1.1 Web Data & Social Media:

rd o ion
The web is a rich, but also very diverse source of data for analytics. The main part of
these sources is typically unstructured, including text, videos and images. However,
f Sk of M
most of these sources have some structure, they are e.g. related to each other through
hyperlinks or provide categorization through tag clouds.
There is web content and knowledge structured to provide machine-readability. It is

.A
intended to enable applications to access the data, understand the data due to
semantics, and allow them to integrate data from different sources, set them into
ham
context and infer new knowledge. Such sources are machine-readable metadata
integrated into web pages, initiatives as the linked open data project using data
formats from the semantic web standard, but also publicly available web services.
ma hbub
This type of data is often graph-shaped and therefore semi-structured.

d. M
Deliver navigational data that provide information how users interact with the web
dF
and how they navigate through it. This data encompasses logs and clickstreams
gathered by web applications as well as search queries. Companies can e.g. use this
a
information to get insight how users navigate through a web shop and optimize its
aha Alam
design based on the buying behavior. This data is typically semi-structured.
Social interactions or communicational data, e.g. from instant messaging services, or

d ca .
status updates in social media sites. On the level of single messages this data is
typically unstructured text or images, but one can impose semi-structure on a higher
level, e.g. indicating who is communicating with whom. Furthermore social
rrie
interaction also encompasses data describing a more structural notion of social

connections, often called the social graph or the social network. An example for this
kind is the friendship relations on facebook. This data is typically semi-structured
d
and graph-shaped. One thing to note is that communicational data is exactly that. This
out
means, the information people publish about themselves on social media is for the
Page | 25
Thi
means of communication and presenting themselves. It is aiming at prestige and can
therefore be biased, flawed or simply just lied.
s au er t
2.1.2 Machine-to-machine data:
Machine to machine communication describes systems communicating with technical

the he s
devices that are connected via some network. The devices are used to measure a
und
physical phenomenon like movement or temperature and to capture events within this
phenomenon. Via the network the devices communicate with an application that
ntic up
makes sense of the measurements and captured events and extracts information from
them. One prominent example of machine to machine communication is the idea of
the internet of things [13].
reco ervis
Devices used for measurements are typically sensors, RFID chips or GPS receiver.
They are often embedded into some other system, e.g. sensors for technical diagnosis
embedded into cars or smartmeters in the context of ambient intelligence in houses.
rd o ion
The data created by these systems can be hard to handle. The BMW group e.g.
predicts its Connected Drive cars to produce one petabyte per day in 2017 [14].
Another example are GPS receivers, often embedded into mobile phones but also
f Sk of M
other mobile devices. The later is an example of a device that creates location base
data, also called spatial data. Machine to machine data is usually semi-structured.
.A
2.1.3 Big transaction data:
Transactional data grew with the dimensions of the systems recording it and the
ham
massive amount of operations they conduct. Transactions can e.g. be purchase items
from large web shops, call detail records from telecommunication companies or
payment transactions from credit card companies. These typically create structured or
ma hbub
semi-structured data. Furthermore, big transactions can also refer to transactions that
d. M
are accompanied or formed by human-generated, unstructured, mostly textual data.

dF
Examples here are call centre records accompanied with personal notes from the
service agent, insurance claims accompanied with a description of the accident or
health care transactions accompanied with diagnosis and treatment notes written by
a
aha Alam
the doctor.
2.1.4 Biometrics:
d ca .
Biometrics data in general is data describing a biological organism and is often used
to identify individuals (typically humans) by their distinctive anatomical and
behavioral characteristics and traits. Examples for anatomical characteristics are
rrie
fingerprints, DNA or retinal scans, while behavioral refers e.g. to handwriting or

keystroke analysis [13]. One important example of using large amounts of biometric
d
data is scientific applications for genomic analysis.

out
Page | 26
Thi
2.1.5 Human-generated data:
s au er t
According to Soares, human-generated data refers to all data created by humans. He
mentions emails, notes, voice recording, paper documents and surveys [13]. This data
the he s
is mostly unstructured. It is also apparent, that there is a strong overlap with two of
und
the other categories, namely big transaction data and web data. Big transaction data
that is categorized as such because it is accompanied by textual data e.g. call centre
ntic up
agents notes, have an obvious overlap. The same goes for some web content, e.g. blog
entries and social media posts. This shows that the categorization is not mutually
exclusive, but data can be categorized in more than one category.
reco ervis
2.2 Data Veracity:

rd o ion
According to a dictionary veracity means conformity with truth or fact. In the

context of big data however, the term describes rather the absence of this
f Sk of M
characteristics. Put differently veracity refers to the trust into the data, which might be
impaired by the data being uncertain or imprecise.
.A
There are several reasons for uncertainty and untrustworthiness of data. For one, when
incorporating different data sources it is likely, that the schema of the data is varying.
The same attribute name or value might relate to different things or different attribute
ham
names or value might relate to the same thing. Jeff Jones, chief scientist at IBMs
Entity Analytics Group, therefore claims that there is no such thing as a single
version of the truth [15]. In fact, in the case of unstructured data there is not even a
ma hbub
schema and in the case of semi-structured data the schema of the data is not as exact
d. M
and clearly defined as it uses to be in more traditional data warehousing approaches,

dF
where data is carefully cleaned, structured and adhering to a relational specification

[16]. In the case of unstructured data, where information first needs to be extracted,
this information is often extracted with some probability and therefore not completely
a
aha Alam
certain. In that sense, variety directly works against veracity.
There are several possibilities to handle imprecise, unreliable, ambiguous or uncertain

d ca .
data. The first approach is typically used in traditional data warehousing efforts and
implies a thorough data cleansing and harmonization effort during the ETL process
that is at the time of extracting the data from its sources and loading it into the
rrie
analytic system. That way data quality and trust is ensured up front and the data
analysis itself is based on a trusted basis. In the face of big data this is often not
feasible, especially when hard velocity requirements are present, and sometimes
d
simply infeasible, as (automatic) information extraction from unstructured data is

out
always based on probability. Given the variety of data it is likely, that there still
Page | 27
Thi
remains some incompleteness and errors in data, even after data cleaning and error
correction [7].
s au er t
Therefore it is always necessary to handle some errors and uncertainness during the
actual data analysis task and manage big data in context of noise, heterogeneity and
uncertainness [7]. There are again essentially two options. The first option is, to do a
the he s
und
data cleaning and harmonization step directly before or during the analysis task. In
that case, the pre-processing can be done more specific to the analysis task at hand
ntic up
and therefore often be leaner. Not every analysis task needs to be based on completely
consistent data and retrieve completely exact results. Sometimes trends and
approximated results suffice [16].
reco ervis
The second option to handle uncertain data during the analysis task at hand is also
based on the notion, that some business problems do not need exact results, but results
good enough - that is with a probability above some threshold - are good enough
rd o ion
[16]. So, uncertain data can be analyzed without cleaning, but the results are presented
with some probability or certainty value, which is also impacted by the trust into the
underlying data sources and their data quality. This allows users to get an impression
f Sk of M
of how trustworthy the results are. For this option it is even more crucial thoroughly
track data provenance and its processing history [7].
.A
2.3 Data Value:

ham
Data is typically gathered with some immediate goal. Put differently, gathered data
offers some immediate value due to the first time use the data was initially collected
for. Of course, data value is not limited to a one-time use or to the initial analysis
ma hbub
d. M
goal. The full value of data is determined by possible future analysis tasks, how they
get realized and how the data is used over time. Data can be reused, extended and
dF
newly combined with another data set [17]. This is the reason why data is more and
more seen as an asset for organizations and the trend is to collect potential data even if
a
it is not needed immediately and to keep everything assuming that it might offer value
aha Alam
in the future.
One reason why data sets in the context of big data offer value is simply that some
d ca .
of them are underused due to the difficulty of leveraging them due to their volume,
velocity or lack of structure. They include information and knowledge which just was
not practical to extract before. Another reason for value in big data sources is their
rrie
interconnectedness, as claimed by Boyd and Crawford. They emphasize that data sets
are often valuable because they are relational to other data sets about a similar
phenomenon or the same individuals and offer insights when combined, which both
d
data sets do not provide if they are analyzed on their own. In that sense, value can be
out
provided when pieces of data about the same or a similar entity or group of entities
Page | 28
Thi
are connected across different data sets. Boyd and Crawford call this fundamentally
networked.
s au er t
According to the McKinsey Global Institute there are five different ways how this
data creates value. It can create transparency, simply by being more widely available
due to the new potential to leverage and present it. This makes it accessible to more
the he s
und
people who can get insights and draw value out of it.
It enables organizations to set up experiments, e.g. for process changes, and create
ntic up
and analyze large amounts of data from these experiments to identify and understand
possible performance improvements [7].
reco ervis
2.4 Different Types of Clusters:
rd o ion
Clustering aims to find useful groups of objects (clusters), where usefulness is defined
by the goals of the data analysis. Not surprisingly, there are several different notions
of a cluster that prove useful in practice. In order to visually illustrate the differences
f Sk of M
among these types of clusters, we use two-dimensional points, as shown in Figure 2.1,
as our data objects. We stress, however, that the types of clusters described here are
equally valid for other kinds of data.
.A
2.4.1 Well-Separated:
ham
A cluster is a set of objects in which each object is closer (or more similar) to every
other object in the cluster than to any object not in the cluster. Sometimes a threshold
is used to specify that all the objects in a cluster must be sufficiently close (or similar)
ma hbub
to one another. This idealistic definition of a cluster is satisfied only when the data
d. M
contains natural clusters that are quite far from each other. Figure 2.1(a) gives an
example of well separated clusters that consists of two groups of points in a two-
dF
dimensional space. The distance between any two points in different groups is larger
than the distance between any two points within a group. Well-separated clusters do
a
aha Alam
not need to be globular, but can have any shape.
2.4.2 Prototype-Based:
d ca .
A cluster is a set of objects in which each object is closer (more similar) to the
prototype that defines the cluster than to the prototype of any other cluster. For data
with continuous attributes, the prototype of a cluster is often a centroid, i.e., the
rrie
average (mean) of all the points in the cluster. When a centroid is not meaningful,
such as when the data has categorical attributes, the prototype is often a medoid, i.e.,
the most representative point of a cluster. For many types of data, the prototype can
d
be regarded as the most central point, and in such instances, we commonly refer to
out
prototype based clusters as center-based clusters. Not surprisingly, such clusters tend
to be globular. Figure 2.1(b) shows an example of center-based clusters.
Page | 29
Thi
2.4.3 Graph-Based:
If the data is represented as a graph, where the nodes are objects and the links
s au er t
represent connections among objects, then a cluster can be defined as a connected
component; i.e., a group of objects that are connected to one another, but that have no
connection to objects outside the group. Important examples of graph-based clusters
the he s
und
are contiguity-based clusters, where two objects are connected only if they are within
a specified distance of each other.
ntic up
reco ervis
(a) Well-separated clusters. Each point (b) Center-based clusters. Each point
rd o ion
is closer to all of the points in its is closer to the center of its cluster than
cluster than to any point in another to the center of any other cluster.
cluster.
f Sk of M
.A
ham
(c) Contiguity-based clusters. Each (d) Density-based clusters. Cluster is

point is closer to at least one point in regions of high density separated by
its cluster than to any point in another regions of low density.
ma hbub
cluster.
d. M
dF
a
aha Alam
d ca .
(e) Conceptual clusters. Points in a cluster share some general property that
derives from the entire set of points. Points in the intersection of the circles
belong to both.
rrie
Figure: 2.1: Different types of clustering as illustrated by sets of two dimensional points.
d
Figure 2.1(c) shows an example of such clusters for two-dimensional points. This
definition of a cluster is useful when clusters are irregular or intertwined, but can have
out
trouble when noise is present since, as illustrated by the two spherical clusters of
Figure 2.1(c), a small bridge of points can merge two distinct clusters.
Page | 30
Thi
Other types of graph-based clusters are also possible. One such approach defines a
cluster as a clique; i.e., a set of nodes in a graph that are completely connected to each
s au er t
other. Specifically, if we add connections between objects in the order of their
distance from one another, a cluster is formed when a set of objects forms a clique.
Like prototype-based clusters, such clusters tend to be globular.
the he s
und
2.4.4 Density-Based:
A cluster is a dense region of objects that is surrounded by a region of low density.

ntic up
Figure 2.1(d) shows some density-based clusters for data created by adding noise to
the data of Figure 2.1(c). The two circular clusters are not merged, as in Figure 2.1
(c), because the bridge between them fades into the noise. Likewise, the curve that is
reco ervis
present in Figure 2.1(c) also fades into the noise and does not form a cluster in Figure
2.1(d). A density based definition of a cluster is often employed when the clusters are
irregular or intertwined, and when noise and outliers are present. By contrast, a
rd o ion
contiguity based definition of a cluster would not work well for the data of Figure
2.1(d) since the noise would tend to form bridges between clusters.
2.4.5 Shared-Property:
f Sk of M
We can define a cluster as a set of objects that share some property. This definition
encompasses all the previous definitions of a cluster; e.g., objects in a center-based
.A
cluster share the property that they are all! closest to the same centroid or medoid.
However, the shared-property approach also includes new types of clusters. Consider
ham
the clusters shown in Figure 2.1(e). A triangular area (cluster) is adjacent to a

rectangular one, and there are two intertwined circles (clusters). In both cases, a
clustering algorithm would need a very specific concept of a cluster to successfully
detect these clusters. The process of finding such clusters is called conceptual
ma hbub
d. M
clustering.
dF
2.5 Clustering Method:

a
aha Alam
In this section we describe the most well-known clustering algorithms. The main
reason for having many clustering methods is the fact that the notion of cluster is
d ca .
not precisely defined (Estivill-Castro, 2000) [21]. Consequently many clustering

methods have been developed, each of which uses a different induction principle.
Farley and Raftery (1998) [21] suggest dividing the clustering methods into two main
rrie
groups: hierarchical and partitioning methods. Han and Kamber (2001) [21] suggest
categorizing the methods into additional three main categories: density-based
methods, model-based clustering and grid- based methods. An alternative
d
categorization based on the induction principle of the various clustering methods is

out
presented in (Estivill-Castro, 2000) [21].
Page | 31
Thi
As there are so many clustering algorithms, this section introduces a categorizing
framework that groups the various clustering algorithms found in the literature into
s au er t
distinct categories. The proposed categorization framework is developed from an
algorithm designers perspective that focuses on the technical details of the general
procedures of the clustering process. Accordingly, the processes of different
the he s
clustering algorithms can be broadly classified.
und
2.5.1 Partitioning Based:
ntic up
Partitioning methods relocate instances by moving them from one cluster to another,
starting from an initial partitioning. Such methods typically require that the number of
clusters will be pre-set by the user. To achieve global optimality in partitioned-based
reco ervis
clustering, an exhaustive enumeration process of all possible partitions is required.
Because this is not feasible, certain greedy heuristics are used in the form of iterative
optimization. Namely, a relocation method iteratively relocates points between the k
rd o ion
clusters. In such algorithms, all clusters are determined promptly. Initial groups are
specified and reallocated towards a union. In other words, the partitioning algorithms
divide data objects into a number of partitions, where each partition represents a
f Sk of M
cluster. These clusters should fulfill the requirements: (1) Each group must contain at
least one object, and (2) Each object must belong to exactly one group. In the K-
means algorithm, for instance, a center is the average of all points and coordinates
.A
representing the arithmetic mean. In the K-medoids algorithm, objects which are near
the center represent the clusters.
ham
There are many other partitioning algorithms such as K- modes, PAM, CLARA,
CLARANS and FCM.
2.5.2 Hierarchical Based:

ma hbub
d. M
Data are organized in a hierarchical manner depending on the medium of proximity.

dF
Proximities are obtained by the intermediate nodes. A dendrogram represents the

datasets, where individual data is presented by leaf nodes. The initial cluster gradually
divides into several clusters as the hierarchy continues. These methods construct the
a
aha Alam
clusters by recursively partitioning the instances in either a top-down or bottom-up

fashion. These methods can be subdivided as following:
d ca .
Agglomerative hierarchical clustering; Each object initially represents a cluster of its

own. Then clusters are successively merged until the desired cluster structure is
obtained.
rrie
Divisive hierarchical clustering; All objects initially belong to one cluster. Then the
cluster is divided into sub-clusters, which are successively divided into their own sub
d
clusters. This process continues until the desired cluster structure is obtained.
out
The result of the hierarchical methods is a dendrogram, representing the nested

grouping of objects and similarity levels at which groupings change. A clustering of
Page | 32
Thi
the data objects is obtained by cutting the dendrogram at the desired similarity level.
The merging or division of clusters is performed according to some similarity
s au er t
measure, chosen so as to optimize some criterion (such as a sum of squares). The
hierarchical method has a major drawback though, which relates to the fact that once
a step (merge or split) is performed, this cannot be undone.
the he s
und
BIRCH, CURE, ROCK and Chameleon are some of the well-known algorithms of
this category.
ntic up
1. K-means
2. K-modes
Partitioning 3. K-midoids
reco ervis
4. PAM
Based 5. CLARANS
6. CLARA
7. FCM
rd o ion
1. BIRCh
Hierarchical 2.CURE
f Sk of M
3. ROCK
Based 4. Chameleon
5. Echidna
.A
Clustering Density
1. DBSCAN
2. OPTICS
Algorithm
ham
Based 3. DBLASD
5. DENCLUE
ma hbub
1. Wave-cluster
d. M
Grid 2. STING
dF
Based 3. CLIQUE
4. OptiGrid
a
aha Alam
1. EM
Model 2. COBWEB
Based
d ca .
3. CLASSIT
4. SOMs
rrie
Figure: 2.2: Overview of clustering algorithms taxonomy
2.5.3 Density Based:

d
Density-based methods assume that the points that belong to each cluster are drawn
out
from a specific probability distribution (Banfield and Raftery, 1993) [21]. The overall
Page | 33
Thi
distribution of the data is assumed to be a mixture of several distributions. Here, data
objects are separated based on their regions of density, connectivity and boundary.
s au er t
They are closely related to point-nearest neighbors. A cluster, defined as a connected
dense component, grows in any direction that density leads to. Therefore, density
based algorithms are capable of discovering clusters of arbitrary shapes. Also, this
the he s
provides a natural protection against outliers. Thus the overall density of a point is
und
analyzed to determine the functions of datasets that influence a particular data point.
ntic up
DBSCAN, OPTICS, DBCLASD and DENCLUE are algorithms that use such a
method to filter out noise and discover clusters of arbitrary shape.
2.5.4 Grid Based:

reco ervis
These methods partition the space into a finite number of cells that form a grid
structure on which all of the operations for clustering are performed. The main
advantage of the approach is its fast processing time (Han and Kamber, 2001) [21].
rd o ion
The space of the data objects is divided into grids. The main advantage of this
approach is its fast processing time, because it goes through the dataset once to
compute the statistical values for the grids. The accumulated grid-data make grid-
f Sk of M
based clustering techniques independent of the number of data objects that employ a
uniform grid to collect regional statistical data, and then perform the clustering on the
grid, instead of the database directly. The performance of a grid-based method
.A
depends on the size of the grid, which is usually much less than the size of the
database. However, for highly irregular data distributions, using a single uniform grid
ham
may not be sufficient to obtain the required clustering quality or fulfill the time
requirement. Wave-Cluster and STING are typical examples of this category.
2.5.5 Model Based:

ma hbub
d. M
These methods attempt to optimize the fit between the given data and some
dF
mathematical models. Unlike conventional clustering, which identifies groups of

objects; model-based clustering methods also find characteristic descriptions for each
group, where each group represents a concept or class. The most frequently used
a
aha Alam
induction methods are decision trees and neural networks.
Decision Trees; In decision trees, the data is represented by a hierarchical tree, where
d ca .
each leaf refers to a concept and contains a probabilistic description of that concept.
Several algorithms produce classification trees for representing the unlabelled data.
Neural Networks; This type of algorithm represents each cluster by a neuron or

rrie
prototype. The input data is also represented by neurons, which are connected to the
prototype neurons. Each such connection has a weight, which is learned adaptively
d
during learning.
out
MCLUST is probably the best-known model-based algorithm, but there are other
good algorithms, such as EM (which uses a mixture density model), conceptual
Page | 34
Thi
clustering (such as COBWEB), and neural network approaches (such as self-
organizing feature maps).
s au er t
2.6 Clustering Validity:
the he s
und
Assessing the validity of a clustering structure is a statistical problem. By clustering
structure we mean a hierarchy, a partition, or an individual cluster. A statistic is
ntic up
chosen, a null hypothesis of randomness is adopted, and a baseline distribution, or a
distribution for the statistic under the null hypothesis, is derived. A threshold for the
statistic is then selected from the baseline distribution to assure that the probability of
rejecting the null hypothesis when it is true is not greater than a predefined level, .
reco ervis
The random graph and random permutation hypotheses have been widely adopted for
ordinal data, while the random position hypothesis, corresponding to a Poisson
rd o ion
process, has been applied when the objects are described by pattern matrices. One
must pay close attention to the details and be sure that all assumptions are appropriate
if a theoretical null distribution for, the statistic is to be used. Otherwise, this
f Sk of M
distribution must be established with Monte Carlo methods.
The distinction between external and internal indices of cluster validity has often been
blurred and confused in the literature. An external index evaluates a clustering
.A
structure in terms of prior information, such as category labels which have been
assigned without reference to the pattern or proximity matrix. An internal index, on
ham
the other hand, uses only the proximity matrix itself and information from the cluster
analysis. A relative criterion compares two statistics, clustering methods, or
characteristics.
ma hbub
The clustering tendency problem has not received a great deal of attention but is
d. M
certainly an important problem. One wants to believe that data are clustered and is
dF
naturally biased toward believing the results of a cluster analysis. Testing the data for
randomness before actually clustering the data should help remove this bias. A
a
number of tests were cataloged and literature references were provided.

aha Alam
The validation of clustering structures is the most difficult and frustrating part of
cluster analysis. Without a strong effort in this direction, cluster analysis will remain a
d ca .
black art accessible only to those true believers who have experience and great
courage.
rrie
2.7 K-Means Algorithm:

d
In the world of machine learning and data mining there are many algorithms in
out
supervised, unsupervised and semi-supervised learning. They are used for different
goals such as pattern recognition, prediction, classification, information retrieval and
Page | 35
Thi
clustering. The practical applications of these algorithms are endless and can permeate
any field of study. Our thesis focuses on unsupervised learning, specifically
s au er t
clustering. The concept of clustering in data mining deals with identifying similar
information without prior knowledge about the data classes. One issue of clustering is
that it is prone to various interpretations and does not have an explicit definition. A
the he s
famous quote mentioned by Estivill-Castro is that clustering is in the eye of the
und
beholder, making and allusion to the popular phrase, Beauty is in the eye of the
beholder, where many different approaches can be taken to analyze clustering. This
ntic up
is also a reason why there are many different kinds of algorithms for clustering.
A widely used algorithm for clustering is the k-means, which purpose is to split a
dataset into distinct groupings. The goal of k-means is to identify similarities
reco ervis
between items in a dataset and group them together based on a similarity function.
The most common used function is the euclidean distance function, but any other
similarity function can be applied. K-means is an interesting option due to its
rd o ion
simplicity and speed. In essence it is an optimization problem where a local optimal
clustering can be achieved, whereas a global optimal is not guaranteed. It is an
heuristic algorithm in which the final result will be highly dependent on the initial
f Sk of M
settings, although a fairly good approximation is possible. For this reason choosing
the exact solution for the k-means is an NP-hard problem. K-means takes as input a
number of desired clusters and data set .The goal is to choose k centers to
.A
minimize the sum of squared Euclidean distance as presented in the following

function.
ham
2
= min
(| | )
Historically k-means was discovered by some researchers from different disciplines.

ma hbub
The most famous researcher to be coined the author is Lloyd (1957, 1982), along with
d. M
Forgey (1965), Friedman and Ru-bin (1967), and McQueen (1967).

dF
2.7.1 Basic K-means Algorithm:

a
K-means is one of the simplest unsupervised learning algorithms that solve the well-
aha Alam
known clustering problem. The procedure follows a simple and easy way to classify a
given data set through a certain number of clusters (assume R clusters) fixed apriori.
d ca .
The main idea is to define k centers, one for each cluster. These centers should be
placed in a cunning way because of different location causes different result. So, the
better choice is to place them as much as possible far away from each other. The next
step is to take each point belonging to a given data set and associate it to the nearest
rrie
center. When no point is pending, the first step is completed and an early group age is
done. At this point we need to re-calculate k new centroids as barycenter of the
d
clusters resulting from the previous step. After we have these k new centroids, a new
binding has to be done between the same data set points and the nearest new center.
out
A loop has been generated. As a result of this loop we may notice that the k centers
change their location step by step until no more changes are done or in other words
Page | 36
Thi
centers do not move any more. Final y, this algorithm aims at minimizing objective
functions know as squared error function given by:
s au er t
() = =1
2
=1 ( )
the he s
und
Standard K-mean clustering Algorithm:
ntic up
Input:
= {1 , 2 , 3 , . }
reco ervis
// Set of n number of data points.
Value of
rd o ion
// Number of desired Cluster.
Get random centroid or Define initial centroid.

f Sk of M
Output:
A set of Clusters.
.A

ham
Steps:
ma hbub
I) Input the value of & input initial centroid or get random centroid.
d. M
dF
II) Calculate distance between each data to each centroid.

a
III) Assign each data to closest centroids cluster.

aha Alam
IV) Repeat step II) until change happened in clusters.

d ca .
Figure 2.3: standard K-means clustering algorithm

rrie
2.7.2 Algorithm steps with example:
This non-heirarchial method initially takes the number of components of the

d
population equal to the final required number of clusters. In this step itself the final
required number of clusters is chosen such that the points are mutually farthest apart.
out
Next, it examines each component in the population and assigns it to one of the
clusters depending on the minimum distance. The centroid's position is recalculated
Page | 37
Thi
every time a component is added to the cluster and this continues until all the
components are grouped into the final required number of clusters.
s au er t
Let us apply the k-Means clustering algorithm to the same example as in the previous
page and obtain four clusters:
the he s 70
und
60 1.1, 60
2, 55
ntic up
50
Fat Content (F)
40 3.9, 39
4.2, 35
30 8.2, 30
reco ervis
20 1.5, 21
7.6, 15
10
rd o ion
0
0 1 2 3 4 5 6 7 8 9
Protine Content (P)
f Sk of M
Figure 2.5: position of food Item with fat & protein content.
.A
Food Item # Protein Content (P) Fat content (F)

ham
Food Item # 01 1.1 60

Food Item # 02 8.2 30
Food Item # 03 4.2 35
ma hbub
d. M
Food Item # 04 1.5 21

Food Item # 05 7.6 15
dF
Food Item # 06 2.0 55

Food Item # 07 3.9 39
a
Table 2.1: Food Item with fat & protein content for clustering.
aha Alam
d ca .
Cluster Number Protein Content (P) Fat content (F)

C1 1.1 60
C2 8.2 30
rrie
C3 4.2 35
C4 1.5 21
d
Table 2.2: Food Item with fat & protein content in initial cluster
out
Page | 38
Thi
Let us plot these points so that we can have better understanding of the problem. Also,
we can select the three points which are farthest apart.
s au er t Start
the he s
und
ntic up Input Data
Number of Cluster R
reco ervis
Centroid info C
rd o ion
Calculate Centroid
f Sk of M
.A
Distence between data point to centroids

ham
Assign Cluster based on minimum distance

ma hbub
d. M
dF
Object move to Yes

a
aha Alam
new centroids
d ca .
No
End
rrie
Figure 2.4: Flow-chart of standard k-means clustering algorithm

d
We see from the graph that the distance between the points 1 and 2, 1 and 3, 1 and 4,
out
1 and 5, 2 and 3, 2 and 4, 3 and 4 is maximum.
Page | 39
Thi
Also, we observe that point 1 is close to point 6. So, both can be taken as one cluster.
The resulting cluster is called C16 cluster. The value of P for C16 centroid is (1.1 +
s au er t
2.0)/2 = 1.55 and F for C16 centroid is (60 + 55)/2 = 57.50.
Upon closer observation, the point 2 can be merged with the C5 cluster. The resulting
cluster is called C25 cluster. The values of P for C25 centroid is (8.2 + 7.6)/2 = 7.9
the he s
und
and F for C25 centroid is (20 + 15)/2 = 17.50.
The point 3 is close to point 7. They can be merged into C37 cluster. The values of P
ntic up
for C37 centroid is (4.2 + 3.9)/2 = 4.05 and F for C37 centroid is (35 + 39)/2 = 37.
The point 4 is not close to any point. So, it is assigned to cluster number 4 i.e., C4
reco ervis
with the value of P for C4 centroid as 1.5 and F for C4 centroid is 21.
Finally, four clusters with three centroids have been obtained.

rd o ion
Cluster Number Protein Content (P) Fat content (F)
C16 1.55 57.50
f Sk of M
C25 7.9 17.5

C37 4.05 37
C4 1.5 21
.A
Table 2.3: Food Item with fat & protein content in final cluster.
ham
2.7.3 K-means clustering properties:
Finally we can say, k-means clustering properties are;

ma hbub
There are always K clusters.

d. M
There is always at least one item in each cluster.

dF
The clusters are non-hierarchical and they do not overlap.

Every member of a cluster is closer to its cluster than any other cluster because
a
closeness does not always involve the 'center' of clusters.

aha Alam
2.7.4 Advantage:
d ca .
K-means have some advantage. Because of those advantages, k-means is the most
used clustering algorithm in last fifty years. Advantages are;
With a large number of variables, K-Means may be computationally faster

rrie
than hierarchical clustering (if K is small).

K-Means may produce tighter clusters than hierarchical clustering, especially
d
if the clusters are globular.

Fast, robust and easier to understand.
out
Gives best result when data set are distinct or well separated from each other.
Page | 40
Thi
2.7.5 Disadvantages:
K-means also have some disadvantages. Those are;

s au er t
Difficulty in comparing quality of the clusters produced (e.g. for
different initial partitions or values of K affect outcome).
the he s
Fixed number of clusters can make it difficult to predict what K should
und
be.
ntic up
The learning algorithm requires apriority specification of the number of
cluster centers.
The use of Exclusive Assignment - If there are two highly overlapping
reco ervis
data then k-means will not be able to resolve that there are two clusters.
The learning algorithm is not invariant to non-linear transformations i.e.
with different representation of data we get different results (data
rd o ion
represented in form of cartesian co-ordinates and polar co-ordinates will
give different results).
Euclidean distance measures can unequally weight underlying factors.
f Sk of M
The learning algorithm provides the local optima of the squared error
function.
Randomly choosing of the cluster center cannot lead us to the fruitful
.A
result.
Applicable only when mean is defined i.e. fails for categorical data.
ham
Unable to handle noisy data and outliers.

Algorithm fails for non-linear data set.
ma hbub
d. M
dF
a
aha Alam
d ca .
rrie
d out
Page | 41
Thi CHAPTER 3
s au er t Background and Related Work
the he s
und
3.1 Background of K-means Clustering

ntic up
The development of clustering methodology has been a truly interdisciplinary
endeavor. Taxonomists, social scientists, psychologists, biologists, statisticians,
mathematicians, engineers, computer scientists, medical researchers, and others who
reco ervis
collect and process real data have all contributed to clustering methodology.
According to JSTOR [jst, n.d.], data clustering first appeared in the title of a 1954
article dealing with anthropological data. Data clustering is also known as Q-analysis,
rd o ion
typology, clumping, and taxonomy [Jain & Dubes, 1988] depending on the field
where it is applied.
f Sk of M
There are several books published on data clustering; classic ones are by Sokal and
Sneath [Sokal & Sneath, 1963], Anderberg [Anderberg, 1973], Hartigan [Hartigan,
1975], Jain and Dubes [Jain & Dubes, 1988] and Duda et al. [Duda et al. , 2001].
.A
Clustering algorithms have also been extensively studied in data mining (see books by
Han and Kamber [Han & Kamber, 2000] and Tan et al. [Tan ET al.2005]) and
machine learning [Bishop, 2006].
ham
K-means has a rich and diverse history as it was independently discovered in different
scientific fields by Steinhaus (1955) [Steinhaus, 1956], Lloyd (1957) [Lloyd, 1982],
ma hbub
Ball & Hall (1965) [Ball & Hall, 1965] and McQueen (1967) [MacQueen, 1967].
d. M
Even though K-means was first proposed over 50 years ago, it is still one of the most
widely used algorithms for clustering. Ease of implementation, simplicity, efficiency,
dF
and empirical success are the main reasons for its popularity. Details of K-means are
summarized below.
a
aha Alam
The basic K-means algorithm has been extended in many different ways. Some of
these extensions deal with additional heuristics involving the minimum cluster size
and merging and splitting clusters.
d ca .
Two well-known variants of K-means in pattern recognition literature are Isodata

[Ball & Hall, 1965] and Forgy [Forgy, 1965]. In K- means, each data point is assigned
rrie
to a single cluster (called hard assignment). Fuzzy c- means, proposed by Dunn

[Dunn, 1973] and later improved by Bezdek [Bezdek, 1981], is an extension of K-
means where each data point can be a member of multiple clusters with a membership
d
value (soft assignment). A good overview of fuzzy set based clustering is available in
out
Backer (1978) [Backer, 1978]. Some of the other significant modifications are
summarized below. Steinbach et al. [Steinbach et al. , 2000] proposed a hierarchical
Page | 42
Thi
divisive version of K-means, called bisecting K-means, that recursively partitions the
data into two clusters at each step. A significant speed up of the K-means pass [Jain &
s au er t
Dubes, 1988] in the algorithm (computation of distances from all the points to all the
cluster centers) involves representing the data using a kd-tree and updating cluster
means with groups of points instead of a single point [Pelleg & Moore, 1999].
the he s
Bradley et al. [Bradley et al. , 1998] presented a fast scalable and single-pass version
und
of K-means that does not require all the data to be fit in the memory at the same time.
X- means [Pelleg & Moore, 2000] automatically finds K by optimizing a criterion
ntic up
such as Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC).
In K- medoid [Kaufman & Rousseeuw, 2005], clusters are represented using the
median of the data instead of the mean. Kernel K-means [Scholkopf et al. , 1998] was
reco ervis
proposed to detect arbitrary shaped clusters, with an appropriate choice of the
kernel similarity function. Note that all these extensions introduce some additional
algorithmic parameters that must be specified by the user.
rd o ion
3.2 Previous Works

f Sk of M
To overcome the abovementioned problems different approaches and techniques have

been proposed.
.A
Krishna and Murty [37] proposed the genetic K-means (GKA) algorithm, integrating
a genetic algorithm with K-means, in order to achieve a global search and fast
ham
convergence.
Jain and Dubes [19] recommend running the algorithm several times with random
initial partitions. The clustering results on these different runs provide some insights
ma hbub
into the quality of the ultimate clusters.

d. M
Forgys method [35] generates the initial partition by first randomly selecting K
dF
points as prototypes and then separating the remaining points based on their distance
from these seeds.
a
aha Alam
Likas et al. [38] proposed a global K-means algorithm consisting of series of K-

means clustering procedures with the number of clusters varying from 1 to K. One
d ca .
disadvantage of the algorithm lies in the requirement for executing K-means N times
for each value of K, which causes high computational burden for large data sets.
Bradley and Fayyad [36] presented a refined algorithm that utilizes K-Means M times
rrie
to M random subsets sampled from the original data.
The most common initialization was proposed by Pena, Lozano et al. [39]. This
d
method is selecting randomly K points as centroids from the data set. The main
out
advantage of the method is simplicity and an opportunity to cover rather well the
solution space by multiple initialization of the algorithm.
Page | 43
Thi
Ball and Hall proposed the ISODATA algorithm [40], which is estimating K
dynamically. For selection of a proper K, a sequence of clustering structures can be
s au er t
obtained by running K-means several times from the possible minimum K min to the
maximum K max [34]. These structures are then evaluated based on constructed indices
and the expected clustering solution is determined by choosing the one with the best
the he s
index [41].
und
The popular approach for evaluating the number of clusters in K-means is the Cubic
ntic up
Clustering Criterion [42] used in SAS Enterprise Miner.
Another modified k-means algorithm was representing by Mahbub and Fahad. They
worked on value of K and calculate an initial centroid. Making feasible data points list
reco ervis
is, one of the important inventions of them.
rd o ion
f Sk of M
.A
ham
ma hbub
d. M
dF
a
aha Alam
d ca .
rrie
d out
Page | 44
Thi CHAPTER 4
s au er t Proposed Methodology
the he s
und
4.1 Modified Algorithm
ntic up
There have a lot of experiment and huge research to improve standard k-means. All
researcher and scientists work for improve native k-means. They worked for
simplicity, save some time, try to make more accurate. They try to fix the limitation
reco ervis
like Number of cluster, way to find initial centroid. They also worked for
simplification the calculation by minimized calculation to get appropriate cluster and
clusters member in each iteration. I get influent by those thesis and research. I make
rd o ion
a decision to make some modification that can more sufficient and faster.
We worked on fix the problem of get initial cluster (R). Also we worked for find
initial centroid. In last of our algorithm, we try to minimize calculation by finding the
f Sk of M
feasible points of working. When we combined all of those, we found an effective

modified k-means algorithm.
.A
Modified K-means algorithm:
Input:
ham
= {1 , 2 , 3 , . }
// Set of n number of data points

ma hbub
d. M
Output:
dF
A set of Clusters.
a

aha Alam
Steps:
d ca .

I) Calculate .
2
II) Allocate = 1, assign a value to // if the value of f is not given than it

was 1 by default and f > 0.
rrie
III) Find the closest pair of data point from . Move those points to new set
.
d
IV) Find the closest point of and move it to from .

Repeat step IV) until the number of element of reach ( ).
out
V)

Page | 45
Thi
VI) When, are full and the number of elements of 0 . Increment the
value of . New = + 1. Repeat step III).
s au er t
VII) The center of gravity of each . Those are the initial centroids and
elements of are the elements of .
VIII) Find the closest centroid for each data points and allocate each data points
the he s
Cluster. Calculate new centroid for each .
und
IX) Calculate the ( = largest distanced data point from each centroid of
each cluster).
ntic up
4
X) Get the data points in the interval of .
9
XI) Find the closest centroid for those points and allocate them to closest
centroid cluster.
reco ervis
XII) Find new centroid for each .
XIII) If, any change in any . Repeat step IX. Otherwise go to end.
rd o ion
f Sk of M
In the first part, we found the number of cluster. Then assign data points to initial
cluster. Then we go for find an effective and useful initial centroid. In those two parts;
there have some loops and its calculation but those make a better clustering rather
.A
than other clustering methods. For make algorithm more intelligent, some procedure
must add. Intelligence of algorithm will help to overcame determine the number of
ham
cluster and which points are the initial centroids. In last portion of modified
algorithm, we worked only those feasible data points which have chance to change
current cluster and move to new cluster. We make a short list of points for calculate. It
minimizes calculation, thats why it was capable to save time on behalf of original
ma hbub
d. M
standard k-means algorithm.

dF
In step I, we calculate the equation and get a concept about the number of cluster. We
take a concept about the cluster number, but this calculation is not the final decision.
a
From step II, we assign a value in x for maintain cluster number. By step III, IV and
aha Alam
V, algorithm assigns a new cluster and assigns data points to this cluster.
From step VI, algorithm finalizes the initial clusters member data points, and gets
d ca .
decision to start a new cluster.
Step VII; find the initial centroids for clusters. By help of step VII, step VIII finalized
a stable cluster and centroid.
rrie
In step IX, X, XI and XII, there have tricks to find feasible data points those have
chance to change current cluster. We worked only those intervals points. Normally
d
other points dont move clusters. For this step, algorithm saves a lot of time. It
out
minimizes a lot of calculation.
Page | 46
Thi
To get decision, step XIII is used. In this step; algorithm take decision about the
algorithm continue or all clustering is finished.
s au er t Start
the he s Input Data points (U)

und
ntic up
Calculate R & allocate value to x
Closest data point pair in U and move to NX

reco ervis
Closest data point of NX in U and move to NX
rd o ion

Greater than( ) Less than( )

Number of
f Sk of M
element in NX
No element in U
.A
Calculate initial & get

ham
Calculate
ma hbub
d. M
4
dF
Calculate and get data points in interval

9
a
aha Alam
Calculate & for inner interval points

d ca .
Yes
Change in
rrie
No
End
d out
Figure 4.1: Flow-chart of modified k-means algorithm
Page | 47
Thi
Main working policy based on modified algorithm same as original k-means
algorithm. In modified algorithm there have some mechanism for intelligent
s au er t
algorithm and effective and time consume.
4.2 Field of Modification

the he s
und
Standard k-means algorithm main limitation field are; fix the value of K, Make an
initial centroid and every iteration it calculate with each and every data points. Lot of
ntic up
researcher worked those fields particularly. Our point of view is modified k-means as;
it can overcome all three limitations. Thats why we work with those three fields. Our
concern of modification is combined all limitation and solved them for maximum
benefit.
reco ervis
We solve the number of cluster problem. Algorithm can determine own its number of
cluster by working data points. Our expression of determine number of cluster is
rd o ion

. Here, R= number of cluster. n= number of data inputted. Its the initial
2
concept of Number of cluster. We fix the number of clusters in next equation. Basis of
f Sk of M
input f will be determined. If we apply k-means on human f value is not equal to

when we apply k-means on music instrument. If no value of f is given than value of
f is by default 1. Smaller than 1 value of f, increase the number of clusters. If the
value of f is bigger than 1, than the number of cluster is reduce.
.A
When we get the number of cluster , our next target to get proper initial centroids.
ham
After get value of f, we compute the closest pair and keep them to N x , and delete
them to input set U. Than we fill N x by calculate the nearest of those pair when any
data goes to any N x than it was deleted from U. When N x is full we increase the value
of x and continue same process as loop until in U have any data point left. Closest pair
ma hbub
d. M
points mean value is initial centroids. Assign closest points of initial points into R m .
Those are initial cluster. Calculate the center of gravity to find the centroids. When we
dF
get each R m center of gravity. Those are the centroids of each R m . Now we have
initial clusters and centroids.
a
aha Alam
When clusters and centroids are on our hand, we continue a loop for finalized cluster.
In traditional k-means algorithm, all points are calculated, to get finalized cluster and
their member. In this case, centroids move a little bit and closest points of center are
d ca .
unchanged. We keep this matter on think and we provide an interval. Calculate largest
4
distanced data point . is the interval. Inner interval points goes for
9
rrie
calculation. In each loop, calculate new centroids for Inner interval points. Assign
inner interval points on their cluster and calculate new centroids. If any change
happened in any cluster, do same process again and again. When no change in any
d
cluster. We found our desire clusters. Those are the final cluster we want.
out
In first two portions we make our algorithm capable to find the number of cluster and
initial centroids with initial cluster. In last part of algorithm we represent an interval.
Page | 48
Thi
Only inner interval points go for calculate and re allocate cluster. We keep a lot of
points out side of interval. Those points cluster are unchanged. It reduces calculation
s au er t
and makes our algorithm faster than traditional k-means clustering algorithm.
4.3 Algorithm Evaluation

the he s
und
When an algorithm created or modified, without evaluate nobody accept those
algorithm. We modified k-means algorithm. Now we evaluate our algorithm by 2-D
ntic up
distance methods.
We take 50 products. Two characteristic consider here. Those are size and weight. Let
consider size as X-Axis and weight as Y-Axis. Those product numbers was 1 to 50.
reco ervis
Those products details are given in table 4.1.
SL No: Size (X) Weight(Y) SL No: Size (X) Weight(Y)

1 23 56 26 29 30
rd o ion
2 11 49 27 62 28
3 37 19 28 67 11
4 28 52 29 15 13
f Sk of M
5 65 31 30 12 17
6 44 33 31 5 20
7 7 48 32 14 55
8 13 39 33 42 26
.A
9 32 8 34 66 14
10 40 35 35 64 3
11 47 57 36 53 7
ham
12 60 18 37 25 16
13 55 50 38 36 2
14 17 5 39 38 6
ma hbub
15 46 15 40 10 32
d. M
16 34 45 41 61 44
17 58 10 42 41 54
dF
18 50 23 43 57 42
19 8 25 44 35 22
a
20 26 51 45 27 40
aha Alam
21 52 38 46 2 27
22 19 43 47 16 41
23 59 47 48 20 36
d ca .
24 56 21 49 54 46
25 3 37 50 33 53
Table 4.1: 50 Products with that size(x) & weight(y) value.

rrie

In primary step, calculate for take concept about number of cluster. Here we
2
d
work with 50 products. So, number of content = 50. Than we get 5. It is not
the final cluster number. Because of, number of cluster is also depended on value of
out
value of . If I dont assign any value to . By default it will be 1. We assign value
Page | 49
Thi
8 8
= . Each have element is =8. If here we get a fraction number than
10 10
roundup this number. We can see the position of each product in (x, y) coordinate in
s au er t
Figure 4.2 .
60the he s
und
47, 57
23, 56
14, 55
41, 54
33, 53
28, 52
ntic up
50 26, 51
55, 50
11, 49
7, 48
59, 47
54, 46
34, 45
61, 44
19, 43
57, 42
reco ervis
40 16, 41
27, 40
13, 39
52, 38
3, 37
20, 36
40, 35
44, 33
rd o ion
10, 32
30 65, 31
29, 30
62, 28
2, 27
42, 26
8, 25
f Sk of M
50, 23
35, 22
20 56, 21
5, 20
37, 19
60, 18
12, 17
25, 16
46, 15
66, 14
.A
15, 13
10 67, 11
58, 10
32, 8
53, 7
ham
38, 6
17, 5
64, 3
36, 2
0
0 10 20 30 40 50 60 70 80
ma hbub
d. M
Figure 4.2: 50 data points in (x,y) coordinate.

dF
4.3.1 Initial Sets and Member Selection

a
aha Alam
For calculate initial sets and member, we need to find the closest pair of data points in
hole data set. We have 50 products and their position.
d ca .
By calculating we found, product 4 and product 20 is the closest in whole data set U.
Distance between them is 2.236068. Assign them to data set 1 . Find the mean value
of (x, y) coordinate of product 4 and product 20. (27, 51.5) is the mean value of 4 and
rrie
20. Now find the closest point of 4 and 20 by calculating the distance between other
points with (27, 51.5). Product 1 is the closest to (27, 51.5). Distance between them is
6.020797. Move product 1 from U to 1 . Now we have 3 members in 1 . We seeking
d
member until the number of member is equal to 8. Next closest product is 50.
out
Distance between them is 6.184658. Move product 50 from U to 1 . Next closest

product is 16. Distance between them is 9.552487. Move product 16 from U to 1 .
Page | 50
Thi
Next closest product is 45. Distance between them is 11.5. Move product 45 from U
to 1 . Next closest product is 22. Distance between them is 11.67262. Move product
22 from U to 1 . Next closest product is 32. Distance between them is 13.46291.
s au er t
Move product 32 from U to 1 . Now member of 1 = 8. 1 is full. Member of
1 = {4,20,1,50,16,45,32,22}.
the he s
und
After assign products to 1 . Increase the value of x. now we calculate for 2 . By
calculating we found, product 28 and product 34 is the closest in remain data set U.
ntic up
of (x, y) coordinate of product 28 and product 34. (66.5, 12.5) is the mean value of 28
and 34. Now find the closest point of 28 and 34 by calculating the distance between
other points with (66.5, 12.5). Product 12 is the closest to (66.5, 12.5). Distance
reco ervis
between them is 8.514693. Move product 12 from U to 2 . Next closest product is 17.
rd o ion
Next closest product is 24. Distance between them is 13.50926. Move product 24
from U to 2 . Next closest product is 36. Distance between them is 14.57738. Move
product 36 from U to 2 . Next closest product is 27. Distance between them is
f Sk of M
16.14001. Move product 27 from U to 2 . Now member of 2 = 8. 2 is full.

Member of 2 = {28,34,12,17,35,24,36,27}.
.A
After assign products to 1 , 2 . Increase the value of x. now we calculate for 3 . By

calculating we found, product 3 and product 44 is the closest in remain data set U.
ham
of (x, y) coordinate of product 3 and product 44. (36, 20.5) is the mean value of 3 and
points with (36, 20.5). Product 33 is the closest to (36, 20.5). Distance between them
ma hbub
d. M
is 8.13941. Move product 33 from U to 3 . Next closest product is 15. Distance

dF

a
Next closest product is 9. Distance between them is 13.1244. Move product 9 from U
aha Alam
to 3 . Next closest product is 18. Distance between them is 14.22146. Move product
18 from U to 3 . Now member of 3 = 8. 3 is full. Member of
3 = {3,44,33,15,26,37,9,18}.
d ca .
After assign products to 1 , 2 , 3 . Increase the value of x. now we calculate for 4 .

By calculating we found, product 8 and product 47 is the closest in remain data set U.
rrie
of (x, y) coordinate of product 8 and product 47. (14.5, 40) is the mean value of 8 and
d
points with (14.5, 40). Product 48 is the closest to (14.5, 40). Distance between them
out
is 6.800735. Move product 48 from U to 4 . Next closest product is 40. Distance

Page | 51
Thi
product is 7. . Distance between them is 10.96586. Move product 7 from U to 4 .
s au er t
product 19 from U to 4 . Now member of 4 = 8. 4 is full. Member of 4 =
{8,47,48,40,2,7,25,19}.
the he s
und
ntic up
reco ervis
rd o ion
f Sk of M
.A
ham
Figure 4.3: Elements of U move to N1 , N2 , N3 , Nn

ma hbub
d. M
After assign products to 1 , 2 , 3 , 4 . Increase the value of x. now we calculate for

5 . By calculating we found, product 23 and product 41 is the closest in remain data
dF
set U. Distance between them is 3.605551. Assign them to data set 4 . Find the mean
a
value of (x, y) coordinate of product 23 and product 41. (60, 45.5) is the mean value
aha Alam
of 23 and 41. Now find the closest point of 23 and 41 by calculating the distance
between other points with (60, 45.5). Product 43 is the closest to (60, 45.5). Distance
d ca .

rrie

17.35655. Move product 11 from U to 5 . Now member of 5 = 8. 5 is full.
d
Member of 5 = {23,41,43,49,13,21,5,11}.
out
Page | 52
Thi
After assign products to 1 , 2 , 3 , 4 , 5 . Increase the value of x. now we calculate
for 6 . By calculating we found, product 38 and product 39 is the closest in remain
s au er t
data set U. Distance between them is 4.472136. Assign them to data set 5 . Find the
mean value of (x, y) coordinate of product 38 and product 39. (37, 4) is the mean
value of 38 and 39. Now find the closest point of 38 and 39 by calculating the
the he s
distance between other points with (37, 4). Product 14 is the closest to (37, 4).
und
ntic up
reco ervis
31.14482. Move product 10 from U to 6 . Next closest product is 31. Distance
between them is 35.77709. Move product 31 from U to 6 . Now member of 6 = 8.
6 is full. Member of 6 = {38,39,14,29,30,6,10,31}.
rd o ion
After assign products to 1 , 2 , 3 , 4 , 5 , 6 . Increase the value of x. now we
calculate for 7 . By calculating we found, product 42 and product 46 is the closest in
remain data set U. Distance between them is 61.40033. Assign them to data set 6 .
f Sk of M
After assign product 42 and product 46 in 6 , input product set U is empty. Now we
can say , we completed assign value to . Member of 7 = {42,46}.
.A
Name of N Product SL
ham
4,20,1,50,16,45,32,22
28,34,12,17,35,24,36,27
3,44,33,15,26,37,9,18
ma hbub

d. M
8,47,48,40,2,7,25,19
23,41,43,49,13,21,5,11
dF
38,39,14,29,30,6,10,31
42,46
a
aha Alam
Table 4.2: After assign value to from input product set U.

d ca .
4.3.2 Finalized Cluster with Inner Interval Calculation
We have value of and its members. From those members positions

rrie
coordinate we can calculate the center of gravity of each cluster. Those center
of gravity are the initial centroids. Than we calculate the closest centroids for
d
each data points. For each initial centroids we get some products. Those are
initial cluster .
out
Page | 53
Thi
Members of Initial Centroids
4,20,1,50,16,45,32,22 (25.5, 49.375)
s au er t

28,34,12,17,35,24,36,27
3,44,33,15,26,37,9,18
8,47,48,40,2,7,25,19
(60.75, 14)
(37, 19.875)
(11, 38.375)
the he s
23,41,43,49,13,21,5,11 (56.25, 44.375)
und
38,39,14,29,30,6,10,31 (25.875, 16.375)
42,46 (21.5, 40.5)
ntic up
Table 4.3: with initial cluster.
From initial centroids we calculate which data goes to with which centroids.
reco ervis
For each product, calculate which centroids is the nearest. Which centroids is
nearest , this product is goes with this centroid and assign its associate cluster
. Than calculate the centroid for each .
rd o ion
Cluster Products in Cluster Centroids (x, y)

1,4,16,20,32,42,50 (28.42857, 52.28571)
f Sk of M
12,17,24,27,28,34,35,36 (60.75, 14)

3,6,10,15,18,33,39,44 (41.5, 22.375)
2,7,8,19,25,31,40,46 (7.375, 34.625)
.A
5,11,13,21,23,41,43,49 (56.25, 44.375)

9,14,29,30,37,38 (21.28571, 11.14286)

ham
22,26,45,47,48 (28.8, 37.6)
Table 4.4: Cluster and member of with centroids .

ma hbub
When we get complete centroids, we calculate largest distance data point and its
d. M
distance from centroid . Now we find those data points in each cluster in the
4
interval . We calculate the closest centroids for the inner interval
dF
9
products. For this interval, algorithm was getting faster. Which points are close to
a
centroids they are unchanged. Because of this some product are prefix to current
aha Alam
cluster. Those points calculation time and hard ware save cost and time both.

Cluster Centroids (x, y) Interval Inner Interval points
d ca .

(28.42857, 52.28571) 7.189026 to 16.17531 16,32,42
(60.75, 14) 6.246975 to 14.05569 24,27,28,35,36
(41.5, 22.375) 6.853178 to 15.41965 6,10,18,33,39
rrie
(7.375, 34.625) 8.591016 to 19.32979 2,7,19,31

(56.25, 44.375) 7.103512 to 15.9829 5,11,21
d
(21.28571, 11.14286) 7.814595 to 17.58284 9,14,29,30,38

out
(28.8, 37.6) 5.734884 to 12.90349 26

Table 4.5: Cluster and interval with inner interval points.
Page | 54
Thi
Here we get 26 products in interval. 24 products are prefix with its recent own cluster.
We worked on 52% of data points. 48% of data points dont need any calculation.
s au er t
48% calculation is reducing in this step. Assign inner interval products with closest
centroids associated cluster. By assign them we get a new and new .
the he s
Centroids (x, y)
und
Cluster Interval Inner Interval
points
members

ntic up

(27.5, 45.85714) 1,4,20,32,42,50
6.52518 to
1,32,42
14.68166
6.246975 to
reco ervis
(60.75, 14) 12,17,24,34,35,36 24,27,28,35,36
14.05569
(41.71429,
3,6,15,18,33,39,44
7.442164 to
6,15,18,39
20.57143) 16.74487

rd o ion
2,7,8,19,25,31,40, 6.58515 to
(7.375, 34.625) 2,7,8,19,31,46
46 14.81659
(56.25, 44.375)
5,11,13,21,23,41,4 7.103512 to
5,11,21
3,49 15.9829
f Sk of M
(21.28571,
9,14,29,30,37,38
7.699313 to
9,30,38
11.14286) 17.32345
(31.14286, 10,16,22,26,45,47, 5.886163 to 10,16,22,26,47
38.28571) 48 13.24387 ,48
.A
Table 4.6: calculation result of an iteration.

ham
4.3.3 Obtain Final Cluster:
In 4.3.2 (Finalized Cluster with Inner Interval Calculation) we represent a complete

ma hbub
iteration process. This process is continuing until we get final cluster. For reach final
d. M
cluster repeat the process given in 4.3.2 until was changing. When was stable
and no change in any than our process is complete.
dF
In this case, product get perfect cluster and we get final Result. Final result is given in
a
below Table 4.7.

aha Alam
Centroids (x, y) Cluster members

(27.22222, 49.88889) 1,4,16,20,22,32,42,45,50
d ca .
(60.75, 14) 12,17,24,27,28,34,35,36

(41.33333, 18.5) 3,15,18,33,39,44
(8.333333, 35.33333) 2,7,8,19,25,31,40,46,47
rrie
(56.25, 44.375) 5,11,13,21,23,41,43,49

(21.28571, 11.14286) 9,14,29,30,37,38
d
(33.25, 33.5) 6,10,26,48

out
Table 4.7: Final and their associate members.
Page | 55
out
Page | 56
d
rrie
d ca .
aha Alam
dF
ma hbub
Figure 4.4: Final with thresholds
ham a
.A d. M
f Sk of M
rd o ion
reco ervis
ntic up
the he s
s au er t
Thi und
Thi CHAPTER 5
s au er t Experimental Result
This chapter discusses the results generated by running accelerating algorithms as
the he s
und
described in Proposed Methodology (Chapter 4). The experiments are performed to
compare the time consumption of the following algorithms: Standard K-means
ntic up
Algorithm, Improve k-means Clustering Algorithm by Nazeer and Sebastian [46],
Optimized Version of the K-Means Clustering Algorithm by Poteras , Mihaescu and
Mocanu [47] and our Modified K-Means Algorithm. Reason for choosing those
algorithms for compare is; native k-mean algorithm the source we modified. Improve
reco ervis
k-means Clustering Algorithm by Nazeer and Sebastian is a centroids base algorithm
and in our algorithm we worked on centroid. Optimized Version of the K-Means
Clustering Algorithm by Poteras , Mihaescu and Mocanu is the recent IEEE paper
rd o ion
focused on border list. Border list concept has some similarity with our feasible data
point interval.
f Sk of M
5.1 Environmental Setup and Data Input:

We use C++ language for implement our algorithm. Experiments were conducted on a
.A
machine consisting of operating system windows8.1 on Intel i7-4700MQ CPU, 8 GB

RAM memory.
ham
Four algorithms is train simultaneously by same data sets. Standard K-means

Algorithm, Improve k-means Clustering Algorithm by Nazeer and Sebastian
,Optimized Version of the K-Means Clustering Algorithm by Poteras , Mihaescu and
ma hbub
Mocanu [47] are need the value of k. Our algorithm can obtain value of k itself.
d. M
Standard K-means Algorithm and Optimized Version of the K-Means Clustering

Algorithm by Poteras , Mihaescu need the initial centroids info too. Our algorithm
dF
and Improve k-means Clustering Algorithm by Nazeer and Sebastian is capable to

find initial centroids.
a
aha Alam
Initial
Algorithm Data points Number of cluster
Cluster
d ca .
Standard K-means Input by user

Input by user Input by user
Algorithm or random
Improve k-means Calculate by
Input by user Input by user
rrie
Clustering Algorithm algorithm

Optimized Version of the
Input by user
K-Means Clustering Input by user Input by user
or random
d
Algorithm
Calculate by Calculate by
Input by user
out
Our Algorithm
algorithm algorithm
Table 5.1: Train four Algorithms for testing
Page | 57
Thi
We use several dataset to testing the algorithm. 5, 00 to 5, 00,000 data points inputted
for testing.
s au er t
5.2 Algorithm Testing:
At first we tested all algorithms with 500 data points. We assign number of cluster for
the he s
und
those algorithms. For 500 data points we assign K value and get result.
Algorithm K Time: ms K Time: ms

ntic up
Standard 8 257 16 374
Improve 8 258 16 330
Optimized 8 238 16 290
reco ervis
Our Algorithm -- 256 -- 256
Table 5.2: result for 500 data with different value of k.

rd o ion
Form table 5.2, we observed, our algorithm is not the best for every environment.
Next we try same algorithms same (500) data points. But we assign the appropriate
value of f.
f Sk of M
Algorithm K Time: ms K Time: ms
Standard 8 257 16 374

.A
Improve 8 258 16 330

Optimized 8 238 16 310
ham
Our Algorithm -- 229 -- 288
Table 5.3: Result for 500 data with different value of k and f.
ma hbub
Our algorithm performance is much better on larger data set. Now observe Table 5.4
d. M
for larger dataset.

dF
Algorithm 5,000 data 10,000 data 50,000 data 1,00,000 data

points(sec) points(sec) points(sec) points(sec)
a
Standard 50.38 157.44 656.57 1137.42

aha Alam
Improve 39.29 111.78 485.53 818.94

Optimized 35.77 100.76 446.68 737.05
Our Algorithm 35.36 92.88 405.80 667.67
d ca .
Table 5.4: Result for different data points.
By observe Table 5.4, we can say, for big amount of data our algorithm performance
rrie
is much more better than other modification of k-mean and original k-mean
algorithm.
d out
Page | 58
Thi
5.3 Decision:
After tested our algorithm in lab, we assume different result on working different data
s au er t
points. Testing outcomes help us to find right decision on choosing algorithm.
Performance analysis help about further research work. In Figure 5.1 we can saw the
performance between our algorithm and other modified algorithm and standard k-
the he s
und
means algorithm.
ntic up
1200
reco ervis
1000 Standard
Improve
800
rd o ion
Optimized
Time
600
f Sk of M
Our
Algorithm
400
.A
200
ham
0
500 5,000 10,000 50,000 1,00,000 Data points
ma hbub
d. M
Figure 5.1: Process time of four algorithms.

dF
By deep espied disclose, in very beginning, four algorithms are very close time
contingent. But working on large scale the different between their process times are
get a clear discrimination. For 500 data optimized algorithm is the fastest one.
a
aha Alam
Standard algorithm is in second position. Our algorithm hold third position and
improved algorithm get lowest first place. But when we enter a large scale data, our
algorithm hold the first position. Optimized algorithm gets second position. Improved
d ca .
algorithm hold third place and standard k-means algorithm is lowest position.
After analyze the experimental result, we can decide our algorithm performance on
rrie
large scale data clustering is much more batter than standard k-means algorithm. It
also more effective rather modern modification of k-means cluster algorithms.
d out
Page | 59
Thi CHAPTER 6
s au er t Conclusion and Further Research
In this thesis we have studied the improvements to k-means clustering. K-means
the he s
und
clustering algorithm is a simple algorithm but, it is time consuming .There have been
various algorithms which aims to improve the k-means algorithm and to work around
ntic up
the limitations of the k-means algorithm. The focus of the thesis is mainly on the
algorithms which decrease the time consumption of the k-means algorithm. Number
of cluster is a major limitation of k-means algorithm. Work on this field on this thesis.
The initialization of the cluster centers plays an important role in the convergence of
reco ervis
the algorithm. Incorrect initialization leads to incorrect results. Earlier researches have
been directed at solving this issue and a huge amount of results are available in the
literatures. All the algorithms, which have been compared in our study, are sensitive
rd o ion
to initialization of the centers. Our main aim was minimize calculation by
representing a method that can prefix some data to a cluster before calculate. This is
the key to save time and process. Usually, in the k- means algorithm is run with
f Sk of M
different initialization and the one with the minimum mean squared distance is
selected.
.A
In 50 years k-mean updated a lot, till now it have a lot of field for research and
modification. We cant fix the method for value of f. In our current modification,
value of f is in cranky position. By interval, we assume feasible data points, but
ham
interval can be updated. Those two fields are open for further Research.
ma hbub
d. M
dF
a
aha Alam
d ca .
rrie
d out
Page | 60
Thi
Reference:
[1] Anil K. Jain with Radha Chitta and Rong Jin, Clustering Big Data :Department of
s au er t
[2]
Computer Science Michigan
http://www.cse.nd.edu/Fu_Prize_Seminars/jain/slides.pdf
State University.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System.
the he s
In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles,
und
SOSP 03, pages 2943, New York, NY, USA, October 2003. ACM. ISBN 1-58113-
757-5. doi: 10.1145/ 945445.945450
ntic up
[3] Jeffrey Dean and Sanjay Ghemawat. MapReduce: A Flexible Data Processing Tool.
Communi- cations of the ACM, 53(1):7277, January 2010. ISSN 0001-0782. doi:
10.1145/1629175.1629198
[4] Hadoop 1.2.1 Documentation. Online documentation, Apache Software Foundation,
reco ervis
2013. URL:http://hadoop.apache.org/docs/r1.2.1/index.html.
[5] James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs,
Charles Roxburgh, and Angela Hung Byers. Big data: The next frontier
rd o ion
for innovation, competition, and productivity. Analyst report, McKinsey Global
Institute.
URL:http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/bi
f Sk of M
g_data_the_next_frontier_for_innovation.
[6] John Gantz and David Reinsel. THE DIGITAL UNIVERSE IN 2020: Big Data,
Bigger Digital Shadows, and Biggest Growth in the Far East. Study report, IDC,
December 2012. URL www.emc.com/leadership/digital-universe/index.htm
.A
[7] Markus Maier. Towards a Big Data Reference Architecture. Eindhoven University of
Technology, Department of Mathematics and Computer Science:
ham
URL:http://www.win.tue.nl/~gfletche/Maier_MSc_thesis.pdf
[8] Andrew McAfee and Erik Brynjolfsson. Big Data: The Management Revolution.
Harvard Business Review.
[9] Viktor Mayer-Schnberger and Kenneth Cukier. Big Data - A Revolution That Will
ma hbub
d. M
Transform How We Live, Work and Think. John Murray (Publishers).

[10] T. Kraska. Finding the Needle in the Big Data Systems Haystack. IEEE Internet
dF
Computing, 17(1):8486, 2013. ISSN 1089-7801. doi: 10.1109/MIC.2013.10.

[11] Michael Stonebraker, Sam Madden, and Pradeep Dubey. Intel Big Data Science
and Technology Center Vision and Execution Plan. SIGMOD Record, 42(1):4449..
a
aha Alam
[12] Linked Data - Connect Distributed Data across the Web. Website, Linked Open Data
Project. URL http://linkeddata.org/ Accessed: 04-02-2015.
[13] Sunil Soares. Big Data Governance - An Emerging Imperative. MC Press Online,
d ca .
LLC, 1st edition edition.

[14] Camille Mendler. M2M and big data. Website, The Economist: Intelligence Unit,
2013. URL http://digitalresearch.eiu.com/m2m/from-sap/m2m-and-big-data
[15] Jeff Jonas. There Is No Such Thing As A Single Version of Truth. URL:
rrie
http://jeffjonas.typepad.com/jeff_jonas/2006/03/there_is_no_suc.html.
[16] Pat Helland. If You Have Too Much Data, then Good Enough Is Good Enough.
Communications of the ACM, 54(6):4047, doi: 10.1145/1953122.1953140.
d
[17] Viktor Mayer-Schnberger and Kenneth Cukier. Big Data - A Revolution That Will
out
Transform How We Live, Work and Think.
Page | 61
Thi
[18] Amanpreet Kaur Toor and Amarpreet Singh, Amritsar College of Engineering &
Technology, Punjab, India , An Advanced Clustering Algorithm (ACA) for Clustering
Large Data Set to Achieve High Dimensionality:. Computer Science Systems
s au er t
Biology, Toor and Singh, J Comput Sci Syst Biol 2014, 7:4. URL:
http://dx.doi.org/10.4172/jcsb.1000146
[19] Anil K. Jain and Richard C. Dubes, Michigan State University; Algorithms for
the he s
und
Clustering Data: Prentice Hall, Englewood Cliffs, New Jersey 07632. ISBN: 0-13-
0222278-X
[20] Anil K. Jain, Michigan State University & M.N. MURTY Indian Institute of Science
ntic up
& FLYNN The Ohio State University: Data Clustering: A Review; ACM Computing
Surveys, Vol. 31, No. 3, September 1999.
[21] Lior Rokach & Oded Maimon; Department of Industrial Engineering Tel-Aviv
reco ervis
University, CLUSTERING METHODS; DATA MINING AND KNOWLEDGE
DISCOVERY HANDBOOK Chapter 15. Page: 321 to 352.
[22] A. Fahad, N. Alshatri, Z. Tari, Member, IEEE , A. Alamri, I. Khalil A. Zomaya,
Fellow, IEEE, S. Foufou, and A. Bouras; A Survey of Clustering Algorithms for Big
rd o ion
Data:Taxonomy & Empirical Analysis;IEEE Transactions on Emerging Topics in
Computing.
[23] Extracting Value from Chaos, By Gantz, J. and Reinsel,D. IDC IVIEW June 2011.
f Sk of M
http://www.emc.com/collateral/analyst-reports/idc- extracting-value-from-chaos-
ar.pdf.
[24] The Big Data Long Tail. Blog post by Bloomberg, Jason. On January 17, 2013.
.A
http://www.devx.com/blog/the-big-data-long-tail.html
[25] The Fourth Paradigm: Data-Intensive Scientific Discovery. Edited by Hey, T. ,
Tansley, S. and Tolle, K.. Microsoft Corporation, October 2009. ISBN 978-0-
ham
9825442-0-4.
[26] Gartner's Big Data Definition Consists of Three Parts, Not to Be Confused with Three
"V"s, By Svetlana Sicular, Gartner, Inc. 27 March 2013.
http://www.forbes.com/sites/gartnergroup/2013/03/27/gartners-big-datadefinition-
ma hbub
d. M
consists-of-three- parts-not-to- be-confused-with-three-vs/

[27] Demchenko, Y., Membrey, P., Grosso, C. de Laat, Addressing Big Data Issues in
dF
Scientific Data Infrastructure. First International Symposium on Big Data and Data
Analytics in Collaboration (BDDAC 2013). Part of The 2013 Int. Conf. on
a
Collaboration Technologies and Systems (CTS 2013), May 20-24, 2013, San Diego,
aha Alam
California, USA.
[28] Kaufman, L., and Rousseeuw, P. J. Finding Groups in Data: An Introduction to
Cluster Analysis, John Wiley & Sons, Inc., New York, NY, 1990.
d ca .
[29] Ng, R. T., and. Han, J. Clarans: A method for clustering objects for spatial data
mining. IEEE Transactions on Knowledge and Data Engineering (TKDE),
14(5):1003 1016, 2002.
[30] Bezdek, J. C., Ehrlich, R., and Full, W. Fcm: The fuzzy c-means clustering algorithm.
rrie
Computers & Geosciences, 10(2):191203, 1984.

[31] Zhang, T., Ramakrishnan, R., and Livny, M. Birch: an efficient data clustering
d
method for very large databases. ACM SIGMOD Record, volume 25, pp. 103114,
1996.
out
Page | 62
Thi
[32] Hinneburg, A., and Keim, D. A. An efficient approach to clustering in large
multimedia databases with noise. Proc. of the ACM SIGKDD Conference on
Knowledge Discovery ad Data Mining (KDD), pp. 58-65, 1998.
s au er t
[33] Hinneburg, A., and Keim, D. A. Optimal Grid- clustering:Towards breaking the
curse of dimensionality in high-dimensional clustering. In Proceedings of the 25th
Conference on VLDB, 506-517, 1999.
the he s
und
[34] Rui X, Wunsch DC II (2009) Clustering. IEEE Press series on computational
intelligence, John Wiley & Sons
[35] Forgy E (1965) Cluster analysis of multivariate data; efficiency vs. interpretability of
ntic up
classifications. Biometrics, 21: pp 768-780
[36] Bradley P, Fayyad U (1998) Refining initial points for K-means clustering.
International conference on machine learning (ICML-98), pp 91-99
reco ervis
[37] Krishna K, Murty M (1999) Generic K-Means algorithm. IEEE Transactions on
systems, man, and cybernetics- part B: Cybernetics, 29(3): pp 433-439
[38] Likas A, Vlassis N, Verbeek J (2003) The global K-means clustering algorithm.
Pattern recognition, 36(2), pp 451-461
rd o ion
[39] Pena JM, Lozano JA, Larranaga P (1999) An empirical comparison of four
initialization methods for K-means algorithm. Pattern recognition letters 20: pp 1027-
1040
f Sk of M
[40] Ball G, Hall D (1967) A clustering technique for summarizing multivariate data.
Behavioral science, 12: pp 153-155
[41] Milligan G, Cooper M (1985) An examination of procedures for determining the
.A
number of clusters in a data set. Psychometrika, 50: pp 150-179

[42] SAS Institute Inc., SAS technical report A-108 (1983) Cubic clustering criterion.
Cary, NC: SAS Institute Inc., 56 pp
ham
[43] http://www.internetlivestats.com/google-search-statistics/
[44] http://www.internetlivestats.com/twitter-statistics/
[45] https://www.youtube.com/yt/press/statistics.html
[46] K. A. Abdul Nazeer, M. P. Sebastian Improving the Accuracy and Efficiency of the k-
ma hbub
d. M
means Clustering Algorithm. WCE 2009, July 1 - 3, 2009, London, U.K.

[47] Cosmin Marian Poteras, ,Marian Cristian Mihaescu ,Mihai Mocanu An Optimized
dF
Version of the K-Means Clustering Algorithm.University of Craiova. Proceedings of

the 2014 Federated Conference on Computer Science and Information Systems pp.
a
695699.
aha Alam
d ca .
rrie
d out
Page | 63

A Modified K-Means Algorithm For Big Data Clustering

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Modified K-Means Algorithm For Big Data Clustering

Uploaded by

Copyright:

Available Formats

Thi

A Modified K-Means Algorithm for Big Data Clustering

Sk. Ahammad Fahad

A Dissertation Submitted in partial fulfillment of the requirements of the

degree of Bachelor of Science in Computer Science and Information

Department of Computer Science and Engineering

Sk. Ahammad Fahad

A Dissertation Submitted in partial fulfillment of the requirements of the

Department of Computer Science and Engineering

Md. Mahbub Alam

Dhaka University of Engineering and Technology (DUET)

Sk. Ahammad Fahad

Program: Bachelor of Science in Computer Science and Information Technology

guidance helps me the whole undergraduate academic session.

I dedicate my thesis to my parents Sk. Abul Kashem and Ainun Naher.

Sk. Ahammad Fahad

Program: Bachelor of Science in Computer Science and Information Technology

IBAIS University, Dhaka, Bangladesh

1.1 Introduction. ... 11

1.3 Clustering................................................................ ............................. 15

1.3.1 Classification and Clustering.. 16

1.3.2 Organ of Clustering. ... 18

1.4.1 Challenges of Define Big Data.. . 20

CHAPTER 2: Literature Review... 25

2.1 Data Sources.. 25

2.5.5 Model Based 34

2.7.1 Basic K-means Algorithm... 36

2.7.3 K-means clustering properties. 40

CHAPTER 3: Background and Related Work.. . 42

3.1 Background of K-means Clustering. .. 42

3.2 Previous Works.. .. 43

CHAPTER 4: Proposed Methodology. .. 45

4.1 Modified Algorithm.. .. 45

4.3 Algorithm Evaluation 49

4.3.1 Initial Sets and Member Selection. . 50

Table 4.4: Cluster and member of with centroids ..... 54

Table 4.5: Cluster and interval with inner interval points.. .. 54

Table 4.6: calculation result of an iteration.. .. 55

Table 4.7: Final and their associate members.. 55

Table 5.1: Train four Algorithms for testing.. 57

Table 5.4: Result for different data points.. 58

Figure 1.3: steps of clustering.. .. 18

Figure: 2.1: Different types of clustering as illustrated by sets of two dimensional

Figure: 2.2: Overview of clustering algorithms taxonomy. ... 33

Figure 2.4: Flow-chart of standard k-means clustering algorithm.. 39

Figure 4.1: Flow-chart of modified k-means algorithm . 47

Figure 4.2: 50 data points in (x,y) coordinate .... 50

Figure 4.3: Elements of U move to N1 , N2 , N3 , Nn .... 52

Figure 4.4: Final with thresholds .... 56

Figure 5.1: Process time of four algorithms. .. 59

operations (analytical operations, process operations, retrieval operations) very

developing highly scalable distributed computing applications. The Hadoop

early stages in manipulating big data.

1.2 BIG DATA

subjective and incorporates a moving definition of how big a dataset needs to be in

One of these definitions is given in IDCs The Digital Universe study:

IDC defines Big Data technologies as a new generation of technologies and

architectures, designed to economically extract value from very large volumes of a

data, and the presentation of the results of the analytics. [6]

isolation, but dependent on other data characteristics.

them with us.

millions of players need to be received and handled, while maintaining a consistent

the whole loop (not of parts of it) is the decisive issue.