Professional Documents
Culture Documents
INTRODUCTION
Due to the rapid growth and innovation in communications as well as computer
technologies, smart cities are the subject of permanent research in industry and academia. The
final goal is to provide numerous services such as real-time traffic monitoring, healthcare
assistance, security, safety. It should be noted that the concept of smart buildings with many
Internet enabled devices can be controlled from the remote locations and communicate each
other, thus becoming parts of smart cities. Internet of Things(IoT), as the collection of smart
appliances for data sharing across the globe, introduces a vision of the future smart cities
where users, computing systems and everyday objects[2][1] cooperate with economic
benefits One of the best options for processing a large collection of data from different
buildings is cloud-computing. The ability to share resources, services, responsibility and
management among cloud providers is the fundamental assumption from the view point of
cloud interoperability.
1
1.1 RESEARCH PROBLEM
In the previous section, it was discussed that big data introduces new security and
privacy issues. For the health care sector these issues are even amplified, due to fact that
health care data are considered privacy sensitive data and traditional security and privacy
methods to protect privacy health care data seem insufficient or even obsolete. This is a
problem for patients as personal information can unwillingly be derived from these health
information systems and end up in wrong hands. Besides that individuals have some rights
against intrusion of their personal information, in wrong hands, personal information can
potentially harm individuals.
On the other hand, weak security and privacy methods can hinder the adoption of big
data in the health care. There can be public resistance from individuals or government against
the use of big data in health care, when there is no trust in the protection of their personal
information. Hindering in the adoption of big data in the health care could also hinder
potential benefits big data could bring to the health care, which are for example improved
quality of health. Therefore, the owners of the problem are the hospitals and other
organizations in the healthcare that potentially can benefit from an adoption of big data in
health care. These organizations have to deal with hurdles such as privacy legalization and
the public perception of privacy before they can successfully adopt big data.
The objective is to select the optimal cloud server for a mobile VM in addition to
minimizing the total number of VM migrations, reducing task-execution time. Honey Bean
Optimization algorithm (HBO) to identify the optimal target cloudlet.
OBJECTIVES
• Predictive accuracy
2
Fig: 1.2. Big Data Mining Platform
3
1.2.2 THESIS CONTRIBUTION
This thesis reports on the use of phrases to Seamless access of Smart healthcare services
requires resource migration in terms of VM migration during the offloading process to ensure
QoS for the user. Honey Bean Optimization algorithm (HBO) based joint VM migration
technique is proposed in which user mobility is also considered.
Every generation introduces new data types, which require new capabilities to deal
with these new data types. The first generation of BI&A applications and research focused on
mostly structured data collected by companies through legacy systems and stored in relational
database management systems (RDBMS)
4
Analytical techniques used in this generation of BI&A are rooted in statistical
methods and data mining techniques developed in the 70s and 80s respectively.
The second generation of BI&A is a result of the development of the internet; it
encompasses analysis of web-based unstructured content.
The third generation of BI&A is emerging as a result of smart phones, tablets and
other sensor based information supplies and includes analysis on location-based,
person-centered and context-relevant analysis
Unlike RDBMS and NoSQL, Hadoop is not referring to a type of database, but rather
a software platform that allows for massively parallel computing. Hadoop is an open source
software framework, which consists of several software modules that are targeted to process
big data, large volume and high variety of data. Core modules of the Hadoop ecosystem are
Hadoop Distributed File System (HDFS) and Hadoop Map reduce. Below we describe the
most popular modules of the Hadoop framework.
HDFS is the software module that arranges the storage in a Hadoop big data ecosystem.
HDFS breaks down data into pieces and distributes these pieces to multiple nodes of physical
data storage in a system. The main advantages of HDFS are that it is designed to be scalable
and fault tolerant. Additionally, by dividing data into pieces HDFS prepares data for parallel
processing. Other modules in the Hadoop framework are designed to take advantage of
distributed data over multiple nodes.
MAP REDUCE is a software framework that provides a programming language that takes
full advantage of parallel processing. Tasks that programmed in Map Reduce are divided in
smaller tasks, which are sent to the relevant nodes in the system. The Map Reduce framework
takes care of the whole process: managing communication between nodes, running tasks in
parallel and providing redundancy and fault-tolerance.
HBASE is a software module that runs as non-relational database on top of HDFS. HBase is
NoSQL database that stores data according a key-value model. As it is a NoSQL type of
database it requires low level programming to query. Like other software modules of Hadoop,
HBase is open-source and is modeled after Google’s Big Table database.
5
HIVE is essentially a data warehouse that runs on top of HDFS. Hive structures data into
concepts like tables, columns, rows and partitions, similar to a relational database. Data in a
Hive database can be queried using (limited) SQL like language, named HiveQL.
The phrase "big data" is often used in enterprise settings to describe large amounts of
data. It does not refer to a specific amount of data, but rather describes a dataset that cannot
be stored or processed using traditional database software.
Examples of big data include the Google search index, the database of Facebook user
profiles, and Amazon.com's product list. These collections of data (or "datasets") are so large
that the data cannot be stored in a typical database or even a single computer [2] [4]. Instead,
the data must be stored and processed using a highly scalable database management
system .Big data is often distributed across multiple storage devices, sometimes in several
different locations.
Many traditional database management systems have limits to how much data they
can store. For example, an Access 2010 database can only contain two gigabytes of data,
which makes it infeasible to store several petabytes orexabytes of data. Even if a DBMS can
store large amounts of data, it may operate inefficiently if too many tables or records are
created, which can lead to slow performance. Big data solutions solve these problems by
providing highly responsive and scalable storage systems.
There are several different types of big data software solutions, including data storage
platforms and data analytics programs. Some of the most common big data software products
include Apache Hadoop, IBM’s Big Data Platform, Oracle NoSQL Database, Microsoft
HDInsight, and EMC Pivotal One.
6
1.3.1 OVERVIEW OF RECENT BIG DATA STUDIES
Since the publication of the benchmark report on big data by the McKinseyGlobal
Institute in June 2011i a plethora of reports have been published over the past year that have
sought to define the term ‘big data’, establish potential user benefits, and forecast future
uptake within the business community. In view of this large volume of readily available
supporting research [2] [1], we have elected not to go into great depth about the
benefits/pitfalls of big data adoption, taking
It was read that this is a well-identified emerging trend and one that has well
recognized potential for business creation and development. It was thought pertinent,
however, to provide a brief overview of some of the generic findings arising from research
[2] [1] in this field and to highlight some important caveats that have tended to be overlooked
by many of those reporting on big data developments within the media/elsewhere.
7
Fig.1.3. The Big Data Stack Divided Into Three Different Layers
The list of challenges in a big data projects include a combination of the following issues:
8
1.3.3 BIG DATA MANAGEMENT TECHNOLOGIES
Given the high volume, velocity and variety of big data, the traditional Data
Warehouse (DWH) and Business Intelligence (BI) architectures already existing in
Companies need to be enhanced in order to meet the new requirements of storing and
processing big data. To optimize the performance of the Big Data Analytics pipeline, it is
important to select the appropriate big data technology for given requirements. This section
contains an overview of the various big data technologies and gives recommendations when
to use them.
DISADVANTAGES
Big risks on security and privacy.
Challenges arise: expensive, need to spend a lot to get it working.
A lot of analyzing: uncover patterns, apply algorithms, connections relationships.
Still need specialization regarding the analysts; hard to find the right skill set.
9
1.3.5 BIG DATA AND CLOUD COMPUTING
Big Data is an umbrella term which encompasses all sorts of data which exists today.
From hospital records and digital data to the overwhelming amount of government paperwork
which is archived – there is more to it than we officially know.
You can’t categorize Big Data under one definition or description, because we are still
working on it. The great thing about information technology is that it has always been
available for technology companies, business and all types of institutions.
It was the emergence of cloud computing which made it easier to provide the best of
technology in the most cost-effective packages. Cloud computing not only reduced costs, but
also made a wide array of applications available to the smaller companies.
Just as the cloud is growing steadily, we are also noticing an explosion of information
across the web. Social media is a completely different world, where both marketers and
common users generate loads of data every day.
Organizations and institutions are also creating data on a daily basis, which can
eventually become difficult to manage. Take a look at these statistics on Big Data generation
in the last five years;
2.5 quintillion bytes (2.3 Trillion Gigabytes) of data are created every day.
Most companies in the US have at least 100 Terabytes (100,000 Gigabytes) of stored
data.
It seems like cloud computing and big data are an ideal combination for this. Together,
they provide a solution which is both scalable and accommodating for big data and business
analytics. The analytics advantage is going to be a huge benefit in today’s world. Imagine all
the information resources which will become easily accessible. Every field of life can benefit
from this information. Let’s look at these advantages in detail;
10
AGILITY
AFFORDABILITY
Cloud computing is a blessing in disguise for a company that wishes to have updated
technology under a budget. Companies can pick what they want and pay for it as they go. The
resources required to manage Big Data are easily available and they don’t cost big bucks.
Before the cloud, companies used to invest huge sums of money in setting up IT departments
and then paid more money to keep that hardware updated. Now the companies can host their
Big Data on off-site servers or pay only for storage space and power they use every hour.
DATA PROCESSING
The explosion of data leads to the issue of processing it. Social media alone generates
a load of unstructured, chaotic data like tweets, posts, photos, videos and blogs which can’t
be processed under a single category. With Big Data Analytics platforms like Apache
Hadoop, structured and unstructured data can be processed. Cloud computing makes the
whole process easier and accessible to small, medium and larger enterprises.
FEASIBILITY
While traditional solutions would require the addition of more physical servers to the
cluster in order to increase processing power and storage space, the virtual nature of the cloud
allows for seemingly unlimited resources on demand. With the cloud, enterprises can scale up
or down to the desired level of processing power and storage space easily and quickly. Big
Data analytics require new processing requirements for large data sets. The demand for
processing this data can raise or fall at any time of the year, and cloud environment is the
11
perfect platform to fulfill this task. There is no need for additional infrastructure, since cloud
can provide most solutions in SaaS models.
Just as Big Data has provided organizations with terabytes of data, it has also
presented an issue of managing this data under a traditional framework. How to analyze the
large sum of data to take out only the most useful bits, analyzing these large volumes of data
often becomes a difficult task as well. In the high speed connectivity era, moving large sets of
data and providing the details needed to access it, is also a problem. These large sets of data
often carry sensitive information like credit/debit card numbers, addresses and other details,
raising data security concerns.
Security issues in the cloud are a major concern for businesses and cloud providers
today. It seems like the attackers are relentless, and they keep inventing new ways to find
entry points in a system. Other issues include ransom ware, which deeply affects a company’s
reputation and resources, Denial of Service attacks, Phishing attacks and Cloud Abuse.
Globally, 40% of businesses experienced a ransom ware incident during the past year.
Both clients and cloud providers have their own share of risks involved when making an
agreement on cloud solutions. Insecure interfaces and weak API’s can give away valuable
information to hackers, and these hackers can misuse this information for the wrong reasons.
Some cloud models are still in the deployment stage and basic DBMS is not only tailored for
Cloud computing. Data Acts is also a serious issue which requires data centers to be closer to
a user than a provider. Data replication must be done in a way which leaves zero room for
error; otherwise it can affect the analysis stage. It is crucial to make the searching, sharing,
storage, transfer, analysis, and visualization of this data as smoothly as possible.
12
1.4 OVERVIEW OF MOBILE CLOUD
Here assume a three-tier Mobile Cloud Computing (MCC) environment, where a set
of M access points (APs) comprise the backbone network. Tier one represents the master
cloud, which consists of several public cloud providers, such as Google App Engine, and
Microsoft Azure Amazon EC2. A set of high-speed interconnected cloudlets constitute the tier
two or the backbone layer of the mobile cloud architecture. Smartphone’s, wearable devices
or other mobile devices constitute the tier three or user layer. Users access the nearest cloud
resources using devices from tier three. A set of cloudlets is controlled and monitored by the
master cloud (MC). All cloudlets route their hypervisor information to the master clouds and
they are connected to the MC with a high-speed network connection.
13
1.5 MOBILE CLOUD ARCHITECTURE
Each of the considered smart city applications (energy, mobility, healthcare, disaster
recovering) can be defined through the services provided to citizens, concerning the
requirements in terms:
1.5.1 LATENCY
The amount of time required by a certain application between the event happening
and the event being acquired by the system.
1.5.3 THROUGHPUT
The amount of bandwidth required by a specific application to be reliably executed in
the smart city environment
1.5.6 STORAGE
The amount of storage space required for storing the sensed data and/or the
processing application.
1.5.7 USERS
The number of users needed to achieve reliable service.
14
1.6 CHALLENGES OF PRESENT CLOUD
1.7 OBJECTIVES
Definition of Mobile Cloud Computing the Mobile Cloud Computing (MCC) term
was introduced after the concept of Cloud Computing. Basically, MCC refers to an
infrastructure where both the data storage and the data processing happen outside of the
mobile device. Regarding the definition, mobile applications move the computation power
and storage from the mobile phones to the cloud. It can be thought as a combination of the
cloud computing and mobile environment. The cloud can be used for power and storage, as
mobile devices don’t have powerful resources compared to traditional computation devices.
15
1.9 MOBILE HEALTHCARE
There are many reasons to use cloud computing with mobile applications. MCC
provides some solutions to the obstacles which mobile subscribers are usually face up with.
These advantages are:
Battery life is one of the main concerns in the mobile environment. There are already
several solutions for extending battery life by enhancing CPU performance, using disk and
screen in an efficient manner to reduce power consumption.
16
But these solutions generally require changes in the mobile devices’ structure or a new
hardware which means increasing the cost. Computation or data offloading techniques are
suggested to migrate the huge and complex computations from limited resource devices like
mobile devices to powerful machines like servers in clouds. This avoids taking a long
application execution time on mobile devices which results in large amount of power 6 and/or
read-write time consumption [4]. There are many evaluations to show effectiveness of these
techniques.
Another obstacle is storage capacity of mobile devices. Mobile devices are generally
have limited storage. To overcome this problem, MCC can be used to access, query or store
the large data on the cloud through wireless networks. There are several examples which are
widely used such as Amazon Simple Storage Service (Amazon S3) to provide file storage on
the cloud. In addition, MCC reduces the time and energy consumption for compute-intensive
applications, which is too applicable when thinking of the limited-resource devices.
1.11.3. RELIABILITY
With the help of CC paradigm, reliability can be improved since data and application
are stored and backed up on several numbers of computers on the cloud. This provides more
confidentiality by reducing the chance of data lost on the mobile devices. In addition,
copyrighting digital contents and preventing illegal distributions like music, video can be
more available in this model. Also security services like virus detection applications can be
easily provided and used in an efficient way without effecting mobile device performance.
Furthermore, CC scalability, elasticity advantages can be used in MCC, as well since cloud
flexibility is applicable as a whole infrastructure, in the same way.
1.11.4. PRIVACY
Privacy is an important issue, when thinking about private data. As in the CC era, the
same trust problem comes out with the mobile network providers and cloud providers. They
can monitor at all the communication and data stored in the cloud or network provider,
although there is encryption mechanisms to crypt data communicated or stored. So from this
perspective, it is a big headache to be solved.
17
1.11.5. COMMUNICATION
The communication is composed from multiple parts from mobile subscriber to the
cloud provider. Therefore there can be some problems like poor network speed or 7 limited
bandwidth. It can be a big concern because the number of mobile and cloud users is
dramatically increasing.
As mentioned in the previous section, Mobile Cloud Computing has many benefits and good
application examples for mobile users and service providers. On the other hand, as mentioned
in some parts, there are also some challenges related to cloud computing and mobile
networks communication. This section gives some explanation about these obstacles and
solutions. 3.1. Mobile Side Challenges In the mobile network side, main obstacles and
solutions are listed below:
1.12.2. AVAILABILITY
Network failures, out of signal errors, or high traffic related poor performance
problems are main threats to prevent users to connect to the cloud. But there are some
solutions to help mobile users in the case of any disconnection from the clouds. One of them
is Wi-Fi Based Multihop MANET. It is a distributed content sharing protocol for the situation
without any infrastructure [7]. In this mechanism, nearby nodes are detected in case of the
failure of direct connection to the cloud. In this case, instead of having a link directly to the
cloud, mobile user can connect to the cloud through neighboring nodes. Although there are
some considers about security issues for such mechanisms, these issues can also be solved.
18
1.12.3. HETEROGENEITY
There are types of networks which are used simultaneously in mobile environment
such as WCDMA, GPRS, WiMAX, CDMA2000, and WLAN. As a result, handling like
heterogeneous network connectivity becomes very hard while satisfying mobile cloud
computing requirements such as connectivity which is always on, on-demand scalable
connectivity, and the energy efficiency of mobile devices. This problem can be solved 10 by
using standardized interfaces and messaging protocols to reach, manage and distribute
contents.
1.12.4. PRICING
Using multiple services in mobile requires with both mobile network provider and
cloud service provider. However, these providers have different methods of payment and
prices for services, features and facilities. Therefore, this has possibility of leading to many
problems like how to determine price, how the price could be shared among the providers or
parties, and how the subscribers can pay. As an example, when a mobile user wants to run a
not free mobile application on the cloud, this participates three stakeholders as one of them is
application provider for application license, second one is mobile network provider for used
data communication from user to cloud, and third one is cloud provider for providing and
running application on the cloud.
19
CHAPTER 2
LITERATURE REVIEW
Marcos D. Assuncao et.al. [1] Have discussed approaches and environments for
carrying out analytics on clouds for Big data applications.They have identified possible
gaps in technology and provide recommendations for the research community on future
directions on Cloud-supported Big Data computing and analytics solutions.
KhairulMunadi et.al. [2] Have proposed a conceptual image trading framework that
enables secure storage and retrieval over internet services. The aim is to facilitate secure
storage and retrieval of original images for commercial transactions, while preventing
untrusted server providers and unauthorized users from gaining access to true contents.
20
Haibo Hu et.al. [8]Have proposed a holistic and efficient solution that comprises a
secure traversal framework and an encryption scheme based on privacy homomorphism.
The framework is scalable to large datasets by leveraging an index-based approach. Based
on this framework, we devise secure protocols for processing typical queries such as k-
nearest-neighbor queries (kNN) on R-tree index.
Ku Rahaneet. al. [9] has proposed about a framework for big data clustering which
utilizes grid technology and ant-based algorithm.
Sudipto das et.al. [10] Have discussed to clarify some of the critical concepts in the
design space of big data and cloud computing such as: the appropriate systems for a
specific set of application requirements on the mobile data.
21
2.1 RELATED PAPERS
TITLE:A Genetic Algorithm for Virtual Machine Migration in Heterogeneous Mobile Cloud
Computing.
AUTHOR:Md. Mofijul Islam, Md. AbdurRazzaque and Md. Jahidul Islam
YEAR: 2016
DESCRIPTION:
Mobile Cloud Computing (MCC) improves the performance of a mobile application
by executing it at a resourceful cloud server.
Virtual Machine (VM) migration in MCC brings cloud resources closer to a user so as
to further minimize the response time of an offloaded application.
The key challenge is to find an optimal cloud server for migration that offers the
maximum reduction in computation time.
The goal of GAVMM is to select the optimal cloud server for a mobile VM and to
minimize the total number of VM migrations, resulting in a reduced task execution
time.
ADVANTAGES:
Mass storage capacity and high-speed computing power.
It will assign multiple tasks in larger bandwidth the VM, but the smaller bandwidth
VM will be assigned rarely tasks.
Load balancing of the entire system can be handled dynamically by using
virtualization technology
DISADVANTAGES:
VM placement problem is the hinge of scheduling and management in cloud data
center.
Limited-bandwidth and other limited resources.
VM placement problem needs to consider the influence of network factors.
ALGORITHM:
Genetic algorithm based virtual Machine migration (GAVMM).
22
TITLE:A Survey of Mobile Cloud Computing Application Models.
AUTHOR:Atta urRehman Khan, Mazliza Othman, Sajjad Ahmad Madani and SameeUllah
Khan.
YEAR: 2014
DESCRIPTION:
Smartphones are now capable of supporting a wide range of applications, many of
which demand an ever increasing computational power.
This poses a challenge because smartphones are resource-constrained devices with
limited computation power, memory, storage, and energy.
The cloud computing technology offers virtually unlimited dynamic resources for
computation, storage, and service provision.
The traditional smartphone application models do not support the development of
applications that can incorporate cloud computing features and requires specialized
mobile cloud application models.
ADVANTAGES:
EXCloud, it transfers only the top stack frames, unlike the traditional process
migration techniques in which full state migrations are performed.
MAUI provides a programming environment where independent methods can be
marked for remote execution.
Model is a wide range of elasticity patterns to optimize the execution of applications
according to the users’ desired objectives.
DISADVANTAGES:
The sharing of data and states between the web lets that execute on distributed
locations are prone to security issues.
The data replication may give rise to data synchronization and integrity issues.
The latency issue is very crucial in mobile cloud application models.
ALGORITHM:
Application partitioning algorithms such as
All-step
K-step
23
TITLE: Big Data-Driven Service Composition Using Parallel Clustered Particle Swarm
Optimizationin Mobile Environment.
AUTHOR:M. ShamimHossain, MohdMoniruzzaman, Ghulam Muhammad and Ahmed
Ghoneim, AtifAlamri.
DESCRIPTION:
A mobile service providers support numerous emerging services with differing quality
metrics but similar functionality.
The mobile environment is ambient and dynamic in nature, requiring more efficient
techniques to deliver the required service composition promptly to users.
Selecting the optimum required services in a minimal time from the numerous sets of
dynamic services is a challenge.
By using parallel processing, the optimum service composition is obtained in
significantly less time than alternative algorithms.
ADVANTAGES:
The performance of this algorithm can be improved by using efficient optimization
techniques like PSO.
Qualities of the mobile environment demand efficient optimization and clustering
techniques.
DISADVANTAGES:
The issue of parallel and distributed data operations where the structure of data is
multi-dimensional.
Dynamic QoS and the rapidly changing nature of services in the mobile environment.
ALGORITHM:
Particle swarm optimization
k-means clustering
24
TITLE:Clone Cloud: Elastic Execution between Mobile Device and Cloud
AUTHOR:Byung-GonChun,SunghwanIhm, PetrosManiatis, MayurNaik and Ashwin Patti.
DESCRIPTION:
Mobile applications are becoming increasingly ubiquitous and provide ever richer
functionality on mobile devices.
Such devices often enjoy strong connectivity with more powerful machines ranging
from laptops and desktops to commercial clouds.
Clone Cloud uses a combination of static analysis and dynamic profiling to partition
applications automatically at a fine granularity while optimizing execution time and
energy use for a target computation and communication environment.
At runtime, the application partitioning is effected by migrating a thread from the
mobile device at a chosen point to the clone in the cloud, executing there for the
remainder of the partition, and re-integrating the migrated thread back to the mobile
device.
ADVANTAGES:
Like desktops and laptops and place demands on an extremely limited supply of
energy.
The granularity of partitioning is coarse since it is at class level, and it focuses on
static partitioning.
Supporting native method calls was an important design choice we made, which
increases its applicability.
DISADVANTAGES:
Web page Consistency problem.
Optimization problem.
ALGORITHM:
DEFLATE compression algorithm
25
TITLE: Federated Internet of Things and Cloud Computing Pervasive Patient Health
Monitoring System
AUTHOR:Jemal H. Abawajy and Mohammad Mehedi Hassan
YEAR: 2017
DESCRIPTION:
In the conventional hospital-centric healthcare system, patients are often tethered to
several monitors.
It develop an inexpensive but flexible and scalable remote health status monitoring
system that integrates the capabilities of the IoT and cloud technologies for remote
monitoring of a patient’s health status.
The healthcare spending challenges by substantially reducing inefficiency and waste
as well as enabling patients to stay in their own homes and get the same or better care.
It demonstrates the suitability of the proposed PPHM infrastructure, a case study for
real-time monitoring of a patient suffering from congestive heart failure using ECG is
presented.
ADVANTAGES:
A flexible, energy-efficient, and scalable remote patient health status monitoring
framework.
A health data clustering and classification mechanism to enable good patient care.
Performance analysis of the PPHM framework to show its effectiveness.
DISADVANTAGES:
IoT-cloud convergence is crucial issue in healthcare application.
Access control, location privacy, data confidentiality.
ALGORITHM:
Rank correlation coefficient algorithm.
Classification algorithm.
26
TITLE: Healthcare Big Data Voice Pathology Assessment Framework.
AUTHOR: M. Shamimhossain and Ghulammuhammad
YEAR: 2016
DESCRIPTION:
Healthcare big data comprise data from different structured, semi-structured, and
unstructured sources.
A framework is required that facilitates collection, extraction, storage, classification,
processing, and modeling of this vast heterogeneous volume of data.
The machine learning algorithms in the form of a support vector machine, an extreme
learning machine and a Gaussian mixture model are used as the classifier.
The proposed VPA system shows its efficiency in terms of accuracy and time
requirement.
ADVANTAGES:
It likely to see an increasingly diverse set of stakeholders involved, spanning the
technical, health, and policy domains.
Big data tools with their merits that facilitate the execution of specified tasks in the
healthcare ecosystem.
DISADVANTAGES:
Security, integrity and privacy violations of these data can cause irremediable damage
to the health, or even death, of the individual and loss to society.
The standardization and format of big data, big data transfer and processing, searching
and mining of big data, and management of services.
Patients with similar symptoms and diseases can share their experiences through
social media to get ad-hoc counseling, which constitutes a big data problem.
ALGORITHM:
Support vector machines (SVM)
Extreme learning machine (ELM)
Gaussian mixture model (GMM)
27
TITLE:Migrate or not? Exploiting dynamic task migration in Mobile cloud computing
systems.
AUTHOR: LazarosGkatzikis and IordanisKoutsopoulos
YEAR: 2013
DESCRIPTION:
Contemporary mobile devices generate heavy loads of computationally intensive
tasks, which cannot be executed locally due to the limited processing and energy
capabilities of each device.
Cloud facilities enable mobile devices-clients to offload their tasks to remote cloud
servers, giving birth to Mobile Cloud Computing (MCC).
The challenge for the cloud is to minimize the task execution and data transfer time to
the user, whose location changes due to mobility.
It provides quality of service guarantees is particularly challenging in the dynamic
MCC environment, due to the time-varying bandwidth of the access links.
ADVANTAGES:
The elasticity of resource provisioning and the pay as- you-go pricing model.
We delineate the performance benefits that arise for mobile applications and identify
the peculiarities of the cloud that introduce significant challenges in deriving optimal
migration strategies.
Reducing the energy consumption of individual servers by moving the processes from
heavily loaded to less loaded servers (load balancing).
DISADVANTAGES:
A strategy that does not consider migration cost and downloads time.
No migration.
28
TITLE:Mobiles on Cloud Nine: Efficient Task Migration Policies for Cloud Computing
Systems.
AUTHOR:LazarosGkatzikis and IordanisKoutsopoulos
YEAR: 2014
DESCRIPTION:
Due to limited processing and energy resources, mobile devices outsource their
computationally intensive tasks to the cloud.
Clouds are shared facilities and hence task execution time may vary significantly.
It investigates the potential of task migrations to reduce contention for the shared
resources of a mobile cloud computing architecture in which local clouds are attached
to wireless access infrastructure.
It devises online migration strategies that at each time make migration decisions
according to the instantaneous load and the anticipated execution time.
ADVANTAGES:
The modification of program to incorporate state capture and recovery function.
Simplified IT management and maintenance capabilities.
Enormous computing resources available on demand.
DISADVANTAGES:
Classifying current computation offloading frameworks. Analyzing them by
identifying their approaches and crucial issues.
Process migration applications are strongly connected with the system in the form of
sockets.
Application development complexity and unauthorized access to remote data
demand a systematized plenary solution.
29
TITLE: Smart City Solution for Sustainable Urban Development
AUTHOR:MostafaBasiri, Ali ZeynaliAzim, Mina Farrokhi.
YEAR:2017
DESCRIPTION:
Large, dense cities can be highly efficient in which it is most desirable that side, by
the heads of the green, and the future porticos.
Bearing to the influx of the citizens of the new challenges of the rapid advance to
command positions.
The globalization of urban economics, cities increasingly have to compete directly
with worldwide and regional economies for international investment to generate
employment, revenue and funds for development.
Smart Cities are those towns which use information technology to improve both the
quality of life and accessibility for their inhabitants.
ADVANTAGES:
Reducing resource consumption, notably energy and water, hence contributing to
reductions in CO2 emissions.
Improving commercial enterprises through the publication of real-time data on the
operation of city services.
The growing penetration of fixed and wireless networks that allow such sensors
and systems to be connected to distributed processing centers and for these centers
in turn to exchange information among themselves.
DISADVANTAGES:
Where there are threats of serious or irreversible damage, lack of full scientific
certainty shall not be used as a reason for postponing cost effective measure to
prevent environmental degradation.
The substitutability of capital.
Sustainable development problem.
TECHNIQUE:
Information management technique
30
CHAPTER 3
HEALTHCARE BIG DATA SOURCE ECO SYSTEM
Healthcare big data is a revolutionary tool in the healthcare industry, and is becoming
vital in current patient-centric care. Owing to the massive growth of data in the healthcare
industry, diverse data sources have been aggregated into the healthcare big data ecosystem.
These data sources are is used by a healthcare provider to enable him or her to make
decisions and provide appropriate care. Major data sources, along with the challenges
involved, are discussed below:
These data are huge in terms of volume and velocity. Regarding data volume, a
variety of signals is collected from heterogeneous sources to monitor patient characteristics,
including blood pressure, blood glucose, and heart rate. Sources include
electroencephalogram, electrocardiogram, and electroglottogram. Data velocity can be
observed from the growing rate of data generation from continuous monitoring, especially for
patients in a critical condition, requires these signals to be processed in real-time, for decision
making. These signals need to be extracted efficiently and processed with the suitable
machine learning algorithm to provide meaningful data for effective patient care. Efficient
and comprehensive methods are also required to analyze and process the collected signals to
provide useable data to the healthcare professionals and other related stakeholders. The
combination of EHR and physiological signals may increase the precision of data based on
the surrounding context of the patient.
3.2. EHRS/EMRS
EHRs or electronic medical records (EMRs) are digitized structured healthcare data
from a patient. The EHRs are collected from and shared among hospitals, research centers,
government agencies, and insurance companies. Security, integrity and privacy violations of
these data can cause irremediable damage to the health, or even death, of the individual and
loss to society. Thus, big healthcare data security is now a key topic of research.
31
3.3. MEDICAL IMAGES
These images generate a huge volume of data to assist healthcare professionals for
identifying or detecting disease, treatment, predicting and monitoring of patients. Medical
imaging techniques such as X-ray, ultrasound, or computed tomography scan play a crucial
role in diagnosis and prognosis. Owing to the complication, dimensionality and noise of the
collected images, efficient image processing methods are required to provide clinically
suitable data for patient care.
A sensed data from patients are collected using different wearable or implantable
devices, environment mounted devices, ambulatory devices, and sensors and smart phones
from home or in hospitals. The sensed data forms a key part of healthcare big data, as these
sensors are used to capture critical events or provide continuous monitoring. However, sensed
data must be collected, pre-processed, stored, shared and delivered correctly in a reasonable
time to be of use to healthcare providers when making clinical decisions. Owing to the
enormous volume of data collected, automated algorithms are required to reduce noise and to
allow for the deployment with big data analytics so that computation time can be reduced.
Moreover, it is a challenge to collect and collate multimodal sensed data from multiple
sources at the same time.
The clinical notes, claims, recommendations, and decisions constitute one of the
largest unstructured sources of healthcare big data. Owing to the variety in format, reliability,
completeness, and accuracy of the clinical notes, it is challenging to ensure the health care
provider has the correct information. Efficient data mining and natural language processing
techniques are required to provide meaningful data.
32
Fig: 3.1. Big healthcare data source eco system.
33
heterogeneous nature of social healthcare media data, it is difficult to conduct data analysis
and provide meaningful data to healthcare big data stakeholders. Thus, this data needs to be
appropriately mined, analyzed and processed to improve the quality the healthcare services in
healthcare providers.
The Map Reduce programming model divides computation into map, and reduces
phases as shown
The map phase partitions input data into many input splits and automatically stores
them across a number of machines in the cluster. Once input data are distributed across the
cluster, the runtime creates a large number of map tasks that execute in parallel to process the
input data. The map tasks read in a series of key-value pairs as input and produce one or more
intermediate key-value pairs. A key-value pair is the basic unit of input to the map task. We
use the Word Count application, which counts the number of occurrences of each word in a
series of text documents, as an example. In the case of processing text documents, a key-
value pair can be a line in a text document. The user can customize the dentition of a key-
value pair.
{ ... }
{ ... }
Map Reduce runtime is an open source implementation of the Map Reduce model first
proposed by Google. It automatically handles task distribution, fault tolerance and other
aspects of distributed computing, making it much easier for programmers to write data
parallel programs. It also enables Google to exploit a large number of commodity computers
to achieve high performance at a fraction of the cost of a system built from fewer but more
expensive high-end servers. Map Reduce scales performance by scheduling parallel tasks on
nodes that store the task inputs. Each node executes the tasks with loose communication with
other nodes.Hadoop is an open source implementation of Map Reduce. To use the
HadoopMap Reduce framework, the user first writes a MapReduce application using the
programming model we described in the previous section. The user then submits the
MapReduce job to a job tracker, which is a Java application that runs in its own dedicated
JVM. The job tracker is responsible for coordinating the job run. It splits the job into a
number of map/reduce tasks and schedules the execution of the tasks
35
Fig: 3.3. Hadoop runs a Map Reduce job
Task trackers have a fixed number of slots for map tasks and for reduce tasks. Each
slot corresponds to a JVM executing a task. Each JVM only employs a single computation
thread. To utilize more than one core, the user needs to configure the number of map/reduces
slots based on the total number of cores and the amount of memory available on each node.
The configuration can be set in the mapred-site.xml file. The relevant properties are map
reduce .The examples in Figure 3.5 shows how to set four map slots and two reduce slot son
each compute node. The setting can be used to express heterogeneity of the machines in the
cluster.
This setting can be different for each compute node. The reason is that different
machines in the cluster can have a different number of cores and differing amounts of
memory
36
For example, a typical hash join application requires each map task to store a copy of
the lookup table in memory. Duplicating the lookup table will decrease the amount of
memory available to each map task. To make sufficient memory available to each map task,
memory intensive applications are often forced to restrict the number of JVMs created to be
smaller than the number of cores in a node at the expense of reducing CPU utilization. For
example, in a machine with four cores and 4 GB of RAM, the system needs to create four
map tasks to use the four cores. However, if 1 GB RAM is in sufficient for each map task, the
Hadoop MapReduce system can create only two map tasks with 2 GBRAM available to each
task. With two map tasks, the runtime system utilizes only
Fig: 3.4. Hadoop Map Reduce on a four cores system of the four available cores or 50
percent of the CPU resources.
37
3.12. MAP REDUCE FOR BIG DATA ANALYSIS
• There is a growing trend of applications that should handle big data. However,
analyzing big data is a very challenging problem today.
• For such applications, the Map Reduce framework has recently attracted a lot of
attention. Google’s Map Reduce or its open source equivalent Hadoop is a powerful
tool for building such applications
• Effective management and analysis of large-scale data poses an interesting but critical
challenge.
Recently, big data has attracted a lot of attention from academia, industry as well as
government
38
3.13. HEALTHCARE PROVIDER
Big data proof-of-value and data lake build out for a healthcare provider. The goal
was a big data solution for predictive analytics and enterprise reporting. The challenge was
that the organization did not have an enterprise data warehouse to consolidate data from their
enterprise operational systems. Further, key business stakeholders were not provided with
tools needed to access data. Lastly, teams tasked with creating analytics spent 90% of their
time integrating data in SAS, leaving them little time to analyze the data or create predictive
analytics. The successful project involved meeting with key business and IT stakeholders to
determine reporting and analytic challenges and priorities, and also performing a Current
State Assessment, along with meta data Discovery, profiling and outlier analysis of source
data. Dell EMC proposed data lake architecture to address enterprise reporting and predictive
analytic needs. The solution also initiated a governance program to ensure data quality and to
establish stewardship procedures. Finally, the project identified federated business data lake
hardware and Pivotal big data suite software as the target platform for the data lake.
The results of the project included new client analytics environment that facilitated
the execution of analytics and reporting activities to reduce time to insight. Further, client
governance structure ensured that metadata for new data sources into the data lake was shared
with users. The environment also supported the rapid creation of sandboxes to support
analytics Thesis.
boosting patient care, service levels and efficiency by simplifying data access
Hadoop is a strong example of a technology that allows healthcare to store data in its
native form. If Hadoop didn’t exist, decisions would have to be made about what can be
incorporated into the data warehouse or the electronic medical record (and what cannot).
Now everything can be brought into Hadoop, regardless of data format or speed of ingests. If
a new data source is found, it can be stored immediately. No data is left behind.
39
By the end of 2017, the number of health records of millions of people is likely to
increase into tens of billions. Thus, the computing technology and infrastructure must be able
to render a cost efficient implementation of:
Hadoop technology is successful in meeting the above challenges faced by the healthcare
industry as the Map Reduce engine and Hadoop Distributed File System (HDFS) have the
capability to process thousands of terabytes of data. Hadoop makes use of highly optimized,
yet inexpensive commodity hardware making it a budget friendly investment for the
healthcare industry.
40
3.13.1. BIG DATA ANALYTICS IS MOTIVATED IN HEALTHCARE
THROUGH THE FOLLOWING ASPECTS
Healthcare data is now growing very rapidly in terms of size, complexity, and Speed
of generation and traditional database and data mining techniques are no longer
efficient in storing, processing and analyzing these data. New innovative tools are
needed in order to handle these data within a tolerable elapsed time.
The patient’s behavioral data is captured through several sensors; patients' various
social interactions and communications.
The standard medical practice is now moving from relatively ad-hoc and subjective
Decision making to evidence-based healthcare.
Inferring knowledge from complex heterogeneous patient sources and leveraging the
patient/data correlations in longitudinal records.
Understanding unstructured clinical notes in the right context.
Efficiently handling large volumes of medical imaging data and extracting pot
initially useful information and biomarkers.
Analyzing genomic data is a computationally intensive task and combining with
standard clinical data adds additional layers of complexity.
41
Fig: 3.6. Mobile Cloud Computing and Big data Analytics
A lot of data is produced on a routine basis by hospitals, laboratories, retail, and non-
retail medical operations and promotional activities. But most of it gets wasted because
respective persons are not able to figure out what to do with that data. This is where Cloud-
based Big Data comes into the picture.
The big data analytics tools and repositories remove the hard thinking and generate
reliable and calculative insights out of huge volumes of data within a matter of seconds. This
means in the future we will need more doctors who are trained to work with big data.
42
The big data revolution is bringing up sophisticated methods of consolidating
information from tons of sources. The focus is on providing the most relevant and updated
information to doctors and medical practitioners in real time while they are consulting their
patients.
Up till now the collection of data is limited to the major available resources in the
healthcare sector. However, with the advent of Smartphone apps and wearable’s, data is now
everywhere. And this allows practitioners to know patients’ health conditions in a more
precise manner. Apps that act like pedometers to measure your steps, the calorie counter for
your diet, the app for monitoring and recording heart rate, blood pressure and blood sugar
levels, and wearable devices like Fit bit, Jawbone etc. are all sources of data nowadays. In the
near future, the patient will share this data with the doctor who can utilize it as a diagnostic
toolbox to provide better treatment in less time.
43
CHAPTER 4
BIG DATA HEALTHCARE USING ANT COLONY
OPTIMIZATION
Big Data Healthcare is the drive to capitalize on growing patient and health system
data availability to generate healthcare innovation. By making smart use of the ever-
increasing amount of data available, we can find new insights by re-examining the data or
combining it with other information. In healthcare this means not just mining patient records,
medical images, bio banks, test results , etc., for insights, diagnoses and decision support
advice, but also continuous analysis of the data streams produced for and by every patient in
a hospital, a doctor’s office, at home and even while on the move via mobile devices.
Current medical hardware, monitoring everything from vital signs to blood chemistry,
is beginning to be networked and connected to electronic patient records, personal health
records, and other healthcare systems.
Big Data has been characterized as raising five essentially independent challenges:
Volume,
Velocity,
Variety,
44
As elsewhere, in Big Data Healthcare the data volume is in-creasing and so is data
velocity as continuous monitoring technology becomes ever cheaper. With so many types of
tests, and the existing wide range of medical hardware and personalized monitoring devices
healthcare data could not be more varied, yet data from this variety of sources must be
combined for processing to reap the expected rewards. In healthcare, veracity of data is of
paramount importance, requiring careful data curation and standardization efforts but at the
same time seeming to be in opposition to the enforcement of privacy rights2.
Finally, extracting value out of big healthcare data for all its beneficiaries (clinicians,
clinical Smart health care, pharmaceutical companies, healthcare policy-makers, etc.)
demands significant innovations in data discovery, transparency and openness, explanation
and provenance, summarization and visualization, and will constitute a major step towards
the cove-ted democratization of data analytics.
45
4.1 ANT Colony Optimization Technique
The Ant Colony Optimization (ACO) algorithm is a meta heuristic initially proposed
by Marco Dario in his PhD dissertation in 1992. “The original idea comes from observing the
exploitation of food resources among ants, in which ants’ individually limited cognitive
abilities have collectively been able to find the shortest path between a food source and the
nest”.
It is firstly used to solve traveling salesman problem (TSP). Because of the
characteristics of distributing computing, self-organization and positive feedback, ACO has
been used in prior works for routing in Sensor Networks “Node Potential” is the heuristic
used to evaluate the potential of next hop selection based on three factors: the candidate’s
distance to the sink node, its distance to the nearest aggregation node and its data correlation
with the current node.
In this algorithm, random searching for the destination (sink node) is needed in early
iterations. Use a simpler heuristic by only considering the distance to the sink node. An
algorithm composed of path construction, path maintenance, and aggregation schemes
including synchronization scheme, loop-free scheme, and avoiding collision scheme.
There is a problem ignored by the algorithms above. Although ACO aggregation
algorithms converge to a route very close to the optimum route, most of them only use a
single path to transfer data until an active node in the path runs out of battery. Then the path
construction and data delivery cycle starts again.
Although route discovery overhead can be reduced, those algorithms do not taken into
consideration limitations of WSNs, especially energy limit of Sensor nodes and number of
agents required to establish the routing Repeatedly using the same optimal path exhausts the
relaying nodes’ energy quickly.
Relatively frequent efforts to maintain the Network and to explore new paths are
needed. Therefore, this approach is not energy efficient and results in shorter Sensor nodes’
lifetime and consequently Network lifetime. Algorithms that separate path establishment and
data delivery processes suffer from this problem.
Data aggregation approach improves energy efficiency in Wireless Sensor Networks
by eliminating redundant packets, reducing end-to-end delay and Network traffic. This
research studies the effect of combining data aggregation technique and multi-path ACO
algorithm with different heuristics on Network lifetime and end-to-end delay.
46
4.2 VIRTUALIZATION RESOURCE ACCESS
Consider the following scenario: a blind user is executing an application that takes an
image from his surroundings. Then, the application processes the image in the cloudlet and
gives a response to the user’s local client. That is, the application continuously uploads some
data and the cloud server processes this data to provide responses back to the user.
47
Fig: 4.1. Ant colony optimization
Now, if the blind user moves away from the current cloudlet, then hoer she will
experience a delayed response from the mobile application executing in the cloudlet,
degrading the overall performance of the application. To avoid this performance degradation,
it is necessary for the system to adopt a VM migration method to choose a cloudlet that is
currently closer to the user to which to migrate the VM. User mobility is not the only reason
forcing a VM to migrate. Migration can be initiated to minimize the over provisioned
resources and thus improve the overall system objectives. For instance, if a VM is required to
be migrated from a cloudlet to any of the candidate cloudlets, the new cloudlet may not have
the same type of VM. In that case, a VM with more resource than the current one must be
chosen and provisioned in order to migrate the VM and thus minimize task-execution time.
48
Fig: 4.2. Double bridge experiment. (a) Ants start exploring the double bridge. (b)
Eventually most of the ants choose the shortest path in principle capable of building a
solution (i.e., of finding a path between nest and food resource), it is only the colony of
ants that presents the “shortest path finding” behavior. In a sense, this behavior is an
emergent property of the ant colony.
The VM migration is provisioned more resources than the required. Therefore, this
over-provisioned resource greatly decreases the system objectives, as it reduces the number
of provisioned VMs in the cloudlets .Furthermore, the joint VM migration approach, where a
set of VMs is remapped based on the VM task execution time and over-provisioned resources
can help to effectively increase the overall system objectives. In contrast to the joint VM
migration approach, single VM migration can only improve particular user objectives but not
the system objectives.
49
4.3 ANT COLONY ALGORITHM
Step 1: while (termination criterion not satisfied)
Step 2: ant generation and activity();
Step 3: pheromone
Step 4: evaporation();
Step 5: daemon actions(); “optional”
Step 6: end while
Step 7: end Algorithm
7. When moving from node f to neighbor node g, the agent update the pheromone trails t (fg)
on the edge (f,g).
8. Once the data is retrieved from the cloud, the agent can retrace the same path backward,
update pheromone trails and close the operation.
50
4.4 ALGORITHM OVERVIEW
In ACO algorithms, a colony of artificial ants is used to construct solutions guided by
the pheromone trails and heuristic information. The original idea of ACO comes from
observing the exploitation of food resources among ants. Ants explore the area surrounding
their nest initially in a random manner. As soon as an ant finds a source of food (source
node), it evaluates the quantity and quality of the food and carries some of it to the nest (sink
node). During the back tracking, the ant deposits a pheromone trail on the ground. The
quality of deposited pheromone, which may depend on the quantity and quality of the food,
will guide other ants to the food source. The pheromone trails are simulated via a
parameterized probabilistic model. The pheromone model consists of a set of parameters. In
general, the ACO approach attempts to find the optimal routing by iterating the following two
steps:
1. Solutions are constructed using a node selection model based on a predetermined
heuristic and the pheromone model, a parameterized probability distribution over the solution
space.
2. The solutions that were constructed in earlier iterations are used to modify the
pheromone values in a way that is deemed to bias the search toward high quality solutions.
The algorithm runs in two passes: forward and backward. In the forward pass, the
route is constructed by a group of ants, each of which starts from a unique source node. In the
first iteration, an ant searches a route to the destination randomly. Later, an ant searches the
nearest point of the previously discovered route. This could take much iteration before the ant
can find a correct path with a reasonable length. A solution is flooding the sink node ID from
the sink to all the Sensor nodes in the Network before any ant starts. The points where
multiple ants join are aggregation nodes. In the backward pass every ant starts from sink
node and travels back to the corresponding source node by following the path discovered in
the forward pass. Pheromone is deposited hop by hop during the traversal.
Nodes of the discovered path are given weights as a result of node selection
depending on the node potential which indicates heuristics for reaching the destination
Pheromone trails are the heuristics to communicate with other ants of the route discovered.
The trail followed by ants most often gets more and more pheromone and eventually
converges to the optimal route. Pheromone in non-optimal route gets evaporated with time.
The aggregation points on the optimal tree identify data aggregation.
51
4.4.1 PATH DISCOVERY PROCEDURE
The procedure is mainly composed of forward and backward passes. In the forward
pass, an ant tries to explore a new path based on the heuristic rule and the pheromone amount
on the edges. Backtracking is used in the forward pass when an ant finds a dead end or is
running into a loop. In the backward pass, the ant updates the pheromone amount on the path
constructed in the forward pass. Other important components in the algorithms include data-
aggregation, loop control, and Network maintenance. In WSN, each node has a unique
identity. Every node is able to calculate and remember its current heuristic value. Initially, the
sink node floods its identity to all the nodes in the Network. After a node receives the packet,
it computes its hop-count to the sink node and correspondingly its initial heuristic value.
Each ant is assigned a source node. After that, an ant starts from the source node and
moves towards the sink node using ad-hoc routing. The forward pass ends only if all the ants
have arrived at the sink node. Single ant-based solution construction uses following steps:
If the node has been visited in the same iteration, follow a previous ant’s path
Use a node selection rule
If all the neighbors have been visited, use the shortest path
If no neighbor nodes, backtrack to the previous node
If no neighbor nodes and the previous node is dead, record the Network
Lifetime and exit the program.
The current node sends the packet. The selected node receives the packet. Both Nodes
update the residual energy after transmission. If the current node does not have enough
energy to send, this transmission fails. The Network is maintained afterwards. Transmission
failure is mostly prevented by doing a receiving and sending energy check in the node
selection step.
52
Ants start from the sink node and move towards their source nodes. The ants follow
the paths discovered in the forward pass. Before an ant arrives at its source node, the
algorithm repeats:
Each Sensor node maintains two queues to store packets: a receiving queue and a
sending queue. The packet sending process includes:
For ”SinkDistNoAggre”, push all the packets into the sending queue.
For other aggregation algorithms, use the predefined function to aggregate all the
received packets into one packet and push it into the sending queue.
Among all the ants arrived at this node, select the earliest ant as the aggregating ant.
The aggregating ant will finish the rest of the routing construction in this iteration.
All the later arrived ants become aggregated ants. They remember the aggregating
ant.
Each aggregated ant shares its path with the aggregating ant. The aggregating ant
updates its subsequent hops with all the aggregated ants.
53
4.5 LOOP CONTROL AND FAILURE HANDLING
“Loop” is defined as the situation that anent revisits an already-visited node in the
same forward pass. Since each ant remembers the path, it can avoid running into a loop by
comparing the candidate node’s ID with the visited nodes’ IDs.
An ant is considered failing its task in a iteration, if all the neighborhood nodes of the
current node have been visited. In that case, the ant uses the shortest path to deliver the
packet to the sink node. The node’s previous visiting history is not considered when choosing
the next node. A path resulting in “failure” is discouraged.
If the “dead node” is the sink node, recharge the node with more energy. Sink node is
different from other nodes because it needs to perform more frequent transmission and
computation for the purpose of application. Therefore, it is assumed that the sink node has
plenty energy to last until the Network dies.
54
4.6.3 LEADING EXPLORATION
Among all the neighborhood nodes, select the first node with the highest probability,
even if there are multiple nodes with the same probability. This method is deterministic. In
every iteration, an ant always discovers the same path to the sink node until one of the
intermediate nodes dies. If the same Network topology is tested repeatedly, the total energy
cost and Network lifetime are the same.
(1)
In equation (1), an ant k having data packet in node i chooses to move to node j until
the sink node, where τ is the pheromone, η is the heuristic, Ni is the set of neighbors of node
i, and β is a parameter which determines the relative importance of pheromone versus
distance (β>0). Value η is calculated using equation (2). Multiple factors can be used and
each one is weighted.
55
4.6.7 EVAPORATION ON ALL EDGES
After all the ants finish the forward passing and before they are going backward, the
pheromone values on all the edges in the Network evaporate at rate ρ. The value is
consistently reduced. Equation (3) shows how the evaporated pheromone value is calculated.
Δωj = ∑ (5)
In equation (4), ρ is the pheromone decay parameter, τij is the pheromone value on the
edge between nodes i and j, and e0 is the encouraging or discouraging rate derived from the
forward pass. A path resulting in less energy consumption and smaller total hop-count is
preferred. The best iteration is one with the least energy consumption and hop-count among
all previous iterations. It is used as a control to calculate the e0 in the current iteration. If the
forward pass is a failed path exploration or used more hop-count and energy consumption
than the best iteration, the path is discouraged. Very small amount of pheromone is deposited
on the edge to differentiate from those links not been visited, and e0 is set to a predetermined
“PunishRate,” which is a relatively low rate between 0 and 1.
If the forward pass found a path using the same hop-count and energy consumption as
the best iteration, e0 is set to a relatively higher rate between 0 and 1-- the “encourageRate.”
If the forward pass found a path with the same hop-count but less energy consumption than
the best iteration, e0 = 1.5 × encourageRate. If the forward pass found a path using less hop-
count and energy consumption than the best iteration, e0 = hop-count difference ×
encourageRate.
56
In equation (5), ζ is a positive number, hi is the hop-count between node i and the
sink, and hj is the hop-count between node j and the sink. If the value of (hi - hj) is greater
than zero, it can be concluded that node j is closer to the sink node than node i. Therefore, the
algorithm rewards the path from node i to node j by depositing more pheromone. If the value
equals to zero, it means that both nodes i and j have the same hop-count to the sink, then the
algorithm lays little pheromone on the path. If the value is less than zero, the algorithm does
not lay pheromone on this path. In equation (6), Rj is the total hop-counts of these sources
before vising node j. Therefore, Δωj is the total hop-counts of some sources to the sink
through node j. The less the total hop accounts, the larger amount of pheromone is added on
the path from node i to node j, as shown in equation (5).
This means that more ants are encouraged to follow this path. For an aggregation
node, it updates the pheromone levels of its all neighbors by equation (4) when an ant moves
to it. If a node does not have ants visit it within a limited time, its pheromone is evaporated
according to equation (3).
57
CHAPTER 5
HONEY POTS DETECTION USING SMART HEALTH
CARE
In 1999 that idea was picked up again by the Honey net project, lead and founded by
Lance Spritzer. During years of development the Honey net project created several papers on
Honeypots and introduced techniques to build efficient Honeypots. The Honey net Project is
a non-profit research organization of security professionals dedicated to information security.
Production Honeypots
Research Honeypots
58
5.2.1 PRODUCTION HONEYPOTS
production honeypots are easy to use, capture only limited information, and are used
primarily by corporations. Production honeypots are placed inside the production network
with other production servers by an organization to improve their overall state of security.
Normally, production honeypots are low-interaction honeypots, which are easier to deploy.
They give less information about the attacks or attackers than research honeypots.
Research honeypots are run to gather information about the motives and tactics of
the black hat community targeting different networks. These honeypots do not add direct
value to a specific organization; instead, they are used to research the threats that
organizations face and to learn how to better protect against those threats Research honeypots
are complex to deploy and maintain, capture extensive information, and are used primarily by
research, military, or government organizations.
59
Fig: 5.1. Single Honey pot detection in system
A common setup is to deploy a Honeypot within a production system. The figure
above shows the Honeypot colored orange. It is not registered in any naming servers or any
other production systems, i.e. domain controller. In this way no one should know about the
existence of the Honeypot. This is important, because only within a properly configured
network, one can assume that every packet sent to the Honeypot, is suspect for an attack.
If misconfigured packets arrive, the amount of false alerts will rise and the value of
the Honeypot drops. Production Honeypot Production Honeypots are primarily used for
detection. Typically they work as extension to Intrusion Detection Systems performing an
advanced detection function. They also prove if existing security functions are adequate, i.e.
if a Honeypot is probed or attacked the attacker must have found a way to the Honeypot. This
could be a known way, which is hard to lock, or even an unknown hole. However measures
should be taken to avoid a real attack. With the knowledge of the attack on the Honeypot it is
easier to determine and close security holes.
A Honeypot allows justifying the investment of a firewall. Without any evidence that
there were attacks, someone from the management could assume that there are no attacks on
the network. Therefore that person could suggest stopping investing in security as there are
no threats. With a Honeypot there is recorded evidence of attacks. The system can provide
information for statistics of monthly happened attacks.
60
5.4 LOG RHYTHM’S HONEYPOT SECURITY ANALYTICS
SUITE
Log Rhythm’s Honeypot Security Analytics Suite allows customers to centrally
manage and continuously monitor honeypot event activity for adaptive threat defense. When
an attacker begins to interact with the honeypot,Log Rhythm’s Security Intelligence Platform
begins tracking the attacker’s actions, analyzing the honeypot data to create profiles of
behavioral patterns and attack methodologies based on the emerging threats. This automated
and integrated approach to honeypots eliminates the need for the manual review and
maintenance associated with traditional honeypot deployments.
The Honeypot Security Analytics Suite provides AI Engine rules that perform real-
time, advanced analytics on all activity captured in the honeypot, including successful logins
to the system, observed successful attacks, and attempted/successful malware activity on the
host. As a result, the Honeypot suite allows AI Engine to also detect when similar activity
captured from the honeypot is observed on the production network. For example, if an
observed attacker interaction on the honeypot is followed by a subsequent interaction with
legitimate hosts within the environment such as production web servers, Log Rhythm can
generate an alarm alerting IT and security personnel to the suspicious activity.
61
5.5 PREVENT COMPROMISED CREDENTIALS
5.5.1 CHALLENGE
The majority of attacks exploit valid user credentials to gain unrestricted access to the
corporate network. Organizations need an effective means of monitoring for insecure
accounts and passwords to prevent credentials from being compromised.
5.5.2 SOLUTION
Log Rhythm’s Honeypot Security Analytics Suite provides AI Engine rules that
monitor for successful and unsuccessful logon attempts to honeypot servers, capturing details
on the username and password. This allows analysts to see commonly attempted username
and password combinations on the honeypot hosts.
62
Fig: 5.3 MCC User Authentications
Transferring data from one remote system to another under the control of a local
system is remote uploading. Remote uploading is used by some online file hosting services. It
is also used when the local computer has a slow connection to the remote systems, but they
have a fast connection between them. Without remote uploading functionality, the data would
have to first be download to local host and then uploaded to the remote file hosting server,
both times over slow connections.
63
5.6 ALGORITHM IMPLEMENTATION
INPUT:VMsetVandaccessiblecloudletsetCuforeach VM
u. System Parameters
Step13: k=k+1
Step20: iteration=iteration+1
64
Step21: end while
Honey pots turn the tables for Hackers and computer security experts. While in the
classical field of computer security, a computer should be as secure as possible, in the realm
of Honeypots the security holes are opened on purpose. In other words Honeypots welcome
Hacker and other threats.
The purpose of a Honeypot is to detect and learn from attacks and use that
information to improve security. A network administrator obtains first-hand information
about the current threats on his network. Undiscovered security holes can be protected gained
by the information from a Honeypot. A Honeypot is a computer connected to a network. It
can be used to examine vulnerabilities of the operating system or network.
Depending on the setup, security holes can be studied in general or in particular.
Moreover it can be used to observe activities of an individual which gained access to the
Honeypot. Honeypots are a unique tool to learn about the tactics of hackers. So far network
monitoring techniques use passive devices, such as Intrusion Detection Systems (IDS). IDS
analyze network traffic for malicious connections based on patterns.
Those can be particular words in packet payloads or specific sequences of packets.
However there is the possibility of false positive alerts, due to a pattern mismatch or even
worse, false negative alerts on actual attacks. On a Honeypot every packet is suspicious.
65
Fig: 5.4 Honey pot setup
As the name suggests these honeypots are deployed and used by Smart health care or
curious individuals. These are used to gain knowledge about the methods used by the black
hat community. They help security Smart health care learn more about attack methods and
help in designing better security tools. They can also help us detect new attack methods or
bugs in existing protocols or software. They can also be used to strengthen or verify existing
intrusion detection systems. They can provide valuable data which can be used to perform
forensic or statistical analysis.
66
For example, lots of HTTP scans detected by honeypot is an indicator that a new http
exploit might be in the wild. Normally commercial servers have to deal with large amounts of
traffic and it is not always possible for intrusion detection systems to detect all suspicious
activity. Honeypots can function as early warning systems and provide hints and directions to
security administrators on what to lookout for.
The real value of a honeypot lies in it being probed, scanned and even compromised,
so it should be made accessible to computers on the Internet or at least as accessible as other
computers on the network. As far as possible the system should behave as a normal system on
the Internet and should not show any signs of it being monitored or of it being a honeypot.
Even though we want the honeypot to be compromised it shouldn’t pose a threat to other
systems on the Internet. To achieve this, network traffic leaving the honeypot should be
regulated and monitored.
67
5.8.1 HONEYPOTS THAT FAKE OR SIMULATE
There are honeypot tools that simulate or fake services or even fake vulnerabilities.
They deceive any attacker to think they are accessing one particular system or service. A
properly designed tool can be helpful in gathering more information about a variety of servers
and systems. Such systems are easier to deploy and can be used as alerting systems and are
less likely to be used for further illegal activities.
This is a viewpoint that states that honeypots should not be anything different from
actual systems since the main idea is to secure the systems that are in use. These honeypots
don’t fake or simulate anything and are implemented using actual systems and servers that
are in use in the real world. Such honeypots reduce the chances of the hacker knowing that he
is on a honeypot. These honeypots have a high risk factor and cannot be deployed
everywhere. They need a controlled environment and administrative expertise. A
compromised honeypot is a potential risk to other computers on the network or for that matter
the Internet.
68
5.10 ROLE OF HONEYPOTS IN NETWORK SECURITY
Honeypots and related technologies have generated great deal of interest in the past
two years. Honeypots can be considered to be one of the latest technologies in network
security today. Project Honey net is actively involved with deployment and study of
honeypots. Honeypots are used extensively in research and it’s a matter of time that they will
be used in production environments as well.
To assess the value of Honeypots we will break down security into three categories as
defined by Bruce Schneider in Secrets and Lies. Schneider breaks security into prevention,
detection and response.
69
5.10.2 PREVENTION
Prevention means keeping the bad guys out. Normally this is accomplished by
firewalls and well patched systems. The value Honeypots can add to this category is small. If
a random attack is Performed, Honeypots can detect that attack, but not prevent it as the
targets are not predictable.
One case where Honeypots help with prevention is when an attacker is directly
hacking into a server. In this case a Honeypot would cause the hacker to waste time on a non-
sufficient target and help preventing an attack on a production system. But this means that the
attacker has attacked the Honeypot before attacking a real server and not otherwise.
5.10.3 DETECTION
The problems with these systems are false alarms and non-detected alarms. A system
might alert on suspicious or malicious activity, even if the data was valid production traffic.
70
Due to the high network traffic on most networks it is extremely difficult to process every
data, so the chances for false alarms increase with the amount of data processed. High traffic
also leads to non-Detected attacks. When the system is not able to process all data, it has to
drop certain packets, which leaves those unscanned. An attacker could benefit of such high
loads on network traffic.
5.10.4 RESPONSE
This chapter defines concepts, architecture and terms used in the realm of Honeypots.
It describes the possible types of Honeypots and the intended usage and purpose of each type.
Further auxiliary terms are explained to gain a deeper understanding about the purpose of
Honeypot concepts.
In the computer security community, a Black hat is a skilled hacker who uses his or
her ability to pursue his interest illegally. They are often economically motivated, or may be
representing a political cause. Sometimes, however, it is pure curiosity. The term “Black hat”
is derived from old Western movies where outlaws wore black hats and outfits and heroes
typically wore white outfits with white hats.
White hats are ethically opposed to the abuse of Computer systems. A White hat
generally concentrates on securing IT Systems whereas a Black hat would like to break into
71
them. Both Black hats and White hats are hackers. However both are skilled computer
experts in contrast to the so-called "script kiddies". Actually script kiddies could be referred
as Black hats, but this would be a compliment to such individuals. From the work of real
hackers, script kiddies, extract discovered and published exploits and merge them into a
script.
72
thus the chance of failure is higher which makes the use of medium-interaction Honeypots
more risky.
Most attacks on the internet are performed by automated tools. Often used by
unskilled users, the so-called script-kiddies they search for vulnerabilities or already installed
Backdoors (see introduction). This is like walking down a street and trying to open every car
by pulling the handle. Until the end of the day at least one car will be discovered unlocked.
Most of these attacks are preceded by scans on the entire IP address range, which means that
any device on the net is a possible target.
73
touched and often with unknown vulnerabilities. A good example for this is the theft of 40
million credit card details at MasterCard International.
That Card Systems Solutions, a third-party processor of payment data has encountered
a security breach which potentially exposed more than 40 million cards of all brands to fraud.
"It looks like a hacker gained access to Card Systems' database and installed a script that acts
like a virus, searching out certain types of card transaction data, direct attacks are performed
by skilled hackers; it requires experienced knowledge. In contrast to the tools used for
random attacks, the tools used by experienced Black hats are not common. Often the attacker
uses a tool which is not published in the Black hat community. This increases the threat of
those attacks.
5.13.1 PREVENTION
Prevention means keeping the bad guys out. Normally this is accomplished by
firewalls and well patched systems. The value Honeypots can add to this category is small. If
a random attack is performed, Honeypots can detect that attack, but not prevent it as the
targets are not predictable. One case where Honey pots help with prevention is when an
attacker is directly hacking into a server. In this case a Honeypot would cause the hacker to
waste time on a non-sufficient target and help preventing an attack on a production system.
But this means that the attacker has attacked the Honeypot before attacking a real server and
not otherwise. Also if an institution publishes the information that they use a Honeypot it
might deter attackers from hacking. But this is more in the fields of psychology and quite too
abstract to add proper value to security.
74
separate the demands to Honeypots. The use of a Honeypot poses risk and needs exact
planning ahead to avoid damage. Therefore it is necessary to consider what environment will
be basis for installation. According to the setup the results are quite different and need to be
analyzed separately. For example the amount of attacks occurring in a protected environment
are less than the number of attacks coming from the internet at least they should. Therefore a
comparison of results afterwards needs to focus on the environment.
In every case there is a risk of using a Honeypot. Risk is added on purpose by the
nature of a Honeypot. A compromised Honeypot, in Hacker terms an “owned box”, needs
intensive monitoring but also strong controlling mechanisms. Scenario VI discusses
requirements on a Honey pot-out-of-the box solution and elaborates different functions which
have to be provided.
CHAPTER 6
EXPERIMENTAL RESULT
75
6.1 HADOOP SERVER
76
The flexible nature of a Hadoop system means companies can add to or modify their
data system as their needs change, using cheap and readily-available parts from any IT
vendor.
Just about all of the big online names use it, and as anyone is free to alter it for their
own purposes, modifications made to the software by expert engineers at, for example,
Amazon and Google, are fed back to the development community, where they are often used
to improve the "official" product. This form of collaborative development between volunteer
and commercial users is a key feature of open source software.
In its "raw" state - using the basic modules supplied here http://hadoop.apache.org/ by
Apache, it can be very complex, even for IT professionals - which is why various commercial
versions have been developed such as Cloudera which simplify the task of installing and
running a Hadoop system, as well as offering training and support services.
6.3 ECLIPSE
77
1. Open Ellipse
2. Click File ->New Project >Java project
Copy all the Jar files from the locations “D:\hadoop-2.6.0\”
a. \share\hadoop\common\lib
b. \share\hadoop\mapreduce
c. \share\hadoop\mapreduce\lib share\hadoop\yarn
d. \share\hadoop\yarn\lib
78
Fig: 6.4 To Detect The Hadoop location
4 Configuration for Hadoop 1.x Fetch Hadoop using version control systems
subversion or git and checkout branch-1 or the particular release branch. Otherwise,
download a source tarball from the CDH3 releases or Hadoop releases.
5 Generate Eclipse project information using Ant via command line:
6 For Hadoop (1.x or branch-1), “ant eclipse”
7 For Smart City releases, “ant eclipse-files”
8 Pull sources into Eclipse:
9 Go to File -> Import.
10 Select General -> Existing Projects into Workspace.
11 For the root directory, navigate to the top directory of the above downloaded source
79
Fig: 6.5 Hadoop Initialized
2. Download MR1 source tarball from CDH4 Downloads and untar into a folder
different than the one from Step 1.
3. Within the MR1 folder, generate Eclipse project information using Ant via command
line (ant eclipse-files).
4. Configure .classpath using this perl script to make sure all classpath entries point to
the local Maven repository:
80
3. For the root directory, navigate to the top directory of the above downloaded
sources.
Fig:6.6 Hadoop-0.19.1tar.gz
1. Generate Eclipse project information using Maven: mvn clean &&mvn install –
DskipTests &&mvneclipse:eclipse. Note: mvneclipse:eclipse generates a static
.classpath file that Eclipse uses, this file isn’t automatically updated as the
project/dependencies change.
3. For the root directory, navigate to the top directory of the above downloaded
source.
Execute tar –xzf Hadoop-0.19.1tar.gz in the cygwin prompt, this will start the process
of unpacking Hadoop distribution. Once this is done, it will display newly created
directory called hadoop-0.19.1
81
Verify whether unpacking is success by executing cd Hadoop-0.19.1 and then -1,
which provides the output as mentioned below which tells that everything is unpacked
correctly.
In the next step, click on Configure Hadoop Installation link, displayed on the right
side of the project configuration window. Project preferences window display is shown in the
image below. Fill in the location of Hadoop directory in Hadoop Installation Directory in
preferences and click OK, and then close the project window after clicking on finish
82
6.4 SCENARIO I – UNPROTECTED ENVIRONMENT
In this scenario the Honeypot is connected to the internet by a firewall. The firewall
limits the access to the Honeypot. Not every port is accessible from the internet resp. not
83
every IP address on the internet is able to initiate connections to the Honeypot. This scenario
does not state the degree of connectivity; it only states that there are some limitations.
However those limitations can be either strict, allowing almost no connection, or loose, only
denying a few connections.
The firewall can be a standard firewall or a firewall with NAT1capabilities (see chapter 3.3).
However a public IP address is always assigned to the firewall.
This scenario focuses on the IP address on the Honeypot. In this scenario the
Honeypot is assigned a public address. The Internet Assigned Numbers Authority (IANA)
maintains a database [IANA 05] which lists the address ranges of public available addresses.
All previous RFCs have been replaced by this database [RFC 3232]. A public IP can be
addressed from any other public IP in the internet. This means that IP data grams targeting a
public IP are routed through the internet to the target. A public IP must occur only once, it
may not be assigned twice.
Applications on the Honeypot can directly communicate with the internet as they have
information of the public internet address. This is in contrast to scenario IV where an
application on the Honeypot is not aware of the public IP. It is further possible to perform a
query on the responsible Regional Internet Registry to lookup the name of the address
registrar; this is called a “whoissearch”.
84
6.6.1 REGIONAL INTERNET REGISTRIES ARE
AfriNIC (African Network Information Centre) - Africa Region
http://www.afrinic.net/
APNIC (Asia Pacific Network Information Centre) - Asia/Pacific Region
http://www.apnic.net/
ARIN (American Registry for Internet Numbers) - North America Region
http://www.arin.net/
LACNIC (Regional Latin-American and Caribbean IP Address Registry) – Latin
America and some Caribbean Islands http://lacnic.net/en/index.html
RIPE NCC (Réseaux IP Européens) - Europe, the Middle East, and Central Asia
http://www.ripe.net/
85
Gateway which rewrites the IP in the payload. Therefore the applications on the Honeypot are
not aware of the public IP and limited by the functionality of the intermediate network
device.
Security mechanisms need to make sure, that this traffic is not affecting the
production systems. Moreover the amount of traffic needs to be controlled. A hacker could
use the Honeypot to launch a DoS2 or DDoS3 attack. Another possibility would be to use the
Honeypot as a file server for stolen software, in hacker terms called warez. Both cases would
increase bandwidth usage and slow production traffic.
As hacking techniques evolve, an experienced Black hat could launch a new kind of
attack which is not recognized automatically. It could be possible to bypass the controlling
functions of the Honeypot and misuse it. Such activity could escalate the operation of a
Honeypot and turn it into a severe threat. A Honeypot operator needs to be aware of this risk
and therefore control the Honeypot on a regular basis.
86
CHAPTER 7
SOURCE CODE
MapAndReduceJob.java
packagecloudmapreduce;
importjava.io.BufferedWriter;
importjava.io.FileWriter;
importjava.io.IOException;
importjava.util.*;
importorg.apache.hadoop.fs.FileSystem;
importorg.apache.hadoop.fs.Path;
importorg.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
importorg.apache.hadoop.mapred.*;
importorg.apache.hadoop.util.*;
87
public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output,
Reporter reporter)
throwsIOException
System.out.println("Line == : "+line);
while (tokenizer.hasMoreTokens())
word.set(tokenizer.nextToken());
output.collect(word, one);
}}}
throwsIOException
int sum = 0;
while (values.hasNext())
sum += values.next().get();
88
{
conf.setJobName("wordcountMapAndReduce");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new
Path("/hadoop/mapred/system/workLoadFile.1.txt","/hadoop/mapred/system/workLoadFile.2
.txt"));
FileOutputFormat.setOutputPath(conf, new
Path("/hadoop/mapred/system/wcMROutput.txt"));
FileSystemfs=FileSystem.get(conf);
if(fs.exists(new Path("/hadoop/mapred/system/wcMROutput.txt")))
fs.delete(new Path("/hadoop/mapred/system/wcMROutput.txt"));
JobClient.runJob(conf);
89
MabJob.java
packagecloudmapreduce;
importjava.io.IOException;
importjava.util.StringTokenizer;
importorg.apache.hadoop.fs.FileSystem;
importorg.apache.hadoop.fs.Path;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.LongWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapred.FileInputFormat;
importorg.apache.hadoop.mapred.FileOutputFormat;
importorg.apache.hadoop.mapred.JobClient;
importorg.apache.hadoop.mapred.JobConf;
importorg.apache.hadoop.mapred.MapReduceBase;
importorg.apache.hadoop.mapred.Mapper;
importorg.apache.hadoop.mapred.OutputCollector;
importorg.apache.hadoop.mapred.Reporter;
importorg.apache.hadoop.mapred.TextInputFormat;
importorg.apache.hadoop.mapred.TextOutputFormat;
90
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text,
Text, IntWritable>
throwsIOException
System.out.println("Line == : "+line);
while (tokenizer.hasMoreTokens())
word.set(tokenizer.nextToken());
output.collect(word, one);
conf.setJobName("wordcountMap");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setInputFormat(TextInputFormat.class);
91
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new
Path("/hadoop/mapred/system/workLoadFile.2.txt"));
FileOutputFormat.setOutputPath(conf, new
Path("/hadoop/mapred/system/wcMapOutput2.txt"));
FileSystemfs=FileSystem.get(conf);
if(fs.exists(new Path("/hadoop/mapred/system/wcMapOutput2.txt")))
fs.delete(new Path("/hadoop/mapred/system/wcMapOutput2.txt"));
JobClient.runJob(conf);
92
Preprocessing2.java
packagecloudmapreduce;
import java.io.*;
importjava.util.ArrayList;
importjava.util.StringTokenizer;
importjava.util.logging.Level;
importjava.util.logging.Logger;
importjavax.swing.JFileChooser;
importjavax.swing.JOptionPane;
public Preprocessing2() {
initComponents();
@SuppressWarnings("unchecked")
93
// <editor-fold defaultstate="collapsed" desc="Generated Code">//GEN-
BEGIN:initComponents
setDefaultCloseOperation(javax.swing.WindowConstants.EXIT_ON_CLOSE);
jPanel1.setLayout(null);
jSeparator1.setOpaque(true);
jPanel1.add(jSeparator1);
jSeparator2.setOpaque(true);
jPanel1.add(jSeparator2);
jLabel1.setHorizontalAlignment(javax.swing.SwingConstants.CENTER);
94
jLabel1.setText("MapReduce Across Datacenters");
jPanel1.add(jLabel1);
jSeparator3.setOpaque(true);
jPanel1.add(jSeparator3);
jSeparator4.setOpaque(true);
jPanel1.add(jSeparator4);
jButton1.setText("Display Data");
jButton1.addActionListener(new java.awt.event.ActionListener() {
jButton1ActionPerformed(evt);
}});
jPanel1.add(jButton1);
jTextArea1.setColumns(20);
jTextArea1.setRows(5);
jScrollPane1.setViewportView(jTextArea1);
jPanel1.add(jScrollPane1);
jButton2.setText("Proceed");
jButton2.addActionListener(new java.awt.event.ActionListener() {
95
public void actionPerformed(java.awt.event.ActionEventevt) {
jButton2ActionPerformed(evt);
}});
jPanel1.add(jButton2);
getContentPane().setLayout(layout);
layout.setHorizontalGroup(
layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
layout.setVerticalGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LE
ADING)
pack();
}// </editor-fold>//GEN-END:initComponents
BufferedReaderbr = null;
try {
intcountToken = 0;
StringTokenizerst;
int j = 0, line = 0;
if (line == 0) {
dataTable.add(new ArrayList());
dataTable1.add(new ArrayList());
token = st.nextToken();
if (line > 0) {
if (!((ArrayList) dataTable.get(j)).contains(token)) {
((ArrayList) dataTable.get(j)).add(token);
((ArrayList) dataTable1.get(j)).add(token);
} else {
attributes.add(token);
jTextArea1.append(token + "\t");
countToken++;
j++;
countToken = 0;
j = 0;
jTextArea1.append("\n\n");
line++;
97
}
int size;
if (size == 1) {
redundantIndex.add(i);
redundantData.add(attributes.get(i));
} finally {
try {
br.close();
}}
}//GEN-LAST:event_jButton1ActionPerformed
new Preprocessing3().setVisible(true);
}//GEN-LAST:event_jButton2ActionPerformed
//<editor-fold defaultstate="collapsed" desc=" Look and feel setting code (optional) ">
try {
98
for (javax.swing.UIManager.LookAndFeelInfo info :
javax.swing.UIManager.getInstalledLookAndFeels()) {
if ("Nimbus".equals(info.getName())) {
javax.swing.UIManager.setLookAndFeel(info.getClassName());
break;
}}
java.util.logging.Logger.getLogger(Preprocessing2.class.getName()).log(java.util.logging.Le
vel.SEVERE, null, ex);
java.util.logging.Logger.getLogger(Preprocessing2.class.getName()).log(java.util.logging.Le
vel.SEVERE, null, ex);
java.util.logging.Logger.getLogger(Preprocessing2.class.getName()).log(java.util.logging.Le
vel.SEVERE, null, ex);
java.util.logging.Logger.getLogger(Preprocessing2.class.getName()).log(java.util.logging.Le
vel.SEVERE, null, ex);
java.awt.EventQueue.invokeLater(new Runnable() {
new Preprocessing2().setVisible(true);
}});}
privatejavax.swing.JButton jButton1;
privatejavax.swing.JButton jButton2;
privatejavax.swing.JLabel jLabel1;
privatejavax.swing.JPanel jPanel1;
privatejavax.swing.JScrollPane jScrollPane1;
99
privatejavax.swing.JSeparator jSeparator1;
privatejavax.swing.JSeparator jSeparator2;
privatejavax.swing.JSeparator jSeparator3;
privatejavax.swing.JSeparator jSeparator4;
privatejavax.swing.JTextArea jTextArea1;
CHAPTER 8
SCREEN SHOTS
100
8.2 BROWSING MAPREDUCE ACRESS
DATACENTER
102
8.4 MAP REDUCE PROCESSING ACROSS
103
8.5 REDUNDANT DATA REMOVAL
104
8.6 BEFORE AND AFTER PREPROCESSING
105
8.7 BEFORE AND AFTER PREPROCESSING WITH
VALUES
106
8.8 SLOT ALLOCATIONS PROCESS
107
8.9 SLOT ALLOCATIONS PROCESS REPORT
108
8.10 MAPPING PROCESS
109
8.11 MAPPING PROCESS WITH VALUES
110
8.12EXECUTION TIME FOR VM RESOURCE IN
MOBILE CLOUD
111
8.13 DATA CENTERS OPTIMIZATION TIME
112
8.14 OPTIMIZATION EVALUATION
113
8.15 OPTIMIZED EXECUTION TIME
114
8.16 HONEY POT-OUT-OF-THE-BOX IN HADOOP
SMART CITY VM
115
8.17 ACCESS CONTROL
116
CHAPTER 9
CONCLUSION
This dissertation centers on performance modeling and resource management for Map
Reduce applications. It introduces a performance modeling frame work for estimating
completion time for complex Map Reduce applications defined as a DAG of Map Reduce
jobs when it is executed on a given platform with different resource allocations and different
input data set(s).Based on the performance modeling framework, we further introduce
resource allocation strategies as well as our customized deadline-driven scheduler in
estimating and controlling the appropriate amount of resource that should be allocated to each
application to meet their (soft) deadlines.
117
CHAPTER 10
BIBLIOGRAPHY
118
[13] L. Gkatzikis and I. Koutsopoulos, ‘‘Migrate or not? Exploiting dynamic task migration
in mobile cloud computing systems,’’ IEEE Wireless Commun., vol. 20, no. 3, pp. 24–32,
Jun. 2013.
[14] L. Gkatzikis and I. Koutsopoulos, ‘‘Mobiles on cloud nine: Efficient task migration
policies for cloud computing systems,’’ in Proc. IEEE 3rd Int. Conf. Cloud Netw. (CloudNet),
Oct. 2014, pp. 204–210.
[15] M. Guntsch, M. Middendorf, and H. Schmeck, ‘‘An ant colony optimization approach to
dynamic TSP,’’ in Proc. 3rd Annu. Conf. Genetic Evol.Comput., 2001, pp. 860–867.
[16] M. M. Hassan, ‘‘Cost-effective resource provisioning for multimedia cloud-based e-
health systems,’’ Multimedia Tools Appl., vol. 74, no. 14, pp. 5225–5241, 2015.
[17] Mumak: Map-Reduce Simulatorhttps://issues.apache.org/jira/browse/MAPREDUCE-
728.
[18] Yang Wang and Wei Shi. On Optimal Budget-Driven Scheduling Algorithmsfor
MapReduce Jobs in the Hetereogeneous Cloud.Technical Report TR-13-02, Carleton
University, 2013.
[19] Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael Ernsat. Haloop: Efficient
iterative data processing on large clusters.Proc. VLDB Endow., 3(1):285–296, 2010.
[20]Songting Chen. Cheetah: A high performance, custom data warehouse on top of
mapreduce.Proc. VLDB Endow., 3(2):1459–1468, 2010
[21] Cliff Engle, Antonio Lupher, ReynoldXin, MateiZaharia, Michael J. Franklin,Scott
Shenker, and Ion Stoica. Shark: Fast data analysis using coarse-graineddistributed memory.
In
Proceedings of the 2012 ACM SIGMOD InternationalConference on Management of Data ,
SIGMOD ’12, pages 689–692, 2012.
[22] M. S. Hossain, G. Muhammad, "Healthcare big data voice pathology assessment
framework", IEEE Access, vol. 4, pp. 7806-7815, 2016.
[23] M. Islam, A. Razzaque, J. Islam, "A genetic algorithm for virtual machine migration in
heterogeneous mobile cloud computing", Proc. Int. Conf. Netw. Syst. Security (NSysS), pp.
1-6, Jan. 2016.
[24] A. R. Khan, M. Othman, S. A. Madani, S. U. Khan, "A survey of mobile cloud
computing application models", IEEE Commun. Surveys Tuts., vol. 16, no. 1, pp. 393-413,
Feb. 2014.
119
[25] S. Kosta, A. Aucinas, P. Hui, R. Mortier, X. Zhang, "ThinkAir: Dynamic resource
allocation and parallel execution in the cloud for mobile code offloading", Proc. IEEE
INFOCOM, pp. 945-953, Mar. 2012.
[26] P. Kulkarni, T. Farnham, "Smart city wireless connectivity considerations and cost
analysis: Lessons learnt from smart water case studies", IEEE Access, vol. 4, pp. 660-672,
2016.
[27] R. Kumari et al., "Application offloading using data aggregation in mobile cloud
computing environment" in Leadership Innovation Entrepreneurship as Driving Forces
Global Economy, Switzerland:Springer, pp. 17-29, 2017.
[28] P. G. J. Leelipushpam, J. Sharmila, "Live VM migration techniques in cloud
environment : A survey", Proc. IEEE Conf. Inf. Commun. Technol. (ICT), pp. 408-413, Apr.
2013.
[29] J. Li, K. Bu, X. Liu, B. Xiao, "Enda: Embracing network inconsistency for dynamic
application offloading in mobile cloud computing", Proc. 2nd ACM SIGCOMM Workshop
Mobile Cloud Comput., pp. 39-44, 2013.
[30] J. Liu, Y. Li, D. Jin, L. Su, L. Zeng, "Traffic aware cross-site virtual machine migration
in future mobile cloud computing", Mobile Netw. Appl., vol. 20, no. 1, pp. 62-71, Feb. 2015.
[31] J. Montgomery, M. Randall, T. Hendtlass, "Structural advantages for ant colony
optimisation inherent in permutation scheduling problems", Proc. 18th Int. Conf. Innov. Appl.
Artif. Intell., pp. 218-228, 2005.
[32] Z. L. Phyo, T. Thein, "Correlation based vms placement resource provision", Int. J.
Comput. Sci. Inf. Technol., vol. 5, no. 1, pp. 95, 2013.
[33] M. Rahimi, J. Ren, C. Liu, A. Vasilakos, N. Venkatasubramanian, "Mobile cloud
computing: A survey state of art and future directions", Mobile Netw. Appl., vol. 19, no. 2,
pp. 133-143, 2014.
[34] C. C. Sasan Adibi, N. Wickramasinghe, "CCmH: The cloud computing paradigm for
mobile health (mHealth)", Int. J. Soft Comput. Softw. Eng., vol. 3, no. 3, pp. 403-410, 2013.
[35] M. Satyanarayanan, P. Bahl, R. Caceres, N. Davies, "The case for VM-based cloudlets in
mobile computing", IEEE Pervas. Comput., vol. 8, no. 4, pp. 14-23, Oct. 2009.
[36] M. Sneps-Sneppe, D. Namiot, On mobile cloud for smart city applications, 2016,
[online] Available: https://arxiv.org/abs/1605.02886.
[37] T. Taleb, A. Ksentini, "An analytical model for follow me cloud", Proc. IEEE Global
Commun. Conf. (GLOBECOM), pp. 1291-1296, Dec. 2013.
120
[38] H. N. Van, F. Tran, J.-M. Menaud, "Sla-aware virtual resource management for cloud
infrastructures", Proc. 9th IEEE Int. Conf. Comput. Inf. Technol. (CIT), vol. 1, pp. 357-362,
Oct. 2009.
[39] U. Varshney, Pervasive Computing and Healthcare, Boston, MA, USA:Springer, pp. 39-
62, 2009.
[40] L. Wang, F. Zhang, A. V. Vasilakos, C. Hou, Z. Liu, "Joint virtual machine assignment
and traffic engineering for green data center networks", SIGMETRICS Perform. Eval. Rev.,
vol. 41, no. 3, pp. 107-112, Jan. 2014.
[41] S. Wang, R. Urgaonkar, T. He, M. Zafer, K. Chan, K. Leung, "Mobility-induced service
migration in mobile micro-clouds", Proc. IEEE Military Commun. Conf. (MILCOM), pp.
835-840, Oct. 2014.
[42] I. Yaqoob, I. A. T. Hashem, Y. Mehmood, A. Gani, S. Mokhtar, S. Guizani, "Enabling
communication technologies for smart cities", IEEE Commun. Mag., vol. 55, no. 1, pp. 112-
120, Jan. 2017.
[43] Q. Zhang, L. Cheng, R. Boutaba, "Cloud computing: State-of-the-art and research
challenges", J. Internet Services Appl., vol. 1, no. 1, pp. 7-18, 2010.
[44] Lance Spitzner (2002). Honeypots tracking hackers. Addison-Wesley. pp. 68–
70. ISBN 0-321-10895-7.
121