You are on page 1of 19

SPARK

Prepared By Dulari Bhatt


Syllabus

Introduction to Data Analysis with Spark,


Downloading Spark and Getting Started,
Programming with RDDs,
Machine Learning with MLlib.
4.1 Introduction to Data Analysis
with Spark
Today most of the enterprises are producing big data. out of this
produced data, 65% of data is not much useful. The targeted data will
be obviously less. So to analyze data as per the industry's requirement
is the most difficult task.
Most of the industries are using Hadoop for analyzing their data sets.
The reason is that Hadoop framework is based on a simple
programming model known as MapReduce. It enables a computing
solution that is scalable, flexible, fault-tolerant and cost effective. Here,
the main focus is to maintain speed in processing big datasets in
terms of waiting time between queries and waiting time to run the
program.
Spark is designed to be highly accessible because of
simple APIs in Python, Java, Scala, and SQL, and rich
built-in libraries. It also integrates closely with other
Big Data tools. Spark can run in Hadoop clusters and it
can access any Hadoop data source, including
Cassandra.
4.1.1 History of Apache Spark
The story of Apache Spark was started back in 2009
with mesos. It was a fantastic project in UC
Berkeleys AMPLab. Idea was to build a cluster
management framework, which can support different
kind of cluster computing systems. Once mesos was
built, they thought they can build on top of mesos.
Thats how spark was born.
It was Open Sourced in 2010 under a BSD license. It
was donated to Apache software foundation in 2013,
and now Apache Spark has become a top level Apache
project from Feb-2014.
What is Spark?
Apache Spark is a cluster computing platform designed to
be fast and general-purpose.
To increase the processing speed of big data, Spark extends the
popular MapReduce model to efficiently support more types of
computations, including interactive queries and stream
processing. To process the large datasets speed is very important,
as it means the difference between exploring data interactively
and waiting minutes or hours. One of the main features Spark
offers for speed is the ability to run computations faster in
memory, but the system is also more efficient than MapReduce
for complex applications running on disk.
Spark has several advantages compared to other big data
and MapReduce technologies like Hadoop and Storm.
The resilient distributed dataset (RDD) is an
application programming interface centered on a
data structure provided by Apache Spark. It is a
read-only multi set of data items distributed over a
cluster of machines, that is maintained in a fault-
tolerant way. It was developed to overcome the
limitations in the Map Reduce cluster
computing paradigm, which forces a particular
linear dataflow structure on distributed programs.
MapReduce programs read input data from
disk, map a function across the data, reduce the
results of the map, and store reduction results on
disk. Spark's RDDs function as a working set for
distributed programs that offers a (deliberately)
restricted form of distributed shared memory.
List out the features of Apache Spark
Apache spark has following main features.
Speed Spark provides faster processing . It helps to run an application in
memory up to 100 times faster in Hadoop cluster, and 10 times faster when
running on the disk. This is possible by reducing number of read and write
operations to disk. It stores the intermediate processing data in memory.
Supports multiple languages Spark provides built-in APIs in multiple
languages such as Java, Scala, or Python. Therefore, you can write applications
in different languages. This is the reason Spark comes up with 80 high-level
operators for interactive querying.
Advanced Analytics Spark not only supports Map and reduce function.
It also supports SQL queries, Streaming data, Machine learning (ML), and
Graph algorithms.
Runs Everywhere - Spark runs in the standalone mode or it can be built on
Apache Hadoop, Mesos. It can access different data sources including HDFS,
Cassandra, HBase, and S3.
How Spark can be built with Hadoop
Components?
There are three ways to deploy spark. The following diagram
shows three ways of how Spark can be built with Hadoop
components.
Standalone Spark Standalone deployment means Spark
occupies the place on top of HDFS(Hadoop Distributed File
System) and that space is allocated for HDFS, explicitly.
Here, Spark and MapReduce will run side by side to cover all
spark jobs on cluster.
Hadoop Yarn Hadoop Yarn deployment simply means
that, spark runs on Yarn without any pre-installation or root
access required. It provides support to integrate Spark into
Hadoop ecosystem or Hadoop stack. It allows other
components to run on top of stack.
Spark in MapReduce (SIMR) In addition to
standalone deployment, Spark in MapReduce is used to
launch spark job . With SIMR, user can start Spark and uses
its shell without any kind of administrative access.
Briefly explain the components of
Spark
Apache Spark Core
Apache Spark consists of Spark Core and a set of libraries.
The Spark core is the distributed execution engine. Java,
Scala, and Python APIs offer a platform for distributed ETL
that is extract, transform and load application development.
Additional libraries, built on the top of the core. It allows
diverse workloads for streaming, SQL, and machine learning.
It provides in Memory computing and referencing datasets in
external storage systems.
Spark Core is the underlying general execution engine for
spark platform that all other functionality is built upon.
Spark SQL
Spark SQL is a component on top of Spark Core that
introduces a new data abstraction called Schema RDD, which
provides support for structured and semi-structured data.
Spark SQL is a Spark module for structured data processing.
Unlike the basic Spark RDD API, the interfaces provided
by Spark SQL provide Spark with more information about
the structure of both the data and the computation being
performed. Internally, Spark SQL uses this extra information
to perform extra optimizations.
Spark Streaming
Spark Streaming provide Spark Core's fast scheduling
capability to perform streaming analytics. It ingests data in mini-
batches and performs RDD (Resilient Distributed Datasets)
transformations on those mini-batches of data.
With so many distributed stream processing engines
available, the unique benefits of Apache Spark
Streaming makes it different. From early on, Apache
Spark has provided an unified engine that natively
supports both batch and streaming workloads. This is
different from other systems that either have a
processing engine designed only for streaming, or have
similar batch and streaming APIs but compile internally
to different engines.
. Sparks single execution engine and unified programming
model for batch and streaming lead to some unique
benefits over other traditional streaming systems. In
particular, four major advantages are:
Fast recovery from failures and stragglers
Better load balancing and resource usage
Combining of streaming data with static datasets and
interactive queries
Native integration with advanced processing libraries (SQL,
machine learning, graph processing)
MLlib
MLlib is a distributed machine learning framework above
Spark because of the distributed memory-based Spark
architecture. To analyze the big data, machine learning
algorithms will be useful to analyze data in faster and scalable
manner. It is, according to benchmarks, done by the MLlib
developers against the Alternating Least Squares (ALS)
implementations.
Spark MLlib is nine times as fast as the Hadoop disk-
based version of Apache Mahout (before Mahout gained
a Spark interface).
MLlib is Sparks machine learning (ML) library. Its goal is to
make practical machine learning scalable and easy. At a high
level, it provides tools such as:
ML Algorithms - common learning algorithms such as
classification, regression, clustering, and collaborative
filtering
Featurization - feature extraction, transformation,
dimensionality reduction, and selection
Pipelines - tools for constructing, evaluating, and tuning
ML Pipelines
Persistence - saving and load algorithms, models, and
Pipelines
Utilities - linear algebra, statistics, data handling, etc.
GraphX
GraphX is a distributed graph-processing framework on top of Spark.
It provides an API for expressing graph computation that can model
the user-defined graphs by using Pregel abstraction API. It also
provides an optimized runtime for this abstraction.
GraphX is a new component in Spark for graphs and
graph-parallel computation. At a high level, GraphX
extends the Spark RDD by introducing a
new Graph abstraction: a directed multigraph with
properties attached to each vertex and edge. To support
graph computation, GraphX exposes a set of fundamental
operators as well as an optimized variant of the Pregel API. In
addition, GraphX includes a growing collection of
graph algorithms and builders to simplify graph analytics tasks.

You might also like