Downloading Spark and Getting Started, Programming with RDDs, Machine Learning with MLlib. 4.1 Introduction to Data Analysis with Spark Today most of the enterprises are producing big data. out of this produced data, 65% of data is not much useful. The targeted data will be obviously less. So to analyze data as per the industry's requirement is the most difficult task. Most of the industries are using Hadoop for analyzing their data sets. The reason is that Hadoop framework is based on a simple programming model known as MapReduce. It enables a computing solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main focus is to maintain speed in processing big datasets in terms of waiting time between queries and waiting time to run the program. Spark is designed to be highly accessible because of simple APIs in Python, Java, Scala, and SQL, and rich built-in libraries. It also integrates closely with other Big Data tools. Spark can run in Hadoop clusters and it can access any Hadoop data source, including Cassandra. 4.1.1 History of Apache Spark The story of Apache Spark was started back in 2009 with mesos. It was a fantastic project in UC Berkeleys AMPLab. Idea was to build a cluster management framework, which can support different kind of cluster computing systems. Once mesos was built, they thought they can build on top of mesos. Thats how spark was born. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-2014. What is Spark? Apache Spark is a cluster computing platform designed to be fast and general-purpose. To increase the processing speed of big data, Spark extends the popular MapReduce model to efficiently support more types of computations, including interactive queries and stream processing. To process the large datasets speed is very important, as it means the difference between exploring data interactively and waiting minutes or hours. One of the main features Spark offers for speed is the ability to run computations faster in memory, but the system is also more efficient than MapReduce for complex applications running on disk. Spark has several advantages compared to other big data and MapReduce technologies like Hadoop and Storm. The resilient distributed dataset (RDD) is an application programming interface centered on a data structure provided by Apache Spark. It is a read-only multi set of data items distributed over a cluster of machines, that is maintained in a fault- tolerant way. It was developed to overcome the limitations in the Map Reduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs. MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory. List out the features of Apache Spark Apache spark has following main features. Speed Spark provides faster processing . It helps to run an application in memory up to 100 times faster in Hadoop cluster, and 10 times faster when running on the disk. This is possible by reducing number of read and write operations to disk. It stores the intermediate processing data in memory. Supports multiple languages Spark provides built-in APIs in multiple languages such as Java, Scala, or Python. Therefore, you can write applications in different languages. This is the reason Spark comes up with 80 high-level operators for interactive querying. Advanced Analytics Spark not only supports Map and reduce function. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms. Runs Everywhere - Spark runs in the standalone mode or it can be built on Apache Hadoop, Mesos. It can access different data sources including HDFS, Cassandra, HBase, and S3. How Spark can be built with Hadoop Components? There are three ways to deploy spark. The following diagram shows three ways of how Spark can be built with Hadoop components. Standalone Spark Standalone deployment means Spark occupies the place on top of HDFS(Hadoop Distributed File System) and that space is allocated for HDFS, explicitly. Here, Spark and MapReduce will run side by side to cover all spark jobs on cluster. Hadoop Yarn Hadoop Yarn deployment simply means that, spark runs on Yarn without any pre-installation or root access required. It provides support to integrate Spark into Hadoop ecosystem or Hadoop stack. It allows other components to run on top of stack. Spark in MapReduce (SIMR) In addition to standalone deployment, Spark in MapReduce is used to launch spark job . With SIMR, user can start Spark and uses its shell without any kind of administrative access. Briefly explain the components of Spark Apache Spark Core Apache Spark consists of Spark Core and a set of libraries. The Spark core is the distributed execution engine. Java, Scala, and Python APIs offer a platform for distributed ETL that is extract, transform and load application development. Additional libraries, built on the top of the core. It allows diverse workloads for streaming, SQL, and machine learning. It provides in Memory computing and referencing datasets in external storage systems. Spark Core is the underlying general execution engine for spark platform that all other functionality is built upon. Spark SQL Spark SQL is a component on top of Spark Core that introduces a new data abstraction called Schema RDD, which provides support for structured and semi-structured data. Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. Spark Streaming Spark Streaming provide Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini- batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data. With so many distributed stream processing engines available, the unique benefits of Apache Spark Streaming makes it different. From early on, Apache Spark has provided an unified engine that natively supports both batch and streaming workloads. This is different from other systems that either have a processing engine designed only for streaming, or have similar batch and streaming APIs but compile internally to different engines. . Sparks single execution engine and unified programming model for batch and streaming lead to some unique benefits over other traditional streaming systems. In particular, four major advantages are: Fast recovery from failures and stragglers Better load balancing and resource usage Combining of streaming data with static datasets and interactive queries Native integration with advanced processing libraries (SQL, machine learning, graph processing) MLlib MLlib is a distributed machine learning framework above Spark because of the distributed memory-based Spark architecture. To analyze the big data, machine learning algorithms will be useful to analyze data in faster and scalable manner. It is, according to benchmarks, done by the MLlib developers against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast as the Hadoop disk- based version of Apache Mahout (before Mahout gained a Spark interface). MLlib is Sparks machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms - common learning algorithms such as classification, regression, clustering, and collaborative filtering Featurization - feature extraction, transformation, dimensionality reduction, and selection Pipelines - tools for constructing, evaluating, and tuning ML Pipelines Persistence - saving and load algorithms, models, and Pipelines Utilities - linear algebra, statistics, data handling, etc. GraphX GraphX is a distributed graph-processing framework on top of Spark. It provides an API for expressing graph computation that can model the user-defined graphs by using Pregel abstraction API. It also provides an optimized runtime for this abstraction. GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental operators as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.