Hadoop A Big Picture of Big Data

Hadoo
A Big picture for Developers
How much Data?

Over
1 terabytes data genrated by NYSE every day.

We send more than144.8 billionEmailmessages sent a day.
OnTwittersend more than340 million tweetsa day.
OnFacebookshare more than684,000 bits of contenta day.
We72 hoursof new video toYouTubea minute.
We spend $272,000 onWeb shoppinga day.
Googlereceives over2 million search queriesa minute.
Applereceives around 47,000 app downloadsa minute.
Brandsreceive more than 34,000 Facebook likes a minute.
Tumblrblog owners publish 27,000 new postsa minute.
Instagramphotographers share 3,600 new photos a minute.
Flickrphotographers upload 3,125 new photosa minute.
We perform over 2,000Foursquarecheck-insa minute.
Individuals and organizations launch 571new websitesa minute.
WordPressbloggers publish close to 350 new blog posts a minute.
Data Types
Structured
Sql db
Data
warehouses
Enterprise
system (CRM,
ERP etc.)
sql
Unstructure
d
Analog Data
GPS Tracking
information
Audio, Video
Streams
Semi
Structured
XML
E-Mails
EDI
XML
Worldwide Corporate Data

Growth
700000
Data in
petabytes
600000
500000
400000
300000
200000
100000
2008
2009
2010
2011
2012
Structured
2013
2014
Semi Structured
2015
2016
Unstructured
2017
2018
2019
2020
Hadoop History
Dec
2004: Dean/Ghemawat (Google) MapReduce paper

2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first
only to extend Nutch (the name is derived from Dougs sons toy
elephant)
2006: Yahoo runs Hadoop on 5-20 nodes
March 2008: Cloudera founded
July 2008: Hadoop wins TeraByte sort benchmark
(1 st time a Java
program won this competition)
April 2009: Amazon introduce Elastic MapReduce as a service on
S3/EC2
June 2011: Hortonworks founded
27 dec 2011: Apache Hadoop release 1.0.0
June 2012: Facebook claim biggest Hadoop cluster, totalling more
than 100 PetaBytes in HDFS
2013: Yahoo runs Hadoop on 42,000 nodes, computing about 500,000
MapReduce jobs per day
15 oct 2013: Apache Hadoop release 2.2.0 (YARN)
Terms
Google calls it:
Hadoop equivalent:
MapReduce
Hadoop
GFS
HDFS
Bigtable
HBase
Chubby
Zookeeper
HDFS
Single
Namespace for entire

cluster
Data Coherency
Write-once-read-many access model
Client can only append to existing
files
Files are broken up into blocks
Typically 128MB block size
Each block replicated on multiple
DataNodes
Intelligent Client
Client can find location of blocks
HDFS Architecture
NameNode Metadata
Meta-data
in Memory
The entire metadata is in main memory
No demand paging of meta-data
Types of Metadata
List of files
List of Blocks for each file
List of DataNodes for each block
File attributes, e.g creation time,
replication factor
A Transaction Log
Records file creations, file deletions. etc
DataNode
A
Block Server
Stores data in the local file system
(e.g. ext3)
Stores meta-data of a block (e.g.
CRC)
Serves data and meta-data to Clients
Block Report
Periodically sends a report of all
existing blocks to the NameNode
Facilitates Pipelining of Data
Forwards data to other specified
DataNodes
Block Placement
Current
Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on same remote rack
-- Additional replicas are randomly
placed
Clients read from nearest replica
Would like to make this policy
pluggable
Map Reduce
Compute to Data
Instead
data to compute
Hadoop Cluster
MapReduce
MapReduce Logical Data Flow
MapReduce flow
Input:This
is the input data / file to be processed.

Split:Hadoop splits the incoming data into smaller pieces called "splits".
Map:In this step, MapReduce processes each split according to the logic
defined in map() function. Each mapper works on each split at a time. Each
mapper is treated as a task and multiple tasks are executed across different
TaskTrackers and coordinated by the JobTracker.
Combine:This is an optional step and is used to improve the performance
by reducing the amount of data transferred across the network. Combiner is
the same as the reduce step and is used for aggregating the output of the
map() function before it is passed to the subsequent steps.
Shuffle & Sort:In this step, outputs from all the mappers is shuffled,
sorted to put them in order, and grouped before sending them to the next
step.
Reduce:This step is used to aggregate the outputs of mappers using the
reduce() function. Output of reducer is sent to the next and final step. Each
reducer is treated as a task and multiple tasks are executed across different
TaskTrackers and coordinated by the JobTracker.
Output:Finally the output of reduce step is written to a file in HDFS.
Data Integration (Sqoop)

Sqoop:
is a tool designed for

efficiently transferring bulk data
betweenApache Hadoopand
structured datastores such as
relational databases.
RDBMS
HDFS
Data Integration (Flume)
Flume is a distributed, reliable, and available

service for efficiently collecting, aggregating, and
moving large amounts of streaming data into the
Hadoop Distributed File System (HDFS). It has a
simple and flexible architecture based on streaming
data flows; and is robust and fault tolerant with
tunable reliability mechanisms for failover and
recovery.
Data Integration (Chukwa)
A data collection system for monitoring large

distributed systems. Chukwa is built on top of the
Hadoop Distributed File System (HDFS) and
Map/Reduce framework and inherits Hadoops
scalability and robustness. Chukwa also includes a
flexible and powerful toolkit for displaying,
monitoring and analyzing results to make the best
use of the collected data.
Data Access (Pig)

A
platform for processing and

analyzing large data sets. Pig consists
of a high-level language (Pig Latin) for
expressing data analysis programs
paired with the MapReduce framework
for processing these programs.
Data Access ( Hive)

Built
on the MapReduce framework,

Hive is a data warehouse that enables
easy data summarization and ad-hoc
queries via an SQL-like interface for
large datasets stored in HDFS
Data Storage ( Hbase)

A
column-oriented NoSQL data storage

system that provides random real-time
read/write access to big data for user
applications.
Data Storage ( Cassandra)

Built
onAmazons Dynamo and

Googles BigTable, is a distributed
databasefor managing large amounts
of structured data across many
commodity servers, while providing
highly available service andno single
point of failure. Cassandra offers
capabilities that relational
databasesand other NoSQL databases.
INTERACTION, VISUALIZATION,
EXECUTION, DEVELOPMENT
Built
HCatal
og
on top of the Hive metastore and

incorporates components from the Hive
DDL. HCatalog provides read and write
interfaces for Pig and MapReduce and uses
Hives command line interface for issuing
data definition and metadata exploration
commands. It also presents a REST interface
to allow external tools access to Hive DDL
(Data Definition Language) operations, such
as create table and describe table.
Lucene
Lucen
e
is a full-text search library in

Java which makes it easy to add search
functionality to an application or
website. It does so by adding content
to a full-text index.
Also available as Lucene .Net
Apache
Hama
Hamais a pure BSP (Bulk

Synchronous Parallel) computing
framework on top of HDFS for massive
scientific computations such as matrix,
graph and network algorithms.
ASimple
Crunch
and Efficient MapReduce

Pipelines. TheApache CrunchJava
library provides a framework for
writing, testing, and running
MapReduce pipelines. Its goal is to
make pipelines that are composed of
many user-defined functions simple to
write, easy to test, and efficient to run.
Data serialization (Avro)

A
very popular data serialization

format in the Hadoop technology
stack. In this article I show code
examples of MapReduce jobs in Java,
Hadoop Streaming, Pig and Hive that
read and/or write data inAvroformat.
Data serialization ( Thrift)

For
scalable cross-language services

development, combines a software
stack with a code generation engine to
build services that work efficiently and
seamlessly between C++, Java,
Python, PHP, Ruby, Erlang, Perl,
Haskell, C#, Cocoa, JavaScript,
Node.js, Smalltalk, OCaml and Delphi ..
Data Intelligence (Drill)

Aframework
that supports dataintensive distributed applications for

interactive analysis of large-scale
datasets.Drillis the open source
version of Google's Dremel system
which is available as an infrastructure
service called Google BigQuery
Data Intelligence (Mahout)

A
machine learning algorithms focused

primarily in the areas of collaborative
filtering, clustering and classification.
Many of the implementations use
theApacheHadoop platform
Management & monitoring

(Ambari)
A
web-based tool for provisioning, managing,

and monitoring Apache Hadoop clusters
which includes support for Hadoop HDFS,
Hadoop MapReduce, Hive, HCatalog, HBase,
ZooKeeper, Oozie, Pig and Sqoop.
Ambari also provides a dashboard for
viewing cluster health such as heatmaps and
ability to view MapReduce, Pig and Hive
applications visually alongwith features to
diagnose their performance characteristics in
a user-friendly manner.

(ZooKeeper)
ZooKeeper
is a centralized service for

maintaining configuration information,
naming, providing distributed
synchronization, and providing group
services. All of these kinds of services
are used in some form or another by
distributed applications. Each time they
are implemented there is a lot of work
that goes into fixing the bugs and race
conditions that are inevitable

(Oozie)
Oozie
is a workflow scheduler system to manage

Apache Hadoop jobs.
Oozie Workflow jobs are Directed Acyclical Graphs
(DAGs) of actions.
Oozie Coordinator jobs are recurrent Oozie Workflow
jobs triggered by time (frequency) and data
availabilty.
Oozie is integrated with the rest of the Hadoop stack
supporting several types of Hadoop jobs out of the
box (such as Java map-reduce, Streaming mapreduce, Pig, Hive, Sqoop and Distcp) as well as
system specific jobs (such as Java programs and
shell scripts).
Other Tools
SPARK:
is ideal for in-memory data processing. It allows

data scientists to implement fast, iterative algorithms for
advanced analytics such as clustering and classification of
datasets.
STORM: is a distributed real-time computation system for
processing fast, large streams of data adding reliable realtime data processing capabilities to Apache Hadoop 2.x
SOLR: A platform for searches of data stored in Hadoop.
Solr enables powerful full-text search and near real-time
indexing on many of the worlds largest Internet sites.
TEZ: A generalized data-flow programming framework,
built on Hadoop YARN, which provides a powerful and
flexible engine to execute an arbitrary DAG of tasks to
process data for both batch and interactive use-cases.
Technology Stack / Ecosystem
Windows Application
Hadoop, We know Why?
Need to process Multi Petabyte Datasets

Expensive to build reliability in each
application.
Nodes fail every day
Failure is expected, rather than exceptional.
The number of nodes in a cluster is not constant.
Need common infrastructure
Efficient, reliable, Open Source Apache License
The above goals are same as Condor, but
Workloads are IO bound and not CPU bound
Fact #1.
Hadoop
consists of multiple products. We talk about Hadoop

as if its one monolithic thing, but its actually a family of
open source products and technologies overseen by the
Apache Software Foundation (ASF). (Some Hadoop products
are also available via vendor distributions; more on that
later.) The Apache Hadoop library includes (in BI priority
order): the Hadoop Distributed File System (HDFS),
MapReduce, Pig, Hive, HBase, HCatalog, Ambari, Mahout,
Flume, and so on. You can combine these in various ways,
but HDFS and MapReduce (perhaps with Pig, Hive, and
HBase) constitute a useful technology stack for applications
in BI, DW, DI, and analytics. More Hadoop projects are
coming that will apply to BI/DW, including Impala, which is a
much-needed SQL engine for low-latency data access to
HDFS and Hive data.
Fact #2.
Hadoop
is open source but available from

vendors, too. Apache Hadoops open source
software library is available from ASF at
www.apache.org. For users desiring a more
enterprise-ready package, a few vendors
now offer Hadoop distributions that include
additional administrative tools,
maintenance, and technical support. A
handful of vendors offer their own nonHadoop-based implementations of
MapReduce.
Fact #3.
Hadoop
is an ecosystem, not a single product. In addition

to products from Apache, the extended Hadoop ecosystem
includes a growing list of vendor products (e.g., database
management systems and tools for analytics, reporting,
and DI) that integrate with or expand Hadoop technologies.
One minute on your favorite search engine will reveal
these. Ignorance of Hadoop is still common in the BI and IT
communities. Hadoop comprises multiple products,
available from multiple sources. 1 This section of the report
was originally published as the expert column Busting 10
Myths about Hadoop in TDWIs BI This Week newsletter,
March 20, 2012 (available at tdwi.org). The column has
been updated slightly for use in this report. 6 TDWI
research Integrating Hadoop Into Bi/DW
Fact #4.
HDFS
is a file system, not a database management

system (DBMS). Hadoop is primarily a distributed
file system and therefore lacks capabilities we
associate with a DBMS, such as indexing, random
access to data, support for standard SQL, and query
optimization. Thats okay, because HDFS does
things DBMSs do not do as well, such as managing
and processing massive volumes of file-based,
unstructured data. For minimal DBMS functionality,
users can layer HBase over HDFS and layer a query
framework such as Hive or SQL-based Impala over
HDFS or HBase.
Fact #5.
Hive
resembles SQL but is not standard SQL.

Many of us are handcuffed to SQL because
we know it well and our tools demand it.
People who know SQL can quickly learn to
hand code Hive, but that doesnt solve
compatibility issues with SQL-based tools.
TDWI believes that over time, Hadoop
products will support standard SQL and SQLbased vendor tools will support Hadoop, so
this issue will eventually be moot.
Fact #6.
Hadoop
and MapReduce are related

but dont require each other. Some
variations of MapReduce work with a
variety of storage technologies,
including HDFS, other file systems, and
some relational DBMSs. Some users
deploy HDFS with Hive or HBase, but
not MapReduce.
Fact #7.
MapReduce
provides control for

analytics, not analytics per se.
MapReduce is a general-purpose
execution engine that handles the
complexities of network
communication, parallel programming,
and fault tolerance for a wide variety
of hand-coded logic and other
applicationsnot just analytics.
Fact #8.
Hadoop
is about data diversity, not just data

volume. Theoretically, HDFS can manage the
storage and access of any data type as long
as you can put the data in a file and copy that
file into HDFS. As outrageously simplistic as
that sounds, its largely true, and its exactly
what brings many users to Apache HDFS and
related Hadoop products. After all, many
types of big data that require analysis are
inherently file based, such as Web logs, XML
files, and personal productivity documents.
Fact #9.
Hadoop
complements a DW; its rarely a replacement. Most

organizations have designed their DWs for structured, relational
data, which makes it difficult to wring BI value from unstructured
and semistructured data. Hadoop promises to complement DWs
by handling the multi-structured data types most DWs simply
werent designed for. Furthermore, Hadoop can enable certain
pieces of a modern DW architecture, such as massive data staging
areas, archives for detailed source data, and analytic sandboxes.
Some early adoptors offload as many workloads as they can to
HDFS and other Hadoop technologies because they are less
expensive than the average DW platform. The result is that DW
resources are freed for the workloads with which they excel. HDFS
is not a DBMS. Oddly enough, thats an advantage for BI/DW.
Hadoop promises to extend DW architecture to better handle
staging, archiving, sandboxes, and unstructured data. tdwi.org 7
Introduction to Hadoop Products and Technologies
Fact #10.
Hadoop
enables many types of analytics, not just Web

analytics. Hadoop gets a lot of press about how Internet
companies use it for analyzing Web logs and other Web
data, but other use cases exist. For example, consider
the big data coming from sensory devices, such as
robotics in manufacturing, RFID in retail, or grid
monitoring in utilities. Older analytic applications that
need large data samplessuch as customer base
segmentation, fraud detection, and risk analysiscan
benefit from the additional big data managed by Hadoop.
Likewise, Hadoops additional data can expand 360degree views to create a more complete and granular
view of customers, financials, partners, and other
business entities.
THANKS

Hadoop A Big Picture of Big Data

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop A Big Picture of Big Data

Uploaded by

Copyright:

Available Formats

Hadoo

A Big picture for Developers

How much Data?

1 terabytes data genrated by NYSE every day.

Worldwide Corporate Data

2004: Dean/Ghemawat (Google) MapReduce paper

Google calls it:

Namespace for entire

MapReduce Logical Data Flow

is the input data / file to be processed.

Data Integration (Sqoop)

is a tool designed for

Data Integration (Flume)

Flume is a distributed, reliable, and available

Data Integration (Chukwa)

A data collection system for monitoring large

Data Access (Pig)

platform for processing and

Data Access ( Hive)

on the MapReduce framework,

Data Storage ( Hbase)

column-oriented NoSQL data storage

Data Storage ( Cassandra)

onAmazons Dynamo and

on top of the Hive metastore and

is a full-text search library in

Hamais a pure BSP (Bulk

and Efficient MapReduce

Data serialization (Avro)

very popular data serialization

Data serialization ( Thrift)

scalable cross-language services

Data Intelligence (Drill)

that supports dataintensive distributed applications for

Data Intelligence (Mahout)

machine learning algorithms focused

Management & monitoring

web-based tool for provisioning, managing,

Management & monitoring

is a centralized service for

Management & monitoring

is a workflow scheduler system to manage

is ideal for in-memory data processing. It allows

Technology Stack / Ecosystem

Hadoop, We know Why?

Need to process Multi Petabyte Datasets

consists of multiple products. We talk about Hadoop

is open source but available from

is an ecosystem, not a single product. In addition

is a file system, not a database management

resembles SQL but is not standard SQL.

and MapReduce are related

provides control for

is about data diversity, not just data

complements a DW; its rarely a replacement. Most

enables many types of analytics, not just Web

You might also like