Hadoop Presentation Final

WHO DEVELOPED HADOOP
Google had bottlenecks as they simply could not write a large enough checks to
process the data:
• To address this their google labs team then developed an algorithm that allowed
for large data calculator to be chopped into smaller chunks and mapped to
many computers and when the calculator were done be brought back together
to produce the resulting data set.
• The developed an algorithm called map reduce this project algorithm was later
used to develop an open source called Hadoop which allows applications to run
using the map reduce algorithm.
• Data is processed in parallel not serially. Big data has big calculations complexity
like statistical simulations.
• Hadoop procedure= Hadoop will start to see Hadoop playing a central role in
statistical analysis, etl processing and business intelligence
• Volume: the amount of data matters.
Organizations collect data from a variety of sources,
including business transactions, social media and
information from sensor or machine-to-machine data.
In the past, storing it would’ve been a problem – but
new technologies (such as Hadoop) have eased the
burden.
• Variety: different kinds of data is being
generated from various sources. Data comes in
all types of formats – from structured, numeric data in
traditional databases to unstructured text documents,
email, video, audio, stock ticker data and financial
transactions.
volume
variety
Velocity
Value
Veracity
Big data an opportunity
Problems of Big Data
• Storing exponentially growing huge datasets
• Storing and spending money on servers due to the growth of data
• Processing data having complex structure
Processing data faster
DIAGRAM ILLUSTRATING PROBLEMS OF BIG DATA
PROBLEMS WITH BIG DATA
• Storing huge and exponentially growing data sets

• Processing data having complex structure (structured,
unstructured and semi-structured data)
• Bringing huge amounts of data to computation unit
becomes complex
• Dealing with data growth
• Generating insights in a timely manner
• Securing big data
• Troubles of upscaling
• Integrating disparate data sources.
Solution to Big Data Problem
• Hadoop- as- a Solution.
WHAT IS HADOOP
WHAT IS HADOOP
• It is a framework that allows us to store and process large data sets in

parallel and distributed fashion
• It is an open source software framework written in java and comprises
of two parts
• It is made up of the following components
• Hdfs which is the Hadoop distributed file systems
• Map reduce which allows parallel processing of data stored in Hdfs
• The Hdfs is responsible for storing exponentially growing huge data
issues
• It is a distributed file system
• It is responsible for storing unstructured data
WHY HADOOP
• Ability to store and process huge amounts of data quickly

• Computing power :Hadoop's distributed computing model process big data fast. The more
computing nodes you have the more processing power you have
• Fault tolerance : data and application processing are protected against hardware failure
• Flexibility: more nodes can be added and improves scalability
• Scalability : you can easily grow your system to handle more data simply by adding nodes little
administration is required.
• Hadoop is an open source
• Cost effective: Hadoop offers storage solutions for business exploding data sets
• Flexible: Hadoop enables business to easily access new data sources and tap into different
types of data both structured and unstructured and business can generate valuable insights in
the data
• Fast: Hadoop's unique storage method is based on distributed file system that basically maps
data wherever it is located.
HOW TO SOLVE BIG DATA ISSUES USING HADOOP
HADOOP DISTRIBUTED FILE SYSTEM ARCHITECURE
• Hdfs Hadoop distributed file system

• Storage unit of Hadoop
• Distributed file system
• Divides files(input data )into smaller chunks and stores it across the
cluster
• Vertical scaling as per requirement
• Stores any kind of data
• No schema validating is done while dumping data
HDFS STRUCTURE
HDFS
Storing data in the HDFS
Storing Different Types of data
• Allows to store any kind of data be it structured ,semi structured or
unstructured
• Follows WORM(write once read many) principles
• No schema validation is done while dumping the data
DETAILED EXPLANATION OF THE HDS STRUCURE
NAME NODE:
• Maintains and manages data nodes
• Records metadata e.g. location of blocks stored
• The size of the files permissions, hierarchy etc.
• Receives heart beat and block report from all the data nodes
DATA NODES:
• Slave daemons
• Stores the actual data
• Serves read and write requests
• Data is replicated and stored in different nodes performs low
level read and write. Sends heart beat to the name node
HDFS BLOCKS
SECONDARY NAME NODE
• Check pointing is a process of combining edit logs with

fsimage
• Allows faster failover as we have a backup of the
metadata
• Checkpoint happens periodically (default is one hour)
• Provides help to the name node it is a backup node
• Constantly reads what is happening to the name node to
hard disk
HADOOP DISTRIBUTED FILE SYSTEM EXPLAINED
FAULT TOLERANCE
SECOND COMPONENT OF HADOOP
• Yarn is yet another resource negotiator

• It is the brain of the Hadoop system
• Performs audio processes
• Its main component is the resource manager
• The resource manager passes the requests to the node
managers
• Must be configured on the same machine where the
node agents are configured
HDFS WRITING MECHANISM
• There are three steps in the writing mechanism of hdfs
• The first step is the pipeline setup
• The client node informs the name node that it has data to write
• The name node responds by giving the client the ip addresses of the data
nodes to write to
• The number of name nodes will determine the replication factor
• If no data node the client will send another request to the name node and
the name node will respond by allocating another set of ip addresses that
are ready to receive the data
WRITING STAGE IN HDFS
• Data is copied or written to all the selected nodes

• Data is written in a serial manner starting with the first node
until the last node
• The acknowledgement will take in the reverse order
starting with the last node
• The client node is informed and the name node is also
informed to update its metadata
READ/WRITE MECHANISM DIAGRAM
READING MECHANISM
• Client node sends a request to read a certain data file

• It sends a request to the name node
• The name nodes responds by sending the name nodes where the data blocks are
written
• Client node will receive the ip addresses and contact all the data nodes and
fetch the blocks simultaneously
• Files are copied in a distributed files system
ARCHITECTURE OF YARN
YET ANOTHER RESOURCE NEGOTIATOR
YARN ARCHITECTURE DETAIL
YARN COMPONENTS
• Application manager
• Application manager accepts the job submission
• Negotiates to containers for executing the application
specific
• Application mastering and monitoring the progress
• Application masters are the daemons which reside on the
data node
• Communicates to containers for the execution of tasks on
each data node
COMPONENTS EXPLAINED
YARN AND HDFS STRUCTURE COMBINED
Processing data faster
• Provide parallel processing of data present in HDFS
• Allows to process data locally i.e. each node works with a part which
is stored on it.
MAP REDUCE
• It a software framework which helps in writing applications that process large

amounts of data sets using the distributed and parallel algorithms inside the
Hadoop
MAP REDUCE JOB FLOW
2ND EXAMPLE OF MAP REDUCER
MAP REDUCER CODE
MAP REDUCE JOB WORK FLOW
CONCLUSIONS
TPYES OF CLUSTERS
HADOOP ECOSYSTEM
ECOSYTEM COMPONENTS
• Hadoop ecosystem is a set of commands used to

manage data
• Used to perform data analytics
• Flume scatters for structured data with high velocity
• Sqoop deals with social data
• Both applications acts like funnels to carter for the data
• Hive developed by facebook to carter for high velocity
and volume data it is a query language
• Spark is used for machine learning
• Aparche and mbara are used for reports
• Cafca is used for searching and indexing
Thank you

Hadoop Presentation Final

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop Presentation Final

Uploaded by

Copyright:

Available Formats

WHO DEVELOPED HADOOP

• Storing huge and exponentially growing data sets

• It is a framework that allows us to store and process large data sets in

• Ability to store and process huge amounts of data quickly

• Hdfs Hadoop distributed file system

• Check pointing is a process of combining edit logs with

• Yarn is yet another resource negotiator

• Data is copied or written to all the selected nodes

• Client node sends a request to read a certain data file

• It a software framework which helps in writing applications that process large

• Hadoop ecosystem is a set of commands used to

You might also like