Professional Documents
Culture Documents
Data Types
Structured
Sql db
Data
warehouses
Enterprise
system (CRM,
ERP etc.)
sql
Unstructure
d
Analog Data
GPS Tracking
information
Audio, Video
Streams
Semi
Structured
XML
E-Mails
EDI
XML
Data in
petabytes
600000
500000
400000
300000
200000
100000
2008
2009
2010
2011
2012
Structured
2013
2014
Semi Structured
2015
2016
Unstructured
2017
2018
2019
2020
Hadoop History
Dec
Terms
Hadoop equivalent:
MapReduce
Hadoop
GFS
HDFS
Bigtable
HBase
Chubby
Zookeeper
HDFS
Single
HDFS Architecture
NameNode Metadata
Meta-data
in Memory
The entire metadata is in main memory
No demand paging of meta-data
Types of Metadata
List of files
List of Blocks for each file
List of DataNodes for each block
File attributes, e.g creation time,
replication factor
A Transaction Log
Records file creations, file deletions. etc
DataNode
A
Block Server
Stores data in the local file system
(e.g. ext3)
Stores meta-data of a block (e.g.
CRC)
Serves data and meta-data to Clients
Block Report
Periodically sends a report of all
existing blocks to the NameNode
Facilitates Pipelining of Data
Forwards data to other specified
DataNodes
Block Placement
Current
Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on same remote rack
-- Additional replicas are randomly
placed
Clients read from nearest replica
Would like to make this policy
pluggable
Map Reduce
Compute to Data
Instead
data to compute
Hadoop Cluster
MapReduce
MapReduce flow
Input:This
HDFS
INTERACTION, VISUALIZATION,
EXECUTION, DEVELOPMENT
Built
HCatal
og
INTERACTION, VISUALIZATION,
EXECUTION, DEVELOPMENT
Lucene
Lucen
e
INTERACTION, VISUALIZATION,
EXECUTION, DEVELOPMENT
Apache
Hama
INTERACTION, VISUALIZATION,
EXECUTION, DEVELOPMENT
ASimple
Crunch
Other Tools
SPARK:
Windows Application
Fact #1.
Hadoop
Fact #2.
Hadoop
Fact #3.
Hadoop
Fact #4.
HDFS
Fact #5.
Hive
Fact #6.
Hadoop
Fact #7.
MapReduce
Fact #8.
Hadoop
Fact #9.
Hadoop
Fact #10.
Hadoop
THANKS