What Is Hadoop - Cimec-120208170829-Phpapp01

Hadoop
A Hands-on Introduction
Claudio Martella Elia Bruni
9 November 2011
Tuesday, November 8, 11
Outline
What is Hadoop Why is Hadoop How is Hadoop Hadoop & Python Some NLP code A more complicated problem: Eva
2
A bit of Context
2003: rst MapReduce library @ Google 2003: GFS paper 2004: MapReduce paper 2005: Apache Nutch uses MapReduce 2006: Hadoop was born 2007: rst 1000 nodes cluster at Y!
3
An Ecosystem
HDFS & MapReduce Zookeeper HBase Pig & Hive Mahout Giraph Nutch
Traditional way
Design a high-level Schema You store data in a RDBMS Which has very poor write throughput And doesnt scale very much When you talk about Terabyte of data Expensive Data Warehouse
5
BigData & NoSQL

Store rst, think later Schema-less storage Analytics Petabyte scale Ofine processing
6
Vertical Scalability
Extremely expensive Requires expertise in distributed systems
and concurrent programming
Lacks of real fault-tolerance

7
Horizontal Scalability
Built on top of commodity hardware Easy to use programming paradigms Fault-tolerance through replication
8
1st Assumptions
Data to process does not t on one node. Each node is commodity hardware. Failure happens.
Spread your data among your nodes and replicate it.
9
2nd Assumptions
Moving computation is cheap. Moving data is expensive. Distributed computing is hard.
Move computation to data, with simple paradigm.
10
3rd Assumptions
Systems run on spinning hard disks. Disk seek >> disk scan. Many small les are expensive.
Base the paradigm on scanning large les.
11
Typical Problem
Collect and iterate over many records Filter and extract something from each Shufe & sort these intermediate results Group-by and aggregate them Produce nal output set
12
Typical Problem
Collect and iterate over many records Filter and extract something from each results Shufe & sort these intermediate R E Group-by and aggregate them D U C Produce nal output set E
13
AP
Quick example
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/ 1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"

(frank, index.html) (index.html, 10/Oct/2000) (index.html, http://www.example.com/start.html)
14
MapReduce
Programmers dene two functions:
map (key, value) reduce (key, [value+])
(key, value)* (key, value)*
Can also dene:
combine (key, value) partitioner: k

15
(key, value)* partition
k1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
map
map
map
map
a 1
b 2
c 3
c 6
a 5
b 7
Shuffle and Sort: aggregate values by keys

a 1 5 b 2 7 c 2 3 6 9
reduce
reduce
reduce
r1 s1
r2 s2
r3 s3
16
MapReduce daemons
JobTracker: its the Master, it runs the
schedule of the jobs, assigns tasks to nodes, collects hearth-beats from workers, reschedules for fault-tolerance. each slave, runs (multiple) Mappers and Reducers each in their JVM.
TaskTracker: its the Worker, it runs on
17
User Program
(1) fork (1) fork (1) fork
Master
(2) assign map
(2) assign reduce
worker split 0 split 1 split 2 split 3 split 4 worker

(5) remote read
(3) read
worker
(6) write
worker
(4) local write
output file 0
worker
output file 1
Input files
Map phase
Intermediate files (on local disk)
Reduce phase
Output files
Redrawn from (Dean and Ghemawat, OSDI 2004)

18
HDFS daemons
NameNode: its the Master, it keeps the
lesystem metadata (in-memory), the leblock-node mapping, decides replication and block placement, collects heart-beats from nodes. blocks (64MB) of the les and serves directly reads and writes.
19
DataNode: its the Slave, it stores the
Application GSF Client
(file name, chunk index) (chunk handle, chunk location)
GFS master
File namespace
/foo/bar
chunk 2ef0
Instructions to chunkserver (chunk handle, byte range) chunk data Chunkserver state
GFS chunkserver Linux file system
GFS chunkserver Linux file system
awn from (Ghemawat et al., SOSP 2003)
20
Transparent to
Workers to data assignment Map / Reduce assignment to nodes Management of synchronization Management of communication Fault-tolerance and restarts
21
Take home recipe

Scan-based computation (no random I/O) Big datasets Divide-and-conquer class algorithms No communication between tasks
22
Not good for

Real-time / Stream processing Graph processing Computation without locality Small datasets
23
Questions?
Baseline solution
What we attacked
You dont want to parse the le many times You dont want to re-calculate the norm You dont want to calculate 0*n
26
Our solution
0 1.3 0 1.2 0 0 0 7.1 1.1 0 0 3.4
1.3 1.2 5.7 5.1 1.6 7.1 3.4 1.1 4.6 2 10 1.1
0 5.7 0 5.1 0 0 0
0 1.1 2 10 0
0 4.6 0 0 1.6 0
line format: <string><norm>[<col><value>]* for example: cat 12.1313 0 5.1 3 4.6 5 10

27
Benchmarking
serial python (single-core): 7 minutes java+hadoop (single-core): 2 minutes serial python (big le): 18 days java+hadoop (parallel, big le): 8 hours it makes sense: 18d / 3.5 = 5.14d / 14 = 8h
28

What Is Hadoop - Cimec-120208170829-Phpapp01

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

What Is Hadoop - Cimec-120208170829-Phpapp01

Uploaded by

Copyright:

Available Formats

Hadoop

BigData & NoSQL

Lacks of real fault-tolerance

(frank, index.html) (index.html, 10/Oct/2000) (index.html, http://www.example.com/start.html)

map (key, value) reduce (key, [value+])

(key, value)* (key, value)*

Can also dene:

combine (key, value) partitioner: k

(key, value)* partition

Shuffle and Sort: aggregate values by keys

TaskTracker: its the Worker, it runs on

(2) assign reduce

worker split 0 split 1 split 2 split 3 split 4 worker

(4) local write

Intermediate files (on local disk)

Redrawn from (Dean and Ghemawat, OSDI 2004)

DataNode: its the Slave, it stores the

Application GSF Client

(file name, chunk index) (chunk handle, chunk location)

GFS chunkserver Linux file system

GFS chunkserver Linux file system

awn from (Ghemawat et al., SOSP 2003)

Take home recipe

Not good for

line format: <string><norm>[<col><value>]* for example: cat 12.1313 0 5.1 3 4.6 5 10

You might also like