You are on page 1of 28

Hadoop

A Hands-on Introduction
Claudio Martella Elia Bruni
9 November 2011

Tuesday, November 8, 11

Outline
What is Hadoop Why is Hadoop How is Hadoop Hadoop & Python Some NLP code A more complicated problem: Eva
2
Tuesday, November 8, 11

A bit of Context
2003: rst MapReduce library @ Google 2003: GFS paper 2004: MapReduce paper 2005: Apache Nutch uses MapReduce 2006: Hadoop was born 2007: rst 1000 nodes cluster at Y!
3
Tuesday, November 8, 11

An Ecosystem
HDFS & MapReduce Zookeeper HBase Pig & Hive Mahout Giraph Nutch
Tuesday, November 8, 11

Traditional way
Design a high-level Schema You store data in a RDBMS Which has very poor write throughput And doesnt scale very much When you talk about Terabyte of data Expensive Data Warehouse
5
Tuesday, November 8, 11

BigData & NoSQL


Store rst, think later Schema-less storage Analytics Petabyte scale Ofine processing
6
Tuesday, November 8, 11

Vertical Scalability
Extremely expensive Requires expertise in distributed systems
and concurrent programming

Lacks of real fault-tolerance


7
Tuesday, November 8, 11

Horizontal Scalability
Built on top of commodity hardware Easy to use programming paradigms Fault-tolerance through replication

8
Tuesday, November 8, 11

1st Assumptions
Data to process does not t on one node. Each node is commodity hardware. Failure happens.
Spread your data among your nodes and replicate it.
9
Tuesday, November 8, 11

2nd Assumptions
Moving computation is cheap. Moving data is expensive. Distributed computing is hard.
Move computation to data, with simple paradigm.
10
Tuesday, November 8, 11

3rd Assumptions
Systems run on spinning hard disks. Disk seek >> disk scan. Many small les are expensive.
Base the paradigm on scanning large les.

11
Tuesday, November 8, 11

Typical Problem
Collect and iterate over many records Filter and extract something from each Shufe & sort these intermediate results Group-by and aggregate them Produce nal output set
12
Tuesday, November 8, 11

Typical Problem
Collect and iterate over many records Filter and extract something from each results Shufe & sort these intermediate R E Group-by and aggregate them D U C Produce nal output set E
13
Tuesday, November 8, 11

AP

Quick example
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/ 1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"


Tuesday, November 8, 11

(frank, index.html) (index.html, 10/Oct/2000) (index.html, http://www.example.com/start.html)

14

MapReduce
Programmers dene two functions:

map (key, value) reduce (key, [value+])

(key, value)* (key, value)*

Can also dene:

combine (key, value) partitioner: k


15

(key, value)* partition

Tuesday, November 8, 11

k1 v1

k2 v2

k3 v3

k4 v4

k5 v5

k6 v6

map

map

map

map

a 1

b 2

c 3

c 6

a 5

b 7

Shuffle and Sort: aggregate values by keys


a 1 5 b 2 7 c 2 3 6 9

reduce

reduce

reduce

r1 s1

r2 s2

r3 s3

16
Tuesday, November 8, 11

MapReduce daemons
JobTracker: its the Master, it runs the
schedule of the jobs, assigns tasks to nodes, collects hearth-beats from workers, reschedules for fault-tolerance. each slave, runs (multiple) Mappers and Reducers each in their JVM.

TaskTracker: its the Worker, it runs on

17
Tuesday, November 8, 11

User Program
(1) fork (1) fork (1) fork

Master
(2) assign map

(2) assign reduce

worker split 0 split 1 split 2 split 3 split 4 worker


(5) remote read

(3) read

worker

(6) write

worker

(4) local write

output file 0

worker

output file 1

Input files

Map phase

Intermediate files (on local disk)

Reduce phase

Output files

Redrawn from (Dean and Ghemawat, OSDI 2004)


Tuesday, November 8, 11

18

HDFS daemons
NameNode: its the Master, it keeps the
lesystem metadata (in-memory), the leblock-node mapping, decides replication and block placement, collects heart-beats from nodes. blocks (64MB) of the les and serves directly reads and writes.
19
Tuesday, November 8, 11

DataNode: its the Slave, it stores the

Application GSF Client

(file name, chunk index) (chunk handle, chunk location)

GFS master
File namespace

/foo/bar
chunk 2ef0

Instructions to chunkserver (chunk handle, byte range) chunk data Chunkserver state

GFS chunkserver Linux file system

GFS chunkserver Linux file system

awn from (Ghemawat et al., SOSP 2003)

20
Tuesday, November 8, 11

Transparent to
Workers to data assignment Map / Reduce assignment to nodes Management of synchronization Management of communication Fault-tolerance and restarts
21
Tuesday, November 8, 11

Take home recipe


Scan-based computation (no random I/O) Big datasets Divide-and-conquer class algorithms No communication between tasks
22
Tuesday, November 8, 11

Not good for


Real-time / Stream processing Graph processing Computation without locality Small datasets
23
Tuesday, November 8, 11

Questions?

Tuesday, November 8, 11

Baseline solution

Tuesday, November 8, 11

What we attacked
You dont want to parse the le many times You dont want to re-calculate the norm You dont want to calculate 0*n
26
Tuesday, November 8, 11

Our solution
0 1.3 0 1.2 0 0 0 7.1 1.1 0 0 3.4
1.3 1.2 5.7 5.1 1.6 7.1 3.4 1.1 4.6 2 10 1.1

0 5.7 0 5.1 0 0 0

0 1.1 2 10 0

0 4.6 0 0 1.6 0

line format: <string><norm>[<col><value>]* for example: cat 12.1313 0 5.1 3 4.6 5 10


27
Tuesday, November 8, 11

Benchmarking
serial python (single-core): 7 minutes java+hadoop (single-core): 2 minutes serial python (big le): 18 days java+hadoop (parallel, big le): 8 hours it makes sense: 18d / 3.5 = 5.14d / 14 = 8h
28
Tuesday, November 8, 11

You might also like