Professional Documents
Culture Documents
A Hands-on Introduction
Claudio Martella Elia Bruni
9 November 2011
Tuesday, November 8, 11
Outline
What is Hadoop Why is Hadoop How is Hadoop Hadoop & Python Some NLP code A more complicated problem: Eva
2
Tuesday, November 8, 11
A bit of Context
2003: rst MapReduce library @ Google 2003: GFS paper 2004: MapReduce paper 2005: Apache Nutch uses MapReduce 2006: Hadoop was born 2007: rst 1000 nodes cluster at Y!
3
Tuesday, November 8, 11
An Ecosystem
HDFS & MapReduce Zookeeper HBase Pig & Hive Mahout Giraph Nutch
Tuesday, November 8, 11
Traditional way
Design a high-level Schema You store data in a RDBMS Which has very poor write throughput And doesnt scale very much When you talk about Terabyte of data Expensive Data Warehouse
5
Tuesday, November 8, 11
Vertical Scalability
Extremely expensive Requires expertise in distributed systems
and concurrent programming
Horizontal Scalability
Built on top of commodity hardware Easy to use programming paradigms Fault-tolerance through replication
8
Tuesday, November 8, 11
1st Assumptions
Data to process does not t on one node. Each node is commodity hardware. Failure happens.
Spread your data among your nodes and replicate it.
9
Tuesday, November 8, 11
2nd Assumptions
Moving computation is cheap. Moving data is expensive. Distributed computing is hard.
Move computation to data, with simple paradigm.
10
Tuesday, November 8, 11
3rd Assumptions
Systems run on spinning hard disks. Disk seek >> disk scan. Many small les are expensive.
Base the paradigm on scanning large les.
11
Tuesday, November 8, 11
Typical Problem
Collect and iterate over many records Filter and extract something from each Shufe & sort these intermediate results Group-by and aggregate them Produce nal output set
12
Tuesday, November 8, 11
Typical Problem
Collect and iterate over many records Filter and extract something from each results Shufe & sort these intermediate R E Group-by and aggregate them D U C Produce nal output set E
13
Tuesday, November 8, 11
AP
Quick example
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/ 1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"
Tuesday, November 8, 11
14
MapReduce
Programmers dene two functions:
Tuesday, November 8, 11
k1 v1
k2 v2
k3 v3
k4 v4
k5 v5
k6 v6
map
map
map
map
a 1
b 2
c 3
c 6
a 5
b 7
reduce
reduce
reduce
r1 s1
r2 s2
r3 s3
16
Tuesday, November 8, 11
MapReduce daemons
JobTracker: its the Master, it runs the
schedule of the jobs, assigns tasks to nodes, collects hearth-beats from workers, reschedules for fault-tolerance. each slave, runs (multiple) Mappers and Reducers each in their JVM.
17
Tuesday, November 8, 11
User Program
(1) fork (1) fork (1) fork
Master
(2) assign map
(3) read
worker
(6) write
worker
output file 0
worker
output file 1
Input files
Map phase
Reduce phase
Output files
18
HDFS daemons
NameNode: its the Master, it keeps the
lesystem metadata (in-memory), the leblock-node mapping, decides replication and block placement, collects heart-beats from nodes. blocks (64MB) of the les and serves directly reads and writes.
19
Tuesday, November 8, 11
GFS master
File namespace
/foo/bar
chunk 2ef0
Instructions to chunkserver (chunk handle, byte range) chunk data Chunkserver state
20
Tuesday, November 8, 11
Transparent to
Workers to data assignment Map / Reduce assignment to nodes Management of synchronization Management of communication Fault-tolerance and restarts
21
Tuesday, November 8, 11
Questions?
Tuesday, November 8, 11
Baseline solution
Tuesday, November 8, 11
What we attacked
You dont want to parse the le many times You dont want to re-calculate the norm You dont want to calculate 0*n
26
Tuesday, November 8, 11
Our solution
0 1.3 0 1.2 0 0 0 7.1 1.1 0 0 3.4
1.3 1.2 5.7 5.1 1.6 7.1 3.4 1.1 4.6 2 10 1.1
0 5.7 0 5.1 0 0 0
0 1.1 2 10 0
0 4.6 0 0 1.6 0
Benchmarking
serial python (single-core): 7 minutes java+hadoop (single-core): 2 minutes serial python (big le): 18 days java+hadoop (parallel, big le): 8 hours it makes sense: 18d / 3.5 = 5.14d / 14 = 8h
28
Tuesday, November 8, 11