You are on page 1of 51

Carnegie Mellon

Spring 2010
15-319 Introduction to Cloud Computing
IntroductiontoCloudComputing
MajdF.Sakr
MapReduceandHadoop

15319,spring2010

17
th

Lecture,Mar16
th
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
2
LectureGoals


TransitiontoMapReducefromFunctionalProgramming


UnderstandtheoriginsofMapReduce


ExploretheHadoopAPIandInterface


OnehandsonprogramtodiveintoHadoop!
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
LectureOutline


FunctionalProgrammingReviewandMapReduce


Hadoop


Components


MapReduce


ProgrammingModel


MapandReduceFunctions


Terminology JobsandTasks


JavaImplementationsandClasses


Mapper

andReducerClasses


WritingaMapReduceDriver


Input,OutputandOtherhandlerclasses


SimpleProgram(HandsOn)


DetailsofHadoop


HandlingInputandOutput


JobExecutionandDataFlow


HadoopOptimizations
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
FunctionalProgrammingReview


Functionaloperationsdonotmodifydatastructures:

Theyalwayscreatenewones


Originaldatastillexistsinunmodifiedform


Dataflowsareimplicitinprogramdesign


Orderofoperationsdoesnotmatter
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
FunctionalProgramming


Processlistsofdataveryfrequently


Iterators:operatorsoverlists


Map


Fold


Mapoperationcanbeassignedtoeachelementofalist

independently


Iftheoperationperformedinafoldoperationis

commutative,itcanbeparallelized
5
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
ImplicitParallelisminFunctionalProgramming


Inapurelyfunctionalsetting,elementsofalistbeing

computedbymapcannotseetheeffectsofthe

computationsonotherelements


Iforderofapplicationofftoelementsinlistis

commutative(andassociative),wecanreorderor

parallelizeexecutionoff


Thisisthesecret

thatMapReduceexploits
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
EnterMapReduce


Functionalprogrammingontopofdistributedprocessing


Motivatedbyneedtoprocesslargeamountsofdatausing

hundreds(orthousands)ofprocessornodes


Dataiscentraltothecomputation,placetheprogram

closesttothedataitwillbeworkingon.


Provideacleanabstractiontoprogrammerssimilarto

functionalprogramming


TheInterfacedealswithallthemessydetails
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
MapReduceHistory


DevelopedbyGoogletosimplifytheirdataprocessingjobs

onlargedata


Detailsemergedfromtwopublishedpapers:


JamesDean,SanjayGhemawat,MapReduce:SimplifiedData

ProcessingonLargeClusters,ProceedingsofOSDI04,2004


SanjayGhemawat,HowardGobioff,andShunTak

Leung,Google

FileSystem,ProceedingsofSymposiumofOperatingSystems

Principles,ACMSIGOPS,2004


SinceGooglesimplementationisproprietaryandnot

availabletopublic,anApacheProjectcalledHadoop

emergedasanopensourceimplementation


PrimaryContributors:Yahoo!,Facebook
8
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
MotivatingExample


ProcessingWebDataonaSingleMachine


20+billionwebpagesx20KB=400+terabytes


Onecomputercanread3035MB/secfromdisk


~fourmonthstoreadtheweb


~1,000harddrivesjusttostoretheweb


Evenmoretodosomethingwiththedata


Takestoolongonasinglemachine,butwith1000

machines?


<3hourstoperformon1000machines


Buthowlongtoprogram?Whatabouttheoverheads?


Communication,coordination,recoveryfrommachinefailure


StatusReporting,Debugging,Optimization,Locality


ReinventingtheWheel:Thishastobedoneforeveryprogram!
9
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
MapReduceFeatures


AutomaticParallelizationandDistributionofWork


FaultTolerant


StatusandMonitoringTools


CleanAbstractionforProgrammers
10
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
TypicalProblemSolvedbyMapReduce
1.

ReadalotofData
2.

MAP:extractsomethingyouneedfromeachrecord
3.

ShuffleandSort
4.

REDUCE:aggregate,summarize,filterortransform
5.

Writetheresults
11
Outlinestaysthesame,
Map

andReduce

changetofittheProblem
ModelseemsrestrictivebutitisTuringComplete.
Multiplemapsandreducesneededtosolveacomplex

problem.
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
ProgrammingModel


FundamentallysimilartoFunctionalProgramming


Usersimplementinterfacestofollowingtwofunctions:


map (in_key, in_value) ->
(out_key, intermediate_value) list


reduce (out_key, intermediate_value

list) ->
out_value

list
12
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
MAP
K2
K1
K2
reduce reduce
K1
Shuf f l e and Sor t (bar r i er )
K1 K1 K1
K1 K1 K1 K1
K2 K2 K2
K2 K2 K2 K2
Input Data
ATypicalMapReduceProgram

I
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
ATypicalMapReduceProgram

II
14
Data store 1 Data store n
map
(key 1,
values...)
(key 2,
values...)
(key 3,
values...)
map
(key 1,
values...)
(key 2,
values...)
(key 3,
values...)
Input key*value
pairs
Input key*value
pairs
==Barrier == : Aggregates intermediate values by output key
reduce reduce reduce
key 1,
intermediate
values
key 2,
intermediate
values
key 3,
intermediate
values
final key 1
values
final key 2
values
final key 3
values
...
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
Parallelism


Mapfunctionsruninparallel,createintermediatevalues

fromeachinputdataset


Theprogrammermustspecifyaproperinputsplit(chunk)between

mappers

toenableparallelism


Reducefunctionsalsoruninparallel,eachwillworkon

differentoutputkeys


Numberofreducersisakeyparameterwhichdeterminesmap

reduceperformance


Allvaluesareprocessedindependently


Reducephasecannotstartuntilthemapphaseiscompletely

finished
15
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
DataLocality


Masterprogramcreatestasks

basedonthelocationof

thedata;triestosendmap()taskstothesamemachine

oratleastsamerack


Map()taskinputsaredividedinto64MBblocks(sameas

HDFS/GFS)oratfileboundaries.


Minimizescommunicationatthenetworklevel
16
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
FaultTolerance


Masterkeepstrackofprogressofeachtaskandworker

nodes


Ifanodefails,itreexecutesthecompletedaswellasinprogress

maptasksonothernodesthatarealive


Italsoexecutesinprogressreducetasks.


Ifparticularinputkey/valuepairskeepcrashing


Masterblackliststhemandskipsthemfromreexecution


Toleratesmallfailures,allowthejobtoruninbesteffort

basis


Forlargedatasetscontainingpotentiallymillionsofrecords,we

dontwanttostopcomputationforafewrecordsnotprocessing

correctly


Usercansetthefailuretolerancelevel
17
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
Optimizations


Reducetaskswillstartonlyafterallthemapoperations

arecomplete

bottleneck


Useacombiner

(aminireducerontheoutputofamap)toreduce

theoutputandtheamountofcommunicationduringsortand

shuffle


Speculativeexecution

NewfeatureinHadoop,enablesmultiple

taskattemptstoruninparallelandusetheresultsofthefirstone

tofinish


MapReduceprogramsareboundbytheslowestelement

thelastmapper

tofinishorthetimetakenforthe

Reduceoperationstofinish.
18
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
TimetoDiveintoHadoop!
Now I can play with
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
WhatisHadoop?


OpenSourceImplementationofGooglesDistributed

ComputingProjects


IncludesopensourceversionsofMapReduce,BigTable,

GFSetc.


SupportedbytheApacheFoundation


PrimaryContributors

Yahoo!,Facebook


WorkinProgress lotsoffeaturesareunstable


UsedbyYahootoruna2000+nodedistributedcluster,

andmanyotherwebcompaniesaswell.
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
QuickLookatHadoopComponents
GoogleCallsIt Hadoop

Equivalent
Description
MapReduce MapReduce JavaImplementationofthe

MapReduceProgramming

Model
GFS HDFS DistributedFileSystem
Sawzall Pig
Hive
DataFlowLanguage
DistributedDataWarehouse

andQueryEngine
BigTable HBase DistributedDatabase
Chubby ZooKeeper DistributedCoordination

Service
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
HadoopMapReduce


TheMeat

ofHadoop.ImplementationofGoogles

MapReduceProgrammingModel


ExpresscomplextasksintheformofMapsandReduces

toenablelargescaleprocessingofdata


Functionalprogrammingmeetsdistributedcomputing

usingbatchprocessingofdata


Benefits:


AutomaticParallelizationandDistributionofData


FaultTolerance


StatusandMonitoringTools


CleanProgrammingAbstractiontohelpfocusonthesolvingthe

problem
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
MapReduceProgrammingModel


FunctionalProgrammingconceptsofmapandfold


UsersImplementMapandReduceFunctions


map (in_key, in_value) -> (out_key,
intermediate_value) list


reduce (out_key, intermediate_value) list ->
out_value

list
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
TheMap

Function


Recordsfromthedatasource(linesoutoffiles,rowsofa

database,etc)arefedintothemapfunctionaskey*value

pairs:e.g.,(filename,line).


map()

produces(emits)oneormoreintermediatevalues

alongwithanoutputkeyfromtheinput.
map
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
Example

UpperCase

Mapper
let

map(k,v) =
emit(k.toUpper(),v.toUpper())
(abcd, efgh)(ABCD, EFGH)
(CMUq, hadoop)(CMUQ, HADOOP)
(foo, bar)(FOO, BAR)
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
Example FilterMapper
let

map(k, v) =
if

(v%2==0) then

emit(k, v)
(foo, 7)(nothing)
(test, 10)(test,10)
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
TheReduce

Function


Afterthemapphaseisover,alltheintermediatevaluesfor

agivenoutputkeyarecombinedtogetherintoalist


reduce()

combinesthoseintermediatevaluesintooneor

morefinalvaluesforthatsameoutputkey
reduce reduce
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
Example SumReducer
let

reduce(k, vals) =
{
sum = 0
foreach

int

v in vals:
sum += v
emit(k, sum)
}
(A, [42, 100, 312])(A, 454)
(B, [12, 6, -2])(B, 16)
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
HadoopMapReduceTerminology


Job


Executionofacomplete

program(mapper

andreducer)across

aninputdataset


Task


Executionofasinglemapper

orreduceronasliceofdata(taskin

progress)


TaskAttempt


Aparticularinstanceofanattempttoexecuteataskonamachine

ornode
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
ExampleofTerminology


Wordcount

on20files:OneJob


TheJobimplies20maptasks

foreachfileplusafew

reducertasks


Therewillbeatleast20+taskattemptstotal.Moretask

attemptsifanodecrashes,etc.
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
MapReduceJavaPackageStructure


org.apache.hadoop.*


mapred

LegacyMapReduceAPI(upto

0.19)


mapreduce

NewMapReduceAPI(0.20+)


conf

JobConfigurationUtilities


io


Input/Outputformats


fs


FileSystemHandlers


util


Utilitiesforhadoop
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
TheMapper

JavaClass


Towriteyourownmapper,youextendtheMapReduceBase

Class,and

overridethemap()

functionintheMapper

Class,inside

org.apache.mapred
public

class

NewMapper

extends

MapReduceBase

implements

Mapper
{
//User Defined Class Variables and Functions
public

void

map(WritableComparable

key, Writable values,
OutputCollector

output, Reporter reporter) throws

IOException
{
//Map Function Goes Here
output.collect(key,value);
}
}
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
Input/OutputDatatypes


Writable

Class


Hadooptypeforwritingdataasabytestream(Serialization)


IntWritable,Text

etc


AllvaluesmustbeWritable


Youcancreateyourowncustomwritableforyourowninputformat.


WritableComparable

Class


SubclassofWritable


Requiredforsortingandshufflingforreducers


AllkeysmustbetypeWriteableComparable
JavaPrimitive Writeable

Implementation
boolean booleanWritable
byte ByteWritable
int IntWritable
float FloatWritable
long LongWritable
JavaType Writeable

Implementation
String Text
Bytes BytesWritable
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
TheReducer

JavaClass


Towriteyourownreducer,youextendthe

MapReduceBase

Class,andoverridethereduce()

functionintheReducer

Class.
public

class

ExampleReducer

extends

MapReduceBase

implements

Reducer
{
public

void

reduce(WritableComparable

_key, Iterator

values,
OutputCollector

output, Reporter reporter)
throws

IOException
{
while

(values.hasNext())
{
// process value
}
}
}
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
TheReducer

JavaClassContinued


Thereducerclassisexecutedforeverykeyemittedfrom

themapper,weusetheIterator

classtoiteratethrough

thevaluesofeachkey


Youcanperformtherequiredreductionoperationonthe

valuesforeachkeyandemitthekey,value

pairsas

necessary
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
TheOutputCollector

andReporter

Classes


TheOutputCollector

Classisusedtohandleoutput

keyvaluepairsemittedbymappers

andreducers


output.collect(k,v)


TheReporter

classisusedtoupdatecountersand

statusmessagesyouseeonscreenwhenhadoop

executes.


Thefunctionalityofbothclasseshavenowbeenmerged

inHadoop0.20usingcontext

objects
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
PuttingitalltogetherintheDriverClass


Onceyouhaveyourmappers

andreducersset,youcan

writeadriverclass


YouconfigureaJobConf

objectwiththefollowing

information


Input/OutputFileformats


Input/OutputKeyValueFormats


Mapper

andReducerClasses
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
public

class

Example {
public

static

void

main(String[] args)
{
JobClient

client = new

JobClient();
JobConf

conf = new

JobConf(Example.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setInputPath(new

Path("src"));
conf.setOutputPath(new

Path("out"));
conf.setMapperClass(ExampleMapper.class);
conf.setReducerClass(ExampleReducer.class);
client.setConf(conf);
try

{
JobClient.runJob(conf);
} catch

(Exception e) {
e.printStackTrace();
}
}
}
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
ASimpleProgram(HandsOn)


LetstrytowriteaHadoopprogramthatcountsthe

numberofevenandoddnumbersinalist.


Input:TextFilewithoneintegervalueperline


Output:TextFileContainingthenumberofevenand

NumberofOddnumbersintheInputfile


Mapper


Readeachline,parseforintegerandemit(even/odd,1);


Reducer


Sumupallthevaluesofeachkey(even/odd)


ThatsIt!
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
HowdoesaJobExecuteinHadoop?


MapReduceJobissenttothe

masternode

JobTracker


JobTracker

createsthetasks

andsendsthemtothe

individualslavenode

TaskTrackers


TheTaskTracker

createstask

instancesforeachtaskand

runsthem,reportsstatusand

resultsbacktotheJobTracker
J obTracker
MapReduce job
submitted by
client computer
Master node
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
JobConfigurationinHadoop


ViatheJobConf

Object


CreatesaJobConfigration

XMLfilethatwillpackagedalongwith

theHadoopJARfilecontainingyourprogram


YouspecifytheMapper,Reducer,InputandOutputFiles/Formats

etc.


Thereareadditionalattributesthatcanbespecified,eachasXML

attributesintheconffile.


TheHadoopsystemusesthejobXMLfilealongwithother

configurationparametersintheFramework(coresite.xml

etc.)


Someparameterscanbedefinedasfinal

tomakesuretheyarenot

overridden.
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
DataFlowfromInputtoMappers
I
n
p
u
t
F
o
r
m
a
t
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
HandlingInput


Firstwemustspecifyaninputfiletype


TextFiles

thebasicfiletypeinHadoop,readstextfilesand

providesinput(key,value)pairstothemaptasks.


SequenceFiles

Binaryformattostoresetsofkey/valuepairsin

binaryform.Supportscompressionaswell.


MapFiles


pairsofsortedinputSequenceFiles,onefile(data)

containskey/valuepairsasrecordsandtheother(index)

containskey/locationpairswherelocationisthelocationofthe

keyinthefiledata
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
HandlingInput


InputfileissplitintochunksusinganInputSplit


Defaultsplitoccursat64MBandfileboundaries


YoucanspecifyyourownInputSplit

foryourInputFormats


OneMaptaskforeachInputSplit


Arecordisa(key,value)pair


ReadusingtheRecordReader

class


YoucanspecifyyourownRecordReader

foryourowninput

formats


EachInputFormat

providesitsownRecordReader

implementation


LineRecordReader

Readsalinefromatextfile


KeyValueRecordReader

Usedby

KeyValueTextInputFormat
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
DataFlowfromMappers

toReducers
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Reducer Reducer Reducer
(intermediates) (intermediates) (intermediates)
Partitioner Partitioner Partitioner Partitioner
s
h
u
f
f
l
i
n
g
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
ReducerinDetail


Reducesasetofintermediatevalueswhichshareakeytoa

smallersetofvalues


Shuffle


Copiesthesortedoutputfromeachmapper

usingHTTPacrossthe

network


Sort


Sortsreduceinputsbykeys


Shuffleandsortphasesoccursimultaneously


SecondarySortusingcustomfunctions


Reduce


FrameworkclassReduceforeachkeyandcollectionofvalues
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
ThePartitioner

Class


ResponsibleforcreatingthepartitionsfortheSortand

ShufflePhase


Dictateswhateachreducerreceivesfromeachmapper


CanbeusedtocreateaGlobalOrder

ofkeys


Totalpartitionsissameasnumberofmaptasks


HashPartitioner

usedbydefault


Useskey.hashCode()

toreturnpartitionnum


JobConf

setsPartitioner

implementation
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
HandlingOutput


EachReducerwritesitsownoutput

file


UsesaRecordWriter

thatispartofa

OutputFormat

object


TextOutputFormat

Writes

keyval\n

stringstooutputfile


SequenceFileOutputFormat


Usesabinaryformattopack(k,v)

pairs


NullOutputFormat

Discards

output
O
u
t
p
u
t
F
o
r
m
a
t
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
OptimizationsforaMapReduceJob


UseaCombinerFunction


Essentiallyaminireducerthatrunsonamappers

output


Executedlocallyoneachnodeafteramaptask


Helpsinreducingdatatobetransferredduringtheshufflephase
conf.setCombinerClass(ExampleReducer.class);
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
Conclusions


BasicoverviewofHadoopandMapReduce


IntroductiontotheHadoopframeworktogetyoustarted

onwritingprogramsandcode


ForMoreInformation


TomWhite:Hadoop:TheDefnitive

Guide,OReillyPress


ProHadoop:JasonVenner
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
References


http://code.google.com/edu/submissions/mapreduce

minilecture/listing.html


http://www.cloudera.com/wpcontent/uploads/2010/01/2

MapReduceAndHDFS.pdf


http://www.cloudera.com/wpcontent/uploads/2010/01/4

ProgrammingWithHadoop.pdf


http://search.iiit.ac.in/cloud/lectures/MapReduce.ppt


http://search.iiit.ac.in/cloud/lectures/HadoopDetailedView.ppt

You might also like