Professional Documents
Culture Documents
Spring 2010
15-319 Introduction to Cloud Computing
IntroductiontoCloudComputing
MajdF.Sakr
MapReduceandHadoop
15319,spring2010
17
th
Lecture,Mar16
th
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
2
LectureGoals
TransitiontoMapReducefromFunctionalProgramming
UnderstandtheoriginsofMapReduce
ExploretheHadoopAPIandInterface
OnehandsonprogramtodiveintoHadoop!
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
LectureOutline
FunctionalProgrammingReviewandMapReduce
Hadoop
Components
MapReduce
ProgrammingModel
MapandReduceFunctions
Terminology JobsandTasks
JavaImplementationsandClasses
Mapper
andReducerClasses
WritingaMapReduceDriver
Input,OutputandOtherhandlerclasses
SimpleProgram(HandsOn)
DetailsofHadoop
HandlingInputandOutput
JobExecutionandDataFlow
HadoopOptimizations
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
FunctionalProgrammingReview
Functionaloperationsdonotmodifydatastructures:
Theyalwayscreatenewones
Originaldatastillexistsinunmodifiedform
Dataflowsareimplicitinprogramdesign
Orderofoperationsdoesnotmatter
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
FunctionalProgramming
Processlistsofdataveryfrequently
Iterators:operatorsoverlists
Map
Fold
Mapoperationcanbeassignedtoeachelementofalist
independently
Iftheoperationperformedinafoldoperationis
commutative,itcanbeparallelized
5
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
ImplicitParallelisminFunctionalProgramming
Inapurelyfunctionalsetting,elementsofalistbeing
computedbymapcannotseetheeffectsofthe
computationsonotherelements
Iforderofapplicationofftoelementsinlistis
commutative(andassociative),wecanreorderor
parallelizeexecutionoff
Thisisthesecret
thatMapReduceexploits
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
EnterMapReduce
Functionalprogrammingontopofdistributedprocessing
Motivatedbyneedtoprocesslargeamountsofdatausing
hundreds(orthousands)ofprocessornodes
Dataiscentraltothecomputation,placetheprogram
closesttothedataitwillbeworkingon.
Provideacleanabstractiontoprogrammerssimilarto
functionalprogramming
TheInterfacedealswithallthemessydetails
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
MapReduceHistory
DevelopedbyGoogletosimplifytheirdataprocessingjobs
onlargedata
Detailsemergedfromtwopublishedpapers:
JamesDean,SanjayGhemawat,MapReduce:SimplifiedData
ProcessingonLargeClusters,ProceedingsofOSDI04,2004
SanjayGhemawat,HowardGobioff,andShunTak
Leung,Google
FileSystem,ProceedingsofSymposiumofOperatingSystems
Principles,ACMSIGOPS,2004
SinceGooglesimplementationisproprietaryandnot
availabletopublic,anApacheProjectcalledHadoop
emergedasanopensourceimplementation
PrimaryContributors:Yahoo!,Facebook
8
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
MotivatingExample
ProcessingWebDataonaSingleMachine
20+billionwebpagesx20KB=400+terabytes
Onecomputercanread3035MB/secfromdisk
~fourmonthstoreadtheweb
~1,000harddrivesjusttostoretheweb
Evenmoretodosomethingwiththedata
Takestoolongonasinglemachine,butwith1000
machines?
<3hourstoperformon1000machines
Buthowlongtoprogram?Whatabouttheoverheads?
Communication,coordination,recoveryfrommachinefailure
StatusReporting,Debugging,Optimization,Locality
ReinventingtheWheel:Thishastobedoneforeveryprogram!
9
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
MapReduceFeatures
AutomaticParallelizationandDistributionofWork
FaultTolerant
StatusandMonitoringTools
CleanAbstractionforProgrammers
10
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
TypicalProblemSolvedbyMapReduce
1.
ReadalotofData
2.
MAP:extractsomethingyouneedfromeachrecord
3.
ShuffleandSort
4.
REDUCE:aggregate,summarize,filterortransform
5.
Writetheresults
11
Outlinestaysthesame,
Map
andReduce
changetofittheProblem
ModelseemsrestrictivebutitisTuringComplete.
Multiplemapsandreducesneededtosolveacomplex
problem.
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
ProgrammingModel
FundamentallysimilartoFunctionalProgramming
Usersimplementinterfacestofollowingtwofunctions:
map (in_key, in_value) ->
(out_key, intermediate_value) list
reduce (out_key, intermediate_value
list) ->
out_value
list
12
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
MAP
K2
K1
K2
reduce reduce
K1
Shuf f l e and Sor t (bar r i er )
K1 K1 K1
K1 K1 K1 K1
K2 K2 K2
K2 K2 K2 K2
Input Data
ATypicalMapReduceProgram
I
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
ATypicalMapReduceProgram
II
14
Data store 1 Data store n
map
(key 1,
values...)
(key 2,
values...)
(key 3,
values...)
map
(key 1,
values...)
(key 2,
values...)
(key 3,
values...)
Input key*value
pairs
Input key*value
pairs
==Barrier == : Aggregates intermediate values by output key
reduce reduce reduce
key 1,
intermediate
values
key 2,
intermediate
values
key 3,
intermediate
values
final key 1
values
final key 2
values
final key 3
values
...
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
Parallelism
Mapfunctionsruninparallel,createintermediatevalues
fromeachinputdataset
Theprogrammermustspecifyaproperinputsplit(chunk)between
mappers
toenableparallelism
Reducefunctionsalsoruninparallel,eachwillworkon
differentoutputkeys
Numberofreducersisakeyparameterwhichdeterminesmap
reduceperformance
Allvaluesareprocessedindependently
Reducephasecannotstartuntilthemapphaseiscompletely
finished
15
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
DataLocality
Masterprogramcreatestasks
basedonthelocationof
thedata;triestosendmap()taskstothesamemachine
oratleastsamerack
Map()taskinputsaredividedinto64MBblocks(sameas
HDFS/GFS)oratfileboundaries.
Minimizescommunicationatthenetworklevel
16
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
FaultTolerance
Masterkeepstrackofprogressofeachtaskandworker
nodes
Ifanodefails,itreexecutesthecompletedaswellasinprogress
maptasksonothernodesthatarealive
Italsoexecutesinprogressreducetasks.
Ifparticularinputkey/valuepairskeepcrashing
Masterblackliststhemandskipsthemfromreexecution
Toleratesmallfailures,allowthejobtoruninbesteffort
basis
Forlargedatasetscontainingpotentiallymillionsofrecords,we
dontwanttostopcomputationforafewrecordsnotprocessing
correctly
Usercansetthefailuretolerancelevel
17
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
Optimizations
Reducetaskswillstartonlyafterallthemapoperations
arecomplete
bottleneck
Useacombiner
(aminireducerontheoutputofamap)toreduce
theoutputandtheamountofcommunicationduringsortand
shuffle
Speculativeexecution
NewfeatureinHadoop,enablesmultiple
taskattemptstoruninparallelandusetheresultsofthefirstone
tofinish
MapReduceprogramsareboundbytheslowestelement
thelastmapper
tofinishorthetimetakenforthe
Reduceoperationstofinish.
18
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
TimetoDiveintoHadoop!
Now I can play with
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
WhatisHadoop?
OpenSourceImplementationofGooglesDistributed
ComputingProjects
IncludesopensourceversionsofMapReduce,BigTable,
GFSetc.
SupportedbytheApacheFoundation
PrimaryContributors
Yahoo!,Facebook
WorkinProgress lotsoffeaturesareunstable
UsedbyYahootoruna2000+nodedistributedcluster,
andmanyotherwebcompaniesaswell.
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
QuickLookatHadoopComponents
GoogleCallsIt Hadoop
Equivalent
Description
MapReduce MapReduce JavaImplementationofthe
MapReduceProgramming
Model
GFS HDFS DistributedFileSystem
Sawzall Pig
Hive
DataFlowLanguage
DistributedDataWarehouse
andQueryEngine
BigTable HBase DistributedDatabase
Chubby ZooKeeper DistributedCoordination
Service
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
HadoopMapReduce
TheMeat
ofHadoop.ImplementationofGoogles
MapReduceProgrammingModel
ExpresscomplextasksintheformofMapsandReduces
toenablelargescaleprocessingofdata
Functionalprogrammingmeetsdistributedcomputing
usingbatchprocessingofdata
Benefits:
AutomaticParallelizationandDistributionofData
FaultTolerance
StatusandMonitoringTools
CleanProgrammingAbstractiontohelpfocusonthesolvingthe
problem
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
MapReduceProgrammingModel
FunctionalProgrammingconceptsofmapandfold
UsersImplementMapandReduceFunctions
map (in_key, in_value) -> (out_key,
intermediate_value) list
reduce (out_key, intermediate_value) list ->
out_value
list
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
TheMap
Function
Recordsfromthedatasource(linesoutoffiles,rowsofa
database,etc)arefedintothemapfunctionaskey*value
pairs:e.g.,(filename,line).
map()
produces(emits)oneormoreintermediatevalues
alongwithanoutputkeyfromtheinput.
map
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
Example
UpperCase
Mapper
let
map(k,v) =
emit(k.toUpper(),v.toUpper())
(abcd, efgh)(ABCD, EFGH)
(CMUq, hadoop)(CMUQ, HADOOP)
(foo, bar)(FOO, BAR)
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
Example FilterMapper
let
map(k, v) =
if
(v%2==0) then
emit(k, v)
(foo, 7)(nothing)
(test, 10)(test,10)
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
TheReduce
Function
Afterthemapphaseisover,alltheintermediatevaluesfor
agivenoutputkeyarecombinedtogetherintoalist
reduce()
combinesthoseintermediatevaluesintooneor
morefinalvaluesforthatsameoutputkey
reduce reduce
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
Example SumReducer
let
reduce(k, vals) =
{
sum = 0
foreach
int
v in vals:
sum += v
emit(k, sum)
}
(A, [42, 100, 312])(A, 454)
(B, [12, 6, -2])(B, 16)
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
HadoopMapReduceTerminology
Job
Executionofacomplete
program(mapper
andreducer)across
aninputdataset
Task
Executionofasinglemapper
orreduceronasliceofdata(taskin
progress)
TaskAttempt
Aparticularinstanceofanattempttoexecuteataskonamachine
ornode
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
ExampleofTerminology
Wordcount
on20files:OneJob
TheJobimplies20maptasks
foreachfileplusafew
reducertasks
Therewillbeatleast20+taskattemptstotal.Moretask
attemptsifanodecrashes,etc.
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
MapReduceJavaPackageStructure
org.apache.hadoop.*
mapred
LegacyMapReduceAPI(upto
0.19)
mapreduce
NewMapReduceAPI(0.20+)
conf
JobConfigurationUtilities
io
Input/Outputformats
fs
FileSystemHandlers
util
Utilitiesforhadoop
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
TheMapper
JavaClass
Towriteyourownmapper,youextendtheMapReduceBase
Class,and
overridethemap()
functionintheMapper
Class,inside
org.apache.mapred
public
class
NewMapper
extends
MapReduceBase
implements
Mapper
{
//User Defined Class Variables and Functions
public
void
map(WritableComparable
key, Writable values,
OutputCollector
output, Reporter reporter) throws
IOException
{
//Map Function Goes Here
output.collect(key,value);
}
}
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
Input/OutputDatatypes
Writable
Class
Hadooptypeforwritingdataasabytestream(Serialization)
IntWritable,Text
etc
AllvaluesmustbeWritable
Youcancreateyourowncustomwritableforyourowninputformat.
WritableComparable
Class
SubclassofWritable
Requiredforsortingandshufflingforreducers
AllkeysmustbetypeWriteableComparable
JavaPrimitive Writeable
Implementation
boolean booleanWritable
byte ByteWritable
int IntWritable
float FloatWritable
long LongWritable
JavaType Writeable
Implementation
String Text
Bytes BytesWritable
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
TheReducer
JavaClass
Towriteyourownreducer,youextendthe
MapReduceBase
Class,andoverridethereduce()
functionintheReducer
Class.
public
class
ExampleReducer
extends
MapReduceBase
implements
Reducer
{
public
void
reduce(WritableComparable
_key, Iterator
values,
OutputCollector
output, Reporter reporter)
throws
IOException
{
while
(values.hasNext())
{
// process value
}
}
}
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
TheReducer
JavaClassContinued
Thereducerclassisexecutedforeverykeyemittedfrom
themapper,weusetheIterator
classtoiteratethrough
thevaluesofeachkey
Youcanperformtherequiredreductionoperationonthe
valuesforeachkeyandemitthekey,value
pairsas
necessary
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
TheOutputCollector
andReporter
Classes
TheOutputCollector
Classisusedtohandleoutput
keyvaluepairsemittedbymappers
andreducers
output.collect(k,v)
TheReporter
classisusedtoupdatecountersand
statusmessagesyouseeonscreenwhenhadoop
executes.
Thefunctionalityofbothclasseshavenowbeenmerged
inHadoop0.20usingcontext
objects
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
PuttingitalltogetherintheDriverClass
Onceyouhaveyourmappers
andreducersset,youcan
writeadriverclass
YouconfigureaJobConf
objectwiththefollowing
information
Input/OutputFileformats
Input/OutputKeyValueFormats
Mapper
andReducerClasses
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
public
class
Example {
public
static
void
main(String[] args)
{
JobClient
client = new
JobClient();
JobConf
conf = new
JobConf(Example.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setInputPath(new
Path("src"));
conf.setOutputPath(new
Path("out"));
conf.setMapperClass(ExampleMapper.class);
conf.setReducerClass(ExampleReducer.class);
client.setConf(conf);
try
{
JobClient.runJob(conf);
} catch
(Exception e) {
e.printStackTrace();
}
}
}
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
ASimpleProgram(HandsOn)
LetstrytowriteaHadoopprogramthatcountsthe
numberofevenandoddnumbersinalist.
Input:TextFilewithoneintegervalueperline
Output:TextFileContainingthenumberofevenand
NumberofOddnumbersintheInputfile
Mapper
Readeachline,parseforintegerandemit(even/odd,1);
Reducer
Sumupallthevaluesofeachkey(even/odd)
ThatsIt!
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
HowdoesaJobExecuteinHadoop?
MapReduceJobissenttothe
masternode
JobTracker
JobTracker
createsthetasks
andsendsthemtothe
individualslavenode
TaskTrackers
TheTaskTracker
createstask
instancesforeachtaskand
runsthem,reportsstatusand
resultsbacktotheJobTracker
J obTracker
MapReduce job
submitted by
client computer
Master node
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
JobConfigurationinHadoop
ViatheJobConf
Object
CreatesaJobConfigration
XMLfilethatwillpackagedalongwith
theHadoopJARfilecontainingyourprogram
YouspecifytheMapper,Reducer,InputandOutputFiles/Formats
etc.
Thereareadditionalattributesthatcanbespecified,eachasXML
attributesintheconffile.
TheHadoopsystemusesthejobXMLfilealongwithother
configurationparametersintheFramework(coresite.xml
etc.)
Someparameterscanbedefinedasfinal
tomakesuretheyarenot
overridden.
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
DataFlowfromInputtoMappers
I
n
p
u
t
F
o
r
m
a
t
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
HandlingInput
Firstwemustspecifyaninputfiletype
TextFiles
thebasicfiletypeinHadoop,readstextfilesand
providesinput(key,value)pairstothemaptasks.
SequenceFiles
Binaryformattostoresetsofkey/valuepairsin
binaryform.Supportscompressionaswell.
MapFiles
pairsofsortedinputSequenceFiles,onefile(data)
containskey/valuepairsasrecordsandtheother(index)
containskey/locationpairswherelocationisthelocationofthe
keyinthefiledata
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
HandlingInput
InputfileissplitintochunksusinganInputSplit
Defaultsplitoccursat64MBandfileboundaries
YoucanspecifyyourownInputSplit
foryourInputFormats
OneMaptaskforeachInputSplit
Arecordisa(key,value)pair
ReadusingtheRecordReader
class
YoucanspecifyyourownRecordReader
foryourowninput
formats
EachInputFormat
providesitsownRecordReader
implementation
LineRecordReader
Readsalinefromatextfile
KeyValueRecordReader
Usedby
KeyValueTextInputFormat
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
DataFlowfromMappers
toReducers
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Reducer Reducer Reducer
(intermediates) (intermediates) (intermediates)
Partitioner Partitioner Partitioner Partitioner
s
h
u
f
f
l
i
n
g
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
ReducerinDetail
Reducesasetofintermediatevalueswhichshareakeytoa
smallersetofvalues
Shuffle
Copiesthesortedoutputfromeachmapper
usingHTTPacrossthe
network
Sort
Sortsreduceinputsbykeys
Shuffleandsortphasesoccursimultaneously
SecondarySortusingcustomfunctions
Reduce
FrameworkclassReduceforeachkeyandcollectionofvalues
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
ThePartitioner
Class
ResponsibleforcreatingthepartitionsfortheSortand
ShufflePhase
Dictateswhateachreducerreceivesfromeachmapper
CanbeusedtocreateaGlobalOrder
ofkeys
Totalpartitionsissameasnumberofmaptasks
HashPartitioner
usedbydefault
Useskey.hashCode()
toreturnpartitionnum
JobConf
setsPartitioner
implementation
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
HandlingOutput
EachReducerwritesitsownoutput
file
UsesaRecordWriter
thatispartofa
OutputFormat
object
TextOutputFormat
Writes
keyval\n
stringstooutputfile
SequenceFileOutputFormat
Usesabinaryformattopack(k,v)
pairs
NullOutputFormat
Discards
output
O
u
t
p
u
t
F
o
r
m
a
t
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
OptimizationsforaMapReduceJob
UseaCombinerFunction
Essentiallyaminireducerthatrunsonamappers
output
Executedlocallyoneachnodeafteramaptask
Helpsinreducingdatatobetransferredduringtheshufflephase
conf.setCombinerClass(ExampleReducer.class);
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
Conclusions
BasicoverviewofHadoopandMapReduce
IntroductiontotheHadoopframeworktogetyoustarted
onwritingprogramsandcode
ForMoreInformation
TomWhite:Hadoop:TheDefnitive
Guide,OReillyPress
ProHadoop:JasonVenner
Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
References
http://code.google.com/edu/submissions/mapreduce
minilecture/listing.html
http://www.cloudera.com/wpcontent/uploads/2010/01/2
MapReduceAndHDFS.pdf
http://www.cloudera.com/wpcontent/uploads/2010/01/4
ProgrammingWithHadoop.pdf
http://search.iiit.ac.in/cloud/lectures/MapReduce.ppt
http://search.iiit.ac.in/cloud/lectures/HadoopDetailedView.ppt