Tyu Cloud Computing

Carnegie Mellon
Spring 2010
15-319 Introduction to Cloud Computing
IntroductiontoCloudComputing
MajdF.Sakr
MapReduceandHadoop

15319,spring2010

17
th

Lecture,Mar16
th
Carnegie Mellon
Spring 2010
2
LectureGoals

TransitiontoMapReducefromFunctionalProgramming

UnderstandtheoriginsofMapReduce

ExploretheHadoopAPIandInterface

OnehandsonprogramtodiveintoHadoop!
Carnegie Mellon
Spring 2010
LectureOutline

FunctionalProgrammingReviewandMapReduce

Hadoop

Components

MapReduce

ProgrammingModel

MapandReduceFunctions

Terminology JobsandTasks

JavaImplementationsandClasses

Mapper

andReducerClasses

WritingaMapReduceDriver

Input,OutputandOtherhandlerclasses

SimpleProgram(HandsOn)

DetailsofHadoop

HandlingInputandOutput

JobExecutionandDataFlow

HadoopOptimizations
Carnegie Mellon
Spring 2010
FunctionalProgrammingReview

Functionaloperationsdonotmodifydatastructures:

Theyalwayscreatenewones

Originaldatastillexistsinunmodifiedform

Dataflowsareimplicitinprogramdesign

Orderofoperationsdoesnotmatter
Carnegie Mellon
Spring 2010
FunctionalProgramming

Processlistsofdataveryfrequently

Iterators:operatorsoverlists

Map

Fold

Mapoperationcanbeassignedtoeachelementofalist

independently

Iftheoperationperformedinafoldoperationis

commutative,itcanbeparallelized
5
Carnegie Mellon
Spring 2010
ImplicitParallelisminFunctionalProgramming

Inapurelyfunctionalsetting,elementsofalistbeing

computedbymapcannotseetheeffectsofthe

computationsonotherelements

Iforderofapplicationofftoelementsinlistis

commutative(andassociative),wecanreorderor

parallelizeexecutionoff

Thisisthesecret

thatMapReduceexploits
Carnegie Mellon
Spring 2010
EnterMapReduce

Functionalprogrammingontopofdistributedprocessing

Motivatedbyneedtoprocesslargeamountsofdatausing

hundreds(orthousands)ofprocessornodes

Dataiscentraltothecomputation,placetheprogram

closesttothedataitwillbeworkingon.

Provideacleanabstractiontoprogrammerssimilarto

functionalprogramming

TheInterfacedealswithallthemessydetails
Carnegie Mellon
Spring 2010
MapReduceHistory

DevelopedbyGoogletosimplifytheirdataprocessingjobs

onlargedata

Detailsemergedfromtwopublishedpapers:

JamesDean,SanjayGhemawat,MapReduce:SimplifiedData

ProcessingonLargeClusters,ProceedingsofOSDI04,2004

SanjayGhemawat,HowardGobioff,andShunTak

Leung,Google

FileSystem,ProceedingsofSymposiumofOperatingSystems

Principles,ACMSIGOPS,2004

SinceGooglesimplementationisproprietaryandnot

availabletopublic,anApacheProjectcalledHadoop

emergedasanopensourceimplementation

PrimaryContributors:Yahoo!,Facebook
8
Carnegie Mellon
Spring 2010
MotivatingExample

ProcessingWebDataonaSingleMachine

20+billionwebpagesx20KB=400+terabytes

Onecomputercanread3035MB/secfromdisk

~fourmonthstoreadtheweb

~1,000harddrivesjusttostoretheweb

Evenmoretodosomethingwiththedata

Takestoolongonasinglemachine,butwith1000

machines?

<3hourstoperformon1000machines

Buthowlongtoprogram?Whatabouttheoverheads?

Communication,coordination,recoveryfrommachinefailure

StatusReporting,Debugging,Optimization,Locality

ReinventingtheWheel:Thishastobedoneforeveryprogram!
9
Carnegie Mellon
Spring 2010
MapReduceFeatures

AutomaticParallelizationandDistributionofWork

FaultTolerant

StatusandMonitoringTools

CleanAbstractionforProgrammers
10
Carnegie Mellon
Spring 2010
TypicalProblemSolvedbyMapReduce
1.

ReadalotofData
2.

MAP:extractsomethingyouneedfromeachrecord
3.

ShuffleandSort
4.

REDUCE:aggregate,summarize,filterortransform
5.

Writetheresults
11
Outlinestaysthesame,
Map

andReduce

changetofittheProblem
ModelseemsrestrictivebutitisTuringComplete.
Multiplemapsandreducesneededtosolveacomplex

problem.
Carnegie Mellon
Spring 2010
ProgrammingModel

FundamentallysimilartoFunctionalProgramming

Usersimplementinterfacestofollowingtwofunctions:

map (in_key, in_value) ->
(out_key, intermediate_value) list

reduce (out_key, intermediate_value

list) ->
out_value

list
12
Carnegie Mellon
Spring 2010
MAP
K2
K1
K2
reduce reduce
K1
Shuf f l e and Sor t (bar r i er )
K1 K1 K1
K1 K1 K1 K1
K2 K2 K2
K2 K2 K2 K2
Input Data
ATypicalMapReduceProgram

I
Carnegie Mellon
Spring 2010
ATypicalMapReduceProgram

II
14
Data store 1 Data store n
map
(key 1,
values...)
(key 2,
values...)
(key 3,
values...)
map
(key 1,
values...)
(key 2,
values...)
(key 3,
values...)
Input key*value
pairs
Input key*value
pairs
==Barrier == : Aggregates intermediate values by output key
reduce reduce reduce
key 1,
intermediate
values
key 2,
intermediate
values
key 3,
intermediate
values
final key 1
values
final key 2
values
final key 3
values
...
Carnegie Mellon
Spring 2010
Parallelism

Mapfunctionsruninparallel,createintermediatevalues

fromeachinputdataset

Theprogrammermustspecifyaproperinputsplit(chunk)between

mappers

toenableparallelism

Reducefunctionsalsoruninparallel,eachwillworkon

differentoutputkeys

Numberofreducersisakeyparameterwhichdeterminesmap

reduceperformance

Allvaluesareprocessedindependently

Reducephasecannotstartuntilthemapphaseiscompletely

finished
15
Carnegie Mellon
Spring 2010
DataLocality

Masterprogramcreatestasks

basedonthelocationof

thedata;triestosendmap()taskstothesamemachine

oratleastsamerack

Map()taskinputsaredividedinto64MBblocks(sameas

HDFS/GFS)oratfileboundaries.

Minimizescommunicationatthenetworklevel
16
Carnegie Mellon
Spring 2010
FaultTolerance

Masterkeepstrackofprogressofeachtaskandworker

nodes

Ifanodefails,itreexecutesthecompletedaswellasinprogress

maptasksonothernodesthatarealive

Italsoexecutesinprogressreducetasks.

Ifparticularinputkey/valuepairskeepcrashing

Masterblackliststhemandskipsthemfromreexecution

Toleratesmallfailures,allowthejobtoruninbesteffort

basis

Forlargedatasetscontainingpotentiallymillionsofrecords,we

dontwanttostopcomputationforafewrecordsnotprocessing

correctly

Usercansetthefailuretolerancelevel
17
Carnegie Mellon
Spring 2010
Optimizations

Reducetaskswillstartonlyafterallthemapoperations

arecomplete

bottleneck

Useacombiner

(aminireducerontheoutputofamap)toreduce

theoutputandtheamountofcommunicationduringsortand

shuffle

Speculativeexecution

NewfeatureinHadoop,enablesmultiple

taskattemptstoruninparallelandusetheresultsofthefirstone

tofinish

MapReduceprogramsareboundbytheslowestelement

thelastmapper

tofinishorthetimetakenforthe

Reduceoperationstofinish.
18
Carnegie Mellon
Spring 2010
TimetoDiveintoHadoop!
Now I can play with
Carnegie Mellon
Spring 2010
WhatisHadoop?

OpenSourceImplementationofGooglesDistributed

ComputingProjects

IncludesopensourceversionsofMapReduce,BigTable,

GFSetc.

SupportedbytheApacheFoundation

PrimaryContributors

Yahoo!,Facebook

WorkinProgress lotsoffeaturesareunstable

UsedbyYahootoruna2000+nodedistributedcluster,

andmanyotherwebcompaniesaswell.
Carnegie Mellon
Spring 2010
QuickLookatHadoopComponents
GoogleCallsIt Hadoop

Equivalent
Description
MapReduce MapReduce JavaImplementationofthe

MapReduceProgramming

Model
GFS HDFS DistributedFileSystem
Sawzall Pig
Hive
DataFlowLanguage
DistributedDataWarehouse

andQueryEngine
BigTable HBase DistributedDatabase
Chubby ZooKeeper DistributedCoordination

Service
Carnegie Mellon
Spring 2010
HadoopMapReduce

TheMeat

ofHadoop.ImplementationofGoogles

MapReduceProgrammingModel

ExpresscomplextasksintheformofMapsandReduces

toenablelargescaleprocessingofdata

Functionalprogrammingmeetsdistributedcomputing

usingbatchprocessingofdata

Benefits:

AutomaticParallelizationandDistributionofData

FaultTolerance

StatusandMonitoringTools

CleanProgrammingAbstractiontohelpfocusonthesolvingthe

problem
Carnegie Mellon
Spring 2010
MapReduceProgrammingModel

FunctionalProgrammingconceptsofmapandfold

UsersImplementMapandReduceFunctions

map (in_key, in_value) -> (out_key,
intermediate_value) list

reduce (out_key, intermediate_value) list ->
out_value

list
Carnegie Mellon
Spring 2010
TheMap

Function

Recordsfromthedatasource(linesoutoffiles,rowsofa

database,etc)arefedintothemapfunctionaskey*value

pairs:e.g.,(filename,line).

map()

produces(emits)oneormoreintermediatevalues

alongwithanoutputkeyfromtheinput.
map
Carnegie Mellon
Spring 2010
Example

UpperCase

Mapper
let

map(k,v) =
emit(k.toUpper(),v.toUpper())
(abcd, efgh)(ABCD, EFGH)
(CMUq, hadoop)(CMUQ, HADOOP)
(foo, bar)(FOO, BAR)
Carnegie Mellon
Spring 2010
Example FilterMapper
let

map(k, v) =
if

(v%2==0) then

emit(k, v)
(foo, 7)(nothing)
(test, 10)(test,10)
Carnegie Mellon
Spring 2010
TheReduce

Function

Afterthemapphaseisover,alltheintermediatevaluesfor

agivenoutputkeyarecombinedtogetherintoalist

reduce()

combinesthoseintermediatevaluesintooneor

morefinalvaluesforthatsameoutputkey
reduce reduce
Carnegie Mellon
Spring 2010
Example SumReducer
let

reduce(k, vals) =
{
sum = 0
foreach

int

v in vals:
sum += v
emit(k, sum)
}
(A, [42, 100, 312])(A, 454)
(B, [12, 6, -2])(B, 16)
Carnegie Mellon
Spring 2010
HadoopMapReduceTerminology

Job

Executionofacomplete

program(mapper

andreducer)across

aninputdataset

Task

Executionofasinglemapper

orreduceronasliceofdata(taskin

progress)

TaskAttempt

Aparticularinstanceofanattempttoexecuteataskonamachine

ornode
Carnegie Mellon
Spring 2010
ExampleofTerminology

Wordcount

on20files:OneJob

TheJobimplies20maptasks

foreachfileplusafew

reducertasks

Therewillbeatleast20+taskattemptstotal.Moretask

attemptsifanodecrashes,etc.
Carnegie Mellon
Spring 2010
MapReduceJavaPackageStructure

org.apache.hadoop.*

mapred

LegacyMapReduceAPI(upto

0.19)

mapreduce

NewMapReduceAPI(0.20+)

conf

JobConfigurationUtilities

io

Input/Outputformats

fs

FileSystemHandlers

util

Utilitiesforhadoop
Carnegie Mellon
Spring 2010
TheMapper

JavaClass

Towriteyourownmapper,youextendtheMapReduceBase

Class,and

overridethemap()

functionintheMapper

Class,inside

org.apache.mapred
public

class

NewMapper

extends

MapReduceBase

implements

Mapper
{
//User Defined Class Variables and Functions
public

void

map(WritableComparable

key, Writable values,
OutputCollector

output, Reporter reporter) throws

IOException
{
//Map Function Goes Here
output.collect(key,value);
}
}
Carnegie Mellon
Spring 2010
Input/OutputDatatypes

Writable

Class

Hadooptypeforwritingdataasabytestream(Serialization)

IntWritable,Text

etc

AllvaluesmustbeWritable

Youcancreateyourowncustomwritableforyourowninputformat.

WritableComparable

Class

SubclassofWritable

Requiredforsortingandshufflingforreducers

AllkeysmustbetypeWriteableComparable
JavaPrimitive Writeable

Implementation
boolean booleanWritable
byte ByteWritable
int IntWritable
float FloatWritable
long LongWritable
JavaType Writeable

Implementation
String Text
Bytes BytesWritable
Carnegie Mellon
Spring 2010
TheReducer

JavaClass

Towriteyourownreducer,youextendthe

MapReduceBase

Class,andoverridethereduce()

functionintheReducer

Class.
public

class

ExampleReducer

extends

MapReduceBase

implements

Reducer
{
public

void

reduce(WritableComparable

_key, Iterator

values,
OutputCollector

output, Reporter reporter)
throws

IOException
{
while

(values.hasNext())
{
// process value
}
}
}
Carnegie Mellon
Spring 2010
TheReducer

JavaClassContinued

Thereducerclassisexecutedforeverykeyemittedfrom

themapper,weusetheIterator

classtoiteratethrough

thevaluesofeachkey

Youcanperformtherequiredreductionoperationonthe

valuesforeachkeyandemitthekey,value

pairsas

necessary
Carnegie Mellon
Spring 2010
TheOutputCollector

andReporter

Classes

TheOutputCollector

Classisusedtohandleoutput

keyvaluepairsemittedbymappers

andreducers

output.collect(k,v)

TheReporter

classisusedtoupdatecountersand

statusmessagesyouseeonscreenwhenhadoop

executes.

Thefunctionalityofbothclasseshavenowbeenmerged

inHadoop0.20usingcontext

objects
Carnegie Mellon
Spring 2010
PuttingitalltogetherintheDriverClass

Onceyouhaveyourmappers

andreducersset,youcan

writeadriverclass

YouconfigureaJobConf

objectwiththefollowing

information

Input/OutputFileformats

Input/OutputKeyValueFormats

Mapper

andReducerClasses
Carnegie Mellon
Spring 2010
public

class

Example {
public

static

void

main(String[] args)
{
JobClient

client = new

JobClient();
JobConf

conf = new

JobConf(Example.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setInputPath(new

Path("src"));
conf.setOutputPath(new

Path("out"));
conf.setMapperClass(ExampleMapper.class);
conf.setReducerClass(ExampleReducer.class);
client.setConf(conf);
try

{
JobClient.runJob(conf);
} catch

(Exception e) {
e.printStackTrace();
}
}
}
Carnegie Mellon
Spring 2010
ASimpleProgram(HandsOn)

LetstrytowriteaHadoopprogramthatcountsthe

numberofevenandoddnumbersinalist.

Input:TextFilewithoneintegervalueperline

Output:TextFileContainingthenumberofevenand

NumberofOddnumbersintheInputfile

Mapper

Readeachline,parseforintegerandemit(even/odd,1);

Reducer

Sumupallthevaluesofeachkey(even/odd)

ThatsIt!
Carnegie Mellon
Spring 2010
HowdoesaJobExecuteinHadoop?

MapReduceJobissenttothe

masternode

JobTracker

JobTracker

createsthetasks

andsendsthemtothe

individualslavenode

TaskTrackers

TheTaskTracker

createstask

instancesforeachtaskand

runsthem,reportsstatusand

resultsbacktotheJobTracker
J obTracker
MapReduce job
submitted by
client computer
Master node
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
Carnegie Mellon
Spring 2010
JobConfigurationinHadoop

ViatheJobConf

Object

CreatesaJobConfigration

XMLfilethatwillpackagedalongwith

theHadoopJARfilecontainingyourprogram

YouspecifytheMapper,Reducer,InputandOutputFiles/Formats

etc.

Thereareadditionalattributesthatcanbespecified,eachasXML

attributesintheconffile.

TheHadoopsystemusesthejobXMLfilealongwithother

configurationparametersintheFramework(coresite.xml

etc.)

Someparameterscanbedefinedasfinal

tomakesuretheyarenot

overridden.
Carnegie Mellon
Spring 2010
DataFlowfromInputtoMappers
I
n
p
u
t
F
o
r
m
a
t
Carnegie Mellon
Spring 2010
HandlingInput

Firstwemustspecifyaninputfiletype

TextFiles

thebasicfiletypeinHadoop,readstextfilesand

providesinput(key,value)pairstothemaptasks.

SequenceFiles

Binaryformattostoresetsofkey/valuepairsin

binaryform.Supportscompressionaswell.

MapFiles

pairsofsortedinputSequenceFiles,onefile(data)

containskey/valuepairsasrecordsandtheother(index)

containskey/locationpairswherelocationisthelocationofthe

keyinthefiledata
Carnegie Mellon
Spring 2010
HandlingInput

InputfileissplitintochunksusinganInputSplit

Defaultsplitoccursat64MBandfileboundaries

YoucanspecifyyourownInputSplit

foryourInputFormats

OneMaptaskforeachInputSplit

Arecordisa(key,value)pair

ReadusingtheRecordReader

class

YoucanspecifyyourownRecordReader

foryourowninput

formats

EachInputFormat

providesitsownRecordReader

implementation

LineRecordReader

Readsalinefromatextfile

KeyValueRecordReader

Usedby

KeyValueTextInputFormat
Carnegie Mellon
Spring 2010
DataFlowfromMappers

toReducers
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Reducer Reducer Reducer
(intermediates) (intermediates) (intermediates)
Partitioner Partitioner Partitioner Partitioner
s
h
u
f
f
l
i
n
g
Carnegie Mellon
Spring 2010
ReducerinDetail

Reducesasetofintermediatevalueswhichshareakeytoa

smallersetofvalues

Shuffle

Copiesthesortedoutputfromeachmapper

usingHTTPacrossthe

network

Sort

Sortsreduceinputsbykeys

Shuffleandsortphasesoccursimultaneously

SecondarySortusingcustomfunctions

Reduce

FrameworkclassReduceforeachkeyandcollectionofvalues
Carnegie Mellon
Spring 2010
ThePartitioner

Class

ResponsibleforcreatingthepartitionsfortheSortand

ShufflePhase

Dictateswhateachreducerreceivesfromeachmapper

CanbeusedtocreateaGlobalOrder

ofkeys

Totalpartitionsissameasnumberofmaptasks

HashPartitioner

usedbydefault

Useskey.hashCode()

toreturnpartitionnum

JobConf

setsPartitioner

implementation
Carnegie Mellon
Spring 2010
HandlingOutput

EachReducerwritesitsownoutput

file

UsesaRecordWriter

thatispartofa

OutputFormat

object

TextOutputFormat

Writes

keyval\n

stringstooutputfile

SequenceFileOutputFormat

Usesabinaryformattopack(k,v)

pairs

NullOutputFormat

Discards

output
O
u
t
p
u
t
F
o
r
m
a
t
Carnegie Mellon
Spring 2010
OptimizationsforaMapReduceJob

UseaCombinerFunction

Essentiallyaminireducerthatrunsonamappers

output

Executedlocallyoneachnodeafteramaptask

Helpsinreducingdatatobetransferredduringtheshufflephase
conf.setCombinerClass(ExampleReducer.class);
Carnegie Mellon
Spring 2010
Conclusions

BasicoverviewofHadoopandMapReduce

IntroductiontotheHadoopframeworktogetyoustarted

onwritingprogramsandcode

ForMoreInformation

TomWhite:Hadoop:TheDefnitive

Guide,OReillyPress

ProHadoop:JasonVenner
Carnegie Mellon
Spring 2010
References

http://code.google.com/edu/submissions/mapreduce

minilecture/listing.html

http://www.cloudera.com/wpcontent/uploads/2010/01/2

MapReduceAndHDFS.pdf

http://www.cloudera.com/wpcontent/uploads/2010/01/4

ProgrammingWithHadoop.pdf

http://search.iiit.ac.in/cloud/lectures/MapReduce.ppt

http://search.iiit.ac.in/cloud/lectures/HadoopDetailedView.ppt

Tyu Cloud Computing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tyu Cloud Computing

Uploaded by

Copyright:

Available Formats

Carnegie Mellon

You might also like