You are on page 1of 8

Home Tutorials JoinUs About Resources JobBoard Examples Whitepapers Academy Library

Java Android JVMLanguages SoftwareDevelopment Agile DevOps Communications Career Misc MetaJCG Searchthissite...

Youarehere:Home SoftwareDevelopment MapReduceQuestionsandAnswersPart1

AboutMariaJurcovicova

Newsletter

MapReduceQuestionsandAnswersPart1 80381insidersarealreadyenjoyin
updatesandcomplimentarywhitepa
byMariaJurcovicovaonMay17th,2012 | Filedin:SoftwareDevelopment Tags:MapReduce

WithallthehypeandbuzzsurroundingNoSQL,Idecidedtohavealookatit.IquicklyfoundthatthereisnotoneNoSQLIcouldlearn.
Jointhemnowtogain
Rather,therearevariousdifferentsolutionswithdifferentpurposesandtradeoffs.Thosevarioussolutionstendtohaveonethingin tothelatestnewsintheJavaworld,
common:processingofdatainNoSQLstorageisusuallydoneusingMapReduceframework. insightsaboutAndroid,Scala,Groov
relatedtechnologies.
SearchonMapReducefoundvariousscatteredblogposts,someuniversitiescoursespagesandonebookthatseemstocontainalmost
everythingothersourcesdid.
Emailaddress:

Youremailaddress
ThispostcontainsMapReducequestionsandanswersbasedonthebook.Basically,ifIwouldbeastudent,thisiswhatIwouldhave
madeasatestpreparationnotes.IfIwouldbeateacher,thisiswhatIwouldaskontheexam.
Signup

Firstchaptergivescreditwherethecreditisdue,therestcontainsquestions.Lastchaptercontainshandsoncodingexercises.
JoinUs
TheBook
With
ThebookisnamedDataIntensiveTextProcessingwithMapReduce.Ifyouareunsurewhetheryouwanttobuyitornot,pre uniquevisitors
productionmanuscriptisavailableforfree. authorswear
thetopJavar
Donotbefooledbythetitle.ThebookismoreaboutMapReduceitselfandlessabouttextprocessing.Firsthalfofthebookdescribes around.Cons
generalpurposetricks(designpatterns)usefulforanytask.Secondhalfcontainsachaptersontextprocessing,graphalgorithmsand thelookoutfo
expectationmaximization. encourageyo
Ifyouhaveablogwithuniqueandin
ThebookcontainsalmosteverythingIfoundonvariousblogs,universitycoursespagesandmuchmore. contentthenyoushouldcheckouto
partnersprogram.Youcanalsobe
Questions forJavaCodeGeeksandhoneyou

Questionsaresplitbybookchapters.Withoneminorexception(counters),questionsaremostlybasedonthefirsthalfofthebook.
SecondhalfcontainsconcretealgorithmsandIwantedtofocusongeneralpurposeknowledge.

Itdoesnotmeanthatlearningthemisnotuseful.EspeciallyGraphAlgorithmschaptercontainseasilygeneralizableideas.

2MapReduceBasics

2.2MappersandReducers

DescribegeneralMapReducealgorithm.Splititintophases.Foreachphaseinclude:

whoisresponsible(framework/programmer/customizable),
whatitdoes,
phaseinput,
phaseoutput.

Theansweris:

MapReducehasfourphases:

map,
combine,
shuttleandsort,
reduce.

Mapphaseisdonebymappers.Mappersrunonunsortedinputkey/valuespairs.Thesamephysicalnodesthatkeepsinputdatarun
alsomappers.Eachmapperemitszero,oneormultipleoutputkey/valuepairsforeachinputkey/valuepair.Outputkey/valuepairsare
calledintermediatekey/valuepairs.Theirtypeisusuallydifferentfrominputkey/valuepairtype.Mappermustbesuppliedby
programmers.
Combinephaseisdonebycombiners.Combinershouldcombinekey/valuepairswiththesamekeytogether.Eachcombinermayrun
zero,onceormultipletimes.Frameworkdecideswhetherandhowmanytimestorunthecombiner,programmerhasnocontroloverit.
Combinersoutputkey/valuepairtypemustbethesameasitsinputkey/valuepairtypes.

Shuttleandsortphaseisdonebyframework.Datafromallmappersaregroupedbythekey,splitamongreducersandsortedbythe
key.Eachreducerobtainsallvaluesassociatedwiththesamekey.Programmermaysupplycustomcomparefunctionforsortingand
partitionerfordatasplit.Allkey/valuepairsgoingtothesamereduceraresortedbythekey,butthereisnoglobalsorting.

Reducerobtainssortedkey/[valueslist]pairssortedbythekey.Valueslistcontainsallvalueswiththesamekeyproducedbymappers.
Eachreduceremitszero,oneormultipleoutputkey/valuepairsforeachinputkey/valuepair.Outputkey/valuepairtypeisusually
differentfrominputkey/valuepairtype.Reducermustbesuppliedbyprogrammers.

IfthealgorithmrequiresmultipleMapReduceiterations,eachcombinermayincrementglobalcounter.Driverprogramwouldreadthe
counterafterthereducephase.Itthendecideswhethernextiterationisneededornot.

Note:chapter2doesnotmentioncounters.Theyareexplainedlater,inthechapter5.Decideifthestatementistrueorfalse:All
MapReduceimplementationsimplementexactlysamealgorithm.

Decideifthestatementistrueorfalse:AllMapReduseimplementationsimplementexactlysamealgorithm.
Theansweris:

False.Forexample,Googlesimplementationdoesnotallowchangeofkeyinthereducer,butprovidessortingforvalues.Hadoopdoes
notprovidevaluessorting,butreducercanchangethekey.

Trueorfalse:Eachmappermustgeneratethesamenumberofkey/valuepairsasitsinputhad.

Theansweris:

False.Mappermaygenerateanynumberofkey/valuepairs(includingzero).

Trueorfalse:Mappersinputkey/valuepairsaresortedbythekey.

Theansweris:

False.Mappersinputisnotsortedinanyway.

Trueorfalse:Mappersoutputkey/valuemustbeofthesametypeasitsinput.

Theansweris:

False.Mappermayproducekey/valuepairsofanytype.

Trueorfalse:Reducerisappliedtoallvaluesassociatedwiththesamekey.

Theansweris:

True.Reducerisappliedtoallvaluesassociatedwiththesamekey.

Trueorfalse:Reducersinputkey/valuepairsaresortedbythekey.

Theansweris:

True.Reducersinputkey/valuepairsaresortedbythekey.
implementation.

Trueorfalse:Eachreducermustgeneratethesamenumberofkey/valuepairsasitsinputhad.

Theansweris:

False.Reducermaygenerateanynumberofkey/valuepairs(includingzero).

Trueorfalse:Reducersoutputkey/valuepairmustbeofthesametypeasitsinput.

Theansweris:

False.ThestatementisfalseinHadoopandtrueinGooglesimplementation.

2.3TheExecutionFramework

Whathappensincaseofhardware/softwarefailure?

Theansweris:

MapReduceframeworkmustbeabletorecoverfrombothhardware(diskfailures,RAMerrors)andsoftware(bugs,unexpected
exceptions)errors.Botharecommonandexpected.

Isitpossibletostartreducerswhilesomemappersstillrun?Why?
Theansweris:

No.Reducersinputisgroupedbythekey.Thelastmappercouldtheoreticallyproducekeyalreadyconsumedbyrunningreducer.

Defineastraggler.

Theansweris:

Straggleriseithermaporreducetaskthattakesunusuallylongtimetocomplete.

Whatisspeculativeexecution(alsocalledbackuptasks)?Whatproblemdoesitsolve?

Theansweris:

Identicalcopyofthesametaskisexecutedonmultiplenodes.Outputofthefastesttaskused.
Speculativeexecutionhelpsifthetaskisslowbecauseofhardwareproblem.Itdoesnothelpifthedistributionofvaluesoverkeysis
skewed.

2.4PartitionersandCombiners

Whatdoespartitionerdo?

Theansweris:

Partitionerdivideskey/valuespairsproducedbymaptasksbetweenreducers.

Whatdoescombinerdo?

Theansweris:

Combinerdoeslocalaggregationofkey/valuespairsproducedbymapperbeforeorduringshuttleandsortphase.Ingeneral,it
reducesamountofdatatobetransferredbetweennodes.

Theframeworkdecideshowmanytimestorunit.Combinermayrunzero,oneormultipletimesonthesameinput.

Decideifthestatementistrueorfalse:Eachcombinerrunsexactlyonce.

Theansweris:

False.Theframeworkdecideswhethercombinerrunszero,onceormultipletimes.

2.5TheDistributedFileSystem

BrieflydescribeHDFSarchitecture.

Theansweris:

HDFShasonenamenodeandalotofdatanodes.Namenodeismasterandcoordinatesfileoperations,ensuresintegrityofthesystem
andkeepsnamespace(metadata,directorystructure,filetoblockmappingetc.).

Dataarestoredinbigblocksondatanodes.Eachblockisstoredonmultiple,bydefaultthree,datanodes.Namenodecheckswhether
datanodesworkcorrectlyandmanagesdatareplication.

Clientcontactsnamenodewhichanswerswithdatablockidanddatanodeid.Datanodethensendsdatadirectlytotheclient.

Decideifthestatementistrueorfalse:HDFSisgoodatmanaginglargenumberofsmallfiles.

Theansweris:

False.HDFSisgoodatmanaginglargefiles.

2.6HadoopClusterArchitecture

Explaindifferencebetweenjobtrackerandtasktracker?

Theansweris:

Clientexecutesjobsonjobtracker.Jobtrackerrunsonthemaster.JobtrackermonitorsMapReducejobs.Italsocoordinatesmappers
andreducers.

Tasktrackerrunsbothusercodeanddatanodedaemononslavenodes.Itisnevercontactedbytheclient.

Explainmapperlifecycle.

Theansweris:

Initializationmethodiscalledbeforeanyothermethodiscalled.Ithasnoparametersandnooutput.

Mapmethodiscalledseparatelyforeachkey/valuepair.Itprocessinputkey/valuepairsandemitsintermediatekey/valuepairs.
Closemethodrunsafterallinputkey/valuehavebeenprocessed.Themethodshouldcloseallopenresources.Itmayalsoemit
key/valuepairs.

Explainreducerlifecycle.

Theansweris:

Initializationmethodiscalledbeforeanyothermethodiscalled.Ithasnoparametersandnooutput.

Reducemethodiscalledseparatelyforeachkey/[valueslist]pair.Itprocessintermediatekey/valuepairsandemitsfinalkey/value
pairs.Itsinputisakeyanditeratoroverallintermediatevaluesassociatedwiththesamekey.

Closemethodrunsafterallinputkey/valuehavebeenprocessed.Themethodshouldcloseallopenresources.Itmayalsoemit
key/valuepairs.

3MapReduceAlgorithmDesign

3.1LocalAggregation

Whatislocalaggregationandwhyisitused?

Theansweris:

Eithercombineroramappercombineskey/valuepairswiththesamekeytogether.Theymaydoalsosomeadditionalpreprocessingof
combinedvalues.Onlykey/valuepairsproducedbythesamemapperarecombined.

Key/Valuepairscreatedbymaptasksaretransferredbetweennodesduringshuffleandsortphase.Localaggregationreducesamount
ofdatatobetransferred.

Ifthedistributionofvaluesoverkeysisskewed,datapreprocessingincombinerhelpstoeliminatereducestragglers.

Whatisinmappercombining?Stateadvantagesanddisadvantagesoverwritingcustomcombiner.

Theansweris:

Localaggregation(combiningofkey/valuepairs)doneinsidethemapper.

Mapmethoddoesnotemitkey/valuepairs,itonlyupdatesinternaldatastructure.Closemethodcombinesandpreprocessallstored
dataandemitsfinalkey/valuepairs.Internaldatastructureisinitializedininitmethod.

Advantages:

Itwillrunexactlyonce.Combinermayrunmultipletimesornotatall.
Wearesureitwillrunduringmapphase.Combinermayruneitheraftermapphaseorbeforereducephase.Thelattercase
providesnoreductionintransferreddata.
Inmappercombiningistypicallymoreeffective.Combinerdoesnotreduceamountofdataproducedbymappers,itonlygroups
generateddatatogether.Thatcausesunnecessaryobjectcreation,destruction,serializationanddeserialization.

Disadvantages:

Scalabilitybottleneck:thetechniquedependsonhavingenoughmemorytostoreallpartialresults.Wehavetoflushpartialresults
regularlytoavoidit.Combineruseproducenoscalabilitybottleneck.

3.2PairsandStripes

ExplainPairdesignpatteronacooccurenceexample.Includeadvantages/disadvantagesagainstStripesapproach,possible
optimizationsandtheirefficacy.

Theansweris:

Mappergenerateskeyscomposedfrompairsofwordsthatoccurredtogether.Thevaluecontainsthenumber1.Frameworkgroups
key/valuepairswiththesameworkpairstogetherandreducersimplycountsthenumbervaluesforeachincomingkey/valuepairs.

Eachfinalpairencodesacellincooccurrencematrix.Localaggregation,e.g.combinerorinmappercombining,canbeused.

Advantages:

Simplevalues,lessserialization/deserializationoverhead.
Simplermemorymanagement.Noscalabilitybottleneck(onlyifinmapperoptimizationwouldbeused).

Disadvantages:

Hugeamountofintermediatekey/valuepairs.Shuffleandsortphaseisslower.
Localaggregationislesseffectivetoomanydistinctkeys.
ExplainStripesdesignpatteronacooccurenceexample.Includeadvantages/disadvantagesagainstPairsapproach,
possibleoptimizationsandtheirefficacy.

Theansweris:

Mappergeneratesadistinctkeyfromeachencounteredword.Associatedvaluecontainsamapofallcooccurredwordsasmapkeys
andnumberofcooccurrencesasmapvalues.Frameworkgroupssamewordstogetherandreducermergesvaluemaps.

Eachfinalpairencodesarowincooccurrencematrix.Combinerorinmappercombiningcanbeused.

Advantages:

Smallamountofintermediatekey/valuepairs.Shuffleandsortphaseisfaster.
Intermediatekeysaresmaller.
Effectivelocalaggregationsmallernumberofdistinctkeys.

Disadvantages:

Complexvalues,moreserialization/deserializationoverhead.
Morecomplexmemorymanagement.Asvaluemapsmaygrowtoobig,theapproachhaspotentialforscalabilitybottleneck.

Explainscalabilitybottleneckcausedbystripesapproach.

Theansweris:

Stripessolutionkeepsamapofcooccurredwordsinmemory.Astheamountofcooccurredwordsisunlimited,themapsizeis
unlimitedtoo.Hugemapdoesnotfitintothememoryandcausespagingoroutofmemoryerrors.

3.3ComputingRelativeFrequencies

Relativefrequenciesofcooccurrencesproblem:

Input:textdocuments
key:documentid
value:textdocument

Output:key/valuepairswhere
key:pair(word1,word2)
value:#cooccurrences(word1,word2)/#cooccurrences(word1,anyword)

Fixfollowingsolutiontorelativefrequenciesofcooccurrencesproblem:

01 classMAPPER
02 methodINITIALIZE
03 H=newhashmap
04
05 methodMAP(docida,docd)
06 foralltermwindocddo
07 foralltermupatrineighbors(w)do
08 H(w)=H(w)+1
09 emit(pair(u,w),count1)
10
11 methodCLOSE
12 foralltermwinH
13 emit(pair(w,*),H(w))
14
15 classREDUCER
16 variabletotal_occurrences=0
17
18 methodREDUCE(pair(p,u),counts[c1,c2,...,cn])
19 s=0
20 forallcincounts[c1,c2,...,cn]do
21 s=s+c
22
23 ifu=*
24 total_occurrences=s
25 else
26 emit(pairp,s/total_occurrences)
27
28 classSORTING_COMPARATOR
29 methodcompare(key(p1,u1),key(p2,u2))
30 ifp1=p2ANDu1=*
31 returnkey1islower
32
33 ifp1=p2ANDu2=*
34 returnkey2islower
35
36 returncompare(p1,p2)

Theansweris:

Partitionerismissing,frameworkcouldsendkey/valuepairswithtotalstodifferentreducerthankey/pairswithwordpairs.

1 classPARTITIONING_COMPARATOR
2 methodcompare(key(p1,u1),key(p2,u2))
3 ifp1=p2
4 returnkeysareequal
5
6 returnkeysaredifferent
Describeorderinversiondesignpattern.

Theansweris:

Orderinversionisusedifthealgorithmrequirestwopassesthroughmappergeneratedkey/valuepairswiththesamekey.Thefirst
passgeneratessomeoverallstatisticwhichisthenappliedtodataduringthesecondpass.Thereducerwouldneedtobufferdatain
thememoryjusttobeabletopasstwicethroughthem.

Firstpassresultiscalculatedbymappersandstoredinsomeinternaldatastructure.Themapperemitstheresultinclosingmethod,
afterallusualintermediatekey/valuepairs.

Thepatternrequirescustompartitioningandsort.Firstpassresultmustcometothereducerbeforeusualkey/valuepairs.Ofcourse,it
mustcometothesamereducer.

3.4SecondarySorting

Describevaluetokeydesignpattern.

Theansweris:

Hadoopimplementationdoesnotprovidesortingforgroupedvaluesinreducersinput.Valuetokeyisusedasaworkaround.

Partofthevalueisaddedtothekey.Customsortthensortsprimarybythekeyandsecondarybytheaddedvalue.Custompartitioner
mustmovealldatawiththesameoriginalkeytothesamereducer.

3.5RelationalJoins

Describereducesidejoinbetweentableswithoneononerelationship.

Theansweris:

Mapperproduceskey/valuepairswithjoinidsaskeysandrowvaluesasvalue.Correspondingrowsfrombothtablesaregrouped
togetherbytheframeworkduringshuffleandsortphase.

Reducemethodinreducerobtainsjoinidandtwovalues,eachrepresentsrowfromonetable.Reducerjoinsthedata.

Describereducesidejoinbetweentableswithonetomanyrelationship.

Theansweris:

WeassumethatthejoinkeyisprimarykeyintablecalledS.SecondtableiscalledT.Inotherwords,thetableSinontheonesideof
therelationshipandthetableTisonthemanysideoftherelationship.

Wehavetoimplementmapper,customsorter,partitionerandreducer.

Mapperproduceskeycomposedfromjoinidandtableflag.Partitionersplitsthedatainsuchaway,thatallkey/valuepairswiththe
samejoinidgoestothesamereducer.Customsortputskey/valuepairgeneratedfromthetableSrightbeforekey/valuepairwiththe
samejoinidfromthetableT.

Reducersinputlookslikethis:
((JoinId1,s)>row)
((JoinId1,t)>[rows])
((JoinId2,s)>row)
((JoinId2,t)>[rows])
...
((JoinIdn,s),row)
((JoinIdn,t),[rows])

Thereducerjoinsallrowsfrom s pairwithallrowsfromfollowing t pair.

Describereducesidejoinbetweentableswithmanytomanyrelationship.

Theansweris:

WeassumethatdataarestoredintablescalledSandT.ThetableSissmaller.Wehavetoimplementmapper,customsorter,
partitionerandreducer.

Mapperproduceskeycomposedfromjoinidandtableflag.Partitionersplitsthedatainsuchaway,thatallkey/valuepairswiththe
samejoinidgoestothesamereducer.Customsortputsthekey/valuepairsgeneratedfromthetableSisrightbeforeallkey/valuepair
withthedatafromthetableT.

Reducersinputlookslikethis:
((JoinId1,s)>[rows])
((JoinId1,t)>[rows])
((JoinId2,s)>[rows])
((JoinId2,t)>[rows])
...
((JoinIdn,s),[rows])
((JoinIdn,t),[rows])

ThereducerbuffersallrowswiththesameJoinIdfromthetableSintothememoryandjoinsthemwithfollowingTtablerows.

Alldatafromthesmallertablemustfitintothememorythealgorithmhasscalabilitybottleneckproblem.

Describemapsidejoinbetweentwodatabasetables.

Theansweris:

Mapsidejoinworksonlyiffollowingassumptionshold:

bothdatasetsaresortedbythejoinkey,
bothdatasetsarepartitionedthesameway.

Mappermapsoverlargerdatasetandreadscorrespondingpartofsmallerdatasetinsidethemapper.Asthesmallersetispartitioned
thesamewayasbiggerone,onlyonemaptaskaccessthesamedata.Asthedataaresortedbythejoinkey,wecanperformmerge
join O(n) .

Describememorybackedjoin.

Theansweris:

Smallersetofdataisloadedintothememoryineverymapper.Mappersloopoverlargerdatasetandjoinsitwithdatainthememory.If
thesmallersetistoobigtofitintothememory,datasetisloadedintomemcachedorsomeothercachingsolution.

Whichoneisfaster?Mapsidejoinorreducesidejoin?

Theansweris:

Mapsidejoinisfaster.

GotoPart2

Reference:MapReduceQuestionsandAnswersfromourJCGpartnerMariaJurcovicovaattheThisisStuffblog.

DoyouwanttoknowhowtodevelopyourskillsettobecomeaJavaRockstar?

SubscribetoournewslettertostartRockingrightnow!
TogetyoustartedwegiveyouourbestsellingeBooksforFREE!

1.JPAMiniBook
2.JVMTroubleshootingGuide
3.JUnitTutorialforUnitTesting
4.JavaAnnotationsTutorial
5.JavaInterviewQuestions
6.SpringInterviewQuestions
7.AndroidUIDesign

andmanymore....

Emailaddress: Youremailaddress Signup!

LeaveaReply

Name(Required)

Mail(willnotbepublished)(Required)

Website
seven =5
Notifymeoffollowupcommentsviaemail.Youcanalsosubscribewithoutcommenting.
Signmeupforthenewsletter!

SubmitComment

KnowledgeBase Partners HallOfFame AboutJavaCodeGeeks

Academy Mkyong AndroidFullApplicationTutorialseries JCGs(JavaCodeGeeks)isanindependentonlinecommunityfo

Examples GWT2Spring3JPA2Hibernate3.5Tutorial creatingtheultimateJavatoJavadevelopersresourcecenterta


technicalarchitect,technicalteamlead(seniordeveloper),proje
TheCodeGeeksNetwork
Library AdvantagesandDisadvantagesofCloud juniordevelopersalike.JCGsservetheJava,SOA,AgileandTe
JavaCodeGeeks ComputingCloudcomputingprosandcons communitieswithdailynewswrittenbydomainexperts,articles,
Resources
AndroidGoogleMapsTutorial reviews,announcements,codesnippetsandopensourceprojec
.NETCodeGeeks
Tutorials
WebCodeGeeks AndroidLocationBasedServicesApplication
Whitepapers
GPSlocation

11OnlineLearningwebsitesthatyoushould
checkout

JavaBestPracticesVectorvsArrayListvs
HashSet

AndroidJSONParsingwithGsonTutorial

AndroidQuickPreferencesTutorial

DifferencebetweenComparatorand
ComparableinJava

JavaCodeGeeksandallcontentcopyright20102015,ExelixisMediaLtd|TermsofUse|PrivacyPolicy|Contact
AlltrademarksandregisteredtrademarksappearingonJavaCodeGeeksarethepropertyoftheirrespectiveowners.
JavaisatrademarkorregisteredtrademarkofOracleCorporationintheUnitedStatesandothercountries.
JavaCodeGeeksisnotconnectedtoOracleCorporationandisnotsponsoredbyOracleCorporation.

You might also like