You are on page 1of 86

Hadoop MapReduce

Fundamentals
@LynnLangit

a five part series – Part 1 of 5


Course Outline

Comparing
MapReduce to
Optimizing other
MapReduce technologies
• Job Optimization • Future directions
• Optimizing Mappers
Using MapReduce • Optimizing Reducers
• Setting up development
• Understanding
MapReduce Jobs
• WordCount via different
Why developers Language, Tools and
Editors
may choose to • Other MapReduce jobs
use the via different Language,
MapReduce API Tools and Editors

What is
Hadoop
MapReduce?
What is Hadoop?
 Open-source data storage and processing API
 Massively scalable, automatically parallelizable
 Based on work from Google
 GFS + MapReduce + BigTable
 Current Distributions based on Open Source and Vendor Work
 Apache Hadoop
 Cloudera – CH4 w/ Impala
 Hortonworks
 MapR
 AWS
 Windows Azure HDInsight
 Cheaper
Why Use  Scales to Petabytes or
more
Hadoop?
 Faster
 Parallel data processing

 Better
 Suited for particular types
of BigData problems
What types of business problems for Hadoop?

Customer Churn Recommendation


Risk Modeling
Analysis Engine

Point of Sale
Ad Targeting Transactional Threat Analysis
Analysis

Trade
Search Quality Data Sandbox
Surveillance

Source: Cloudera “Ten Common Hadoopable Problems”


 Facebook
Companies Using  Yahoo
Hadoop
 Amazon

 eBay

 American Airlines

 The New York Times

 Federal Reserve Board

 IBM

 Orbitz
Forecast growth of Hadoop Job Market

Source: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html


Hadoop is a set of Apache Frameworks and more…

 Data storage (HDFS)


 Runs on commodity hardware (usually Linux)
 Horizontally scalable
 Processing (MapReduce)
 Parallelized (scalable) processing Monitoring & Alerting
 Fault Tolerant Tools & Libraries
 Other Tools / Frameworks Data Access
 Data Access
 HBase, Hive, Pig, Mahout
MapReduce API
 Tools Hadoop Core - HDFS
 Hue, Sqoop

 Monitoring
 Greenplum, Cloudera
What are the core parts of a Hadoop distribution?

HDFS Storage
Redundant (3 copies)
MapReduce API
For large files – large blocks
Batch (Job) processing
Other Libraries
64 or 128 MB / block
Distributed and Localized to
Can scale to 1000s of clusters (Map) Pig
nodes Hive
Auto-Parallelizable for huge
amounts of data HBase
Fault-tolerant (auto retries) Others
Adds high availability and
more
Hadoop Cluster HDFS (Physical)
Storage
One Name Node Name Node

• Contains web site to view cluster


information
• V2 Hadoop uses multiple Name Secondary
Nodes for HA Name Node

Many Data Nodes

• 3 copies of each node by default


Data Node 1 Data Node 2 Data Node 3
Work with data in HDFS

• Using common Linux shell


commands
• Block size is 64 or 128 MB
MapReduce Job – Logical View

Image from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png


Hadoop Ecosystem
Common Hadoop Distributions

 Open Source
 Apache

 Commercial
 Cloudera
 Hortonworks
 MapR
 AWS MapReduce
 Microsoft HDInsight (Beta)
A View of Hadoop (from Hortonworks)

Source: “Intro to Map Reduce” -- http://www.youtube.com/watch?v=ht3dNvdNDzI


Setting up Hadoop Development

Other
Hadoop
Data Storage MapReduce Libraries &
Binaries
Tools

Local install Local


• Linux • File System
• Windows • HDFS Pseudo- Local Vendor Tools
distributed (single-
node)

Cloudera’s Demo Cloud


VM • AWS
• Need Virtualization • Azure Cloud Libraries
software, i.e. VMware, • Others
etc…

Cloud
• AWS
• Microsoft (Beta)
• Others
Demo – Setting up Cloudera Hadoop

Note: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs


Hadoop MapReduce
Fundamentals
@LynnLangit

a five part series – Part 2 of 5


So, what’s the problem?
 “I can just use some ‘SQL-like’ language to query
Hadoop, right?
 “Yeah, SQL-on-Hadoop…that’s what I want
 “I don’t want learn a new query language and….
 “I want massive scale for my shiny, new BigData
Ways to MapReduce

Libraries Languages

HBase Java*
Hive HiveQL (HQL)
Pig Pig Latin
Python
Sqoop
C#
Oozie JavaScript
Mahout R
Others… More…

Note: Java is most common, but other languages can be used


Demo – Using Hive QL on CDH4
What is Hive?

 a data warehouse system for Hadoop that


 facilitates easy data summarization
 supports ad-hoc queries (still batch though…)
 created by Facebook
 a mechanism to project structure onto this data and query the data
using a SQL-like language – HiveQL
 Interactive-console –or-
 Execute scripts
 Kicks off one or more MapReduce jobs in the background
 an ability to use indexes, built-in user-defined functions
Is HQL == ANSI SQL? – NO!

--non-equality joins ARE allowed on ANSI SQL


--but are NOT allowed on Hive (HQL)

SELECT a.*
FROM a
JOIN b ON (a.id <> b.id)

Note: Joins are quite different in MapReduce, more on that coming up…
Preparing for MapReduce

Loading File You


Output
Files System Define
Input, Map,
Native file system
Reduce, Output
64 MB

Use Java or other


HDFS Immutable programming
language

128 MB
Work with key-
Cloud
value pairs
Common Hadoop Shell Commands

hadoop fs –cat file:///file2


hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2
hadoop fs –copyFromLocal <fromDir> <toDir>
hadoop fs –put <localfile>
hdfs://nn.example.com/hadoop/hadoopfile
sudo hadoop jar <jarFileName> <method> <fromDir> <toDir>
hadoop fs –ls /user/hadoop/dir1

hadoop fs –cat hdfs://nn1.example.com/file1


hadoop fs –get /user/hadoop/file <localfile>

Tips
-- ‘sudo’ means ‘run as administrator’ (super user)
--some hadoop configurations use ‘hadoop dfs’ rather than ‘hadoop fs’ – file paths to hadoop differ for the
former, see the link included for more detail
Demo – Working with Files and HDFS
Thinking in MapReduce

 Hint: “It’s Functional”


Understanding MapReduce – P1/3

 Map>>
 (K1, V1) 
 Info in
 Input Split
 list (K2, V2)
 Key / Value out
(intermediate
values)
 One list per local
node
 Can implement
local Reducer (or
Combiner)
Understanding MapReduce – P2/3

 Map>>  Shuffle/Sort>>
 (K1, V1) 
 Info in
 Input Split
 list (K2, V2)
 Key / Value out
(intermediate
values)
 One list per local
node
 Can implement
local Reducer (or
Combiner)
Understanding MapReduce – P3/3

 Map>>  Shuffle/Sort>>  Reduce


 (K1, V1)   (K2, list(V2) 
 Info in  Shuffle / Sort phase
 Input Split precedes Reduce
phase
 list (K2, V2)  Combines Map output
 Key / Value out into a list
(intermediate
values)  list (K3, V3)
 One list per local  Usually aggregates
node intermediate values
 Can implement
local Reducer (or
Combiner)

(input) <k1, v1>  map  <k2, v2>  combine  <k2, v2>  reduce  <k3, v3> (output)
MapReduce Example -
WordCount

Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png


MapReduce Objects

• Name Node • Task


• Job Tracker Tracker
• Data Node

Slave
Master
Node
Node
1

Slave Slave
Node Node
3 2
• Task • Task
Tracker Tracker
• Data Node • Data Node

Each daemon spawns a new JVM


Ways to MapReduce

Libraries Languages

HBase Java*
Hive HiveQL (HQL)
Pig Pig Latin
Python
Sqoop
C#
Oozie JavaScript
Mahout R
Others… More…

Note: Java is most common, but other languages can be used


Demo – Running MapReduce WordCount
Hadoop MapReduce
Fundamentals
@LynnLangit

a five part series – Part 3 of 5


Ways to run MapReduce Jobs

 Configure JobConf options


 From Development Environment (IDE)
 From a GUI utility
 Cloudera – Hue
 Microsoft Azure – HDInsight console
 From the command line
 hadoop jar <filename.jar> input output
Ways to MapReduce

Libraries Languages

HBase Java*
Hive HiveQL (HQL)
Pig Pig Latin
Python
Sqoop
C#
Oozie JavaScript
Mahout R
Others… More…

Note: Java is most common, but other languages can be used


Setting up Hadoop On Windows Azure

 About HDInsight
Demo – MapReduce in the Cloud
 WordCount MapReduce using HDInsight
MapReduce (WordCount) with Java Script

Note: JavaScript is
part of the Azure
Hadoop distribution
Common Data Sources for MapReduce Jobs

Text Files – i.e. log files


Statistical information –
• Semi-structured piles of numbers, often
• Unstructured scientific sources

Clickstream –
Geospatial information
advertising, website
– i.e. cell phone activity
traversals
Where is your Data coming from?

 On premises
 Local file system
 Local HDFS instance
 Private Cloud
 Cloud storage
 Public Cloud
 Input Storage buckets
 Script / Code buckets
 Output buckets
Common Data Jobs for MapReduce

Text Index
Graphs
Mining Building

Patterns Filtering Prediction

Risk
Analysis
Demo – Other Types of MapReduce

Tip: Review the Java MapReduce code in these samples as well.


Methods to write MapReduce Jobs

 Typical – usually written in Java


 MapReduce 2.0 API
 MapReduce 1.0 API
 Streaming
 Uses stdin and stdout
 Can use any language to write Map and Reduce Functions
 C#, Python, JavaScript, etc…
 Pipes
 Often used with C++
 Abstraction libraries
 Hive, Pig, etc… write in a higher level language, generate one or more
MapReduce jobs
Ways to MapReduce

Libraries Languages

HBase Java*
Hive HiveQL (HQL)
Pig Pig Latin
Python
Sqoop
C#
Oozie JavaScript
Mahout R
Others… More…

Note: Java is most common, but other languages can be used


Demo – MapReduce via C# & PowerShell
Ways to MapReduce

Libraries Languages

HBase Java*
Hive HiveQL (HQL)
Pig Pig Latin
Python
Sqoop
C#
Oozie JavaScript
Mahout R
Others… More…

Note: Java is most common, but other languages can be used


Using AWS MapReduce

Note: You can select Apache or MapR Hadoop Distributions to run your
MapReduce job on the AWS Cloud
What is Pig?

 ETL Library for HDFS developed at Yahoo


 Pig Runtime
 Pig Language
 Generates MapReduce Jobs
 ETL steps
 LOAD <file>
 FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNT…
 DUMP {to screen for testing}  STORE <newFile>
MapReduce Python Sample

Remember that white space matters in Python!


Demo – Using AWS MapReduce with Pig

Note: You can select Apache or MapR Hadoop Distributions to run your
MapReduce job on the AWS Cloud
AWS Data Pipeline with HIVE
Hadoop MapReduce
Fundamentals
@LynnLangit

a five part series – Part 4 of 5


Better MapReduce - Optimizations

Optimize Optimize
AFTER the BEFORE
Job the Job
completes runs

Optimize
the Optimize
REDUCE ONLOAD
phase of of the Data
the Job

Optimize
Optimize
the
the MAP
SHUFFLE
phase of
phase of
the Job
the Job
Optimization BEFORE running a MapReduce Job

Chaining jobs
Pre-processing

Encryption

Compression

File sizes
More about Input File Compression

 From Cloudera…
 Their version of LZO ‘splittable’

Type File Size GB Compress Decompress

None Log 8.0 - -


Gzip Log.gz 1.3 241 72
LZO Log.lzo 2.0 55 35
Optimization WITHIN a MapReduce Job

Optimize Mapper Task


• Define custom Input Format
• Work with custom Input Data Types
• Define custom Partitioner
• Split the process out into other MapReduce
Jobs
• Add Local Reducer (Combiner)
• Can use Map-only jobs

Optimize Reducer Task


• Split the process into multiple Reducer
Tasks
• Can configure reducer thresholds
59
Mapper Task Optimization

Sub-dividing tasks Logging / Counters


• Chaining jobs • Can set to skip error(s)
• Aim for 1-3 minutes per map task run • Debugging / Unit Testing

Custom Partitioner Monitoring / tuning for optimal spill ratio


• Default is Hash partitioner • Optimal ratio depends on type of Map task(s) being
performed
• Goal is number of spilled records EQUALS number
of map output records
• Can tune io.sort.spill.percent

Local (intermediate) compression


• Snappy
• LZO
Local Reducers (Combiner)
Data Types
 Writable
 Text (String)
 IntWritable
 LongWritable
 FloatWritable
 BooleanWritable
 WritableComparable for keys
 Custom Types supported – write RawComparator
Reducer Task Optimization

Secondary
Sort

Logging /
Debugging

Sub-dividing tasks
(chaining jobs)
MapReduce Job Optimization
String
manipulation
can bottleneck

Distributed
Use
StringBuffer.append Cache
vs. string
concatenation
(public or
private)
Optimized
Java
processing

LongWritable or
BytesWritable
faster than text
parsing Add more
nodes! Skipping
bad records
Debugging /
Unit Testing

Sub-dividing
tasks
Logging / (chaining
Profiling jobs)
Demo – Unit Testing MapReduce
 Using MRUnit + Asserts
 Optionally using ApprovalTests

Image from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png


A note about MapReduce 2.0
 Splits the existing JobTracker’s roles
 resource management
 job lifecycle management
 MapReduce 2.0 provides many benefits over the existing
MapReduce framework, such as better scalability
 through distributed job lifecycle management
 support for multiple Hadoop MapReduce API versions in a single cluster
What is Mahout?
 Library with common machine learning algorithms
 Over 20 algorithms
 Recommendation (likelihood – Pandora)
 Classification (known data and new data – spam id)
 Clustering (new groups of similar data – Google news)
 Can non-statisticians find value using this library?
Mahout Algorithms
Setting up Hadoop on Windows

 For local development


 Install from binaries from Web Platform Installer
 Install .NET Azure SDK (for Azure BLOB storage)
 Install other tools
 Neudesic Azure Storage Viewer
Demo – Mahout
 Using HDInsight
What about the output?
Clients (Visualizations) for HDFS

 Many clients use Hive


 Often included in GUI console tools for Hadoop distributions as well
 Microsoft includes clients in Office (Excel 2013)
 Direct Hive client
 Connect using ODBC
 PowerPivot – data mashups and presentation
 Data Explorer – connect, transform, mashup and filter
 Hadoop SDK on Codeplex
 Other popular clients
 Qlikview
 Tableau
 Karmasphere
Demo – Executing Hive Queries
Demo – Using HDFS output in Excel 2013

To download Data Explorer:


http://www.microsoft.com/en-
us/download/details.aspx?id=36803
About Visualization
Demo – New Visualizations – D3
Hadoop MapReduce
Fundamentals
@LynnLangit

a five part series – Part 5 of 5


Limitations of MapReduce

MapReduce
Batch Designed for programming Lack of API / security
processing, a specific paradigm not trained model are
not problem commonly support moving
interactive domain understood professionals targets
(functional)
Comparing: RDBMS vs. Hadoop

Traditional RDBMS Hadoop / MapReduce

Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)

Access Interactive and Batch Batch – NOT Interactive

Updates Read / Write many times Write once, Read many times

Structure Static Schema Dynamic Schema

Integrity High (ACID) Low

Scaling Nonlinear Linear

Query Response Can be near immediate Has latency (due to batch processing)
Time
Microsoft alternatives to MapReduce
 Use existing relational system
 Scale via cloud or edition (i.e. Enterprise or PDW)
 Use in memory OLAP
 SQL Server Analysis Services Tabular Models
 Use “productized” Dremel
 Microsoft Polybase – status = beta?
Looking Forward - Dremel or Apache Drill
 Based on original research from Google
Apache Drill Architecture
In-market MapReduce Alternatives

Cloudera Google
 Impala  Big Query
Demo – Google’s BigQuery
 Dremel for the rest of us
Hadoop MapReduce Call to Action
More MapReduce Developer Resources

 Based on the distribution – on premises


 Apache
 MapReduce tutorial - http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.htmlCloudera
 Cloudera
 Cloudera University - http://university.cloudera.com/
 Cloudera Developer Course (4 day) - *RECOMMENDED* -
http://university.cloudera.com/training/apache_hadoop/developer.html
 Hortonworks
 MapR
 Based on the distribution – cloud
 AWS MapReduce
 Tutorial - http://aws.amazon.com/elasticmapreduce/training/#gs
 Windows Azure HDInsight
 Tutorial - http://www.windowsazure.com/en-us/manage/services/hdinsight/using-
mapreduce-with-hdinsight/
 More resources - http://www.windowsazure.com/en-us/develop/net/tutorials/intro-to-hadoop/
The Changing Data Landscape

Other
Services

RDBMS

Hadoop

You might also like