Hadoop in Action Streams

Hadoop MapReduce
Fundamentals
@LynnLangit
a five part series – Part 1 of 5

Course Outline
Comparing
MapReduce to
Optimizing other
MapReduce technologies
• Job Optimization • Future directions
• Optimizing Mappers
Using MapReduce • Optimizing Reducers
• Setting up development
• Understanding
MapReduce Jobs
• WordCount via different
Why developers Language, Tools and
Editors
may choose to • Other MapReduce jobs
use the via different Language,
MapReduce API Tools and Editors
What is
Hadoop
MapReduce?
What is Hadoop?
 Open-source data storage and processing API
 Massively scalable, automatically parallelizable
 Based on work from Google
 GFS + MapReduce + BigTable
 Current Distributions based on Open Source and Vendor Work
 Apache Hadoop
 Cloudera – CH4 w/ Impala
 Hortonworks
 MapR
 AWS
 Windows Azure HDInsight
 Cheaper
Why Use  Scales to Petabytes or
more
Hadoop?
 Faster
 Parallel data processing
 Better
 Suited for particular types
of BigData problems
What types of business problems for Hadoop?
Customer Churn Recommendation

Risk Modeling
Analysis Engine
Point of Sale
Ad Targeting Transactional Threat Analysis
Analysis
Trade
Search Quality Data Sandbox
Surveillance
Source: Cloudera “Ten Common Hadoopable Problems”

 Facebook
Companies Using  Yahoo
Hadoop
 Amazon
 eBay
 American Airlines
 The New York Times
 Federal Reserve Board
 IBM
 Orbitz
Forecast growth of Hadoop Job Market
Source: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html

Hadoop is a set of Apache Frameworks and more…
 Data storage (HDFS)

 Runs on commodity hardware (usually Linux)
 Horizontally scalable
 Processing (MapReduce)
 Parallelized (scalable) processing Monitoring & Alerting
 Fault Tolerant Tools & Libraries
 Other Tools / Frameworks Data Access
 Data Access
 HBase, Hive, Pig, Mahout
MapReduce API
 Tools Hadoop Core - HDFS
 Hue, Sqoop
 Monitoring
 Greenplum, Cloudera
What are the core parts of a Hadoop distribution?
HDFS Storage
Redundant (3 copies)
MapReduce API
For large files – large blocks
Batch (Job) processing
Other Libraries
64 or 128 MB / block
Distributed and Localized to
Can scale to 1000s of clusters (Map) Pig
nodes Hive
Auto-Parallelizable for huge
amounts of data HBase
Fault-tolerant (auto retries) Others
Adds high availability and
more
Hadoop Cluster HDFS (Physical)
Storage
One Name Node Name Node
• Contains web site to view cluster

information
• V2 Hadoop uses multiple Name Secondary
Nodes for HA Name Node
Many Data Nodes
• 3 copies of each node by default

Data Node 1 Data Node 2 Data Node 3
Work with data in HDFS
• Using common Linux shell

commands
• Block size is 64 or 128 MB
MapReduce Job – Logical View
Image from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png

Hadoop Ecosystem
Common Hadoop Distributions
 Open Source
 Apache
 Commercial
 Cloudera
 Hortonworks
 MapR
 AWS MapReduce
 Microsoft HDInsight (Beta)
A View of Hadoop (from Hortonworks)
Source: “Intro to Map Reduce” -- http://www.youtube.com/watch?v=ht3dNvdNDzI

Setting up Hadoop Development
Other
Hadoop
Data Storage MapReduce Libraries &
Binaries
Tools
Local install Local

• Linux • File System
• Windows • HDFS Pseudo- Local Vendor Tools
distributed (single-
node)
Cloudera’s Demo Cloud

VM • AWS
• Need Virtualization • Azure Cloud Libraries
software, i.e. VMware, • Others
etc…
Cloud
• AWS
• Microsoft (Beta)
• Others
Demo – Setting up Cloudera Hadoop
Note: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs

Hadoop MapReduce
Fundamentals
@LynnLangit

So, what’s the problem?
 “I can just use some ‘SQL-like’ language to query
Hadoop, right?
 “Yeah, SQL-on-Hadoop…that’s what I want
 “I don’t want learn a new query language and….
 “I want massive scale for my shiny, new BigData
Ways to MapReduce
Libraries Languages
HBase Java*
Hive HiveQL (HQL)
Pig Pig Latin
Python
Sqoop
C#
Oozie JavaScript
Mahout R
Others… More…
Note: Java is most common, but other languages can be used

Demo – Using Hive QL on CDH4
What is Hive?
 a data warehouse system for Hadoop that

 facilitates easy data summarization
 supports ad-hoc queries (still batch though…)
 created by Facebook
 a mechanism to project structure onto this data and query the data
using a SQL-like language – HiveQL
 Interactive-console –or-
 Execute scripts
 Kicks off one or more MapReduce jobs in the background
 an ability to use indexes, built-in user-defined functions
Is HQL == ANSI SQL? – NO!
--non-equality joins ARE allowed on ANSI SQL

--but are NOT allowed on Hive (HQL)
SELECT a.*
FROM a
JOIN b ON (a.id <> b.id)
Note: Joins are quite different in MapReduce, more on that coming up…
Preparing for MapReduce
Loading File You

Output
Files System Define
Input, Map,
Native file system
Reduce, Output
64 MB
Use Java or other

HDFS Immutable programming
language
128 MB
Work with key-
Cloud
value pairs
Common Hadoop Shell Commands
hadoop fs –cat file:///file2

hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2
hadoop fs –copyFromLocal <fromDir> <toDir>
hadoop fs –put <localfile>
hdfs://nn.example.com/hadoop/hadoopfile
sudo hadoop jar <jarFileName> <method> <fromDir> <toDir>
hadoop fs –ls /user/hadoop/dir1
hadoop fs –cat hdfs://nn1.example.com/file1

hadoop fs –get /user/hadoop/file <localfile>
Tips
-- ‘sudo’ means ‘run as administrator’ (super user)
--some hadoop configurations use ‘hadoop dfs’ rather than ‘hadoop fs’ – file paths to hadoop differ for the
former, see the link included for more detail
Demo – Working with Files and HDFS
Thinking in MapReduce
 Hint: “It’s Functional”

Understanding MapReduce – P1/3
 Map>>
 (K1, V1) 
 Info in
 Input Split
 list (K2, V2)
 Key / Value out
(intermediate
values)
 One list per local
node
 Can implement
local Reducer (or
Combiner)
 Map>>  Shuffle/Sort>>
 (K1, V1) 
 Info in
 Input Split
 list (K2, V2)
 Key / Value out
(intermediate
values)
 One list per local
node
 Can implement
local Reducer (or
Combiner)
 Map>>  Shuffle/Sort>>  Reduce

 (K1, V1)   (K2, list(V2) 
 Info in  Shuffle / Sort phase
 Input Split precedes Reduce
phase
 list (K2, V2)  Combines Map output
 Key / Value out into a list
(intermediate
values)  list (K3, V3)
 One list per local  Usually aggregates
node intermediate values
 Can implement
local Reducer (or
Combiner)
(input) <k1, v1>  map  <k2, v2>  combine  <k2, v2>  reduce  <k3, v3> (output)
MapReduce Example -
WordCount
Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png

MapReduce Objects
• Name Node • Task

• Job Tracker Tracker
• Data Node
Slave
Master
Node
Node
1
Slave Slave
Node Node
3 2
• Task • Task
Tracker Tracker
• Data Node • Data Node
Each daemon spawns a new JVM

Ways to MapReduce
Libraries Languages
HBase Java*
Hive HiveQL (HQL)
Pig Pig Latin
Python
Sqoop
C#
Oozie JavaScript
Mahout R
Others… More…

Demo – Running MapReduce WordCount
Hadoop MapReduce
Fundamentals
@LynnLangit

Ways to run MapReduce Jobs
 Configure JobConf options

 From Development Environment (IDE)
 From a GUI utility
 Cloudera – Hue
 Microsoft Azure – HDInsight console
 From the command line
 hadoop jar <filename.jar> input output
Ways to MapReduce
Libraries Languages
HBase Java*
Hive HiveQL (HQL)
Pig Pig Latin
Python
Sqoop
C#
Oozie JavaScript
Mahout R
Others… More…

Setting up Hadoop On Windows Azure
 About HDInsight
Demo – MapReduce in the Cloud
 WordCount MapReduce using HDInsight
MapReduce (WordCount) with Java Script
Note: JavaScript is
part of the Azure
Hadoop distribution
Common Data Sources for MapReduce Jobs
Text Files – i.e. log files

Statistical information –
• Semi-structured piles of numbers, often
• Unstructured scientific sources
Clickstream –
Geospatial information
advertising, website
– i.e. cell phone activity
traversals
Where is your Data coming from?
 On premises
 Local file system
 Local HDFS instance
 Private Cloud
 Cloud storage
 Public Cloud
 Input Storage buckets
 Script / Code buckets
 Output buckets
Common Data Jobs for MapReduce
Text Index
Graphs
Mining Building
Patterns Filtering Prediction
Risk
Analysis
Demo – Other Types of MapReduce
Tip: Review the Java MapReduce code in these samples as well.

Methods to write MapReduce Jobs
 Typical – usually written in Java

 MapReduce 2.0 API
 MapReduce 1.0 API
 Streaming
 Uses stdin and stdout
 Can use any language to write Map and Reduce Functions
 C#, Python, JavaScript, etc…
 Pipes
 Often used with C++
 Abstraction libraries
 Hive, Pig, etc… write in a higher level language, generate one or more
MapReduce jobs
Ways to MapReduce
Libraries Languages
HBase Java*
Hive HiveQL (HQL)
Pig Pig Latin
Python
Sqoop
C#
Oozie JavaScript
Mahout R
Others… More…

Demo – MapReduce via C# & PowerShell
Ways to MapReduce
Libraries Languages
HBase Java*
Hive HiveQL (HQL)
Pig Pig Latin
Python
Sqoop
C#
Oozie JavaScript
Mahout R
Others… More…

Using AWS MapReduce
Note: You can select Apache or MapR Hadoop Distributions to run your
MapReduce job on the AWS Cloud
What is Pig?
 ETL Library for HDFS developed at Yahoo

 Pig Runtime
 Pig Language
 Generates MapReduce Jobs
 ETL steps
 LOAD <file>
 FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNT…
 DUMP {to screen for testing}  STORE <newFile>
MapReduce Python Sample
Remember that white space matters in Python!

Demo – Using AWS MapReduce with Pig
Note: You can select Apache or MapR Hadoop Distributions to run your
MapReduce job on the AWS Cloud
AWS Data Pipeline with HIVE
Hadoop MapReduce
Fundamentals
@LynnLangit

Better MapReduce - Optimizations
Optimize Optimize
AFTER the BEFORE
Job the Job
completes runs
Optimize
the Optimize
REDUCE ONLOAD
phase of of the Data
the Job
Optimize
Optimize
the
the MAP
SHUFFLE
phase of
phase of
the Job
the Job
Optimization BEFORE running a MapReduce Job
Chaining jobs
Pre-processing
Encryption
Compression
File sizes
More about Input File Compression
 From Cloudera…
 Their version of LZO ‘splittable’
Type File Size GB Compress Decompress
None Log 8.0 - -

Gzip Log.gz 1.3 241 72
LZO Log.lzo 2.0 55 35
Optimization WITHIN a MapReduce Job
Optimize Mapper Task

• Define custom Input Format
• Work with custom Input Data Types
• Define custom Partitioner
• Split the process out into other MapReduce
Jobs
• Add Local Reducer (Combiner)
• Can use Map-only jobs
Optimize Reducer Task

• Split the process into multiple Reducer
Tasks
• Can configure reducer thresholds
59
Mapper Task Optimization
Sub-dividing tasks Logging / Counters

• Chaining jobs • Can set to skip error(s)
• Aim for 1-3 minutes per map task run • Debugging / Unit Testing
Custom Partitioner Monitoring / tuning for optimal spill ratio

• Default is Hash partitioner • Optimal ratio depends on type of Map task(s) being
performed
• Goal is number of spilled records EQUALS number
of map output records
• Can tune io.sort.spill.percent
Local (intermediate) compression

• Snappy
• LZO
Local Reducers (Combiner)
Data Types
 Writable
 Text (String)
 IntWritable
 LongWritable
 FloatWritable
 BooleanWritable
 WritableComparable for keys
 Custom Types supported – write RawComparator
Reducer Task Optimization
Secondary
Sort
Logging /
Debugging
Sub-dividing tasks
(chaining jobs)
MapReduce Job Optimization
String
manipulation
can bottleneck
Distributed
Use
StringBuffer.append Cache
vs. string
concatenation
(public or
private)
Optimized
Java
processing
LongWritable or
BytesWritable
faster than text
parsing Add more
nodes! Skipping
bad records
Debugging /
Unit Testing
Sub-dividing
tasks
Logging / (chaining
Profiling jobs)
Demo – Unit Testing MapReduce
 Using MRUnit + Asserts
 Optionally using ApprovalTests
Image from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png

A note about MapReduce 2.0
 Splits the existing JobTracker’s roles
 resource management
 job lifecycle management
 MapReduce 2.0 provides many benefits over the existing
MapReduce framework, such as better scalability
 through distributed job lifecycle management
 support for multiple Hadoop MapReduce API versions in a single cluster
What is Mahout?
 Library with common machine learning algorithms
 Over 20 algorithms
 Recommendation (likelihood – Pandora)
 Classification (known data and new data – spam id)
 Clustering (new groups of similar data – Google news)
 Can non-statisticians find value using this library?
Mahout Algorithms
Setting up Hadoop on Windows
 For local development

 Install from binaries from Web Platform Installer
 Install .NET Azure SDK (for Azure BLOB storage)
 Install other tools
 Neudesic Azure Storage Viewer
Demo – Mahout
 Using HDInsight
What about the output?
Clients (Visualizations) for HDFS
 Many clients use Hive

 Often included in GUI console tools for Hadoop distributions as well
 Microsoft includes clients in Office (Excel 2013)
 Direct Hive client
 Connect using ODBC
 PowerPivot – data mashups and presentation
 Data Explorer – connect, transform, mashup and filter
 Hadoop SDK on Codeplex
 Other popular clients
 Qlikview
 Tableau
 Karmasphere
Demo – Executing Hive Queries
Demo – Using HDFS output in Excel 2013
To download Data Explorer:

http://www.microsoft.com/en-
us/download/details.aspx?id=36803
About Visualization
Demo – New Visualizations – D3
Hadoop MapReduce
Fundamentals
@LynnLangit

Limitations of MapReduce
MapReduce
Batch Designed for programming Lack of API / security
processing, a specific paradigm not trained model are
not problem commonly support moving
interactive domain understood professionals targets
(functional)
Comparing: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query Response Can be near immediate Has latency (due to batch processing)
Time
Microsoft alternatives to MapReduce
 Use existing relational system
 Scale via cloud or edition (i.e. Enterprise or PDW)
 Use in memory OLAP
 SQL Server Analysis Services Tabular Models
 Use “productized” Dremel
 Microsoft Polybase – status = beta?
Looking Forward - Dremel or Apache Drill
 Based on original research from Google
Apache Drill Architecture
In-market MapReduce Alternatives
Cloudera Google
 Impala  Big Query
Demo – Google’s BigQuery
 Dremel for the rest of us
Hadoop MapReduce Call to Action
More MapReduce Developer Resources
 Based on the distribution – on premises

 Apache
 MapReduce tutorial - http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.htmlCloudera
 Cloudera
 Cloudera University - http://university.cloudera.com/
 Cloudera Developer Course (4 day) - *RECOMMENDED* -
http://university.cloudera.com/training/apache_hadoop/developer.html
 Hortonworks
 MapR
 Based on the distribution – cloud
 AWS MapReduce
 Tutorial - http://aws.amazon.com/elasticmapreduce/training/#gs
 Windows Azure HDInsight
 Tutorial - http://www.windowsazure.com/en-us/manage/services/hdinsight/using-
mapreduce-with-hdinsight/
 More resources - http://www.windowsazure.com/en-us/develop/net/tutorials/intro-to-hadoop/
The Changing Data Landscape
Other
Services
RDBMS
Hadoop

Hadoop in Action Streams

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop in Action Streams

Uploaded by

Copyright:

Available Formats

Hadoop MapReduce

a five part series – Part 1 of 5

Customer Churn Recommendation

Source: Cloudera “Ten Common Hadoopable Problems”

 The New York Times

 Federal Reserve Board

Source: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html

 Data storage (HDFS)

• Contains web site to view cluster

Many Data Nodes

• 3 copies of each node by default

• Using common Linux shell

Image from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png

Source: “Intro to Map Reduce” -- http://www.youtube.com/watch?v=ht3dNvdNDzI

Local install Local

Cloudera’s Demo Cloud

Note: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs

a five part series – Part 2 of 5

Note: Java is most common, but other languages can be used

 a data warehouse system for Hadoop that

--non-equality joins ARE allowed on ANSI SQL

Loading File You

Use Java or other

hadoop fs –cat file:///file2

hadoop fs –cat hdfs://nn1.example.com/file1

 Hint: “It’s Functional”

 Map>>  Shuffle/Sort>>  Reduce

Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png

• Name Node • Task

Each daemon spawns a new JVM

Note: Java is most common, but other languages can be used

a five part series – Part 3 of 5

 Configure JobConf options

Note: Java is most common, but other languages can be used

Text Files – i.e. log files

Patterns Filtering Prediction

Tip: Review the Java MapReduce code in these samples as well.

 Typical – usually written in Java

Note: Java is most common, but other languages can be used

Note: Java is most common, but other languages can be used

 ETL Library for HDFS developed at Yahoo

Remember that white space matters in Python!

a five part series – Part 4 of 5

Type File Size GB Compress Decompress

None Log 8.0 - -

Optimize Mapper Task

Optimize Reducer Task

Sub-dividing tasks Logging / Counters

Custom Partitioner Monitoring / tuning for optimal spill ratio

Local (intermediate) compression

Image from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png

 For local development

 Many clients use Hive

To download Data Explorer:

a five part series – Part 5 of 5

Traditional RDBMS Hadoop / MapReduce

Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)

Access Interactive and Batch Batch – NOT Interactive

Structure Static Schema Dynamic Schema

Integrity High (ACID) Low

Scaling Nonlinear Linear

 Based on the distribution – on premises

You might also like