Professional Documents
Culture Documents
Fundamentals
@LynnLangit
Comparing
MapReduce to
Optimizing other
MapReduce technologies
• Job Optimization • Future directions
• Optimizing Mappers
Using MapReduce • Optimizing Reducers
• Setting up development
• Understanding
MapReduce Jobs
• WordCount via different
Why developers Language, Tools and
Editors
may choose to • Other MapReduce jobs
use the via different Language,
MapReduce API Tools and Editors
What is
Hadoop
MapReduce?
What is Hadoop?
Open-source data storage and processing API
Massively scalable, automatically parallelizable
Based on work from Google
GFS + MapReduce + BigTable
Current Distributions based on Open Source and Vendor Work
Apache Hadoop
Cloudera – CH4 w/ Impala
Hortonworks
MapR
AWS
Windows Azure HDInsight
Cheaper
Why Use Scales to Petabytes or
more
Hadoop?
Faster
Parallel data processing
Better
Suited for particular types
of BigData problems
What types of business problems for Hadoop?
Point of Sale
Ad Targeting Transactional Threat Analysis
Analysis
Trade
Search Quality Data Sandbox
Surveillance
eBay
American Airlines
IBM
Orbitz
Forecast growth of Hadoop Job Market
Monitoring
Greenplum, Cloudera
What are the core parts of a Hadoop distribution?
HDFS Storage
Redundant (3 copies)
MapReduce API
For large files – large blocks
Batch (Job) processing
Other Libraries
64 or 128 MB / block
Distributed and Localized to
Can scale to 1000s of clusters (Map) Pig
nodes Hive
Auto-Parallelizable for huge
amounts of data HBase
Fault-tolerant (auto retries) Others
Adds high availability and
more
Hadoop Cluster HDFS (Physical)
Storage
One Name Node Name Node
Open Source
Apache
Commercial
Cloudera
Hortonworks
MapR
AWS MapReduce
Microsoft HDInsight (Beta)
A View of Hadoop (from Hortonworks)
Other
Hadoop
Data Storage MapReduce Libraries &
Binaries
Tools
Cloud
• AWS
• Microsoft (Beta)
• Others
Demo – Setting up Cloudera Hadoop
Libraries Languages
HBase Java*
Hive HiveQL (HQL)
Pig Pig Latin
Python
Sqoop
C#
Oozie JavaScript
Mahout R
Others… More…
SELECT a.*
FROM a
JOIN b ON (a.id <> b.id)
Note: Joins are quite different in MapReduce, more on that coming up…
Preparing for MapReduce
128 MB
Work with key-
Cloud
value pairs
Common Hadoop Shell Commands
Tips
-- ‘sudo’ means ‘run as administrator’ (super user)
--some hadoop configurations use ‘hadoop dfs’ rather than ‘hadoop fs’ – file paths to hadoop differ for the
former, see the link included for more detail
Demo – Working with Files and HDFS
Thinking in MapReduce
Map>>
(K1, V1)
Info in
Input Split
list (K2, V2)
Key / Value out
(intermediate
values)
One list per local
node
Can implement
local Reducer (or
Combiner)
Understanding MapReduce – P2/3
Map>> Shuffle/Sort>>
(K1, V1)
Info in
Input Split
list (K2, V2)
Key / Value out
(intermediate
values)
One list per local
node
Can implement
local Reducer (or
Combiner)
Understanding MapReduce – P3/3
(input) <k1, v1> map <k2, v2> combine <k2, v2> reduce <k3, v3> (output)
MapReduce Example -
WordCount
Slave
Master
Node
Node
1
Slave Slave
Node Node
3 2
• Task • Task
Tracker Tracker
• Data Node • Data Node
Libraries Languages
HBase Java*
Hive HiveQL (HQL)
Pig Pig Latin
Python
Sqoop
C#
Oozie JavaScript
Mahout R
Others… More…
Libraries Languages
HBase Java*
Hive HiveQL (HQL)
Pig Pig Latin
Python
Sqoop
C#
Oozie JavaScript
Mahout R
Others… More…
About HDInsight
Demo – MapReduce in the Cloud
WordCount MapReduce using HDInsight
MapReduce (WordCount) with Java Script
Note: JavaScript is
part of the Azure
Hadoop distribution
Common Data Sources for MapReduce Jobs
Clickstream –
Geospatial information
advertising, website
– i.e. cell phone activity
traversals
Where is your Data coming from?
On premises
Local file system
Local HDFS instance
Private Cloud
Cloud storage
Public Cloud
Input Storage buckets
Script / Code buckets
Output buckets
Common Data Jobs for MapReduce
Text Index
Graphs
Mining Building
Risk
Analysis
Demo – Other Types of MapReduce
Libraries Languages
HBase Java*
Hive HiveQL (HQL)
Pig Pig Latin
Python
Sqoop
C#
Oozie JavaScript
Mahout R
Others… More…
Libraries Languages
HBase Java*
Hive HiveQL (HQL)
Pig Pig Latin
Python
Sqoop
C#
Oozie JavaScript
Mahout R
Others… More…
Note: You can select Apache or MapR Hadoop Distributions to run your
MapReduce job on the AWS Cloud
What is Pig?
Note: You can select Apache or MapR Hadoop Distributions to run your
MapReduce job on the AWS Cloud
AWS Data Pipeline with HIVE
Hadoop MapReduce
Fundamentals
@LynnLangit
Optimize Optimize
AFTER the BEFORE
Job the Job
completes runs
Optimize
the Optimize
REDUCE ONLOAD
phase of of the Data
the Job
Optimize
Optimize
the
the MAP
SHUFFLE
phase of
phase of
the Job
the Job
Optimization BEFORE running a MapReduce Job
Chaining jobs
Pre-processing
Encryption
Compression
File sizes
More about Input File Compression
From Cloudera…
Their version of LZO ‘splittable’
Secondary
Sort
Logging /
Debugging
Sub-dividing tasks
(chaining jobs)
MapReduce Job Optimization
String
manipulation
can bottleneck
Distributed
Use
StringBuffer.append Cache
vs. string
concatenation
(public or
private)
Optimized
Java
processing
LongWritable or
BytesWritable
faster than text
parsing Add more
nodes! Skipping
bad records
Debugging /
Unit Testing
Sub-dividing
tasks
Logging / (chaining
Profiling jobs)
Demo – Unit Testing MapReduce
Using MRUnit + Asserts
Optionally using ApprovalTests
MapReduce
Batch Designed for programming Lack of API / security
processing, a specific paradigm not trained model are
not problem commonly support moving
interactive domain understood professionals targets
(functional)
Comparing: RDBMS vs. Hadoop
Updates Read / Write many times Write once, Read many times
Query Response Can be near immediate Has latency (due to batch processing)
Time
Microsoft alternatives to MapReduce
Use existing relational system
Scale via cloud or edition (i.e. Enterprise or PDW)
Use in memory OLAP
SQL Server Analysis Services Tabular Models
Use “productized” Dremel
Microsoft Polybase – status = beta?
Looking Forward - Dremel or Apache Drill
Based on original research from Google
Apache Drill Architecture
In-market MapReduce Alternatives
Cloudera Google
Impala Big Query
Demo – Google’s BigQuery
Dremel for the rest of us
Hadoop MapReduce Call to Action
More MapReduce Developer Resources
Other
Services
RDBMS
Hadoop