Another Intro To Hadoop

Another Intro to Hadoop
Fridays@5
Context Optional
April 2, 2010
By Adeel Ahmad
About Me
Follow me on Twitter @_adeel
The AI Show podcast: www.aishow.org

Artificial intelligence news every week.
Senior App Genius at Context Optional

We're hiring Ruby developers. Contact me!
Too much data

User-generated, social networks, logging and
tracking

Google, Yahoo and others need to index the entire
internet and return search results in milliseconds

NYSE generates 1 TB data/day

Facebook has 400 terabytes of stored data and
ingests 20 terabytes of new data per day. Hosts
approx. 10 billion photos, 1 petabyte (2009)
Can't scale

Challenge to both store and analyze datasets

Slow to process

Unreliable machines (CPUs and disks can do down)

Not affordable (faster, more reliable machines are
expensive)
Solve it through software

Split up the data

Run jobs in parallel

Sort and combine to get the answer

Schedule across arbitrarily-sized cluster

Handle fault-tolerance

Since even the best systems breakdown, use cheap
commodity computers
Enter Hadoop

Open-source Apache project written in Java

MapReduce implementation for parallelizing
application

Distributed filesystem for redundant data

Many other sub-projects

Meant for cheap, heterogenous hardware

Scale up by simply adding more cheap hardware
History

Open-source Apache project

Grew out of Apache Nutch project, an open-source
search engine

Two Google papers

MapReduce (2003): programming model for parallel
processing

Google File System (2003) for fault-tolerant processing
of large amounts of data
MapReduce
 Operates exclusively on <key, value> pairs

Split the input data into independent chunks

Processed by the map tasks in parallel

Sort the outputs of the maps

Send to the reduce tasks

Write to output files
MapReduce
MapReduce
HDFS

Hadoop Distributed File System

Files split into large blocks

Designed for streaming reads and appending writes,
not random access

3 replicas for each piece of data by default

Data can be encoded/archived formats
Self-managing and self-healing

Bring the computation as physically close to the data
as possible for best bandwidth, instead of copying
data

Tries to use same node, then same rack, then same
data center

Auto-replication if data lost

Auto-kill and restart of tasks on another node if
taking too long or flaky
Hadoop Streaming

Don't need to write mappers and reducers in Java

Text-based API that exposes stdin and stdout

Use any language

Ruby gems: Wukong, Mandy
Example: Word count
# mapper.rb # reducer.rb
STDIN.each_line do |line| word = nil
word_count = {} count = 0
line.split.each do |word| STDIN.each_line do |line|
word_count[word] ||= 0 wordx, countx = line.strip.split
word_count[word] += 1 if word x!= word
end puts "#{word}\t#{count}" unless word.nil?
word = wordx
word_count.each do |k,v| count = 0
puts "#{k}\t#{v}" end
end count += countx.to_i
end end
puts "#{word}\t#{count}" unless word.nil?
Who Uses Hadoop?

Yahoo 
Flightcaster

Facebook 
RapLeaf

Netflix 
Trulia

eHarmony 
Last.fm

LinkedIn 
Ning

NY Times 
CNET

Digg 
Lots more...
Developing With Hadoop

Don't need a whole cluster to start

Standalone
– Non-distributed
– Single Java process

Pseudo-distributed

Just like full-distributed

Components in separate processes

Full distributed

Now you need a real cluster
How to Run Hadoop

Linux, OSX, Windows, Solaris

Just need Java, SSH access to nodes

XML config files

Download core Hadoop

Can do everything we mentioned

Still needs user to play with config files and
create scripts
How to Run Hadoop

Cloudera Inc. provides their own distributions and
enterprise support and training for Hadoop

Core Hadoop plus patches

Bundled with command-line scripts, Hive, Pig

Publish AMI and scripts for EC2

Best option for your own cluster
How to Run Hadoop

Amazon Elastic MapReduce (EMR)

GUI or command-line cluster management

Supports Streaming, Hive, Pig

Grabs data and MapReduce code from S3 buckets and
puts it into HDFS

Auto-shutdown EC2 instances

Cloudera now has scripts for EMR

Easiest option
Pig

High-level scripting language developed by Yahoo

Describes multi-step jobs

Translated into MapReduce tasks

Grunt command-line interface
Ex: Find top 5 most visited pages by users aged 18 to 25
Users = LOAD 'users' AS (name, age);
Filtered = FILTER Users BY age >=18 AND age <= 25;
Pages = LOAD 'pages' AS (user, url);
Joined = JOIN Filtered BY name, Pages BY user;
Grouped = GROUP Joined BY url;
Summed = FOREACH Grouped GENERATE group, COUNT(Joined) AS clicks;
Sorted = ORDER Summed BY clicks DESC
Hive

High-level interface created by Facebook

Gives db-like structure to data

HIveQL declarative language for querying

Queries get turned into MapReduce jobs

Command-line interface
ex.
CREATE TABLE raw_daily_stats_table (dates STRING, ..., pageviews STRING);
LOAD DATA INPATH 'finaloutput' INTO TABLE raw_daily_stats_table;
SELECT … FROM … JOIN ...
Mahout

Machine-learning libraries for Hadoop
– Collaborative filtering
– Clustering
– Frequent pattern recognition
– Genetic algorithms
 Applications
– Product/friend recommendation
– Classify content into defined groups
– Find associations, patterns, behaviors
– Identify important topics in conversations
More stuff

Hbase – database based on Google's Bigtable

Sqoop – database import tool

Zookeeper – coordination service for distributed
apps to keep track of servers, like a filesystem

Avro – data serialization system

Scribe – logging system developed by Facebook

Another Intro To Hadoop

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Another Intro To Hadoop

Uploaded by

Copyright:

Available Formats

Another Intro to Hadoop

The AI Show podcast: www.aishow.org

Senior App Genius at Context Optional

You might also like