You are on page 1of 13

Big Data

05.03.2013
Paquete rmr2
R y Hadoop
2
rea
Razn Social
!"#$%&'($) +$),-")'(.%
What is Hadoop?
An open source project designed to support large scale data
processing
Inspired by Googles MapReduce-based computational
infrastructure
Comprised of several components
Hadoop Distributed File System (HDFS)
MapReduce processing framework, job scheduler, etc.
Ingest/outgest services (Sqoop, Flume, etc.)
Higher level languages and libraries (Hive, Pig, Cascading, Mahout)
Written in Java, first opened up to alternatives through its
Streaming API
! If your language of choice can handle stdin and stdout, you can use it
to write MapReduce jobs
9
3
rea
Razn Social
Introduction

Hadoop streaming enables the creation of mappers,


reducers, combiners, etc. in languages other than Java

Any language which can handle standard, text-based


input & output will do

Increasingly viewed as a lingua franca of statistics and


analytics, R is a natural match for Big Data-driven
analytics

As a result, a number of R packages to work with


Hadoop

Well take a quick look at some of them and then dive


into the details of the RHadoop package
Saturday, March 10, 2012
4
rea
Razn Social
!"#$%& () * +,-.,/%0 12,34,$4%
Pow many 8 ackages
are Lhere now?
AL Lhe command llne
enLer:
> dlm(avallable.packages())
Slide courtesy of John Versotek, organizer of the Boston Predictive Analytics Meetup
5
rea
Razn Social
!"#$ & '()* )+,- *"+* . ,) /0 123,)) 4(/0
5$,6#7 68( -+*+9 08: /,;"* "+<# =,>*:(#- *",)?
6
rea
Razn Social
!"# %&' (&" )%&' * '+, -.+//( #01%)1%2 #01,3
7
rea
Razn Social
!"#$%&'($) +$),-")'(.%
Enter RHadoop
RHadoop is an open source project sponsored by
Revolution Analytics
Package Overview
rmr2 - all MapReduce-related functions
rhdfs - interaction with Hadoops HDFS file system
rhbase - access to the NoSQL HBase database
rmr2 uses Hadoops Streaming API to allow R
users to write MapReduce jobs in R
handles all of the I/O and job submission for you (no
while(<stdin>)-like loops!)
11
8
rea
Razn Social
!"#$%&'($) +$),-")'(.%
RHadoop Advantages
Modular
Packages group similar functions
Only load (and learn!) what you need
Minimizes prerequisites and dependencies
Open Source
Cost: Low (no) barrier to start using
Transparency: Development, issue tracker, Wiki, etc. hosted on
github: https://github.com/RevolutionAnalytics/RHadoop/
Supported
Sponsored by Revolution Analytics
Training & professional services available
Support included with Revolution R Enterprise
12
9
rea
Razn Social
lapply(data, function)
mapreduce(big.data, map = function)
10
rea
Razn Social
mapreduce(input, output, map, reduce)
11
rea
Razn Social
x = from.dfs(hdfs.object)
hdfs.object = to.dfs(x)
12
rea
Razn Social
small.ints = 1:1000
lapply(small.ints, function(x) x^2)
small.ints = to.dfs(1:1000)
mapreduce(input = small.ints,
map = function(k,v) keyval(v, v^2))
groups = rbinom(32, n = 50, prob = 0.4)
tapply(groups, groups, length)
groups = to.dfs(groups)
mapreduce(input = groups,
map = function(k, v) keyval(v,1),
reduce = function(k,vv)
keyval(k, length(vv)))

You might also like