Professional Documents
Culture Documents
Hadoop MapReduce
Hadoop MapReduce is a
Software framework
For easily writing applications
Which process vast amounts of data (multi-terabyte data-sets)
In-parallel on large clusters (thousands of nodes) of commodity
hardware
In a reliable, fault-tolerant manner.
A MapReduce job usually splits the input data-set into independent chunks
which are processed by the map tasks in a completely parallel manner.
The framework sorts the outputs of the maps, which are then input to
the reduce tasks.
Typically both the input and the output of the job are stored in a file-system.
The framework takes care of scheduling tasks, monitoring them and reexecutes the failed tasks.
Job Specification
Minimally, applications specify the input/output locations and
supply map and reduce functions via implementations of appropriate
interfaces and/or abstract-classes.
These, and other job parameters, comprise the job configuration.
The Hadoop job client then submits the job (jar/executable etc.) and
configuration to the JobTracker which then assumes the responsibility of
distributing the software/configuration to the slaves, scheduling tasks and
monitoring them, providing status and diagnostic information to the jobclient.
Although the Hadoop framework is implemented in JavaTM, MapReduce
applications need not be written in Java.
Hadoop Streaming is a utility which allows users to create and run jobs with any
executables (e.g. shell utilities) as the mapper and/or the reducer.
Hadoop Pipes is a SWIG- compatible C++ API to implement MapReduce
applications (non JNITM based).
Input Output
The MapReduce framework operates exclusively on <key,
value> pairs.
That is, the framework views the input to the job as a set
of <key, value> pairs and produces a set of <key, value> pairs as
the output of the job, conceivably of different types.
The key and value classes have to be serializable by the
framework and hence need to implement
the Writable interface.
Additionally, the key classes have to implement
the WritableComparable interface to facilitate sorting by the
framework.
Input and Output types of a MapReduce job:
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce > <k3, v3> (output)
Accepts MR jobs
Assigns tasks to slaves
UI for submitting
jobs
Monitors tasks
Polls status
information
Handles failures
Client
TaskTracker Slaves
TaskTracker
TaskTracker
Task
Namenode
HTTP Monitoring UI
Client
JobTracker
.
tasks
Submit Job
http://hadoop.apache.org/mapreduce/
Streaming
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
Pipes
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/pipes/packagesummary.html
Pig
http://hadoop.apache.org/pig/
Hive
http://hive.apache.org
TF-IDF
Intuitively, the relevancy of a document to a term can
be calculated
from the percentage of that term shows up in the
document
TF-IDF
On the other hand,
If this is a very common term which appears in many other
documents,
Then its relevancy should be reduced.
i.e.: the count of documents having this term divided by total
number of documents.
TF-IDF
The overall relevancy of a document with respect to a term can
be computed using both the term frequency and document
frequency.
relevancy = tf-idf = term frequency * log (1 / document
frequency)
Collating
Problem Statement:
There is a set of items and some function of one item.
It is required to save all items that have the same value of
function into one file or perform some other computation
that requires all such items to be processed as a group.
The most typical example is building of inverted indexes.
Collating
Solution:
Mapper computes a given function for each item and emits
value of the function as a key and item itself as a value.
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term f(t), t)
Each Mapper runs simulation for specified amount of data which is 1/Nth of the required
sampling and emit error rate.
Reducer computes average error rate.
Sorting
Problem Statement:
There is a set of records and it is required to sort these
records by some rule or process these records in a certain
order.
Sorting
Solution:
Mappers just emit all items as values associated with the
sorting keys that are assembled as function of items.
Nevertheless, in practice sorting is often used in a quite
tricky way, thats why it is said to be a heart of MapReduce
(and Hadoop).
In particular, it is very common to use composite keys to
achieve secondary sorting and grouping.
Sorting in MapReduce is originally intended for sorting of
the emitted key-value pairs by key, but there exist
techniques that leverage Hadoop implementation specifics
to achieve sorting by values.
Example
Cross-Correlation
Problem Statement:
There is a set of tuples of items.
For each possible pair of items calculate a number of tuples
where these items co-occur.
If the total number of items is N then N*N values should be
reported.
This problem appears in :
text analysis (say, items are words and tuples are sentences),
market analysis (customers who buy this tend to also buy that).
Suppose a store sells N different products and has data on B customer purchases
(each purchase is called a market basket)
Cross-Correlation
Solution:
class Mapper
method Map(null, basket = [i1, i2,...] )
for all item i in basket
for all item j in basket
Emit(pair [i j], count 1)
class Reducer
method Reduce(pair [i j], counts [c1, c2,...])
s = sum([c1, c2,...])
Emit(pair[i j], count s)
Cross-Correlation
Solution:
class Mapper
method Map(null, basket = [i1, i2,...] )
for all item i in basket
H = new AssociativeArray : item -> counter
for all item j in basket
H{j} = H{j} + 1
Emit(item i, stripe H)
class Reducer
method Reduce(item i, stripes [H1, H2,...])
H = new AssociativeArray : item -> counter
H = merge-sum( [H1, H2,...] )
for all item j in H.keys()
Emit(pair [i j], H{j})
Selection
Solution:
class Mapper
method Map(rowkey key, tuple t)
if t satisfies the predicate
Emit(tuple t, null)
Projection
Solution:
class Mapper
method Map(rowkey key, tuple t)
tuple g = project(t) // extract required fields to tuple g
Emit(tuple g, null)
class Reducer
method Reduce(tuple t, array n) // n is an array of nulls
Emit(tuple t, null)
Union
Solution:
class Mapper
method Map(rowkey key, tuple t)
Emit(tuple t, null)
class Reducer
method Reduce(tuple t, array n) // n is an array of one or two nulls
Emit(tuple t, null)
Intersection
Solution:
class Mapper
method Map(rowkey key, tuple t)
Emit(tuple t, null)
class Reducer
method Reduce(tuple t, array n) // n is an array of one or more nulls
if n.size() >= 2
Emit(tuple t, null)
Difference
Solution:
// We want to compute difference R S.
// Mapper emits all tuples and tag which is a name of the set this record came from.
// Reducer emits only records that came from R but not from S.
class Mapper
method Map(rowkey key, tuple t)
Emit(tuple t, string t.SetName) // t.SetName is either 'R' or 'S'
class Reducer
method Reduce(tuple t, array n) // array n can be ['R'], ['S'], ['R' 'S'], or ['S', 'R']
if n.size() = 1 and n[1] = 'R'
Emit(tuple t, null)
Join
A JOIN is a means for combining fields from two tables by using values
common to each.
ANSI standard SQL specifies four types of JOIN: INNER, OUTER, LEFT,
and RIGHT.
As a special case, a table (base table, view, or joined table) can JOIN to itself
in a self-join.
Join algorithms
Nested loop
Sort-Merge
Hash
SELECT *
FROM employee
INNER JOIN department ON employee.DepartmentID = department.DepartmentID;
SELECT *
FROM employee
JOIN department ON employee.DepartmentID = department.DepartmentID;
SELECT *
FROM employee
LEFT OUTER JOIN department ON employee.DepartmentID = department.DepartmentID;
Nested loop
Sort-Merge
p in P; q in Q; g in Q
while more tuples in inputs do
while p.a < g.b do
advance p
end while
while p.a > g.b do
advance g //a group might begin here
end while
while p.a == g.b do
q = g //mark group beginning
while p.a == q.b do
Add <p,q> to the result
Advance q
end while
Advance p //move forward
end while
g = q //candidate to begin next group
end while
Sort-Merge : MapReduce
This algorithm joins of two sets R and L on some key k.
Mapper goes through all tuples from R and L, extracts key k from the tuples,
marks tuple with a tag that indicates a set this tuple came from (R or L),
and emits tagged tuple using k as a key.
Reducer receives all tuples for a particular key k and put them into two
buckets for R and for L.
When two buckets are filled, Reducer runs nested loop over them and emits
a cross join of the buckets.
Each emitted tuple is a concatenation R-tuple, L-tuple, and key k.
This approach has the following disadvantages:
Mapper emits absolutely all data, even for keys that occur only in one set and
have no pair in the other.
Reducer should hold all data for one key in the memory. If data doesnt fit the
memory, its Reducers responsibility to handle this by some kind of swap.
Hash Join
The Hash Join algorithm consists of a build phase and a probe phase.
In its simplest variant, the smaller dataset is loaded into an in-memory hash
table in the build phase.
In the probe phase, the larger dataset is scanned and joined with the
relevant tuple(s) by looking into the hash table.
for all p in P do
Load p into in memory hash table H
end for
for all q in Q do
if H contains p matching with q then
add <p,q> to the result
end if
end for
Reference
F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, pages
99-110, 2010
R. Vernica, M. J. Carey, and C. Li. Ecient parallel set-similarity joins using mapreduce. In
SIGMOD, pages 495-506, 2010.
H. Yang, A. Dasdan, R.-L. Hsiao, and D. S. P. Jr. Map-Reduce-Merge: simplied relational
data processing on large clusters. In SIGMOD Conference, pages 10291040, 2007.
Processing theta-joins using MapReduce, Alper Okcan, Mirek Riedewald, SIGMOD '11
Foto N. Afrati, Jeffrey D. Ullman: Optimizing Multiway Joins in a Map-Reduce Environment.
IEEE Trans. Knowl. Data Eng. 23(9): 1282-1298 (2011)
V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets
and Vectors, Ahmed Metwally, Christos Faloutsos, PVLDB Proceedings of the VLDB
Endowment, vol. 5 (2012)
End of session
Day 2: MapReduce Concepts