Map Reduce Using Cascading

Cascading
www.cascading.org
info@cascading.org
Wednesday, May 14, 2008
Design Goals
Make large processing jobs more transparent
Reusable processing components independent of resources
Incremental “data” builds
Simplify testing of processes
Scriptable from higher level languages (Groovy, JRuby, Jython, etc)

Cascading Introduction

Tuple Streams
Value Stream Group Stream
[K1,K2,...,Kn
[V1,V2,...,Vn
[V1,V2,...,Vn
Tuple [V1,V2,...,Vn
[V1,V2,...,Vn
A set of ordered data [“John”, “Doe”, 39] [V1,V2,...,Vn [V1,V2,...,Vn
[K1,K2,...,Kn
[V1,V2,...,Vn
Value Stream [V1,V2,...,Vn
[V1,V2,...,Vn
[V1,V2,...,Vn
Just tuples [V1,V2,...,Vn

[V1,V2,...,Vn
[V1,V2,...,Vn [V1,V2,...,Vn
Group Stream
Tuples groups by a key

Tuple Streams
[values]
Scalar functions and filters

Source
Apply to value and group streams [values]

func
[values]
Aggregate functions
Apply to group stream
[values] [groups/values]
Group
Functions can be chained [groups] [values]

aggr
func
[values]
Sink
Source func Group aggr Sink

Stream Processing
Flow
Pipe Assembly
S F F G A A S
Pipe Assemblies
A chain of scalar functions, groupings, aggregate functions
Reusable, independent of data source/sink
Flows Cascade
S F S F S
Assemblies plus sources and sinks

S F
Cascades
S F S
A collection of Flows
S F

Processing Patterns
Source Group Sink Source Sink
Chain
Group Sink Sink
Source Source
Splits
Group Sink Sink
Source
Joins
Group Sink
Source
Cross Source Group Sink

MapReduce Planner
Flow Job
Flow Job
Map
Map Reduce
F F F Reduce
S F G A S G A S
F
Map
Job
Map
S F F Job
Reduce Map
Flows are logical ‘units of work’

G A S F S
S F
Map
Flows ‘compiled’ into MR Jobs

Intermediate files are created (and destroyed) to join Jobs

Topological Scheduler
Flows walk MapReduce Jobs in dependency order

Cascades walk Flows in dependency order
Independent Jobs and Flows are scheduled to run concurrently
Listeners can react to element events (notify completion or failures)
Only stale data-sets are rebuilt (configurable)

Scripting - Groovy
Flow flow = builder.flow("wordcount")
{
source(input, scheme: text()) // input is filename of raw text document
tokenize(/[.,]*\s+/) // output new tuple for each split, result replaces stream by default
group() // group on stream
count() // count values in group, creates 'count' field by default
group(["count"], reverse: true) // group/sort on 'count', reverse the sort order
sink(output)
}
flow.complete() // execute, block till completed

System Integration
FileSystems (unique to Cascading)
Raw file S3 reading/writing (MD5)
Raw file HTTP reading (MD5)
Zip files
Can bypass native Hadoop ‘collectors’
Event notification via listeners (XMPP/SQS/Zookeeper notifications)
Groovy scripting for easier local shell/file operations (wget, scp, etc)

Cascading API & Internals

Core Concepts
Taps and Schemes
Tuples and Fields
Pipes and PipeAssemblies
Each and Every Operators
Groups
Flows, FlowSteps, and FlowConnectors
Cascades, and CascadeConnectors, optional

Taps and Schemes
Taps, abstract out where and how a data resources is accessed

hdfs, http, local, S3, etc
Taps, used as Tuple (data) stream sinks, sources, or both
Schemes, define what a resource is made of
text lines, SequenceFile, CSV, etc

Tuples and Fields
Tuples are the ‘records’, read from Tap sources, written to Tap sinks
Fields are the ‘column names’, sourced from Schemes
Tuple class, an ordered collection of Comparable values
(“a string”, 1.0, new SomeComparableWritable())
Fields class, a list of field names, absolute or relative positions
(“total”, 3, -1) // fields ‘total’, 4th position, last position

Pipes and PipeAssemblies
Tuple streams pass through Pipes to be processed
Pipes, apply functions, filters, and aggregators to the Tuple stream
Pipe instances are chained together into assemblies
Reusable assemblies are subclasses of class PipeAssembly
A
B'
A
E P
C'
B'
G A
B
A
E E
C
C'
E E

Group Class and Subclasses
Group, subclass of Pipe, groups the Tuple stream on given fields
GroupBy and CoGroup subclass Group
GroupBy groups and sorts
CoGroup performs joins
T E
Fe Fa
G A T
T E G A T
T E

Each and Every Classes
Each, subclass of Pipe, applies Functions and Filters to each Tuple instance
(a,b,c) -> Each( func() ) -> (a,b,c,d)
Every, subclass of Pipe, applies Aggregators to every Tuple group
(a: b,c) -> Every( agg()) -> (a,d: b,c)
Fe Fa
E A

Flows and FlowConnectors
Flows encapsulate assemblies and sink and source Taps

FlowConnectors connect assemblies and Taps into Flows
Flow FlowStep Flow

FlowStep
E G A T
T E
T E
G A T
FlowStep
T E
E E G A T

FlowSteps and FlowConnectors
Internally, FlowConnectors ‘compile’ assemblies into FlowSteps
FlowSteps are MapReduce jobs, which are executed in Topo order
Temporary files are created to link FlowSteps
Flow
FlowStep FlowStep
Map Stack Reduce Stack Map Reduce Stack
Stack
T E G A T G E T

Cascades and CascadeConnectors
Are optional
Cascades bind Flows together via shared Taps
CascadeConnectors connect Flows
Flows are executed in Topo order
Cascade
T F T F T
T F
T E T F T

Syntax
Each( previous, argSelector, function/filter, resultSelector )
Every( previous, argSelector, aggregator, resultSelector )
GroupBy( previous, groupSelector, sortSelector )
CoGroup( joinN, joiner, declaredFields )
Function( numArgs, declaredFields, .... )
Filter (numArgs, ... )
Aggregator( numArgs, declaredFields, ... )

Map Reduce Using Cascading

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Map Reduce Using Cascading

Uploaded by

Copyright:

Available Formats

Cascading

Wednesday, May 14, 2008

Wednesday, May 14, 2008

A set of ordered data [“John”, “Doe”, 39] [V1,V2,...,Vn [V1,V2,...,Vn

Just tuples [V1,V2,...,Vn

Wednesday, May 14, 2008

Scalar functions and filters

Apply to value and group streams [values]

Functions can be chained [groups] [values]

Wednesday, May 14, 2008

Assemblies plus sources and sinks

Wednesday, May 14, 2008

Cross Source Group Sink

Wednesday, May 14, 2008

Flows are logical ‘units of work’

Flows ‘compiled’ into MR Jobs

Wednesday, May 14, 2008

Flows walk MapReduce Jobs in dependency order

Wednesday, May 14, 2008

flow.complete() // execute, block till completed

Wednesday, May 14, 2008

Wednesday, May 14, 2008

Tuples and Fields

Pipes and PipeAssemblies

Each and Every Operators

Flows, FlowSteps, and FlowConnectors

Cascades, and CascadeConnectors, optional

Wednesday, May 14, 2008

Taps, abstract out where and how a data resources is accessed

Wednesday, May 14, 2008

Wednesday, May 14, 2008

Wednesday, May 14, 2008

Wednesday, May 14, 2008

Wednesday, May 14, 2008

Flows encapsulate assemblies and sink and source Taps

Flow FlowStep Flow

Wednesday, May 14, 2008

Wednesday, May 14, 2008

Wednesday, May 14, 2008

Wednesday, May 14, 2008

You might also like