You are on page 1of 22

Cascading

www.cascading.org
info@cascading.org
Wednesday, May 14, 2008
Design Goals
Make large processing jobs more transparent
Reusable processing components independent of resources
Incremental “data” builds
Simplify testing of processes
Scriptable from higher level languages (Groovy, JRuby, Jython, etc)

Wednesday, May 14, 2008


Cascading Introduction

Wednesday, May 14, 2008


Tuple Streams
Value Stream Group Stream
[K1,K2,...,Kn
[V1,V2,...,Vn
[V1,V2,...,Vn

Tuple [V1,V2,...,Vn
[V1,V2,...,Vn

A set of ordered data [“John”, “Doe”, 39] [V1,V2,...,Vn [V1,V2,...,Vn

[K1,K2,...,Kn
[V1,V2,...,Vn
Value Stream [V1,V2,...,Vn
[V1,V2,...,Vn

[V1,V2,...,Vn

Just tuples [V1,V2,...,Vn


[V1,V2,...,Vn

[V1,V2,...,Vn [V1,V2,...,Vn

Group Stream
Tuples groups by a key

Wednesday, May 14, 2008


Tuple Streams
[values]

Scalar functions and filters


Source

Apply to value and group streams [values]


func
[values]

Aggregate functions
Apply to group stream
[values] [groups/values]
Group

Functions can be chained [groups] [values]


aggr

func

[values]
Sink
Source func Group aggr Sink

Wednesday, May 14, 2008


Stream Processing
Flow
Pipe Assembly

S F F G A A S

Pipe Assemblies
A chain of scalar functions, groupings, aggregate functions
Reusable, independent of data source/sink
Flows Cascade
S F S F S

Assemblies plus sources and sinks


S F
Cascades
S F S
A collection of Flows
S F

Wednesday, May 14, 2008


Processing Patterns
Source Group Sink Source Sink

Chain
Group Sink Sink

Source Source

Splits
Group Sink Sink

Source

Joins
Group Sink

Source

Cross Source Group Sink

Wednesday, May 14, 2008


MapReduce Planner
Flow Job
Flow Job
Map
Map Reduce
F F F Reduce

S F G A S G A S

F
Map

Job
Map
S F F Job
Reduce Map

Flows are logical ‘units of work’


G A S F S

S F
Map

Flows ‘compiled’ into MR Jobs


Intermediate files are created (and destroyed) to join Jobs

Wednesday, May 14, 2008


Topological Scheduler

Flows walk MapReduce Jobs in dependency order


Cascades walk Flows in dependency order
Independent Jobs and Flows are scheduled to run concurrently
Listeners can react to element events (notify completion or failures)
Only stale data-sets are rebuilt (configurable)

Wednesday, May 14, 2008


Scripting - Groovy
Flow flow = builder.flow("wordcount")
{
source(input, scheme: text()) // input is filename of raw text document

tokenize(/[.,]*\s+/) // output new tuple for each split, result replaces stream by default
group() // group on stream
count() // count values in group, creates 'count' field by default
group(["count"], reverse: true) // group/sort on 'count', reverse the sort order

sink(output)
}

flow.complete() // execute, block till completed


Wednesday, May 14, 2008
System Integration
FileSystems (unique to Cascading)
Raw file S3 reading/writing (MD5)
Raw file HTTP reading (MD5)
Zip files
Can bypass native Hadoop ‘collectors’
Event notification via listeners (XMPP/SQS/Zookeeper notifications)
Groovy scripting for easier local shell/file operations (wget, scp, etc)

Wednesday, May 14, 2008


Cascading API & Internals

Wednesday, May 14, 2008


Core Concepts
Taps and Schemes

Tuples and Fields

Pipes and PipeAssemblies

Each and Every Operators

Groups

Flows, FlowSteps, and FlowConnectors

Cascades, and CascadeConnectors, optional

Wednesday, May 14, 2008


Taps and Schemes

Taps, abstract out where and how a data resources is accessed


hdfs, http, local, S3, etc
Taps, used as Tuple (data) stream sinks, sources, or both
Schemes, define what a resource is made of
text lines, SequenceFile, CSV, etc

Wednesday, May 14, 2008


Tuples and Fields

Tuples are the ‘records’, read from Tap sources, written to Tap sinks
Fields are the ‘column names’, sourced from Schemes
Tuple class, an ordered collection of Comparable values
(“a string”, 1.0, new SomeComparableWritable())
Fields class, a list of field names, absolute or relative positions
(“total”, 3, -1) // fields ‘total’, 4th position, last position

Wednesday, May 14, 2008


Pipes and PipeAssemblies
Tuple streams pass through Pipes to be processed
Pipes, apply functions, filters, and aggregators to the Tuple stream
Pipe instances are chained together into assemblies
Reusable assemblies are subclasses of class PipeAssembly
A
B'
A
E P
C'

B'
G A
B
A
E E
C
C'
E E

Wednesday, May 14, 2008


Group Class and Subclasses
Group, subclass of Pipe, groups the Tuple stream on given fields
GroupBy and CoGroup subclass Group
GroupBy groups and sorts
CoGroup performs joins
T E
Fe Fa
G A T

T E G A T
T E

Wednesday, May 14, 2008


Each and Every Classes
Each, subclass of Pipe, applies Functions and Filters to each Tuple instance
(a,b,c) -> Each( func() ) -> (a,b,c,d)
Every, subclass of Pipe, applies Aggregators to every Tuple group
(a: b,c) -> Every( agg()) -> (a,d: b,c)
Fe Fa

E A

Wednesday, May 14, 2008


Flows and FlowConnectors

Flows encapsulate assemblies and sink and source Taps


FlowConnectors connect assemblies and Taps into Flows

Flow FlowStep Flow


FlowStep
E G A T

T E

T E
G A T
FlowStep
T E
E E G A T

Wednesday, May 14, 2008


FlowSteps and FlowConnectors
Internally, FlowConnectors ‘compile’ assemblies into FlowSteps
FlowSteps are MapReduce jobs, which are executed in Topo order
Temporary files are created to link FlowSteps
Flow
FlowStep FlowStep
Map Stack Reduce Stack Map Reduce Stack
Stack
T E G A T G E T

Wednesday, May 14, 2008


Cascades and CascadeConnectors
Are optional
Cascades bind Flows together via shared Taps
CascadeConnectors connect Flows
Flows are executed in Topo order
Cascade
T F T F T

T F

T E T F T

Wednesday, May 14, 2008


Syntax
Each( previous, argSelector, function/filter, resultSelector )
Every( previous, argSelector, aggregator, resultSelector )
GroupBy( previous, groupSelector, sortSelector )
CoGroup( joinN, joiner, declaredFields )
Function( numArgs, declaredFields, .... )
Filter (numArgs, ... )
Aggregator( numArgs, declaredFields, ... )

Wednesday, May 14, 2008

You might also like