Realtime Processing With Storm Presentation

Realtime Processing with Storm
Who are we?

Dario Simonassi @ldsimonassi
Gabriel Eisbruch @geisbruch
Jonathan Leibiusky @xetorthio
Agenda The problem Storm - Basic concepts Trident Deploying Some other features Contributions
Tons of information arriving every second
http://necromentia.files.wordpress.com/2011/01/20070103213517_iguazu.jpg
Heavy workload
http://www.computerkochen.de/fun/bilder/esel.jpg
Do i Re t alti me !
http://www.asimon.co.il/data/images/a10689.jpg
Queue Worker
Worker
Complex
Queue
Queue Information Source Worker
Worker
Worker
Queue Queue Worker Queue Queue Worker Worker
DB
Worker
What is Storm?
Is a highly distributed realtime computation system. Provides general primitives to do realtime computation. Can be used with any programming language. It's scalable and fault-tolerant.
Example - Main Concepts

https://github.com/geisbruch/strata2012
Main Concepts - Example
"Count tweets related to strataconf and show hashtag totals."
Main Concepts - Spout

First thing we need is to read all tweets related to strataconf from twitter A Spout is a special component that is a source of messages to be processed.
Tweets
Spout
Tuples
Main Concepts - Spout

public class ApiStreamingSpout extends BaseRichSpout { SpoutOutputCollector collector; TwitterReader reader; public void nextTuple() { Tweet tweet = reader.getNextTweet(); if(tweet != null) collector.emit(new Values(tweet)); } public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { reader = new TwitterReader(conf.get("track"), conf.get("user"), conf.get("pass")); this.collector = collector; } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("tweet")); } }
Tweets
Spout
Tuples
Main Concepts - Bolt

For every tweet, we find hashtags A Bolt is a component that does the processing.
tweet stream
tweet
hashtags hashtags hashtags
Spout
HashtagsSplitter

public class HashtagsSplitter extends BaseBasicBolt { public void execute(Tuple input, BasicOutputCollector collector) { Tweet tweet = (Tweet)input.getValueByField("tweet"); for(String hashtag : tweet.getHashTags()){ collector.emit(new Values(hashtag)); } } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("hashtag")); } }
tweet stream hashtags hashtags hashtags
tweet
Spout
HashtagsSplitter

public class HashtagsCounterBolt extends BaseBasicBolt { private Jedis jedis; public void execute(Tuple input, BasicOutputCollector collector) { String hashtag = input.getStringByField("hashtag"); jedis.hincrBy("hashs", key, val); } public void prepare(Map conf, TopologyContext context) { jedis = new Jedis("localhost"); } }
tweet stream
tweet
Spout
hashtag hashtag hashtag
HashtagsSplitter
HashtagsCounterBolt
Main Concepts - Topology

public static Topology createTopology() { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("tweets-collector", new ApiStreamingSpout(), 1); builder.setBolt("hashtags-splitter", new HashTagsSplitter(), 2) .shuffleGrouping("tweets-collector"); builder.setBolt("hashtags-counter", new HashtagsCounterBolt(), 2) .fieldsGrouping("hashtags-splitter", new Fields("hashtags")); return builder.createTopology(); }
shuffle fields
HashtagsSplitter
HashtagsCounterBolt HashtagsCounterBolt
Spout
HashtagsSplitter
Example Overview - Read Tweet

Spout Splitter Counter Redis
This tweet has #hello #world hashtags!!!
Example Overview - Split Tweet

This tweet has #hello #world hashtags!!!
#hello
#world

This tweet has #hello #world hashtags!!
#hello
incr(#hello)
#hello = 1
#world incr(#world)
#world = 1

#hello = 1
This tweet has #bye #world hashtags
#world = 1

#hello = 1
#world This tweet has #bye #world hashtags #bye
#world = 1

#hello = 1
#world This tweet has #bye #world hashtags #bye incr(#bye) incr(#world)
#world = 2 #bye =1
Main Concepts - In summary
Topology Spout Stream Bolt
Guaranteeing Message Processing
http://www.steinborn.com/home_warranty.php
HashtagsSplitter
Spout
HashtagsSplitter
HashtagsSplitter
Spout
HashtagsSplitter
_collector.ack(tuple);
HashtagsSplitter
Spout
HashtagsSplitter
HashtagsSplitter
Spout
HashtagsSplitter
HashtagsSplitter
Spout
HashtagsSplitter
void fail(Object msgId);
Trident
http://www.raytekmarine.com/images/trident.jpg
Trident - What is Trident?
High-level abstraction on top of Storm. Will make it easier to build topologies. Similar to Hadoop, Pig
Trident - Topology
TridentTopology topology = new TridentTopology();
tweet stream
tweet
Spout
HashtagsSplitter
Memory
Trident - Stream
Stream tweetsStream = topology .newStream("tweets", new ApiStreamingSpout());
tweet stream
Spout
Trident
Stream hashtagsStream = tweetsStrem.each(new Fields("tweet"),new BaseFunction() { public void execute(TridentTuple tuple, TridentCollector collector) { Tweet tweet = (Tweet)tuple.getValueByField("tweet"); for(String hashtag : tweet.getHashtags()) collector.emit(new Values(hashtag)); } }, new Fields("hashtag") );
tweet stream
tweet
Spout
HashtagsSplitter
Trident - States
TridentState hashtagCounts = hashtagsStream.groupBy(new Fields("hashtag")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"));
tweet stream
tweet
Spout
HashtagsSplitter
Memory (State)
Trident - States
tweet stream
tweet
Spout
HashtagsSplitter
Memory (State)
Trident - States
tweet stream
tweet
Spout
HashtagsSplitter
Memory (State)
Trident - States
tweet stream
tweet
Spout
HashtagsSplitter
Memory (State)
Trident - Complete example
TridentState hashtags = topology .newStream("tweets", new ApiStreamingSpout()) .each(new Fields("tweet"),new HashTagSplitter(), new Fields("hashtag")) .groupBy(new Fields("hashtag")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))
tweet stream
tweet
Spout
HashtagsSplitter
Memory (State)
Transactions
Transactions
Exactly once message processing semantic
Transactions
The intermediate processing may be executed in a parallel way
Transactions
But persistence should be strictly sequential
Transactions
Spouts and persistence should be designed to be transactional
Trident - Queries
We want to know on-line the count of tweets with hashtag hadoop and the amount of tweets with strataconf hashtag
Trident - DRPC
Trident - Query
DRPCClient drpc = new DRPCClient("MyDRPCServer", "MyDRPCPort"); drpc.execute("counts", "strataconf hadoop");
Trident - Query
Stream drpcStream = topology.newDRPCStream("counts",drpc)
Trident - Query
Stream singleHashtag = drpcStream.each(new Fields("args"), new Split(), new Fields("hashtag"))
DRPC Server
DRPC stream
Hashtag Splitter
Trident - Query
Stream hashtagsCountQuery = singleHashtag.stateQuery(hashtagCountMemoryState, new Fields("hashtag"), new MapGet(), new Fields("count"));
DRPC Server
DRPC stream
Hashtag Splitter
Hashtag Count Query
Trident - Query
Stream drpcCountAggregate = hashtagsCountQuery.groupBy(new Fields("hashtag")) .aggregate(new Fields("count"), new Sum(), new Fields("sum"));
DRPC Server
DRPC stream
Hashtag Splitter
Hashtag Count Query
Aggregate Count
Trident - Query - Complete Example
topology.newDRPCStream("counts",drpc) .each(new Fields("args"), new Split(), new Fields("hashtag")) .stateQuery(hashtags, new Fields("hashtag"), new MapGet(), new Fields("count")) .each(new Fields("count"), new FilterNull()) .groupBy(new Fields("hashtag")) .aggregate(new Fields("count"), new Sum(), new Fields("sum"));
DRPC Server
DRPC stream
Hashtag Splitter
Hashtag Count Query
Aggregate Count
Client Call DRPC stream Hashtag Splitter

Hashtag Count Query
DRPC Server
Aggregate Count
Hold the Request
Execute
DRPC Server
DRPC stream
Hashtag Splitter
Hashtag Count Query
Aggregate Count
Request continue holded DRPC stream Hashtag Splitter

Hashtag Count Query
DRPC Server
Aggregate Count
On finish return to DRPC server
DRPC Server Return response to the client
DRPC stream
Hashtag Splitter
Hashtag Count Query
Aggregate Count
Running Topologies
Running Topologies - Local Cluster

Config conf = new Config(); conf.put("some_property","some_value") LocalCluster cluster = new LocalCluster(); cluster.submitTopology("twitter-hashtag-summarizer", conf, builder.createTopology());
The whole topology will run in a single machine. Similar to run in a production cluster Very Useful for test and development Configurable parallelism Similar methods than StormSubmitter
Running Topologies - Production

package twitter.streaming; public class Topology { public static void main(String[] args){ StormSubmitter.submitTopology("twitter-hashtag-summarizer", conf, builder.createTopology()); } }
mvn package storm jar Strata2012-0.0.1.jar twitter.streaming.Topology \ "$REDIS_HOST" $REDIS_PORT "strata,strataconf" $TWITTER_USER $TWITTER_PASS
Running Topologies - Cluster Layout

Single Master process No SPOF Automatic Task distribution Amount of task configurable by process Automatic tasks re-distribution on supervisors fails Work continue on nimbus fails
Storm Cluster
Supervisor
(where the processing is done)
Nimbus
(The master Process)
Supervisor
Supervisor
Zookeeper Cluster
(Only for coordination and cluster state)
Running Topologies - Cluster Architecture - No SPOF

Storm Cluster
Supervisor
Nimbus
Supervisor
Supervisor

Storm Cluster
Supervisor
Nimbus
Supervisor
Supervisor

Storm Cluster
Supervisor
Nimbus
Supervisor
Supervisor
Other features
Other features - Non JVM languages

{
Use Non-JVM Languages Simple protocol Use stdin & stdout for communication Many implementations and abstraction layers done by the community
"conf": { "topology.message.timeout.secs": 3, // etc }, "context": { "task->component": { "1": "example-spout", "2": "__acker", "3": "example-bolt" }, "taskid": 3 }, "pidDir": "..." }
Contribution

A collection of community developed spouts, bolts, serializers, and other goodies. You can find it in: https://github.com/nathanmarz/storm-contrib Some of those goodies are:
cassandra: Integrates Storm and Cassandra by writing tuples to a Cassandra Column Family. mongodb: Provides a spout to read documents from mongo, and a bolt to write documents to mongo. SQS: Provides a spout to consume messages from Amazon SQS. hbase: Provides a spout and a bolt to read and write from HBase. kafka: Provides a spout to read messages from kafka. rdbms: Provides a bolt to write tuples to a table. redis-pubsub: Provides a spout to read messages from redis pubsub. And more...
Questions?

Realtime Processing With Storm Presentation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Realtime Processing With Storm Presentation

Uploaded by

Copyright:

Available Formats

Realtime Processing with Storm

Who are we?

Gabriel Eisbruch @geisbruch

Jonathan Leibiusky @xetorthio

Tons of information arriving every second

Queue Information Source Worker

Queue Queue Worker Queue Queue Worker Worker

Example - Main Concepts

Main Concepts - Example

"Count tweets related to strataconf and show hashtag totals."

Main Concepts - Spout

Main Concepts - Spout

Main Concepts - Bolt

hashtags hashtags hashtags

Main Concepts - Bolt

Main Concepts - Bolt

hashtag hashtag hashtag

Main Concepts - Topology

Example Overview - Read Tweet

This tweet has #hello #world hashtags!!!

Example Overview - Split Tweet

This tweet has #hello #world hashtags!!!

Example Overview - Split Tweet

This tweet has #hello #world hashtags!!

Example Overview - Split Tweet

Example Overview - Split Tweet

Example Overview - Split Tweet

Main Concepts - In summary

Topology Spout Stream Bolt

Guaranteeing Message Processing

Guaranteeing Message Processing

Guaranteeing Message Processing

Guaranteeing Message Processing

Guaranteeing Message Processing

Guaranteeing Message Processing

Trident - What is Trident?

TridentTopology topology = new TridentTopology();

hashtags hashtags hashtags

Stream tweetsStream = topology .newStream("tweets", new ApiStreamingSpout());

hashtags hashtags hashtags

hashtags hashtags hashtags

hashtags hashtags hashtags

hashtags hashtags hashtags

hashtags hashtags hashtags

Trident - Complete example

hashtags hashtags hashtags

DRPCClient drpc = new DRPCClient("MyDRPCServer", "MyDRPCPort"); drpc.execute("counts", "strataconf hadoop");

Stream drpcStream = topology.newDRPCStream("counts",drpc)

Stream singleHashtag = drpcStream.each(new Fields("args"), new Split(), new Fields("hashtag"))

Stream hashtagsCountQuery = singleHashtag.stateQuery(hashtagCountMemoryState, new Fields("hashtag"), new MapGet(), new Fields("count"));

Hashtag Count Query

Hashtag Count Query

Trident - Query - Complete Example

Trident - Query - Complete Example

Hashtag Count Query

Trident - Query - Complete Example

Client Call DRPC stream Hashtag Splitter

Trident - Query - Complete Example

Hold the Request

Hashtag Count Query

Trident - Query - Complete Example

Request continue holded DRPC stream Hashtag Splitter

On finish return to DRPC server

Trident - Query - Complete Example