Professional Documents
Culture Documents
Agenda The problem Storm - Basic concepts Trident Deploying Some other features Contributions
http://necromentia.files.wordpress.com/2011/01/20070103213517_iguazu.jpg
Heavy workload
http://www.computerkochen.de/fun/bilder/esel.jpg
Do i Re t alti me !
http://www.asimon.co.il/data/images/a10689.jpg
Queue Worker
Worker
Complex
Queue
Worker
Worker
DB
Worker
What is Storm?
Is a highly distributed realtime computation system. Provides general primitives to do realtime computation. Can be used with any programming language. It's scalable and fault-tolerant.
Tweets
Spout
Tuples
Tweets
Spout
Tuples
tweet stream
tweet
Spout
HashtagsSplitter
tweet
Spout
HashtagsSplitter
tweet
Spout
HashtagsSplitter
HashtagsCounterBolt
HashtagsSplitter
HashtagsCounterBolt HashtagsCounterBolt
Spout
HashtagsSplitter
#hello
#world
#hello
incr(#hello)
#hello = 1
#world incr(#world)
#world = 1
#hello = 1
This tweet has #bye #world hashtags
#world = 1
#hello = 1
#world This tweet has #bye #world hashtags #bye
#world = 1
#hello = 1
#world This tweet has #bye #world hashtags #bye incr(#bye) incr(#world)
#world = 2 #bye =1
http://www.steinborn.com/home_warranty.php
HashtagsSplitter
HashtagsCounterBolt HashtagsCounterBolt
Spout
HashtagsSplitter
HashtagsSplitter
HashtagsCounterBolt HashtagsCounterBolt
Spout
HashtagsSplitter
_collector.ack(tuple);
HashtagsSplitter
HashtagsCounterBolt HashtagsCounterBolt
Spout
HashtagsSplitter
HashtagsSplitter
HashtagsCounterBolt HashtagsCounterBolt
Spout
HashtagsSplitter
HashtagsSplitter
HashtagsCounterBolt HashtagsCounterBolt
Spout
HashtagsSplitter
void fail(Object msgId);
Trident
http://www.raytekmarine.com/images/trident.jpg
High-level abstraction on top of Storm. Will make it easier to build topologies. Similar to Hadoop, Pig
Trident - Topology
tweet stream
tweet
Spout
HashtagsSplitter
Memory
Trident - Stream
tweet stream
Spout
Trident
Stream hashtagsStream = tweetsStrem.each(new Fields("tweet"),new BaseFunction() { public void execute(TridentTuple tuple, TridentCollector collector) { Tweet tweet = (Tweet)tuple.getValueByField("tweet"); for(String hashtag : tweet.getHashtags()) collector.emit(new Values(hashtag)); } }, new Fields("hashtag") );
tweet stream
tweet
Spout
HashtagsSplitter
Trident - States
TridentState hashtagCounts = hashtagsStream.groupBy(new Fields("hashtag")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"));
tweet stream
tweet
Spout
HashtagsSplitter
Memory (State)
Trident - States
TridentState hashtagCounts = hashtagsStream.groupBy(new Fields("hashtag")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"));
tweet stream
tweet
Spout
HashtagsSplitter
Memory (State)
Trident - States
TridentState hashtagCounts = hashtagsStream.groupBy(new Fields("hashtag")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"));
tweet stream
tweet
Spout
HashtagsSplitter
Memory (State)
Trident - States
TridentState hashtagCounts = hashtagsStream.groupBy(new Fields("hashtag")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"));
tweet stream
tweet
Spout
HashtagsSplitter
Memory (State)
TridentState hashtags = topology .newStream("tweets", new ApiStreamingSpout()) .each(new Fields("tweet"),new HashTagSplitter(), new Fields("hashtag")) .groupBy(new Fields("hashtag")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))
tweet stream
tweet
Spout
HashtagsSplitter
Memory (State)
Transactions
Transactions
Exactly once message processing semantic
Transactions
The intermediate processing may be executed in a parallel way
Transactions
But persistence should be strictly sequential
Transactions
Spouts and persistence should be designed to be transactional
Trident - Queries
We want to know on-line the count of tweets with hashtag hadoop and the amount of tweets with strataconf hashtag
Trident - DRPC
Trident - Query
Trident - Query
Trident - Query
DRPC Server
DRPC stream
Hashtag Splitter
Trident - Query
DRPC Server
DRPC stream
Hashtag Splitter
Trident - Query
Stream drpcCountAggregate = hashtagsCountQuery.groupBy(new Fields("hashtag")) .aggregate(new Fields("count"), new Sum(), new Fields("sum"));
DRPC Server
DRPC stream
Hashtag Splitter
Aggregate Count
topology.newDRPCStream("counts",drpc) .each(new Fields("args"), new Split(), new Fields("hashtag")) .stateQuery(hashtags, new Fields("hashtag"), new MapGet(), new Fields("count")) .each(new Fields("count"), new FilterNull()) .groupBy(new Fields("hashtag")) .aggregate(new Fields("count"), new Sum(), new Fields("sum"));
DRPC Server
DRPC stream
Hashtag Splitter
Aggregate Count
DRPC Server
Aggregate Count
Execute
DRPC Server
DRPC stream
Hashtag Splitter
Aggregate Count
DRPC Server
Aggregate Count
DRPC stream
Hashtag Splitter
Aggregate Count
Running Topologies
The whole topology will run in a single machine. Similar to run in a production cluster Very Useful for test and development Configurable parallelism Similar methods than StormSubmitter
mvn package storm jar Strata2012-0.0.1.jar twitter.streaming.Topology \ "$REDIS_HOST" $REDIS_PORT "strata,strataconf" $TWITTER_USER $TWITTER_PASS
Supervisor
(where the processing is done)
Nimbus
(The master Process)
Supervisor
(where the processing is done)
Supervisor
(where the processing is done)
Zookeeper Cluster
(Only for coordination and cluster state)
Supervisor
(where the processing is done)
Nimbus
(The master Process)
Supervisor
(where the processing is done)
Supervisor
(where the processing is done)
Supervisor
(where the processing is done)
Nimbus
(The master Process)
Supervisor
(where the processing is done)
Supervisor
(where the processing is done)
Supervisor
(where the processing is done)
Nimbus
(The master Process)
Supervisor
(where the processing is done)
Supervisor
(where the processing is done)
Other features
Use Non-JVM Languages Simple protocol Use stdin & stdout for communication Many implementations and abstraction layers done by the community
"conf": { "topology.message.timeout.secs": 3, // etc }, "context": { "task->component": { "1": "example-spout", "2": "__acker", "3": "example-bolt" }, "taskid": 3 }, "pidDir": "..." }
Contribution
A collection of community developed spouts, bolts, serializers, and other goodies. You can find it in: https://github.com/nathanmarz/storm-contrib Some of those goodies are:
cassandra: Integrates Storm and Cassandra by writing tuples to a Cassandra Column Family. mongodb: Provides a spout to read documents from mongo, and a bolt to write documents to mongo. SQS: Provides a spout to consume messages from Amazon SQS. hbase: Provides a spout and a bolt to read and write from HBase. kafka: Provides a spout to read messages from kafka. rdbms: Provides a bolt to write tuples to a table. redis-pubsub: Provides a spout to read messages from redis pubsub. And more...
Questions?