Professional Documents
Culture Documents
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Welcome
Thank you for purchasing the MEAP for Streaming Data. I'm happy to see the book has
reached this stage, and look forward to its further development and eventual release.
This book introduces the concepts and requirements of streaming and real-time data
systems. Through this book you will develop a foundation to understand the challenges and
solutions of building in-the-moment data systems before you committing to specific
technologies. You do not have to have any experience with streaming or real-time data
systems to fully take advantage of the material in this book. It is perfect for developers or
architects, but also written to be accessible to technical managers and business decision
makers. A lot of care has been put into making the content of the book as approachable and
up-to-date as possible.
This initial release contains the first three chapters of the book. In the first chapter we will
start by discussing what a real-time system is and some of the differences between them and
streaming data systems. We will develop an understanding for why streaming data is
important and also develop our architectural blueprint that will serve as our navigational aid
throughout the book. In the second chapter we will discuss the collection tier, the first tier of
our architecture. In this chapter we will learn about the different data collection patterns,
survey the technology landscape, and develop an understanding of how to choose the right
technology for your business problem. In the third chapter, we will continue to follow our
guide and discuss how we transport data from the collection tier using the message queuing
tier. In this chapter we will spend time learning about message durability, deliver semantics,
and how to choose the right technology. The upcoming chapter 4 will talk about the analysis
tier. Beyond that, the rest of the book will continue to follow our architectural blueprint and
discuss each of the tiers in depth. That will conclude the first part of the book that discusses
this new holistic approach to building streaming data systems.
In the second part of the book we will focus our time on making our tiers come to life by
applying what we have learned and building out a complete streaming data system.
As youre reading, I hope youll take advantage of the Author Online forum. Ill be
reading your comments and responding, and your feedback is helpful in the development
process.
Andrew Psaltis
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
brief contents
PART 1: A NEW HOLISTIC APPROACH
3 Transporting the data from collection tier: decoupling the data pipeline
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
1
Introducing Streaming Data
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
To set the stage, this chapter will introduce the concepts of streaming data systems,
introduce the architectural blueprint and get us set to begin exploring in depth each of the
tiers as we progress.
You can identify hard real-time systems fairly easily, they are almost always found in
embedded systems and have very strict time requirements that if missed may result in total
system failure. The design and implementation of these systems is well studied in the
literature and outside the scope of this book. We will leave them behind and turn our focus to
the real-time systems often categorized as soft or near real-time. Determining if a system is
soft or near real-time is an interesting exercise, as the overlap in their definitions often results
in confusion. Take a moment and think about the following examples:
1. Someone you are following on Twitter posts a tweet and moments later you see the
Tweet in your twitter client.
2. You are tracking flights around New York using the real-time Live Flight Tracking service
from FlightAware (http://flightaware.com/live/airport/KJFK)
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Although these systems are all quite different, figure 1.1 below shows a simplified view of
their commonality.
In each of the above examples is it reasonable to conclude that the time delay may only be
seconds, no life is at risk, and an occasional delay in minutes will not cause total system
failure. What do you think? If someone posts a tweet and you see it almost immediately is
that soft or near real-time? What about watching live flight status or real-time stock quotes?
Some of these can go either way; what if there is a delay in you seeing the data due to the
slow Wi-Fi at the coffee shop or on the plane? As you think about these examples I think you
will agree that the line differentiating them between soft and near real-time becomes blurry,
at times disappears, and is very subjective and may often be dependent on the consumer of
the data.
Now lets change our examples just a little bit by taking the consumer of the data out of
the picture. Restating our examples to be just focused on the services at hand we end up with
these:
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
3. The NASDAQ Real Time Quotes application (http://www.nasdaq.com/quotes/real-
time.aspx) is tracking stock quotes
Think about these for a moment. Granted we do not know how these systems work
internally, but the essence of what we are asking is common to all of them, and can be stated
as:
Is the process of receiving data all the way to the point it is ready for consumption a soft or near
real-time process?
Does taking the consumers of the data out of the picture change your answer? If with a
consumer you classified one of the examples as near real-time was that due to the lag or
perceived lag in you seeing the data?
After a while it gets confusing on whether to call something soft or near real-time or just
real-time as some of the services in our examples do. Clearly there has to be a better way.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 1.3 Real-time computation and consumption split apart
Looking at figure 1.3 on the left-hand side we have the non-hard real-time service or the
computation part of the system and on the right-hand side we have the clients, the
consumption side of the system. In many scenarios the computation part of the system is
operating in a non-hard real-time fashion, however, the clients may not be consuming the
data in real-time, due to network delays, application design, or perhaps a client application is
not even running. Put another way, what we really have is a non-hard real-time service with
clients that consume data when they need it. This is a streaming data system, a non-hard
real-time system that makes its data available at the moment a client application needs it, it is
not soft or near, it is streaming. Figure 1.4 shows the result of applying this definition to our
example architecture from figure 1.3.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 1.4 A first view of a streaming data system
Using this definition we have eliminated the confusion of soft vs. near, real-time vs. not real-
time, thus allowing us to concentrate on designing systems that deliver the information a
client requests at the moment it is needed. Lets use our examples from before, but this time
think about them from the standpoint of streaming, see if you can split each one up and
recognize the streaming data service and streaming client.
4. Someone you are following on Twitter post a tweet and moments later you see the
Tweet in your twitter client.
5. You are tracking flights around New York using the real-time Live Flight Tracking service
from FlightAware (http://flightaware.com/live/airport/KJFK)
7. Twitter A streaming system that processes tweets and allows clients to request the
latest tweets at the moment they are needed, some may be seconds old, while others
may be hours.
8. FlightAware A streaming system that processes the most recent flight status data and
allows a client to request the latest data for particular airports or flights.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
9. NASDAQ Real Time Quotes A streaming system that is process the price quotes of all
stocks and allows clients to request the latest quote for particular stocks.
Did you notice that doing this exercise allowed us to stop worrying about soft or near real-
time, we actually got to think and focus on what and how a service makes its data available to
clients at the moment they need it. Granted we do not know how these systems work behind
the scenes, which is just fine. Together we are going to embark on a journey to help us
understand how to assemble these types of system and many more as we progress through
the book.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 1.5 The streaming data architectural blueprint
As we progress we will zoom in and focus on each of the tiers, while also keeping the big
picture in mind. Although our architecture calls out the different tiers, keep in mind these are
not the hard rigid tiers you may have seen in other architectures, in this world we will call
them tiers, however, we will use them as LEGO pieces, allowing us to design the correct
solution for the problem at hand. Our tiers do not prescribe a deployment scenario in fact in
many cases they will be distributed across many different physical locations. Now lets take
our examples from before and walk through together how Twitters service maps to our
architecture.
1. Twitter
2. Collection When a user posts a tweet, this is collected by the Twitter services.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
3. Message queuing Undoubtedly Twitter runs data centers in various locations across
the globe, and conceivably the collection of tweets does not happen in the same
location as the analysis of the tweet.
4. Analysis Although I am sure there is a lot of processing done to those 140 characters,
suffice it to say at a minimum for our examples Twitter needs to identify the followers
of a tweet.
5. Long-term storage Even though we are not going to discuss this optional tier in depth
in this book, by the very nature that you can see tweets going back in time implies that
they are stored in persistent data store.
6. In-Memory data store The tweets that are a mere couple of seconds old are most
likely held in an in-memory data store.
7. Data access All the different twitter clients need to be connected to Twitter to access
the service.
Take some time and walk yourself through the exercise of decomposing the other two
examples and see how they fit our streaming architecture. Remember those examples are:
10. FlightAware A streaming system that processes the most recent flight status data and
allows a client to request the latest data for particular airports or flights.
11. NASDAQ Real Time Quotes A streaming system that is process the price quotes of all
stocks and allows clients to request the latest quote for particular stocks.
How did you do? Dont worry if this seemed foreign or hard to breakdown, we will see
plenty more examples in the coming chapters. As we work through them together we will
delve deeper into each of the tiers and discover ways that these LEGO pieces can be
assembled to solve different business problems.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 1.6 The architectural blueprint with security identified
We will not be spending time discussing security in depth in this book, but along the way it will
be called out so that you can see how it fits and think about what it may mean for the
problems you are solving.
1.6 Summary
With the introduction of the architectural blueprint under our belt lets step back and see
where we have been.
Where we have been
Dont worry if some of this is slightly fuzzy at this point or if teasing apart the different
business problems and applying the blueprint seems overwhelming, we will walk through this
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
slowly over many different examples in the coming chapters. By the time we are done with
this book it will seem much more natural. We are now ready to dive into each of the tiers and
really understand what they are composed of and how to apply them in the building of an in-
the-moment system. To help us focus which tier we tackle first, lets take a look at a slightly
modified version of our architectural blueprint, found below in figure 1.7.
We are going to take on the tiers one at a time, starting form the left with the Collection Tier.
Dont let the lack of emphasis on the Message Queuing Tier in figure 1.7 above give you any
cause for concern; in certain cases where it serves a collection role well talk about it and clear
up any confusion. Now on to our first tier The Collection Tier our entry point for bringing
data into our in-the-moment system.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
2
Getting data from clients: Data
ingestion
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 2.1 The streaming data architectural blueprint
We are going to take on the tiers one at a time, starting form the left with the Collection Tier
in this chapter and working our way through each of them. Now on to our first tier The
Collection Tier our entry point for bringing data into our streaming system. Figure 2.2 below
shows a slightly modified version of our blueprint, with the focus put on the collection tier.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 2.2 Architectural blueprint with emphasis on Collection tier
This tier is where data comes into the system and starts its journey; from here it will progress
through the rest of the system. In the coming chapters we will follow the flow of data through
each of the tiers. Our goal in this chapter is to learn about the collection tier. When you finish
this chapter you will have learned the two different collection modes, the different
technologies at play, and how to choose the right one to solve your business problem.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
foreseeable future. Even though we are swimming in data, there are only two general ways
that it can be collected: active or passive. Active collection is just like when you browse the
Internet; the collection tier (your browser) initiates and directs the collection of data. On the
other side of the coin is passive collection; in this mode, the collection tier waits for data to be
sent to it. In essence these two modes of collection can be boiled down to pull vs. push. Pull
being the same as active, where the collection tier pulls in data and push being the same as
passive, where the collection tier has data pushed to it. In a streaming system the mode of
collection is always passive.
If our mode of collection is always passive, then why does it matter what the source of the
data is? That is a great question and on the surface it may not seem like the source matters.
However, lets think about this statement:
1 connection with 1 Billion Events per day 1 Billion connections with 1 Event per day
If you give that some thought, I am sure you will think of a variety of differences in the
systems you would design for these two scenarios. For example, just the basic connection
aspects of it designing a system that has 1 connection with a lot of data, one really big straw
if you will is quite a bit different then designing a system that has to handle 1 billion
connections each with a tiny bit of data. Today it is common and easy to find data sources that
fit the pattern of 1 connection and a lot of data. It is also not very hard to find data sources
that meet the other pattern of many connections each with very little data. In the future this
second scenario will become much more pervasive in our society. A lot of this will be driven by
what has been termed The Internet of Things. As the Internet and Internet of Things grows
in the coming years, understanding how to design systems for each of these patterns will be
essential.
Although these sources of data and the characteristics about them are vastly different, from a
user browsing a website to a sensor in a street sending traffic updates, they both are still
characterized as passive data collection. We will see in the next section that this logical
grouping of the style of collection has nothing to do with the potentially big data like features
(the 3Vs volume, velocity, and variety that we discussed in Chapter 1) but more to do with
the direction of the data flow and who the initiator of it is. Keep in mind that the who may be
a person in some situations and in others it may be a physical device such as a garage door or
a thermostat.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
search for, interact with, and update content online. A Thing of Internet is what I would classify as a
device that accesses the Internet in a similar fashion to how you or I would. For example, if your
toaster checked the weather as it was toasting your morning bagel and provided the forecast on
your bagel, I would call your toaster a Thing on the Internet. Or perhaps your refrigerator may
access your online calendar and suggest items for lunch based on your schedule; this would also be
a Thing on the Internet. In these examples our toaster and refrigerator are accessing the Internet in
just the same fashion as any application would. The toasters access pattern is no different than a
weather application and the refrigerators access pattern is similar to a lifestyle application that helps
you prepare for your day.
Internet of Things is slightly different, in this case we are talking about internetworked things
that are active participants in business, information and social processes that gather and distribute
information from and about the physical world in order to draw conclusions and often act on those
conclusions in the physical world. Today it is not hard to find consumer examples such as garage
doors and thermostats that report changes and can also be controlled by homeowners using an
application on a mobile device. This is really the tip of the iceberg. Imagine in the future if you
wanted to take a bus that you can be notified by the bus when it was time to leave for the station.
You may think that is easy, today I download an application on my mobile device that sends me a
push notification based on my proximity to a bus stop and the static bus schedule. What if instead
the projected bus arrival time and optimal route were computed with the aid of the city supplied
traffic information collected by street cameras, traffic lights, current street conditions possibly
reported by the bus, the drivers shift information, the bus fuel or battery level and other information
that may require the bus to make an unplanned stop. Imagine all of this data being collected and
analyzed to provide the bus with the optimal route to take and then the bus sends you a push
notification of its predicted arrival time and the time you should leave the coffee shop to be able to
make it to the closest station with time to spare.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 2.3 Passively collecting the SuperSearch search stream
Right now if you are like me you have a picture in your mind of a big system that we would
need to build to be able to handle the fire hose of searches that our system will be ingesting.
We will dig into the details of what our collection tier may look like in the next section. Before
that lets move on to another example that helps to introduce and orient us.
Imagine this time we are building a streaming system that is going to reside inside a
vehicle and its mission in life is to be able to optimally route the vehicle to its destination
based on current traffic conditions during its journey. Figure 2.4 below shows what this may
look like.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 2.4 Passively collecting traffic conditions with on-board streaming system.
Not surprisingly we can take the same architecture and embed it in vehicle. The software
stack to compose it may be slightly different; however, all of the principles of the blueprint still
apply.
NOTE It may be interesting to think of the traffic conditions service that the vehicle is making a
request to, you could imagine that it is also a streaming system with much different scaling
factors, perhaps all vehicles from a given manufacturer would make requests to it every time
they were on the road. I will leave it to you to think through this scenario. As you continue
through this book, different aspects of a solution and perhaps alternative designs may become
apparent.
In both of these examples the data flow follows a request/response pattern. The
consumer of the data initiates a request and the service responds, in our two examples it just
so happens that the response is a stream of data and the client is our collection tier. This is
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
the same familiar pattern you are used to seeing on the web, the twist in this case is that it
may or may not be over HTTP and it is our collection tier making the request not a browse
The second passive collection pattern we will discuss is the event pattern. The event
pattern is a style of interaction where a producer of data (a thermostat, web server, phone,
etc.) sends current or temporally aggregated state (current temperature, number of requests
served in last minute) about itself and/or its environment to another system or systems (a
temperature monitoring app, operations monitoring system). The data flow is always one-
way from the producer to the consumer. This will become clearer as we walk through the
following examples.
Imagine for a moment that you own a skyscraper in New York City, business is going
great, you have good tenants and all of your units are occupied. But, your cost for utilities just
keeps increasing at what seems like an uncontrollable pace. Walking past your building one
night you notice that a lot of your tenants leave the lights on all night long, you start to think
there has to be a way to monitor this usage and control the lights so that you can save
electricity and have a more environmentally friendly building. After doing some research you
discover that indeed you can turn your building into a smart building. Fast-forward 6 months
and the conversion is done, your building has been outfitted with the latest and greatest
technology and is now considered a smart building. The lights in the building not only turn on
and off as people move about different areas of their offices they also regularly report the
following information:
Hours of operation
Wattage used
Frequency of turning on and off
Hours of use remaining
Status
We will call this information an event message. With this information in hand youll be
able to know when you need to replace lights, youll understand the energy efficiency of the
lights you use, and youll be able to perform an energy usage analysis of your building and
each of your tenants. To be able to gather this data and do this analysis we will need to make
sure our Collection tier can handle receiving events data from various devices. You may
already have a picture of the architecture in your mind it is just like the one we saw for the
SuperSearch.com example, except we would replace the SuperSearch Stream as the producer
with motion sensors.
For our next example, imagine that we are going to build a social network, lets call it
TwitterOfThings that will allow all home appliances to send out event messages containing
their status and state. Perhaps your washing machine or dryer would report how often it is
run, how long it runs for, and the health of the various components. If you think of these and
the other appliances in your home, I am sure you can come up with a long list of the data that
could be sent to the TwitterOfThings. If you are like me you start to think, wow with all this
data we can do some amazing things such as: determine the quality of different appliances,
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
allow a service representative to proactively schedule service, and determine energy usage.
For now though we will hold back on exploring those avenues, as we will address the Analysis
tier in Chapter 4. Getting back to the Collection tier, do you think we will need to change our
architecture? If you said no, we may need to scale it out but the architecture will remain, you
are correct. Once again our architecture remains the same as before we are just adding many
more producers.
In this section we have looked at the two passive message patterns, the request/response
and the event. The key take away from both of these is that although there may be a lot of
data and potentially a very high volume of data, from the point of view of the Collection tier it
is passive as the data always flows from the producer which may be a phone, a washing
machine, a street meter, a web browser, or any other thing capable of sending data to the
Collection tier. In each of these examples we also decided that our architecture did not
change, well there is a slight wrinkle to that. Our architectural blueprint may not have
changed, but as we will see in the following sections where we will dig in deeper to each of
these patterns the underlying architecture of the collection tier may vary widely across
scenarios.
Figure 2.5 SuperSearch Stream with just the collection tier in focus.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Now lets start to dig in a little deeper, peeling the onion back one layer so we can see the
request/response pattern exposed below in figure 2.6
There are several steps identified above in figure 2.5 that are executed during our use of
the SuperSearch stream.
Step 1 is the request part of the request/response pattern. The role of the collection
tier here is to make a request to consume; in this case the request is to start a stream
of data flowing. Keep in mind that it is very common for a consumer to be
authenticated as part of the request processing. We will not go into the steps of
performing this, but you should take note that often it will be required and needs to be
factored into your architecture.
Step 2 is the response part of the request/response pattern. In this scenario the
response will be a continuous stream of data.
Step 3 when we are done consuming the stream we close the connection.
Step 2 is where the real work for the collection tier takes place. Continuing with our
example, SuperSearch is going to stream 70,000 searches per second to our collection tier; by
any measure this is a fire hose of data. To help us think through the different design choices
table 2.1 below list some of the areas of concern along with some questions to help steer our
design. When you are designing a real world streaming system these questions coupled with
others we develop in later chapters will help elicit the thinking and conversations that you will
need to have to be successful.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Table 2.1 Areas of concentration and questions to address
Area Question
Downstream consumers What if the next tier cannot keep up with the speed and/or volume of data?
How can we make sure our collection tier is not affected by this
backpressure?
With our questions in hand and a fire hose of data ready to be consumed, lets walk through
some designs based on answers to the above questions. To make sure we have our frame of
reference, illustrated below in figure 2.7 is our blueprint streaming architecture with the
SuperSearch stream as the source of data.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 2.7 Simplified architecture consuming 70K searches a second from SuperSearch
You may be thinking that is a very large arrow going from the SuperSearch Stream to the
collection tier; this was intentionally done to emphasize the amount of data we are consuming.
Before we try to answer some of the questions from table 2.1 lets redraw figure 2.7 so we can
see a better representation of what the collection tier may really look like. Figure 2.8 below
illustrates the collection tier in its expanded form to aid in our discussion.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 2.8 Example collection tier expanded to show the nodes that make it up.
Looking at figure 2.8, it is important to realize that the number of nodes in the collection tier is
just for illustration purposes, your mileage may vary depending on many factors, some of
which we will cover, but many others such as; type and cost of hardware we will not. Now lets
see if we can answer some of the questions we posed in table 2.1. First lets start with trying
to answer the Velocity Questions, which are repeated below in table 2.2.
Velocity of Data
In we look closely these questions are trying to make us think about how we scale the
collection tier in the face of changes to the velocity of data we are dealing with. Our scaling
efforts can result in our collection tier being labeled as Superlinear, Linear, or Sublinear. Each
of these is illustrated below in figure 2.9.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 2.9 Data velocity vs. Speed of Processing scalability
Ideally we would like to at least be able to achieve linear scalability. To get there you have
two options you can scale vertically or you can scale horizontally. There is no one size fits all
direction to go here, and the choice to go vertical or horizontal depends on a lot of factors
some of which are organizational and are beyond the scope of this book, so we will put those
to the side. Given this lets see how we would address the questions posed. In both sets of
questions we are really after answers to the general question of How do we handle growth,
be it from an overall increase or a spike. To be able to handle this scaling horizontally we
would like to simply add more collection nodes to our collection tier. If we were scaling
vertically we would like to simply add more CPUs and/or RAM to each of the collection tier
nodes. Restating this generically we can state that to scale the collection tier we would like to
be able to: Add X more cores and RAM to handle an increase in traffic by Y. Your job when
you solve this problem for your organization is to determine what X and Y are. Lets solve this
using our example architecture from figure 2.8 where there are 6 nodes in the collection tier.
Here are our operational assumptions:
With those operational assumptions we can comfortably handle the 70K messages per second
load and have the capacity to handle a spike of 35%. Of course if this spike turned into a new
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
norm, this would not be sustainable, as you would most likely not want your collection tier to
run at full capacity for long. All right, our extra capacity takes care of the question about how
we are going to handle a spike in traffic. Since we stated before that we can solve the
scalability problem with the general formula Add X more cores and RAM to handle an increase
in traffic by Y. answering the questions of handling a doubling or 10x increase in data velocity
just becomes a matter of plugging in the values for X and Y. How would you handle a drop in
data? Would you reduce the number of nodes in the tier? Or would you leave them so that you
are ready when the data velocity picks back up?
Lets now think about the streaming protocol that is being used. The question we asked above
in table 2.1 was:
Before we can answer this question we need to know the protocol in use. In the wild some
of the popular protocols you will see in use today are HTTP and Websockets, of course there
are also custom protocols built on top of these or TCP. Since there are many dedicated
resources for each of these protocols available that discuss scalability and the protocols in
depth, we will not elaborate on them further, please consult one of the many resources to
make sure you take into account any nuisances of the protocol you are using.
We are half way there, we answered the questions about data velocity, addressed the protocol
in use, now lets turn our attention to the next set of concerns, that being How do we keep up
the producer. The questions we posed back in table 2.2 are recited below in table 2.4
To answer the first question What happens if we fall behind? we need to consider various
aspects of the stream we are consuming and our collection tier. What if the SuperSearch
stream we are consuming has business rules that state: If your consumer cannot keep up
with the stream, then the consumer will be disconnected. We stated earlier that our goal was
to perform an analysis of this stream as it happens, thus being disconnected from the stream
could result in our application missing a lot of data. On the surface this may not seem like a
big deal. But what if we make our money by selling the analysis for streaming ad buying and
our customers miss opportunities? Clearly this is not something that we can allow to happen
often if at all. Now lets twist this a little, perhaps SuperSearch decided that they were not
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
going to explicitly disconnect consumers that could not keep up and instead just discard
messages if the consumer could not consume fast enough. In this case we are in a very
similar predicament, we will begin to miss data and potentially cause our customers to miss
opportunities. That brings us to the next question that we need to consider:
Answering this question is going to depend on what you are doing with the data. In our
example, we are selling streaming ad buying and missing data can have financial ramifications
for our customers. Your situation may be different, perhaps you are doing something with the
data where missing a few messages here and there may not change the outcome of your
analysis, this is something you need to consider and take into account. This brings us to our
last question:
There are at least several ways that we can tackle this question. The first obvious choice is
to keep adding nodes to our collection tier until we no longer fall behind. In some situations
this may not be a bad idea, and for very little effort you may have solved your problem of
being able to keep up. But in many cases this is just no feasible, so what do we do? In this
situation one possibility is to split our collection tier into two parts and add a buffering tier
between them. This architectural change is illustrated below in figure 2.10
Figure 2.10 Collection tier split in half with Buffering tier in the middle.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
The key to splitting the collection tier is to isolate the part of the tier that just serves to
receive messages from the producer, split that part out and change it so that it will now
consume messages as fast as possible and push them to a buffering tier. The other half of the
collection tier that is left will then be modified to consume messages from the buffering tier. If
we can split our collection tier and use a buffering tier in this fashion then we stand to gain at
least the following two nice features:
13. Decoupling of our collection tier from the producer of the data.
First you may be wondering why we would want to have short-term storage for our
messages. That is a good question, but remember in some cases if we cannot keep up with
the provider of the stream, in this case SuperSearch, it may chose to disconnect us or worse
start dropping messages. In both of those cases we will lose data and if your business will be
impacted by the loss of data then this is something you will need to consider. The second
benefit of decoupling the collection tier from the producer of the data can pay dividends even
if we do not need the short-term storage. There are at least two benefits to this. First, now
that our collection tier is receiving messages from a tier that provides short-term storage our
failure and recovery scenario just became must simpler. A collection node or the whole tier
can fail and when it recovers it can resume consuming data from where it left off. A secondary
benefit is that now that our collection tier is decoupled from the actual producer of the data we
can add in new data sources that we are collecting from and thus new data into the buffering
tier, the details of how to do this will be explored in Chapter 3, for now lets just consider the
general idea which is illustrated below in figure 2.11.
Figure 2.11 Collection with buffering tier and two data sources.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
I hope as you look at and start to internalize figure 2.11 you will see that not only can you
start to add other data sources into the buffering tier, but this general concept may be
applicable in other cases when building out a streaming system.
At this point we have covered the areas of data velocity, protocol, and keeping up with the
producer questions we raised. The last area we identified is the Downstream Consumers.
Our concern here is how we protect our collection tier that we have spent a lot of time on from
being affected by consumers that cannot consume from us fast enough; think of it as
preventing backpressure in a hose. This is an interesting area, and one that we are going to
put on hold for just a little bit longer, as it is really the heart of Chapter 3. However, if you
want a head start in thinking about it, keep the buffering tier in mind and think about taking
that a little further. For now, we are going to move on to our next example and revisit the
downstream consumers in the next chapter.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 2.12 the high-level architecture of household appliances sending event messages to a streaming
system.
On the surface figure 2.12 may look like we just switched out the SuperSearch stream for
billions of devices and called it good. However, there is a subtle difference between the
request/response and event message pattern that figure 2.12 tries to capture. Recall that in
the request/response our collection tier is reaching out to a single data source and requesting
data, the resulting response is a never-ending stream that may be massive. In the event
message pattern it is slightly different, in this case we may have millions or billions of things
that send us a message. It may be one message every hour, day, month or many messages a
minute. When looked at from the view of a single appliance or household it may seem small,
however, when you think about an entire city, state or country the amount of data quickly
becomes quite large. Although this slight difference between the between the event message
pattern and the request / response may seem minor, as we explore a couple of the questions
we need to keep in mind and the resulting architecture we will see that it is indeed quite
different. The difference in message pattern becomes more apparent when we look at figure
2.13 below, which compares the data flow of the request / response with the data flow of the
event message.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 2.13 the differences between the request/response and event message patterns.
With these differences laid out, lets turn our attention to table 2.5 which lists some of the
key areas and questions for us to think through as we consider different design choices for the
collection tier in the building of TwitterOfThings.
Area Question
You may have noticed that we left off the velocity of data category; it can be argued that
having all of the home appliances for an entire country sending us status messages every time
they are used could result in a high velocity of data. Dont worry about this at all; you already
know how to handle data velocity based on what you learned in the previous section about the
request/response pattern. Lets move on to the areas that are new and different with the
event message pattern.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
The one question in the volume of data section How do we handle going from a city to a
state to a country? is really getting at how do we scale our architecture as the number of
appliances grows. If you have not thought about connected devices, other than your phone or
computer consider this as of this writing according to Ciscos connections counter
(http://newsroom.cisco.com/feature-content?type=webcontent&articleId=1208342) shows
there are over 12 billion things connected to the internet, to put that in perspective there are
a little over seven billion people on earth, this equates to approximately 1.7 things per
person on earth. By many accounts, we are just getting started; many predict that by 2020
there will be over 50 billion connected things. With that perspective, lets get back to our
questions and rephrase it as How do we handle going from New York City, to New York State,
to the United States? Table 2.6 shows the estimated populations and things for these
geographies.
Table 2.6 New York City, New York State, and the United States estimated population and
number of connected things
Geography Estimated 2013 Population Estimated connected things (1.7 per person)
In this scenario there will between 14-540 million devices that will be sending us status
messages throughout the day, basically like a little bird chirping all day long. Without a
doubt there will be peaks and valleys in the number of devices that connect at one time, but
for simplicity lets assume that the number of connections throughout the day is constant.
Taking this into consideration a plausible architecture for our TwitterOfThings is illustrated
below in Figure 2.14.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 2.14 First pass at the collection tier with connected devices sending a status message.
If you look closely figure 2.14 is not that much different than figure 2.8 in fact when
considering the number of devices chirping compared to the number of searches being
streamed to us with SuperSearch, the overall architecture may be very similar for the amount
of data we are considering. The protocol differences and streaming versus event messages, as
we will see the next section will have more of an impact on the technology choices for this tier
than the sheer number of devices.
Before moving on to talk about the technology choices that lie ahead, lets not forget about
the second question we need to consider How do we handle multiple protocols? This is
interesting, in the SuperSearch example we considering only HTTP as that is the most
common way for data to be streamed across the Internet. However, when we start to talk
about devices, in many cases having a full HTTP stack may be way to much of a burden on
something that runs on very limited battery power, has a very tight cost structure, and may
often have a very spotty and limited internet connection think moisture meter in a remote
agriculture field. How does this affect our collection tier architecture? That is a great question,
really it does not change our picture from that in figure 2.14, but it does beg us to think about
the fact that we will need to handle multiple protocols and messages in various formats and
sizes.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Phew that is enough of the theory; lets move on to talk about the different technologies
that exist today that we can leverage to support both the SuperSearch Stream and our
TwitterOfThings.
Velocity of Data High - 70,000 / second Low - < 6,000 / second That is
every device (540M) sent a
message per second.
Variety of Data Low - Single source all the same High - many different devices
When looking at table 2.7, it appears that there are no similarities between the two feature
sets. If we look a little closer and give them more consideration, the biggest difference that
may have an impact on technology we select are the protocols being used. Sure, with the
stream there are few connections compared to the event pattern, and theres a higher velocity
versus volume. In the end these are moot compared to the protocol support. With that in mind
lets discuss what technologies are currently available that we can use to build each of these.
Lets start with the SuperSearch streaming system, for this we need to support 70,000
messages / second over HTTP. With the prevalence of HTTP we can find many choices in every
popular programming language on every popular platform. For example if our language of
choice is Java we may choose to use the Apache HTTP Client or Netty. If we preferred writing
our services in JavaScript, we can use Node.js. The key in both of these cases is going to be
making sure the technology you choose can scale to meet the velocity demands. Remember
we talked earlier about adding a buffering tier between the two collection tiers in this case we
can choose from a variety of messaging system, one that is particularly well suited for this
task is Apache Kafka. More information on Apache Kafka can be found at:
http://kafka.apache.org/.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Lets now turn our attention to the TwitterOfThings system, for this we are going to need
amongst other things to support quite the myriad of protocols, the most popular ones today
being Advanced Message Queuing Protocol (AMQP) and MQ Telemetry Transport (MQTT). We
need to keep in mind that we may still see HTTP traffic and for certain there are protocols we
will need to support in the future such as Chirp a protocol that is lighter weight and more
efficient than IP for moving data for things. At the present time implementations of both
AMQP and MQTT are only found in RabbitMQ and ActiveMQ, if either of these are to heavy for
your collection tier or do not meet your needs, there are client libraries available for both
protocols in most popular programming languages so that you can build the exact collection
tier you require.
At this point you are on your way to collecting data from a blazing fast stream or millions
of household devices across the United States. Fantastic, but the next obvious questions is
Great, but what do I do with it? That my friend is exactly what we are going to dive into in
the next chapter.
2.5 Summary
In this chapter we have explored the various aspects of collecting data for a streaming
system, from the blazing fast SuperSearch to a futuristic TwitterOfThings that allows
household devices to send status messages.
Along the way we have:
At times our focus was quite wide and covered a lot of ground, as we progress to the
Messaging Queuing tier you will see that although our net may have been cast very wide from
the collection tier, once the data is collected it will all start to look and feel the same. For now
lets put the collecting of data behind us, and start our journey of following the data stream
now that it has entered into our streaming system.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
3
Transporting the data from
collection tier: decoupling the data
pipeline
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 3.1 Collection tier with unsaid role (moving data from input to rest of the platform) exposed
Previously we only talked about the role of handling the incoming data, not the output of data
from the collection tier. In this chapter we are going to focus on transporting data from the
collection tier to the rest of the streaming pipeline. Although we may mention the collection
and analysis tiers in our discussion, we will only be concerned with getting messages from or
to those tiers via the message queuing tier. Figure 3.2 below shows our streaming architecture
with this focus in mind.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 3.2 The message queuing tier with its input and output as the focus
After completing this chapter you will have a solid understanding why we need a message
queuing tier andthe features of a messaging product that are important for a streaming data
system.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 3.3 From the collection tier straight to the analysis tier
Looking at this redrawn architecture we may be tempted to say this looks simpler and things
should work just fine. Before we convince ourselves of this, we need to answer a very
important question:
What if our consumers cannot consume data fast enough from the collection tier?
In my mind that starts to conjure up an old cartoon picture of a hose with the end plugged
and the water spigot completely opened, it starts to swell and eventually just explodes from
the backpressure. Taking that example to our realm of data lets look at a time-lapse of the
data flowing from the collection tier to the analysis tier below in Figure 3.4.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 3.4The three stages of data flowing without a message queue. We do not want step C
Step A:This looks pretty normal and what we would like to see
Step B: We can tell something is not quite right, back-pressure is building
Step C: Our data pipe broke under pressure and data is now virtually dropping on the
floor and is gone forever.
Ouch, this is not a good situation as we are now loosing data, for some businesses this can
be catastrophic. At first blush you may think that this is a consumer problem, and all we have
to do is add more consumers or make them faster so they can keep up and life will be good.
The reality is, this is not a consumer problem at all, as it is perfectly acceptable in many use
cases for consumers to read slowly or be offline from time to time. In this chapter we are
going to explore how using a message queuing tier helps protect the collection tier from ever
being subjected to message backpressure and ending up like figure 3.4 C.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
to the newer takes on messaging found in Apache Kafka which is described as a publish-
subscribe messaging service.
If you search the web for message queuing products you will find that there are a lot of
technologies to choose from and the choices are always evolving. If you were to stop now and
investigate some of them further you would find a bewildering array of features, many of them
the same or sound the same, which just makes our job of choosing the right technology a little
harder. So lets resist the temptation to just jump right in and start downloading for a little bit
longer and avoid getting ourselves all wound up over something I like to call feature overload.
Instead lets step back and tease out the features that are critical to the success of our
streaming system and discuss those. I know this is hard, but it will be worth it. Armed with
this information choosing the technology tool or tools will not only be much less stressful, but
will enable us to objectively think about the problem we are trying to solve and what is
important to your business.
Now that we avoided feature overload and we have the stage set for where this tier fits in
to the larger streaming architecture, we are ready to talk about the core features we need to
consider when selecting a message queuing product. This is by no means an exhaustive list of
features but the ones you really want to pay attention to when designing a streaming system.
Before we dive into the core features lets take a moment to make sure we have an
understanding of the components of a message queuing product and how they map to our
streaming architecture.
In figure 3.5 we can see that the producer and the consumer have jobs that closely match
their names, the producer produces messages and the consumer consumes messages. You
may notice in figure 3.5 that the term broker is used and not the terms message queue, a
logical question is why the change? Well it is not so much a change as it is an abstraction. If
we look at figure 3.6 below, we will see that the message queue is alive and well, but it is
abstracted away by the broker.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 3.6 The broker with the message queue being shown.
When you take figure 3.6 into account, then the data flow starts to make more sense. If
you follow the flow from left to right in figure 3.6 you will see the following steps taking place:
To put this in perspective, lets see what it looks like if we overlay these terms and pieces onto
our streaming architecture. Below in figure 3.7 you will see that the components of the
message queuing we have been talking about overlaid onto our streaming architecture.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 3.7 Streaming architecture with message queuing tier components shown in context
Looking at figure 3.7 I think you would agree that this seems pretty simple and
straightforward, but as the saying goes the devil is in the details. It is in these details the
subtle interactions between the producer, broker, and consumer as well as various behaviors
of the broker that we will now turn our attention to. Phew, finally we are ready to dig into the
core concepts. Are you ready? Lets go.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 3.8 Two data centers with data flowing between them
In the San Diego data center you have the collection tier running and in the Amsterdam
data center you are running the Analysis tier. I know we have not talked about the Analysis
tier yet, for now lets just say it needs the data from the collection tier. All right, you are
collecting data in San Diego and analyzing it in Amsterdam, things are running smoothly and
business is good. But as luck would have it, right as you were about to leave for the weekend
on a beautiful Friday afternoon, a construction worker accidently put a backhoe through a fiber
optic line, cutting off communication between your two data centers as shown in figure 3.9
Figure 3.9Two data centers with data flowing into the ocean
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
After talking with the telecom company that owns the fiber line, their best guess is it may
take 2-3 days to repair it. What would the impact be to your business if this were to happen?
How much data can your business tolerate loosing from your collection tier? If this situation
would have a negative impact on your business and you cannot tolerate losing potentially days
of data then you need to make sure the message queuing technology you choose has the
ability to persist messages for long term. Figure 3.10 below shows how durable messaging fits
in with this tier and some of the types you may find.
Figure 3.10 Durable messages where they fit and how they may be stored
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
the traffic data for a given day, week, or month. If your architecture is similar to Figure 3.11
then as the Analysis tier consumes messages they are discarded from the message queue, in
essence they are gone and you cannot provide your historical traffic replay.
Figure 3.11 Transient messages get discarded after the Analysis tier consumes them.
To solve this problem we need an architecture that more closely resembles that depicted
below in Figure 3.12.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 3.12 Offline consumers persisting data for historically reporting / analysis
Using a product that supports storing messages to allow for offline consumers will allow us
to handle the current desire to provide an historical traffic replay as well as any other
historical reporting or analysis we may want to do in the future. In order to be sure you can
support these types of desires you need to make sure that the message queuing technology
you choose supports both online and offline consumers.
At most once - A message may get lost, but it will never be re-read by a consumer.
At least once A message will never be lost, however it may be re-read by a
consumer.
Exactly once A message is never lost and is read by a consumer once and only once.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
If you had to pick one that you wanted, which would it be? If you said exactly once you are
not alone, in fact most people want a system where messages are never lost and each
message is delivered to a consumer once and only once who would not want that? If only it
were that simple, this of course comes with caveats and risks. Figure 3.13 below shows the
possible points of failure that we need to talk about.
Wow, it seems like we have identified almost every spot in the diagram as a possible point
of failure. Dont worry it is not all doom and gloom, lets walk through them and understand
what the risks are and what each numbered item means.
1. Producer If the producer fails after it has generated a message but before it has had a
chance to actually send it over the network to the broker then we will lose a message.
There is also the chance that the producer may fail waiting to hear back from the
broker that it did receive the message and in turn the producer after it recovers may
send the same message a second time.
2. The network between the producer and broker - If the network between the producer
and the broker fails the producer may send the message but the broker never receives
it or the broker does receive it but the producer never gets the response acknowledging
it. In both of these cases the producer may send the same message a second time.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
3. Broker If the broker fails with messages that are still held in memory and not
committed to a persistent store then we may lose messages. If the broker were to fail
before sending an acknowledgement to the producer, the producer may send the
message a second time. Likewise if the broker tracks the messages consumers have
read and it failed before committing that information a consumer may read the same
message more than once.
4. Message Queue If the message queue is an abstraction over a persistent store, then if
it were to fail trying to write data to disk we may end up losing messages.
5. The network between the consumer and broker If the network between the consumer
and the broker fails the broker may send a message, record that it was sent however
the consumer may never get it. From the consumer side, if the broker waits for the
consumer to acknowledge it received a message but that acknowledgement never gets
to the broker it may send the consumer the same message a second time.
6. Consumer If the consumer fails before being able record that it processed a message
either by sending an acknowledgement to the broker or to a persistent store it may
request the same message from the broker. Another twist here is the case where there
are multiple consumers and more than one of them reads the same message.
I know that is a lot to consider and it may seem a little overwhelming, dont worry this will
not be the last time we see these types of semantics discussed. In the context of a message
queuing system we need to keep these failure scenarios in our back pocket so that when a
messaging system claims to provide exactly once deliver semantics we can understand if it
truly does. As is the case with so many things the choice of the technology to use in this case
will involve various tradeoffs such as that in table 3.1 below.
Table 3.1 The tradeoffs we are often faced with when considering a message queuing system
Less complexity, faster performance, vs. More complexity, a performance hit, and a
and weaker guarantees strong guarantee
The choice of where to compromise is going to be based on the business problem you are
trying to solve with the streaming system. For example, if you are building a streaming web
analytics product missing a message here or there is not going to have much if any impact on
your product. However, if on the other hand you are building a streaming fraud detection
system, then missing a message can have a very undesirable effect.
As you look at different messaging system you may find that the messaging system you
want to use does not provide exactly once guarantees, dont despair often times can solve
this using two techniques. Lets take a look at figure 3.14 below so see them graphically and
then we will discuss them.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 3.14 The two ways to have exactly once semantics if the messaging system does not provide it
If your business problem requires exactly once semantics but your chosen messaging
system does not provide them, then you will need to use two techniques to bridge the gap.
Figure 3.13 above identifies the producer and the consumer techniques. Lets now talk about
those in more detail.
1. Do not retry to send messages This the first technique we must use, to do this you
will need to have in place a way to track the messages your producer(s) send to a
broker(s). If and when there is no response or a network connection is interrupted
between your producer(s) and the broker(s) you can read data from the broker to
verify that the message you did not receive an acknowledgment for was received. By
having this type of message tracking in place you can be sure your producer only sends
messages exactly once.
2. Store metadata for last message This is the second technique that we must use and
involves us storing some data about the last message we read. The metadata you store
is going to vary by the messaging system. In the end what you need is data about the
message so that you can be sure that your consumer does not reprocess a message a
second time. In figure 3.13, it shows the metadata being stored in a persistent store.
Something you will need to take into consideration is What do you do if there is a
failure storing the metadata?
If you do implement these two techniques you be able to guarantee exactly once
messaging. You may not have noticed during this discussion, but by doing this you also get
two nice little bonuses (sorry not that type of bonus, I was thinking more about data quality
and robustness of your system). Take a look again at figure 3.13 and our discussion above
about it, what do you think the bonuses are? There may be more, but the ones I was thinking
of are message auditing and duplicate detection. From a message auditing standpoint, since
you are already going to keep track of the messages your producer sends via metadata then
on the consumer side you can use this same metadata to keep track of not just messages
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
arriving, but also perhaps max, min and average time it takes to process a message. Perhaps
you can identify a slow producer or slow consumer. Regarding duplicate detection, we already
decided that our producer was going to do the right thing to make sure a message was only
sent to a broker one time, and on the consumer side we said it was going to check to see if a
message has already been processed. One extra thing to keep in mind, in your consumer dont
just keep track of metadata related to the messaging system (some will expose to you a
message ID of some sort so you know if you processed a message by the same ID), but also
be sure to keep track of metadata that you can use to distinctly identify the payload of a
message. Now you know not just how to ensure exactly once semantics, but also you are on
your way to providing message auditing and detecting message duplication. As you go through
this book you will run into these concepts again and you may see other ways to apply
message auditingthrough the entire streaming architecture.
Questions Discussion
What would the impact be to Pauls business This would have a catastrophic impact to Pauls business.
if the communication between the collection His business would not be able to offer their service and
tier and analysis tier were interrupted for an may have a detrimental impact to his customers
extended period of time? businesses.
How many days worth of data can Pauls Zero, in fact I would argue that given the nature of both
business tolerate loosing? Pauls business and the type of data that he is dealing with
losing data is not an option.
Would you anticipate that Paul would need to I would expect that at least one customer or an executive
store historical data? in Pauls business has asked to see a report detailing how
their service has performed over time.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
What type of message delivery semantics I would expect that his business needs exactly once
does Pauls streaming system need? semantics. Without that there is the chance that he may
miss a message and thus miss a fraudulent transaction.
Could he get by with at least once? Perhaps, it may make
the consumer more complex, but it is possible.
That was fun. How about we take two more totally different businesses and see if you can
answer the questions for them.
Questions Discussion
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
adding to their cart and buying right now along with the jeans. With our mission in mind, go
ahead answer the questions so we can design the correct system for Rex.
Questions Discussion
3.4 Summary
In this chapter we have explored how we decouple the data being collected from the data
being analyzed by using amessage queuing tier in the middle.
During this exploration we:
At this point we have developed a good understanding of how to decouple the data being
produced from our collection tier from the analysis tier. As move through the other chapters of
this book, we will see some of the terms and concepts popping up again, so do not worry if
thisis the least bit overwhelming, it will not be the last time we talk about message deliver
semantics. Now lets get ready to have some fun with the data that we have collected, the
next chapter will take us through the analysis tier, if you remember from above this is our
message consumer. Are you ready? Lets go.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
4
Analyzing Streaming Data
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 4.1 The streaming data architecture with the Analysis tier in focus.
One thing that you may notice in figure 4.1 above is that unlike the previous chapter
where we discussed the input and output of the data, in this chapter we are only going to
concern ourselves with the input. The reason for this is simple, our goal is to understand the
core underpinnings of this tier and in the next chapter we will discuss the ways we can work
with the data in this tier. Therefore, we will hold off on talking about where the data goes from
this tier until next chapter. After this chapter you will have an understanding of the core
concepts found in all the modern tools used for this tier and be ready to learn about how to
perform various operations on the data. All right, grab a quick coffee refill and lets get going.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
it comes to data, in-flight refers to the idea of it always being in motion and never at rest. If
you have not heard of the terms data at rest, dont worry it is just a fancy way of saying that
the data is stored on disk or another storage medium. Lets take a look at figure 4.2 below,
which shows how this plays out in our streaming architecture.
Figure 4.2 Data being pushed from the collection tier and pulled from the analysis tier
Looking at figure 4.2 above it should be clear that our goal in this tier is to pull the data
from the message queuing tier as fast as possible, in essence we need to be sure we can keep
up with the rate that the collection tier is pushing data into the message queuing tier. So how
is this different from a non-streaming system, say one built with an RDBMS or using Hadoop
Map Reduce jobs. In those non-streaming types of systems the data is at rest and you query it
for answers, while in a streaming system we turn that on its head and the data is moved
through the query. Figure 4.3 shows what I mean by this.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 4.3 Turning things on their head, non-streaming vs. streaming
Flipping this model on its head so that data is being pulled through our system in a never-
ending stream has implications both on the design and the way we query these systems. If
you sit back and think of all the data zipping around you all day long, from the myriad of
connected devices and appliances to online activity, the questions you could ask and problems
you could solve if it all passed through a streaming analysis tier are amazing. Understanding
how to build these systems to harness all of this and the future data streams is becoming an
essential skill. All right, lets not get ahead of ourselves just yet; we have our work cutout for
us learning about the core features of an analysis tier. Lets begin our journey by discussing
the general architecture of a stream-processing system and then move onto the key features
and see how each of the features play a role in our decision to use a particular framework.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 4.4 Generic streaming analysis architecture you will find with many products on the market.
When you look at figure 4.4 above, you will notice that some systems may require you to
have an Application Driver, in those system it is also common that the driver controls the
lifecycle of the whole stream-processing application. In all systems you will find a component
that plays the role of the streaming task manager. This component has several jobs, first it
manages the lifecycle of the streaming task processors and secondly it setups up the data flow
of the incoming stream and how each of the streaming task processors interact. Figure 4.4,
just shows the state of the system after each of the streaming task processors was created.
Figure 4.5 below shows what the system looks like after everything is connected and the
stream is running.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 4.5 A view of the streaming analysis after it is up and running.
There are several things to point out in figure 4.5. First, the streaming task manager is still
present; its responsible for watching and monitoring the various stream task processors.
Secondly, there are multiple input streams and several of the stream task processors send
data to other stream task processors. Dont be alarmed by this, it is the natural programming
model of stream-processing, it is very similar to the Map Reduce programming model you may
already be familiar with. If you are not familiar with the Map Reduce model either dont worry,
by the time you finish the next chapter you will feel very comfortable with this programming
model. Looking again at figure 4.5, the following three questions come to mind:
Q: How did the streaming task manager know how to setup the stream task
processors?
A: The short answer is you told it. OK, lets elaborate on that a little. When you look
at the various technologies available for this layer one thing you will find is that they all
provide a way to express a data flow graph. It is this data flow graph that the
streaming task manager uses to determine how to organize and orchestrate the stream
task processors.
A: Depending on the framework you use some of your code will execute in the
streaming task manager, in all frameworks the business logic that is executing in the
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
stream task processors is your code. As an example imagine you wanted to determine
the number of times each visitor to your website visited a page. A plausible solution if
you were to write it down on a napkin, would be to group all the page views by visitor
and then count each group. Applying this solution to figure 4.5 we end up with figure
4.6 below.
If you familiar with the concepts Map Reduce you will notice that figure 4.6 above looks
just like how the data would be processed with Map Reduce. That is a powerful attribute
of a lot of the streaming processing tools; the higher-level data flow is familiar.
A: Looking at figure 4.5 you will notice that there is an input stream of data and then
an output from our stream task processors, taking this simple example lets update that
figure to show the tiers, the result is figure 4.7 below.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Figure 4.7 The data flow of an analysis tier with the previous and next tier
With this understanding of the general architecture and the data flow we can begin to dig
into the key features of the various streaming-processing frameworks that are a key piece to
assembling the analysis tier.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
At-most-once - A message may get lost, but it will never be processed a second time.
At-least-once A message will never be lost, however it may be processed more than
once.
Exactly-once A message is never lost and will only be processed once.
Those definitions are a slightly more generic version of what we saw before with the
message queuing tier, so how do these manifest themselves in the stream-processing tools
you may use in this tier? Lets overlay them on our data flow diagram and then walk through
them to understand what they mean.
Figure 4.8 below shows at-most-once semantics with the two failure scenarios, a message
dropping and a streaming task processor failing. The second scenario, a streaming task
processor failing, will also result in message loss until a replacement processor comes online.
Figure 4.8 At-least-once message delivery shown with the streaming data flow
At-most-once is the simplest deliver guarantee a system can offer, there is no special logic
required anywhere, in essence if a message gets dropped, a stream task processor crashes, or
the machine that a stream processor task is running on fails, the message is lost.
At-least-once increases the complexity as the streaming system must keep track of every
message that was sent to the stream task processor and the result. If it determines that the
message was not processed (perhaps it was lost, or the stream task processor did not respond
within a given time boundary) then it will be resent. Your code (the streaming task processor)
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
must keep track of all the messages processed because if the stream-processing framework
does not believe the message was handled it will send it again, thus your code will need to
handle duplicate messages that are also out of order.
Exactly-once semantics ratchets up the complexity a little more for the stream-processing
framework. Besides the bookkeeping that it must keep for all messages that have been sent,
it now must also detect and ignore duplicates. Arguably the complexity of your code actually
goes down since it now just has to make sure it responds with a success or failure after it has
processed the message, but it does not need to deal with duplicate detection.
STATE MANAGEMENT
Once your streaming analysis algorithm becomes more complicated then just using the
current message without dependencies on any previous messages and/or external data then
you will need to maintain state and will likely need the state management services provided
by your framework of choice. Lets take a simple example that we can work with to help
understand where and perhaps how we state needs to be managed.
Pretend you are the marketing manager for a large e-commerce site and you want to
know the number of page views per hour for each visitor.
I know youre thinking an hour that can be done in a batch process. We are not going to
worry about that right now, instead lets focus on the implied state we must keep to satisfy
this business question. Figure 4.9 below shows how our streaming task processors would be
organized to answer this question.
Figure 4.9 Simple counting page views per user over an hour
It becomes obvious when looking at figure 4.9 where we need to keep state, right there in
the stream processor tasks that perform the counting by id. If our streaming analysis tool of
choice does not provide state management capabilities, then one viable option is for us to
keep the data in memory and flush it every hour. This would work as long as we are OK with
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
the potential of losing all the data if our streaming processor task failed at anytime. Of course
as luck would have it, our tasks would be running smoothly and then one day start to fail at
59 minutes into the hour. Depending on your business case the risk and possible loss of data
by keeping all state in-memory may be acceptable. However, in many business cases life is
not so simple and we do need to worry about managing state. To help in these scenarios
many stream-processing frameworks provide state management features we can leverage.
The state management facilities provided by the various systems naturally fall along a
complexity continuum shown below in figure 4.10.
Figure 4.10 State management complexity continuum for stream processing tools
On the far left side of the continuum is our nave in-memory only choice from above,
closely followed by an in-memory with checkpointing feature. The systems that offer in-
memory with checkpointing almost always use a remote datastore for checkpointing and thus
provide just the basic functionality of ensuring a computation is not lost in the face of failure.
On the other end of the spectrum are the systems that provide a queryable persistent state
that is replicated. If you find yourself saying these seem like two totally different slants on
state management you are not alone. The solutions on the low complexity side only solve the
problem of maintaining the state of a computation in the face of failures. Granted some of
those systems will claim that you can query the state and manipulate it, however, in reality
the checkpoint is often written to a remote data store and the cost of querying may have a
dramatic impact on your ability to keep up with the speed of the stream. However, for the
simple operations of keeping a running count current and not losing track of the current value
in the face of failure these systems are a great fit. On the other end of the spectrum the
frameworks that offer sate management in way of a replicated queryable persistent store,
really help you answer much different and more complicated questions. With these
frameworks you can actually join different streams of data together, for example imagine you
were running an ad serving business and you wanted to track two things the ad impression
and the ad click. It is reasonable that the collection of this data would result in two streams of
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
data, one for ad impressions and one for ad clicks. Figure 4.11 below shows how these
streams and our stream task processors would be setup for handling this.
Figure 4.11 Handling ad impression and ad click streams that use stream state.
In this example the ad impressions and ad clicks arrive in two separate streams, since the
ad clicks will lag the ad impressions we will join the two streams and then count by the ad id.
Because of the lag in the ad click stream using a stream-processing framework that persists
the state in a replicated queryable data store enables us to join the two streams and produce
a single result. I think you will agree being able to join streams by leveraging the state
management facilities of a stream-processing framework is quite a bit different then just
making sure the current value of an aggregation is persisted. If you give some more thought
to this example, I am sure you will come up with other ideas of how you can join more than
one stream of data. It is a fascinating topic and something we will look at in more depth in our
next chapter. For now, lets continue on to the next feature we need to understand when
choosing a stream-processing framework.
WINDOWS OF TIME
At some point you will want to move beyond the simple counting of events and perform
computations that involve looking at events over a period of time, just like our previous
example of counting page views over a time period. To accomplish this youll want to take
advantage of the windowing features a stream-processing framework may offer. As you
survey the stream-processing landscape you will find that the windowing features provided fall
into one of three general categories:
No built in support
The ability to execute a task at a given interval
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
The ability to perform a computation over a sliding window of time at a given interval
We will go into detailed examples of how we can answer certain questions for each of these
scenarios in chapter 5, for now though lets make sure we have an understanding of what we
mean by each of these categories and what is involved in solving the follow example.
Every 10 seconds you want to output the number of times your business was
mentioned online during the last 30 seconds.
No built in support - means that the framework does not provide a mechanism that
allows you to process data in anything other than an event-by-event fashion in respect
to a window of time. With this model performing a task at a given interval often times
mean you are left with two options; adding data to an input stream at a given interval
that your stream task processor reacts to or keeping track of time in a stream task
processor and performing a computation or producing output at a given interval. To be
able to provide a solution for the above example we would need to keep track of the
last 30 seconds worth of data, or at least the count and then on a 10 second interval
we would need to emit the 30 second count and reset our counter. This is doable, but it
would require us to do all of the work. On the surface that may not sound bad, but as
the saying goes the devil is in the details, for example what happens if the stream
task processor that has the 30 seconds of data fails? What happens if the event that is
supposed to fire every 10 seconds actually executes every 15 seconds? I am sure you
can think of other failure scenarios as well. However, regardless of the risk if your
business problem is similar, it can be mostly solved with this event-by-event model,
you are comfortable adding this windowing lite model, and the inherent risk of data loss
then a stream-processing framework that has this level of windowing may be just fine.
Execute a task at a given interval at this level you will find products that handle
the interval work and ensure that your stream task processor is called at a specified
interval. You may still have to handle the holding onto the last 30 seconds of data but
at least the interval work is something you do not need to handle. The risk of data loss
does not go away since you will have to maintain the 30 seconds of data. Taking that
into consideration this level is an improvement, but not that great. It reduces the
complexity of what you need to write, but it still leaves you at risk for losing data.
However, as in the previous level, the level of risk may be well within the limits for your
business.
Sliding window computations this is the richest of the feature sets and often times
provides many different ways for you to express that you want to perform a
computation over a sliding window of data at a particular interval. At this level keeping
track of the 30 seconds of data and executing at a certain interval is completely taken
care of by the framework. You are left with only worrying about your algorithm and
nothing more.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
I think you would agree that there is quite a range in the features that are provided by the
various stream-processing frameworks. In the end you should be able to find one that helps
you solve your business problems.
FAULT TOLERANCE
It is nice to think of a world where things do not fail, however, in reality it is not a matter of if
things will fail but only a matter of when. A stream-processing frameworks ability to keep
going in the face of failures is a direct result of its fault tolerance capabilities. When we
consider all of the pieces involved in stream-processing there are quite a few places that it can
fail. Lets take a look again at the pieces involved and use figure 4.12 below to identify all of
the failure points.
Figure 4.12 The points of failure with stream processing in the context of the streaming architecture
1. Incoming Stream of Data In all fairness the message queuing tier will not be under
the control of the stream-processing framework, however, there is the potential for the
message queuing system to fail in which chase the stream-processing framework must
respond gracefully and not also fail.
3. Stream Task Processor This is where our code is running and it should be under
supervision of the stream-processing framework. If something goes wrong here,
perhaps our software fails, or the machine it is running on fails then the streaming task
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
manager should take steps to restart the processor or move the processing to a
different machine.
5. Stream Task Processor This is the same as #3 above and it should also be under
direct supervision of stream task manager.
6. Connection to output destination The stream task manager may not be able to
control the network path to the output, however, it should be able to control the flow of
data from the last stream task processor so that it does not become overwhelmed by
network backpressure.
7. Output destination This would not be under the direct supervision of the stream
task manager, however, it failing can impact the processing of the stream and
therefore it needs to be taken into consideration.
8. Stream Task Manager If this fails then we end up with a situation that is often
referred to as running headless. This refers to the notion that it is the responsibility of
this component to supervise and control the flow of data and the stream task
processors. Thus if this is component fails, then there is no supervisor for the data flow
and the stream task processors no new ones can be started or failed ones recovered.
Many stream processing frameworks use a variety of checkpointing strategies and take
advantage of various characteristics of the input stream of data to provide a robust
environment that can continue to process the stream of data without interruption in the face
of failure. As you investigate which stream-processing framework to use to solve your
business problem, understanding the failure semantics is critically important to your success.
4.4 Summary
In this chapter we took a dive into the common architecture of stream-processing frameworks
you will find when surveying the landscape and we went over the core features that you need
to consider, to recap what we learned.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
We have learned
I understand that some of this may seem fuzzy or fairly abstract, dont worry about it at
all. In the next chapter we will focus on how you perform analysis and/or query the data
flowing through the stream-processing framework. Some may say, that is where the fun
really begins, but to effectively be able to ask questions of the data we need to have the
understanding you developed in this chapter. Are you ready to start asking questions of the
data? Great, lets turn the page and get started.
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926