Streaming Data v2 MEAP

MEAP Edition
Manning Early Access Program

Streaming Data
Version 2
Copyright 2014 Manning Publications
For more information on this and other Manning titles go to

www.manning.com
Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
http://www.manning-sandbox.com/forum.jspa?forumID=926
Licensed to Subu Balakrishnan <Subu2000@gmail.com>
Welcome
Thank you for purchasing the MEAP for Streaming Data. I'm happy to see the book has
reached this stage, and look forward to its further development and eventual release.
This book introduces the concepts and requirements of streaming and real-time data
systems. Through this book you will develop a foundation to understand the challenges and
solutions of building in-the-moment data systems before you committing to specific
technologies. You do not have to have any experience with streaming or real-time data
systems to fully take advantage of the material in this book. It is perfect for developers or
architects, but also written to be accessible to technical managers and business decision
makers. A lot of care has been put into making the content of the book as approachable and
up-to-date as possible.
This initial release contains the first three chapters of the book. In the first chapter we will
start by discussing what a real-time system is and some of the differences between them and
streaming data systems. We will develop an understanding for why streaming data is
important and also develop our architectural blueprint that will serve as our navigational aid
throughout the book. In the second chapter we will discuss the collection tier, the first tier of
our architecture. In this chapter we will learn about the different data collection patterns,
survey the technology landscape, and develop an understanding of how to choose the right
technology for your business problem. In the third chapter, we will continue to follow our
guide and discuss how we transport data from the collection tier using the message queuing
tier. In this chapter we will spend time learning about message durability, deliver semantics,
and how to choose the right technology. The upcoming chapter 4 will talk about the analysis
tier. Beyond that, the rest of the book will continue to follow our architectural blueprint and
discuss each of the tiers in depth. That will conclude the first part of the book that discusses
this new holistic approach to building streaming data systems.
In the second part of the book we will focus our time on making our tiers come to life by
applying what we have learned and building out a complete streaming data system.
As youre reading, I hope youll take advantage of the Author Online forum. Ill be
reading your comments and responding, and your feedback is helpful in the development
process.
Andrew Psaltis
brief contents
PART 1: A NEW HOLISTIC APPROACH
1 Introducing streaming data
2 Getting data from clients: data ingestion
3 Transporting the data from collection tier: decoupling the data pipeline
4 Analyzing streaming data
5 Clever ways to analyze data from a fire hose
6 Storing the analyzed or collected data
7 Making the data available: To Push or to Pull
8 Consumer device capabilities, limitations accessing the data
PART 2: TAKING IT REAL WORLD
9 Building an In The Moment recommendation engine
10 Building an IoT a tweeting San Francisco parking garage
1
Introducing Streaming Data
This chapter covers
What is a real-time system

Differences of real-time and streaming data systems
Why streaming data is important?
The architectural blueprint
Security for streaming data systems
Data is flowing everywhere around us, from phones, credit cards, sensor-equipped buildings,
vending machines, thermostats, trains, buses, planes, posts to social media, digital pictures
and video, and the list goes on. In fact over the last two years we have created approximately
90% of the data that exists in the world today. Arguably, the Big Data problem has existed for
a very long time. The difference now though is we have finally reached the point where we can
do something with the data such as asking interesting questions of it, making more informed
and faster business decisions, and providing services that allow consumers and businesses to
leverage what is happening around them right now.
We live in a world that is operating more and more in the Now, from social media, to
retail stores tracking users as they walk through the aisles, to sensors reacting to changes in
their environment. There is no shortage of examples of how data is being used today as it
happens. What is missing, though, is a shared way to both talk about and design the systems
that will enable not just these current services, but also the systems of the future.
It is the intent of this book to lay down a common architectural blueprint for how we talk
about and design the systems that will be able to handle all of the amazing questions yet to be
asked of the data that is flowing all around us. Do not worry if you have never built, designed,
or even worked on a real-time or big data system this book will still serve as a great guide.
To set the stage, this chapter will introduce the concepts of streaming data systems,
introduce the architectural blueprint and get us set to begin exploring in depth each of the
tiers as we progress.
1.1 What is a real-time system

Real-time systems and real-time computing have been around for decades; with the advent of
the Internet they have become very popular. Unfortunately with this popularity has come a lot
of ambiguity and debate on what constitutes a real-time system, table 1.1 brakes out the
common types of real-time systems along with the prominent characteristics by which they
differ.
Table 1.1 Types of real-time systems
Type of Real-time Examples Latency Measured In Tolerance for Delay
Hard Pacemaker, Anti-lock Milliseconds or None total system

brakes Microseconds failure, potential loss of
life.
Soft Airline Reservation Seconds Low no system failure,

System, Online stock no life at risk
quotes
Near Skype video, home Seconds to Minutes High no system

automation failure, no life at risk
You can identify hard real-time systems fairly easily, they are almost always found in
embedded systems and have very strict time requirements that if missed may result in total
system failure. The design and implementation of these systems is well studied in the
literature and outside the scope of this book. We will leave them behind and turn our focus to
the real-time systems often categorized as soft or near real-time. Determining if a system is
soft or near real-time is an interesting exercise, as the overlap in their definitions often results
in confusion. Take a moment and think about the following examples:
1. Someone you are following on Twitter posts a tweet and moments later you see the
Tweet in your twitter client.
2. You are tracking flights around New York using the real-time Live Flight Tracking service
from FlightAware (http://flightaware.com/live/airport/KJFK)
3. You are using the NASDAQ Real Time Quotes application

(http://www.nasdaq.com/quotes/real-time.aspx) to track your favorite stocks.
Although these systems are all quite different, figure 1.1 below shows a simplified view of
their commonality.
Figure 1.1 A generic real-time system with consumers
In each of the above examples is it reasonable to conclude that the time delay may only be
seconds, no life is at risk, and an occasional delay in minutes will not cause total system
failure. What do you think? If someone posts a tweet and you see it almost immediately is
that soft or near real-time? What about watching live flight status or real-time stock quotes?
Some of these can go either way; what if there is a delay in you seeing the data due to the
slow Wi-Fi at the coffee shop or on the plane? As you think about these examples I think you
will agree that the line differentiating them between soft and near real-time becomes blurry,
at times disappears, and is very subjective and may often be dependent on the consumer of
the data.
Now lets change our examples just a little bit by taking the consumer of the data out of
the picture. Restating our examples to be just focused on the services at hand we end up with
these:
1. A tweet is posted on Twitter
2. The Live Flight Tracking service from FlightAware

(http://flightaware.com/live/airport/KJFK) is tracking flights
3. The NASDAQ Real Time Quotes application (http://www.nasdaq.com/quotes/real-
time.aspx) is tracking stock quotes
Think about these for a moment. Granted we do not know how these systems work
internally, but the essence of what we are asking is common to all of them, and can be stated
as:
Is the process of receiving data all the way to the point it is ready for consumption a soft or near
real-time process?
Graphically this looks like figure 1.2:
Figure 1.2 A generic real-time system with no consumers
Does taking the consumers of the data out of the picture change your answer? If with a
consumer you classified one of the examples as near real-time was that due to the lag or
perceived lag in you seeing the data?
After a while it gets confusing on whether to call something soft or near real-time or just
real-time as some of the services in our examples do. Clearly there has to be a better way.
1.2 Differences of real-time and streaming systems

It is apparent now that a system may be labeled soft or near real-time based on the perceived
delay experienced by consumers. We just saw with three simple examples how this can get
confusing. Imagine how confusing things can get when looking at other systems and involving
other people in the conversation. When you step back and look at the big picture, we are
trying to use one term to define two parts of a larger system, as shown in figure 1.3, the end
result is this breaks down, causing confusion.
Figure 1.3 Real-time computation and consumption split apart
Looking at figure 1.3 on the left-hand side we have the non-hard real-time service or the
computation part of the system and on the right-hand side we have the clients, the
consumption side of the system. In many scenarios the computation part of the system is
operating in a non-hard real-time fashion, however, the clients may not be consuming the
data in real-time, due to network delays, application design, or perhaps a client application is
not even running. Put another way, what we really have is a non-hard real-time service with
clients that consume data when they need it. This is a streaming data system, a non-hard
real-time system that makes its data available at the moment a client application needs it, it is
not soft or near, it is streaming. Figure 1.4 shows the result of applying this definition to our
example architecture from figure 1.3.
Figure 1.4 A first view of a streaming data system
Using this definition we have eliminated the confusion of soft vs. near, real-time vs. not real-
time, thus allowing us to concentrate on designing systems that deliver the information a
client requests at the moment it is needed. Lets use our examples from before, but this time
think about them from the standpoint of streaming, see if you can split each one up and
recognize the streaming data service and streaming client.
4. Someone you are following on Twitter post a tweet and moments later you see the
Tweet in your twitter client.
5. You are tracking flights around New York using the real-time Live Flight Tracking service
from FlightAware (http://flightaware.com/live/airport/KJFK)
6. You are using the NASDAQ Real Time Quotes application

(http://www.nasdaq.com/quotes/real-time.aspx) to track your favorite stocks.
How did you do? Here is how I thought about them:
7. Twitter A streaming system that processes tweets and allows clients to request the
latest tweets at the moment they are needed, some may be seconds old, while others
may be hours.
8. FlightAware A streaming system that processes the most recent flight status data and
allows a client to request the latest data for particular airports or flights.
9. NASDAQ Real Time Quotes A streaming system that is process the price quotes of all
stocks and allows clients to request the latest quote for particular stocks.
Did you notice that doing this exercise allowed us to stop worrying about soft or near real-
time, we actually got to think and focus on what and how a service makes its data available to
clients at the moment they need it. Granted we do not know how these systems work behind
the scenes, which is just fine. Together we are going to embark on a journey to help us
understand how to assemble these types of system and many more as we progress through
the book.
1.3 Why streaming data is important?

Everyday there is more and more pressure on businesses to make decisions faster while at the
same time consumers are demanding that systems provide them information when they want
it, how they want it, and in context. Even if the accuracy of the various industry analysts is
slightly off as it relates to the number of mobile devices in the world, bandwidth usage,
connected devices (Internet of Things), and data growth, we just have to look around us to
realize that trend is undeniable. In many ways we are merely at the beginning of the deluge of
data businesses will be faced with in the coming years. To help businesses and consumers
benefit the most from this, it is our job to be able to properly articulate, architect, and develop
systems that can scale and adapt to the changing needs and continuous growth demands that
will be placed on them. It is the intent of this book to arm you with the knowledge so that you
can design and develop these systems, the ones that help solve the problems of today and
tomorrow. Are you ready? To get started lets begin by discussing our overarching architecture
that we will flush out as the book unfolds.
1.4 The architectural blueprint

Throughout our journey we are going to follow an architectural blueprint that allows us to talk
about all streaming systems in a generic way, our pattern language. Figure 1.5 depicts the
architecture we will follow; take a moment to get familiar with it.
Figure 1.5 The streaming data architectural blueprint
As we progress we will zoom in and focus on each of the tiers, while also keeping the big
picture in mind. Although our architecture calls out the different tiers, keep in mind these are
not the hard rigid tiers you may have seen in other architectures, in this world we will call
them tiers, however, we will use them as LEGO pieces, allowing us to design the correct
solution for the problem at hand. Our tiers do not prescribe a deployment scenario in fact in
many cases they will be distributed across many different physical locations. Now lets take
our examples from before and walk through together how Twitters service maps to our
architecture.
1. Twitter
2. Collection When a user posts a tweet, this is collected by the Twitter services.
3. Message queuing Undoubtedly Twitter runs data centers in various locations across
the globe, and conceivably the collection of tweets does not happen in the same
location as the analysis of the tweet.
4. Analysis Although I am sure there is a lot of processing done to those 140 characters,
suffice it to say at a minimum for our examples Twitter needs to identify the followers
of a tweet.
5. Long-term storage Even though we are not going to discuss this optional tier in depth
in this book, by the very nature that you can see tweets going back in time implies that
they are stored in persistent data store.
6. In-Memory data store The tweets that are a mere couple of seconds old are most
likely held in an in-memory data store.
7. Data access All the different twitter clients need to be connected to Twitter to access
the service.
Take some time and walk yourself through the exercise of decomposing the other two
examples and see how they fit our streaming architecture. Remember those examples are:
10. FlightAware A streaming system that processes the most recent flight status data and
allows a client to request the latest data for particular airports or flights.
11. NASDAQ Real Time Quotes A streaming system that is process the price quotes of all
stocks and allows clients to request the latest quote for particular stocks.
How did you do? Dont worry if this seemed foreign or hard to breakdown, we will see
plenty more examples in the coming chapters. As we work through them together we will
delve deeper into each of the tiers and discover ways that these LEGO pieces can be
assembled to solve different business problems.
1.5 Security for streaming systems

As you reflect back on our architectural blueprint you may notice that we do not explicitly call
out security. Security is important in many cases; however, it can be overlaid on this
architecture naturally. Figure 1.5 shows an example of how security can be applied to this
architecture.
Figure 1.6 The architectural blueprint with security identified
We will not be spending time discussing security in depth in this book, but along the way it will
be called out so that you can see how it fits and think about what it may mean for the
problems you are solving.
1.6 Summary
With the introduction of the architectural blueprint under our belt lets step back and see
where we have been.
Where we have been
We defined a real-time system

We explored the differences of real-time and in-the-moment systems
We developed an understanding of why streaming is important
We laid out an architectural blueprint
We discussed where security for in-the-moment systems fits in
Dont worry if some of this is slightly fuzzy at this point or if teasing apart the different
business problems and applying the blueprint seems overwhelming, we will walk through this
slowly over many different examples in the coming chapters. By the time we are done with
this book it will seem much more natural. We are now ready to dive into each of the tiers and
really understand what they are composed of and how to apply them in the building of an in-
the-moment system. To help us focus which tier we tackle first, lets take a look at a slightly
modified version of our architectural blueprint, found below in figure 1.7.
Figure 1.7 Architectural blueprint with emphasis on the first tiers
We are going to take on the tiers one at a time, starting form the left with the Collection Tier.
Dont let the lack of emphasis on the Message Queuing Tier in figure 1.7 above give you any
cause for concern; in certain cases where it serves a collection role well talk about it and clear
up any confusion. Now on to our first tier The Collection Tier our entry point for bringing
data into our in-the-moment system.
2
Getting data from clients: Data
ingestion
This chapter covers
Learning about the Collection tier

Understanding the data collection patterns
Surveying the technology landscape
Choosing the right technology
In chapter one we were introduced to the architectural blueprint found below in figure 2.1. As
we discuss each tier of streaming data system we will use this blueprint as our guide, in
essence this blueprint serves as our navigational aid.
Figure 2.1 The streaming data architectural blueprint
We are going to take on the tiers one at a time, starting form the left with the Collection Tier
in this chapter and working our way through each of them. Now on to our first tier The
Collection Tier our entry point for bringing data into our streaming system. Figure 2.2 below
shows a slightly modified version of our blueprint, with the focus put on the collection tier.
Figure 2.2 Architectural blueprint with emphasis on Collection tier
This tier is where data comes into the system and starts its journey; from here it will progress
through the rest of the system. In the coming chapters we will follow the flow of data through
each of the tiers. Our goal in this chapter is to learn about the collection tier. When you finish
this chapter you will have learned the two different collection modes, the different
technologies at play, and how to choose the right one to solve your business problem.
2.1 How we collect data and why the source matters

A lot of data is flowing all around us, some we generate browsing the web or updating our
social status, some we dont think about, the data being generated by our cars and our
phones, to the fitness devices we wear. No matter where you look you can find more and
more sources of data available for us to collect, analyze and take action on. The amount of
data and the sources it comes from will continue to grow by leaps and bounds for the
foreseeable future. Even though we are swimming in data, there are only two general ways
that it can be collected: active or passive. Active collection is just like when you browse the
Internet; the collection tier (your browser) initiates and directs the collection of data. On the
other side of the coin is passive collection; in this mode, the collection tier waits for data to be
sent to it. In essence these two modes of collection can be boiled down to pull vs. push. Pull
being the same as active, where the collection tier pulls in data and push being the same as
passive, where the collection tier has data pushed to it. In a streaming system the mode of
collection is always passive.
If our mode of collection is always passive, then why does it matter what the source of the
data is? That is a great question and on the surface it may not seem like the source matters.
However, lets think about this statement:
1 connection with 1 Billion Events per day 1 Billion connections with 1 Event per day
If you give that some thought, I am sure you will think of a variety of differences in the
systems you would design for these two scenarios. For example, just the basic connection
aspects of it designing a system that has 1 connection with a lot of data, one really big straw
if you will is quite a bit different then designing a system that has to handle 1 billion
connections each with a tiny bit of data. Today it is common and easy to find data sources that
fit the pattern of 1 connection and a lot of data. It is also not very hard to find data sources
that meet the other pattern of many connections each with very little data. In the future this
second scenario will become much more pervasive in our society. A lot of this will be driven by
what has been termed The Internet of Things. As the Internet and Internet of Things grows
in the coming years, understanding how to design systems for each of these patterns will be
essential.
Although these sources of data and the characteristics about them are vastly different, from a
user browsing a website to a sensor in a street sending traffic updates, they both are still
characterized as passive data collection. We will see in the next section that this logical
grouping of the style of collection has nothing to do with the potentially big data like features
(the 3Vs volume, velocity, and variety that we discussed in Chapter 1) but more to do with
the direction of the data flow and who the initiator of it is. Keep in mind that the who may be
a person in some situations and in others it may be a physical device such as a garage door or
a thermostat.
Thing on the Internet vs Internet of Things

We all know what sources of Internet data are, but what a Thing on the Internet and The Internet
of Things?
A Thing on the Internet is interesting; perhaps you have heard of the Internet of Things or
Internet of Everything. Traditionally the Internet is composed of activities and services driven by
human interaction and usage, we use an application perhaps a Web browser or mobile application to
search for, interact with, and update content online. A Thing of Internet is what I would classify as a
device that accesses the Internet in a similar fashion to how you or I would. For example, if your
toaster checked the weather as it was toasting your morning bagel and provided the forecast on
your bagel, I would call your toaster a Thing on the Internet. Or perhaps your refrigerator may
access your online calendar and suggest items for lunch based on your schedule; this would also be
a Thing on the Internet. In these examples our toaster and refrigerator are accessing the Internet in
just the same fashion as any application would. The toasters access pattern is no different than a
weather application and the refrigerators access pattern is similar to a lifestyle application that helps
you prepare for your day.
Internet of Things is slightly different, in this case we are talking about internetworked things
that are active participants in business, information and social processes that gather and distribute
information from and about the physical world in order to draw conclusions and often act on those
conclusions in the physical world. Today it is not hard to find consumer examples such as garage
doors and thermostats that report changes and can also be controlled by homeowners using an
application on a mobile device. This is really the tip of the iceberg. Imagine in the future if you
wanted to take a bus that you can be notified by the bus when it was time to leave for the station.
You may think that is easy, today I download an application on my mobile device that sends me a
push notification based on my proximity to a bus stop and the static bus schedule. What if instead
the projected bus arrival time and optimal route were computed with the aid of the city supplied
traffic information collected by street cameras, traffic lights, current street conditions possibly
reported by the bus, the drivers shift information, the bus fuel or battery level and other information
that may require the bus to make an unplanned stop. Imagine all of this data being collected and
analyzed to provide the bus with the optimal route to take and then the bus sends you a push
notification of its predicted arrival time and the time you should leave the coffee shop to be able to
make it to the closest station with time to spare.
2.1.1 The Request/Response and Event patterns

There are two patterns of passive data ingestion the request/response and the event pattern.
We will see in the coming examples that this does not imply anything about the actual data, it
may still be very large and moving very fast, remember the notion of passivity has to do with
the role being played by this tier in the flow of the data.
Imagine that SuperSearch, the worlds largest search engine which handles over 6 billion
searches per day and growing just made a streaming API available that enables authorized
applications to get a stream of all searches as they happen and we want to build a streaming
system that analyzes this stream with minimal delay. You may be thinking that is a lot of
data, how in the world can you consider approximately 70,000 searches per second passive
collection? That is a great question; it certainly is a high volume of data that is moving very
fast. However, when it comes to collection we will consider it passive because our collection
tier passively waits for data to arrive as depicted below in figure 2.3.
Figure 2.3 Passively collecting the SuperSearch search stream
Right now if you are like me you have a picture in your mind of a big system that we would
need to build to be able to handle the fire hose of searches that our system will be ingesting.
We will dig into the details of what our collection tier may look like in the next section. Before
that lets move on to another example that helps to introduce and orient us.
Imagine this time we are building a streaming system that is going to reside inside a
vehicle and its mission in life is to be able to optimally route the vehicle to its destination
based on current traffic conditions during its journey. Figure 2.4 below shows what this may
look like.
Figure 2.4 Passively collecting traffic conditions with on-board streaming system.
Not surprisingly we can take the same architecture and embed it in vehicle. The software
stack to compose it may be slightly different; however, all of the principles of the blueprint still
apply.
NOTE It may be interesting to think of the traffic conditions service that the vehicle is making a
request to, you could imagine that it is also a streaming system with much different scaling
factors, perhaps all vehicles from a given manufacturer would make requests to it every time
they were on the road. I will leave it to you to think through this scenario. As you continue
through this book, different aspects of a solution and perhaps alternative designs may become
apparent.
In both of these examples the data flow follows a request/response pattern. The
consumer of the data initiates a request and the service responds, in our two examples it just
so happens that the response is a stream of data and the client is our collection tier. This is
the same familiar pattern you are used to seeing on the web, the twist in this case is that it
may or may not be over HTTP and it is our collection tier making the request not a browse
The second passive collection pattern we will discuss is the event pattern. The event
pattern is a style of interaction where a producer of data (a thermostat, web server, phone,
etc.) sends current or temporally aggregated state (current temperature, number of requests
served in last minute) about itself and/or its environment to another system or systems (a
temperature monitoring app, operations monitoring system). The data flow is always one-
way from the producer to the consumer. This will become clearer as we walk through the
following examples.
Imagine for a moment that you own a skyscraper in New York City, business is going
great, you have good tenants and all of your units are occupied. But, your cost for utilities just
keeps increasing at what seems like an uncontrollable pace. Walking past your building one
night you notice that a lot of your tenants leave the lights on all night long, you start to think
there has to be a way to monitor this usage and control the lights so that you can save
electricity and have a more environmentally friendly building. After doing some research you
discover that indeed you can turn your building into a smart building. Fast-forward 6 months
and the conversion is done, your building has been outfitted with the latest and greatest
technology and is now considered a smart building. The lights in the building not only turn on
and off as people move about different areas of their offices they also regularly report the
following information:
Hours of operation
Wattage used
Frequency of turning on and off
Hours of use remaining
Status
We will call this information an event message. With this information in hand youll be
able to know when you need to replace lights, youll understand the energy efficiency of the
lights you use, and youll be able to perform an energy usage analysis of your building and
each of your tenants. To be able to gather this data and do this analysis we will need to make
sure our Collection tier can handle receiving events data from various devices. You may
already have a picture of the architecture in your mind it is just like the one we saw for the
SuperSearch.com example, except we would replace the SuperSearch Stream as the producer
with motion sensors.
For our next example, imagine that we are going to build a social network, lets call it
TwitterOfThings that will allow all home appliances to send out event messages containing
their status and state. Perhaps your washing machine or dryer would report how often it is
run, how long it runs for, and the health of the various components. If you think of these and
the other appliances in your home, I am sure you can come up with a long list of the data that
could be sent to the TwitterOfThings. If you are like me you start to think, wow with all this
data we can do some amazing things such as: determine the quality of different appliances,
allow a service representative to proactively schedule service, and determine energy usage.
For now though we will hold back on exploring those avenues, as we will address the Analysis
tier in Chapter 4. Getting back to the Collection tier, do you think we will need to change our
architecture? If you said no, we may need to scale it out but the architecture will remain, you
are correct. Once again our architecture remains the same as before we are just adding many
more producers.
In this section we have looked at the two passive message patterns, the request/response
and the event. The key take away from both of these is that although there may be a lot of
data and potentially a very high volume of data, from the point of view of the Collection tier it
is passive as the data always flows from the producer which may be a phone, a washing
machine, a street meter, a web browser, or any other thing capable of sending data to the
Collection tier. In each of these examples we also decided that our architecture did not
change, well there is a slight wrinkle to that. Our architectural blueprint may not have
changed, but as we will see in the following sections where we will dig in deeper to each of
these patterns the underlying architecture of the collection tier may vary widely across
scenarios.
2.2 Scaling the request/response

Once again lets imagine we are going to build a streaming system that performs analysis of
over 6 billions searches per day with minimal delay. For this exercise we are only concerned
with the collection of the data as highlighted in figure 2.5 below.
Figure 2.5 SuperSearch Stream with just the collection tier in focus.
Now lets start to dig in a little deeper, peeling the onion back one layer so we can see the
request/response pattern exposed below in figure 2.6
Figure 2.6 Setting up the SuperSearch streaming connection.
There are several steps identified above in figure 2.5 that are executed during our use of
the SuperSearch stream.
Step 1 is the request part of the request/response pattern. The role of the collection
tier here is to make a request to consume; in this case the request is to start a stream
of data flowing. Keep in mind that it is very common for a consumer to be
authenticated as part of the request processing. We will not go into the steps of
performing this, but you should take note that often it will be required and needs to be
factored into your architecture.
Step 2 is the response part of the request/response pattern. In this scenario the
response will be a continuous stream of data.
Step 3 when we are done consuming the stream we close the connection.
Step 2 is where the real work for the collection tier takes place. Continuing with our
example, SuperSearch is going to stream 70,000 searches per second to our collection tier; by
any measure this is a fire hose of data. To help us think through the different design choices
table 2.1 below list some of the areas of concern along with some questions to help steer our
design. When you are designing a real world streaming system these questions coupled with
others we develop in later chapters will help elicit the thinking and conversations that you will
need to have to be successful.
Table 2.1 Areas of concentration and questions to address
Area Question
Velocity of Data What if the speed of the data doubled?
What if it increased by 10x?
What if it dropped in half?
How do we handle spikes?
Streaming Protocol Are there special considerations?
Keeping up with the producer What happens if we fall behind?
Is it acceptable to miss data?
How do we make sure we can keep up with the producer of data?
Downstream consumers What if the next tier cannot keep up with the speed and/or volume of data?
How can we make sure our collection tier is not affected by this
backpressure?
With our questions in hand and a fire hose of data ready to be consumed, lets walk through
some designs based on answers to the above questions. To make sure we have our frame of
reference, illustrated below in figure 2.7 is our blueprint streaming architecture with the
SuperSearch stream as the source of data.
Figure 2.7 Simplified architecture consuming 70K searches a second from SuperSearch
You may be thinking that is a very large arrow going from the SuperSearch Stream to the
collection tier; this was intentionally done to emphasize the amount of data we are consuming.
Before we try to answer some of the questions from table 2.1 lets redraw figure 2.7 so we can
see a better representation of what the collection tier may really look like. Figure 2.8 below
illustrates the collection tier in its expanded form to aid in our discussion.
Figure 2.8 Example collection tier expanded to show the nodes that make it up.
Looking at figure 2.8, it is important to realize that the number of nodes in the collection tier is
just for illustration purposes, your mileage may vary depending on many factors, some of
which we will cover, but many others such as; type and cost of hardware we will not. Now lets
see if we can answer some of the questions we posed in table 2.1. First lets start with trying
to answer the Velocity Questions, which are repeated below in table 2.2.
Table 2.2 Data velocity questions
Velocity of Data
What if the speed of the data doubled?
What if it increased by 10x?
What if it dropped in half?
How do we handle spikes?
In we look closely these questions are trying to make us think about how we scale the
collection tier in the face of changes to the velocity of data we are dealing with. Our scaling
efforts can result in our collection tier being labeled as Superlinear, Linear, or Sublinear. Each
of these is illustrated below in figure 2.9.
Figure 2.9 Data velocity vs. Speed of Processing scalability
Ideally we would like to at least be able to achieve linear scalability. To get there you have
two options you can scale vertically or you can scale horizontally. There is no one size fits all
direction to go here, and the choice to go vertical or horizontal depends on a lot of factors
some of which are organizational and are beyond the scope of this book, so we will put those
to the side. Given this lets see how we would address the questions posed. In both sets of
questions we are really after answers to the general question of How do we handle growth,
be it from an overall increase or a spike. To be able to handle this scaling horizontally we
would like to simply add more collection nodes to our collection tier. If we were scaling
vertically we would like to simply add more CPUs and/or RAM to each of the collection tier
nodes. Restating this generically we can state that to scale the collection tier we would like to
be able to: Add X more cores and RAM to handle an increase in traffic by Y. Your job when
you solve this problem for your organization is to determine what X and Y are. Lets solve this
using our example architecture from figure 2.8 where there are 6 nodes in the collection tier.
Here are our operational assumptions:
Table 2.3 Operational assumptions for example 6 node collection tier
Number of nodes 6 (2 cores, 4GB RAM each)
Messages per second capacity (per node) 180,000
Reserve capacity for handling spikes 35%
Current volume 70,000 messages per second
With those operational assumptions we can comfortably handle the 70K messages per second
load and have the capacity to handle a spike of 35%. Of course if this spike turned into a new
norm, this would not be sustainable, as you would most likely not want your collection tier to
run at full capacity for long. All right, our extra capacity takes care of the question about how
we are going to handle a spike in traffic. Since we stated before that we can solve the
scalability problem with the general formula Add X more cores and RAM to handle an increase
in traffic by Y. answering the questions of handling a doubling or 10x increase in data velocity
just becomes a matter of plugging in the values for X and Y. How would you handle a drop in
data? Would you reduce the number of nodes in the tier? Or would you leave them so that you
are ready when the data velocity picks back up?
Lets now think about the streaming protocol that is being used. The question we asked above
in table 2.1 was:
Are there special considerations?
Before we can answer this question we need to know the protocol in use. In the wild some
of the popular protocols you will see in use today are HTTP and Websockets, of course there
are also custom protocols built on top of these or TCP. Since there are many dedicated
resources for each of these protocols available that discuss scalability and the protocols in
depth, we will not elaborate on them further, please consult one of the many resources to
make sure you take into account any nuisances of the protocol you are using.
We are half way there, we answered the questions about data velocity, addressed the protocol
in use, now lets turn our attention to the next set of concerns, that being How do we keep up
the producer. The questions we posed back in table 2.2 are recited below in table 2.4
Table 2.4 Keeping up with the producer questions
Keeping up with the producer
What happens if we fall behind?
To answer the first question What happens if we fall behind? we need to consider various
aspects of the stream we are consuming and our collection tier. What if the SuperSearch
stream we are consuming has business rules that state: If your consumer cannot keep up
with the stream, then the consumer will be disconnected. We stated earlier that our goal was
to perform an analysis of this stream as it happens, thus being disconnected from the stream
could result in our application missing a lot of data. On the surface this may not seem like a
big deal. But what if we make our money by selling the analysis for streaming ad buying and
our customers miss opportunities? Clearly this is not something that we can allow to happen
often if at all. Now lets twist this a little, perhaps SuperSearch decided that they were not
going to explicitly disconnect consumers that could not keep up and instead just discard
messages if the consumer could not consume fast enough. In this case we are in a very
similar predicament, we will begin to miss data and potentially cause our customers to miss
opportunities. That brings us to the next question that we need to consider:
Answering this question is going to depend on what you are doing with the data. In our
example, we are selling streaming ad buying and missing data can have financial ramifications
for our customers. Your situation may be different, perhaps you are doing something with the
data where missing a few messages here and there may not change the outcome of your
analysis, this is something you need to consider and take into account. This brings us to our
last question:
There are at least several ways that we can tackle this question. The first obvious choice is
to keep adding nodes to our collection tier until we no longer fall behind. In some situations
this may not be a bad idea, and for very little effort you may have solved your problem of
being able to keep up. But in many cases this is just no feasible, so what do we do? In this
situation one possibility is to split our collection tier into two parts and add a buffering tier
between them. This architectural change is illustrated below in figure 2.10
Figure 2.10 Collection tier split in half with Buffering tier in the middle.
The key to splitting the collection tier is to isolate the part of the tier that just serves to
receive messages from the producer, split that part out and change it so that it will now
consume messages as fast as possible and push them to a buffering tier. The other half of the
collection tier that is left will then be modified to consume messages from the buffering tier. If
we can split our collection tier and use a buffering tier in this fashion then we stand to gain at
least the following two nice features:
12. Short-term message storage
13. Decoupling of our collection tier from the producer of the data.
First you may be wondering why we would want to have short-term storage for our
messages. That is a good question, but remember in some cases if we cannot keep up with
the provider of the stream, in this case SuperSearch, it may chose to disconnect us or worse
start dropping messages. In both of those cases we will lose data and if your business will be
impacted by the loss of data then this is something you will need to consider. The second
benefit of decoupling the collection tier from the producer of the data can pay dividends even
if we do not need the short-term storage. There are at least two benefits to this. First, now
that our collection tier is receiving messages from a tier that provides short-term storage our
failure and recovery scenario just became must simpler. A collection node or the whole tier
can fail and when it recovers it can resume consuming data from where it left off. A secondary
benefit is that now that our collection tier is decoupled from the actual producer of the data we
can add in new data sources that we are collecting from and thus new data into the buffering
tier, the details of how to do this will be explored in Chapter 3, for now lets just consider the
general idea which is illustrated below in figure 2.11.
Figure 2.11 Collection with buffering tier and two data sources.
I hope as you look at and start to internalize figure 2.11 you will see that not only can you
start to add other data sources into the buffering tier, but this general concept may be
applicable in other cases when building out a streaming system.
At this point we have covered the areas of data velocity, protocol, and keeping up with the
producer questions we raised. The last area we identified is the Downstream Consumers.
Our concern here is how we protect our collection tier that we have spent a lot of time on from
being affected by consumers that cannot consume from us fast enough; think of it as
preventing backpressure in a hose. This is an interesting area, and one that we are going to
put on hold for just a little bit longer, as it is really the heart of Chapter 3. However, if you
want a head start in thinking about it, keep the buffering tier in mind and think about taking
that a little further. For now, we are going to move on to our next example and revisit the
downstream consumers in the next chapter.
2.3 Scaling the Event Message

Recalling from earlier, in this example we are going to build a social network, not your
ordinary human centered social network, but one designed for the machines. We are going to
build the TwitterOfThings social network that allows home appliances to send out event
messages containing their status and state. Perhaps your washing machine or dryer would
report how often it is run, how long it runs for, and the health of its components. There are a
lot of interesting things that can be done with this data; however, for now, we are going to
focus on how we collect it and explore the impacts this will have on our collection tier
architecture. The high-level architecture of what this looks like is illustrated below in figure
2.12.
Figure 2.12 the high-level architecture of household appliances sending event messages to a streaming
system.
On the surface figure 2.12 may look like we just switched out the SuperSearch stream for
billions of devices and called it good. However, there is a subtle difference between the
request/response and event message pattern that figure 2.12 tries to capture. Recall that in
the request/response our collection tier is reaching out to a single data source and requesting
data, the resulting response is a never-ending stream that may be massive. In the event
message pattern it is slightly different, in this case we may have millions or billions of things
that send us a message. It may be one message every hour, day, month or many messages a
minute. When looked at from the view of a single appliance or household it may seem small,
however, when you think about an entire city, state or country the amount of data quickly
becomes quite large. Although this slight difference between the between the event message
pattern and the request / response may seem minor, as we explore a couple of the questions
we need to keep in mind and the resulting architecture we will see that it is indeed quite
different. The difference in message pattern becomes more apparent when we look at figure
2.13 below, which compares the data flow of the request / response with the data flow of the
event message.
Figure 2.13 the differences between the request/response and event message patterns.
Several things stand out when you look at figure 2.13.
The number of steps is drastically reduced in the event message pattern

The message is sent one way, from the thing to our collection tier in the event
message pattern
The protocol in the request/response is commonly HTTP, in the event message pattern
HTTP may be one of many protocols that used.
With these differences laid out, lets turn our attention to table 2.5 which lists some of the
key areas and questions for us to think through as we consider different design choices for the
collection tier in the building of TwitterOfThings.
Table 2.5 Areas of concentration and questions to address
Area Question
Volume of Data How do we handle going from a city to a state to a country?
Protocol How do we handle a variety of protocols?
You may have noticed that we left off the velocity of data category; it can be argued that
having all of the home appliances for an entire country sending us status messages every time
they are used could result in a high velocity of data. Dont worry about this at all; you already
know how to handle data velocity based on what you learned in the previous section about the
request/response pattern. Lets move on to the areas that are new and different with the
event message pattern.
The one question in the volume of data section How do we handle going from a city to a
state to a country? is really getting at how do we scale our architecture as the number of
appliances grows. If you have not thought about connected devices, other than your phone or
computer consider this as of this writing according to Ciscos connections counter
(http://newsroom.cisco.com/feature-content?type=webcontent&articleId=1208342) shows
there are over 12 billion things connected to the internet, to put that in perspective there are
a little over seven billion people on earth, this equates to approximately 1.7 things per
person on earth. By many accounts, we are just getting started; many predict that by 2020
there will be over 50 billion connected things. With that perspective, lets get back to our
questions and rephrase it as How do we handle going from New York City, to New York State,
to the United States? Table 2.6 shows the estimated populations and things for these
geographies.
Table 2.6 New York City, New York State, and the United States estimated population and
number of connected things
Geography Estimated 2013 Population Estimated connected things (1.7 per person)
New York City 8.4 million 14.2 million
New York State 19.6 million 33.4 million
United States 318 million 540 million
In this scenario there will between 14-540 million devices that will be sending us status
messages throughout the day, basically like a little bird chirping all day long. Without a
doubt there will be peaks and valleys in the number of devices that connect at one time, but
for simplicity lets assume that the number of connections throughout the day is constant.
Taking this into consideration a plausible architecture for our TwitterOfThings is illustrated
below in Figure 2.14.
Figure 2.14 First pass at the collection tier with connected devices sending a status message.
If you look closely figure 2.14 is not that much different than figure 2.8 in fact when
considering the number of devices chirping compared to the number of searches being
streamed to us with SuperSearch, the overall architecture may be very similar for the amount
of data we are considering. The protocol differences and streaming versus event messages, as
we will see the next section will have more of an impact on the technology choices for this tier
than the sheer number of devices.
Before moving on to talk about the technology choices that lie ahead, lets not forget about
the second question we need to consider How do we handle multiple protocols? This is
interesting, in the SuperSearch example we considering only HTTP as that is the most
common way for data to be streamed across the Internet. However, when we start to talk
about devices, in many cases having a full HTTP stack may be way to much of a burden on
something that runs on very limited battery power, has a very tight cost structure, and may
often have a very spotty and limited internet connection think moisture meter in a remote
agriculture field. How does this affect our collection tier architecture? That is a great question,
really it does not change our picture from that in figure 2.14, but it does beg us to think about
the fact that we will need to handle multiple protocols and messages in various formats and
sizes.
Phew that is enough of the theory; lets move on to talk about the different technologies
that exist today that we can leverage to support both the SuperSearch Stream and our
TwitterOfThings.
2.4 Choosing the right technology

Figure 2.7 below is designed to aid us in understanding the technical features we need to keep
in mind when evaluating different technologies we may use to build the collection tier to
consume the SuperSeach stream or device status messages for TwitterOfThings.
Feature SuperSearch (Request / Response) TwitterOfThings (Event)
Number of Connections 1 Millions
Connection Initiator Collection Tier Devices
Protocol(s) HTTP AMQP, MQTT, RFID, HTTP,

Chirp, RF
Velocity of Data High - 70,000 / second Low - < 6,000 / second That is
every device (540M) sent a
message per second.
Variety of Data Low - Single source all the same High - many different devices
When looking at table 2.7, it appears that there are no similarities between the two feature
sets. If we look a little closer and give them more consideration, the biggest difference that
may have an impact on technology we select are the protocols being used. Sure, with the
stream there are few connections compared to the event pattern, and theres a higher velocity
versus volume. In the end these are moot compared to the protocol support. With that in mind
lets discuss what technologies are currently available that we can use to build each of these.
Lets start with the SuperSearch streaming system, for this we need to support 70,000
messages / second over HTTP. With the prevalence of HTTP we can find many choices in every
popular programming language on every popular platform. For example if our language of
choice is Java we may choose to use the Apache HTTP Client or Netty. If we preferred writing
our services in JavaScript, we can use Node.js. The key in both of these cases is going to be
making sure the technology you choose can scale to meet the velocity demands. Remember
we talked earlier about adding a buffering tier between the two collection tiers in this case we
can choose from a variety of messaging system, one that is particularly well suited for this
task is Apache Kafka. More information on Apache Kafka can be found at:
http://kafka.apache.org/.
Lets now turn our attention to the TwitterOfThings system, for this we are going to need
amongst other things to support quite the myriad of protocols, the most popular ones today
being Advanced Message Queuing Protocol (AMQP) and MQ Telemetry Transport (MQTT). We
need to keep in mind that we may still see HTTP traffic and for certain there are protocols we
will need to support in the future such as Chirp a protocol that is lighter weight and more
efficient than IP for moving data for things. At the present time implementations of both
AMQP and MQTT are only found in RabbitMQ and ActiveMQ, if either of these are to heavy for
your collection tier or do not meet your needs, there are client libraries available for both
protocols in most popular programming languages so that you can build the exact collection
tier you require.
At this point you are on your way to collecting data from a blazing fast stream or millions
of household devices across the United States. Fantastic, but the next obvious questions is
Great, but what do I do with it? That my friend is exactly what we are going to dive into in
the next chapter.
2.5 Summary
In this chapter we have explored the various aspects of collecting data for a streaming
system, from the blazing fast SuperSearch to a futuristic TwitterOfThings that allows
household devices to send status messages.
Along the way we have:
Learned about the Collection tier

Developed an understanding of the request/response and event collection patterns
Surveyed the technology landscape
Learned how to choose the right technology for your business problem
At times our focus was quite wide and covered a lot of ground, as we progress to the
Messaging Queuing tier you will see that although our net may have been cast very wide from
the collection tier, once the data is collected it will all start to look and feel the same. For now
lets put the collecting of data behind us, and start our journey of following the data stream
now that it has entered into our streaming system.
3
Transporting the data from
collection tier: decoupling the data
pipeline
This chapter covers
Why we need a message queuing tier

Understanding message durability
How to accommodate offline consumers
What are message delivery semantics
Choosing the right technology
In chapter 2 we explored the collection tier of our streaming architecture. Remember the
collection tier is the entry point for data getting into our streaming platform. As such it has
two major roles: scaling to meet the demands of the data producers and moving data along as
fast as possible to the next tier of the streaming platform. For the collection tier, it really is a
game of how fast can it take data that is coming in and move it along to the next tier.
Obviously this is an oversimplification of the complexity involved in scaling the collection tier;
however, I think you get the idea of the data shuttle job it is tasked with doing. Pictorially we
can see these two roles depicted in our streaming architecture below in figure 3.1.
Figure 3.1 Collection tier with unsaid role (moving data from input to rest of the platform) exposed
Previously we only talked about the role of handling the incoming data, not the output of data
from the collection tier. In this chapter we are going to focus on transporting data from the
collection tier to the rest of the streaming pipeline. Although we may mention the collection
and analysis tiers in our discussion, we will only be concerned with getting messages from or
to those tiers via the message queuing tier. Figure 3.2 below shows our streaming architecture
with this focus in mind.
Figure 3.2 The message queuing tier with its input and output as the focus
After completing this chapter you will have a solid understanding why we need a message
queuing tier andthe features of a messaging product that are important for a streaming data
system.
3.1 Do we really need a message queuing tier?

Being pretty astute you may look at figure 3.2 above and think:OK I figured the output from
the collection tier went to the message queuing tier and then the data magically flows to the
analysis tier, so whats the big deal? And why do we need this message queuing tier at
all?Great questions, so lets imagine for a second that our streaming architecture did not
have the message queuing tier as part of it, figure 3.3 below shows are drawn architecture
without this tier.
Figure 3.3 From the collection tier straight to the analysis tier
Looking at this redrawn architecture we may be tempted to say this looks simpler and things
should work just fine. Before we convince ourselves of this, we need to answer a very
important question:
What if our consumers cannot consume data fast enough from the collection tier?
In my mind that starts to conjure up an old cartoon picture of a hose with the end plugged
and the water spigot completely opened, it starts to swell and eventually just explodes from
the backpressure. Taking that example to our realm of data lets look at a time-lapse of the
data flowing from the collection tier to the analysis tier below in Figure 3.4.
Figure 3.4The three stages of data flowing without a message queue. We do not want step C
Lets take Figure 3.4above step-by-step:
Step A:This looks pretty normal and what we would like to see
Step B: We can tell something is not quite right, back-pressure is building
Step C: Our data pipe broke under pressure and data is now virtually dropping on the
floor and is gone forever.
Ouch, this is not a good situation as we are now loosing data, for some businesses this can
be catastrophic. At first blush you may think that this is a consumer problem, and all we have
to do is add more consumers or make them faster so they can keep up and life will be good.
The reality is, this is not a consumer problem at all, as it is perfectly acceptable in many use
cases for consumers to read slowly or be offline from time to time. In this chapter we are
going to explore how using a message queuing tier helps protect the collection tier from ever
being subjected to message backpressure and ending up like figure 3.4 C.
3.2 Features critical to our success

With our motivation under our belt we are ready to take a look at the features of a message
queuing product that are critical to the success of our streaming system. Before doing that
lets get one more formality out of the way, that being our use of the term message
queuingWe will be using those terms broadly to encompass the spectrum of messaging
services, from the traditional message queuing products (RabbitMQ, ActiveMQ, HornetQ, etc.)
to the newer takes on messaging found in Apache Kafka which is described as a publish-
subscribe messaging service.
If you search the web for message queuing products you will find that there are a lot of
technologies to choose from and the choices are always evolving. If you were to stop now and
investigate some of them further you would find a bewildering array of features, many of them
the same or sound the same, which just makes our job of choosing the right technology a little
harder. So lets resist the temptation to just jump right in and start downloading for a little bit
longer and avoid getting ourselves all wound up over something I like to call feature overload.
Instead lets step back and tease out the features that are critical to the success of our
streaming system and discuss those. I know this is hard, but it will be worth it. Armed with
this information choosing the technology tool or tools will not only be much less stressful, but
will enable us to objectively think about the problem we are trying to solve and what is
important to your business.
Now that we avoided feature overload and we have the stage set for where this tier fits in
to the larger streaming architecture, we are ready to talk about the core features we need to
consider when selecting a message queuing product. This is by no means an exhaustive list of
features but the ones you really want to pay attention to when designing a streaming system.
Before we dive into the core features lets take a moment to make sure we have an
understanding of the components of a message queuing product and how they map to our
streaming architecture.
THE PRODUCER, THE BROKER, AND THE CONSUMER

In the message queuing world there are three main components the Producer, the Broker, and
the Consumer. Each of these components plays a very important role in the overall functioning
and design of the message queuing product. Below in figure 3.5 we can see how they fit
together in their simplest form.
Figure 3.5 The three core parts to a message queuing system.
In figure 3.5 we can see that the producer and the consumer have jobs that closely match
their names, the producer produces messages and the consumer consumes messages. You
may notice in figure 3.5 that the term broker is used and not the terms message queue, a
logical question is why the change? Well it is not so much a change as it is an abstraction. If
we look at figure 3.6 below, we will see that the message queue is alive and well, but it is
abstracted away by the broker.
Figure 3.6 The broker with the message queue being shown.
When you take figure 3.6 into account, then the data flow starts to make more sense. If
you follow the flow from left to right in figure 3.6 you will see the following steps taking place:
The producer sends a message to a broker

The broker puts the message into a queue
The consumer reads the message from the broker
To put this in perspective, lets see what it looks like if we overlay these terms and pieces onto
our streaming architecture. Below in figure 3.7 you will see that the components of the
message queuing we have been talking about overlaid onto our streaming architecture.
Figure 3.7 Streaming architecture with message queuing tier components shown in context
Looking at figure 3.7 I think you would agree that this seems pretty simple and
straightforward, but as the saying goes the devil is in the details. It is in these details the
subtle interactions between the producer, broker, and consumer as well as various behaviors
of the broker that we will now turn our attention to. Phew, finally we are ready to dig into the
core concepts. Are you ready? Lets go.
CORE CONCEPT #1: DURABLE MESSAGES

Wait a moment, why do I need to worry about durable messages in a book about building
streaming data systems? That is a great question. Lets imagine for a moment that the data
centers for your business are geographically dispersed and you have two data centers, one in
Amsterdam, NL and the other in San Diego, CA as shown in figure 3.8.
Figure 3.8 Two data centers with data flowing between them
In the San Diego data center you have the collection tier running and in the Amsterdam
data center you are running the Analysis tier. I know we have not talked about the Analysis
tier yet, for now lets just say it needs the data from the collection tier. All right, you are
collecting data in San Diego and analyzing it in Amsterdam, things are running smoothly and
business is good. But as luck would have it, right as you were about to leave for the weekend
on a beautiful Friday afternoon, a construction worker accidently put a backhoe through a fiber
optic line, cutting off communication between your two data centers as shown in figure 3.9
Figure 3.9Two data centers with data flowing into the ocean
After talking with the telecom company that owns the fiber line, their best guess is it may
take 2-3 days to repair it. What would the impact be to your business if this were to happen?
How much data can your business tolerate loosing from your collection tier? If this situation
would have a negative impact on your business and you cannot tolerate losing potentially days
of data then you need to make sure the message queuing technology you choose has the
ability to persist messages for long term. Figure 3.10 below shows how durable messaging fits
in with this tier and some of the types you may find.
Figure 3.10 Durable messages where they fit and how they may be stored
CORE CONCEPT #2: OFFLINE CONSUMERS OF DATA

Imagine for a moment that you have built a real-time traffic routing system that allows people
to drive around any city and using your smart phone application they can get updates and be
re-routed based on up to the moment traffic conditions. Three months pass and now your
business wants to offer a historical traffic replay product, that lets a user pick a city and replay
the traffic data for a given day, week, or month. If your architecture is similar to Figure 3.11
then as the Analysis tier consumes messages they are discarded from the message queue, in
essence they are gone and you cannot provide your historical traffic replay.
Figure 3.11 Transient messages get discarded after the Analysis tier consumes them.
To solve this problem we need an architecture that more closely resembles that depicted
below in Figure 3.12.
Figure 3.12 Offline consumers persisting data for historically reporting / analysis
Using a product that supports storing messages to allow for offline consumers will allow us
to handle the current desire to provide an historical traffic replay as well as any other
historical reporting or analysis we may want to do in the future. In order to be sure you can
support these types of desires you need to make sure that the message queuing technology
you choose supports both online and offline consumers.
CORE CONCEPT #3: MESSAGE DELIVERY SEMANTICS

We learned earlier about how producers and consumers work, if you remember we said that a
producer sends messages to a broker and the consumer reads messages from a broker. Thats
a pretty high-level description of how it works, lets go deeper than that and explore the
common semantic guarantees found in messaging products. The following are the three
common semantic guarantees you will run into when looking at message queuing products:
At most once - A message may get lost, but it will never be re-read by a consumer.
At least once A message will never be lost, however it may be re-read by a
consumer.
Exactly once A message is never lost and is read by a consumer once and only once.
If you had to pick one that you wanted, which would it be? If you said exactly once you are
not alone, in fact most people want a system where messages are never lost and each
message is delivered to a consumer once and only once who would not want that? If only it
were that simple, this of course comes with caveats and risks. Figure 3.13 below shows the
possible points of failure that we need to talk about.
Figure 3.13 The possible points of failure that need to be considered
Wow, it seems like we have identified almost every spot in the diagram as a possible point
of failure. Dont worry it is not all doom and gloom, lets walk through them and understand
what the risks are and what each numbered item means.
1. Producer If the producer fails after it has generated a message but before it has had a
chance to actually send it over the network to the broker then we will lose a message.
There is also the chance that the producer may fail waiting to hear back from the
broker that it did receive the message and in turn the producer after it recovers may
send the same message a second time.
2. The network between the producer and broker - If the network between the producer
and the broker fails the producer may send the message but the broker never receives
it or the broker does receive it but the producer never gets the response acknowledging
it. In both of these cases the producer may send the same message a second time.
3. Broker If the broker fails with messages that are still held in memory and not
committed to a persistent store then we may lose messages. If the broker were to fail
before sending an acknowledgement to the producer, the producer may send the
message a second time. Likewise if the broker tracks the messages consumers have
read and it failed before committing that information a consumer may read the same
message more than once.
4. Message Queue If the message queue is an abstraction over a persistent store, then if
it were to fail trying to write data to disk we may end up losing messages.
5. The network between the consumer and broker If the network between the consumer
and the broker fails the broker may send a message, record that it was sent however
the consumer may never get it. From the consumer side, if the broker waits for the
consumer to acknowledge it received a message but that acknowledgement never gets
to the broker it may send the consumer the same message a second time.
6. Consumer If the consumer fails before being able record that it processed a message
either by sending an acknowledgement to the broker or to a persistent store it may
request the same message from the broker. Another twist here is the case where there
are multiple consumers and more than one of them reads the same message.
I know that is a lot to consider and it may seem a little overwhelming, dont worry this will
not be the last time we see these types of semantics discussed. In the context of a message
queuing system we need to keep these failure scenarios in our back pocket so that when a
messaging system claims to provide exactly once deliver semantics we can understand if it
truly does. As is the case with so many things the choice of the technology to use in this case
will involve various tradeoffs such as that in table 3.1 below.
Table 3.1 The tradeoffs we are often faced with when considering a message queuing system
Less complexity, faster performance, vs. More complexity, a performance hit, and a
and weaker guarantees strong guarantee
The choice of where to compromise is going to be based on the business problem you are
trying to solve with the streaming system. For example, if you are building a streaming web
analytics product missing a message here or there is not going to have much if any impact on
your product. However, if on the other hand you are building a streaming fraud detection
system, then missing a message can have a very undesirable effect.
As you look at different messaging system you may find that the messaging system you
want to use does not provide exactly once guarantees, dont despair often times can solve
this using two techniques. Lets take a look at figure 3.14 below so see them graphically and
then we will discuss them.
Figure 3.14 The two ways to have exactly once semantics if the messaging system does not provide it
If your business problem requires exactly once semantics but your chosen messaging
system does not provide them, then you will need to use two techniques to bridge the gap.
Figure 3.13 above identifies the producer and the consumer techniques. Lets now talk about
those in more detail.
1. Do not retry to send messages This the first technique we must use, to do this you
will need to have in place a way to track the messages your producer(s) send to a
broker(s). If and when there is no response or a network connection is interrupted
between your producer(s) and the broker(s) you can read data from the broker to
verify that the message you did not receive an acknowledgment for was received. By
having this type of message tracking in place you can be sure your producer only sends
messages exactly once.
2. Store metadata for last message This is the second technique that we must use and
involves us storing some data about the last message we read. The metadata you store
is going to vary by the messaging system. In the end what you need is data about the
message so that you can be sure that your consumer does not reprocess a message a
second time. In figure 3.13, it shows the metadata being stored in a persistent store.
Something you will need to take into consideration is What do you do if there is a
failure storing the metadata?
If you do implement these two techniques you be able to guarantee exactly once
messaging. You may not have noticed during this discussion, but by doing this you also get
two nice little bonuses (sorry not that type of bonus, I was thinking more about data quality
and robustness of your system). Take a look again at figure 3.13 and our discussion above
about it, what do you think the bonuses are? There may be more, but the ones I was thinking
of are message auditing and duplicate detection. From a message auditing standpoint, since
you are already going to keep track of the messages your producer sends via metadata then
on the consumer side you can use this same metadata to keep track of not just messages
arriving, but also perhaps max, min and average time it takes to process a message. Perhaps
you can identify a slow producer or slow consumer. Regarding duplicate detection, we already
decided that our producer was going to do the right thing to make sure a message was only
sent to a broker one time, and on the consumer side we said it was going to check to see if a
message has already been processed. One extra thing to keep in mind, in your consumer dont
just keep track of metadata related to the messaging system (some will expose to you a
message ID of some sort so you know if you processed a message by the same ID), but also
be sure to keep track of metadata that you can use to distinctly identify the payload of a
message. Now you know not just how to ensure exactly once semantics, but also you are on
your way to providing message auditing and detecting message duplication. As you go through
this book you will run into these concepts again and you may see other ways to apply
message auditingthrough the entire streaming architecture.
3.3 Applying the core concepts to business problems

Now that we have covered the core concepts we need to keep in mind when thinking about
the message queuing tier lets see if we can apply these to different business scenarios.
FINANCE: FRAUD DETECTION

Pauls company provides a real-time fraud detection services that helps detect fraud as it is
happening. To do this he needs to collect credit card transactions from all over the web as
they are occurring, apply some pretty cool algorithms against the data and then send back
approved or declined messages to customers while a purchase is happening. Thinking about
the core concepts we discussed in this section lets see if we can answer some questions that
may come up when designing the architecture for Pauls business.
Table 3.2Fraud detection scenario questions
Questions Discussion
What would the impact be to Pauls business This would have a catastrophic impact to Pauls business.
if the communication between the collection His business would not be able to offer their service and
tier and analysis tier were interrupted for an may have a detrimental impact to his customers
extended period of time? businesses.
How many days worth of data can Pauls Zero, in fact I would argue that given the nature of both
business tolerate loosing? Pauls business and the type of data that he is dealing with
losing data is not an option.
Would you anticipate that Paul would need to I would expect that at least one customer or an executive
store historical data? in Pauls business has asked to see a report detailing how
their service has performed over time.
What type of message delivery semantics I would expect that his business needs exactly once
does Pauls streaming system need? semantics. Without that there is the chance that he may
miss a message and thus miss a fraudulent transaction.
Could he get by with at least once? Perhaps, it may make
the consumer more complex, but it is possible.
That was fun. How about we take two more totally different businesses and see if you can
answer the questions for them.
INTERNET OF THINGS: A TWEETING COKE MACHINE

Franks business owns 1000s of Coke machines and would now like to make them social. He
wants his Coke vending machines to tweet and send push notifications with special offers to
consumers that are geographically close. As if that were not enough, there is one slight twist,
if the closest vending machine does not have stock that it can offer it should recommend the
next closest vending machine that can offer the customer a deal. Thinking about his business
and taking into consideration the core concepts we have discussed how would you answer the
following questions?
Table 3.3 Internet of Things scenario questions
What would the impact be to Franks business

if the communication between the collection
tier and analysis tier were interrupted for an
extended period of time?
How many days worth of data can Franks

business tolerate loosing?
Would you anticipate that Frank would need to

store historical data?
What type of message delivery semantics

does Franks streaming system need?
E-COMMERCE: PRODUCT RECOMMENDATIONS

Rex runs a high-end fashion e-commerce business and is trying to increase the conversion
rate on his site and feels that perhaps social influence is one way to do this. Unlike regular
static product recommendations, Rex has tasked you with architecting a system that will allow
him to show a customer viewing a product what other people are adding to their cart and
buying on the site right now that is related to the product being viewed. So if I am looking at a
pair of jeans right on the same page I will see shoes, belts, or shirts that other people are
adding to their cart and buying right now along with the jeans. With our mission in mind, go
ahead answer the questions so we can design the correct system for Rex.
What would the impact be to Rexs business if

the communication between the collection tier
and analysis tier were interrupted for an
extended period of time?
How many days worth of data can Rexs

business tolerate loosing?
Would you anticipate that Rex would need to

store historical data?
What type of message delivery semantics

does Rexs streaming system need?
3.4 Summary
In this chapter we have explored how we decouple the data being collected from the data
being analyzed by using amessage queuing tier in the middle.
During this exploration we:
We learned why we need message queuing tier

We developed an understanding of message durability
We learned how to accommodate offline consumers
We learned about the different message delivery semantics
Learned how to choose the right technology for your business problem
At this point we have developed a good understanding of how to decouple the data being
produced from our collection tier from the analysis tier. As move through the other chapters of
this book, we will see some of the terms and concepts popping up again, so do not worry if
thisis the least bit overwhelming, it will not be the last time we talk about message deliver
semantics. Now lets get ready to have some fun with the data that we have collected, the
next chapter will take us through the analysis tier, if you remember from above this is our
message consumer. Are you ready? Lets go.
4
Analyzing Streaming Data
This chapter covers
In-flight data analysis

The common stream-processing architecture
Key features common to stream-processing frameworks
In the previous chapter we spent time understanding and thinking through the importance of
the message queuing tier. Remember that tier is designed to gather data from the collection
tier and make it available to be moved through the rest of the streaming architecture. At this
point the data is ready and waiting for us to consume and do magic with. In this chapter we
are going to learn about the Analysis tier. Our goal is to develop an understanding of the
underlying principles of this tier and then in the next chapter will dive into the different ways
to use this tier to perform magic on the data. With that frame of reference in mind lets consult
our navigational aid below in figure 4.1 to make sure we are oriented in respect to the flow of
data.
Figure 4.1 The streaming data architecture with the Analysis tier in focus.
One thing that you may notice in figure 4.1 above is that unlike the previous chapter
where we discussed the input and output of the data, in this chapter we are only going to
concern ourselves with the input. The reason for this is simple, our goal is to understand the
core underpinnings of this tier and in the next chapter we will discuss the ways we can work
with the data in this tier. Therefore, we will hold off on talking about where the data goes from
this tier until next chapter. After this chapter you will have an understanding of the core
concepts found in all the modern tools used for this tier and be ready to learn about how to
perform various operations on the data. All right, grab a quick coffee refill and lets get going.
4.1 Understanding in-flight data analysis

A key to understanding the features we will discuss in this chapter is first coming to grips with
what in-flight data analysis is and why it matters. If the term in-flight makes you think of
something that is in the air moving and not touching the ground, that is the right idea. When
it comes to data, in-flight refers to the idea of it always being in motion and never at rest. If
you have not heard of the terms data at rest, dont worry it is just a fancy way of saying that
the data is stored on disk or another storage medium. Lets take a look at figure 4.2 below,
which shows how this plays out in our streaming architecture.
Figure 4.2 Data being pushed from the collection tier and pulled from the analysis tier
Looking at figure 4.2 above it should be clear that our goal in this tier is to pull the data
from the message queuing tier as fast as possible, in essence we need to be sure we can keep
up with the rate that the collection tier is pushing data into the message queuing tier. So how
is this different from a non-streaming system, say one built with an RDBMS or using Hadoop
Map Reduce jobs. In those non-streaming types of systems the data is at rest and you query it
for answers, while in a streaming system we turn that on its head and the data is moved
through the query. Figure 4.3 shows what I mean by this.
Figure 4.3 Turning things on their head, non-streaming vs. streaming
Flipping this model on its head so that data is being pulled through our system in a never-
ending stream has implications both on the design and the way we query these systems. If
you sit back and think of all the data zipping around you all day long, from the myriad of
connected devices and appliances to online activity, the questions you could ask and problems
you could solve if it all passed through a streaming analysis tier are amazing. Understanding
how to build these systems to harness all of this and the future data streams is becoming an
essential skill. All right, lets not get ahead of ourselves just yet; we have our work cutout for
us learning about the core features of an analysis tier. Lets begin our journey by discussing
the general architecture of a stream-processing system and then move onto the key features
and see how each of the features play a role in our decision to use a particular framework.
4.2 Distributed Stream Processing Architecture

It may be possible to run an analysis tier on a single computer; however, the velocity and
volume of the data at some point make this a non-viable option. Therefore, we are going to
concentrate on the tools and technologies that allow us to use commodity hardware to build
our analysis tier. As you survey the technology landscape you will find various different
technologies designed for doing stream-processing. Many of these systems share a very
common architectural blueprint, which can be boiled down to two main parts, a Streaming
Task Manager and Stream Task Processors, as depicted below in figure 4.4.
Figure 4.4 Generic streaming analysis architecture you will find with many products on the market.
When you look at figure 4.4 above, you will notice that some systems may require you to
have an Application Driver, in those system it is also common that the driver controls the
lifecycle of the whole stream-processing application. In all systems you will find a component
that plays the role of the streaming task manager. This component has several jobs, first it
manages the lifecycle of the streaming task processors and secondly it setups up the data flow
of the incoming stream and how each of the streaming task processors interact. Figure 4.4,
just shows the state of the system after each of the streaming task processors was created.
Figure 4.5 below shows what the system looks like after everything is connected and the
stream is running.
Figure 4.5 A view of the streaming analysis after it is up and running.
There are several things to point out in figure 4.5. First, the streaming task manager is still
present; its responsible for watching and monitoring the various stream task processors.
Secondly, there are multiple input streams and several of the stream task processors send
data to other stream task processors. Dont be alarmed by this, it is the natural programming
model of stream-processing, it is very similar to the Map Reduce programming model you may
already be familiar with. If you are not familiar with the Map Reduce model either dont worry,
by the time you finish the next chapter you will feel very comfortable with this programming
model. Looking again at figure 4.5, the following three questions come to mind:
Q: How did the streaming task manager know how to setup the stream task
processors?
A: The short answer is you told it. OK, lets elaborate on that a little. When you look
at the various technologies available for this layer one thing you will find is that they all
provide a way to express a data flow graph. It is this data flow graph that the
streaming task manager uses to determine how to organize and orchestrate the stream
task processors.
Q: Where does your code execute?
A: Depending on the framework you use some of your code will execute in the
streaming task manager, in all frameworks the business logic that is executing in the
stream task processors is your code. As an example imagine you wanted to determine
the number of times each visitor to your website visited a page. A plausible solution if
you were to write it down on a napkin, would be to group all the page views by visitor
and then count each group. Applying this solution to figure 4.5 we end up with figure
4.6 below.
Figure 4.6 Counting page views per user
If you familiar with the concepts Map Reduce you will notice that figure 4.6 above looks
just like how the data would be processed with Map Reduce. That is a powerful attribute
of a lot of the streaming processing tools; the higher-level data flow is familiar.
Q: How does this fit into the larger streaming architecture?
A: Looking at figure 4.5 you will notice that there is an input stream of data and then
an output from our stream task processors, taking this simple example lets update that
figure to show the tiers, the result is figure 4.7 below.
Figure 4.7 The data flow of an analysis tier with the previous and next tier
With this understanding of the general architecture and the data flow we can begin to dig
into the key features of the various streaming-processing frameworks that are a key piece to
assembling the analysis tier.
4.3 Key Features of Stream-Processing Frameworks

There are many different stream-processing frameworks that can be used in the analysis tier
of our streaming data architecture. When you boil them down there are a handful of key
features that we want to pay special attention to when comparing them and deciding if they
are suitable for solving our business problem. In this section we will discuss the key features
you need to pay special attention to and make sure you understand each of them and can
apply your understanding when selecting the stream-processing framework you will use in
your streaming data architecture.
MESSAGE DELIVERY SEMANTICS

In chapter 3 we first learned about message deliver semantics in respect to the message
queuing tier and the producers, brokers, and consumers. This time we are going to focus the
discussion of message deliver semantics on the analysis tier, the definitions dont change, but
you will notice the implications are a little different. First lets refresh our memory on the
definitions of the different guarantees:
At-most-once - A message may get lost, but it will never be processed a second time.
At-least-once A message will never be lost, however it may be processed more than
once.
Exactly-once A message is never lost and will only be processed once.
Those definitions are a slightly more generic version of what we saw before with the
message queuing tier, so how do these manifest themselves in the stream-processing tools
you may use in this tier? Lets overlay them on our data flow diagram and then walk through
them to understand what they mean.
Figure 4.8 below shows at-most-once semantics with the two failure scenarios, a message
dropping and a streaming task processor failing. The second scenario, a streaming task
processor failing, will also result in message loss until a replacement processor comes online.
Figure 4.8 At-least-once message delivery shown with the streaming data flow
At-most-once is the simplest deliver guarantee a system can offer, there is no special logic
required anywhere, in essence if a message gets dropped, a stream task processor crashes, or
the machine that a stream processor task is running on fails, the message is lost.
At-least-once increases the complexity as the streaming system must keep track of every
message that was sent to the stream task processor and the result. If it determines that the
message was not processed (perhaps it was lost, or the stream task processor did not respond
within a given time boundary) then it will be resent. Your code (the streaming task processor)
must keep track of all the messages processed because if the stream-processing framework
does not believe the message was handled it will send it again, thus your code will need to
handle duplicate messages that are also out of order.
Exactly-once semantics ratchets up the complexity a little more for the stream-processing
framework. Besides the bookkeeping that it must keep for all messages that have been sent,
it now must also detect and ignore duplicates. Arguably the complexity of your code actually
goes down since it now just has to make sure it responds with a success or failure after it has
processed the message, but it does not need to deal with duplicate detection.
STATE MANAGEMENT
Once your streaming analysis algorithm becomes more complicated then just using the
current message without dependencies on any previous messages and/or external data then
you will need to maintain state and will likely need the state management services provided
by your framework of choice. Lets take a simple example that we can work with to help
understand where and perhaps how we state needs to be managed.
Pretend you are the marketing manager for a large e-commerce site and you want to
know the number of page views per hour for each visitor.
I know youre thinking an hour that can be done in a batch process. We are not going to
worry about that right now, instead lets focus on the implied state we must keep to satisfy
this business question. Figure 4.9 below shows how our streaming task processors would be
organized to answer this question.
Figure 4.9 Simple counting page views per user over an hour
It becomes obvious when looking at figure 4.9 where we need to keep state, right there in
the stream processor tasks that perform the counting by id. If our streaming analysis tool of
choice does not provide state management capabilities, then one viable option is for us to
keep the data in memory and flush it every hour. This would work as long as we are OK with
the potential of losing all the data if our streaming processor task failed at anytime. Of course
as luck would have it, our tasks would be running smoothly and then one day start to fail at
59 minutes into the hour. Depending on your business case the risk and possible loss of data
by keeping all state in-memory may be acceptable. However, in many business cases life is
not so simple and we do need to worry about managing state. To help in these scenarios
many stream-processing frameworks provide state management features we can leverage.
The state management facilities provided by the various systems naturally fall along a
complexity continuum shown below in figure 4.10.
Figure 4.10 State management complexity continuum for stream processing tools
On the far left side of the continuum is our nave in-memory only choice from above,
closely followed by an in-memory with checkpointing feature. The systems that offer in-
memory with checkpointing almost always use a remote datastore for checkpointing and thus
provide just the basic functionality of ensuring a computation is not lost in the face of failure.
On the other end of the spectrum are the systems that provide a queryable persistent state
that is replicated. If you find yourself saying these seem like two totally different slants on
state management you are not alone. The solutions on the low complexity side only solve the
problem of maintaining the state of a computation in the face of failures. Granted some of
those systems will claim that you can query the state and manipulate it, however, in reality
the checkpoint is often written to a remote data store and the cost of querying may have a
dramatic impact on your ability to keep up with the speed of the stream. However, for the
simple operations of keeping a running count current and not losing track of the current value
in the face of failure these systems are a great fit. On the other end of the spectrum the
frameworks that offer sate management in way of a replicated queryable persistent store,
really help you answer much different and more complicated questions. With these
frameworks you can actually join different streams of data together, for example imagine you
were running an ad serving business and you wanted to track two things the ad impression
and the ad click. It is reasonable that the collection of this data would result in two streams of
data, one for ad impressions and one for ad clicks. Figure 4.11 below shows how these
streams and our stream task processors would be setup for handling this.
Figure 4.11 Handling ad impression and ad click streams that use stream state.
In this example the ad impressions and ad clicks arrive in two separate streams, since the
ad clicks will lag the ad impressions we will join the two streams and then count by the ad id.
Because of the lag in the ad click stream using a stream-processing framework that persists
the state in a replicated queryable data store enables us to join the two streams and produce
a single result. I think you will agree being able to join streams by leveraging the state
management facilities of a stream-processing framework is quite a bit different then just
making sure the current value of an aggregation is persisted. If you give some more thought
to this example, I am sure you will come up with other ideas of how you can join more than
one stream of data. It is a fascinating topic and something we will look at in more depth in our
next chapter. For now, lets continue on to the next feature we need to understand when
choosing a stream-processing framework.
WINDOWS OF TIME
At some point you will want to move beyond the simple counting of events and perform
computations that involve looking at events over a period of time, just like our previous
example of counting page views over a time period. To accomplish this youll want to take
advantage of the windowing features a stream-processing framework may offer. As you
survey the stream-processing landscape you will find that the windowing features provided fall
into one of three general categories:
No built in support
The ability to execute a task at a given interval
The ability to perform a computation over a sliding window of time at a given interval
We will go into detailed examples of how we can answer certain questions for each of these
scenarios in chapter 5, for now though lets make sure we have an understanding of what we
mean by each of these categories and what is involved in solving the follow example.
Every 10 seconds you want to output the number of times your business was
mentioned online during the last 30 seconds.
No built in support - means that the framework does not provide a mechanism that
allows you to process data in anything other than an event-by-event fashion in respect
to a window of time. With this model performing a task at a given interval often times
mean you are left with two options; adding data to an input stream at a given interval
that your stream task processor reacts to or keeping track of time in a stream task
processor and performing a computation or producing output at a given interval. To be
able to provide a solution for the above example we would need to keep track of the
last 30 seconds worth of data, or at least the count and then on a 10 second interval
we would need to emit the 30 second count and reset our counter. This is doable, but it
would require us to do all of the work. On the surface that may not sound bad, but as
the saying goes the devil is in the details, for example what happens if the stream
task processor that has the 30 seconds of data fails? What happens if the event that is
supposed to fire every 10 seconds actually executes every 15 seconds? I am sure you
can think of other failure scenarios as well. However, regardless of the risk if your
business problem is similar, it can be mostly solved with this event-by-event model,
you are comfortable adding this windowing lite model, and the inherent risk of data loss
then a stream-processing framework that has this level of windowing may be just fine.
Execute a task at a given interval at this level you will find products that handle
the interval work and ensure that your stream task processor is called at a specified
interval. You may still have to handle the holding onto the last 30 seconds of data but
at least the interval work is something you do not need to handle. The risk of data loss
does not go away since you will have to maintain the 30 seconds of data. Taking that
into consideration this level is an improvement, but not that great. It reduces the
complexity of what you need to write, but it still leaves you at risk for losing data.
However, as in the previous level, the level of risk may be well within the limits for your
business.
Sliding window computations this is the richest of the feature sets and often times
provides many different ways for you to express that you want to perform a
computation over a sliding window of data at a particular interval. At this level keeping
track of the 30 seconds of data and executing at a certain interval is completely taken
care of by the framework. You are left with only worrying about your algorithm and
nothing more.
I think you would agree that there is quite a range in the features that are provided by the
various stream-processing frameworks. In the end you should be able to find one that helps
you solve your business problems.
FAULT TOLERANCE
It is nice to think of a world where things do not fail, however, in reality it is not a matter of if
things will fail but only a matter of when. A stream-processing frameworks ability to keep
going in the face of failures is a direct result of its fault tolerance capabilities. When we
consider all of the pieces involved in stream-processing there are quite a few places that it can
fail. Lets take a look again at the pieces involved and use figure 4.12 below to identify all of
the failure points.
Figure 4.12 The points of failure with stream processing in the context of the streaming architecture
In figure 4.12 we have identified 9 points of failure in a very simple stream-processing

data flow. Lets go through them and make sure we understand what we will need from a
stream-processing framework in respect to fault tolerance.
1. Incoming Stream of Data In all fairness the message queuing tier will not be under
the control of the stream-processing framework, however, there is the potential for the
message queuing system to fail in which chase the stream-processing framework must
respond gracefully and not also fail.
2. Network carrying input stream - something that the stream-processing framework

cant control, however, it needs to handle the disruption gracefully.
3. Stream Task Processor This is where our code is running and it should be under
supervision of the stream-processing framework. If something goes wrong here,
perhaps our software fails, or the machine it is running on fails then the streaming task
manager should take steps to restart the processor or move the processing to a
different machine.
4. Connection between Stream Task Processors This may be a disruption of the

network between the stream task processors or it may just be an error in the software
that handles the communication. There may not be direct steps the stream task
manager can take to repair these problems, however, it should be able to control the
stream so that the stream task processor that is producing messages does not fail due
to network backpressure.
5. Stream Task Processor This is the same as #3 above and it should also be under
direct supervision of stream task manager.
6. Connection to output destination The stream task manager may not be able to
control the network path to the output, however, it should be able to control the flow of
data from the last stream task processor so that it does not become overwhelmed by
network backpressure.
7. Output destination This would not be under the direct supervision of the stream
task manager, however, it failing can impact the processing of the stream and
therefore it needs to be taken into consideration.
8. Stream Task Manager If this fails then we end up with a situation that is often
referred to as running headless. This refers to the notion that it is the responsibility of
this component to supervise and control the flow of data and the stream task
processors. Thus if this is component fails, then there is no supervisor for the data flow
and the stream task processors no new ones can be started or failed ones recovered.
9. Application Driver Some stream processing frameworks rely on an application

driver to submit the stream processing workflow to the streaming task manager and in
turn to control the lifetime of the stream processing. In these cases if the application
driver fails the entire stream processing data flow would stop.
Many stream processing frameworks use a variety of checkpointing strategies and take
advantage of various characteristics of the input stream of data to provide a robust
environment that can continue to process the stream of data without interruption in the face
of failure. As you investigate which stream-processing framework to use to solve your
business problem, understanding the failure semantics is critically important to your success.
4.4 Summary
In this chapter we took a dive into the common architecture of stream-processing frameworks
you will find when surveying the landscape and we went over the core features that you need
to consider, to recap what we learned.
We have learned
About the common architecture of stream-processing frameworks

What message delivery semantics mean for this tier
What and how state may be managed
Windows of time and how different frameworks provide this feature
What fault tolerance is and why we need it.
I understand that some of this may seem fuzzy or fairly abstract, dont worry about it at
all. In the next chapter we will focus on how you perform analysis and/or query the data
flowing through the stream-processing framework. Some may say, that is where the fun
really begins, but to effectively be able to ask questions of the data we need to have the
understanding you developed in this chapter. Are you ready to start asking questions of the
data? Great, lets turn the page and get started.

Streaming Data v2 MEAP

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Streaming Data v2 MEAP

Uploaded by

Copyright:

Available Formats

MEAP Edition

Manning Early Access Program

Copyright 2014 Manning Publications

For more information on this and other Manning titles go to

1 Introducing streaming data

2 Getting data from clients: data ingestion

4 Analyzing streaming data

5 Clever ways to analyze data from a fire hose

6 Storing the analyzed or collected data

7 Making the data available: To Push or to Pull

8 Consumer device capabilities, limitations accessing the data

PART 2: TAKING IT REAL WORLD

9 Building an In The Moment recommendation engine

10 Building an IoT a tweeting San Francisco parking garage

This chapter covers

What is a real-time system

1.1 What is a real-time system

Table 1.1 Types of real-time systems

Type of Real-time Examples Latency Measured In Tolerance for Delay

Hard Pacemaker, Anti-lock Milliseconds or None total system

Soft Airline Reservation Seconds Low no system failure,

Near Skype video, home Seconds to Minutes High no system

3. You are using the NASDAQ Real Time Quotes application

Figure 1.1 A generic real-time system with consumers

1. A tweet is posted on Twitter

2. The Live Flight Tracking service from FlightAware

Graphically this looks like figure 1.2:

Figure 1.2 A generic real-time system with no consumers

1.2 Differences of real-time and streaming systems

6. You are using the NASDAQ Real Time Quotes application

How did you do? Here is how I thought about them:

1.3 Why streaming data is important?

1.4 The architectural blueprint

1.5 Security for streaming systems

We defined a real-time system

Figure 1.7 Architectural blueprint with emphasis on the first tiers

This chapter covers

Learning about the Collection tier

2.1 How we collect data and why the source matters

Thing on the Internet vs Internet of Things

2.1.1 The Request/Response and Event patterns

2.2 Scaling the request/response

Figure 2.6 Setting up the SuperSearch streaming connection.

Velocity of Data What if the speed of the data doubled?

What if it increased by 10x?

What if it dropped in half?

How do we handle spikes?

Streaming Protocol Are there special considerations?

Keeping up with the producer What happens if we fall behind?

Is it acceptable to miss data?

How do we make sure we can keep up with the producer of data?

Table 2.2 Data velocity questions

What if the speed of the data doubled?

What if it increased by 10x?

What if it dropped in half?

How do we handle spikes?

Table 2.3 Operational assumptions for example 6 node collection tier

Number of nodes 6 (2 cores, 4GB RAM each)

Messages per second capacity (per node) 180,000

Reserve capacity for handling spikes 35%

Current volume 70,000 messages per second

Are there special considerations?

Table 2.4 Keeping up with the producer questions

Keeping up with the producer