X

079x_FMi.
fm Page i Tuesday, December 16, 2003 12:57 PM
Practical Service Level Management:

Delivering High-Quality Web-Based Services
John McConnell with Eric Siegel
Cisco Press
800 East 96th Street, 3rd Floor
Indianapolis, IN 46240 USA
079x_FMi.fm Page ii Tuesday, December 16, 2003 12:57 PM
ii
Practical Service Level Management:

Delivering High-Quality Web-Based Services
John McConnell with Eric Siegel
Copyright 2004 Cisco Systems, Inc.
Published by:
Cisco Press
800 East 96th Street
Indianapolis, IN 46240 USA
All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means, electronic
or mechanical, including photocopying, recording, or by any information storage and retrieval system, without
written permission from the publisher, except for the inclusion of brief quotations in a review.
Printed in the United States of America 1 2 3 4 5 6 7 8 9 0
First Printing January 2004
Library of Congress Cataloging-in-Publication Number: 2001097399
ISBN: 1-58705-079-x
Warning and Disclaimer

This book is designed to provide information about service level management. Every effort has been made to make
this book as complete and as accurate as possible, but no warranty or tness is implied.
The information is provided on an as is basis. The author, Cisco Press, and Cisco Systems, Inc. shall have neither
liability nor responsibility to any person or entity with respect to any loss or damages arising from the information
contained in this book or from the use of the discs or programs that may accompany it.
The opinions expressed in this book belong to the author and are not necessarily those of Cisco Systems, Inc.
Feedback Information
At Cisco Press, our goal is to create in-depth technical books of the highest quality and value. Each book is crafted
with care and precision, undergoing rigorous development that involves the unique expertise of members from the
professional technical community.
Readers feedback is a natural continuation of this process. If you have any comments regarding how we could
improve the quality of this book, or otherwise alter it to better suit your needs, you can contact us through e-mail
at feedback@ciscopress.com. Please make sure to include the book title and ISBN in your message.
We greatly appreciate your assistance.
Trademark Acknowledgments
All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized.
Cisco Press or Cisco Systems, Inc. cannot attest to the accuracy of this information. Use of a term in this book
should not be regarded as affecting the validity of any trademark or service mark.
Corporate and Government Sales

Cisco Press offers excellent discounts on this book when ordered in quantity for bulk purchases or special sales.
For more information, please contact:
U.S. Corporate and Government Sales 1-800-382-3419 corpsales@pearsontechgroup.com
For sales outside of the U.S., please contact:
International Sales 1-317-581-3793 international@pearsontechgroup.com
079x_FMi.fm Page iii Tuesday, December 16, 2003 12:57 PM
iii
Publisher
Editor-in-Chief
Executive Editor
Cisco Representative
Cisco Press Program Manager
Manager, Marketing Communications, Cisco Systems
Cisco Marketing Program Manager
Production Manager
Acquisitions Editor
Development Editor
Project Editor
Copy Editor
Technical Editors
Team Coordinator
Book Designer
Cover Designer
Composition
Indexer
John Wait
John Kane
Brett Bartow
Anthony Wolfenden
Sonia Torres Chavez
Scott Miller
Edie Quiroz
Patrick Kanouse
Michelle Grandin
Jill Batistick
Marc Fowler
Jill Batistick
David M. Fishman
John P. Morency
Richard L. Ptak
Tammi Barnett
Gina Rexrode
Louisa Adair
Mark Shirar
Larry Sweazy
079x_FMi.fm Page iv Tuesday, December 16, 2003 12:57 PM
iv
In Loving Memory
This book was nished as a nal tribute to my late husband, John McConnell.
I hope these words keep his ideas alive in the industry a little longer.
Grace Morlock McConnell
My perception of my very special son, John W. McConnell

All that we can be, we must be.
Find a star and never settle for less.
John was born to be one of a kind.
Making his way with a mind of his own,
and making a difference and making it known.
He had his dreams and hopes to pursue.
by his mother, Jeanette McConnell
079x_FMi.fm Page v Tuesday, December 16, 2003 12:57 PM
Dedication
John W. McConnell
December 9, 1943November 3, 2002
This book is dedicated to my wife, Grace, whose support has been so helpful in carving out the time and quiet
needed for this project. My friends and Grace have also provided a supportive environment and tolerated my
frequent absences to work with clients. Returning home to a warm community has been really important to me.
079x_FMi.fm Page vi Tuesday, December 16, 2003 12:57 PM
vi
Acknowledgments
Many people have been part of this process of turning some ideas and experience into a book. First, my thanks to
the Cisco Press team, especially Michelle Grandin. The steady enthusiasm and willingness of all to help are deeply
appreciated.
In the same vein, the technical reviewers have been so helpful. Ive had the pleasure of spending good time exchanging
views with John Morency and Rich Ptak at many analyst conferences and other events; their suggestions for this
manuscript were specic and helpful, and in some cases spurred some spirited discussions. Although Ive never met
David Fishman face to face, Id be pleased to buy him a good meal someday as thanks for so many good suggestions
and his attention to detail and integrity on getting it right.
Another group I want to acknowledge are the clients Ive worked with around the world. Ive gotten to learn a lot
about how technology is actually used and to work with people who want to push the envelope.
Finally, my thanks to my friends and colleagues in the industry who constantly stimulate and challenge me. Its
been a tremendous blessing to be among so many creative and independent thinkers and doers that have shaped the
networking industry.
John McConnell
Its impossible to begin these acknowledgements without wishing that John were still alive. This is his book, not
mine. He conceived it; he drafted it; he should have been writing this page. We all used to joke about how John
towered over the industry, and it wasnt just because of his height. In working from Johns drafts to complete the
book, in talking to colleagues about his work, and in remembering the easy, jovial way he talked about examples of
industry practices, I was constantly reminded of his stature and of the friendly way he had. I think I can say, with
condence, that everyone in the industry truly misses him; I certainly do.
Johns wife wanted to see this book come to publication, and Cisco Press went far out of their way to make that happen.
Jill Batistick and Michelle Grandin, the editors, were wonderfully friendly and helpful; they made the process of
working through the chapters almost enjoyable. The technical reviewers, Rich Ptak, John Morency, and David Fishman
put a tremendous amount of work into the book. They didnt just point out my errors; they suggested corrections
and entire new paragraphs that could improve the text. They were truly partners in bringing the book to publication.
Id also like to thank Astrid Wasserman, of MediaLive International, Inc., (the organizers of Networld+Interop),
who gave me a copy of Johns proposed two-day seminar on Service Level Management. Although it was never
presented, the seminar slides gave me a lot of insight into his ideas.
I have tried to stay close to Johns original thoughts and text, although I have occasionally succumbed to temptation
and added additional information. Minor additions occur in all chapters; major additions are in Chapter 2 (measurement statistics), Chapter 6 (triage for quick assignment of problems to appropriate diagnostic teams), Chapter 8
(transaction response time), and Chapter 11 (ash loads and abandonment). Most of the additions are topics that I
had discussed with John at various conferences we attended together; I hope, and believe, that he would agree with
them. In all cases when the author speaks directly to the reader, that author is John.
Eric Siegel
October 14, 2003
079x_FMi.fm Page vii Tuesday, December 16, 2003 12:57 PM
vii
About the Authors

John McConnell was involved in networking for over 30 years. A member of the ARPANET working group, John
contributed to early Internet architecture and protocol development. John has consulted with clients in the U.S.,
Europe, Asia, and the Middle East, and he has designed some of the rst TCP/IP networks deployed in Europe and
the Middle East.
John served as a consultant in the areas of systems and network management with a focus on Service Level Management
(SLM), policy-based management solutions, and the emerging issues of management solutions for e-business.
John received a masters in electrical engineering and computer science from the University of California, Berkeley.
Eric Siegel, Principal Internet Consultant with Keynote Systems, Inc., the Internet performance authority, rst
worked on the Internet in 1978. He wrote Designing Quality of Service Solutions for the Enterprise (John Wiley &
Sons) and has taught Internet performance tuning, SLM, and quality of service (QoS) at major industry conferences,
such as Networld+Interop.
Before joining Keynote Systems, Eric was a Senior Network Analyst at NetReference, Inc., where he specialized in
network architectural design for Fortune 100 companies, and he was a Senior Network Architect with Tandem
Computers, where he was the technical leader and coordinator for all of Tandems data communications specialists
worldwide. Eric also worked for Network Strategies, Inc. and for the MITRE Corporation, where he specialized in
computer network design and performance evaluation. Eric received his B.S. and M.Eng. degrees in electrical engineering from Cornell University, where he was elected to the Electrical Engineering honor society.
079x_FMi.fm Page viii Tuesday, December 16, 2003 12:57 PM
viii
About the Technical Reviewers

David M. Fishman is at Sun Microsystems, where he is responsible for availability measurement strategies in the
ofce of Suns Chief Customer Advocate. Prior to that, he managed Suns strategic technology relationship with
Oracle, driving technology alignment on High Availability (HA), Java technology, and performance. Before joining Sun
in 1996, Fishman held a variety of technical and business development positions at Mercury Interactive Corporation in
a variety of product management and business development roles. Previous work experience includes high-tech
marketing and management in defense electronics, embedded systems, and ofce automation. David holds an
MBA from the School of Management at Yale University. He lives in Sunnyvale, California, with his wife and
two children.
John P. Morency is a 29-year veteran of the networking and telecommunications industries and president of
Momenta Research, Inc., a company that he founded in 2002. His industry experience includes network software
development, technical support, IT operations, industry consulting, product marketing, and business development.
Because of his wide range of experience, John has a very unique ability to effectively assess the business, technological,
and operational impacts of new products and technologies. This is evidenced by the signicant business case and
Total Cost of Ownership (TCO) work that John has done on behalf of hundreds of Fortune 1000 clients over the
past ten years, resulting in hundreds of millions of dollars in both top- and bottom-line benets.
Johns current research is focused on the business benets attributable to the implementation of wireless LANs
(Wi-Fi), network telephony, content networking, system and network security, Web services, disaster recovery,
and IT process automation.
He is the author of over 400 publications on the operations and business impact of new IT technology. His speakership
and publication credentials include Networld+Interop, Network World, Billing World, Broadband Year, LightWave,
Telecommunications, and Telecom-Plus International, among many others.
Richard L. Ptak, founder of Ptak & Associates, Inc., has more than 25 years experience providing consulting services
on the use of IT resources to achieve competitive advantage. Ptak earned his B.S. and M.S. at Kansas State University.
His MBA was earned at the University of Chicago.
079x_FMi.fm Page ix Tuesday, December 16, 2003 12:57 PM
ix
Contents at a Glance
Preface xxi
Part I
Service Level Agreements and Introduction to Service Level Management 3
Chapter 1
Introduction
Chapter 2
Service Level Management
Chapter 3
Service Management Architecture
Part II
Components of the Service Level Management Infrastructure
Chapter 4
Instrumentation
Chapter 5
Event Management
Chapter 6
Real-Time Operations
Chapter 7
Policy-Based Management
Chapter 8
Managing the Application Infrastructure
Chapter 9
Managing the Server Infrastructure
5
13
41
61
81
101
129
177
Long-term Service Level Management Functions
Chapter 11 Load Testing
209
Planning and Implementation of Service Level Management
Chapter 13 ROI: Making the Business Case
219
Chapter 14 Implementing Service Level Management

Chapter 15 Future Developments
Index 259
193
195
Chapter 12 Modeling and Capacity Planning

Part IV
145
163
Chapter 10 Managing the Transport Infrastructure

Part III
59
245
231
217
079x_FMi.fm Page x Tuesday, December 16, 2003 12:57 PM
Contents
Preface xxi
Part I
Service Level Agreements and Introduction to Service Level Management 3
Chapter 1
Introduction
E-business Services
B2B
B2C
B2E
Webbed Services and the Webbed Ecosystem

Structure of the Book
Summary
Chapter 2
10
11
13
Overview of Service Level Management
14
The Internal Role of the IT Group
14
The External Role of the IT Group
15
The Components of Service Level Management

The Participants in a Service Level Agreement
Metrics Within a Service Level Agreement
Introduction to Technical Metrics
17
High-Level Technical Metrics
18
Workload 18
Availability 19
Transaction Failure Rate 20
Transaction Response Time 20
File Transfer Time 20
Stream Quality 20
Low-Level Technical Metrics
21
Workload and Availability

Packet Loss 22
21
16
15
15
079x_FMi.fm Page xi Tuesday, December 16, 2003 12:57 PM
xi
Latency 22
Jitter 22
Server Response Time
Measurement Granularity
Measurement Scope
23
23
23
Measurement Sampling Frequency
24
Measurement Aggregation Interval
26
Measurement Validation and Statistical Analysis

Measurement Validation
Statistical Analysis
28
29
Business Process Metrics
31
Problem Management Metrics
33
Real-Time Service Management Metrics

Service Level Agreements
Summary
Chapter 3
28
33
34
37
41
Web Service Delivery Architecture
42
Service Management Architecture: History and Design Factors

The Evolution of the Service Management Environment
45
45
Service Management Architectures for Heterogeneous Systems

Architectural Design Drivers
48
Demands for Changing, Expanding Services
49
Multiple Service Providers and Partners 49

Elastic Boundaries Among Teams and Providers
Demands for Fast System Management 50
Data Item Definition and Event Signaling 50
Service Management Architecture: A General Example
Instrumentation
52
Instrumentation Management
53
49
52
46
079x_FMi.fm Page xii Tuesday, December 16, 2003 12:57 PM
xii
SLA Statistics and Reporting
54
Real-Time Event Handling, Operations, and Policy

Long-Term Operations
Back-Office Operations
Summary
54
55
56
57
Part II
Components of the Service Level Management Infrastructure
Chapter 4
Instrumentation
61
Differences Between Element and Service Instrumentation

Information for Service Management Decisions
Operational Technical Decisions
Operational Business Decisions
63
64
64
Decisions That Have Long-Term Effect
65
Instrumentation Modes: Trip Wires and Time Slices

Trip Wires
Time Slices
67
68
Starting with the Instrumentation Managers

Aggregators
Processing
65
66
The Instrumentation System

Collectors
61
69
70
72
72
Ending with the Instrumentation Manager

Instrumentation Design for Service Monitoring
Demarcation Points
73
73
73
Passive and Active Monitoring Techniques
75
Passive Collection 75
Active Collection 75
Trade-Offs Between Passive and Active Collection
Hybrid Systems 77
76
59
079x_FMi.fm Page xiii Tuesday, December 16, 2003 12:57 PM
xiii
Instrumentation Trends
Adaptability
77
77
Collaboration
78
Tighter Linkage for Passive and Active Collection

Summary
Chapter 5
78
78
Event Management
81
Event Management Overview

Alert Triggers
82
82
Reliable Alert Transport

Alert Management
83
84
Basic Event Management Functions: Reducing the Noise and Boosting the Signal
Volume Reduction
86
Roll-Up Method 86
De-duplication 87
Intelligent Monitoring
Artifact Reduction
87
88
Verification 88
Filtering 89
Correlation 90
Business Impact: Integrating Technology and Services
Top-Down and Bottom-Up Approaches
Modeling a Service 92
Care and Feeding Considerations 93
Prioritization
Activation
94
95
Coordination
96
A Market-Leading Event Manager: Micromuse

Netcool Product Suite
Event Management
Summary
99
92
98
97
97
91
85
079x_FMi.fm Page xiv Tuesday, December 16, 2003 12:57 PM
xiv
Chapter 6
101
Reactive Management
103
Triage
104
Root-Cause Analysis
107
Speed Versus Accuracy 107

Case Study of Root-Cause Analysis
Complicating Factors
108
110
Brownouts 110
Virtualized Resources 110
The Value of Good Enough 111
Proactive Management
112
The Benefits of Lead Time 112

Baseline Monitoring 112
The Value of Predicting Behavior
Automated Responses
113
113
Languages Used with Automated Responses

A Case Study
113
114
Step 1: Assessing Local Impact 114

Step 2: Adjusting Thresholds 115
Step 3: Assessing Headroom 115
Step 4: Taking Action 115
Step 5: Reporting 116
Building Automated Responses
116
Picking Candidates for Automation
116
Examples of Commercial Operations Managers

Tavve Softwares EventWatch
ProactiveNet
Netuitive
117
120
117
116
079x_FMi.fm Page xv Tuesday, December 16, 2003 12:57 PM
xv
Handling DDoS Attacks
121
Traditional Defense Against DDoS Situations 122

Defense Through Redundancy and Buffering 124
Automated Defenses 124
Organizational Policy for DDoS Defense 126
Summary
Chapter 7
127
The Need for Policies
129
129
130
Management Policies for Elements

Service-Centric Policies
A Policy Architecture
132
133
Policy Management Tools

Repository
131
133
134
Policy Distribution
134
The Pull (Component-Centric) Model 134

The Push (Repository-Centric) Model 135
Hybrid Distribution 135
Enforcers
136
Policy Design
136
Policy Hierarchy
137
Policy Attributes
137
Policy Auditing
138
Policy Closure Criteria

Policy Testing
138
138
Policy Product Examples
139
Cisco QoS Policy Manager
139
Orchestream Service Activator

Summary
142
141
079x_FMi.fm Page xvi Tuesday, December 16, 2003 12:57 PM
xvi
Chapter 8
145
Interaction of Operations and Application Development Teams

The Effect of Organizational Structures
146
The Need to Understand the Operational Environment

Time Lines Are Shorter
Application-Level Metrics
Workload
146
146
147
147
149
Customer Behavior Measurement

Business Measurements
149
150
Service Quality Measurement
151
Transaction Response Time: An Example of Dependence on Lower-Level

Services 152
Serialization Delay
Queuing Delay
153
154
Propagation Delay
Processing Delay
154
156
The Need for Communications Among Design and Operations Groups

Instrumenting Applications
157
Instrumenting Web Servers
157
Instrumenting Other Server Components

End-User Measurements
Summary
Chapter 9
159
160
161
163
Architecture of the Server Infrastructure
163
Load Distribution and Front-End Processing

Local Load Distribution 166
Geographic Load Distribution 168
Caching
168
Content Distribution
169
164
156
079x_FMi.fm Page xvii Tuesday, December 16, 2003 12:57 PM
xvii
Instrumentation of the Server Infrastructure

Load Distribution Instrumentation
Cache Instrumentation
172
173
Content Distribution Instrumentation

Summary
171
173
174
Chapter 10 Managing the Transport Infrastructure
177
Technical Quality Metrics for Transport Services

Workload and Bandwidth
178
Availability and Packet Loss

One-Way Latency
180
Round-Trip Latency
Jitter
179
181
181
QoS Technologies
181
Tag-Based QoS
182
IEEE 802 LAN QoS

IP TOS 183
IP DiffServ 183
MPLS 183
RSVP 184
Traffic-Shaping QoS
182
185
Rate Control 186

Queuing 187
Over-provisioning and Isolated Networks
188
Managing Data Flows Among Organizations
188
Levels of Control
Demarcation Points
189
189
Diagnosis and Recovery

Summary
191
189
178
079x_FMi.fm Page xviii Tuesday, December 16, 2003 12:57 PM
xviii
Part III
Long-term Service Level Management Functions
195
The Performance Envelope

Load Testing Benchmarks
196
199
Load Test Beds and Load Generators
200
Building Transaction Load-Test Scripts and Profiles

Using the Test Results
Summary
206
209
Advantages of Simulation Modeling
209
Complexity of Simulation Modeling
211
Simulation Model Examples

Model Construction
Model Validation
Reporting
211
211
213
214
Capacity Planning
Part IV
203
205
Chapter 12 Modeling and Capacity Planning
Summary
193
214
215
Planning and Implementation of Service Level Management
Chapter 13 ROI: Making the Business Case
219
Impact of ROI on the Organization

A Basic ROI Model
220
The ROI Mission Statement

Project Costs
222
223
Project Benefits
220
223
Availability Benefits 224

Performance Benefits 225
Staffing Benefits 225
Infrastructure Benefits 225
Deployment Benefits 225
217
079x_FMi.fm Page xix Tuesday, December 16, 2003 12:57 PM
xix
Soft Benefits
226
ROI Case Study

Summary
226
228
Chapter 14 Implementing Service Level Management

Phased Implementation of SLM
Choosing the Initial Project
Incremental Aggregation
231
231
231
232
An SLM Project Implementation Plan
233
Census and Documentation of the Existing System

Specification of Performance Metrics
233
234
Instrumentation Choices and Locations
235
Passive Measurements 236

Active Measurements 236
Baseline of Existing System Performance
237
Investigation of System Performance Sensitivities and System Tuning

Construction of SLAs
239
Roles and Responsibilities 240

Reporting Mechanisms and Scheduled Reviews
Dispute Resolution 241
Summary
240
242
Chapter 15 Future Developments
245
The Demands of Speed and Dynamism
245
Evolution of Management Systems Integration

Superficial Integration
Data Integration
248
248
248
Event Integration
Process Integration
249
250
Architectural Trends for Web Management Systems
250
Loosely Coupled Service-Management Systems Architecture

Process Managers 251
Clustering and the Webbed Architecture
252
251
237
079x_FMi.fm Page xx Tuesday, December 16, 2003 12:57 PM
xx
Integrating the Components with Signaling and Messaging

Loosely Coupled Service-Management Processes
Business Goals for Service Performance
Finding the Best Tools
Summary
Index 259
256
255
254
253
252
079x_FMi.fm Page xxi Tuesday, December 16, 2003 12:57 PM
xxi
Preface
Some years ago I received a true pearl of wisdom from an industry colleague. In order to truly understand your
profession, he advised, you must make the effort to learn other disciplines that are completely different from the
one that you espouse.
That colleague was John McConnell, a man who truly understood this advice by walking the talk over the course of
his life. Born into a military family, John developed a keen understanding of the importance of the global ecosystem at
a very young age through his childhood experiences in both Europe and the Far East. Despite being a shy, scholarly
individual throughout primary and secondary school, John also demonstrated the value of hard work and dedication
by making the varsity rowing team at U.C. Berkeley.
The strong work ethic that John nurtured at Berkeley served him well after he received his masters in computer science
in 1968. What differentiated John from many of his fellow graduates, however, was the application of his craft to nonIT disciplines after graduation. Some of his rst initiatives included the application of computer technology to measure
the rate of solar intensity upon the earth and the development of a programming language that was designed to test the
content and substance of moon samples brought back to earth by the Apollo astronauts. In addition, John developed
a number of network control programs for the ARPANET (the predecessor to todays Internet) in the mid-1970s
when the state of the commercial data networking industry was in its true infancy.
John also spent a number of years in professional capacities that had very little to do with information technology.
After graduate school, John became an accomplished massage therapist, hypnotist, and practitioner in the art of
Rolng, a technique for the detection, treatment, and removal of bodily stress and pain. In 1983, using his Rolng
technique, John was selected to work with the members of the U.S. bicycling Olympic team, and he applied this
technique to aid the team in preparing for the 1984 Olympic games. Recently, when not consulting, John was training
to become an instructor in the Ridhwan Foundation, an institution whose focus is the rediscovery and integration of
the true self into ones own professional and personal life. Over the years, he had a myriad of personal interests
including soaring, mountain climbing, bird watching, backpacking, rowing, and blues festivals. One of his most
recent and satisfying accomplishments was the design, building, and completion of a second home in southern
Costa Rica that effectively enabled both he and his wife Grace to really get away from it all.
First and foremost, Johns professional focus in the IT industry was the advancement of technologies and products
that improved the efciency and the effectiveness of IT management.
Given his whole life background, John was especially dedicated to reducing the operational and business pain
points associated with IT implementation and management. This focus is reected in Johns prior work,
Internetworking Computer Systems and Managing Client/Server Environments, as well as in Practical Service
Level Management: Delivering High-Quality Web-Based Services. Johns numerous publications, conferences, and
televised briengs reect a focused dedication to the removal of technological barriers to the optimal effectiveness
of IT organizations worldwide. His life experiences of a true Renaissance man uniquely enabled him to both understand and drive the level of change needed to not only improve state of the art, but also quality of life. John was
indeed the gold standard of knowledge, professionalism, and personal integrity that made the pursuit of these
goals not only a logical possibility, but, for many of us, a practical reality. The loss of John will be keenly felt for
some time, but the goals and values that he aspired to and embraced will inspire and guide many of us for years to
come.
John Morency, President, Momenta Research
May 2003
079x_01i.book Page 2 Monday, December 15, 2003 3:04 PM
PART
Service Level Agreements and

Introduction to Service Level
Management
Chapter 1
Introduction
Chapter 2
Chapter 3
CHAPTER
Introduction
The World Wide Webthe Webis the catalyst for the changes in our communications,
work styles, business processes, and ways of seeking entertainment and information. The
Internet is just the transport infrastructure for the web-based services that drive so much
innovation. Note, however, that the Internet generally gets all the credit. As Thomas
Friedman writes in The Lexus and the Olive Tree:
The Internet is going to be like a huge vise that takes the globalization system that I have describedand
keeps tightening and tightening that system around everyone, in ways that will only make the world smaller
and smaller and faster and faster with each passing day.
This is an accurate description of the environment that most of us deal with directly on a
daily basis. The Internet is a tremendous business engine, and, as it transforms the ways we
do business, it is being transformed in turn by the ways we use it. We must learn how to
manage the growing array of online business services or risk being marginalized by a faster
moving and more dynamic business environment.
In this introductory chapter, I discuss the following:
The types of e-business services

A denition of webbed services and the webbed ecosystem
Service Level Management (SLM)
The structure of this book
E-business Services
E-business is a generic term dening business activities that are carried out totally, or in
part, through electronic communications between distributed organizations and people.
These activities are characterized by speed, exibility, and constant change.
The Internet has become the vehicle for transforming business processes. The reasons for
its ascendancy include the following:
The Internet protocols are the only workable set of technologies that really provide a
high degree of interoperability among different systems.
The wide geographic reach of the Internet increases the size of any potential market.
Chapter 1: Introduction
Internet economies make it feasible to distribute information and transact business

globally.
The introduction of the browser and its supporting technologies make the Internet
much easier to use, thereby increasing the potential market.
There are many ways of segmenting and describing the large variety of services available
through the Internet and the Web. A simple classication that covers most services is based
on the relationship of the business to customers, business partners, and employees. For
example, the process shown in Figure 1-1 describes a simple situation involving all three
types of relationships: business to business (B2B), business to consumer (B2C), and
business to employee (B2E). These segments are an easy way of organizing our thinking
about services, although its important to remember that business processes in the real
world will have many variations and overlaps.
Figure 1-1
Business Relationships
Supplier
Customer
Sales Staff
B2C
B2B
B2E
Internal Enterprise
Systems
The following sections discuss each relationship type in turn.
B2B
B2B services are a broad category that incorporates transactions among different
businesses and government agencies. Many current B2B services, such as supply chain
management and credit authorization, use the Internet to drive down the costs and delays
associated with current processes and to boost their productivity.
B2B is rapidly broadening to include more than supply chain management and credit
authorization. Functions such as shipping, billing, and Customer Relationship Management
E-business Services
(CRM) are now often external to the business; other businesses provide and host these
specialized services as a utility. For example, entry of a customers order can result in more
than the functions of pricing, authorizing, assembling, and shipping; a modern system
might use B2B links to provide the customer with a shipment tracking number from the
shipping company, and it might interact with an external CRM service to reect the current
purchases and other factors of the customers prole. Meanwhile, the sales person might be
indirectly using B2B links to handle her commissions and personnel data through
outsourced employee management services, and engineering staff might use B2B links for
collaborative design.
Thanks to the Web, B2B is rapidly transforming into an even more dynamic set of services
from which an enterprise can select in real time. No one wants to be dependent on a single
supplier or customer; everyone must deal with competitive pressures exerted from both
sides. Services such as credit authorization and shipping are examples of those that can be
selected in real time based on their performance or costs. Other services and supplies may
be selected from web-based exchanges or e-markets.
B2B processes can be complex. They must follow the business requirements for tracking
orders, negotiating contracts, arranging payments, and reporting outcomes that govern
these processes when they take place without the automation of electronic communications.
Note that new benets become available, although at the cost of additional complexity,
when B2B replaces older systems. For example, organizations can change their business
processes to increase their business effectiveness by obtaining real-time information on
order volumes, revenue rates, cancelled orders, and other factors. This additional
information, while adding to complexity, provides value in addition to the acceleration of
the processes themselves by identifying further efciencies.
Continuous monitoring of B2B suppliers, partners, and web infrastructure
(communications, hosting, and exchanges) is necessary to determine whether they are
meeting their service quality commitments.
As in conventional commerce, managing across organizations adds complexity. All the
links in the B2B services chains are known, but these links are controlled by many different
organizations, are complex, and may change rapidly as services are selected in real time.
Managing B2B services therefore requires cooperation with the management teams of the
other participants and, possibly, with third-party measurement organizations to assure true
end-to-end service quality.
B2C
B2C garnered most of the early attention from the trade press and analysts as traditional
businesses took advantage of the Internets wide geographic reach and low costs for
reaching customers. Some businesses (eBay and Amazon.com, for example) were founded
to exploit this new market opportunity.
B2C sites continually add new services of their own while offering links to related
businesses and services in an attempt to offer one-stop shoppingand sellingto their
customers. This is a highly competitive segment with little customer loyalty. The wide
selection of competing sites draws customers away whenever any one site has a service
disruption.
B2C environments are characterized by a lack of visibility and management control of the
customer-access infrastructure, which is the set of networks, caches, and other systems that
consumers use to connect to the B2C site. Customers usually dont want measurement tools
embedded in their systems, and the access infrastructure providers also resist making their
internal performance readily visible. There is also limited visibility into the performance of
partner sites (advertisers and other third parties), which are important parts of the
customers perception of total site performance. The span of control and management
available to B2C sites is therefore usually limited to monitoring and managing their internal
operations (inside the rewall) as well as measurement of Internet delays and performance
as seen from various points on the edge of the Internet.
B2E
B2E services are also known as the intranet. These services help improve the internal
effectiveness of an organization and help it keep pace with its customers and business
partners. Many B2E services enable employees to query their benets, schedule vacations,
ll out expense reports, and conduct a set of activities that formerly required a large staff to
coordinate.
B2C and B2E services use the web browser as the access device. Transactions are initiated
from the browser to deliver information and activate a range of business processes.
However, B2E environments are the only ones that enable administrators to have control of
both endsthe servers as well as the desktops, cell phones, and personal communicators
used to access them.
Webbed Services and the Webbed Ecosystem

In this book, I use the term webbed services to describe the set of business services that are
based on a component approach to systems design. This design is driven from the Web and
its associated technologies, regardless of the specic technologies used. Because webbed
services are constructed from a set of interconnected software components and services that
can be reused in multiple places, they can usually avoid some of the expense, time, and
effort associated with building and modifying monolithic applications.
Webbed services is a very inclusive term; its increasingly difcult to nd services that are
not somehow tied into the Web. As a case in point, I was recently speaking about webbed
services at a large retail organization, and someone in the group stated that their main
application did not t into the webbed category because it was a stand-alone Oracle
Financials application. However, further discussion soon revealed that their international
operations used real-time currency conversion decisions. The real-time exchange rates in
the Oracle Financials application were, in fact, accessed through the Web.
Indeed, webbed services are now taking on many of the characteristics of an ecosystem,
which is a group of independent but interrelated elements comprising a unied whole. A
smooth business process depends on each element carrying out its tasks accurately and
quickly, with consideration for maintaining balances among all the elements. In a wellbalanced webbed ecosystem, all elements bear appropriate shares of the load. None is
overwhelmed, none is underutilized. Balance is concurrently maintained between service
quality and service cost. The ecosystem metaphor is gaining momentum as online
processes evolve to dynamically select their elements (underlying services) based on their
current behavior and performance.
The webbed ecosystem perspective also holds within any subgroup of systems. For
instance, hosting facilities use a range of technologies, such as prioritizing devices,
bandwidth managers, global load balancers, and caches, to deliver online business services.
These systems also need balanced management; adding bandwidth when servers are
congested is a wasteful investment.

Service quality is extremely important, given the accelerating number of critical business
processes going online. Customers and business partners go elsewhere if the services they
want are not available or are performing sluggishly. Unfortunately, good service quality is
a dynamic target and the demands continue to tighten. Competitors will match or exceed
service quality levels and create pressure toward matching or bettering theirs.
Service Level Management (SLM) is the process of managing network and computing
resources to ensure the delivery of acceptable service quality at an acceptable price in an
acceptable time frame. It focuses on the behavior of the services rather than on tracking the
status of every router, switch, and server in the environment. Through SLM, service quality
is guaranteed and priced for different levels of service.
SLM is a competitive weapon in the marketplace, offering the guarantees needed to
transition critical business activities online. Poorly managed services have harmed many
businesses when their web sites crashed, their applications slowed to a crawl, or their Web
content was not attractively presented or was too difcult to navigate. Good service quality
helps retain customers and differentiate your organization from those that have not yet
mastered the art of managing service quality.
Effective SLM is also an economic weapon. Managing resources more effectively reduces
costs, creates more revenue opportunities, and leverages technology investments.
Finally, SLM is a means to build the solid business relationships that make online business
initiatives successful.
10
The basic terminology of SLM is as follows:
Service Level Agreement (SLA)A formal, negotiated contract between a service

provider and a service user that denes the services to be provided, the service quality
goals (often called service level indicators and service level objectives), and the
actions to be taken if the service provider does not comply with the SLA terms.
Quality of Service (QoS)A technology-centered concept that focuses on the

performance of the transport and service technologies underlying a webbed service.
Examples are service availability, response time, and the technologies that measure
and assure specic levels of transport infrastructure performance (packet loss,
network transit time, and transit time variations).
Quality of Experience (QoE)A customer-centered concept that focuses on

monitoring and assessing service quality from the end-customer perspective. This
includes someone using a browser to access information and order merchandise, or a
business conducting a series of exchanges to order products, negotiate terms, and
arrange payment.
QoE is the most important to customers, yet it is also the most difcult to evaluate. For
example, I recently visited a large company that derives over half its revenues online. They
were justly proud of a new initiative that reduced web page download times by two seconds.
However, the content was so dense and difcult to navigate that users still needed a long
time to understand the directions and identify the buttons or links they wanted to use next.
Improved technical performance did not appreciably raise the QoE in this case; users
wasted at least two seconds looking for what they wanted.
Structure of the Book

The book is divided into four parts:
Part I: Service Level Agreements and Introduction to Service Level Management

(Chapters 13)Chapter 1 denes the webbed ecosystem, and Chapter 2 denes and
discusses SLAs along with typical technical and business process metrics, their
statistical treatment, and recommendations for writing SLAs. Chapter 3 outlines both
the overall architecture for service delivery on the Web and the overall architecture for
managing that delivery. (Additional, detailed examples of technical metrics and
management architectures are given in Part II.)
Part II: Components of the Service Level Management Infrastructure (Chapters

410)The rst group of chapters (47) in this part discusses the details of the
service management infrastructure. The group starts with measurement collection and
aggregation technologies (Chapter 4) and then continues through the ltering and
integration of real-time, measurement-detected events (Chapter 5). It concludes with
the use of those ltered events by the operations staff (Chapter 6) and by automated,
policy-based management systems (Chapter 7).
Summary
11
The second group of chapters (810) in this part steps through the major
systems used for web service delivery. It looks at the ways they can be used
to improve service delivery and also discusses their specic instrumentation
needs, using the system management infrastructures described in the rst
part of this section. Chapter 8 investigates the instrumentation and
management of applications and of end-user access devices, such as
browsers. Chapter 9 looks at web server systems, including servers, load
balancers, and content distribution networks. Finally, Chapter 10 discusses
instrumentation and management of the transport infrastructure, including
QoS technology and trafc shaping to achieve policy objectives.
Part III: Long-term Service Level Management Functions (Chapters 1112)

This part covers load testing, modeling, and capacity planning. No management
system can provide necessary quality if the web serving system, as a whole, has
insufcient capacity.
Part IV: Planning and Implementation of Service Level Management (Chapters

1315)Calculation of Return on Investment (ROI) for SLM is critical to the
justication and design of an implementation; its covered in Chapter 13. Chapter 14
provides guidance for using the information in this book to design an SLM system for
your particular situation, and the part ends with discussion in Chapter 15 of possible
future developments in SLM.
Summary
The Internet, and the Web, are transforming business processes for interaction among
businesses, government, suppliers, customers, and employees. As more and more critical
business processes go online, the service quality of those processes becomes more
important to the success of business as a whole.
SLAs are the formal, negotiated contracts between service providers and service users that
dene the services to be provided, their quality goals, and the actions to be taken if the SLA
terms are violated.
SLM is the process of managing network and computing resources to ensure the delivery
of acceptable service quality, usually as dened in an SLA, at an acceptable price in an
acceptable time frame. It is a competitive weapon in the marketplace because it can improve
customer relationships, create more revenue opportunities, and reduce costs.
CHAPTER

Service Level Management (SLM) is a key for delivering the services that are necessary to
remain competitive in the Internet environment. Service quality must remain stable and
acceptable even when there are substantial changes in service volumes, customer activities,
and the supporting infrastructures.
Superior service quality also becomes a competitive differentiator because it reduces
customer churn and brings in new customers who are willing to pay the premiums for
guaranteed service quality. Customer churn is an insidious problem for almost every service
provider.
The competitive market increases customer acquisition costs because continuous
marketing and promotions are necessary just to replace the eroding customer base. Higher
customer acquisition costs must be dealt with by either raising prices (a difcult move in a
highly competitive market) or by taking longer to amortize the acquisition costs before
protability for each customer is achieved. Improving customer retention therefore
dramatically increases prots.
This chapter covers the basics of SLM and lays part of the groundwork for the rest of the
book:
An overview of SLM
An introduction to technical metrics
Detailed discussions of measurement granularity and measurement validation
Business process metrics
Service Level Agreements (SLAs)
Note that the chapter ends with a summary discussion in the context of building an SLA.
Use of metrics in combination with the SLAs service level objectives to control
performance is discussed in Chapter 6, Real-Time Operations, and Chapter 7, PolicyBased Management.
14
Chapter 2: Service Level Management

Often, one groups service provider is another groups customer. It is critical to understand
that service delivery is often, in fact, a chain of such relationships. As Figure 2-1 shows,
some entities, such as an IT group, can play different roles in the service delivery process.
As shown in the gure, a hosting company can be a customer of multiple service providers
while in turn acting as a service provider. An IT group may be a customer of several service
providers offering basic Internet connectivity, application hosting, content delivery, or other
services. Customers may use multiple providers of the same service to increase their
availability and to protect against dependence on a single provider. Customers will also use
specialized service providers to fulll particular needs.
Figure 2-1
Roles of Customers and Service Providers
Service Provider Role
IT Group
Customer Role
Hosting Co.
Content Delivery
Customer Role
Customer Role
ISP #1
ISP #2
Customer Role
Customer Role
Telephone Co.
Telephone Co.
The Internal Role of the IT Group

An IT group serves the entire organization by aggregating demands of individual business
units and using them as leverage to reduce overall costs from service providers.
Today, such IT groups are making the necessary adjustments as managed services become
a mandatory requirement. IT managers are constantly reassessing the business and strategic
15
trade-offs of developing internal competence and expertise as opposed to outsourcing more

of the traditional IT work to external providers. The goal is to save money, protect strategic
assets, and maintain the necessary exibility to meet new challenges.
The External Role of the IT Group

IT groups are increasingly being required to provide specic levels of service, and they are
also more frequently involved in helping business units negotiate agreements with external
service providers. Business units often choose to deal directly with service providers when
they have specialized needs or when they determine that the IT group cannot offer services
with competitive costs and benets.
IT groups must therefore manage their own service levels as well as those of service
providers, and they must track compliance with negotiated SLAs.
The Components of Service Level Management

The process of monitoring service quality, detecting potential or actual problems, taking
actions necessary to maintain or restore the necessary service quality, and reporting on
achieved service levels is the core of SLM. Effective SLM solutions must deliver acceptable
service quality at an acceptable price.
Acceptable quality from a customer perspective means an ability to use the managed
services effectively. For example, acceptable quality may mean that an external customer
or business partner can transact the necessary business that will generate revenues,
strengthen business partnerships, increase the Internet brand, or improve internal
productivity. Specic ongoing measurements are carried out to determine acceptable
service quality levels, and noncompliance is noted and reported.
Acceptable costs must also be considered, because over-provisioning and throwing money
at service quality problems is not an acceptable strategy for either service providers or their
customers (and in spite of the cost, it often doesnt solve the problem). Service management
policies are applied to critical resources so that they are allocated to the appropriate
services; inappropriate activities are curtailed. Service providers that manage resources
effectively deliver superior service quality at competitive prices. Their customers, in turn,
must also increase their online business effectiveness and strengthen their bottom-line
results.
The Participants in a Service Level Agreement

The SLA is the basic tool used to dene acceptable quality and any relationships between
quality and price. Because the SLA has value for both providers and customers, its a
wonder why it has taken so long for it to become important. In practice, many organizations
16
and providers nd the process of negotiating an acceptable SLA to be a difcult task. As

with many technical offerings, customers often experience difculty in expressing what
they need in technical terms that are both measurable and manageable; therefore, they have
difculty specifying their needs precisely and verifying that they are getting what they pay for.
Service providers, on the other hand, appreciate clearly-specied requirements and want to
take advantage of the opportunity to offer protable premium services, but they also want
to minimize the risks of public failure and avoid increasingly stringent nancial penalties
for noncompliance with the terms of the SLA.
Metrics Within a Service Level Agreement

Measurement is a key part of an SLA, and most SLAs have two different classes of metrics,
as shown in Figure 2-2, which may be divided into technical metrics and business process
metrics. Technical metrics include both high-level technical metrics, such as the success
rate of an entire transaction as seen by an end user, and low-level technical metrics, such as
the error rate of an underlying communications network. Business process metrics include
measures of provider business practices, such as the speed with which they respond to
problem reports.
Figure 2-2
Contents of a Service Level Agreement
Service Level Agreement

Technical Metrics
High-Level
Workload
Availability
Transaction Failure
Transaction Response Time
...
Low-Level
Workload
Availability
Packet Loss
One-Way Packet Delay
Jitter
...
Trouble Response Time
Trouble Relief Time
Provisioning Time
...
Metric Specification and Handling
Granularity
Validation
Penalties and Rewards
17
Service providers may package the metrics into specic proles that suit common customer
requirements while simplifying the process of selecting and specifying the parameters.
Service proles help the service provider by simplifying their planning and resource
allocation operations.

Technical metrics are a core component of SLAs. They are used to quantify and to assess
the key technical attributes of delivered services.
Examples of technical metrics are shown in Table 2-1. They are separated into the two basic
groups: high-level metrics, which deal with attributes that are highly relevant to end users
and are easily understood by them, and low-level metrics, which deal with attributes of the
underlying technologies. Note that you should be very specic when dening these terms
in an agreement. Although many of these terms are in common use, their denitions vary.
Table 2-1
Examples of Technical Metrics

Metric
Description

Workload
Applied workload in terms understandable by the end user (such

as end-user transactions/second)
Availability
Percentage of scheduled uptime that the system is perceived as

available and functioning by the end user
Transaction Failure Rate
Percentage of initiated end-user transactions that fail to complete
Measure of response-time characteristics of a user transaction
File Transfer Time
Measure of total transfer-time characteristics of a le transfer
Stream Quality
Measure of the user-perceived quality of a multimedia stream

Workload
Applied workload in terms relevant to underlying technologies

(such as database transactions/second)
Availability
Percentage of scheduled uptime that the subsystem is available

and functioning
Packet Loss
Measure of one-way packet loss characteristics between

specied points
Latency
Measure of transit time characteristics between specied points
Jitter
Measure of the transit time variability characteristics between

specied points
Measure of response-time characteristics of particular server

subsystems
18
Workload is an important characteristic of both high- and low-level metrics. Its not a
measure of delivered quality; instead, its a critical measure of the load applied to the
system. For example, consider the workload of serving web pages. A text-only page might
comprise only 10 K bytes, whereas a graphics page could comprise a few megabytes. If the
requirement is to deliver a page in six seconds to the end user, massively different
bandwidth and capacity will be necessary. Indeed, content may need to be altered for lowspeed connections to meet the six-second download time.
NOTE
In many situations, certain technical metrics arent specied in the SLA. Instead, the
supplier is asked to use best effort, which represents the classic Internet delivery strategy of
get it there somehow without concern for service quality. Today, best effort represents the
commodity level for services. There are no special treatments for best-effort services. The
only need is that there are sufcient resources to prevent best-effort services from starving
out, which means having the connection time out because of long periods of inactivity.
Discussions of all of the examples in Table 2-1 follow, to illustrate the basic concepts of
technical metrics. Additional descriptions of these metrics, and other technical metrics,
appear in Chapters 4 and 810.

These metrics deal with workload and performance as seen and understood by the end user.
Workload
The workload high-level technical metric is the measure of applied load in end-user terms.
Its unreasonable to expect a service provider to agree to service levels for an unspecied
amount of workload; its also unreasonable to expect that an end user will willingly
substitute specication of obscurely-related low-level workload metrics instead of
understandable high-level metrics. SLAs should therefore begin by specifying the highlevel workload metrics, and service providers can then work with the customers technical
staff to derive low-level workload metrics from them.
For transaction systems, the workload metric is usually specied in terms of the end-user
transaction mix and volumes, which typically vary according to time of day and other
business cycles. For existing systems, these statistics can be obtained from logs; for new
systems or situations (such as a proposed major advertising campaign designed to drive
prospective customers to a web site), the organizations marketing group or their
consultants should work to produce the most accurate, specic estimates possible. These
workload estimates for new systems should be used for load testing as well as for SLAs.
19
Transaction workload metrics must include end-user tolerance for transaction response
time delays. If response time delays are too long, external customers will abandon the
transaction. In legacy systems where external customers did not interact directly with the
server systems, abandonment was not a factor in workload testing. Call-center operators
handled any delays by talking to the customers, shielding them from the problem, if
necessary. On the Web, customers see the delays without any shielding, and they may
decide at any point to abandon the transactionwith immediate impact on the server
systems workload.
Another effect of the direct connection between customers and web-serving systems is that
theres no buffer between those customers and the servers. In a call center, the workload is
buffered by external queues. Incoming calls go through an automatic call distribution
system; callers are placed on hold until an operator is available. In an order-entry center,
the workload is buffered by the stack of documents on the entry clerks desk. In contrast,
the web workload has no external buffer; massive spikes in workload hit the servers
instantly. These spikes in workload are called flash load, and they must be specied in the
workload metric and considered during load testing. Load specication for the Web should
therefore be in terms of arrival rate, not concurrent users, as was the case for call centers
and order-entry centers.
File-serving, web-page, and streaming-media workload metrics are similar to transaction
metrics, but simpler. Theyre usually specied in terms of the size and number of les that
must be transferred in a given time interval. (For web pages, the types of the les are usually
specied. Dynamically-generated les are clearly more resource-intensive than stored
static les.) The serving system must have the bandwidth to serve the les, and it must also
be able to handle the anticipated number of concurrent connections. Theres a relationship
between these two variables; given a certain arrival rate, higher end-to-end bandwidth
results in fewer concurrent users.
Availability
Availability is the percentage of time that the system is perceived as available and
functioning by the end user. It is a function of both the Mean Time Between Failures
(MTBF) and the Mean Time To Repair (MTTR). Scheduled downtime might, in some
organizations, be excluded from these calculations. In those organizations, a system can be
declared 100 percent available even though its down for an hour every night for system
maintenance.
Availability is a binary measurementthe service is either available or it isnt. For the end
user, and therefore for the high-level availability metric, the fact that particular underlying
components of a service are unavailable is not a concern if that unavailability is concealed
through redundant systems design.
Availability can be improved by increasing the MTBF or by decreasing the time spent on
each failure, which is measured by the MTTR. Chapter 3, Service Management
20
Architecture, introduces the concept of triage, which decreases MTTR through quick
assignment of problems to the appropriate specialist organization.
Transaction Failure Rate

A transaction fails if, having successfully started, it does not successfully complete.
(Failure to start is the result of an availability problem.) As is true for availability, systems
design and redundancy may conceal some low-level failures from the end user and
therefore exclude the failures from the high-level transaction failure rate metric.

This metric represents the acceptable delay for completing a transaction, measured at the
level of a business process.
Its important to measure both the total time to complete a transaction and the elapsed time
per page of the transaction. Thats because the end users perception of transaction time,
which will be used to compare your system with your competitors, is based on total
transaction time, regardless of the number of pages involved, while the slowest page will
inuence end-user abandonment of a web transaction.
File Transfer Time

The le transfer time metric is closely associated with specied workload and is a measure
of success. The le transfer workload metric describes the work that must be accomplished
in a certain period; the le transfer time metric shows whether that workload was
successfully handled. Lack of end-to-end bandwidth, an insufcient number of concurrent
connections, or persistent transmission errors (requiring retransmission) will inuence this
measure.
Stream Quality
The quality of multimedia streams is difcult to measure. Although underlying low-level
technical metrics, such as frame loss, can be obtained, their relationship to the quality as
perceived by an end user is very complex.
Streaming is a real-time service in which the content continues owing even with variations
in the underlying data transmission rates and despite some underlying errors. A content
consumer may see a small blemish on a graphic because a packet is lost in transit
equivalent to static on your car radio. There is no rewinding and playing it again, as there
might be with interactive services. Thus, packet loss is handled by just continuing with the
streaming rather than retransmitting lost packets.
21
Occasional packet loss can still be tolerated and sometimes may not even be noticed. If
packet loss increases, quality will begin to degrade until it falls below a threshold and
becomes unacceptable. Years of development have been focused on concealing these lowlevel errors from the multimedia consumer, and the major existing technologies from
Microsoft, Real Networks, Apple, and others have different sensitivities to these errors.
Nevertheless, quality must be measured. The telephone companies years ago established
the Mean Opinion Score (MOS), a measure of the quality of telephone voice transmission.
There are also international standards for evaluation of audio and video quality as perceived
by human end users; examples are the International Telecommunication Unions ITU-T
P.800-series and P.900-series standards and the American National Standards Institutes
T1.518 and T1.801 standards. Simpler methods are also in use, such as measuring the
percentage of successful connection attempts to the streaming server, the effective
bandwidth delivered over that connection, and the number of rebuffers during transmission.

These metrics deal with workload and performance of the underlying technical subsystems,
such as the transport infrastructure. Low-level technical metrics can be selected and dened
by rst understanding the high-level technical metrics and their implications for the
performance requirements placed on underlying subsystems. For example, a clear
understanding of required transaction response time and the associated transaction
characteristics (the number of transits across the transport network, the size of each transit,
and so on) can help set the objective for the low-level technical metric that measures
network transit time (latency).
Workload and Availability

These low-level technical metrics are similar to those for the high-level discussion, but
theyre focused on performance characteristics of the underlying systems rather than on
performance characteristics that are directly visible to end users. Their correlation with the
high-level metrics depends on the particular system design and the degree of redundancy
and substitution within that design.
Throughput, for example, is a low-level technical metric that measures the capacity of a
particular service ow. Services with rich content or critical real-time requirements might
need sufcient bandwidth to maintain acceptable service quality. Certain transactions, such
as downloading a le or accessing a new web page, might also require a certain bandwidth
for transferring rich content, such as complex graphics, within the specied transaction
delay time.
22
Packet Loss
Packet loss has different effects on the end-user experience, depending on the service using
the transport. The choice of a packet loss metric for a particular application must be
carefully considered. For example, packet loss in le transfer forces retransmission unless
the high-level transport contains embedded error correction codes. In contrast, moderate
packet loss in streaming media may have no user-perceptible effect at allunless bad luck
results in the loss of a key frame.
The burst length must be included in packet loss metrics. Usually a uniform distribution of
dropped packets over longer time intervals is implicitly assumed. For example, out of every
100 packets there could be two lost without violating an SLA calling for two percent packet
loss. There may be a different perspective if you examine behavior over longer intervals,
such as 1,000 packets. Up to 20 packets in a row could be lost without violating the SLA.
However, losing 20 consecutive packetscreating a signicant gap in data received
might drive quality levels to unacceptable values.
Latency
Latency is the time needed for transit across the network; its critical for real-time services.
Excessive latency quickly degrades the quality of web sites and of interactive sound and video.
Routes in the Internet are usually asymmetric, with ows often taking different paths
coming and going between any pair of locations. Thus, the delays in each direction are
usually different. Fortunately, most Internet applications are primarily sensitive to roundtrip delays, which are much simpler to measure than one-way delays. File transfer, web
sites, and transactions all require a ow of acknowledgments in the opposite direction to
data ow. If acknowledgments are delayed, transmission temporarily ceases. The roundtrip latency therefore controls the effective bandwidth of the transmission.
Round-trip latency is much simpler to measure than one-way latency, because clock
synchronization of separated locations is not necessary. That synchronization can be quite
tricky if it is accomplished across the same network thats having its one-way delay
measured. In that case, uctuations in the metric thats being measured (one-way latency)
can easily affect the stability of the measurement apparatus for one-way latency. An
external reference, such as the satellite Global Positioning System (GPS) timers, is often
used in such situations.
Jitter
Jitter is the deviation in the arrival rate of data from ideal, evenly-spaced arrival; see Figure
2-3. Some packets may be bunched more closely together (in terms of inter-packet delays)
or spread farther apart after crossing the network infrastructure. Jitter is caused by the
internal operation of network equipment, and its unavoidable. Jitter is created whenever
there are queues and buffering in a system. Extreme varieties of jitter are also created when
theres rerouting of packets because of network congestion or failure.
Figure 2-3
23
Jitter
Ideal
Packet
Spacing
Actual
Packet
Spacing
Jitter
Interactive teleconferencing is an example of a service that is extremely sensitive to jitter;

too much jitter can make the service completely useless. Therefore, a reduction in jitter,
approaching zero, represents an increase in quality.
Buffering in the receiving device can be used to smooth out jitter; the jitter buffer is familiar
to those of us who have a CD player in the car. Small bumps are smoothed out and the sound
quality remains acceptable, but hitting a pothole usually causes more disturbance than the
buffer can overcome. The dejitter buffer allows for latency that is typically one or two times
that of the expected jitter; its not a cure for all situations. The time spent in the dejitter
buffers is an important contributor to total system latency.

Similar to the high-level technical metric transaction response time, this measures the
individual response time characteristics of underlying server systems. A common example
is the response time of the database back-end systems to specic query types. Although not
directly seen by end users, this is an important part of overall system performance.
The SLA must describe the granularity of the measurements. There are three related parts
to that granularity: the scope, the sampling frequency, and the aggregation interval.
Measurement Scope
The rst consideration is the scope of the measurement, and availability metrics make an
excellent example. Many providers dene the availability of their services based on an
overall average of availability across all access points. This is an approach that gives the
service providers the most exibility and cushion for meeting negotiated levels.
24
Consider if your company had 100 sites and a target of 99 percent availability based on an
overall average. Ninety-nine of your sites could have complete availability (100 percent)
while one could have zero. Having a site with an extended period of complete unavailability
isnt usually acceptable, but the service provider has complied with the negotiated terms of
the SLA.
If the availability level is specied on a per-site basis instead, the provider would have been
found to be noncompliant and appropriate actions would follow in the form of penalties or
lost customers. The same principle applies when measuring the availability of multiple
sites, servers, or other units.
Availability has an additional scope dimension, in addition to breadth: the depth to which
the end user can penetrate to the desired service. To use a telephone analogy, is dial tone
sufcient, or must the end user be able to reach specic numbers? In other words, which
transactions must be accessible for the system to be regarded as available?
Scope issues for performance metrics are similar to those for the availability metric. There
may be different sets of metrics for different groups of transactions, different times of day,
and different groups of end users. Some transactions may be unusually important to
particular groups of end users at particular times and completely unimportant at other
times.
Regardless of the scope selected for a given individual metric, its important to realize that
executive management will want these various metrics aggregated into a single measure of
overall performance. Derivation of that aggregated metric must be addressed during
measurement denition.
Measurement Sampling Frequency

A shorter sampling frequency catches problems sooner at the expense of consuming
additional network, server, and application resources. Longer intervals between
measurements reduce the impacts while possibly missing important changes, or at least not
detecting them as quickly as when a shorter interval is used. Customers and the service
providers will need to negotiate the measurement interval because it affects the cost of the
service to some extent.
Statisticians recommend that sampling be random because it avoids accidental
synchronization with underlying processes and the resulting distortion of the metric.
Random sampling also helps discover brief patterns of poor performance; consecutive bad
results are more meaningful than individual, spaced-out difculties.
Condence interval calculations can be used to help determine the sampling frequency.
Although it is impossible to perform an innite number of measurements, it is possible to
calculate a range of values that were reasonably sure would contain the true summary
values (median, average, and so on) if you could have performed an innite number of
measurements. For example, you might want to be able to say the following: Theres a 95
25
percent chance that the true median, if we could perform an innite number of
measurements, would be between ve seconds and seven seconds. That is what the 95
Percent Condence Interval seeks to estimate, as shown in Figure 2-4. When you take
more measurements, the condence interval (two seconds in this example) usually becomes
narrower. Therefore, condence intervals can be used to help estimate how many
measurements youll need to obtain a given level of precision with statistical condence.
Confidence Interval for Internet Data
Actual Median
Confidence Interval
Percentage of
Measurements
Figure 2-4
9 10 11 12 13 14
Response Time (Seconds)
There are simple techniques for calculating condence intervals for normal distributions
of data (the familiar bell-shaped curve). Unfortunately, as discussed in the subsequent
section on statistical analysis, Internet distributions are so different from the normal
distribution that these techniques cannot be used. Instead, the statistical simulation
technique known as bootstrapping can be used for these calculations on Internet
distributions.
In some cases, depending on the pattern of measurements, simple approximations for
calculating condence intervals may be used. Keynote Systems recommends the following
calculation approximation for calculating the condence interval for availability metrics.
(This information is drawn from Keynote Data Accuracy and Statistical Analysis for
Performance Trending and Service Level Management, Keynote Systems Inc., San Mateo,
California, 2002.) The formula is as follows:
Omit data points that indicate measurement problems instead of availability

problems.
Calculate a preliminary estimate of the 95 percent condence interval for average

availability (avg) of a measurement sample with n valid data points:
Preliminary 95 Percent Condence Interval = avg (1.96 * square root [(avg
* (1 avg))/(n 1)])
For example, with a sample size n of 100, if 12 percent of the valid
measurements are errors, the average availability is 88 percent. The
condence interval is calculated by the formula as (0.82, 0.94). This suggests
that theres a 95 percent probability that the true average availabilityif
26
wed miraculously taken an innite number of measurementsis between

82 and 94 percent. Notice that even with 100 measurements, this condence
interval leaves much room for uncertainty! To narrow that band, you need
more valid measurements (a larger n, such as 1000 data points).
Now you must decide if the preliminary calculations are reasonable. We suggest that
the preliminary calculation should be accepted only if the upper limit is below 100
percent and the lower limit is above 0 percent. (The example just used gives an upper
limit > 100% for n = 29 or fewer, so this rule suggests that the calculation is reasonable
if n = 30 or greater.)
Note that were not saying that the condence interval is too wide if the
upper limit is above 100 percent (or if the average availability itself is 100
percent because no errors were detected); were saying that you dont know
what the condence interval is. The reason is that the simplifying
assumptions you used to construct the calculation break down if there are not
enough data points.
For performance metrics, a simple solution to the problem of condence intervals is to use
geometric means and geometric deviations as measures of performance, which are
described in the subsequent section in this chapter on statistical analysis.
Keynote Systems suggests, in the paper previously cited, that you can approximate the 95
Percent Condence Interval for the geometric mean as follows, for a measurement sample
with n valid (nonerror) data points:
Upper Limit = [geometric mean] * [ (geometric deviation) (1.96 / (square root of [n 1] ) ) ]
Lower Limit = [geometric mean] / [ (geometric deviation) (1.96 / (square root of [n 1] ) ) ]
This is similar to the use of the standard deviation with normally distributed data and can
be used as a rough approximation of condence intervals for performance measurements.
Note that this ignores cyclic variations, such as by time of day or day of week; it is also
somewhat distorted because even the logarithms of the original data are asymmetrically
distributed, sometimes with a skew greater than 3. Nevertheless, the errors encountered
using this recipe are much less than those that result from the usual use of mean and
standard deviation.
Measurement Aggregation Interval

Selecting the time interval over which availability and performance are aggregated should
also be considered. Generally, providers and customers agree upon time spans ranging from
a week to a month. These are practical time intervals because they will tend to hide small
uctuations and irrelevant outlying measurements, but still enable reasonably prompt
analysis and response. Longer intervals enable longer problem periods before the SLA is
violated.
27
Table 2-2 shows this idea. If availability is measured on a small scale (hourly), high
availability and requirements such as the 5-9s or 99.999% permit only 0.036 seconds of
outage before theres a breach of the SLA. Providers must provision with adequate
redundancy to meet this type of stringent requirement, and clearly they will pass on these
costs to the customers that demand such high availability.
Table 2-2
Measurement Aggregation Intervals for Availability

Availability Percentage
Allowable Outage for Specified Aggregation Intervals

Hour
Day
Week
4 Weeks
98%
1.2 min
28.8 min
3.36 hr
13.4 hr
98.5%
0.9 min
21.6 min
2.52 hr
10 hr
99%
0.6 min
14.4 min
1.68 hr
6.7 hr
99.5%
0.3 min
7.2 min
50.4 min
3.36 hr
99.9%
3.6 sec
1.44 min
10 min
40 min
99.99%
0.36 sec
8.64 sec
1 min
4 min
99.999%
0.036 sec
0.864 sec
6 sec
24 sec
If a monthly (four-week) measurement interval is chosen, the 99.999 percent level indicates
that an acceptable cumulative outage of 24 seconds per month is permitted while remaining
in compliance. A 99.9 percent availability level permits up to 40 minutes of accumulated
downtime for a service each month. Many providers are still trying to negotiate an SLA
with availability levels ranging from 98 to 99.5 percent, or cumulative downtimes of 13.4
to 3.5 hours each month.
Note that these values assume 24 7 365 operations. For operations that do not require
round-the-clock availability, or are not up during weekends, or have scheduled maintenance
periods, the values will change. That said, theyre pretty easy to compute.
The key is for service provider and service customer to set a common denition of the
critical time interval. Because longer aggregation intervals permit longer periods during
which metrics may be outside tolerance, many organizations must look more deeply at their
aggregation denitions and look to their tolerance for service interruption. A 98 percent
availability level may be adequate and also economically acceptable, but how would the
business function if the 13.5 allotted hours of downtime per month occurred in a single
outage? Could the business tolerate an interruption of that length without serious damage?
If not, then another metric that limits the interruption must be incorporated. This could be
expressed in a statement such as the following: Monthly availability at all sites shall be 98
percent or higher, and no service outage shall exceed three minutes. In other words, a little
arithmetic to evaluate scenarios for compliance goes a long way.
28

The Internet and Web are extremely complex statistically. Invalid measurements and
incorrect statistical analysis can easily lead to SLA violations and penalties, which may
then fall apart when challenged by the service provider using a more appropriate analysis.
Therefore, special care must be taken to discard invalid measurements and to use the
appropriate statistical analysis methods.
Measurement Validation
Measurement problems, which are artifacts of the measurement process, are inevitable in
any large-scale measurement system. The important issues are how quickly these errors are
detected and tagged in the database, and the degree of engineering and business integrity
thats applied to the process of error detection and tagging.
Measurement problems can be caused by instrument malfunction, such as a response timer
that fails, and by synthetic transaction script failure, which leads to false transaction error
reports. It can also be caused by abnormal congestion on a measurement tools access link
to the backbone network and by many other factors. These failures are of the measurement
system, not of the system being measured. They therefore are best excluded from any SLA
compliance metrics.
Detection and tagging of erroneous measurements may take time, sometimes up to a day or
more, as the measurement team investigates the situation. Fortunately, SLA reports are not
generally done in real time, and theres therefore an opportunity to detect and remove such
measurements.
The same measurements will probably also be used for quick diagnosis, or triage, and that
usage requires real-time reporting. Theres therefore no chance to remove erroneous
measurements before use, and the quick diagnosis techniques must themselves handle
possible problems in the measurement system. Good, fast-acting artifact reduction
techniques (discussed in Chapter 5, Event Management) can eliminate a large number of
misleading error messages and reduce the burden on the provider management system.
An emerging alternative is using a trusted, independent third-party to provide the
monitoring and SLA compliance verication. The advantage in having an independent
party providing information is both service providers and their customers could view this
party as objective when they have disputes about delivered service quality.
Keynote Systems and Brix Networks are early movers into this market space. Keynote
Systems provides a service, whereas Brix Networks provides an integrated set of software
and hardware measurement devices to be installed and managed by the owner of the SLA.
They both provide active, managed measurement devices placed at the service demarcation
points between customers and providers or between different providers. (Other companies,
such as Mercury Interactive and BMC, now offer similar services and software.)
29
The measurement devices, known as agents in the Keynote service and verier
platforms in the Brix service, carry out periodic service quality measurements. They
collect information and reduce it to trends and baselines. There is also a real-time alerting
component when the measurement device detects a noncompliant situation. Alerts are
forwarded to the Keynote or BrixWorx operations center where they are logged and
included in service level quality reports. As the Keynote system is a service, Keynote
provides measurement device management and measurement validation.
Keynote and BrixWorx also offer integration with other management systems and support
systems for reporting to customers, provisioning staff, and other back-ofce departments.
Test suites for more detailed testing are also stored at the center and deployed to the
measurement platforms as necessary.
Trusted third parties may be the solution needed to reduce the problems when customer
experience and provider data are not in close agreement.
Most statistical behavior that you see in life is described by a normal distribution, the
typical bell-shaped curve. This is an extremely convenient and well-understood data
distribution, and much of our intuitive understanding of data is built on the assumption that
the data were examining ts the normal distribution. For a normal distribution, the
arithmetic average is, indeed, the typical value of the data points, and a standard deviation
calculated by the usual formula gives a good sense of the breadth of the distribution.
(A small standard deviation implies a very tight grouping of data points around the average;
a large standard deviation implies a loose grouping.) For a normal distribution, 67 percent
of the measurements are within one standard deviation of the average, and 95 percent are
within two standard deviations of the average.
Unfortunately, Web and Internet behavior do not conform to the normal distribution. As a
result of intermixing long and short les, compressed video and acknowledgments, and
retransmission timeouts, Internet performance has been shown to be heavy tailed with a
right tail. (See Figure 2-5.) This means that a small but signicant portion of the
measurement data points will be much, much larger than the median.
Figure 2-5
Heavy-Tailed Internet Data

Heavy Tail
30
If you use just a few measurements to estimate an arithmetic average with a heavy-tailed
distribution, the average will be very noisy. Its unpredictable whether one of the very large
measurements will creep in and massively alter the whole average. Alternatively, you may be lulled
into a false sense of security by not encountering such an outlying measurement (an outlier).
The situation for standard deviations is even worse because these are computed by squaring
the distance from the arithmetic average. A single large measurement can therefore
outweigh tens of thousands of typical measurements, creating a highly misleading standard
deviation. Its mathematically computable, but worse than useless for business decisions.
Use of arithmetic averages, standard deviations, and other statistical techniques that depend
on an underlying normal distribution can therefore be quite misleading. They should
certainly not be used for SLA contracts.
The geometric mean and the geometric standard deviation should be used for Internet
measurements. Those measures are not only computationally manageable, theyre also a
good t for an end-users intuitive feeling for the typical measurement, psychologically.
As an alternative, the median and eighty-fth percentiles may be used, but they take more
power to compute.
The geometric mean is the nth root of the product of the n data points. The geometric
deviation is the standard deviation of the data points in log space. The following algorithm
should be used to avoid computational instabilities:
Round up all zero values to a larger threshold value.
Undo the logarithm by exponentiating the results to the same base originally used.
Take the logarithm of the original measurements (any base).

Perform any weighting you may want by replicating measurements.
Take the arithmetic mean and the standard deviation of the logarithms of the original
measurements.
Note that the geometric deviation is a factor; the geometric mean must be multiplied and
divided by it to create the upper and lower deviations. Because of the use of logarithms, the
upper and lower deviations are not symmetrical, as they are with a standard deviation in
normal space. This is one of the prices you pay for the use of the geometric measures.
Another disadvantage is that, as is also true for percentiles, you cannot simply add the
geometric statistics for different measurements to get the geometric statistics for the sum
of the measurements. For example, the geometric mean of (connection establishment time
+ le download time) is not the sum of the geometric means of the two components.
Instead, each individual pair of data points must be individually combined before the
computations are made.
These calculations of both the geometric mean and the geometric deviation, or the median
and the eighty-fth percentile, should be used for end-user response time specication.
Using these statistics instead of conventional arithmetic averages or absolute maximums
helps manage SLA violations effectively and avoids the expense of xing violations that
were caused by transient, unimportant problems.
31

There have been numerous stories in industry publications that describe service provider
difculties in managing new technologies, digital subscriber line (DSL) services being a
prime example. Customers were annoyed by the delays and operational interruptions.
Many customers investigated alternative technologies with different service providers and
subsequently left their original provider.
When customers defect, service providers suffer with lost business and revenues. Many
startups in the DSL space, for example, could not deploy their services and generate
revenue quickly enough and are out of business after exhausting their initial funding.
Many customers still view most of their providers as being behind the curve, sluggish, and
unable to help them execute their business strategies fully. Typical complaints about
interaction with providers often include the following:
Difculty in nding experts at the provider who actually understand the providers
own services
Mistake-prone business processes for interacting with the provider

Revenue impacts when scheduled services slip their delivery dates
Voluminous, and often incomprehensible, bills and reports
Bombardment from competitors offering equally incomprehensible services
Although such issues have made the service-provider marketplace somewhat turbulent, the
good news is that the situation is improving because of two developments.
The rst is the continuing build-out of the Internet core with optical transmission systems
of tremendous capacity coupled to the widening deployment of broadband services for the
last mile access links to the customer. When this capacity is fully in place, bandwidth
services can be activated and deactivated without the delays associated with running new
wiring and cable. As these high-capacity transmission systems become more widespread,
it becomes a question of coordinating the activities of both customer and provider
management systems for more effective and economical service delivery.
That introduces the second enabling factor: the development of standards, such as extensible
markup language (XML) and Common Information Model (CIM), and other factors, are
making the sharing of management information easier and simpler than it used to be.
Customers and service providers can use mechanisms such as XML to loosely couple their
management systems. Neither party needs to expose internal information processes to the
other, but they can exchange requests and information in real time to speed up and simplify
their interactions.
Customers can allocate their spending more precisely by activating and deactivating
services with ner control and thereby reducing their usage charges. They can also
temporarily add capacity or services to accommodate sudden shifts in online business
activities.
32
Providers have a competitive edge when they have the appropriate service management
systems. They can meet customer needs quickly and use their own dynamic pricing
strategies to generate additional revenues.
Business process metrics measure the quality of the interactions between customers and
service providers as a way of including them in an SLA and thereby improving them. Some
of these metrics may be incorporated in standard provider service proles, while others may
need to be negotiated explicitly.
Many customer organizations maintain relationships with multiple service providers to
avoid depending on a single provider and to use the competition to extract the best prices
and service quality they can negotiate.
Business process speed and accuracy will be even more important in the future as customer
and provider management systems are integrated, and as services are activated and
deactivated in real time. Service providers must be able to provision quickly, bill
appropriately, and adjust services in a matter of a few seconds to a few minutes. Customers
must also be able to understand their service mix and adjust their requests to the service
provider to match changes in their business requirements. It is this environment that will
begin to accelerate the use of business process metrics as part of the selection and continued
evaluation of a set of service providers.
Table 2-3 lists two emerging categories of business process metrics. Problem management metrics
measure the providers responses to customer problems, whereas real-time service
management metrics track the responses to customer requests for service modications.
Table 2-3

Metric
Description

Trouble Response Time
Elapsed time between trouble notication by customer and rst

response by provider
Notication Time
Elapsed time between trouble detection by provider and rst

notication to the customer
Escalation Time
Elapsed time between rst response by provider or notication to the

customer and the rst escalation to provider specialists
Trouble Relief Time

customer and the furnishing of a workaround or x for the problem
that permits normal operation to resume
Trouble Resolution Time

customer and the furnishing of a permanent x for the problem

Provisioning Time
The elapsed time to provision a new service
Activation /
Deactivation Time
The elapsed time to activate or deactivate a provisioned service
Change Latency
The elapsed time to effect a parameter change across the entire system
The following sections describe the nuances of each metric in turn.
33

Service quality problems are inevitable, although, ideally, they are becoming more rare
with time. A metric of primary importance is the trouble response time to a customer
problem report or query. This metric can be used to measure both the rst response to a
customer call and the rst response to automated notication from a customers
management system.
Notification time measures the interval between the provider detecting a service problem
and reporting it to the customer. Agile customers will activate their own procedures to deal
with the interruption and will want a quick notication time to minimize any disruptions.
Escalation time measures how quickly a problem is moved from the intake at the help desk
to more highly qualied experts. Faster escalation times will usually carry a premium the
customers will be willing to pay when critical services are involved. As is true for other
problem management metrics, escalation time may depend on the severity of the problem
and the priority assigned to the users request.
Trouble relief time is that point at which the customer reporting the issue has a workaround
in hand, or has overcome the service interruption. Relief is distinct from resolution: even if
its not known what caused the outage, the customer is back in business. However, the
customer will want the provider to promptly identify the root cause of the outage and take
corrective action to prevent it from happening again. That nal stage is known as resolution
time.

Customers and providers will both exploit real-time service management capabilities as
their management systems begin to interact with each other. Customers will be able to ne
tune their resource usage and control their costs while coping with the dynamic shifts that
are so characteristic of online activities. Service providers will also have the advantage of
maintaining control while allowing their customers to take over many of their management
tasks and thereby reducing their stafng costs substantially.
Customers want to be able to change their service environment on their time schedule rather
than waiting for the provider to do the job in the traditional way. This may involve, for
example, activating services, such as videoconferencing, on a demand basis. At other times,
customers may want to add capacity to handle temporary trafc surges, or they may want
to change the priorities (and costs) of some of the services they use.
Provisioning time is the time needed to congure and prepare a new service for activation,
including the allocation of resources and the explicit association of consumers with those
resources for billing purposes. Activation/deactivation time is the time needed to activate
or deactivate a provisioned service.
34
Change latency is an idea for a metric that arose from the experience of one of my
colleagues. She works for a large multinational organization with approximately 1,200
global access devices. Some access points support a small number of dial-in users, while
others accommodate larger buildings and campuses. Her organization wanted to change
some access control policies and asked the service provider to update all the access devices.
The problem occurred because the provider changed only portions of the devices in phases
over two days rather than all at once, leading to a situation in which devices had
inconsistent access control information. The result was disruptions to the business.

The SLA has become an important concern for both providers and their customers as
dependence on high-quality Internet services increases. The SLA is a negotiated agreement
between service providers and their customers, and in the best of worlds, the SLA is
explicit, complete, and easily understood. When done properly, the SLA serves the needs
of both customers and service providers.
Organizations are constantly struggling to maintain or extend their competitive advantage
with stable, highly available services for their customers and end users. As a result, they are
increasingly dependent upon their service providers to deliver the consistent and
predictable service levels on which their businesses depend. A well-crafted SLA provides
substantial value for customers because of the following:
They have an explicit agreement that denes the services that will be provided, the
metrics to assess service provider performance, the measurements that are required,
and the penalties for noncompliance.
The clarity of the SLA removes much of the ambiguity in customer-service provider
communication. The metrics, rather than arguments based on subjective opinions of
whether the response time is acceptable, are the determinant for compliance.
The SLA also helps customers manage their costs because they can allocate their
spending on a differentiated scale with premiums for critical services and commodity
pricing where best effort is sufcient.
Customers have the condence that they can successfully deploy the critical services
that improve their internal operations (remote training and web-based internal
services) or strengthen their ability to compete (web services and supply chains). Too
many efforts have oundered due to unacceptable service quality after deployment.
The SLA becomes more important as you move toward customer-managed service
activation and resource management. The SLA will determine what the customer is
allowed to do in real time in terms of changing priorities and service selections.
Service providers have been reluctant to negotiate SLAs because of their increased
exposure to nancial penalties and potentially adverse publicity if they fail to meet
customer needs. In spite of their reluctance, they have been forced into adopting SLAs to
keep their major customers. The evolution of SLAs has therefore been driven mainly by
customer demands and fear of losing business.
35
Early SLAs focused primarily on availability because it was easier to measure and show
compliance. Availability is also easier for a provider to supply by investing in the
appropriate degree of redundancy so that failures do not have a signicant impact on
availability levels.
Performance metrics are beginning to be included in more SLAs because customers
demand them. Providers have a more difcult time guaranteeing performance levels
because of the dynamism of their shared infrastructures. Simply adding more bandwidth
will not guarantee acceptable response time without signicant trafc engineering,
measurements, and continued analysis and adjustment. The difculty of managing highly
dynamic ows has many providers reluctant to accept the nancial penalties that are part of
most SLAs.
Nonetheless, the value of the SLA to providers is also recognized, and some of the
signicant factors are as follows:
The clarity of the SLA serves the provider as it does the customer. Clearly dened
metrics simplify the assignment of responsibility when service levels are questioned.
The SLA offers service providers the capacity to differentiate their services and
escape (somewhat) the struggles of competing in a commodity-based market. As
providers create and deploy new services, they can charge on a value-pricing basis to
increase their prot margins.
High performance and availability are increasingly becoming competitive

differentiators for service providers. Increasing customer dependence on Internet,
content delivery, and hosting service providers gives an advantage to those providers
that demonstrate their ability to deliver guaranteed service quality levels.
When constructing an SLA, customers must assess their desired mix of services and weigh
their relative priorities. A useful rst attempt is to match those needs against the providers
precongured service proles. This will group services with common characteristics and
requirements, and it will also help identify any special services that are not easily
accommodated by the predened categories. Requirements that do not t a predened class
will require special considerations when negotiating an SLA.
After services have been grouped, their relative priorities within each category must be
established. Customers can do this by selecting the appropriate service prole; for example,
many service providers offer a variation on the platinum, gold, and silver proles. Typically,
platinum services are the most expensive and provide the highest quality; gold and silver
are increasingly less expensive and provide relatively lower quality.
Even if prebuilt service proles are used, the SLA negotiations must include discussions of
how the SLA metrics are to be measured and how any penalties or rewards are to be
calculated. Customers will continue to push for stronger nancial penalties for
noncompliance, and providers will give in to the pressure as slowly as they can in a highly
competitive market.
36
Unfortunately, its not uncommon for providers and customers to have ongoing disputes
about the delivered services and their quality. Some of the roots of the problem are
technical: customers and providers may have different measurement and monitoring
capabilities and are therefore comparing apples to oranges. Other problems are rooted in
the terms of the SLA, where ambiguities lead to different interpretations. SLAs must
therefore incorporate relevant measurement, artifact reduction, verication mechanisms,
and appropriate statistical treatments to protect both parties as much as possible. Customers
must play a role in the verication process because they still have the most to lose when
serious service disruptions occur.
SLA penalties and rewards are a form of risk management on the part of the customer.
However, they continue to be among the least well-developed elements of service offerings.
More mature industries offer guarantees and incentives; the ability of the service provider
to reduce and absorb some risk for its customers is a key competitive differentiator.
Still, customers bear the brunt of any disruptions caused by a provider. As one customer
once said, The problem is the punishment doesnt t the crime; an hour-long outage costs
us over $100,000, and my provider just gives me a 10 percent rebate on my next bill.
Nevertheless, the correct role for penalties and rewards is to encourage good performance,
not to compensate the customer for all losses. If loss compensation is needed, its a job for
risk insurance.
Rather, SLA penalties and rewards must focus on motivation. The penalties and rewards
should be sufcient to inspire the performance the customer wants, and the goals should be
set to ensure that the motivating quality of the SLA remains throughout the time period.
Impossible or trivial goals dont motivate, and capped penalties or goals stop motivating
when the cap is reached. For example, if a provider must pay a penalty based on monthly
performance, and the SLA is violated in the rst three days of the monthso the maximum
penalty must be paidthe provider wont be motivated to handle problems that appear
during the remainder of the month. After all, that particular customers ship has already
sunk; maybe another customers ship is still sinking and can be rescued without paying a
maximum penalty!
Web performance goals that are set unrealistically high, with no reference to the Internets
background behavior, will cause the supplier to refuse the SLA or insist on minor penalties.
A solution to this problem is to include in the SLA metrics a background measure of
Internet performance or of competitors performance, possibly from a public performance
index or from specic measurements undertaken as a part of the SLA.
Sometimes, performance is so poor that a contract must be terminated. The SLA should
discuss the conditions under which termination is an option, and it should also discuss who
bears the costs for that termination. Again, the costs should be primarily designed to
motivate the supplier to avoid terminations; it may not be possible to agree on an SLA in
which all of the customers termination costs are repaid.
Summary
37
Finally, customers may want to include security concerns in their SLA as part of a service
prole through additional negotiation and specication. Security is notoriously difcult to
measure, except in very large aggregates. Security metrics are more likely to take the form
of response-time commitments in the event of a breach, either to roll out patches, shut down
access, or detect an intrusion. The bulk of security discussions around service levels will be
about policies, not measurement.
Summary
This chapter covers a lot of territory and sets the stage for the following chapter discussions
that cover different aspects of actually managing services. Successful service management
is predicated on delivering acceptable service quality at acceptable price points and within
acceptable time frames. Correctly handled, it improves service quality, improves
relationships with suppliers, and may even lower total costs.
The SLA is the basic tool used to dene acceptable quality and any relationships between
quality and price. It is a formal, negotiated contract between a service provider and a service
user that denes the services to be provided, the service quality goals (often called service
level indicators and service level objectives), and the actions to be taken if the service
provider does not comply with the SLA terms.
Measurement is a key part of an SLA, and most SLAs have two different classes of metrics,
technical and business process metrics. Technical metrics include both high-level technical
metrics, such as the success rate of an entire transaction as seen by an end user, and lowlevel technical metrics, such as the error rate of an underlying communications network.
Business process metrics include measures of provider business practices, such as the speed
with which they respond to problem reports. Metrics should also include measures of the
workload expected. Service providers may package the metrics into specic proles that
suit common customer requirements while simplifying the process of selecting and
specifying the parameters.
In any case, a properly constructed SLA is based on metrics that are relevant to the end-user
experience. Many of the low-level technical metrics, such as communications packet loss,
have complex relationships to end-user experience; its usually much better to use highlevel technical metrics that directly measure end-user experience, such as web page
download time and transaction time. The low-level technical metrics can then be derived
from the high-level technical metrics and used to manage subordinate systems.
SLA metrics must be carefully dened in terms of scope, sampling frequency, and
aggregation interval:
Scope represents the breadth of measurement (for example, the number of test points
from which availability is measured and the percentage of them that must be
unavailable for the entire system to be marked as unavailable).
38
Measurement sampling should be random, and the sampling frequency should be

chosen to provide timely alerts when problems occur and to provide the appropriate
condence intervals for availability and performance measurement. Calculation of
condence intervals is unfortunately complex for Internet statistics, as the usual
formulas, suited for normal distributions, cannot be used. Instead, statistical
simulation through bootstrapping or the approximations discussed in the body of this
chapter can provide estimates of the number of measurements needed to provide
reasonable statistics.
The aggregation interval is also important, as longer intervals, often chosen in SLAs,
may allow long periods of sub-par performance. The tolerance for service interruption
then becomes important and may need to be separately specied.
Measurements must also be validated and subjected to statistical treatment when used in
SLAs, and the methods for that validation and treatment must be documented in the SLA
to avoid dispute. Validation ensures that erroneous measurements are removed, insofar as
is possible, before computation of the metrics used in the SLA. Statistical treatment ensures
that outlying measurements do not create a misleading picture of the performance as
perceived by end users, with the resulting waste of resources spent xing what may be a
minor issue. Arithmetic averages and standard deviations should not be used to handle
Internet statistics.
Finally, the SLA should be written with penalty and reward clauses that are sufcient to
inspire the performance the customer wants, and the goals should be set to ensure that the
motivating quality of the SLA remains throughout the time period. Capped penalties or
goals are examples of techniques that may motivate a supplier to abandon work on an
account just because the cap has been reachedprobably not the desired behavior.
The service level indicators and objectives described in the SLA are then used by the
operations staff and by automated systems to manage the service levels, as described in
Chapters 6 and 7.
CHAPTER
Service Management
Architecture
This chapter describes the overall architecture for the management of service delivery on
the Web; it forms the framework that later chapters will build on to create a complete
design. It also gives some of the relevant history of service management architectures to
enhance the readers understanding of the issues facing management architectures. The
history will let the reader see the origins of some of the major management products in the
marketplace.
Before the discussions begin, however, its important to understand that a system
architecture denes the components of a system and their relationships, showing how they
provide the required system functions and meet the system objectives. The interconnections
among the architectures subsystems are clearly dened, and each subsystem can be
expanded to reveal internal details. A good architecture therefore provides a high-level overview
for those who need it, while also providing detailed technical information, if necessary.
Any number of architectures can be created to handle the same challenge; the number of
potential architectures is limited only by the imagination of the architect. Each can stress
different capabilities or organizing principles. Ideally, the selected architecture provides the
maximum business value (including functionality, exibility, match to the organization's
culture, and more) while costing the least in funding and effort to build and manage.
Design decisions made within each architectures subsystem may affect the function and
performance of the other subsystems, although the subsystems in some architectures are
more tightly interrelated than in others. For example, in a Web service delivery architecture,
the decision to use a private network of geographically distributed caching systems has
implications for the design of the central Web-serving system. A well-dened architecture
helps those managing it to see the implications of a subsystems design and management
decisions on the system as a whole.
This chapter is organized into three sections. The rst section provides a brief description
of a large-scale Web services delivery architecture along with its business environment.
Thats because its impractical to discuss management systems without having a common
understanding of the architecture of the systems being managed and the business
environmentthe webbed ecosystemwithin which those systems must function. The
middle section discusses the history of service management platforms for heterogeneous
systems and the design factors and standards that go into them. The last section gives a
summary of the service management architecture used in this book and provides references
to the relevant chapters.
42
Chapter 3: Service Management Architecture

Figure 3-1 shows a typical architecture for delivery of services on the Web.
Figure 3-1
Access
Provider
CDN Server
Cache
DNS
Server
Routers
Server
Farm
MCI
Server
Farm
C&W
Sprint
AT&T
Level 3
Router
UUNET
Verio
Qwest
The Internet
Firewall
Load Distributor
Web
Servers
Application
Servers
Database
Servers
Primary
Server Farm
This server farm uses a three-tier application model, which is normally used for large-scale
systems. The three tiers are as follows:
Web servers, which maintain the connections with client browsers and other client
devices, parsing and handling input from them, formatting data to be sent to them,
serving unchanging (static) web pages, and often being responsible for maintaining
transaction context.
43
Application servers, which run the major transaction and dynamic web page
generation systems, as well as any specialized applications for the end users. They
often run specialized transaction-processing operating systems that simplify
programming for scalability and availability.
Database servers, which handle the large back-end databases needed by larger Web
systems.
Because the three tiers are loosely coupled, each tier can grow independently of the others,
and interconnections can be used to increase availability.
Above the three server tiers in Figure 3-1 is the load distributor, which distributes incoming
requests among the web servers, and the rewall and Internet access router.
In Figure 3-1, the primary server farm is multi-homed; its connected to two different
Internet Service Providers (ISPs) to increase availability. The primary server farm usually
also includes ancillary devices, such as the authoritative Domain Name System (DNS)
server, which provides the key records for mapping the sites Internet host names to Internet
numeric addresses, and server-side caches, which can be used to relieve the serving systems
of highly repetitive work by storing the results of commonly repeated requests. The end users
Quality of Experience (QoE) depends on much more than the primary server farms
performance, however. Multiple server farms, caching devices, content distribution networks,
third-party content providers, and the DNS may also be involved.
Most large systems rely, often indirectly, on multiple, distributed server farms. Some
enterprises have multiple locations from which they provide their basic content, and they
use geographic distribution technologies to try to direct end users to the server farm that will
respond the fastest. For example, its impossible to deliver rapid web page downloads in Asia
from a server system in New York City; enterprises that have a large user base in Asia must,
therefore, have some server systems on that side of the Pacic. Geographic distribution is
critical to providing good QoE, though its difcult to locate an end user with great precision
by using that end users Internet address. Obtaining detailed knowledge of location, while
very important for some applications and for some performance situations, can be quite tricky.
Caching devices are used to store frequently requested data inside the network, at the server
location, or within the end users local network to decrease both network trafc and the time
needed to locate and display data. These devices are often provided free of charge for web
sites to use, but conguring web pages for use with remote caching can be complex.
Precise evaluation of the QoE at an end users location as the result of caching requires
remote measurement facilities.
A Content Distribution Network (CDN) is a service that uses a large network of remote
caches to provide much more control of caching than is available using free caching. A
CDN can provide prepositioning of content such as a major advertising campaign; it also
provides the ability to cache download les and streaming media, which are usually not
stored by public caches. A CDN gives the content owner direct, immediate control over
remotely cached content. A CDN can also supply differentiated content to end users, based
on their location.
44
Most web sites uses third-party content providers for some advertising or even basic site
content. Many stock-trading sites, for example, use a third-party provider for stock price
graphs that are visually embedded in their web pages. Despite the fact that content comes
from third-party content providers, the end user usually does not realize that the content
originates from different sites. If there are performance problems, the site owner is blamed,
not the third-party provider.
Finally, the web site cant even be found by the end user if there are problems with the
performance of the DNS. DNS is a worldwide hierarchy of server systems congured as a
distributed directory, and it must be able to reach the web sites authoritative record and
interpret that (often complex) record correctly. DNS information can then be cached in the
DNSs own dedicated system of distributed caching servers, with some control from the
web sites owner. Without measurement from end-user locations, problems with the DNS
are often not detected until irate end users call up the site to complain about the sites being
ofine. The site may be completely accessible from the site owners intranet, but
completely inaccessible from large areas of the Internet.
All the trafc between the end user and the various server farms, caching devices, content
distribution networks, and DNS servers travels over the Internets complex mesh of
backbones and peering points, which are the locations at which different organizations
interconnect their backbones. The routing tables, used to direct Internet trafc, are so
complex that the routing software cannot consider uctuations in transit time when making
routing decisions. If it did, the Internet would be saturated by routing table update
messages, and the router CPUs would be saturated by the calculations required. The result
is that routing through the Internet is often suboptimal, and trafc often heads for congested
areas and peering points instead of traveling around them.
Routers do attempt to reroute around failed pieces of the Internet; they just dont usually
reroute around congested pieces. Delays can build quickly at congestion points, and packets
can be lost or duplicated as routers try to recover from their problems. To add to the
complexity, the route between any pair of endpoints is almost always different in the two
directions.
For all Internet and Web situations, you can see that measurement of the performance as
seen by the end user must be available to detect QoE problems occurring in the complex
Web-serving systems. Those measurements must also be quickly available and must be
credible; otherwise, the web sites owner wont be able to use them to get an Internet service
provider to x a problem. Of course, some problemssuch as a very localized difculty
with an ISPs bank of dial-in modemsare beyond the scope of responsibility of a web
sites owner, even though that web sites availability appears to be affected. In such cases,
which occur constantly, measurements from standard locations and the use of public
performance index measurements can be used to reassure management (and the end user,
if necessary) that the problem is a local one and is beyond the direct responsibility of the
owners or operators of the web site.
45
Service Management Architecture: History and Design

Factors
A service management architecture must be designed to handle the heterogeneous,
geographically distributed subsystems used for Web service delivery, and it must also cope
with the fact that many of these subsystems are owned and managed by different
organizations with only tenuous links for instrumentation and control. To make matters
more complex, suppliers are being challenged, by customer demand, to provide
individually tailored Service Level Agreements (SLAs) with ne distinctions and pricing
for specic services and for specic customers. Customers are also beginning to ask for the
ability to alter their service mix and service quality levels on demand.
The management system must, therefore, be able to handle the complex interactions
involved as services are provisioned, activated, used, and deactivated. Despite these
difculties, it must help speed deployment of new services while also quickly adapting to
new service demands and changed relationships among service suppliers.
Further discussion of this topic is divided into the following subsections:
The evolution of the service management environment

History of service management architectures for heterogeneous systems
Architectural design drivers for management of heterogeneous systems
The Evolution of the Service Management Environment

Todays service management environment bears scant resemblance to the one we had even
a few years back. Its important to note that this shift is not just about new tools and
technologies; its equally about changes to the organizations and job denitions within IT
shops.
An example should make the critical difference between traditional and Internet-based
system structures clear. Consider that in a traditional system, such as a corporations
transaction-processing infrastructure based on IBMs Systems Network Architecture
(SNA), the entire system was owned and operated by the corporation. Any externally
owned telecommunications facilities were very simple; they were direct telephone lines
between physical locations, with little variation or value-added services. Data switching
was performed inside the corporations private IBM communications controllers, which
were centrally congured and operated, with all cross-network routes preplanned and
tightly controlled. Major portions of the network were regularly taken ofine for
recongurations; it wasnt unusual for an entire worldwide network to be completely
unavailable for one or more days per quarter. The end users all had IBM terminals or
terminal emulators running on early PCs; those terminals connected to one server at a time
and usually presented only text.
46
If a problem occurred in the traditional IBM SNA-based system, the system operator had
central, integrated control of the application, the network, and the end users terminal. The
networks System Services Control Points (SSCPs) could instantly locate and diagnose the
complete end-to-end connection between a particular application and a particular terminal.
Given clear visibility into underlying connections, other tools could diagnose the
application problems quickly. The entire system was tightly-coupled. Conguration was
extremely complex and could be error-prone, but a running system was under strong central
control.
In contrast, Internet-based systems are loosely coupled and do not rely on massive,
centralized congurations of servers, storage, and network hardware. However, these more
exible congurations are more difcult to operate at a given level of service and do not
have any central management system. Instead of having that central authority, which is the
keystone of a traditional system, Internet-based systems have a loose confederation of
interacting, separately owned and controlled subsystems.
Of course, the exibility of networked architectures is a mixed blessing. It does facilitate
changes to keep pace with changing demands of the business. However, such change can
also introduce new complexities and vulnerabilities. When a problem occurs in an Internetbased system, nding the precise end-to-end path that the data ow is taking may be
extremely difcult; theres no central switching or routing authority. Even if that path is
found, its unclear that the knowledge could be effectively used to x any problems
quicklythe responsible ISP might be one that isnt directly accountable to either end of
the connection.
Further exacerbating the situation is that a problem as seen by the end user could have been
caused by any of dozens of interacting subsystems and servers. The image on the end users
browser probably comes from multiple servers simultaneously (third-party suppliers
provide stock charts, ads, and so on); each data ow may have been invisibly intercepted
and possibly cached by devices unknown to server or end user; the server assigned to a
particular end user may have been assigned only temporarily and cannot easily be traced at
a later time or even while the error is occurring. Running a help desk in the complex Internet
environment is much more technically difcult, and it takes more ongoing negotiating and
interacting with external suppliers than running one in the traditional environment!
Service Management Architectures for Heterogeneous Systems

New architectures and platforms were created to manage Internet-based, heterogeneous
systems. Managing services that span many infrastructures and organizations not only
demands a set of management tools from a variety of vendors; it also means that tools must
be applied in the appropriate sequence to solve the problem. How the tools are organized
has a signicant impact on the effectiveness of your management efforts.
47
The traditional approach, still common today, is to use each tool in isolation. When several
tools are needed for a task, a staff person takes the output from one tool and uses that
information to drive the next tool. This approach needs additional staff attention, can
consume large amounts of time, and adds the risk of introducing errors with manual steps.
It also requires an investment in additional equipment and requires additional physical
space, adding signicantly to the cost of monitoring and management. Integrating the tools
appears to be a better solution.
If integration is good, you might wonder why is there so little of it. Consider the following
reasons:
First, deep integration of the management structures of heterogeneous systems has

been a signicant, expensive technical challenge, especially when there were no
accepted standards for guidance and the managed systems changed constantly.
Second, the market has been willing to settle for integration as dened by marketing
departmentsintegration that seemed to correspond to needs, but that has failed to
meet the test of practice.
The early management platforms touted themselves as integration points for a set of bestof-breed management tools. Unfortunately, their marketing hype exceeded their capacity
for delivering any meaningful integration. Competition was on the basis of who had the
longest list of third-party management tools sharing interfaces to their platform,
notwithstanding any real integration efforts. The market positioning suggested that
commonality of interfaces was the key to making tools useful; that turns out not to be the
case in practice.
The integration many early management platform vendors actually offered might be better
characterized as consolidation and tool launching. Consolidation allows customers to use a
single server for a set of management tools rather than use a server for each one. Tools can
be launched after an alert triggers a response. This is useful, but there is no integration
each tool still operates as a separate entity with its own commands, functions, data schema,
and display formats.
Some management platforms added integration on the glassa consistent look and feel for
a set of tools. This feature is useful because it simplies usage and reduces staff training
requirements. The platforms offered this common look and feel for their products and the
overall console. However, each tool could, and often did, have its own conventions after
being launched.
All the early platform vendors got away with these low-level integration features because
the market was relatively unsophisticated, and systems management did not demand as
much integration. However, today, this lighter level of integration is no longer adequate;
management tools must now work in a webbed services environment.
A cynical view is that the early lack of deep integration also served vendors as they built
substantial professional services organizations to nish the job. I had one vendor in a
48
moment of candor admit that his company made $10 in professional services for each $1 a
customer spent on the actual software. Market studies in general showed that the consulting
spent to take such tools off the shelf and put them to use exceeded the licensing fees by a
factor of 2:1 or more.
The relatively shallow integration left organizations with several other choices: they could
nd another integrator, undertake the effort themselves, or live with a set of disjointed
management tools. Of course, using a systems integrator was expensive and timeconsuming; it often meant that a company was dependent upon the integrator every time
new management tools were acquired. The alternative of internal integration efforts was
also expensive and time-consuming, as it diverted development resources from the core
business initiatives.
As with much of technology, invention is the mother of necessity in management. The
management industry has been responding to the need for better integration through
consolidation. The big players buy up niche products and offer the suite as an integrated
solution. Others are forging strategic partnerships and integrating their products. Both
trends offer some additional value for management solution buyers. However, it is still
unusual for these efforts to produce a product suite that offers more than integration on the
glass. Often surface integration relies too heavily on limited new software to try to glue the
disparate pieces together.
NOTE
Its important for prospective purchasers of integrated management systems to remember

this history of supercial integration when evaluating systems. Deep integration of
management systems is difcult, even though new standards, discussed later in this section,
promise some help.
Architectural Design Drivers

The key factors that drive the development of architectural designs for service management
architecture are as follows:
Demands for changing, expanding services

Multiple service providers and partners
Elastic boundaries among teams and providers
Demands for fast system management
Need for mutually understandable data item denitions and event signaling
mechanisms
These are described in the following sections.
49
Demands for Changing, Expanding Services

The range of Web-based services continues to expand, with streaming, multimedia, and
remote collaboration gaining interest. The range of network access devices in common use
is also expanding, requiring services that can adapt to the inherent bandwidth, resolution,
and screen size limitations of the access alternatives.
For example, feedback from the lowest network transport layers could be used to adjust the
mixture of frame types in a streaming media presentation to improve end-user QoE. Many
current systems conceal transport error rates from the application layer; the application
layer therefore doesnt know that errors are occurring and that a change in the frame
mixture might be helpful. As streaming increases in importance, new streaming-tuned
services may appear. In such systems, the application layer that creates the video or audio
stream can be told to increase the percentage of key (synchronization point) frames in the
stream as the error rate in the transport layer increases. That increase assists the receiver in
regaining lost synchronization quickly, at the expense of some instantaneous bandwidth
use. Service levels delivered to the streaming media service in this situation could be
adjusted quickly to provide the optimum mix of transport error rate and bandwidth,
enabling an improved QoE for the end users.
Multiple Service Providers and Partners

An array of service providers and partners play their parts in the end-user experience,
offering connectivity, value-added network services, hosting, content delivery, and backofce functions. Some partners use direct interaction for building their own supply chains
and other online business processes, while other organizations use exchanges as a way of
transacting business with a larger number of potential suppliers and partners.
The customer must consider how these various providers are held accountable for meeting
their various compliance criteria. Holding providers accountable requires an ability to
monitor their service delivery with the appropriate instrumentation, and the measurements
from that instrumentation must be correlated with the service quality as seen by the end
user.
Elastic Boundaries Among Teams and Providers

Boundaries in the service management environment are more elastic and uid than they
were in the older mainframe and client-server worlds. Service ows move among the
infrastructures in many different ways, and managers must understand how the behavior of
each infrastructure is affecting overall service quality. Service managers therefore need to
understand issues that span multiple supporting infrastructures (networks, systems, and
applications) and multiple organizations.
50
The contrast with traditional management organization strategies is stark. In the past, teams
had isolated, well-bounded responsibilities; for instance, the network and application
infrastructures managers had little reason to interact. Today, such specializations must be
integrated with a structure for mutual responsibility and collaboration by specialists across
these different layers. Infrastructure managers can be specialists, but service managers must
also be generalists.
Boundaries between customers, their providers, and their business partners are also
becoming more uid. At any point, the constellation of providers and partners can change
as the mix of services responds to business shifts. To keep pace with the changing mix,
management systems must interact more frequently, and customers need to assume some
of the management functions that have been the providers domain.
Demands for Fast System Management

Despite the difculties in managing an Internet-based system, competitive pressures drive
fast services provisioning and conguration, along with fast problem detection and
resolution. The fact that many of the critical underlying services are much more complex
than in the days of traditional architectures, and that they are under only loose control, does
little to soften the expectations of end users. They still want fast, effective support from the
help desk.
Data Item Definition and Event Signaling

Products from different system management tool vendors generally use different names for
management data, different ways of representing their values, and different ways for
describing relationships among data elements. It comes as no surprise to anyone that each
vendors choices are not compatible with the others. Especially in older designs, a tool from
one vendor usually cannot access needed information from another vendor without
knowing the details of the latters data denitions and creating the translation software that
transforms the data into a usable form.
The Simple Network Management Protocol (SNMP) was the rst effort to develop
standards for exchanging management informationin this case between an agent on a
network device and a management application running on a system management platform.
As its name implies, it contains a very simple way of asking a remote device to send a
formatted array of system management information (the Management Information Base
[MIB]), which contains data, such as packet counts and error rates. It also has other features
that enable the sending of asynchronous alert messages and the setting of some remote
parameters. SNMP was very successful and has been extended to a number of other
elements, such as servers and applications. However, SNMP has two signicant drawbacks:
it focuses on syntax, and it pays less attention to semantics.
51
The syntax (command and data structure) of SNMP can be used by a management
application to determine that a variable included in an array of system management
information is a 32-bit integer used as a counter. However, without the semantics (meaning)
of the counter, the application cannot use the data. Missing information may include the
following:
What does the counter count?

When is it incremented?
What are the maximum and minimum values?
When was it initialized?
What are the thresholds for generating an alert?
Many standard sets of SNMP syntax and semantics for network and applications systems
were dened, and they attempt to answer these questions. However, manufacturers quickly
introduced proprietary extensions to the standard MIB data denitions, and often the
semantics of those extensions were poorly specied, which stymied interoperability.
Proprietary extensions are inevitable, and they are a mixed blessing for customers. They are
desirable because they enable vendors to innovate and offer unique value-added features to
the standard SNMP management capabilities. They are a problem because management
applications from other vendors often do not use the foreign extensions to advantage.
Without complete specications, and without a nancial incentive to do the integration
work, vendors cant and wont incorporate other vendors extensions into their
management tools. This leads to situations where customers having similar network
devices from several vendors must use different management tools for each product set
even though all the devices perform the same functions in almost identical ways.
This problem of data denition continues in current standards efforts, although there has
been some improvement. The extensible markup language (XML) standard is already being
used extensively for exchanging structured information, and many vendors have adopted
XML as a means for exchanging information between their own management products. XML
takes a step forward by including methods for converting the format of a messages data
into a format understood by the receiver, but the semantics of that message must still be
dened elsewhere.
The Distributed Management Task Force, a standards body composed of industry players,
has recently dened the Common Information Model (CIM), which is intended to
complement XML by offering more complete denitions for all management tools. For
example, CIM can be used to describe each managed object by the following:
Characteristics, or attributesDescribe the specic parameters associated with

each object. A server, for example, would have characteristics describing its
manufacturer, model, memory capacity, disc storage, number of processes, and other
attributes. An application would have characteristics describing its requirements for
processing, storage, network resources, and service quality.
52
MethodsDescribe the operations that can be performed on the object. For the
servers there would be methods for rebooting, killing a process, creating a process,
changing the number of active threads, and other operations.
Indications (alerts)Are used by the object to communicate with external entities.

A server would send indications when a process failed, memory was running low, or
the disc system was clogged, for instance.
AssociationsAre used to describe the relationships among various managed

objects, allowing a management system to construct logical groupings.
CIM is still very young, and not yet widely used, but it points a way to the future.

The service management architecture shown in Figure 3-2, and described in the following
subsections, is intended to be an example of a typical architecture used by large
organizations. It is one that can accommodate changes that take place during the service
delivery lifecycle faced by any organization that relies extensively on networked delivery
of information. It encompasses both the components on which service delivery relies, as
well as the service that is a product of those components.
Instrumentation
Instrumentation, described in detail in Chapter 4, Instrumentation, and Chapters 810
and shown at the top of Figure 3-2, monitors and measures the performance and availability
of system components, as well as that of services. Instrumentation of components, or element
instrumentation, tracks the status and behavior of individual components, such as network
devices, servers, and applications. Examples of element measurements are CPU busy
percentage and the percentage of received packets that contain transmission errors. Services
instrumentation tracks the behavior of services using active and passive collectors. Examples
of measured services are round-trip time through a network and transaction response time.
Instrumentation takes two forms:
Active instrumentationAdds trafc to a system, essentially performing a small

experiment to validate compliance with key parameters. An example is the ping tool
that sends a single packet to a remote system component, which then immediately
returns a copy. The tool measures the time delay between when the packet went out
and the copy returned, and, if multiple packets are sent, the tool also reports the
percentage that returned.
Passive instrumentationRelies on system trafc and facilities that are already

there to provide performance data. Examples are the use of existing log les to
measure workload and server response time. Other passive devices can sit on a
network segment and watch the packets passing by, deriving a lot of data about
workload, error rates, response time, and more.
Figure 3-2
53
Web Service Management Architecture

Instrumentation
From Applications
From Server Systems
From Content Distribution
Networks
From Transport Networks
From Third-Party Services
From End-User Measurement
Instrumentation Management,
Aggregation, and Filtering
SLA Data
Alerts
SLA Statistics
and Reporting
Real-Time
Event Management
Real-Time
Operations
Back-Office
Operations
Policy-Based
Management
Long-Term
Operations
System Control and Configuration
Instrumentation Management
Instrumentation managers, described in Chapter 4 and shown in the middle of Figure 3-2,
congure the instrumentation systems and receive the measurement data from them. They
examine each incoming data item, ltering out obvious measurement errors and comparing
measurements to specied thresholds to see if an alert should be issued. If measurements
indicate a possible problem, the instrumentation manager may demand additional
measurements to help make sense of the problem and to see if the original measurement
was an outlier or was a true indicator of a difculty. There are two primary outputs from the
instrumentation manager: alerts and service level indicator data. The former consists of
alerts that are important enough to be escalated to the real-time event handler, where they
will be combined with other data for evaluation; the latter consists of data sets and
aggregated measurements that are all forwarded to the SLA statistics system for statistical
treatment and reporting on system performance.
54
SLA Statistics and Reporting

Data sets received from the instrumentation manager are processed to generate statistics
appropriate to their use as service level indicators, as described in Chapter 2, Service Level
Management. The summary information is placed in a database for later reference and for
use in generating periodic reports about system performance for the team that manages
compliance with SLAs and for other concerned groups. Summary information is also made
available to the operations groups to help them determine if changes have to be made to the
system to maintain compliance with service level commitments. (The goal, after all, is to
nd and x problems before an SLA violation occurs.)
Real-Time Event Handling, Operations, and Policy

The real-time event manager, discussed in Chapter 5, Event Management, and shown at
the center of Figure 3-2, acts as a central switchboard, connecting other parts of the
management system to the instrumentation driving them. Its the core component of most
commercial management systems, such as HP OpenView, Tivoli Enterprise Console, and
Unicenter TNG. It can communicate with many different instrumentation systems, using
multiple standards and techniques. In some cases, it passively waits for alerts to be
received; in other cases, it actively polls remote instrumentation to obtain data on a regular
basis.
Because it has a far broader view of the system than any individual instrumentation
manager, the central real-time event manager can identify performance patterns that the
individual instrumentation managers cannot see. Its also aware of the topology and
interdependencies of the system being measured. It can, therefore, do a better job of data
ltering, aggregation, and problem-detection than the instrumentation manager. As just one
example, it can have the knowledge necessary to realize that the hundreds of component
unavailable messages ooding into its receivers are the result of a single router failure
because all the messages are about components that depend on that failed router for access
to the rest of the network.
The event manager also can determine problem priorities by following preprogrammed
rules, and it can automatically activate other management tools. This enables an
administrator to design a sequence of steps using each tool in the appropriate steps. The
conguration of the event manager is still a manual process, but no further attention is
required after rules and process are set.
The real-time operations management function (described in detail in Chapter 6, RealTime Operations) provides much more sophisticated event analysis and handling than the
basic event manager. Using the output from the event manager, real-time operations
management applies complex algorithms to nd more subtle patterns in the data. It can try
to predict future failures by noticing patterns that have resulted in failures in the past, and
it can also automatically take actions to x existing or predicted problems without operator
interventionthereby avoiding violation of the SLA and its associated policies.
55
The policy manager applies business rules to the operation of the system. It is an automated
tool that identies the service levels allocated to each end user and application, based on
rules programmed by the system operators. It then tunes the system and denies system
access as needed to enforce those service levels.
Some examples of the functions performed by the trio of event manager, operations
manager, and policy manager are listed here and are discussed in more detail in Chapters 57:
Compliance testingPerforms real-time monitoring of service behavior, comparing

the actual behavior against the objectives in the SLA. Alerts to other real-time
functions are forwarded if service quality becomes questionable.
Root-cause analysisIdenties the likely cause of (potential) performance

degradation or availability problems. It begins with steps to determine which
infrastructure is involved in service quality degradation, later proceeding to identify
the elements within the infrastructure that are probably the source of the problem.
Predictive analysisPredicts future behavior and thereby avoids service quality

disruptions. These approaches vary from statistical strategies to actual testing for
nonlinearity (infection) points.
Automated operationProvides automatic handling of problems. Analyses of

various types are useful, but they must lead to actions that rectify the problem (or the
threat) of a service-quality disruption. The increasingly stringent downtime
requirements in SLAs encourage automated responses if they can be made quickly
and accurately.
Policy oversightProvides completely automated tuning of service-delivery

systems to enforce policy rules that determine who obtains particular service levels
and the amount of service theyre allowed to use. Policy oversight is used to apply
business rules to the services provided by the system.
Long-Term Operations
Some operations are considered to be longer term because their activation or completion
within a short time interval is not critical. Such longer-term operations, shown at the bottom
of Figure 3-2, can be associated with strategic changes to the service-delivery environment,
or they can offer more fundamental remediation of problems identied by alarms. Some
examples of longer-term operations include the following:
Load testingThrough real tests of the system, or of a representative environment in

a test bed, a load test helps determine the actual nonlinearity points and bottlenecks
within a system. This information also validates capacity planning and helps set the
appropriate thresholds for detecting problems.
System modeling and capacity planningUsing information collected over a

period of time, these predict infrastructure usage trends and the resulting resource
needs. This provides enough time to get resources in place before there is any servicequality impact.
56
Load testing is discussed in Chapter 11, Load Testing, and system modeling and capacity
planning are discussed in Chapter 12, Modeling and Capacity Planning.
Back-Office Operations
Back-ofce operations, shown at the bottom of Figure 3-2, are related to the business side
of service delivery. These processes have usually been described as Operations Support
Systems (OSS) in the world of traditional telephone providers. They constitute a bridge
between operations of the service-delivery environment and the management of the
business that pays for them. Typical back-ofce functions for service providers include the
following:
BillingTracks resource usage and charges accordingly. Billing will be tied to the
negotiated terms of the SLA, and it must be exible and easily extended to incorporate
new services.
ProvisioningAllocates resources to support one or more usage instances of specic

services, and it associates one or more consumers with that resource allocation for
service billing or charge-back purposes.
Customer serviceProvides the help desk, web pages, and other means of
interacting with customers and supporting a wide range of needs, such as ordering
services, getting information, and resolving disputes.
Order trackingFollows customer orders through the steps from initial contact
through ordering, activation, and revenue capture.
FinancialsLooks at the business side by tracking metrics, such as the Return on

Investment (ROI), which is the protability of various services, and capital spending
projections.
Service consumers must manage their online business with similar types of information.
For example, they should track the performance of their providers, the cost of the services
that they use, and the benet or income from their use of those services.
The business-process metrics described in Chapter 2 can be used to suggest metrics that
will help manage the overall performance of the back-ofce operations.
Summary
57
Summary
This chapter outlines the important parts of a complete service level management system.
Starting with a description of a large-scale Web services delivery architecture, it then shows
the inuences of that architecture on the design of service level management systems.
Critical inuences are the constant demands for changing services, the use of multiple
service providers and partners, elastic boundaries among teams and providers, demands for
fast system management, and the need for mutually understandable data item denitions
and event signaling mechanisms among the various pieces of the management system.
The generic management system outlined in Figure 3-2 is used as a reference model in the
rest of the book. The parts of the generic management system consist of system
instrumentation (fully described in Chapters 4 and 810), instrumentation management
systems (described in Chapter 4), and the SLA statistics and reporting systems (described
in Chapter 2) that use the data from instrumentation. The parts also consist of the real-time
operations systems (event handling, Chapter 5; operations, Chapter 6; and policy, Chapter 7),
along with long-term operations (load testing, Chapter 11; and system modeling and
capacity planning, Chapter 12), and, nally, back-ofce operations, which are not further
described in this book.
PART
II
Components of the Service Level

Management Infrastructure
Chapter 4
Instrumentation
Chapter 5
Event Management
Chapter 6
Chapter 7
Chapter 8
Chapter 9
Chapter 10
Managing the Transport Infrastructure
CHAPTER
Instrumentation
The term instrumentation is used to describe the technologies and processes for monitoring
and measuring the behaviors of services, infrastructures, and elements. You use
instrumentation to monitor behavior and assess the impact of changing operational
conditions on your ability to meet the compliance requirements for Service Level
Agreements (SLAs). Services managers need appropriate instrumentation to inform them
of actual or potential problems and provide feedback after they make adjustments. This
chapter introduces an instrumentation methodology for managing the infrastructure
described in the overview in Chapter 3, Service Management Architecture.
This chapter covers the following topics:
The differences between element and service instrumentation

The information needed to make effective service management decisions
Instrumentation modes: trip wires and time slices
The instrumentation system
Instrumentation design for service monitoring
Instrumentation trends
Differences Between Element and Service

Instrumentation
A large installed base of element instrumentation already monitors infrastructure elements.
These elements include servers, applications, switches, and databases. Monitoring service
behavior requires different instrumentation approaches.
Technical managers have always relied on the instrumentation in various infrastructure
elements for guidance. The instrumentation supplies management applications with
information on element status, resource usage (CPU or bandwidth, for example), and
errors. Instrumentation also generates real-time alerts when an element needs immediate
attention from the management system. The management application processes the
information and may initiate a set of responses depending on the results. The application
62
Chapter 4: Instrumentation
displays the information for an administrator through a console or management portal,

activates other management tools, and saves the information for further analysis.
Most organizations have already made substantial investments in element instrumentation
and in the tools for collecting and analyzing the information that the tools provide.
Administrators can monitor status, collect operational statistics, and receive real-time
alarms when an element needs immediate attention.
Each element collects information about itself. The behavior of an individual element,
however, may not have a direct impact on service behavior and quality. For example, the
failure of an individual network element may have no impact on service quality when there
is adequate redundancy and responsive dynamic routing. An element problem must be
addressed and resolved by those responsible for managing it, but no problem has occurred
from a service managers perspective as long as the SLA metrics are in an acceptable range.
Element instrumentation is absolutely essential; unfortunately, however, it is insufcient
for measuring service behaviors. A service behavior is determined by the aggregate
behavior of the supporting infrastructures and their elements. Different instrumentation is
needed to measure service behavior and to identify actual or potential service disruptions,
which are inabilities to comply with an SLA. A disruption initiates a top-down process that
rst identies the most likely infrastructure causing the disruption. Further analysis and
measurement is needed to determine which elements are contributing to the problem.
Infrastructure administrators still must deal with an element failure because it is their
responsibility to maintain high availability and redundancy within the infrastructure. They
may prioritize their tasks based on service quality impacts, with those elements affecting
critical services receiving attention rst. Element management tools then are used to
pinpoint the problem, take corrective actions, and verify that the element is operating
properly. Finally, services instrumentation veries that the service disruption has been
resolved and eliminated.
Administrators also need information on infrastructure behavior and the correlation
between services and infrastructure usage. They need to understand when element
problems might affect their service quality. They also need to identify the elements
associated with any given service when there is a disruption.
As shown in Figure 4-1, service instrumentation extends the discipline beyond monitoring
individual elements. Elements are organized into infrastructures that are monitored as
cohesive systems. Actual service behavior is measured across the appropriate
infrastructures that support the associated activities. Examples of these service classes
include interactive, streaming, and transaction.
Figure 4-1
63
Element and Service Instrumentation

Service Instrumentation
Response Time
CPU
Queue
Frames
Errors
Hits
Packets
Errors
Element Instrumentation
Figure 4-1 shows the relationship between elements and service instrumentation. Service
instrumentation must monitor and measure the overall behavior of the aggregate elements
supporting any service ow.
Business managers also rely on real-time information to track business processes and goals.
Technical information must be translated into business-centric metrics. A large transaction
volume indicates high server performance, but this high volume may have no business
value if the transactions are completing quickly because the desired content is missing.

There are many technical and business management decisions that affect operations only
on a certain day. Others affect the long-term ability of an organization to deliver a high level
of service quality. This section examines the various types of management decisions and
the roles instrumentation plays in those decisions.
Most environments usually have a mix of services whose dynamic behaviors are inuenced
by the constantly changing interactions of the technology infrastructures, service providers,
and customers. Service managers must make ongoing adjustments to maintain an overall
equilibrium that delivers consistent service quality despite one or more failures, load shifts,
and resource conicts.
64
Lack of the appropriate service instrumentation results in service managers being left to
manage by hope and by customer feedback. This strategy involves reacting to problems
reported by customers and trying alternatives until something works.
Decisions based on poor management information result in the following:
Extending service disruptions by lengthening the time to resolve problemsStaff

may consume additional time narrowing the cause.
Disrupting staff unnecessarilyCalling a network expert to deal with a server

problem wastes his or her time and extends the disruption.
Addressing a technical problem to the detriment of the businessAll failures do

not have equal business impacts. For example, xing an internal e-mail problem while
customers cannot transact business could prove very costly.
Exacerbating the issuePoor information can lead to poor decisions that introduce
further disruptions.
Technical and business managers need information about, and insight into, service behavior
so that they can make effective service management decisions. Operational decisions must
be made within short time intervals, while other decisions, having major long-term effects,
can be made more slowly. The text discusses these in turn.
Operational Technical Decisions

Technical managers must administer and control highly dynamic environments that have
changing network and computing loads. Many environments also have a growing mixture
of services, often with conicting demands. An environment solely dedicated to
exchanging les has very different operational characteristics than one for interactive Webdriven processes. Supporting a range of different service characteristics can introduce
conict and interactions that degrade the performance of all services. Technical managers
need to allocate resources quickly to minimize instabilities in their delivery of services.
Operational decisions are tactical in nature; that is, adjustments are made in response to
current conditions to sustain compliance with an SLA. Managing for compliance requires
fast and accurate responses to constantly changing conditions, and accurate information
provided by good instrumentation is essential.
Operational Business Decisions

Business administrators must also make an increasing number of operational decisions as
more online business processes are introduced. Business perspectives rather than the
underlying technology behaviors drive real-time business adjustments. Business-centric
metrics are often derived from a combination of processed technical measurements and
direct instrumentation of services.
65
As an example of the business perspective, an online web site selling merchandise can be
monitored with network-based probes tracking the actual URLs being used. The web
applications can also provide direct access to information. The instrumentation indicates
the number of abandoned shopping carts by analyzing the URLs owing on the network or
reaching certain points within the application itself.
Technical problems, such as slow credit authorization or billing services, can increase the
number of abandoned shopping carts. These problems are correctable with standard
technical means. Carts are also abandoned when there is a problem with the actual web
content or navigation. The business administrator needs to understand when an unwelcome
change has occurred and take steps to keep the business running smoothly.
Decisions That Have Long-Term Effect

Other causes of service degradation are rooted in poor long-term management processes
rather than in any dynamic operational uctuations. Long-term management is strategic in
nature, with administrators taking steps to eliminate or minimize future problems. Highquality information is essential for these tasks as well. Managers must have condence in
their information because they are investing in the resources to increase the competitive
capabilities of their organizations.
Examples of long-term decisions include the following:
Provisioning by using current operational data and trends to predict future

resource requirementsMany operational decisions involve reallocating resources
as conditions change; provisioning, in contrast, tries to ensure that there are sufcient
resources to manage operationally.
Stress testing services to determine their actual capacities and breaking points
Managers stress test those operational areas and loads where service disruptions are
more likely.
Evaluating the services mix to determine if new services will destabilize the
current mix and introduce more service degradationManagers can avoid
unpleasant surprises and outages by planning ahead.
Instrumentation is the bedrock for managing services and service quality. An instrumentation system provides accurate and timely information for a range of management
decisions and other functions. In addition, instrumentation provides essential feedback for
technical and business administrators. Measuring the results of any decision validates good
choices or indicates whether further attention is still needed.

Technical and business administrators can use instrumentation in different ways to make
operational and strategic management decisions.
66
There are two primary instrumentation modes: activating trip wires and taking time slices.
A combination of trip wires and time-sliced measurements is used for supporting
operational and strategic service management tasks. Figure 4-2 shows the use of trip wires,
which can generate real-time alerts by comparing a behavior to a static threshold value or
by tracking deviations from a normal behavioral envelope. Time slices are repetitive
measurements of the same variables over time.
NOTE
Trip wires and time slices are used for real-time alerts; time slices also help with longerterm functions, such as planning.
Figure 4-2
Trip Wires
!
# $
(

! % &'
Trip Wires
Trip wires provide simple real-time alerting for operational decision making. Management
tools compare the collected information to established thresholds. An alert is sent to a
management application when the value is higher or lower than the established threshold.
Further processing of the alert determines whether it is a valid problem, whom to notify,
and which tools to activate.
A series of thresholds can be established. Consider an SLA requiring response times of ve
seconds or less. A warning level (2.5 seconds, for instance) gives administrators ample time
to investigate a performance shift and take appropriate action. A three-second threshold
denotes a performance level that is getting closer to unacceptable values, and a four-second
threshold is used to bring an urgent response from the management system.
67
Determining when a trip wire should be triggered is fairly simple. However, the simplicity
introduces difculties because a threshold is usually a static value and the environment is
dynamic. There may be peaks and valleys of activity and a threshold set too low will trigger
a rash of alerts that do not really indicate a problem. Raising the threshold reduces the alert
volumes when the normal load is high, but introduces the risk of missing situations when
normal volumes are lighter.
One key for effective instrumentation is selecting realistic thresholds to ensure accurate
warnings. Many product specications are based on a set of optimum conditions, and actual
performance can be quite different. Realistic load testing is a practical means for
determining accurate threshold values. Load testing is discussed in Chapter 11, Load
Testing.
Time Slices
Time slices are repetitive measurements of the same variables over longer time intervals.
They track changes in normal behavior over an extended period of time.
Baselines are an example of a time-sliced measurement. Baselines are also used as trip
wires because they provide a more accurate assessment of dynamic behavior. Repetitive
measurements are used to set the initial baseline for normal behavior as an envelope with
high, low, and average values. Statistical techniques such as those mentioned in Chapter 2,
Service Level Management, can be used to set the baseline values. A baseline approach
sends an alert whenever measurements fall outside the normal envelope. Current measures
are compared to the baseline and deviations can reveal conditions such as the following:
A shift in normal behavior that naturally occurs over time with growth or
changesThis situation merely denes a new normal baseline.
A trend showing that performance is shifting toward the edges of the envelope
and thus may indicate an underlying problemAdministrators spend time only on
situations that are actually abnormal.
A measurement that should not have any influence can be detected and
discardedA single measurement, for instance, could have a very abnormal value,
but a single occurrence requires no further attention. Many artifacts, which are false
indications of the actual situation, can be automatically screened, saving valuable staff
time and minimizing unnecessary interruptions.
Baselines are most effective when the environment is stable long enough to take the
measurements and make the calculations. Baselines must be recalculated as normal loads
grow or newly added services alter the environment.
Time slices require consistent measurements over time and some processing to determine
the trends. Trends revealed with time-sliced measurements are used for longer-term
planning and optimization functions.
68

Administrators usually nd themselves with a collection of unrelated instrumentation
components. These components can be organized into an instrumentation system, which is
an adaptable system for the following:
Collecting service management information at several granularities or levels of detail
Storing the collected information for a variety of other service management functions
Organizing a high-volume stream from multiple sources into a manageable set of

alerts and alarms
The instrumentation system provides the framework for monitoring service behaviors and
reporting them to other parts of the service management system. The instrumentation
system manages collectors and aggregators, ensuring that they are operating properly and
collecting the appropriate information. The processing functions organize the data and save
some for long-term storage. The collectors and aggregators collect and reduce data and pass
alerts to the processing or event management functions. An instrumentation system
produces the information necessary for making sound management decisions at the tactical
or strategic levels.
A service instrumentation system provides an organizing framework for leveraging the
installed instrumentation base while guiding the incorporation of new components.
Instrumentation is dynamic; new instrumentation emerges with new technologies and
services. New information sources must be incorporated with minimal staff intervention
and then leveraged by other service management tools.
The major components of a service instrumentation system are shown in Figure 4-3. Event
handling, in the real-time event manager, and SLA management tools are also included
because they are tightly coupled with the instrumentation system. The basic cyclic behavior
of instrumentation management, collection, and processing drives many other management
functions.
These components represent an abstract way of discussing what an instrumentation system
does. The reality of how it is actually implemented is usually messiersome of these
functions take place in several stages and in different parts of the system. Different vendors
offer different sets of features and functions; the completeness of the system functions is
the goal. The behavior can be viewed as cyclic. The collected information causes
adjustments in the information collection process, which creates new information, which
results in a change, and so forth.
Figure 4-3
69
The Instrumentation System and Related Management Functions

!

"###

!#
$ %#
$ !#
&" '
&"

( ) #
"
)*+
,(#
Starting with the Instrumentation Managers

Instrumentation managers do the following:
Monitor and control a distributed group of collectors and aggregators
Establish instrumentation policies by transferring policy information to each collector

and aggregator
Control the local data collection activities in a distributed set of collectors and
aggregators
These policies can specify the types of measurements to be taken, their frequency, and the
acceptable ranges of values. For example, simple policies can dictate that more than three
consecutive abnormal measures should generate an alert. The measurement frequency
policy should be based on the failover latency (how long it takes redundant components to
70
respond to a service disruption and resume service delivery at the specied quality levels).
Thus, for example, if a service fails within ve minutes, your system should test every 12
minutes.
Instrumentation managers simplify operations because a single command affects the
operation of many collectors and aggregators. Thus, staff time and mistakes are reduced and
the instrumentation is managed effectively.
Instrumentation managers periodically use a heartbeat to verify continued collector and
aggregator operations. A heartbeat is a periodic exchange of messages to verify that both
parties are operating properly. Consider that an independent collector (discussed later in
this chapter) might not communicate for long periods of time when no problems are
detected. The instrumentation manager, in this case, uses a heartbeat to determine whether
the collector is still operating; if heartbeats are not returned, the instrumentation manager
must take steps to reestablish communication or shift to other monitors.
Collectors
Collectors measure service behavior instead of element behavior. They collect information
that is suited for each class of services. The information includes response times for
transactions and packet loss for interactive or streaming classes. Collectors measure
specic service instances, verifying that individuals, groups, or regions receive acceptable
service quality.
Collectors can be programmed to provide more granular service information. They can
measure subtransactions to make distinctions among functions, such as downloading a
page, executing a stock trade, or ordering merchandise. Collectors use a combination of
active and passive techniques. These techniques are discussed later in this chapter.
Collectors and aggregators are shown in Figure 4-3 in relation to other instrumentation
system components. They are the source of management information and alerts for the
processing and event management components. Alerts are the trip wire; the collectors send
alerts when a certain condition, such as an unacceptable delay, has been detected. The
system provides time-sliced data for SLA tracking and a variety of purposes.
Collectors can be embedded in network elements (as described in the sidebar), incorporated
as software modules in desktops or servers, or packaged as standalone components.
Continued processor price/performance improvements reduce the impact when more
instrumentation processing is embedded. In addition, the additional processing power
enables more complex measurements. In the future, collectors will interact with other
collectors and instrumentation managers.
71
Embedding Collectors in the Network Infrastructure

Cisco Systems has been developing the Service Assurance Agent (SAA) as an advanced
collector. The SAA is embedded within network infrastructure elements where it conducts
behavioral measurements across the network infrastructure. Embedding the SAA within
elements gives wide instrumentation coverage, granularity, and speed. There is no need to
physically move agents to a new location before measurements are collected; the embedded
agent is activated and it quickly captures the operational information.
The SAAs are designed as a coherent, collaborative monitoring system. They exchange
trafc with each other as they carry out measurements, and various relationships between
SAAs are created as needed by external management applications. An extensible markup
language (XML)-based interface opens the architecture to third-party tools.
SAAs collect information in both active and passive modes. The active measurements use
a variety of virtual transactions to probe a range of behaviors. Their monitoring
functionality can also be extended through software enhancements. SAAs can carry out
periodic measurements to track behavior on an ongoing basis, and they can carry out
specialized tasks as needed.
As shown in Figure 4-4, SAAs at the edge of the network can measure the delay across the
entire network infrastructure. An alert is forwarded to an external management application
when the delay begins to approach a threshold level. More granular measurements are
obtained by using other SAAs in the path between the edges. Hop-to-hop delays are
monitored between each pair of SAAs to quickly identify the part of the network
infrastructure that is causing the slowdown. Periodic measurements can also track jitter,
which is a key metric for interactive and streaming service classes.
Figure 4-4
Cisco Systems Embeds SAAs to Measure Performance Within or Across a Network Infrastructure

72
Aggregators
Aggregators are used for scaling and for providing efciency through monitoring and
managing a set of local collectors. Aggregators consolidate the information and usually
carry out simple ltering to reduce the volume of information they forward to the
processing functions. Aggregators also conserve bandwidth by ltering alerts and
forwarding only those needing further attention. Figure 4-3 shows how aggregators can be
cascaded to scale even further.
Aggregators also scale the instrumentation management tasks because they can accept a
single management directive and distribute it to the collectors they control; in that case, they
are instrumentation managers as well as aggregators. They use heartbeats to check collector
health and to set new monitoring policies as directed.
Aggregators can also provide local correlation and integration of the information from
multiple collectors. This creates higher-quality information for components higher in the
chain.
Processing
Processing involves a range of functions that are packaged in vendor-dependent ways.
Further, these processing functions are widely distributed within the instrumentation
system. For example, the collectors themselves usually test for trip-wire situations. In
addition, they often build baselines and carry out more sophisticated measurements.
Remember this rule of thumb: Functions tend to move toward the information source.
Some functions overlap with event management or with features of some management
tools. Such situations are acceptable because completeness of monitoring coverage is the
goal.
When new information arrives, it may need grooming. Grooming is the process that
simplies the information-handling tasks of the other components. For example, some data
values might need normalization because different collectors use different value ranges. For
example, collectors from one vendor might have a range from 110, and another collector
might provide values from 150 for the same type of information. The data cannot be
accurately compared until the ranges are normalized to the same value; in this case,
multiplying the rst set of data by 5 provides consistency.
Grooming can also include artifact reduction, as discussed in Chapter 5, Event
Management. Some of these functions are packaged differently, depending on particular
vendor packaging choices.
Trip wires require real-time processing of the collected data to test when an alert is sent.
Developers of collectors are applying more sophisticated testing to reduce the alert volume.
For example, the collector might not forward an alert unless a threshold has been exceeded
for three measurements in succession.
73
A single transaction that is close to a performance threshold is another example of trip-wire

processing. A single increase in delay times might not be of concern; however, a sustained
increase in delay times is a cause for further investigation. Single sporadic incidents might
be logged and need no further staff attention at the moment.
Baseline measurements are regularly scheduled to stay abreast of the normal operating
envelope. Real-time measurements are compared to current baselines to detect deviations
that could be leading to service disruptions.
Most real-time information is discarded after it is checked for conditions that exceed
threshold values or that deviate from the activity baselines. Real-time tracking over long
time intervals generates a large amount of data with no long-term value. Some real-time
information is reduced and saved for use in SLAs and in longer-term trend analyses that are
expressed with a few points and a formula.
Ending with the Instrumentation Manager

Completing the cycle in Figure 4-3 brings us back to the instrumentation manager. The
event manager inuences the instrumentation manager and will adjust the measurement
activities to suit its immediate needs.
Consider the options when a response time trend indicates a future disruption. If the
potential cause is undetermined, the event manager can ask for measurements with ner
granularity and frequency at key collectors to pinpoint the root of the problem before it
escalates. The collected information can be saved so that staff can examine the operating
conditions before a disruption occurs. Altering the measurement activities keeps the
instrumentation system focused on collecting the most useful information.

In this section, you learn about the actual process of collecting the required information. To
monitor service behaviors, you need to select the appropriate demarcation points and
monitoring techniques.
Demarcation Points
Collectors are deployed most effectively by selecting the appropriate demarcation points
usually a boundary between organizations or infrastructures. The enterprise-service
provider interface is an example of an organization demarcation point. Collectors
positioned at each demarcation point measure the delay across the provider network as well
as within parts of the enterprise network structure. They can then provide end-to-end
service quality measurements; additional placements break the measurements into specic
domains.
74
The collectors in Figure 4-5 are placed at demarcation points. They move from right to left
between the following:
Figure 4-5
The local delays found in the desktop and local infrastructure

The service provider delays
The delays from the provider edge to the remote server
Measurement Demarcation Points
Remote Server
Delays
Service Provider
Delays
Local
Delays
Total Delay
The desktop (or wireless phone or PDA) collector measures the entire round-trip delay for
any transaction initiated from that location. No further measurements are needed unless the
delay exceeds specications in the SLA.
For situations in which the desktop is beyond the control of the enterprise (for example, a
web site serving the general public), or for situations where a disinterested third party is
needed, measurement services, such as those offered by Keynote Systems, can be used.
The other demarcation points are used to identify the likely cause of the delay so that staff
members are properly assigned without wasting additional time and interrupting other
activities.
As an example of the use of demarcation points, consider that measuring the round-trip
delay between the desktop and the edge of the service provider network isolates the delay
associated with the local infrastructure. Tracking the round trip between the edges of the
service provider network measures the delay introduced by the provider. Finally, measuring
a transaction from the collector closest to the server tracks the server delays.
75
Passive and Active Monitoring Techniques

Collectors track service quality with passive and active measurements. Passive
measurements are usually made from client desktops or customer access devices, using
instrumentation or activity logging on a client that consumes the service. The results are
widely variable and can be difcult to normalize. However, passive measurements will
accurately reect the user experience and can be important as the last resort for detecting
compliance problems.
Active measurements, in contrast, are consistent and thus easier to use for tracking
performance trends. The active measurements are also proactive, detecting potential noncompliance before passive approaches can. A combination of both types is most effective
in exploiting the strengths of each approach while minimizing the shortcomings.
Passive Collection
The most common collectors use passive collection. In other words, they gather only the
information that ows by. For example, a desktop collector tracks user activity as it occurs
and keeps a record of specic transactions and their completion times. Passive collectors
can be relatively simple and can consume minimal resources. They use no additional
bandwidth, but they can generate large volumes of data. They are good for detailed data
collection and for reactive management, such as forwarding an alarm when a problem is
detected.
Placing a collector in a desktop is a common form of passive collection and measurement.
One of the rst to offer desktop instrumentation was VitalSigns, which became the Lucent
Technologies VitalSuite product line after acquisition. The collector usually intercepts
trafc owing between the desktop and the network and measures round-trip delay while
tracking the applications and subtransactions actually being used. The information usually
is stored at the desktop until it is passed to the management system for further analysis and
processing. A real-time alert is forwarded whenever the response time exceeds a predened
threshold value.
Active Collection
Active collection, in contrast, uses active agents to generate network and application
activity for management measurement purposes. An active approach is proactive because it
is exercising networks and services and evaluating their behavior rather than waiting for a
passive collector to detect a problem. Periodic active measurements detect problems earlier
than the passive approach. Active measurements are probing behavior even in the middle
of the night; they do not depend on user actions to highlight a problem.
Virtual transaction (or synthetic transaction) is the commonly used term for describing
active measurements. There can be a range of virtual transactions for measuring
76
performance and for detecting service-related problems. Some examples, in order of

complexity, include the following:
Pinging to verify network connectivity and basic system response

Activating a service by checking for service availability and access
Initiating specic transactions by testing specic operations such as sending a
message, retrieving a web page, or buying a product
Virtual transactions match the actual business processes being measured; thus, the
measurements are viewed with condence by administrators. Virtual transactions are of
limited value if they dont match the actual business processes. Using a simple database
query in a virtual transaction doesnt illuminate potential problems when the actual
business processes are making multiple queries and activating other processes.
Checking for correct operation is essential after a virtual transaction extends beyond the
simple ping. For example, a web server might return a page not found message quickly.
Using that measurement to route more trafc to that (apparently) lightly loaded server only
compounds the problem. As another example, a virtual transaction for ordering a product
must verify that appropriate information is placed correctly in forms, that the credit card
authorization worked, and that a conrming message was sent.
Active agents are usually used as a proxy for a set of local desktops. Therefore, they must
be carefully placed and congured so that they accurately reect the user experience. The
virtual transactions they use must match the actual transactions of the local desktops and
they must access the same services so that the trafc ows over the same areas of the
network. When the Internet is involved and there are thousands of external customers, a
measurement service, such as that offered by Keynote Systems, can perform virtual
transactions from the same backbones and geographic locations as the customers.
Active agents consume network and application resources. Therefore, they must be
constrained through measurement policies dening the virtual transactions to use and the
frequency. Other policy parameters dene acceptable values so that trip wires are activated.
Highly dynamic environments that frequently create new transactions or modify current
ones add to the administrative burden. Administrators must develop new virtual
transactions or modify their current set. This entails taking the time to understand
the transactions, modeling the steps, determining successful outcomes, and measuring
parameters.
Trade-Offs Between Passive and Active Collection

Active approaches offer advantages over passive collection in that they allow proactive
responses. For instance, a virtual transaction can detect a failed server or an application that
is not available, but passive techniques can indicate only that no server trafc has been
detected.
77
Consistency is also a signicant difference between the two approaches; a virtual

transaction always exercises the same functions in the same way each time. As such,
baselining is simpler because the only changes between successive transactions will be due
to network, server, or content-delivery delays. Documentation about normal responses and
trends is easier to build and monitor.
A passive approach can also track Web downloads by timing the duration of HTTP GET
operations. However, the difculty arises when, for example, one link brings back three
lines of text and the next one brings in complex graphics. As such, getting an accurate
understanding of actual performance with such variation is more difcult and requires
much more processing.
Active agents must be used carefully because they consume network and application
resources with each usage. Large numbers of active agents initiating complex transactions
on a frequent basis can degrade performance and interfere with legitimate activities. In
contrast, passive approaches dont add any trafc to the system.
Hybrid Systems
A combination of active and passive agents offers optimum instrumentation coverage. The
passive agents collect information on the actual transactions and their performance, and the
active agents proactively nd problems and build accurate baselines. This maximizes the
information quality while minimizing the resource impacts of virtual transactions.
An instrumentation system for tracking service behaviors can be conceptualized as a new
layer that sits above the element instrumentation. The services layer uses its own
monitoring tools and techniques for measuring and tracking service-level metrics.
Integrating the information from both layers is discussed in Chapter 5.
Instrumentation for tracking service behaviors is continuing its evolution. There are several
ways of leveraging the instrumentation after the basic system is installed.
Adaptability
Adding adaptable measuring strategies reduces loads and adjusts the granularity as needed.
There are minimal changes while service delivery is operating without problems. However,
as problems arise, the instrumentation can shift into other modes. If an active collector
detects a service outage, it can shorten the interval it uses for virtual transactions to measure
the duration of the outage. After service availability is restored, it uses a longer time
interval.
78
Collaboration
Collectors are beginning to collaborate with each other as well as with external
management applications. Measurements between pairs of collectors help determine
overall compliance and simplify problem isolation (see the sidebar in this chapter for more
information).
Tighter Linkage for Passive and Active Collection

Within virtual transactions, the act of mapping to actual transactions is known as fidelity,
and it is a limitation of active measurement techniques. High delity measurements
increase in value because the virtual transactions reect actual behavior. For example, a
virtual transaction using a single database query will not shed much light on the actual
behavior of a transaction using multiple databases.
The value of passive collectors in consumer-based network access systems (such as those
used for desktops and wireless devices) is in the depth and breadth of the coverage. Each
application and subtransaction can be captured. Aggregating this information gives detailed
activity breakdowns that can be used in selecting the appropriate virtual transactions to use.
Segmenting the information by location, customers, or other criteria offers higher delity
to customer activities.
Summary
Measuring service quality and determining compliance with SLAs are fundamental goals
of instrumentation. Careful selection of demarcation points places intelligent collectors at
the proper points to gather information on service behavior and quality.
Active techniques offer proactive problem detection and consistent baseline measurements.
They are coupled with widely distributed passive collectors for thorough coverage.
Trip wires and time slices provide real-time notications and solid data for planning and
provisioning.
Each infrastructure involved in service management has its own specic instrumentation
needs. These are discussed in the chapters covering each infrastructure: Chapter 8, Managing the Application Infrastructure; Chapter 9, Managing the Server Infrastructure; and
Chapter 10, Managing the Transport Infrastructure.
CHAPTER
Event Management
Chapter 4, Instrumentation, describes how service behaviors are monitored to track
compliance with service level metrics and to identify potential or actual service disruptions.
Event management, which is the topic of this chapter, describes the different steps that
transform a ood of raw alerts into a reduced set of events that require further action from
the management system.
The instrumentation system collects raw data, such as response times or packet loss.
Unfortunately, administrators often nd raw data of low value for making sound service
management decisions. A response-time measurement; might have little meaning without
further analysis; for example, is it an isolated incident or part of a growing number of slow
transactions? Although the measurement is compared to thresholds and baselines to
generate an alert, more context is needed to determine how important any measurement
actually is. The event management functions rene raw instrumentation measurements into
those that require further attention from the management system. Administrators spend
their time on important problems and make better decisions with rened information.
This chapter is a complement to Chapter 4, and both should be considered part of the same
process: helping you turn raw data into usable information and indicating the next steps
when a response to a service disruption is detected and reported.
Note that one of the difculties in organizing material for this chapter is that vendors offer
a range of packaging options. For instance, some of the functions discussed in this chapter,
such as artifact reduction, can also be embedded in the intelligent collectors discussed in
Chapter 4. In addition, other products are designed specically for handling events. Thus,
be aware that just because event management is discussed separately doesnt mean that it
must be packaged as a separate product.
Regardless of the sometimes-vague dividing lines between instrumentation and event
management, the goal of this chapter is to examine the range of event management
functions and their contribution to the process of creating usable, action-oriented
information. Specically, the focus is on the following:
An overview of event management

The basic event management functions
The examination of a market-leading event manager
82
Chapter 5: Event Management

Event management denes a set of functions that are applied to the alert stream to identify
those alerts associated with actual or potential service disruptions. Further actions are
initiated after the event manager identies the important alerts.
Alerts arrive in different forms, as determined by the collectors and specic product
implementations. They include the following:
Simple Network Management Protocol (SNMP) traps, which are sent mainly by
network infrastructure elements, although elements in other infrastructures (such as
the server infrastructure) also use SNMP
Alerts from passive and active collectors using vendor-specic protocols
Alerts generated by the management system itself
Alerts from other elements using vendor-specic protocols

Alerts triggered by the arrival and transformation of an extensible markup language
(XML) document
The following subsections discuss how alerts are triggered, the need to transport alerts
reliably from their origin to the central event manager, and the need for the event manager
to handle the fact that some alerts are more important than others.
Alert Triggers
Baselines, thresholds, and internal failures are the usual triggers of alerts from element
instrumentation. Threshold alerts can be triggered when a threshold is crossed in either
direction (see Figure 4-2 in Chapter 4). Baselines represent a normal operating range of
measurements. Alerts are triggered when the monitored variable is moving toward the edge
of the envelope or has moved outside that envelope.
Alerts are also generated when there are internal failures, such as with a disc system, an
application, or an interface. An element might be able to report certain failures itself. For
example, a server that is still operating after an application fails can easily report that
failure.
Other failure alerts are indirect, often because of a failure that prevented the element from
reporting its own problems. For example, a central instrumentation management portal
monitors a set of collectors with a heartbeat exchange. If a collector doesnt respond, an
alert is generated, noting that it might have failed.
Internally generated alerts are used to integrate and coordinate event management
operations and to activate other management system components. As seen in Figure 5-1,
the event manager activates other functional areas, such as fault- or performancemanagement tools. At this point, the event management system has organized an alert
stream into a set of actions based on the alerts that are generated.
Figure 5-1
83
Event Management
Instrumentation Management,
Aggregation, and Filtering
SLA Data
Alerts
Internal
Alert
Raw Alerts
Artifact Reduction,
Volume Reduction,
and Filtering
Compound Metrics and
Correlation
Business Impact and
Prioritization
Process Activation and
Generation of
Internal Alerts
Event Management
Real-Time
Operations
Reporting
and Billing
Policy-Based
Management
Actions
The alert volume can be substantial; several large organizations that I have spoken with
recently have tens of thousands of element alerts daily, while the services alarms are in the
mid-hundreds. There are usually more element alerts than service alerts because many
element problems do not affect service quality when there is sufcient redundancy.
Automated processing of alerts is needed to identify those requiring immediate action from
the management system. High alert volumes and more complex sorting criteria can
overwhelm human staff.
Reliable Alert Transport

There are important reliability issues that must be considered when handling the reception
of alerts. Many systems, such as SNMP, use unreliable transport methods; theres no
assurance that alerts will arrive at the central event manager. If the event manager is
regularly polling for information or for the presence of a heartbeat, missing poll responses
will be noted and the next poll may get the information. If the event manager is simply
listening for alerts, without checking for problems with the instrumentation, it wont know
that an alert has been lost.
84
One common solution to the problem of missing alerts is to have remotely located
aggregators that are in proximity to the source of the alerts. If they use a reasonably errorfree communications channel to connect to the alert sources, the aggregators will receive
almost all alerts correctly.
To push measurement alert information reliably into the enterprises event manager from
the remote aggregators, it is necessary to avoid using unreliable transport. This can be
difcult if the alerting system uses industry-standard SNMP traps; normal SNMP uses
unreliable transport.
One way of reliably transporting SNMP is illustrated by the web-performance
measurement service, Keynote Systems, which uses industry-standard SNMP traps to push
its measurement alert information into event managers from Tivoli, HP OpenView,
Micromuse Netcool, and other major management systems. Keynote places a small
appliance next to the enterprises event manager, inside the enterprises rewall. That
appliance connects across the Internet to the Keynote system using once-a-minute,
outgoing, secure, reliable connections. Retrieved alerts are then signaled with SNMP traps
from the Keynote appliance to the management system thats only a few feet away; theres
little chance of losing the alert.
Keynote also offers direct plug-in into some event managers, such as Unicenter/TNG; in
those cases, software is installed into the event manager to communicate directly with the
Keynote systems using reliable, secure transport and XML. Either methodlocal
appliance or direct plug-incan be used to improve the reliability of alert transport.
Alert Management
The alert stream contains information of differing value for managing services. Not every
alert requires further attention. As an example, an alert reporting a slow response time for
a single customer might not indicate a problem by itself. A single slow transaction can be
caused by temporary server congestion, lost packets, or a routing change. No further
attention is needed as long as the percentage of completed transactions with acceptable
response times is very high.
The managed environment is highly dynamic and the instrumentation can create artifacts,
which are false indications of the actual situation (false positives); they need to be
eliminated before a false diagnosis causes further disruptions to staff and operations. In
fact, responding to artifacts wastes staff time because subsequent measurements usually
reveal no problem at all.
The event manager organizes the remainder of the alerts after the artifacts have been
removed from consideration. There are ranges of actions depending on the overall
operational context. For example, a measurement that exceeds a warning threshold requires
different attention than a measurement indicating noncompliance with a Service Level
Agreement (SLA).
85
Alerts also have different business impacts that affect subsequent management decisions.
A disruption that affects revenues and business relationships should draw more attention
than a slight slowing of internal e-mail.
Refer again to Figure 5-1, which illustrates event management functions. Starting at the top
are the event management inputs, which are either internally generated alerts or those from
the service or element instrumentation.
The event management functions are shown within the rectangle in the middle. Functions
such as artifact reduction, ltering, and correlation are applied to any alert.
The event management system identies the events that require further action. The next
step depends on the event. Some events activate a fault management tool while others
launch a performance management tool. Events can also trigger a billing subsystem, page
an administrator, generate a report, or initiate other functions.
Basic Event Management Functions: Reducing the

Noise and Boosting the Signal
Event management must deal with a ood of alerts and select those that actually matter. A
dynamic environment with multiple instrumentation points generates different views of the
same behaviors. As an example, consider a case where a key database server fails. The
database server may have sent an alarm before it crashed, but there is no guarantee that it
did so. Active collectors (which were discussed in Chapter 4), or probes, also report a
failure after they execute the next virtual transaction against that server.
The active measurements offer an administrator the assurance of independently detecting
the problem; however, this approach also generates multiple reports of the same failure.
Subsequent measurements will generate another urry of alarms if the server is still down.
Customers wanting that service will trigger more alerts when they cannot connect to the
server. Hundreds of alerts can arrive within a small time interval. Effective event
management picks the server problem out as a single occurrence for further treatment.
The following sections discuss the various techniques that remove extraneous information,
add value to the remainder, and determine the subsequent actions.
Table 5-1 summarizes the event management functions that are discussed in subsequent
subsections. It is important to remember that some of these functions might be embedded
in instrumentation that you will encounter.
86
Table 5-1
The Basic Event Management Functions

Function
Value
Volume reduction
Prevents data overload by using roll-up, de-duplication, and intelligent

monitoring
Artifact reduction
Eliminates wasted effort and time by using verication, ltering of single

alerts, and correlation of multiple alerts
Business impacts
Improves decision making by protecting critical services
Prioritization
Focuses on the most important situations
Activation
Automates responses, speeds resolution, and improves accuracy
Coordination
Integrates alerts and builds automated processes
Although vendor-packaging choices blur lines between instrumentation and event

management, the important concept to understand is that the functionality is needed
regardless of the specic packaging. The early enterprise management platforms, such as
those offered by Hewlett-Packard, Tivoli, and Computer Associates, positioned themselves
as a single point for event management; they consequently have a range of functions for
processing alerts. Other companies, such as BMC Software and Micromuse, have added
similar capabilities, while smaller vendors may offer a limited set of event management
functions.
Volume Reduction
Simply reducing the alert volume can be very helpful. Hundreds of alerts reporting the same
situation can be generated. However, only a single alert is necessary to note the database
server failure and to start recovery procedures.
There are different methods of reducing the alert volume: roll-up, de-duplication, and
intelligent monitoring.
Roll-Up Method
Hierarchical collector structures reduce alert volumes by rolling them up from one level to
the next. The aggregators described in Chapter 4 are a natural place for implementing this
alert compression. In Figure 5-2, three collectors are using virtual transactions against the
same server. If the server is congested, all collectors forward a server slow alert to the
aggregator. The aggregator simply passes a single server slow alert downward to the event
manager or another level in the instrumentation hierarchy.
Figure 5-2
87
Using Roll-Up to Reduce Alert Volume
Multiple
Slow-Server
Alerts
Aggregator
Single
Slow-Server
Alert
Event Manager
De-duplication
A failure may generate a multitude of virtually identical alarms and events that can be
consolidated into one alarm by de-duplication. For example, a router failure may spawn a
large number of alarms about dropped connections. De-duplication adds information to a
single event, indicating a larger number of similar alarms.
Intelligent Monitoring
Adaptive instrumentation (Chapter 4) provides the exibility for intelligently monitoring
situations, and it thereby reduces alert volumes. Consider the example at the beginning of
this section. Active collectors have reported a database server failure. It may take some time
for the failover procedures to complete and even longer to resolve the problem. If the active
collectors continue monitoring, they only add to network and alert loads without adding any
new information.
Adaptive instrumentation helps the situation by continuing measurements and not
generating any further alerts until the virtual transaction indicates that service is restored.
A different alert then informs the management system that the service is again healthy. The
elapsed time between the failure and restoration alerts measures the outage.
Additional reduction is possible with deeper knowledge of the service topology.
Dependencies can indicate that for some failures, downstream monitoring will not be
productive. For example, service behavior can be monitored in steps or in smaller parts of
88
the entire transaction. If a step fails, monitoring those that follow do not yield any useful
information until the failed step is repaired. The same approach of monitoring, but not
generating, new alerts is used to detect the restoration of the service step.
Artifact Reduction
There are techniques to reduce the raw alert volume by eliminating artifacts, which are
measurements that falsely imply an important problem where none actually exists.
Response time, for example, could be slowed while network routers are recalculating
routing options after a failure or topology change. The next transaction has satisfactory
response time after the routing system has stabilized. There is no value in notifying the
transaction manager of this artifact. There is nothing to chase and correct because the
transient behavior of the routing system has ceased.
A transaction could also be lost or timed out while a server in a tier fails and is replaced.
Specic transactions might be lost in the failed server, but after the replacement is
operating, operations resume at satisfactory levels.
A similar situation arises when an occasional lost packet triggers an alert because a
transaction failed or timed out. Further checking, however, usually nds operations
proceeding within the normal range of behaviors.
Large numbers of artifacts can consume large amounts of staff time and divert effort from
other tasks. It is difcult for humans to identify all the artifacts and ignore them when
appropriate to do so.
There are approaches to help reduce the number of artifacts that slip through:
Verication
Filtering
Correlation
The text discusses each in turn.
Verification
Quickly verifying that an incoming alert is reporting an actual problem is an effective rst
step in eliminating artifacts. For example, an active collector can be used to run a
transaction that has been reported as noncompliant. The active measurement establishes
whether the problem persists and is repeatable; if it is, further attention might be warranted.
The initial measurement is treated as an artifact if the test doesnt reveal a problem.
Using a repeat failures lter for simple thresholds can help discriminate noise from real
failure conditions by requiring that several successive measurements exceed the threshold
before an alert is issued. For instance, you can stipulate that it will take 10 minutes to
89
forward an alert if the interval between virtual transactions is 5 minutes and the rule is that
two repeated failures are needed. Using criteria for successive measurements frees the
system from responding to a single blip that later cannot be found.
A more proactive form of verication uses active measurements after the initial alert is
received. Verication with an immediate series of virtual transactions claries the situation
quickly. Successive failures are detected in less than a minute rather than waiting for 10
minutes to attack the problem.
For example, consider a customer who is verifying the response time of a remotely hosted
service. Suppose an active measurement device periodically initiates a virtual transaction,
perhaps every 10 minutes. If one of these virtual transactions exceeds the specied response
time, the measurement device immediately sends a series of closely spaced virtual
transactions. If those complete successfully, no further action is necessary.
If the problem persists, the customer management system noties the provider and begins
tracking the provider response until the problem is resolved and the customer veries that
acceptable service levels are restored.
The collector sending the alert should be used for verication whenever possible because
the environment will be more consistent. Using a collector in a different location might
change the results, depending on the location of the problem. On the other hand, using
multiple monitors from multiple locations can provide some diagnostic triangulation;
noting that a problem is detected from one side of the network but not the other can aid in
problem isolation.
Filtering
Filtering is the application of rules to a single alert source over some time interval. Figure
5-3 illustrates the application of rules concerning measurements exceeding a specied
response-time threshold. This is more sophisticated than a check for successive overthreshold measurements. This is an X out of Y process instead. That is, within a set of Y
transactions, any X that is slow constitutes an alert. The gure illustrates a three out of eight
conditionany three over-threshold measurements out of eight will trigger an alert.
Note that these ltering rules require state to be maintained between measurements. They
should be selectively applied to a small numbers of sources to avoid loading the event
manager. Using simple counters places less processing demands on the event manager, but
this is done at the expense of being less able to exploit more effective ltering rules.
90
Figure 5-3
Simple Filtering to Remove Transient Behaviors

Generate
Alert (3 of 8)
Response Time
Filtering Window
Transactions
Suppress
Artifact (1 of 8)
Response Time
Filtering Window
Transactions
Correlation
Filtering tracks a single source over a period of time and eliminates artifacts as a result.
Correlation, in contrast, works with a number of alert sources simultaneously (or within
short intervals). As mentioned, some types of failures or disruptions trigger many additional
alerts. Correlation works with this ood of alerts and removes the secondary artifacts,
which are those alerts caused by another problem. For instance, the database server failure
results in many reports of failed transactions. Administrators will waste valuable time
looking at each service with a problem rather than addressing the cause of all the secondary
artifacts.
Correlation is more powerful than ltering because it identies the most likely cause of a
urry of alerts. The accuracy speeds problem resolution and reduces staff disruption.
Correlation is also more complicated than ltering because it deals with multiple,
independent alert sources. Correlation depends on understanding the relationships among
various service elements. Essentially, it is the rule of cause and effect. (If you cannot reach
a router, you cannot reach the networks that are connected to it, for example.)
Building the appropriate information for a correlation engine is a challenge. Early
correlation engines, such as the Tivoli Enterprise Console and the Veritas NerveCenter,
were powerful, but they often became shelfware (software that wasnt used in production)
91
because of their complexity. Increasing dynamism of the managed environment increased

the staff burden because the rules required more frequent updating by experts. In the end,
organizations simply could not afford to use these powerful tools.
Correlation approaches use techniques such as time correlation, which is the examining of
(near) simultaneous alarms and determining if they are related. This often reveals a large
number of problems. For example, in one situation that Ive seen, server performance would
suddenly degrade without any signs that the server itself had been changed. A simple time
correlation revealed that packet losses increased just before the server had performance
problems. It turned out that higher levels of lost packets were leading to timeouts, and
dropped connections were causing extra processing and resource conicts in the server.
Another correlation approach is the matching of a problem signature. Experience indicates
that a certain set of alerts appearing at (nearly) the same time points to a specic type of
problem. The appropriate responses are activated after the problem signature is matched.
Matching signatures is effective, but it creates some drawbacks as well. One drawback is
creating the signatures. Management vendors usually do this by hiring experts to dene the
signatures and the associated symptoms and cures. The labor and need for detailed
expertise means that most customers will not be able to build their own signatures for other
service management situations.
Signature matching is also more complex for the correlation engine. Parts of the signature
can arrive in any order and they can change the tentative diagnosis as more parts arrive. The
correlation engine needs to remember more states and use a time interval to build the
signature. (Signatures are discussed in Chapter 6, Real-Time Operations, in the context
of sophisticated real-time operations tools.)
Business Impact: Integrating Technology and Services

Service managers are being called upon to make decisions that affect more than the
technology they manage; now they are directly affecting their organizations capacity to
generate revenues and communicate with partners and suppliers. Better service
management decisions must incorporate more information about the business processes
and the services that are using the infrastructures.
Detecting a potential or actual service disruption is only the rst step. Determining the
likely cause rapidly and accurately speeds restoration of service quality. Rapid problem
isolation is simpler and faster when the elements associated with a specic service ow are
known. They can be quickly probed for any abnormalities that warrant further attention
from element management experts.
Service quality also degrades even when there are no element failures. Rapid changes in
loads and activities can introduce problems with resource allocation, temporary congestion,
and other instabilities. The same mapping of services to elements enables rapid testing of
the associated elements and identies candidates for detailed analysis.
92
Conversely, if an element fails, an administrator needs to know which services are affected
and what the business ramications of those services are. A failure affecting a critical
business service receives more attention than one that interrupts internal data backup.
Service managers can also use element instrumentation in a different way. Elements
associated with key services can send alerts to the service manager. These are informational
because the service manager is not usually responsible for responding to element problems.
The service manager is informed that changes are occurring even if no disruptions are
threatened for key services. Several element failures affecting the service would be another
early warning mechanism.
Understanding the business impacts of any alert enables administrators to truly understand
what is important to the business and make better decisions.
The subsequent discussion is further divided into the following subsections:
Using top-down and bottom-up approaches

Modeling a service
Care and feeding considerations
Top-Down and Bottom-Up Approaches

A top-down approach begins with the service and works toward the specic elements
supporting it. Taking a top-down approach is usually easier when associating elements and
services. Starting with the services perspective simplies associating the supporting
infrastructures and the individual elements within them. Some infrastructure elements,
such as servers or databases, might be dedicated to specic services, making the
associations straightforward. Other elements are shared and thus require further effort to
specify.
The complementary bottom-up approach starts with elements and builds associations with
the services using them. This is a much more difcult task because elements, such as
network devices, do not usually capture and provide any information about the services
owing through them. Servers track the active processes, but building an integrated end-toend representation is very hard to do.
Modeling a Service
Modeling is the most effective way of associating a service with the elements supporting
it. Models use the power inherent in object-based descriptions and tools (see the
accompanying sidebar for more information).
93
A Brief Object Overview

Objects are abstractions that represent real entities. Each instance of an object has
attributes, methods, and notications.
Each instance of an object has a set of attributes that describe the entity they are modeling.
A server, for example, may have attributes describing the type of processor(s), the operating
system, the memory, and other technical parameters.
Attributes can describe the relationships among the objects. Attributes dene logical
connectivity among objects and thus dene dependencies and membership in groups. Some
relationships, such a set of servers running a specic service, are xed for relatively long
periods of time. Others, such as the actual Internet route, are more dynamic and are
calculated when they are actually needed.
Methods are the operations that objects can carry out. These operations include rebooting
a server, deactivating a specic process, and accessing specic policy information.
Methods are not as important as the attributes for building service models; they are used
mainly for element management functions at this time.
Notifications dene what an object can communicate to another object. They ll the role of
alerts, for either an element or a service.
Building a service model is fairly straightforward if the proper ingredients are available.
One of the most important is an object library that has the templates for all the common
components, whether physical or logical. Objects representing physical components, such
as servers, must be readily available and easily customized to represent specic instances
of any physical entity. Objects also represent logical entities, such as an application or an
external service. Some of these will be common, and others will be specic for each
organization.
After the objects are dened, they must be related by setting the appropriate attributes in
each object that dene dependencies to other objects. Thus, an application object is related
to the server where it executes. The application is also related to other functional objects,
such as a database system, a content delivery network, or an external search engine.
Care and Feeding Considerations

Unfortunately, much of the correlation between services and elements is created only
through manual methods. Administrators must specify the associations, with some
accompanying burdens.
94
Building these associations consumes staff time and is an ongoing challenge as

environments continue to change and new services are introduced. A large online auction
business, for example, introduces several new services each week, and there would be
additional time needed to prepare the proper management views for each.
Maintaining the accuracy of these associations is an ongoing drain on staff in dynamic
environments with rapid shifts in resources to match demands. Additional servers might be
brought online as demand for a specic service grows. Updating the information stretches
staff resources even further.
DIRIG Software PathFinder

DIRIG Software has taken the process of building and maintaining associations between
services and elements further with its introduction of PathFinder. PathFinder is focused
only on Java-based services, and it exploits that specialization. The rst step is discovering
the service components, which are the enterprise Java beans, applets, or servlets.
PathFinder then determines their relationships within services by scanning directories and
other application-building information.
Agents on servers track the activation of the Java components, thus providing the binding
of the logical components to the underlying server infrastructures. The logical
dependencies between components coupled with the physical association with the servers
executing them provide rich information for troubleshooting and isolating service
problems.
A component failure is associated with the services it supports and with servers as well.
Troubleshooters also know the calling sequence and other high-value information to
resolve problems quickly while understanding the business ramications.
The uidity of PathFinder is welcome because agents detect the execution of the service
components, sparing staff the odious task of trying to track changes and keep information
current.
Prioritization
This is a stage when an alert has been identied as an event requiring some response from
the management system, including notifying staff, sending reports to appropriate staff,
assigning staff to the event, or immediately activating automated procedures.
All events are not of equal importance, however, and management teams must keep the
business-critical services running smoothly. A report that online customers are abandoning
their shopping carts in droves will be of immediate concern to a business manager; a report
that an occasional catalog lookup is slow need not receive the same attention.
95
Prioritizing correctly means collaborating between the management staff and its service
customers (directly or indirectly). It is the customers who must determine and communicate
the relative priority of their services to their providers (internal or external). Only when they
have this clear indication of business priorities can providers assign the appropriate
priorities to the associated alerts and events.
Providers must also assign other priorities to help them with their operations. The text has
already mentioned that customers might pay higher premiums or have stricter
noncompliance penalties. These concerns must also be incorporated into priority
assessments.
The event manager uses these assigned priorities to organize the event stream and guide
responses more effectively. Staff members are directed to address their attention to the most
critical events.
The event manager also needs an aging mechanism so that low-priority events receive
attention within a specied time frame rather than being completely starved-out by higherpriority events. The event priority is automatically increased by the aging mechanism if the
event hasnt received attention within a specied time frame.
Prioritized events are usually placed in their appropriate queueminimally offering a
severe, moderate, or warning priority level. Some products offer much more granularity
with more priority levels to assign. Multiple thresholds can be used to trigger different
responses depending on the severity of the alert.
The event monitor interface is also a means for tracking workow. The status of each event,
such as when the event was received and its current status, is available. As events are
cleared, they are appropriately marked, logged, and removed from the active queues.
Events must be organized in a variety of formats to meet different management needs. An
overall display might show all outstanding events by priority class. Other displays are
needed to show the affected customers, the specic SLAs, and the penalties that apply. Staff
can also modify the events, changing priorities or clearing them from the console.
Activation
Any event activates one or more management tools. The time constraints imposed by SLAs
mandate automatic and rapid responses to problems while the management staff is being
notied.
Registration is the process of linking events and management tools. Management tools are
activated by the event manager when any events for which they have registered are detected.
Specic tools register with the event manager for types and classes of events. A Cisco
Systems device manager would register to receive any events generated by specic Cisco
elements, for example.
96
Registration is usually accomplished with an application program interface (API) for the
event manager. Most products use a publish/subscribe approach, where a management tool
subscribes to certain events. The event manager publishes events, which activate the
subscribers. Multiple tools can also be activated by a single event.
The event manager currently uses local server functions to activate the specied
management tool. In the future, XML documents will activate remote management tools.
Coordination
Event management can help integrate the services and technology management areas as
well as integrate management tools into processes. It is a natural place for integration
because element and services instrumentation are already converging there.
One key factor in a bigger role for event management is the use of internally generated
alerts. Figure 5-4 offers an example of event management as an integrating factor. A server
failure alert is generated (1 in the gure) and leads to the activation (2) of a server
management tool. The server manager performs the detailed problem analysis and
determines that a hardware failure has occurred and the server is not operational. The server
manager then creates an internally generated alert (3), which comes from the management
tool, not the managed environment), and the event manager then sends another alert that
activates a tool that determines the impact on services (4).
Figure 5-4
Event Manager Using Alerts to Integrate and Sequence a Management Process
1) Alert: Server Failure
Event Manager
4) Activate Tool
2) Activate
Tool
3) Internal Alert
Server Element
Manager
5) Internal Alert
Service
Manager
97
The impact assessment tool determines if the server failure is having an impact on service
quality, such as congesting the remaining servers and creating unacceptable response times.
If that is the case, it sends another internally generated alert (5 in Figure 5-4) that activates
provisioning tools and trafc redirection tools and that noties the staff of a serious threat
to service-level compliance.
Incorporating internal alerts from the management system adds more value because a single
point receives all alerts and can place them in the proper context. One example of this
function would be a performance manager sending an alert if the pool of stand-by servers
falls below a dened threshold number. The management staff then has this information and
can prioritize it against other events to allocate efforts as effectively as possible.
Event management has a range of functions for sifting through an alert stream and picking
from the management system those that need immediate attention. Some products have a
full set of these functions, while other products use a more limited set. Other products
distribute these functions and couple them more tightly to instrumentation.

Micromuse was an early player in the event management market. It has expanded its
capabilities by adding a family of active probes, root-cause analysis tools, and business
impact analysis tools. It competes with BMC Software, Hewlett-Packard, Tivoli, and
Computer Associates, among others. It has established itself strongly with service
providers.
The discussion about Micromuse is divided into the following subsections:
Netcool product suite

Event management
Netcool Product Suite

The Netcool product suite depends on the strong event management of the Netcool
OMNIbus application, which is the basic driver of the Netcool suite of management tools.
An early OMNIbus design goal was abstracting basic element alert information, adding
value, and transforming it into operations information. The tools are associated with
specic events and activated when they occur.
Central to Netcool OMNIbus is the Netcool ObjectServer, a memory-resident database that
holds operational information as well as the information that associates elements with the
services they support. Customers must still build the associations between services and
elements, which adds extra administrative burdens. However, the ObjectServer is designed
for fast access so that it does not become a choke point under high-request volumes.
98
A set of Netcool Probes and Monitors uses active techniques for collecting operational
information and feeds the data to the ObjectServer. The Probes are passive collectors,
implemented as software modules placed in monitored elements. Monitors are separate
systems carrying out active measurements. The set of Probes and Monitors covers a wide
variety of equipment, services, and transactions.
The Probes and Monitors also provide local processing to reduce the loads on the network
and ObjectServer. Local processing enables sophisticated ltering of the alarm streams, and
it helps the solution scale for large, managed environments.
The active monitors maintain complex rules that can calculate expressions with multiple
alarm sources. The value of the expression determines if an event is triggered. Both types
of collectors track cumulative behavior, such as the number of slow transactions over the
preceding two hours.
Instrumenting across the service infrastructures gives Netcool a denite advantage
compared with many event management systems that are focused on network elements and
alarms. In contrast, Netcool uses complex expressions in the monitors that are based on
sources in different infrastructures. Administrators can zero in on service behaviors across
the infrastructures.
A publish/subscribe interface is used to associate management tools and events. Any tool
registers for one or more events; when these events occur, the appropriate tool is launched
and responds as needed.
Administrators access the event management system from anywhere on the Internet. They
can navigate quickly through different views. Administrators can view event status, clear
events, change their priority, and generate reports on event-management activity.
Event Management
Micromuse offers a set of capabilities for reducing alarm volumes and delivering actionoriented information to the management tools. Some of these functions include deduplication, normalization, automatic suppression of transient conditions, and presentation
from a service perspective.
De-duplication consolidates related alarms and thereby provides a clearer picture for the
staff. Instead of dozens of individual alert messages, the operators see a single message
with associated information about the number of underlying alarms.
Normalization is a means of aligning the values from many different sources so that data
can be compared accurately. As a case in point, consider the situation in which one collector
measures utilization as an integer value representing the percentage, and another expresses
utilization as a fraction. A management tool would not necessarily know that 0.9 is actually
larger than 80. Normalization converts incoming information into consistent formats and
data ranges. Any tool that subscribes to the ObjectServer can use the normalized data.
Summary
99
Transient conditions are a burden for the management team. An arriving alert can take staff
time and effort to set up new measurements, select troubleshooting tools, and launch a
trouble ticket. Transients may have already died down, leaving the staff with nothing out of
the ordinary to measure and analyze. The Netcool auto-clear function tracks these
transients and removes them from the active list when they disappear by themselves.
The Netcool/Impact management tool adds the intelligence to transform element
information into service-centric perspectives. Impact enables the staff to build service
models and associate them with actual elements. Whenever an element fails, Impact
determines which customers and services are affected. It uses other information to assign
the event the proper priority so that the staff guides the workload accordingly. Finally,
Impact provides the problem resolution policy associated with the failed element.
Micromuse is adding more tools and integrating partner products into the Netcool suite as
well. Their basic architecture is copied extensively.
Summary
Event management takes in a large volume of alerts of varying value and produces a smaller
volume of events that require further attention. It reduces the volume using de-duplication,
roll-up, artifact elimination, ltering, and correlation to remove artifacts and identify the
alarms that matter for maintaining service quality.
After events are identied, they are prioritized to add additional guidance to keep the staff
focused on the most pressing technical or business problems. The last step is using the event
manager to activate the management tools that have registered for specic events, thus
completing the transformation of raw instrumentation into actions to be taken.
Event management will become a key integration point as internally generated alerts are
used to activate other tools and to manage process steps. The creation of sequenced tool
operations enables organizations to build more sophisticated automated management
processes.
CHAPTER
Effectiveness and accuracy of real-time operations directly affects compliance with Service
Level Agreements (SLAs). The time taken to detect a problem, determine its cause, and
take corrective action is the time during which service quality is at risk and SLA violations
may occur.
The demand for higher service quality shrinks the time allowance for responding to the
actual variations in service behavior. Every SLA has time-based metrics. For example,
availability metrics are all about time: the available uptime each month; the total of outages
(downtime); the time between outages; and the duration of each outage. Transaction
completion times are another example of a time metric. The emerging metrics for
measuring provider Quality of Service (QoS), such as service activation time or responses
to trouble tickets, are also time-based.
Real-time operations management involves operations that deal with time-sensitive tasks
such as monitoring, analyzing, and responding to potential service disruptions. Real-time
operations management tools are a core part of most commercial system management
consoles.
To illustrate real-time operations management, consider Figure 6-1. As shown, the realtime operations manager receives alert input from the real-time event management module
and time-sliced measurement input from the SLA statistics module. The real-time
operations manager typically contains the sophisticated analysis tools that evaluate the
incoming alerts and SLA measurements, identifying problems and proposing solutions.
NOTE
Alerts indicate failures, conditions in which SLA compliance is compromised, or situations

where compliance may be compromised in the future. Alerts are often unpredictablethey
respond to dynamic behavior that is inuenced by many factors.
Time-sliced, periodic measurements are used for managing and reporting on service
quality. The real-time operations manager must continuously update its assessment of
system behavior and take further actions as needed. It may generate internal alerts if it
detects that the time-sliced measurements are straying over predetermined thresholds.
102
Chapter 6: Real-Time Operations
Figure 6-1
Real-Time Operations Architecture
# $ !" %

&"" "$% ' ( "
)& *
&
)& ,

'
$"
)& ,
! # *

!"
&

+
$
&,$
Automated responses can be directly activated by alerts or after other functions, such as
root-cause analysis, have performed their tasks. The use of automated responses assists in
decreasing the time to handle a situation and the potential for errors and misjudgments.
Many routine issues can be handled by automation; where issues cannot be mitigated
automatically, automated analysiseven if only partialcan assist the human
troubleshooters.
NOTE
Another way of compensating for relatively slow human troubleshooting speeds is to

increase redundancy and capacity to handle failures and congestion while the analysis
proceeds.
The function of real-time operations management is to help staff reduce Mean Time To
Repair (MTTR) when incidents occur and to increase the Mean Time Between Failures
Reactive Management
103
(MTBF) whenever possible through proactive prediction of difculties. There are three
basic methods discussed in this chapter to achieve those goals:
Reactive management
Proactive management
Automated responses
These methods are discussed in order in this chapter, followed by illustrative descriptions
of some major commercial real-time operations managersincluding response managers
for denial-of-service attacks.
Reactive Management
Reactive management will always be needed because, simply, failure happens. Devices
unexpectedly fail, changes turn out to have unintended consequences, backhoes cut ber,
or entire electrical grids go down. Reactive management is the most demanding from a time
perspective because administrators have no prior warning and still must assemble their
resources and attack the problem as best they can.
Components of reaction time include the following:
Problem detection and verication, initiated by the instrumentation and rened by the
management tools
Problem isolation, consisting of further analysis to identify and isolate the cause of
any (potential) service disruption
Problem resolution, in which steps are taken to resolve the problem and restore
service levels, if necessary
Most of the time involved in resolving a problem is usually spent in the problem isolation
phase, attempting to determine what is actually causing the problem. Increasingly complex
service environments add to the challenge because even the simplest delivery chains span
multiple elements and organizations. The instrumentation quickly detects threshold
violations, baseline drifts, and other warning conditions.
For an organization thats well prepared for reactive real-time management, many actions
to resolve a problem (such as bringing an additional server into the mix, selecting an
alternate network route, switching to another service provider, or redirecting trafc to a
lightly loaded data center) can be completed quickly. However, even for the most agile
organizations, the maximum leverage in reducing resolution time is in reducing problem
isolation time with speed and accuracy improvements.
Accelerating problem detection and vericationthereby increasing the speed with which
validated alarms can be generatedbuys time for the problem isolation process. Moreover,
faster analysis means there is less lead time needed between the arrival of a warning and
104
the assembly of enough information to take corrective action. Automation makes a

signicant contribution here: with less time needed to perform the analysis, the analytic
engine can be used to predict events that will occur very soon, which makes its job simpler.
Looking at trends to identify a problem that might occur within the next 15 minutes is much
simpler than predicting behavior 6 hours into the future.
The following subsections describe two primary methods for decreasing problem isolation
time: triage and root-cause analysis. The descriptions are followed by discussions of how
to handle some common factors that complicate both methods.
Triage
Triage is the process of determining which part of the service delivery chain is the most
likely source of a potential disruption. First, its important to understand what triage does
not do. It isnt diagnostic; it isnt focused on determining the precise technical explanation
for a problem. Instead, its a technique for very quickly identifying the organizational group
or set of subsystems thats probably responsible for the problem.
Triage thereby saves problem isolation time in two ways. First, it ensures that the bestqualied group is identied and set to work on the problem as quickly as possible; second,
it decreases nger-pointing time.
Identifying the best qualied group to deal with the problem means nding those who are
most likely to have the specialized tools and knowledge that can be used to solve the
problem more quickly than if it were left with a generalist group.
Equally important, triage techniques are focused on drastically decreasing nger-pointing
time, during which various groups try to avoid taking responsibility for a problem. It does
that by presenting the responsible group with data thats sufciently detailed and credible
to convince them that its truly their problem.
An example of triage technique should clarify the difference between it and detailed
diagnosis. In this approach, called the white box technique, a simple web server (the
white box) is installed, as shown in Figure 6-2, at the point where the enterprise web
server systems connect to the Internet infrastructure. (The web server could be extremely
inexpensive; it could be just an old PC running a avor of Unix and the Apache web server
system without any conguration, serving the default Apache home page or some other
simple content.)
Reactive Management
Figure 6-2
105
Triage Example System
Virtual Transaction
Response Time
Service
Measurements
Network Groups
Responsibility
Router
White Box
Unloaded
Web Server
Demarcation
Load Distributor
Production
Web Servers
Server Groups
Responsibility
Server Farm
The white box web server in Figure 6-2 is located at the demarcation point between two
different organizational groups: the group responsible for the web server systems and the
group responsible for Internet connectivity. Active measurement instrumentation is located
outside the enterprise server room at the opposite end of the network. It is at end user
locations, and it measures both the enterprises web pages and the web page on the white
box. Because no end user knows about the existence of the white box, the white box has
almost no workload; it is used only by the measurement agents.
Figure 6-3 shows an example of some response time measurements from the system
diagramed in Figure 6-2. Its easy to see that when the event occurred, the unloaded white
box server was unaffected. The chart can be created in a few seconds, and it is sufcient to
convince the server group that its almost certainly their responsibility. The root-cause
106
reason for the problem is unknown; the chart is not diagnostic. However, the responsible
group has almost certainly been correctly identied within a few seconds, and ngerpointing time has been cut to zero. The server group can then use their root-cause analysis
tool or other specialized tools and knowledge to study the problem further.
Triage Example Measurements
-
-

Figure 6-3

,(
%
")*% #
+

(

!"
" #$%
&'
(

Triage points can be established at many boundaries within a system, and different
techniques can be used to establish those boundaries. Triage points can be placed at the
demarcations between network and server groups, as shown in Figure 6-3, and they can also
be placed just outside a rewall, at a load-distribution device, and at a specialized subgroup
of web servers.
White boxes can be measured to create easy-to-understand differential measurements, but
theyre not always necessary. For example, consider an organization that measures the
response time of a conguration screen on a load distribution device to see if there are any
problems up to that point. Triage can also be performed by placing active measurement
instrumentation at demarcation points, such as just outside a major customers rewall, to
see if response time from that point is acceptable.
Finally, detailed measurements can themselves be used for triage, although more technical
knowledge is usually necessary. For example, an external agent can measure the time
needed to establish a connection between itself and a le server, followed immediately by
a measurement of the time needed to download a le from that server. It can probably be
assumed that if le download time increases greatly without any corresponding increase in
connection time, then theres a problem with the server, not with the network. (This is
further discussed in Chapter 10, Managing the Transport Infrastructure.)
Such triage techniques are very useful in the heterogeneous, uid world of web systems. It
requires much less detailed knowledge of the internals of the various subsystems than does
root-cause analysis. This is a great virtue when things change frequently and the internals
of some systems are hidden. It also cuts time from the most time-intensive part of system
Reactive Management
107
management. However, it can be difcult to use it for complete diagnostic analysis within
a complex system; too many triage points, or demarcation points, are needed. For complex
systems, true root-cause analysis tools are a necessary complement.
Root-Cause Analysis
Root-cause analysis tools can require considerable investment and conguration, but they
can be surprisingly powerful and benecial. They use a variety of approaches to organize
and sift through inputs from many sources. These sources include raw and processed realtime instrumentation (trip-wires), historical (time-sliced) data, topologies, and policy
information. They produce a likely cause more quickly and more accurately than staffintensive analysis. Analysis tools are activated in a fraction of a second after an alert is
generated and are already collecting data much faster than staff could respond to a pager or
to e-mail. Because conditions can change quickly, and critical diagnostic evidence may not
be preserved, compression of activation time is paid back with more effective analysis.
Root-cause analysis tools can be targeted at elements or servicesor both. The earliest
root-cause tools focused on a single infrastructure, usually the network; newer products are
focusing on service performance spanning many infrastructures.
Speed Versus Accuracy

Many vendors in this part of the industry have emphasized the speed of their solutions
nding the root cause in a fraction of a second. Although speed is important, at the element
level, it takes second place to accuracy. Most infrastructures have sufcient redundancy and
resilience so that an element failure rarely completely disrupts a service. That is why
accuracy is more important. For example, identifying a specic interface on a network
device, a specic server with a memory leak, an alternate route with added delays, or a
database engine that is congested speeds resolution by addressing the right problem with
the right tools and staff. The failure must be noted and marked for attention based upon
policies and priorities.
Finding the root cause of a service disruption is more challenging than nding the root
cause of an element problem because the cause can be found in any of several
infrastructures. At the service level, root cause is more a matter of determining which part
of the service delivery chain is the most likely source of a potential disruption. The triage
technique discussed previously can be used here, and speed is very important because a
service disruption is serious and compliance is threatened, along with the business that
depends on the service. The root-cause tool must quickly pinpoint areas for further analysis.
The easiest case is one in which measurements clearly implicate one infrastructure as the
likely culprit. After a candidate infrastructure is identied, more specic tools are deployed
to isolate the element(s) involved.
108
A difcult case arises when all the infrastructures are behaving within their normal
operating envelopes. This is an opportunity for automated tools to collect as much
information as possible for a staff member to use. The information might not be conclusive,
but it can guide the staff members next steps in an effective way.
Assembled information can include the following:
Performance trends for each infrastructureAre any infrastructures trending

toward the edges of their envelopes?
Historical comparisonsAre any infrastructures showing a signicant change in

historical patterns, even if they remain within the envelope? Is there a similar problem
signature, which is a known pattern of measurements that has previously been linked
to service failures or other difculties?
Investigating element failure informationIs there a time correlation between the

failure and the service disruption?
For instance, an end-to-end response problem could automatically result in the comparison
of other infrastructure measurements to their historical precedents and could also result in
the automatic initiation of new infrastructure measurements. Those automated
investigations could fail to nd any performance that exceeds thresholds. However,
learning that the transport infrastructure delay has suddenly increased vefold while all the
other infrastructures are operating within their normal envelopes would indicate the most
likely area for further investigation.
Linking service root-cause analysis to element root-cause analysis adds leverage to
accelerate the resolution process at both levels. Passing information and control between
the two domains speeds operations and keeps both teams informed and effective.
Case Study of Root-Cause Analysis

The following is an example from a company with which I recently spoke. They have
stringent SLAs with their internal business units and external providers. They funded their
own integration effort because minimizing service disruptions was so essential. The goal
was to have two-way interactions between root-cause analysis for service and elements and
to leverage each for the other. Some of this work is being implemented in phases as they
learn from experience.
Consider rst the case of an end-to-end virtual transaction that is showing transaction
response time slowing, drifting toward greater degradation and eventual service disruption.
This is a situation for the service root-cause tools because there is nothing specic for
element-oriented tools to pursue yet. More detailed measurements at demarcation points
suggest problems with the transport infrastructure, where delays are increasing while other
components remain stable.
Reactive Management
109
An alert is forwarded to the alarm manager, which in turn activates element tools and
noties staff. Information is also passed to help the troubleshooting process at this time. It
includes the following:
The virtual transaction(s) used
Indications of changes in the actual site used (Domain Name System [DNS]
redirection)
The end-to-end transport delay measurements

The actual route from the testing point to the server and from the server to the testing
point (these are usually different over the Internet)
The troubleshooters already have this information as they start to narrow down the cause,
identifying which parts of the services are behaving well and which parts require further
investigation.
Todays technology still leaves some manual steps in the hand-off, such as transferring
information to the element root-cause tools. This is the step where time is lost and errors
can be introduced. In the future, automation of isolation functions (discussed later in this
chapter) might be used to simplify and accelerate the process. For example, an automatic
script could exercise the route used by agents running the measurement transactions,
probing all the devices on the path and looking for any exceptional status or operating loads.
It can have this information ready for a staff member or for another tool that can then
investigate further.
The ow must be bi-directional. Element instrumentation may detect an element failure
rst. Redundancy keeps operations owing while actions are taken to address the failure.
The primary consideration is the services impacted by an element failure. When a service
has been impacted, the management system might respond by monitoring more closely and
setting thresholds for more sensitivity. Being able to understand the relationship between
elements and the services depending on them allows administrators to prioritize tasks and
ensure that critical services have the highest degree of redundancy.
Tools in the services domain may notice unwanted trends and correlate them with the
element failure; sometimes a simple time correlation between the failure and the detection
of the shift is all that is necessary. In the best case, both domains have information and can
communicate effectively as they watch for and resolve developing problems. (This is where
a lot of the new investment in management products is goingbuilding management
systems that can correlate symptoms from disparate elements and understand the impact on
the multiple services and customers while helping operations staff prioritize and x the
problems. The InCharge systems from System Management Arts, Inc. is an example of this
trend.)
110
Complicating Factors
Brownouts and virtualized resources make the tasks of triage and root-cause analysis more
difcult. These are discussed in the following subsections.
Brownouts
A brownout can be a difcult challenge to diagnose because all the elements are still
operable, but performance suffers nonetheless. In contrast, hard (complete) failures are
easier to resolve because a hard failure is a binary valuesomething works or it doesnt.
There are tests that verify a failure and help identify the source of a problem.
It is harder to identify the likely cause of a brownout because there is no denite service
failure that lends certainty to the search. Degrading performance can be caused by any of
the following: a conguration error, high loads, or an underlying element failure that
increases congestion in another part of the environment. Redundancy further complicates
isolation of the cause of the brownout because underlying element failures may be hidden
from service measurements by element redundancy.
The steps described for basic root-cause analysis still apply in brownout failures.
Troubleshooters need all the information and context that they can assemble. Historical
comparisons, indications of recent changes, and other data can help them understand the
situation more clearly. Some patterns, such as a xed percentage of all web requests taking
an abnormally long time, strongly suggest the probable causesespecially if the same
percentage of web servers has recently been upgraded to new software. Sophisticated rootcause analysis tools can learn to look for these patterns and thereby help diagnose brownout
failures.
Virtualized Resources
Another complicating factor is introduced by the common system architecture of
virtualizing resources, in which an entire set of similar resources appears to the end user as
a single, virtual resource. Virtualization simplies many tasks for the end user and the
application developer; its most common in storage systems, where rather than identify
physical sectors on individual discs, storage software virtualizes the storage resources as
volumes and le systems. In the webbed services customers case, for example, geographic
load distribution makes a set of distributed sites available with the same name. The
geographic load balancer selects the site, and the end user automatically connects to the
closest site without having to know the details.
NOTE
Application developers use object brokers to hide the details of locating and transforming
the objects that an application accesses.
Reactive Management
111
Load balancing switches are another means of virtualizing; they hide a tier of servers
behind the switch. Requests are directed to the switch, which in turn allocates them to any
member of the set. Firewalls and hidden networks using Network Address Translation
(NAT) technology also create virtualization.
Unfortunately, from a root-cause perspective, virtualization obscures important details. A
synthetic measurement transaction might detect a performance shift because the
geographic load distributor has selected a different site with different transport and
transaction delays. Understanding that distinction helps the troubleshooting team save time.
They may determine no further actions are needed until the usual site is restored to service.
In fact, the redirection may have behaved entirely as expected, with service measurements
verifying the resilience of the environment. It might also lead to further investigation
because service levels must still be maintained even when these actions are taken.
To handle the complications of virtual resources, management tools must be able to
distinguish among the various hidden resources or, at least, must be able to suggest that the
problem lies somewhere within the virtual group. Instrumentation within the virtual group
can take measurements without having the individual group members identities obscured
by the virtualization process.
As suggested before, failure patterns can suggest the cause of the problem, even if the
virtualization layer cannot be penetrated. In addition, some IT organizations create special,
secret addresses for servers within a virtual group so that they can be measured externally
without revealing those addresses to the general end-user base, as in the white box triage
technique previously discussed.
The Value of Good Enough

Root-cause analysis is not yet a panaceait is not fool-proof and probably never will be.
Nonetheless, a valuable tool can be good enough without being perfect; accurate enough,
for example, that it reduces analysis time signicantly. In other words, even when the
analysis cannot pinpoint the exact cause, it might be close enough that staff efforts can
focus and nish the process more efciently. However, there is a downside to imperfection:
a root-cause exercise that produces incorrect results may waste further time and disrupt
staff by sending them in the wrong direction.
The real opportunity for leverage is automating the actual analysissorting through large
amounts of information, comparing symptoms and test results, and eliminating potential
causes. This is the area where humans are easily overwhelmed. Our role, at least for a little
while yet, is to determine the rules and policies we want these tools to carry out.
112
Proactive Management
Because reactive management is challenging at best, administrators are attempting to be
more proactiveidentifying indications of potential service disruptions early enough to
avert them entirely or to minimize their impact. Proactive management is highly desirable,
so many products claim to deliver it. Reduced downtime is one yardstick or metric to
measure these claims.
The Benefits of Lead Time

For proactive approaches to yield benets, they have to provide enough lead time to allow
proactive intervention. The lead time must be longer than the time needed to identify the
cause and take corrective action; otherwise, the warning appears after the problem has
already resulted in service loss. Such a tardy warning may still be useful, as it may speed
correct diagnosis, but it would be better if the problem were recognized far enough in
advance that it could be avoided entirely. A key here is having tools that shorten the time to
produce useful information after a warning is generated.
Proactive analysis tools are therefore driven by continuously monitoring service quality and
element behavior. The raw data from the instrumentation system is used to drive a set of
real-time functions, such as baselining and predicting future behavior.
Baseline Monitoring
Baselines are continuously calculated based on regular samples of behavioral variables. The
baseline is usually calculated with an average or median of the load, a response time, and other
monitored attributes. A range denes the normal activity envelope, which is the normal
range of the monitored variable through time. The actual behavior is related to the
performance envelope; if its value lies within the envelope, the normal range of behavior is
being measured.
The trend of the measured value is also important because you want to know if behavior is
likely to stay within the expected envelope. If the behavior is trending toward the edge of
the envelope, thats more of a concern than if the trend is moving deeper within the
envelope.
The baseline is an effective early warning mechanism; however, the warning usually lacks
the specic information needed to take specic steps.
Automated Responses
113
The Value of Predicting Behavior

Predicting behavior is essential for making the transition to proactive management. Good
predictive tools give administrators the lead time to take steps before a potential problem
becomes a service interruption. Predictive tools must be able to do the following:
Identify potential problems with sufcient accuracy that administrators will heed the
warnings and take action
Provide enough lead time for meaningful responses to be mounted
Automated Responses
Automated responses are another key real-time function; they are activated after an alarm
is detected or after another tool, such as root-cause analysis, has done its task.
Automation was initially introduced to reduce staff effort and errors by automatically
initiating corrective actions or collecting information for further staff attention. Taking over
repetitive tasks, such as making regularly scheduled measurements or setting
congurations for groups of elements, saves substantial labor and reduces the errors and
inconsistencies that occur with manual input.
Automation also speeds up processes because they do not demand staff attention; they are
activated as needed without waiting for permission. Speeding up processes is always
valuable, but you reach a point where more speed may not give the leverage you seek.
The new challenge is not just speeding up a simple task and continuously shrinking the
window; it is about using the same time window for making more complex, intelligent
decisions.
Languages Used with Automated Responses

Automated responses were originally constructed using scripting languages such as PERL,
which allowed the quick creation of simple scripts. Other responses were created with
programs that could handle more complex situations, but they also took longer to create and
modify.
Many organizations are using Java for its easy implementation, widespread usage on many
types of computers, and the advantages of using a component-based approach. New
automated response mechanisms can be created quickly by reusing previously developed
software components.
114
A Case Study
To better understand how automated responses work, consider the set of actions needed
after a root-cause analysis has identied a failed element. The management team is more
effective when it is addressing the most critical problems and keeping business processes
functioning. Any task, such as addressing a failed element, must be prioritized against other
tasks demanding staff time and attention.
The impact of element failure must be assessed in real time to make the best decisions (by
management tools or by staff). In the example in Figure 6-4, you can see that distinct steps
are involved. Each step is discussed in the following subsections.
Figure 6-4
Example of an Automated Response
Edge Routers
Firewalls
Premises Routers
LAN Switches
Web Servers
Step 1: Assessing Local Impact

In this case study, a simple application traces the topology information from the failed
element, which is the rewall marked in Figure 6-4 by a large X. It determines the
neighboring elements from the connectivity information, checks each neighbor for
remaining redundancy, and then discovers the impacted elements (marked with small Xs).
Both edge routers and both premises routers are affected by the rewall failure.
Automated Responses
115
Redundancy is temporarily disabled in this case because each router has only one
connection left to other parts of the physical infrastructure.
The topology information is supplied by the enterprise management platform. The
application uses the published schema and application program interface (API) to collect
the topology information it needs. Note that future plans could include conversion to the
Common Information Model (CIM) specied by the Distributed Management Task Force
(DMTF).
Step 2: Adjusting Thresholds

The neighboring elements now present different risks as failure and choke points. Decisions
on adjusting the thresholds for the neighboring elements are needed. One matter is that the
load on neighboring elements will suddenly jump because all the trafc has been funneled
to the remaining routers as a result of the failure. Although failed elements are ofine,
thresholds for the good elements need to be adjusted to reect the new system
conguration and resulting changes in loading levels; otherwise, a steady stream of alerts
will be generated based on the old values.
Warning and severity levels might also need to be adjusted for more sensitivity because the
neighboring elements are the most sensitive in the delivery chain. Measurement intervals
might also be adjusted for tracking behavior more nely until the risk is eliminated.
Step 3: Assessing Headroom

The remaining routers will naturally take on a heavier load. The headroom is the difference
between the current offered load and the maximum usable capacity of the routers. A larger
amount of headroom reduces sensitivities to trafc variations, although adding headroom
can be costly. The risk of non linear degradation grows when the capacity margins diminish.
Step 4: Taking Action

The next step is the taking of a set of actions to reduce and eliminate the risks of further
disruptions. The actions involved in this step are as follows:
1 Adjust the thresholds to reect the new system congurations.
2 Use the capacity assessment as a decision point; inform the global load balancing
system to direct trafc away from the site if there is less than a predened amount.
3 Check inventory for a replacement rewall or a computer system that can be loaded
with the software, in the event that the rewall cannot be repaired in place in a timely
fashion.
4 Assess the relative priority of the task and place it in the workow system.
116
Step 5: Reporting
Real-time reports are generated for browser access by members of the operations team and
the group responsible for conguring the automated systems. They include the following:
The failure report

Assessment of impact: redundancy and headroom
Report of threshold adjustments
Report of impacted services (which is sent to users, if it violates an SLA objective)
Building Automated Responses

Although automation has many benets, its easy to get carried away and create software
so complex that it doesnt hold up under re. One of the keys is keeping it simple and
modular. This approach ts well with a component-based approach that allows easy reuse
and combination with other modules to build new functionality.
Picking Candidates for Automation

There are a large number of potential candidates for automation. Organizations with an
abundance of staff and resources can attack all of them; most real-world shops must select
their targets more carefully. Information from the help desk or trouble ticketing system can
be used to help pick the processes and actions that deliver the greatest benet for a specied
investment of time and energy. Some metrics that point toward likely automation
candidates are as follows:
The most frequently performed tasks (saves the most time)

The most challenging tasks (improves outcomes and decreases the need for training)
Tasks that take the longest to complete (reduces time to x a problem and staff errors)
Tasks that are the most critical (improves service quality)

Theres a growing set of new technologies with a range of approaches to nding the rootcause of a service problem. Examples discussed here are tools from Tavve, ProactiveNet,
and Netuitive, along with two specialized tools from Arbor Networks and NetScaler that
handle Distributed Denial of Service (DDoS) attacks.
There are many other tools available, of course. Among them are tools from Identify
Software, OC Systems, Micromuse, Computer Associates, Tivoli, and Mercury Interactive,
among others.
117
Tavve Softwares EventWatch

Tavve Software has a root-cause tool that is focused on Layer 2 connectivity; it builds a
topology model with the connectivity relationships included.
When an alert is received, Tavve EventWatch rst veries that a performance or availability
problem exists. (In many cases, an element will have been temporarily marked down by
instrumentation just because a single probe of that element failedpossibly because a brief
spike in network congestion temporarily interfered with access.) EventWatch uses active
tests, or synthetic transactions, to measure the same response several times in a row. If it
determines that a problem actually exists, persisting beyond a brief period, the topology
model is then used to determine a root-cause.
The EventWatch software uses the topology model to traverse the paths from the active
collector to the target server. It has information about each element in the path and it
conducts more detailed measurements to assess each elements health. A failure or overload
is detected quickly. If it discovers that all elements subordinate to a single key element have
failed and that the key element is also unavailable, it quickly reports failure of the key
element instead of producing a large number of individual, uncorrelated reports of
subordinate element failures.
EventWatch has the capability to detect topology changes and incorporate them into the
topology model. The logic for traversing the possible paths and checking each element is
straightforward and leads to quick root-cause detection.
ProactiveNet
ProactiveNet was an early player in the active monitoring and management of complex
e-business infrastructures. ProactiveNet bases its approach on statistical quality control
principles. It uses sampling and analysis to track behavioral shifts and identify root causes.
Sampling is more efcient than measuring everything all of the time. The key is selecting
the variables to sample; they should be ones whose changes are the most inuential.
(Netuitive, discussed later, uses a similar approach with its strongly correlated variables.)
The sampling interval is a basic parameter that determines the granularityevery ve
minutes, for example. The trade-off between granularity and volume is a major issue to
decide. Frequent samples will detect smaller shifts in behavior, but at the expense of
generating huge amounts of data to store, manage, and protecton top of the additional
sampling trafc.
Sampling of service behavior establishes the operational envelope: the average, maximum,
and minimum values of a behavior (response time, utilization, or help desk calls, for
example) over a period of time. These baselines represent the ranges of normal behavior.
ProactiveNet builds intelligent thresholds from its baselines. It determines a practical
threshold value after the baseline is created and the maximum and minimum ranges are
118
determined. Thresholds are adjusted as the baseline changes, always providing an accurate
warning at any time without manual staff adjustments. Adjusting thresholds on the y is a
critical feature because it accounts for the full range of motion in the environment and
reduces the likelihood of both false positives (reporting problems that dont exist) and false
negatives (failing to report real problems).
A large variety of other management tools leverage the collected data. They determine
thresholds, test for SLA compliance, isolate failures, or produce business metrics, for
example.
ProactiveNet monitors a variety of devices, applications, servers, and other infrastructure
components. The baselines and intelligent thresholds provide the warning and identify the
cause if resource usage shifts toward a potential service disruption.
These techniques are very powerful for dealing with single elements. However, web-based
services are composed of a highly interrelated set of elements distributed across multiple
organizations. More information is needed for tracking down service disruptions in a
complex infrastructure. ProactiveNet therefore provides a pre-built set of dependency
relationships for most common applications so that customers are spared the effort of
building them. This feature alone makes a signicant contribution to reducing deployment
cycle time with ProactiveNet.
When searching for the root cause of a problem, ProactiveNet uses a sequential ltering
approach, progressively eliminating elements as the root cause until only those likely to be
a cause of the disruption remain. As shown in Figure 6-5, each ltering step removes a
portion of the remaining candidates. The steps that are taken are initiated by an alarm
reporting degraded performance.
The rst lter discriminates between normal and abnormal behaviors. Processing is very
efcient because ProactiveNet has already established the adaptive resource baselines.
Only abnormal behaviors are selected, resulting in a signicant reduction in candidates, on
the order of 200:1.
The second lter applies time-based correlation to the remaining candidates. The premise
at this stage is that the probability of simultaneous, unrelated baseline deviations is unlikely.
Time correlation associates a set of abnormal baselines with a single cause, as yet
undetermined.
Figure 6-5
119
ProactiveNet Root-Cause Filtering

Raw Measurements
Abnormality Detection Filter

Passes Outside-Baseline Measurements
(Approximately 1 in 200 Measurements)
Statistical Correlation Filter

Passes Measurements That Are Time-Correlated
with the Problem (Approximately 1 in 20)
Isolation Filter
Uses Known System Dependencies to
Pass Only Relevant Measurements
(Approximately 1 in 50)
Scoring and Categorization

Uses the Measurements to Rank
the Possible Causes of the Problem
System dependencies are the focus of the third ltering stage. ProactiveNet uses the
relationship information stored for common transactions to further isolate the root-cause.
The dependencies point back to the root-cause because the transaction depends on these
resources. For example, given a sluggish transaction, anomalies in e-mail performance
metrics are set aside if the transaction doesnt depend on the e-mail service.
Finally, the ltered element data is examined in detail and ranked, if possible, by the
probability that it is the cause of the problem. It is then presented to the operators for
evaluation. With a ranked set of potential causes computed automatically, administrators
and troubleshooters can get to work applying professional judgment much more quickly
than if they had to work through the root-cause triage manually.
120
Netuitive
Netuitive specializes in predicting future behavior. The companys offerings build
predictive models that are used to identify behaviors that may lie outside the range of
expected activity thresholds. The models are derived from correlation inputs in
combination with congurations set by subject matter experts.
Most services have a large number of parameters that characterize their behavior. Any root
cause or triage strategy needs to determine which variables are the most useful in
understanding and predicting behavior. Typically, there is an overabundance of variables from
which to choose, confounded by a lack of understanding of their relationships to each other.
Netuitive proceeds through a set of steps as they build a predictive model. The tool collects
operational information, renes the variables, incorporates expert knowledge, and renes
the model.
The process begins by baselining the range of operational behavior, collecting operational
data for a 14-day period as a rst step in modeling for a new application. Netuitive captures
all the variables the application provides as it builds a representation of average, maximum,
and minimum values for the operational envelope.
Netuitive then identies strongly correlated variablesthose whose behavior is tightly
coupled to other variables that dene the operational envelope. A change in one variable
will be reected in other strongly correlated variables and is therefore a good predictor of
change. Conversely, tracking a variable with low correlation does not provide any
indication of its impact on overall application behavior.
The goal is to determine a small set of strongly correlated variables that are accurate
indicators of behavioral change. This enables the model to be as simple as possiblebut
not so simple that it provides inaccurate predictions.
Netuitive also facilitates incorporation of input from experts that understand the modeled
application. These are usually members of the original development team or those who
have extensive practical experience with using the application. Such subject matter experts
provide the root-cause information, using their knowledge to link specic variable changes
with their likely causes.
The nal stage in predictive model development is verifying the capabilities and usefulness
of the model. Anomalous events are introduced to validate that the model detects them and
provides the correct root-cause analysis for them. The assessment also tracks the numbers
of false alarms that are generated as a measure of the models accuracy.
The application model is now ready for production use; it shows how the various
measurement inputs are correlated and how they can be used to predict performance
problems.
121
In production, the Netuitive Analytics Core System Engine calculates dynamic thresholds
for models variables using their workload and the time as the basis. The threshold values
dening the operational envelope are updated continuously.
The Netuitive system also calculates imputed values for the models variables based on the
actual variable values and history. In other words, it evaluates the expected value if the
variables follow their normal relationships and the correlation between them holds.
Real-time alerts are generated when actual measurements differ from the imputed values
by an amount that indicates a possible problem. Predictive alerts indicate that a forecasted
value will exceed the forecasted baseline range. The alerting module also has a parameter
that denes the number of alerts that trigger an alarm to other management elements, such
as a management platform.
Netuitives approach offers a peek into the future. A possible drawback is keeping the
models current with application enhancements. Changes in the application may introduce
new correlations between variables. New behavior must also be incorporated after nding
the needed experts. This may represent an ongoing effort that should be balanced against the
gains of predictive tools.
Handling DDoS Attacks

DDoS attacks are another real-time phenomenon that must be dealt with because security
concerns continue to draw a high prole. Such attacks are designed to overwhelm a site
with a load that causes failures and brownouts. As a result, legitimate customers are
prevented from carrying out their normal activities, and business processes are interrupted
or halted.
DDoS attacks are characterized by large volumes of sham transactions solely intended to
overtax infrastructure elements. A SYN Flood attack is a classic example. The SYN bit in
an incoming packet header indicates an offer to establish a connection. Receipt of a header
with the SYN bit set initiates the establishment of a connection, but the attacker never
completes it. The server will eventually discard the pending connection, but a ood of such
connection attempts will cause overows inside the server system before the pending connections can be discarded. The server may therefore be unable to handle legitimate
connection requests.
Other attacks establish hundreds or thousands of connections simultaneously by accessing
the same URL or application. These attacks are attempting to oversubscribe resources and
disrupt legitimate activity.
Attackers maintain their anonymity by using other systems to launch their DDoS attacks
against a targeted site. They scan the Internet for computer systems that have security aws
and exploit them to insert their own software. The rst thing the software does is hide traces
of its insertion. It then goes dormant, awaiting a directive to begin an attack. Attackers can
amass thousands of such sleeper agents, called zombies, and scale their attacks to take down
122
even the most robust Internet sites. The attacker unleashes the attack by sending the
zombies a directive specifying the target system and the attack parameters (some actually
select from a repertoire of attacks).
The sudden onslaught of a DDoS attack can quickly disrupt operations. There is often no
warning, such as a more gradual increase in loading might provide. The performance and
availability collapse can be very sudden if elements are operating with little headroom.
Traditional Defense Against DDoS Situations

The traditional defense in DDoS situations is to determine the address of the attackers and
then to create lters that network devices and rewalls use to screen trafc. This process is
time consuming, and service disruptions can last for extended periods of time (from hours
to several days to full recovery) while the culprits are located and screened out.
Unfortunately, attackers are aware of this defense, and they add another level of anonymity
by using IP spoongthe attack packets do not use the zombies address as the source, but
use a randomly created address instead. This makes tracing the packets to their source
difcult because the address is counterfeit.
Egress Filtering to Suppress IP Source Address Spoofing

Egress ltering could be an effective solution to the problem of inaccurate source addresses,
if it were more widely used. In egress ltering, routers are congured to discard any
outbound trafc that doesnt contain a legitimate source address from inside the routers
domain. This stops any attacking packets with spoofed addresses from leaving the origin
network.
Routers typically have the ltering capabilities to perform egress ltering. The challenge is
getting a critical mass of organizations and service providers who own the routers to use it.
There is additional staff overhead for setting up and maintaining addressing ranges and
keeping the router lters current. This is a good example of a task that can, and should be,
completely automated.
Egress ltering protects each organization and provides a collaborative defense as well.
Providers and organizations deny their networks as origins for zombie attacks, and
protection for everyone increases as the adoption of egress ltering grows. Filtering
outbound packets and detecting spoofed addresses is also a warning that security of your
local systems has been compromised.
123
The traditional defense against DDoS attacks entails these steps:

Step 1 Determine that an attack is underwayusually as service levels begin to
erode.
Step 2 Capture as much trafc as possible for analysis; ideally, this means
capturing trafc on every router port.

Step 3 Identify an attacker and your outer router interface those packets use.
Step 4 Gather more detailed information from your router and identify the
router that forwarded the packet.

Step 5 Repeat Steps 14 until you reach the edge of your network.
Step 6 Determine the Internet service provider (ISP) that delivered the attack
packets to you.
Step 7 Get the ISPs help in tracing the packet to the next ISP in the chain.
Step 8 Continue working with ISPs until you reach the origin network for the
attacking packets.
Step 9 Get the origin ISP to set up lters to stop the packets from entering the
rest of the Internet.

This is an incredibly tedious, labor-intensive, and frustrating process, especially when
multiple organizations must coordinate and cooperate quickly for the defense to succeed.
It is made more difcult because these steps must be applied for each of thousands of
attackers.
By constantly changing their addresses, attackers elude screening and force the
management team to constantly adjust their lters. Overwhelming the management team
creates a secondary denial of servicethe management team must neglect other tasks while
defending against DDoS attacks.
Most management teams under attack are too overwhelmed and unable to take the time and
effort to trace every individual attack. They must resort instead to faster and less precise
responses such as blocking an entire subnet if an attackers origin is traced to it. Although
this stops the attack, it also blocks any legitimate trafc originating from that subnet.
Organizations are clearly at the mercy of these attacks. The fatalists believe the only
defense is to avoid as much notice as possible, staying off the attackers radar, in effect.
However, building a strong Internet presence and avoiding notice is an impossible
balancing act. Attackers will continue extending their advantage as long as they use
automated attacks against manual defenses.
124
Defense Through Redundancy and Buffering

Organizations with large online business activities use many types of redundancies to
protect themselves from disruptions of all types. Using several Internet providers is a hedge
against wide-scale disruptions that might impact the operations of a single provider.
Multiple data centers, physical redundancy, and global load balancing offer additional
buffers.
Unfortunately, attackers are limited only by the number of unsecured systems they can
penetrate and use as zombies. As long as that remains the case, a DDoS attack can be scaled
to overwhelm any nite set of resources. Organizations and providers cannot overprovision and hope they have enough spare capacity to cushion the effects of an attack
while they track its origins.
Redundancy is used primarily to support high availability and high performance.
Still, redundancy can help buffer some of the initial impact of an attack, buying additional
time for defensive measures. For example, the global load balancing system is often the rst
place the loads are concentrated as new connection attempts begin to accelerate. It may be
able to spread the attacking trafc to different sites, cushioning its impact and buying time
with an early warning. Some load balancing vendors now include DDoS attack detection
and some attack defense capabilities in their products.
Automated Defenses
One of the continuing threats that DDoS attacks pose is the shock of a sudden trafc surge
that disrupts services for lengthy periods. All too often, rapidly deteriorating service quality
is the earliest indication that an attack is underwayin fact, such a degradation shows it is
already succeeding.
Earlier detection clearly helps the defensedetecting the early signs, or signatures, of an
impending attack and activating the appropriate defensive measures in time. A warning
from any source is helpful, but those that provide the longest lead time are the most
valuable. Longer lead times are attained with some trade-off with accuracy. A sudden surge
of 100,000 connection attempts within the last minute has a very high likelihood of being
a DDoS attack, but your lead time is very short.
Conversely, detecting a smaller perturbation that predicts an incipient attack increases your
lead time and options for countering the attack. However, the longer lead time might also
come with an occasional false alarm when some perturbation was not actually indicative of
an attack.
Two examples of defense solutions are described in the following sections to show the
range of automated approaches available.
125
Arbor Networks PeakFlow

Arbor Networks is representative of companies (Mazu Networks and others) that are using
sophisticated detection tools to automatically drive problem attack analysis and defense.
Figure 6-6
Arbor Networks PeakFlow Architecture
Router
Peak Controller
Firewall
PeakFlow Collector
Load Distributor
Web Servers
The Arbor PeakFlow system is a set of distributed components for defending against DDoS
attacks. Collectors gather statistics from Cisco and Juniper routers and from other network
components, such as switches. They also monitor routing update messages to follow
changes in the routing fabric. Periodic sampling is used to build a normal activity baseline
for each collector.
Collectors detect anomalous changes in the trafc patterns and characterize them for the
PeakFlow controllers. (This is where each vendor has their secret sauce: proprietary
algorithms for mining the changing operational patterns and extracting better predictions
of future problems.) The distributed PeakFlow collectors provide the detailed analysis that
identies particular attack signatures. Knowing the type of attack helps direct the defensive
response more accurately. Remote collectors also capture new anomalies that may prove to
be new attack signatures. New signatures are added to provide faster diagnosis if the same
attack is attempted in the future.
126
A PeakFlow controller integrates the reports from a set of collectors and determines if a
DDoS attack is indicated in the anomalies. The controller traces the attack to its source and
constructs a set of defensive lters. The controller can then automatically load and activate
the defensive lters, or they can be initiated after staff inspection.
The Arbor Networks approach uses a centralized correlation engine to pick out attack
indicators from the collector anomaly reports. Centralized correlation aids accuracy
because the distributed nature of attacks may be obscured when looking at each point of
attack; the aggregate pattern is more revealing.
NetScaler Request Switches

NetScaler is an example of a DDoS defense system that is based on a load balancing
product. The NetScaler RS 6000 and RS 9000 products are front-end processors for web
servers. They create persistent connections with the end user and with the web servers. The
connections are primarily used for load management; thousands of end-user connections
are managed by NetScaler and condensed into far fewer connections to the web servers.
The connection-management technology can also be used as part of a DDoS defense. The
established connections to the end users are prioritized and maintained despite incoming
bursts of new trafc, some of which may be part of a DDoS attack. Distinguishing the trafc
streams ensures legitimate users will maintain access and will be able to carry out their
activities.
In contrast to the automated anomaly detectors used by Arbor Networks, the NetScaler
switches do not actively trace attackers to their origin. Instead, they maintain stable services
while reporting the attack to the management staff for further attention.
Organizational Policy for DDoS Defense

Responding to DDoS attacks by tracking the source and denying further access is a critical
part of the process, but other concerns must also be included. To handle such a threat
effectively, management teams must establish a comprehensive policy for dealing with
these potentially debilitating attacks, policies that deal with organizational function and
communications for contingencies such as DDoS. In this fashion, if and when an attack
happens, all participants have a clear picture of their respective roles and obligations.
Communications plans are the foundation. Notication policies must be dened and
implemented. Management planners must determine who is notied at the rst indication
of an attack, what notication escalation procedures are invoked if the attack is veried, and
the specic staff skills that are needed.
Response policies must also be in place; management teams need clearly dened policy
steps. These procedures also need to be practiced so that they work as expected when a real
Summary
127
attack occurs. Some policy questions that need discussion and agreement prior to the attack
itself include the following:
What steps are taken to verify a suspected attack?
What is the set of graduated steps to take after an attacker is traced (usually selecting
the least disruptive rst)?
What steps are taken while an attack is veried?

What steps are taken when an attack is veried?
What escalation procedures are followed if the attack is not stopped within a specied
time interval?
Increasingly sophisticated DDoS attacks demand more sophisticated defenses. This area
merits continuous attention, as attackers will not rest once their current methods have been
defeated. Detectors and the policies they activate must be reviewed and rened to meet
evolving threats.
Summary
Real-time operations comprise a set of key functions that must operate within tight time
constraints. Information ows into the real-time operations system from the
instrumentation manager (the source of alert data) and from the SLA statistics modules,
which provide time-sliced measurements of performance. The real-time operations system
then processes the inputs in an attempt to improve MTBFpossibly by using proactive
techniques to predict possible failures. At the same time, it tries to assist the operations staff
in decreasing the MTTR when a failure actually occurs.
Reactive management, used to decrease MTTR, is based on the use of triage and root-cause
analysis. Triage tries to identify the responsible organization very quickly, in the hope that
they will be able to use their specialized tools and knowledge to x the situation. Rootcause analysis is a more detailed, technically intense process that tries to assist in the
detailed diagnosis of the situation.
Root-cause analysis uses sophisticated methods of ltering and correlating input data,
possibly combined with a model of the system being managed, to make reasonable
suggestions about the cause of a performance problem.
Active responses can then be used to handle routine problems or even predicted problems
so that system operators can concentrate on more complex issues.
CHAPTER
Managing services in compliance with a Service Level Agreement (SLA) places more
demands on the management system and the staff. Stiffer penalties for noncompliance
increase the pressures to respond quickly and accurately even while the environment grows
more dynamic and complex. Often, more sophisticated automation than that described in
Chapter 6, Real-Time Operations, is needed to relieve and supplement overworked staff
members. Toward that end, this chapter covers the following:
Policy-based management
The need for policies
The policy architecture
Policy design
Examples of products
Automation is a key attribute of an effective Service Level Management (SLM) system.
Stringent SLA compliance criteria reduce the time cushion that administrators might have
had. One of the compliance criteria mentioned in Chapter 2, Service Level Management,
is a demand for higher availability. If management staff members are left to deal with high
rates of change and growing complexity, the resolution times are unacceptable. Automated
management tasks are the only way to add speed and to deal with complexity.
Note, however, that automated tasks are also of concern to administrators because they are
taking actions and making changes at a faster rate than humans can maintain. A policybased management system is an attempt to leverage automation while constraining actions.
Policies are sets of rules that dene and constrain the actions the management system takes
in different situations. Table 7-1 shows the various levels of rules that might be involved in
a policy-based system. The rules are dened from the business level downward. Each rule
level supports the goals of the levels above and depends on lower levels to achieve those
goals.
130
Chapter 7: Policy-Based Management
Table 7-1
Multi-Level Rules
Level
Focus
Business rules
Business goals, such as protecting revenue
Service rules
Dening service quality metrics for end-user services
Infrastructure rules
Dening service quality metrics for infrastructure services
Element rules
Dening quality metrics for elements
Management system rules
For internal tasks, such as monitoring
As an example, consider that infrastructure rules might involve establishing special routes
for low-latency network trafc or allocating more servers behind a load-balancing switch.
Those infrastructure rules depend in turn on the proper element congurations.
Management system rules govern internal management processes, such as monitoring.
Monitoring processes have targets, polling and heartbeat frequency, threshold values for
alerts, and steps to take when there are failures in the instrumentation system.
Many policies are activated when a potential or actual service disruption is passed along by
an alert from the real-time event manager. Other policies are activated in response to
changes in the SLA statistics.
Policy-based management has a learning curve. Simple policies save staff time and effort
and are usually implemented rst. More sophisticated policies are implemented as the
management team gains experience and learns how to extend policies more deeply into
business processes and to more areas in the managed environment.
Policy-based management is the systematic creation of policies that drive the management
system to maintain the highest service quality.

Different factors are driving service providers and enterprises to consider implementing
policy-based management systems. These factors include the following:
Staffing costsEconomic pressures to show strong bottom line results work against
the need to hire expensive expert staff.
Growing complexityThe managed environment itself is becoming more complex

as more technologies and organizations are blended into online business processes.
Complexity soon outstrips the ability of staff to understand the situation and take the
appropriate action within the time limits imposed by an SLA.
131
Growing awarenessService failures tend to draw media attention, risk customer

loss, and make corporate management unhappy because service delivery is more and
more tightly coupled to the bottom line.
The need for a knowledge repositoryPolicies capture experience and expertise.

This knowledge persists in the face of staff turnover and lack of immediate
availability.
Table 7-2 illustrates the major differences between the service and element policy domains.
For example, high levels of redundancy can absorb element failures without substantially
degrading service quality. More sophisticated policies are needed for managing a complex
and dynamic service environment.
Table 7-2
Summary of the Two Policy Domains

Element-Centric
Service-Centric
Applied to single elements
Applied to services
Applied within one infrastructure
Applied across infrastructures
Relatively simple
Relatively complex
The next two sections discuss management policies for elements and for entire services.
Management Policies for Elements

Early management policies were associated with managing elements within the various
infrastructures. They were vendor-specic for the most part and dealt with relatively simple
situations. For example, a network switch could have a policy that says, If any port has a
utilization level greater than this threshold, send an alert to the element manager. A more
complex policy might add local actions, such as, If the broadcast trafc on any port
exceeds the threshold, disable the port and send an alert.
Simple policies are not exclusively applied to network elements. A policy applied to
servers, for example, could specify that if a process dies, the management system should
send an alert and restart the process. If that fails, the management system should try three
more times and then reboot the server while sending another alert.
Management policies for elements are good for speeding up many management responses
and for preventing staff mistakes. Management staff involvement is needed only if the
policy actions fail to restore service levels.
Most element management policies are configuration-centric because they dene specic
conguration information for each element to satisfy higher-level rules. Different vendors
have their own unique ways of setting operational parameters, making this job even harder
if staff members are forced to remember all the vendor-specic details. Some companies,
such as MetaSolv Software, have created products that deal with products from a range of
vendors.
132
Policies offer large environments that have many devices, sites, and users a consistent way
for handling element conguration. This approach scales gracefully as the environment
grows. In addition, staff are freed from element-specic details and are involved only if a
policy fails.
While freeing administrators from a plethora of low-level decisions and reducing the
likelihood of error is attractive, it is important to remember that the best results are obtained
when policy management has unambiguous input. Elements that have very clear
management instrumentation and a limited set of conguration options are the best
candidates for applying automated policies.
Conversely, elements such as high-end operating systems, application servers, and other
parts of the service delivery architecture dont always expose their management
information clearly. This situation makes automated decisions less clear-cut. The multiple
layers of complexity inside some elements, such as servers, also make tuning them a
challenge. The pressure is on policy designers to incorporate those subtleties to get the most
from a policy-based approach.
Service-Centric Policies
This policy category deals with service-quality issues rather than element behavior. Such
policies are inherently more complex, and they can span several infrastructures. Most
importantly, service-centric policies are targeted as much toward achieving business aims
as maintaining technical performance. For example, policies are focused on minimizing
penalties or treating the affected customers in various ways.
Lets look at an example to clarify the differences between element- and service-centric
policies. Consider a provider using a tiered server farm to speed transaction ows. The
redundancy of the farm means that a single server failure does not immediately impact
service availability, but it begins to expose the site to performance problems if the
remaining servers are approaching their loading limits. This is an example of a servicecentric policy, which focuses on maintaining adequate server capacity, rather than
responding in detail to the failure of any server in the farm.
The policy actions taken when a server fails can include the following:
Check the other servers on the tierIs their load after the failure still under the
dened threshold? As an example, consider that a set of four servers each running at
a 25 percent load transforms into three servers each with a 33 percent load.
Check the loadIf the load is acceptable, for now, send an alert and wait for the staff
to take further action. If the load is too high, increase the severity of the alert and page
the server manager.
Check the pool of stand-by serversThen determine the appropriate candidate to

replace the failed server, based on matching resources, required software, and
proximity to the failed server.
133
Check the number of servers in the standby poolIf they are depleted below a
threshold value, send a high-priority alert to the event management system.
Provide detailed reportsThe reports should cover the steps taken and warn of
imminent problems. They should also generate a problem ticket for repair of the failed
server.
Other information can be used to increase the intelligence of the response. For instance,
there can be a check to see if there are imminent load changes. The alert could then provide
more information, such as whether the remaining servers in the tier are operating under
threshold now and whether the afternoon trafc surge is 30 minutes away. This gives the
staff better information and indicates that attention is needed to avoid compounding the
problems.
This section covers the basic components comprising a policy management system. The
components include policy enforcers, repositories, and policy managers (after all,
something has to manage the management policies). Specic products might have different
combinations of components.
Policy Management Tools

Administrators use policy management tools to construct and modify their management
policies. Policies today are often generic in that they are applied across the entire base of
customers being served; however, policy systems must scale in sophistication to
accommodate a tailored set of policies for each customer. For example, each customer
organization could have policies that apply to each service used by the organization and
could also have policies that apply to specic individuals within the customer organization.
Increased granularity stresses the policy system because there is more information to
manage and access when policies are applied. As such, appropriate storage, information
management, and processing resources are needed for scaling the policy system.
Clear presentation and ease of use are critical here because introducing a mistake into a
policy can be burdensome to nd and rectify. Administrators need to enter as much information as possible with simple forms. Importing customer proles and other policy
information saves a signicant amount of staff time. Automated entry also speeds and
simplies the process while reducing the chance of introducing errors.
Archiving and version control are also essential features to consider. The original policies
are modied as administrators use feedback to improve them or add more precise actions.
Administrators must be able to track each generation of a policy so that they can roll back
to an earlier version quickly.
134
Security features control access to the policy management tools. Only selected
administrators can create or modify the policies. The same restrictions are applied to the
policies for each customer; only those administrators responsible for a customer can set or
modify the appropriate policies.
Repository
The policy repository contains the policy information used by the other elements in the
policy system. The repository can be implemented in many waysas a set of at les, a
database, and more commonly, as a directory. Directories are winning favor because they
offer advantages, including the following:
Directories are already widely used for other functions, including conguration,
access control, and resource allocation. This offers the advantage of using existing
mechanisms rather than inventing something equivalent.
Directory technology is maturing with high reliability and scalability.

Distributed directories offer higher performance and resilience to failures.
Repositories are structured to provide independent policy domains for each customer
organization, their business units, and specic individuals within the organization.
The policy management tools aid administrators in creating and modifying the information
held in the policy repository. After the information is safely in the repository, it must be
available to the elements that actually act on it.
Policy Distribution
After they are in the repository, policies can be distributed to the appropriate elements.
There are three types of distribution models: pull, or component-centric; push, or
repository-centric; and a hybrid that combines elements of each. These approaches are
discussed in the following subsections.
The Pull (Component-Centric) Model

The pull model is component-centric because it allows a policy-management component to
gather the information it needs through a direct request to the repository. This saves large
amounts of staff time that otherwise would be required to load every component with all
the information it might ever need. Instead, a component queries for missing information
and proceeds when the policy repository makes it available.
135
Adding more intelligence to the repository enables ner tuning and control. For example,
distinctions between a customer accessing services from high-speed or low-speed
connections can be made and the appropriate policies can be applied to each case.
Consider further that time of day might be considered as a criterion for blocking or
permitting access to specic services in different time periods. Such a policy would be one
method of preventing undesired activities, such as bulk transfers or database mirroring,
from taking place when they would interfere with other activities.
A pull model evolves through on-demand delivery to the components. Over a period of
time, each component pulls together the unique information it needs for its specic
functions.
The drawback of this approach is that policies must remain fresh; policy information in a
component can become stale and of no value. In fact, it has negative value because an old
policy is in effect rather than its successor. As usual, theres a compromise to be made
between frequent update requests and using local caching to reduce trafc and delays.
The Push (Repository-Centric) Model

The push model is called repository-centric because the repository drives new policy
information to the components without waiting for their requests. The signicant advantage
is that policies can be quickly changed across large numbers of components. This ensures
all components have current information and there are no errors introduced by having stale
information in some parts of the environment. The pull model cannot make the same
guarantee until aging times have elapsed.
The push model also entails the overhead of establishing and maintaining connections to
the components for the fastest push. This can be a substantial demand on resources, and it
must be balanced against the start-up times of establishing connections each time a new
piece of information is pushed.
Hybrid Distribution
Using both models takes advantage of the strengths of each. The pull model obtains the
latest information and can reduce trafc with local caching and aging. The push model is
used when large-scale rapid policy changes are necessary (for example, when a security
breach is detected). The policy system would use a push operation to change operations
very quickly and minimize the damage from an intrusion.
136
Enforcers
This is where the rubber meets the roadenforcers ensure that policies are properly carried
out. Enforcers are distributed throughout each infrastructure and carry out specic
functions. Some enforcers are part of management agents embedded in another element
while others, such as those in load-balancing switches, are major parts of the elements core
function. Examples of enforcers include the following:
Access devices at the network edge apply access policies that determine access to
services. These policies must be very granular, with policies that can be applied to
individuals and services.
Routers in the network core switch and forward trafc according to policies for
reserving bandwidth and forwarding trafc. This must also be granular for each
customer and each service.
Servers apply priority policies for scheduling their tasks. These policies can also vary
with activities. For example, a customer browsing a catalog might receive a lower
priority than one who is completing a purchase.
Load-balancing solutions apply policies to distribute ows among a set of attached

servers.
The geographic distribution system applies policies to select a site based on distance,
relative loads, or customer proles.
Policy Design
Designing policies becomes more complicated as you move from the elements to the
services they support. Several infrastructures might be involved and the decisions made by
the policy system must reect more conditions, each of which must be tested and analyzed.
One important tool for assessing policy robustness as policies are designed is called Failure
Modes and Effects Analysis (FMEA). It is commonly found in structured process and
design methods, such as Six Sigma.
Using a spreadsheet or table with yellow stickies on a whiteboard, FMEA accounts for the
following:
A description where a policy must be applied (failure mode)

A description of the impact (effect) of the failure mode
The severity of the failure mode, rated on a scale of 110
The causes of the failure mode
The frequency of the failure mode, rated on a scale of 110
Policy Design
137
The likelihood that the failure mode will be detected, rated on a scale of 110
A weighted Risk Priority Number (RPN), which is obtained by multiplying the three
ratings: severity, frequency, and likelihood of detection
The power of the FMEA method rests on two foundations. First, it makes explicit the
catalog of policy inputs required to make the policy successful, and it helps make explicit
how a policy has to deal with them. Second, and more importantly, FMEA provides a
discussion framework for collaboration by experts from multiple domains. These domains
include applications, networks, servers, electricity, and so forth. If a policy can be
thoroughly accounted for in an FMEA, its a good candidate for automation; if not, the
policy will likely not succeed.
Further discussion in this section is divided into the following subsections:
Policy hierarchy
Policy attributes
Policy auditing
Policy closure criteria
Policy testing
Policy Hierarchy
Policies can be organized into hierarchical structures that give advantages to providers and
customers. Customers can have an overall policy that applies to all their users. They can
assign additional constraints to different business units, for example, by allowing them
ner-grained control.
One example would have a customer policy forbidding certain applications from running
during normal business hours. Each business unit must conform to that organizational
policy, but they can add more constraints that do not violate it. One business unit may add
additional times when those applications are forbidden, for instance.
Hierarchy enables constraints to be organized and applied while still preserving the
exibility of lower-level policies that do not violate the basic constraints.
Policy Attributes
Policy attributes are important to consider as well. The attributes differ according to the
needs of each policy.
Policies can be based on the initiating user. As an example, user A can use a particular
service only at night and only at the bronze level. On the other hand, User B can use the
138
same service anytime at a platinum level. This provides a great deal of granularity and
customization.
The time of day will have a strong impact on many policies because it controls when certain
services are available, or it changes the constraints on service quality. Thus, time can be an
important attribute because certain activities can be scheduled for, and restricted to, times
when they do not interfere with more critical service ows. Policy violations also identify
those users who are trying to violate the policy or who do not understand it.
Policies can come into conict with each other, just as they do in other work areas.
Assigning each policy a precedence value helps resolve conicts, with the highest
precedence being the operative policy at that moment. Using precedence is a good practice
because it forces administrators to evaluate the relative priorities of the policies they create
and manage.
Policies might also need a lifetime to help with their administration and management. There
will be situations where an administrator creates a policy for special situations. It is easy to
dene its lifetime and automatically deactivate it when the time expires. This saves
administrative effort to track policies and prevents old policies from being used without
oversight. Some policies can have a lifetime value of forever. They will exist until an
administrator takes specic action to delete them.
Policy Auditing
Policy systems need auditing functions. I have worked with several organizations, for
example, that had created large numbers of policies and then found that only a small
percentage of them were actually used. Knowing which policies are heavily used also
focuses time and energy for optimizing those that give the higher pay-off.
Policy Closure Criteria

Every policy must have clearly dened closure criteria. With the criteria, the enforcement
mechanism understands when the policy has completed successfully or when it cannot
proceed further. Closure might involve, for example, contacting a staff member as the last
step.
Policy Testing
Many policy systems lack strong testing capabilities. Administrators must have condence
that the policies they are using will work properly in the operating range the policy was
designed to handle. Policies can be tested in development laboratories or by hand by
running through a set of possible scenarios.
139

Two examples illustrate established policy products. One is from Cisco Systems; the other
is from MetaSolv Software.
Cisco QoS Policy Manager

The QoS Policy Manager (QPM) from Cisco Systems is an example of a maturing policy
system; it has been in the market long enough to be tested in real production environments.
QPM has a scalable architecture with a set of distributed policy servers supporting its
classiers and enforcers. Cisco might package classiers and enforcers in the same network
device, or it can use separate elements. QPM is designed to manage the conguration of
devices to deliver specied service quality.
The QPM console functions as the interface to the rest of the system. The console is also
used for policy management functions. Policies are stored on a policy server that replicates
the policy information to a distributed set of policy servers. This provides high scalability
and fault tolerance.
Other policy management functions include the following:
Periodic comparison of the device conguration against the policy denitions stored
in the repository
The provision of a series of web-based reports for administrators

The tracking of distribution status
The maintenance of a detailed log of device conguration and policy changes
The use of an alarm to distribute new policy information
The use of incremental conguration updates, saving time and bandwidth
The repository is a directory accessed through the Lightweight Directory Access Protocol
(LDAP). LDAP establishes a client/server connection before transporting requests from the
client. LDAP is the common-access mechanism, leaving the choice of the actual repository
schema open. The repository could be an object-oriented database, an SQL database, or
even a at le.
QPM uses several policy-distribution mechanisms to support its push-delivery model.
Cisco will continue to support Simple Network Management Protocol (SNMP) devices, but
it sees Common Open Policy Services (COPS) as its strategic future direction. Distributed
servers can be placed close to concentrations of policy elements to avoid backbone
congestion and to speed policy updates to the components.
140
COPS
COPS is an emerging Internet standard for interactions between a policy client (the Policy
Enforcement Point [PEP]) and a policy server (the Policy Decision Point [PDP]). It
supports both models for distributing policy information. The pull model is used when the
PEP initiates requests, updates, and deletes to the PDP, which returns a decision for each
request. The push model is also available, enabling the PDP to push new information to a
PEP, or delete old information.
COPS uses reliable communication between the PEP and the PDP. Because secured policy
exchanges are essential, COPS uses message-level security for message integrity,
authentication, and replay protection. Other security mechanisms, such as IPSec, can also
be used.
Messages between the PDP and PEP contain self-identifying objects relating to each
request or response. Examples include client type, interfaces, errors, decisions, and timers.
COPS offers higher reliability than earlier connectionless protocols, such as SNMP. It also
imposes the burden on the PEP and PDP to keep the connection active with heartbeat trafc
that also adds more network loading.
Cisco provides special engines that use Network Based Application Recognition (NBAR)
to inspect each incoming packet and classify it according to the specied policy. Enforcers
are also built into most Cisco devices.
QPM has all the components for a policy-based management system oriented toward
element-management policies. It has Cisco devices with the capabilities to classify trafc
and apply a range of enforcement policies. Figure 7-1 shows an example of the use of QPM
to handle an SLA. Among the SLA specications are descriptions of service classes,
services, and metrics.
Figure 7-1
Using QPM and an SLA to Configure a Policy System

!
"
# $"%
& %
141
Under the SLA in Figure 7-1, there are two branches: the left branch sets up the
classication and enforcement functions while the right branch handles instrumentation.
With the left branch, the SLA denes the service classes, such as streaming, interactive, or
transactional, that are covered by the agreement. The next stage is dening the membership
of each service or application and dening which service class applies.
This information is then enhanced with the denition of the relative priorities for each
service class. Applications are identied by criteria such as the information in the
communications packet. This information is used to congure the NBAR functional
module so that it recognizes each application and appends the appropriate information to
each packet it processes.
Information can also be loaded into other devices that act as enforcers for the policy system.
For example, edge devices can have rate and admission control functions that are activated
for each type of application/service ow.
The metrics are handled on the right branch of Figure 7-1. They are specied for each class
and service in the SLA. A solution such as QPM takes these metrics and congures the
instrumentation system accordingly. Instrumentation can be found in devices, desktops,
servers, and stand-alone collectors and aggregators. Both passive and active
instrumentation are congured to capture the metrics and report them to a management
server.
Some of these steps require some manual translation today. Future SLAs can be constructed
as XML documents, providing an electronic input for QPM. More of the process can be
automated, adding more value by reducing staff labor and errors.
Orchestream Service Activator

In contrast to QPM, MetaSolv Softwares Orchestream Service Activator is positioned as a
multi-vendor policy solution for conguring a larger variety of elements. A brief overview
shows similarities to QPM in many aspects.
The Service Activator system consists of a central server and a set of distributed agents that
uses vendor-specic device drivers to control a number of network elements. Policy
information is distributed with COPS or with SNMP.
Device drivers convert requests for services and policies into device- and vendor-specic
congurations without needing scripts or templates. For example, Service Activator
automatically determines the following:
Which devices are affected by the policies

Which protocol is used when updating device congurations
The exact commands that are issued to the device
142
This frees administrators to concentrate on more important management tasks.

Orchestream Service Activator further simplies the process with a discovery function that
enables it to create a topology model with information about the capabilities of each device.
Constant monitoring tracks any unauthorized conguration changes, and Service Activator
restores the appropriate policy information.
Summary
This chapter introduced the idea of policy-based management as a means of dealing with
demands for service management in a complex environment with tight time constraints.
Automating many of the responses and procedures minimizes staff labor, reduces staff
mistakes, and provides the speed needed to meet stringent SLA compliance criteria.
Policies serve two main purposes: they dene what actions the management system takes
in certain situations, and they prohibit other management activities that are irrelevant to a
specic problem.
Policy systems have enforcers to determine the appropriate actions to take on service ows.
Policies are distributed using push, pull, or hybrid approaches. The push model is very
effective for abruptly changing the policy system behavior. The pull model enables each
component to ask for information as needed.
Policies for services management evolve by automatically integrating more functionality.
Consider the policies that could be activated when a desktop initiates a streaming
connection. Collectors inside the desktop are activated to measure the latency and packet
loss on the connection. Alerts are forwarded if the measurements indicate an actual or
potential service disruption. Monitoring is discontinued when the connection is terminated.
A security breach might activate a set of policies that adjust rewalls, isolate key resources,
inform corporate management, and track the intruder while alerting the management team.
Policy-based products for service management are still maturing and administrators need
to assess their actual capabilities carefully.
For any policy, the output of a decision is only as good as the quality of the input. In
selecting where to apply policy-based management, as much consideration must be given
to the information used to make the decision as to the automation of possible outcomes.
This reinforces the importance of good instrumentation and event management for good
policy-based management.
CHAPTER
Managing the Application

Infrastructure
In the not-so-distant past, a service and an application were often the same thing.
Applications were monolithic and performed a certain set of functions (services) for their
users. As web-based services appeared, along with the need for extremely fast creation and
modication of services for users, organizations found that assembling services from sets
of interacting applications was faster than trying to build a monolithic application for each
service. Modular clusters of applications can be easily assembled to support new service
offerings. This reduces development costs, speeds time to market, and provides signicant
development leverage. Key applications, such as an order entry system, can be used by
many services because it performs a common function.
Delivering superior service quality in a dynamic system composed of multiple applications
and services is based on the ability to coordinate that set of supporting applications and
services to meet the overall service quality goals. Understanding the relationships of all the
service components and measuring the behavior of each is a major hurdle for efcient and
economical service management.
This chapter discusses service quality at the highest architectural levelthat of the
application infrastructure and the services provided by them that are seen directly by the
end users. (Chapter 9, Managing the Server Infrastructure, and Chapter 10, Managing
the Transport Infrastructure, discuss service quality at the web-server and the transport
levels, respectively.)
This chapters discussion of application-level service quality is in four subsections:
The critical need to have applications designers and the network and services
managers share the same perspectives about service delivery
Application-level service metrics, which are high-level technical metrics and other
end-user experience metrics
Transaction response time, which is a primary example of the dependence of

application-level service quality on lower-layer service quality
Instrumenting web servers and other server components
146
Chapter 8: Managing the Application Infrastructure
Interaction of Operations and Application Development

Teams
One of the obstacles to effective applications management is not a lack of technology;
rather, it is often a structural problem within IT organizations. The network operations staff,
application managers, and developers are often unaware of the impacts of their decisions
on the overall service quality. Although most groups have experts with deep technical
knowledge of their own area, barriers exist that inhibit communications with other experts
who can make service delivery more effective. This isolation also results in situations where
a decision made by one group negatively impacts overall service because of an
unanticipated interaction with technologies that are managed by a different group.
The Effect of Organizational Structures

One of the most difcult issues with application management is the gap in knowledge and
perspectives between application developers and the operations teams. Of course, both have
their specialized functions, but often application performance problems hinge on poor
application design or operational choices that hobble performance.
Organizational structures often accentuate this gap. Many IT groups are divided into teams
by technical specialty, with independent network, systems, and applications teams; at other
times, there is a division between the operations and development groups. Id characterize
many of these relationships as ranging from aloof to adversarial. The nger pointing that
typically occurs with a service quality failure is an unpleasant, but, unfortunately, all too
common occurrence within many IT groups. This frequently leads to poor decision-making
on either side with impacts that affect overall service quality and economics.
Sometimes, lack of a common vocabulary keeps communications from being clear. At other
times, the network and applications people dont understand basic assumptions about the
others work and perspectives. Often whats required is collecting and organizing objective
information and design guidelines that both sides can use. It is easier to resolve and prevent
performance problems when both sides have information they share and trust, and when
both sides have learned to collaborate on design issues.
The Need to Understand the Operational Environment

I was recently speaking with some developers visiting a large e-business software rm. The
developers were quite excited about incorporating more object request broker technology
into their new applications. They cited faster implementation times and simpler
programming because the object request broker handles all kinds of messy details. An
application simply requests an object and the object request broker takes over the task of
locating it, accessing the object, and performing any necessary data and formatting
transformations.
147
Granted, this does simplify the application development process, but there is the impact of
not knowing where your objects (content) are actually located. Content location can have a
signicant performance impact due to long distance (propagation delay), restricted
bandwidth, or overloaded servers at a given location. In response, the teams felt that any
unacceptable access delays could be xed by adding more bandwidth. However, this
answer masks confusion between bandwidth and propagation delay. In this instance, the
stated solution would undoubtedly contribute to poor application design decisions that
cannot always be xed later by throwing resources (money) at the problem. This is
especially true when wide-area networks (WANs) spanning long distances are involved.
To enhance the understanding of network impact on application behavior, tools such as
Compuwares Application Expert can be useful. Application Expert is used during the
development phase to quickly test different application scenarios. It enables developers to
see the effect of network delays and bandwidth on application performance. This
information is fresh enough that it has an impact during the development process rather
than after the fact; this leads to better implementation. In the deployment phase,
Application Expert can be used to monitor the actual performance and identify further
opportunities for improvements.
Time Lines Are Shorter

Another consideration is that the market is unforgiving. In the past, applications could be
introduced and then tuned and shaken down over a period of time. Today, applications must
deliver peak performance and functionality immediately. Many customers wont give you
a second chance. If the application is sluggish, if links are broken, if there are bugs, youve
lost your chance to attract or keep those customers. Its extremely easy for your customer
to shop a competitors web sitepossibly while waiting for your slow site to respond. If
your customer, having been inspired to shop around by your sites poor performance, then
discovers that your competitor has better prices, or better products, or even the same prices
and products in an easier-to-use, more reliable web siteyoure in trouble!
Applications require instrumentation to make their behavior observable, and, as
appropriate, controllable. Client-side collectors operating in passive or active modes
provide some of the instrumentation because they measure the user experience from that
location. Note that client-side collectors are usually not a part of the application
instrumentation itself; they measure the application behavior for a specic virtual
(synthetic) transaction.
148
Table 8-1 shows examples of application instrumentation for a web sales application. The
instrumentation can be further divided into internal and external measurements:
Table 8-1
The internal measurements give insight into the behavior of the systems within the
direct control of the IT group. There are categories of internal measurements,
including those for workload, customer behavior, and business behavior.
External measurements show the behavior as seen by an end user outside the scope of
the IT organization, such as by an end user using the Internet to access the system and
perform a transaction.
Examples of Application Metrics for a Web Sales Application

Measurement
Insight
Workload
Number of transactions/second
Overall functioning; relationship of load versus

performance
Peak transaction loads
Stress points; average:peak ratios
Number of concurrent connections
Overall functioning; relationship of load versus

performance
Peak connection volume
Stress points; average:peak ratios
Web page volume or trafc volumes
Gauges of activity

Favorite content
Content optimization, choice of replication sites,

and cache preloading
Ratio of visitors to sales
Effectiveness of content
Forward and reverse path analysis
Path analysis and navigation
Start and stop pages
Path analysis and navigation
Average pages per visit
Navigation
Average visit time
Stickiness
Business Measurement
Number of completed orders
Tracking site efciency
Revenues generated
Tracking actual business generated
Abandoned carts
Tracking customer behavior
Promotion feedback
Tracking marketing effectiveness

Transaction response time and availability
Measure supporting services
149
Workload
Workload metrics track the capacity of the application. Capacity is determined by the
quality of the implementation and the assigned computing, storage, and network resources.
Some measurements of overall activity and capacity might include the number of
transactions per second, the number of concurrent connections, or the actual server loading
measurements.
Periodic measurements of workload can build activity baselines that prole the normal
ranges over longer time intervals. Alerts can then be generated when the comparison of the
actual workload against the baselines indicates a trend away from the normal ranges.
I recently visited a site that had melted down when a web page designer added two simple
objects, and another 45 Kb of payload, to the home page. Testing in the lab showed no
apparent bugs, so the new page was placed in production. The problems began appearing
during times of heavy customer access. When there were over 10,000 active connections,
which occurred during the hours of peak demand, the additional load was 450 MB being
sent across the network for users downloading the home pages. As it turned out, this bump
on the backbone was actually the nal straw that convinced this organization to outsource
their content delivery to a managed infrastructure provider.

Measuring customer behavior is essential for several reasons. Customer behavior offers
great feedback and insight into application effectiveness. This is particularly important in
web applications because they are the major customer-facing applications. Tuning a
customer-facing application is more challenging than tuning individual elements or even an
infrastructure.
The same technology tuning needs to occur for rapid page access and high transaction
volumes. However, customer-facing applications must also meet business goals as well.
These goals depend upon the characteristics of the web site itselforganization, structure,
navigation, and layout. Content must be compelling and easy to navigate, and processes
must be as simple as possible.
For example, sites that provide content may want a sticky environment that keeps customers
at the site for extended periods of time. On the other hand, sites dependent on serving ads
and promotions will use each new page as another selling opportunity.
In contrast, sites that provide information or products want to get their customers to the
right content as quickly as possible, transact their business, and move on to the next
customer. Clumsy navigation and excessive links discourage this type of consumer.
Measurements of the average number of pages a customer uses during a visit or the average
length of a web site visit can track the overall effectiveness of the site. These metrics assess
either type of site.
150
Other aspects of customer behavior are used to optimize application performance. For
example, tracking the most heavily used content or the most frequent transactions gives
valuable information for improving the effectiveness of the customer-facing application.
For example, one site found that one of the most frequently accessed pages took ve clicks
to reach. This was a business site wanting short visits and quick navigation. The desired
content was moved to the home page, resulting in improved customer satisfaction and
increased revenue because fewer customers lost interest with more direct navigation to their
desired destination.
The popularity of content also assists managers in making intelligent content placement,
preloading caches, and determining the number of replication sites. The same value is
provided by instrumentation that identies the most frequently used transactions.
Developers can focus their attention and optimize those transactions that will offer the
highest payoff in improved performance.
Business Measurements
Business measurements are becoming increasingly important. They are directly important
to business managers who want to understand how their online business is actually
functioning in real time. These metrics are important to technology managers as well; being
the source of critical business information establishes the value of better management
investments.
Some examples of business metrics are completed orders, generated revenue, promotion
feedback, and abandoned shopping carts.
A tally of completed orders indirectly measures the effectiveness of the web sitewhether
it is keeping customer interest long enough to close sales, for example. This measures only
bottom line effectiveness, not efciency; however, it is a basic metric for many
organizations at this time.
The completed orders metric can be broken into more details, such as the following:
The ratio of customer orders to total customer visitorsThis is a measure of the

percentage of visitors that actually buy. This is helpful when evaluating alternate page
design strategies or navigation options.
Active customersThis is the identication of the best customers based on total

sales, for example. Special promotions can be targeted to the best customers.
Generated revenue is calculated by measuring the cumulative revenue from completed

orders. It gives business managers deeper insight into their current operation. They can
compare the revenue generated against goals, or they can compare historical trends to gauge
overall revenue growth. Other derived measurements could include building revenue
baselinesplots of average run rates over the business day, for instance.
151
Promotion feedback can be invaluable when business managers and their marketing teams
are always focusing on guiding users down a certain path to meet objectives such as
strengthening the Internet brand, creating stronger differentiation with competitors,
responding to market and competitor moves, and maintaining customer loyalty. They are
under continuous pressure to capture a greater market share while simultaneously reducing
customer acquisition costs.
Instrumentation can use special web pages, special buttons or links, or other ways of
tracking responses to a variety of promotions. This information can be analyzed and
organized to assess the effectiveness of different promotions and to understand acquisition
costs.
For some reason, abandoned shopping carts always seem to get a business managers
attention. There have been some anecdotal reports that abandonment rates are often over 50
percent for some consumer sites. This should be distressing because these are potential
buyers who have taken time to navigate the site and select products before they go to
another site.
Business behavior metrics may be derived from other more basic measurements. For
example, revenues are calculated after each order is completed. The basic revenues may be
further segmented by the customer, the product, the time of day, a promotion, or other
criteria. These metrics must be baselined, and thresholds should be established. Because
business managers want to understand and respond to situations more quickly and because
technology managers want to make adjustments to maintain compliance with Service Level
Agreements (SLAs), an alarm can be sent to the appropriate business and technology
managers when an application has a sudden drop in revenues or visitors.

For a web transaction, the principal service quality measurements are the external metrics
of transaction response time and availability, both of which were rst discussed in Chapter
2, Service Level Management, in the section titled High-Level Technical Metrics. They
are further discussed in this chapter because of their importance.
Note that transaction time is a measure of how quickly a user can complete an end-user
transaction on your system. If the user is a member of your own organization making an
intranet transaction, poor performance will decrease productivity but might have no other
bad effects. However, if the user is a web customer, and if the same function (for example,
purchasing a book) can be accomplished more quickly, and with a greater chance of
success, on a competitors site, poor performance may cause transaction abandonment and
loss of business. The faster speed of the competitors site might be due to fewer pages being
involved, or quicker downloads per page, or both; improved availability might be simply a
side effect of shorter total times. To the end user on the Web, the key fact is that your
competitor offers faster, more reliable service.
152
Each web page within the transaction should also be measured for download time because
that can be a good indicator of user abandonment behavior. On legacy systems, users didnt
see the computer screen directly; they spoke to call center operators. If the computer was
slow, the operator would talk to the customer and save the sale. On the Web, the customers
are directly exposed to slow web service, and theyll abandon a slow transaction. A twominute transaction that consists of ten 12-second page downloads is considerably different
from a two-minute transaction that consists of ve 6-second page downloads and one 90second download. Many users will abandon during that 90-second download. Thats crucial
information for the business groups and should be included in the SLA.
Transaction Response Time: An Example of

Dependence on Lower-Level Services
Transaction response time is the primary metric of application service quality as delivered
to the end user. Its also an excellent example of the dependence of an application-level
quality metric on lower-level services with their associated metrics. In addition, it shows
the need for communications between applications design groups and the network services
and operations groups. Because of its importance, and because its such a good example,
this section traces the dependencies of transaction response time through all the underlying
services and their relevant metrics. (Chapters 9 and 10 present additional details about
metrics for the web-server systems and transport infrastructures, respectively.)
You can view a transaction as having several time components. These are the serialization
delay, queuing delays in transmission, propagation (transmission) delay, and the processing
delays in both the network (modem, switching equipment, and so on) and the web-server
system. There are also delays, commonly called think time, that are associated with external
user activities. They include reading the delivered content, thinking, and talking on the
telephoneall of which are not considered here but which will be important in load testing.
(Note that there is a strategy in user interface design that can accommodate habits of user
perception and inuence think time by loading different components at different speeds,
organizing the presentation, and using other tactics that create perceptions of positive or
negative performance variations.)
In addition, each data block has additional overhead added to it before transmission. That
overhead, in the form of headers and trailers, is needed by the lower infrastructure layers as
they process the data blocks.
Specics about different types of delays are discussed in the following subsections;
serialization delay and propagation delay are shown in Figure 8-1.
Transaction Response Time: An Example of Dependence on Lower-Level Services
Figure 8-1
153
Serialization and Propagation Delays

Output
Queue
Header and
Trailer Overhead
Serialization
Delay
Propagation
Delay
Serialization Delay
Serialization delay is caused by the process of converting a byte or word in the computers
memory to or from a serial string of bits on the communications line. Serialization causes
delays in most routers and, of course, at the source and destination. The time needed for
serialization is the time needed to write bits on to or off of the communications line; its
controlled by the line speed. For example, 1500 bytes requires 8 milliseconds (ms) to
serialize at 1.5 Mbps and 300 ms to serialize at 40 kbps. The added header and trailer overhead
increases serialization delay because of the time needed to write and read those bytes.
Decreasing overhead by tting more data into each packet decreases download time by
decreasing serialization delay. However, most systems that run over the public Internet use
either 1460 bytes per packet (for high-speed connections) or 576 bytes per packet (for dialup connections); its not easy to change those values. Changes are more easily made on
private systems. (Longer packets increase jitter and the penalty for a packet error; but in a
private, dedicated network where the number of router hops is constrained and transmission
quality is more controllable, this might not be a major issue.)
Its important to note that serialization delay is greatly inuenced by compression and
encryption of content. For example, the standard home-user, dial-up modems perform hardware
compression within the modem itself. For some data patterns, the modem compression ratio is
4:1 or better. If a data block has been compressed, it is shorter and therefore takes much less time
to serialize. On the other hand, encrypted data cannot be compressed. (An encrypted string
appears to be purely random and therefore uncompressible.) The result is that secure web pages
are transmitted much more slowly on transmission links that have a large serialization delay.
Such web pages should be compressed before encryption. (This is also a strong argument in
favor of using true end-user measurements instead of computed or simulated end-user
measurements. A true end-user measurement would include the effects of modem hardware
compression; no commercial emulated measurements do that.)
154
Queuing Delay
Queuing delay is caused by waits in queues at origin, destination, and intermediate
switching or routing nodes. Variations in this delay cause jitter. For streaming media
applications, a dejitter buffer is required at the receiving end. (The delay in the dejitter
buffer is typically one or two times the typical jitter.)
Propagation Delay
Propagation delay is governed by the laws of physics; propagation delay cannot be
decreased by increasing the line speed. It is a distance-sensitive parameter. The ITU-T
standard G.114 species 4 s/km for radio, 5 s/km for optical ber, and 6 s/km for
submarine coaxial cables, including repeaters. Therefore, it will require 20 ms to travel the
4000 kilometers (km) from New York City to Los Angeles, or 100 ms to travel the 17,000
km from New York City to Melbourne, Australia. A signal beamed up to a geosynchronous
satellite and down again, a distance of 72,000 km, takes approximately 280 ms.
An example may help illustrate the massive importance of propagation delay. Imagine a
1-MB le to be transmitted over three different connections:
A local high-speed Ethernet connection at approximately 100 Mbps
An Internet connection from New York to Los Angeles with an effective bandwidth of
15 Mbps and a one-way propagation delay of 75 ms
An Internet connection from New York to Los Angeles with an effective bandwidth of
1.5 Mbps and a one-way propagation delay of 75 ms (a typical coast-to-coast latency
on the Internet)
Theres some additional complexity that must be mentioned here: the Transmission Control
Protocol (TCP) used by web browsers and for reliable le transmission over the Internet
has a typical data block size of 1460 bytes and a window size of 17,520 bytes.
NOTE
The window is the maximum amount of unacknowledged data that can be outstanding at
any given time; the value given here is for the Windows 2000 operating system (OS). Thus,
for a window size of 17,520 bytes, twelve 1460-byte data packets can be transmitted before
an acknowledgment must be received. An acknowledgment is sent after each evennumbered packet is received.
Transaction Response Time: An Example of Dependence on Lower-Level Services
155
Note also that TCPs slow start algorithm, which slowly increases transmission rate at the
start of a le to avoid congestion, is being ignored for this example. (The large le size
makes slow start less important here, but it can be important for short les.)
Now you can see the effects of propagation delay on performance:
Figure 8-2
For local, high-speed Ethernet, the propagation delay is so low that theres never a
problem receiving the acknowledgments before 17,520 bytes have been serialized.
The transmission of 1 MB proceeds at full line speed and is complete in
approximately .1 seconds.
For the 1.5-Mbps Internet connection in our example, serialization of 17,520 bytes
takes approximately 100 ms, and the propagation delay across the U.S. takes
approximately 75 ms. The round trip is therefore approximately 150 ms, and the rst
acknowledgment is generated when the second packet has nished arriving,
approximately 15 ms after the rst packet begins to arrive. Therefore, as shown in Figure
8-2, theres a 65 ms pause to wait for an acknowledgment after each block of 17,520
bytes is transmitted. Transmission of the 1 MB in 58 separate blocks of 17,520 bytes
each takes approximately 9.5 seconds.
For the 15-Mbps Internet connection, serialization of 17,520 bytes takes approximately 10
ms, and the propagation delay across the U.S. takes approximately 75 ms. The round trip
is therefore approximately 150 ms, and the rst acknowledgment is generated
approximately 1.5 ms after the rst packet begins to arrive at the receiver. Therefore, as
shown in Figure 8-3, theres a 142 ms pause to wait for an acknowledgment after each
block of 17,520 bytes is transmitted. Transmission of the 1 MB in 58 blocks of 17,520
bytes takes approximately 9 secondsalmost the same as for the 1.5-Mbps connection!
Packet Transmission at 1.5 Mbps

One-Way Propagation Delay
Serialization Delay
0 ms
Start Transmit
75 ms
90 ms
100 ms
Start Receive
Send First ACK
Finish Transmit
165 ms
175 ms
Start Transmit
Finish Receive
240 ms
Start Receive
156
Figure 8-3
Packet Transmission at 15 Mbps

One-Way Propagation Delay
Serialization Delay
0 ms
10 ms
Start Transmit
Finish Transmit
75 ms
85 ms
Start Receive
Send First ACK
Finish Receive
152 ms
162 ms
Start Transmit
Finish Transmit
227 ms
237 ms
Start Receive
Finish Receive
This situation is even more important for web transactions, where each web page may
require many les, each with this type of sensitivity to transmission delays. The number of
round trips required by a web page or a transaction is sometimes referred to as turns, and
decreasing that number clearly decreases the sensitivity to transmission delay. Another way
of decreasing download time is to decrease transmission delay itself, and Chapter 9
discusses how content distribution networks can be used to place some of the pages content
closer (in terms of transmission delay) to the end user.
Processing Delay
Processing delay in the network includes modem delays (typically 40 ms or more for a pair
of V.34 modems without compression and error correction functions, for example), router
delays, and telephone network switching equipment delays.
Processing delay at the web server encompasses such functions as authentication, database
access, use of supporting services, and calculation. Increasing the server performance,
improving caching and load distribution, accelerating encryption speeds, adding servers, or
adding disc capacity are all ways to reduce the processing time or time spent on a server.
The Need for Communications Among Design and Operations Groups

Figures 8-2 and 8-3 demonstrate that different combinations of speed, location, and size can
accentuate different sensitivities. Careful analysis and discussions between the operations
and applications teams can avoid some problems and build the foundation for applications
that are actually network aware. Designers must understand the burdens they impose on the
network and other resources when they add turns to a transaction, add objects to a page, or
enrich the current content.
157
The network administrators must also increase their application awareness as well. They
need to select window and packet sizes that reduce latency and improve efciency wherever
they can. They, too, must understand that bandwidth does not solve every application
performance problem.
The placement of content is becoming a concern as pressures to deliver and use richer
content at higher quality continue. The content delivery infrastructure discussed in Chapter
9 and the sensitivities just covered in this chapter indicate the trade-offs that must be
considered in application design and operations.
Applications must provide the internal loading, customer, and business behavior metrics
that are necessary to understand their functioning. These metrics can be collected by
instrumentation from the web-server systems and applications, from other server
components, and from the end user. These are discussed in the following sections.
Instrumenting Web Servers

Web analytics products are used to obtain customer behavior metrics. The earliest focused
on analysis of web logs produced by the standard web server applications, such as Apache
Software Foundations HTTP Web Server and Microsofts Internet Information Server
(IIS).
Server logs provide the most basic information on the accessed pages and the time of the
request. The information they provide is simple, they maintain no context, there is no
information about the interactions of customers with the content, and there is no visibility
into cached activity (such as when the user navigates back to the previous page, which is
already stored in the browser cache). Because they can provide the identity of the
referring page (the previous web page), its possible for software to laboriously chain the
referrals together and thereby obtain a picture of the users progress through the web site.
A typical example of a log-analysis product is the WebTrends Log Analyzer Series from
NetIQ. It accepts server logs from all the major web servers and produces extensive reports
on customer behavior. (Log les from all the major servers are very similar; there are
standard formats and log entry denitions.) Because log les can be massive (each item on
a web page creates a log entry each time its downloaded) and because multiple web servers
are usually involved in a single system, the amount of data that must be processed by the
analysis engine is also massive.
Later web analytics products depended on special tags to be inserted into the pages. These
tags activate the instrumentation to capture more information than is available with pure
log-analysis products.
158
One way of tagging a page is to insert an almost-invisible phantom object on each page.
Usually this is a transparent, extremely small image; the technique is often called pixelbased tracking or page-bug tracking. When the page is loaded into a browser, the browser
automatically makes a request for this invisible objectexactly as it requests all the other
images on the page. Its just another image as far as the browser is concerned. The phantom
objects tag is no different than a standard image tag, except that the tag references the data
collection server. Because of that reference, the phantom object request is directed to a
third-party recording site or tool that captures the activity. By using a single object in a page
to represent the page as a whole, it reduces the number of entries for each page retrieved.
Instead of having a log entry for each item on the pageand there may be 50 or more
theres only one entry per page.
Unfortunately, the simplest version of this type of tagging cant see the interactions that the
user makes with the web page. However, more complex versions of phantom object tagging
enable the tag to contain a parameter string in the form of a query string in the image
request. That parameter string can be constructed by JavaScript running in the browser, and
it can therefore record user actions on the page along with any other information available
to JavaScript, such as browser size and available plug-ins.
Of course, use of phantom objects requires that each page to be measured include the
phantom object and, probably, a piece of special JavaScript code. In contrast, log le
analysis does not necessitate changes to the web pages.
Cookies can also be used by themselves or in conjunction with phantom-object tagging and
JavaScript. A cookie is an object exchanged between the browser and a web application. It
contains application and user information that applications can use for authentication,
personalization of content, and identication of customers for differentiated treatment. It is
stored in the browser at the request of the server, and a copy of it is returned to the server
with any subsequent requests made to that server. (Many users have set their browsers to
reject cookies automatically, unfortunately.)
WebSideStorys HitBox is one of the leading tools that uses phantom objects, usually in
combination with JavaScript and sometimes with cookies.
Clickstream is a tool from Clickstream Technologies that uses cookies in combination with
a tracking module installed on the web server. Each request for a page serves the page to
the browser, accompanied by a page-side measurement algorithm that records page display
times as well as any ofine and cached browsing activities that occur. The information is
recorded in a cookie and later sent to a server that records and analyzes all the request
informationincluding the browser-side, cache-based activities that would not otherwise be
seen by instrumentation because they did not result in any trafc on the communications link.
Keynotes WebEffective is a measurement service thats different from the phantom object
services. To use WebEffective, one line of JavaScript is embedded in the web sites entry
pages. That JavaScript redirects selected users (a sample of all users, specic users, and so
on) to the WebEffective server. The WebEffective server then inserts itself between the end
159
users browser and the original web server systems. In that position, it records everything
the end user does on the web page and everything that the original web server systems do
in response. (For example, it can discover that an end user is not clicking on a particular
button because that button is not displayed on the end users small browser window.) If
requested, WebEffective presents a pop-up window to the end user to ask permission to
track activity. After permission is given, it can ask questions of end users at any time. It can
even intercept end users who are abandoning the site to ask them why theyre leaving, and
it can track them to their next site.
The integrity of web pages can also be evaluated by web analytics tools. I have spoken with
several organizations that have written a simple application for periodically validating the
integrity of the web application content. These applications improved service availability
by ensuring that the correct content was correctly linked. The rapid, frequent changes on
many sites might introduce a broken link, a pointer to a non-existent page, or other
problems that result in poor customer experience, lost business, and reduced chances of
future visits. Any problems with links or content are passed in an alert to the alarm manager.
(Such tools are also available from commercial vendors and include the Keynote
WebIntegrity tool and the Mercury Interactive Astra SiteManager tool.)
The integrity testing tool can be used to exercise the embedded links in the web pages. The
virtual transactions load a page, check for the correct content using a simple technique like
a checksum, and then initiate further virtual transactions based on links in the new page.
Unfortunately, some manual intervention might be needed because of potential loops in the
sequence of linksthe tests never terminate. The virtual transactions would exercise
selected trails, such as those leading to visitor purchases. Web analytics tools help identify
the paths that have the heaviest visitor volume.
In an example I saw, the rst deployment of an integrity-testing tool was at a site with a high
number of objects within each page. Before the application was tuned by the inclusion of
delays inside each testing transaction, the rapid sequence of links delivered large numbers
of new objects to the servers cache. This would cause the cache to replace other content
with these transitory objects and add some delays while the cache refreshed its normal
content after the test. The active collector was attached in a new position so that the servers
cache was not directly in line and was therefore not disturbed by the integrity testing.
Instrumenting Other Server Components

Many applications are organized into tiers of servers, as shown in Figure 3-1 of this book,
for higher availability and distribution of transaction volumes. Chapter 9 discusses the use
of collectors to obtain measurement data from these servers within the data center. For
example, using active collectors at the edge of the data center provides direct application
response-time measurements without the effects of the external network. Consider a
scenario in which the WAN is slow, but the web front-end and the back-end database have
no performance problems. Taken alone, the end-to-end measurement from the end users
160
locations will show degradation. However, rather than faulting the application as a whole,
the data from active collectors at the edge of the data center shows that the real problem lies
with the network.
Legacy applications are still very much in the mix for most organizations, although they are
hidden behind better web interfaces or safely tucked away in the back-end areas. The
challenge with legacy applications is that many were never initially instrumented for
remote monitoring and management.
The relative opacity of legacy applications dictates less direct approaches to understanding
application behavior. One approach pioneered by BMC Software was to treat an application
as a black box and observe behavior indirectly. BMC Software started to instrument
mainframe applications by observing their effects on system logs, disc system activity, and
memory usage, among other factors. BMC uses experience to make an educated guess
about the applications behavior, based on inferences derived from analysis of those factors
that could be monitored.
Geodesic Systems offers a more direct approach to instrumenting applications with
their Geodesic TraceBack tool. They actually embed instrumentation during the application
build process. Instrumentation is incorporated into the application code at compile time,
and it records application behavior at such a ne level of detail that application errors can
be pinpointed to a specic line of code.
End-User Measurements
For end-user measurements, passive and active collectors are placed near concentrations of
customers or at key infrastructure locations. They interact with the web applications and
carry out normal transactions. They measure the end-to-end performance of the application
from various sites. Almost all measurement system vendors, such as Computer Associates,
Tivoli, and HP, offer tools for running synthetic transactions or for passively observing an
end user.
Measurement services are also available from companies such as Keynote Systems and
Mercury Interactive. Keynote Systems is the largest supplier, with over 1500 active
measurement collectors at over 100 locations on all the major Internet backbones
worldwide. They run synthetic web transactions over high bandwidth, dial-up, and wireless
links, and they can also pull streaming media and evaluate the end-user experience. Use of
measurement service suppliers makes the most sense when your customers are dispersed
over the Internet or when you need a disinterested third party to provide your SLA metrics.
Its important to measure accurately when you want to evaluate the end-user experience. As
mentioned, emulation of dial-up user experiences by using restricted-bandwidth devices
fails miserably because of the impact of a real modems hardware compression feature. A
study in the Proceedings of the 27th Annual Conference of the Computer Measurement
Summary
161
Group showed inaccuracies as high as 45 percent when using restricted-bandwidth

emulation instead of real dial-up modem measurements.
Similarly, use of emulated browsers instead of the actual Microsoft Internet Explorer (IE),
for example, can result in misleading page download times due to differences in the
browsers handling of parallel connections and other functions. If full-page download times
or transaction times arent important, browser emulations are acceptable for fetching the
rst HTML le in a web page. Crucial measurements, such as DNS lookup time, TCP
connect time (an excellent measure of round-trip network delay), and the time to obtain the
rst packet of a le (which may indicate growth in the servers backlog queue) are all
obtainable from an emulated browser. However, more sophisticated measures, such as the
total time to obtain the page, should use a real browser, not a simplistic emulation. Thats
because a real browser pulls images in parallel, uses plug-ins, and has other behaviors that
simplistic emulators do not match. Especially when part of the page is being delivered by
caches and third-party servers (ad servers, stock price servers, and content distribution
networks), end-user measurement by a simple emulator is not satisfactory.
Caches and other memories of previous measurements must also be discarded before the
start of each new transaction cycle. This prevents misleading reuse of previously retrieved
les.
Summary
The application infrastructure is aptly named because most applications are composed of
related elements and supporting services. Applications include customer-facing elements,
which are activated most often through a browser, as well as backend functions, such as
credit authorization and order tracking. A single interaction with an end user commonly
involves multiple applications and services, and those applications and services are
themselves usually constructed of many smaller modules. Delivering superior service
quality in such a system requires good coordination and management of the supporting
applications and services.
One essential need is for closer communication among application designers and the
operations teamsbefore, during, and after deployment. Application performance is
sensitive to many factors that designers usually ignore, such as transmission delay, the
number of turns, or back-and-forth data exchanges needed for a transaction.
Legacy applications usually lack adequate instrumentation, whereas newer applications are
providing some embedded monitoring and tracing functions. Instrumentation allows
administrators to track application workload as well as business and customer behavior.
CHAPTER

Servers provide information, store and protect critical content and business information,
and process transactions. Server performance and availability are key factors in delivering
overall service quality; consequently, a major server infrastructure management challenge
is to manage for both high availability and high transaction volumes.
In this chapter, I cover server infrastructures, including load distribution, caching, and
content distribution. The text follows with a discussion of instrumenting that architecture.

Most systems use a tiered architecture, with multiple servers within each tier, as shown in
Figure 9-1. Each tier has a set of servers that carry out the same function. Multiple servers
in a tier can increase performance and availability because distribution of workload across
redundant instances can absorb a failure. Multiple servers can also increase the ow volume
because multiple ows are spread across the tier, increasing parallelism and reducing the
time for application processes to execute.
Web applications are in the rst tier. They are customer facing, providing the rst layer of
access to services and information. To build the web page, the appropriate application logic
is activated in the next tier. That application may in turn use databases or other back-end
services as needed.
The challenge has become to optimize each server layer for a range of applications, often
in the face of conicting demands. For example, one application may need intermittent, but
quick, access to small objects, while another application needs to move bulk data to and
from disk storage for sustained intervals. A dedicated server running a specic application
can be tuned to meet those requirements without compromising performance by attempting
to accommodate a wide range of conicting resource demands.
Load distribution, other front-end processing, caching, and content distribution also play
key roles in moving content efciently; they are discussed in the next subsections.
164
Chapter 9: Managing the Server Infrastructure
Figure 9-1
Access
Provider
CDN Server
Cache
DNS
Server
Routers
Router
Firewall
Load Distributor
Web
Servers
Application
Servers
Database
Server Farm
Load Distribution and Front-End Processing

The networking industry regularly reinvents old approaches and applies them to new
situations. For example, the front-end processor was classically used to improve mainframe
performance after mainframes were networked with remote terminals. The mainframe was
a highly optimized computing platform and was not suited to handle high interrupt volumes
from communications activity.
165
A similar approach is emerging to wring more performance from server farms. Servers are
designed to be high-performance computing and data-access platforms. They can suffer
from dealing with high-speed network communications tasks, such as the following:
Processing connection offers
Performing the processing needed for key establishment and for encryption and
decryption of Secure Sockets Layer (SSL) connections
Handling data compression
Detecting and suppressing attacks

Handling high interrupt levels associated with hundreds or thousands of active
Transmission Control Protocol (TCP) connections
Handling error recovery, ow control, and timers for each connection
Hypertext Transfer Protocol (HTTP), the protocol used for web page transfers, adds
additional strains because some versions of HTTP use a separate connection for each object
that is accessed. This means that a browser will create a connection, access the object, and
break the connection for each object, even if the same server is involved for all the objects
on a pageand many web sites have 50 or more objects per page. This adds additional
server overhead and slows response.
New products address these limitations with a computer system placed between the server
farm and the customers using the site. This new front end is purpose-built for handling
communications tasks, in contrast to a general-purpose server where these functions
compete with application services for resources. Such front-end devices handle the
communications tasks and also perform load-balancing functions.
SSL Accelerators
Businesses and customers are increasingly concerned about the privacy of their
transactions. The SSL protocol is an application-layer protocol using TCP for reliable
delivery. SSL uses special software at the client and server ends of the connection to ensure
that communications are private.
After a TCP connection is established, the client and server authenticate each other to
establish that they are who they represent themselves to be. Encrypted digital certicates
are exchanged and validated. Then the parties exchange encrypted messages and create a
unique key that they use for only this session. The key enables secure communications and
the detection of any alterations to the trafc in transit.
SSL adds some additional network overhead for authenticating the partners and negotiating
the security prole, but the biggest SSL impact is in computing processing loads associated
with key creation. Large numbers of secure connections can degrade server performance
because servers must dedicate cycles to the processing associated with SSL establishment.
166
This places administrators in a bindproviding secure communications degrades

performance, and buying more general purpose servers is a costly tax.
A class of products called SSL accelerators is used to off-load the server. Using specialpurpose hardware, they handle all the brute force computation that encryption needs. When
combined with load-distribution devicesenabling all end-user trafc to ow through the
SSL accelerator regardless of the ultimate server destinationthey can also reuse a session
key across all of the end users sessions. This greatly decreases the computation load and
simplies digital certicate management.
There are two related types of load distribution: local load distribution, which shares load
across servers in a single server farm, and geographic load distribution, which uses the end
users location to optimize server farm selection. Both types often contain extra functions,
such as SSL acceleration, attack handling, and aggregation of many hundreds of incoming
connections into far fewer server connections to decrease the servers connection-handling
workload.
Local Load Distribution

In a tiered server farm, the rst tier has several candidate servers available for the incoming
connection. The goal of local load distribution is to balance the load across all members of
the tier so that bottlenecks can be avoided, and transaction throughputs maximized, by
utilizing available capacity efciently across servers.
At each tier there must be a means of making a sound selection of the best server to use at
that moment. There are load balancing or content switches designed to forward transactions
to that server best suited to take on the next incremental request for services.
There are a variety of local load balancing techniques, implemented either in dedicated
network hardware or in software. The simplest techniques allocate incoming connections
to the next available server using a scheduling algorithmround robin, for example.
More sophisticated load balancing techniques can depend on both demand-side (end-user
request) criteria and supply-side (server status) criteria.
To handle demand-side criteria, the load distributor inspects the entire HTTP request and
uses that information in its selection decision. By extracting the URL and the cookie, the
load balancer has information about the content and the user requesting it. There are many
business situations where it is advantageous to treat your customers differently. Those who
do large amounts of business might get preferential treatment, or content may be
customized for each key customer. Some content, such as URLs associated with purchasing
products or services, can be treated as higher priority than those for customers browsing a
catalog, for example.
167
The state of the server infrastructure is an example of a supply-side criterion that can
inuence the selection process. The load distributor uses information about server loads,
access controls, application or content availability, and priority to nd the best server at that
moment.
Dynamic server supply-side selection strategies are based upon criteria such as the
following:
Determining the server with the lowest number of active connections

Selecting a server with the fastest response at that moment
Using an algorithm to predict the best server
Using a ratio of incoming requests for each server
Some load distributors periodically execute a set of scripts that check the health of the
servers. For example, they can request a web page and test to see if it was correctly
presented. This can be done at the same time that theyre timing the servers response speed.
Session persistence is important to transactions that require multiple requests. A customer
making a purchase needs to enter information, such as an address, a credit card number, and
shipping instructions. These types of services are statefulsome context is needed between
requests to maintain coherence and associate the requests.
When servers share a common repository for state information, switching a request to any
server is acceptable. However, most applications maintain their state independently, each
in a specic server. In those cases, sending requests associated with a single transaction to
different servers can cause the transaction to fail.
Session persistence is the capacity to associate a set of requests so that they are directed to
the same server. A few examples of the persistence options used by load distributors
illustrate the capabilities:
Source persistenceThe load distributor remembers the addresses of the end user
and the identity of the server that was assigned on the rst request. Further requests
from that user are directed to the original server. However, many end users are on the
other side of a rewall or a Network Address Translation (NAT) system, which can
reassign addresses frequently, limiting the utility of this technique.
Cookie manipulationThe load distributor can create or manipulate a cookie

passed between the server and the end user. The cookie is used to store the identity of
the server used on the rst request. Note this limitation: Many end users set their
browsers to refuse cookies.
SSL session ID persistenceFor secure sessions using SSL, the load distributor can
use the SSL protocols session ID, which is unique to a particular end user, to identify
that end user and the assigned server.
168
Geographic Load Distribution

Organizations with distributed centers of operations need to consider geographic load
distribution as a strategy to optimize performance based on geography. Geographic
load distribution optimizes which users in which locations are connected to geographically
dispersed server farms.
Multiple server farms offer higher availability by removing the threat of a single point of
failure. There are other advantages to distribution, such as the following:
Performance is improved by getting users close to the desired content on the network.
Users can be identied by country and receive content in the specied language.
The distributed sites will show higher performance in the aggregate if trafc is distributed
intelligently, so this strategy works best if all are (approximately) equally loaded. Having
one center under-utilized while another is congested wastes resources at each location.
Intelligent geographic distribution decisions must also incorporate persistencethe user
must be directed to a single site, at least for the duration of a transaction.
Geographic distribution decisions are made using the same criteria as for local load
distribution, with additional input about the location of the end user. In some cases, the
Internet address of the end user or of the end users DNS server can be matched against a
table of Internet addresses to determine probable locations. In other cases, all the server
farms attempt return contact with the end user, and the server farm with the fastest access
is assigned to that end user.
Content distribution network (CDN) switching is an interesting feature that enables a web
site to use public content distribution networks for extended geographic reach and to handle
trafc surges. As the web sites data centers reach capacity, a public CDN is used to handle the
overow until trafc levels fall. Public CDNs are available with usage-based pricing,
enabling the web site owners to control costs as well.
Caching
Caches are special-purpose appliances that hide network and server latency by quickly
delivering frequently used content. A cache delivers objects faster than a serverafter the
objects are in the cache.
The cache is used in conjunction with an interception switch, which intercepts all trafc
designated for particular services, such as the Web service on TCP Port 80, usually without
regard for the ultimate destination. The cache looks to see if it already has that object in its
storage; if so, it provides that object to the requester much faster than if the object had to
be fetched from the server. Objects can be stored in cache explicitly, by being preloaded, or
they are stored when the cache sees an object requested by an end user that it hasnt seen
before. In that case, the cache performs the fetch from the web server on behalf of the end
user, and it then stores the object in the cache memory for the next retrieval (see Figure 9-2).
Figure 9-2
169
Caching
End-to-End Retrieval
Cached Retrieval
Interception
Switch
Web
Server
Desired
Object
Cache
Copy of
Desired Object
The cache must be sensitive to the aging characteristics of each object. Some objects, such
as company logos, may never change. Other content, such as current stock prices, will
change constantly. Caching loses its value when it delivers expired (stale) content. Objects
often are delivered with caching headers from their origin servers; those headers tell the
cache how long it can store the object before it expires. If there isnt a caching header, or if
the item is probably unique and will never be requested again (for example, a URL with an
embedded query string, or a dynamically generated le with a URL ending in .jsp, or an
encrypted object), the cache will simply ignore the object and not cache it.
Placing a cache in front of a server hides much of the server delay after objects are in the
cache. Preloading those objects in the cache can be easily controlled by the server
administration, if it owns both the servers and the server-side cache.
The cache can also be placed close to the client so that network latencies are eliminated as
well. In those cases, the client-side caches are probably owned by the end users ISP and
will depend on caching headers for information about expiration.
All browsers also contain caches; this is readily apparent when an end user navigates by
using the Back button on the browser. Note that a lot of the end-user activity may be
concealed from server management tools if it comes out of cache instead of from the
original server.
Content Distribution
On the face of it, there seems to be signicant business opportunities for those who can
deliver high-quality, content-rich services, including detailed graphics, animation, and
sound. Service providers are also attracted to the potential of these high-value and high-
170
margin services, as they represent signicant business opportunities to attract new

customers while growing revenues from the current customer base.
However, the impact of content-rich trafc on the current mix of services generally results
in degraded quality and access delays. Despite all the new capacity, the initial model of a
centralized server distributing content across the Internet backbone simply does not stand
up to the demands for content-rich service quality at the scale needed to support large
numbers of customers. Raw network capacity by itself will not solve the problems of time
delay and packet loss at internetwork connections. These obstacles must be dealt with
through structural changes in the content-delivery system itself.
As was discussed in Chapter 8, Managing the Application Infrastructure, time delay
across a networkpropagation delayis not decreased by increasing bandwidth. It is a
result of the laws of physics and the distance traveled. Shortening that distance will
therefore improve end-user performance. If that shorter distance results in the data packets
crossing fewer network boundaries, packet losses will also probably decrease.
Getting the content to the network edge is a good way of decreasing time delay and the
number of network boundaries that are crossed. Placing multiple copies of the content at
the edgessuch as cable system head-endsbrings the content closer to its consumers and
thereby improves service quality. It also avoids the congestion and variability on the
backbone, improving delivery of high-quality services. Investment in multiple servers at the
edge is also cheaper than upgrading the entire backbone. By shifting compute cycles to
the edge, where network latency is low, content-delivery architectures strive to ensure that
end users get the full impact of rich content.
Content servers are caches that replicate the contents of the origin server around the
network edges. The content servers can deliver high-quality video and audio streams as
well as web page objects with high service quality.
Customers must be connected to the closest content server to take advantage of their
relative location. This must be done transparently because customers should need to have
no knowledge of the closest content server. The content-server assignment is made by the
equivalent of a geographic distribution service, as described in a preceding subsection.
The content manager supervises the ow of content from the origin server to the content
servers at the edge. The content is distributed over high-speed connections to maintain fresh
information at the content servers. Content managers are also usually able to force the early
expiration, if necessary, of pieces of content that have been cached in the content servers.
Most content distribution involves only static, unchanging, readily cached content.
However, there are techniques that enable dynamically generated web pages to be cached.
These generally involve a set of techniques to reduce the transmission volume of dynamic
pages by identifying the specic changes and sending only those changes to the requesting
end user. There is an industry standard that facilitates this process: Edge-Side Includes
(ESI). ESIs are a way of marking the contents of a web page to tell the content distribution
171
system which pieces of the page are unchanging, what their cache expiration times are, and
how to do some simple processing to select one of a number of web page fragments for
inclusion in a web page to be delivered to an end user.
Content distribution is available by assembling content servers and content managers from
components sold by cache vendors or by subscribing to a content distribution network
service, such as those furnished by Akamai, Speedera, and Mirror Image.

Individual servers are instrumented by their manufacturers or by other companies that build
an agent for a specic type of server. As with other managed elements, the element-centric
instrumentation provides insight on the following:
Current behavior based upon CPU load, memory usage, network, and disc activity, for
instance
Usage details, such as the number of users, threads, or processes

Environmental monitoring of temperature, power, and enclosure integrity
However, in a service-delivery infrastructure, such instrumentation of individual

components is usually useful only after troubleshooting has narrowed the origin of a
problem to a specic component. Instrumenting server tiers requires a different perspective
again. Each tier is handling a subset of the total transaction, and behavioral measurements
must take that into account. For example, synthetic (virtual) transactions, or parts of them,
can be run against every tier to understand its response.
Because the transaction is composed of a set of determinate end-user steps (which must
each succeed, fail, or time out in a given sequence), the measurements must also be
decomposed accordingly. Administrators must determine the acceptable delay thresholds
for each measured step in the transaction. Synthetic transactions for each transaction step
can be created and modied as applications and infrastructures change.
Partitioning of synthetic transactions into their component parts can help decrease the load
on the server systems. Theres little incremental benet in all transactions testing the same
component in the same way. When the common component fails, all the synthetic
transactions fail simultaneously and create redundant artifacts that the management system
must screen. At the same time, its not always practical to segment every section of every
synthetic transaction; some segments, such as a database connection that needs a session
assigned in the application server before it can be exercised, cannot be operated
independently.
Collectors run the synthetic transactions and measure the tier response against established
thresholds. Usually the overall performance of the tier is measured and no further steps are
needed as long as it is acceptable. Other measurements become important when the
transaction completion time for a tier begins to approach the threshold for unacceptable
172
performance. The measurement intervals must be selected to balance the needed

measurement granularity against the additional resources needed to process the synthetic
transaction. Web architectures typically can absorb a relatively frequent sampling by
synthetic transactionssay, for example, one HTTP request per minutewithout much
difculty.
By partitioning the measures of transaction performance across each successive level of the
tier, administrators can more quickly isolate a problem when performance is degrading
toward a service disruption. Using synthetic transactions for every tier quickly identies the
tier likely causing the problem. After a tier is identied as a likely source of the performance
problem, element managers are used to pinpoint the specic elements and conditions
causing the performance problem.
For example, end-user synthetic transactions might all be indicating a response-time
problem with a certain step in a transaction. If all the end-user synthetic transactions are
seeing the same problem, its probably not related to the location of the end users, but it
could be in the server farm or in the server farms access to the Internet. If a synthetic
transaction collector located within the server farm also sees the problem, the problem is
inside the server farm, not with the Internet access link. A collector performing or
monitoring database retrievals of the type used by the transaction step would then help
operators see if slow database retrieval was a cause of the problem.
Measuring element performance and building baselines helps prevent problems when the
inevitable component changes occur. The historical baselines now become trip wires,
which are thresholds that indicate if the changes have actually improved the component
performance or introduced additional delays. I recently visited a large online retailer that
uses these strategies. They described a recent incident where a software change caused the
application to issue three identical database queries each time data was needed. For
whatever reason, the change slid through quality assurance and was placed into production.
The database activity baseline immediately indicated a sudden abnormal jump in query
volume. Administrators immediately determined the changed application was the culprit
and rolled back to an earlier versionbefore customers noticed degradation in service.
Each of the three server architecture components discussed in the rst part of this chapter
load distribution, caching, and content distributionhas its own instrumentation
characteristics that must be considered when building an integrated instrumentation
system. Therefore, these are discussed in the next three subsections.
Load Distribution Instrumentation

The sellers of load distribution devices have included sophisticated management and
instrumentation capabilities in their systems. F5 Networks, for example, has developed a
network manager to monitor the status and performance of F5 load distribution devices.
The F5 devices report to the manager using either Simple Network Management Protocol
(SNMP) or XML. Available information includes workload volumes, the number of
173
discarded connections, the number of times particular servers have been chosen, and more.
The load-distribution system can then be tuned to handle performance situations as they
occur. Detailed usage information can also be extracted for accurate billing and resource
forecasting. Through a Web services XML interface, F5 network devices can be integrated
directly with any third-party application.
Cache Instrumentation
Its important for instrumentation design to consider the fact that caches absorb incoming
requests from end users. Phantom objects (page bugs), discussed in Chapter 8, can be used
to count web page downloads even when the entire page is cached. The phantom objects
le is simply marked with a cache header as uncacheable, or it is given an attribute, such as
a query string, that cache agents will avoid. That phantom object will always be fetched
from the origin server, even when the entire rest of the page is fetched from cache. The
relatively slow delivery of the phantom object wont interfere with the end users
perception of page performance, as its usually an invisible, one-pixel object.
If server-side caches are used, the caches performance data is available for analysis.
Information on cache hits and misses can be used to compute the bandwidth savings
resulting from the cache. (For browser-side caches, such computations can also be made,
but theyre slightly more complex. The number of page views must be combined with
knowledge of how many page elements were actually fetched from the server and
compared to the number of elements designed into the page.) In any case, end-user
measurements are needed to see the impact of caching on end-user performance.
Content Distribution Instrumentation

A combination of active and passive measures is needed to monitor and manage the
behavior of the content-delivery infrastructure. Active collectors must be distributed to
match the content-server distribution so that they can provide representative measurements
for the cluster of customers using each server.
Active measurements use a series of synthetic transactions to measure the performance of
the content delivery infrastructure. The performance of object fetches evaluates the speed
of the network and the efciency of the caches. The measurements are also dependent upon
the objects; for example, only the delay is usually important when fetching a web page. If the
object is a content stream, jitter and packet loss are also important measurements.
The content manager takes advantage of passive information from the content servers and
caches. The content servers provide information on the most popular content. This is used
to balance content across multiple servers to identify content that should be preloaded and to
increase the replication of popular content. The content manager can schedule updates as
needed from the origin server.
174
Other metrics for the content-delivery infrastructure can be derived as well. One useful
measurement would assess the real impact of the content-delivery infrastructure on the
backbone. Bandwidth gain is a metric that compares the total content delivered to
consumers to the backbone bandwidth needed for preloading objects, refreshing content,
and accessing the server when an object is not resident in the cache. For example, the
bandwidth gain is 5 if the cache were delivering 100 Mbps of content while using 20 Mbps
on the backbone for cache overhead. This shows the benet of content servers on the edge
versus upgrading the backbone for a centralized approach. The bandwidth gain becomes
even more signicant in the aggregate view. If there are 25 content servers at the edge, each
with a bandwidth gain of 5, the backbone impact is very clear.
Content delivery networks can supply detailed information, including not just workload
volumes but also information about the geographic location of your end users and the
particular web pages or other content that they request.
Summary
There are a growing number of options for boosting server performance and availability.
Global load distribution and a tiered architecture provide high levels of availability with
two levels of redundancy: multiple sites and multiple servers within each site.
Individual servers must be optimized as they are assigned to tiers. As transactions ow
through the tiers, they can be load balanced by content switches or accelerated with new
front-end processors.
Content delivery infrastructures speed the delivery of content, opening new opportunities
for service providers and their customers. Providers have new, high-margin services to
offer, while customers have new applications that save money, increase competitive
advantage, and strengthen their Internet presence. The traditional use of a centralized origin
server is being replaced by a set of content servers at the edge. That enables customers to
get high-quality, content-rich services, while providers avoid moving large volumes of
time-sensitive content across their backbones.
The distribution of servers across the network means that managing server instrumentation
is becoming a critical skill. Load distribution, caching, and content distribution networks
greatly affect both performance and the management information that is available to system
administrators. (A lot of the end-user activity may be concealed from the original servers
management tools if it comes out of cache or content-distribution networks instead of from
the original server.)
These services also need to be managed to ensure that they provide a return on investment.
Each type of service provides element-level management data, but synthetic transactions
from many locations in the system are important for Service Level Agreements (SLAs) and
for handling performance issues.
CHAPTER
10
Managing the Transport

Infrastructure
The transport infrastructure is the product of many interconnected network services. The
customers enterprise network, along with multiple Internet Service Providers (ISPs),
hosting providers, and multiple business partners and supplier networks, comprises the
transport infrastructure. IT and business managers must nd ways of reducing costs by
leveraging the public infrastructure while preserving, or even improving, customer service
levels. Adequate bandwidth and the appropriate trafc priorities should be applied to the
service mix, and service quality must be consistent even while the set of interconnected
networking services changes.
End-to-end management is necessary so that all of the supporting individual networking
services are at least monitored and held accountable for their contributions to overall
service quality. In some cases, particularly situations in which trafc travels over the public
Internet, direct service quality control and prioritization of all segments of the transport
service may not be possible. Those situations call for a combination of approaches. In those
parts of the network where its possible, transport service quality can be controlled by using
the approaches outlined in this chapter. In other parts of the network, strict Service Level
Agreements (SLAs) and measurement, combined with appropriate selection of underlying
transport services, can be used to assure that end-to-end transport quality is acceptable.
Monitoring of individual networking services provides measures that help select transport
services and enforce SLAs; the revenues of network service providers are therefore
partially determined by the service quality they provide. These measures can also help
isolate a particular network environment when performance problems are indicated.
Isolating the specic networking service speeds problem resolution and ensures that only
the appropriate parties need to be involved in resolving the situation.
The ultimate goal is to design a network management approach that maintains high-quality
service ows while being oriented toward solutions that use resources effectively and
economically.
In this chapter, the following topics are covered:
The low-level technical quality metrics that apply to transport services
The control and measurement of transport service quality when the trafc ows
among separate organizations
An introduction to Quality of Service (QoS) technology, which is used to control

transport service quality
178
Chapter 10: Managing the Transport Infrastructure

There are common low-level technical quality metrics that can be applied to network
infrastructures to paint a picture of overall service quality. They are as follows:
Workload and required bandwidth

Availability and packet loss
One-way latency
Round-trip latency
Jitter
These were rst introduced in Chapter 2, Service Level Management, and they are
expanded upon here.
Workload and Bandwidth

Some services require a guaranteed, unchanging amount of bandwidth to function properly;
for those situations, bandwidth guarantees are necessary. Others simply require a certain
amount of bandwidth averaged over a long interval. In either case, the bandwidth required
is a function of the workload being applied to the transport system, and it is usually
measured in terms of bytes per second, typically over specied intervals with median and
95th percentile values. Even the simplest transport devices provide basic counts of the
number of bytes into and out of each device interface. In most cases, packet or frame counts
are also provided.
Data transport providers sell bandwidth in terms of similar measures. Bandwidth is often
provided in terms of a guaranteed rate along with a maximum burst rate above that
guaranteed rate. For example, a Frame Relay circuit has a Committed Information Rate
(CIR); data in excess of that rate is tagged as being discard eligible and may be discarded
without notice by the network. Asynchronous Transfer Mode (ATM) has a sustainable cell
rate, which is the average bandwidth over a long period, and a peak cell rate, which is the
maximum bandwidth allowed over the period dened by the maximum burst size. ATM can
provide both steady bandwidth guarantees (constant bit rate) and average bandwidth
guarantees (for example, variable bit rate); other technologies have similar services.
ISPs usually bill according to either the total number of bytes transmitted in a month or by
a more complex formula that looks at peak usage.
When billing by the total number of bytes, the ISP uses the monthly byte count produced
by the router connecting to the subscriber. If the count goes above the agreed-on number of
gigabytes, theres an additional charge.
Billing by peak usage works as follows: At each ve-minute interval in the entire month,
the ISP measures both input and output bandwidth in bytes/second. The higher value is
recorded as the peak usage for each ve-minute interval. At the end of the month, the 95th
179
percentile value of all of those measurements is used as the basis for billing. An effect of
this is that the top ve percent of the ve-minute samples are ignored; you can therefore
burst up to the maximum bandwidth of your access line for up to ve percent of the month
without any additional cost.
In both cases (total number of bytes transmitted or peak usage), the workload measured is
not precisely the same as the workload or bandwidth as seen by the application. If errors
interfere with data packets, and those packets are therefore retransmitted, the low-level
workload metrics usually count those packets again. The paradoxical result is that as link
quality deteriorates, the byte count carried by that link rises!
Availability and Packet Loss

Availability on a communications link is usually represented as the percentage of time that
the link is electrically operating and can carry trafc at better than a specied error rate. For
example, a digital link that is powered and that provides a clocking signal is available to
carry trafc, and the error rate over that link is described in terms such as error-free
seconds. Because errors on digital links normally occur in severe bursts, the use of errorfree seconds as a measure of link availability is understandable. During the period when
errors are occurring, the link is probably unusable; at other times, there are probably
no errors at all. In some cases, a supplier will specify that any period of unavailability less
than, for example, one hour does not count in the availability metrics.
Note that a link with a moderate, steady error rate may interfere with so many data packets
that the links effective throughput is quite small. (Effective throughput is the throughput of
a link after retransmissions are taken into account. It varies according to the links error rate
and the particular error-recovery protocols in use on that link.) A simple measure of
availability or packet loss on the link would show only the moderate error rate, while a more
sophisticated measure of the impact of that error rate on a particular application might show
that the link was, for all practical purposes, unavailable.
Multimedia streaming is designed to tolerate limited noise resulting from packet loss
because an interruption (for rebuffering) as you experience multimedia is worse than a
small anomaly introduced by a dropping a packet in the audio/video stream. Because
interruptions of audio are more objectionable to end users than short freezes in video, most
multimedia streaming servers attempt to give the audio portion of the signal preference over
the video portion if effective throughput is restricted.
Packet loss for transactions does not cause transaction failure because transaction protocols
automatically retransmit as necessary to ensure error-free completion. Any packet losses
merely add delay to the total transaction completion time.
To avoid the necessity of specifying the precise error-recovery protocols used on links and
their impact on perceived packet loss and effective throughput, thereby creating a very
complex measure, organizations can specify error rates in terms of block error rate (or
packet error rate, or ATMs cell error ratio, and so on). After all, it doesnt matter if there
180
are one or more errors within a particular data block; any error at all will require
retransmission of the entire block if perfect transmission is required. The usual exception
is streaming media, which can accept low error rates without major impact on the
application; in those cases, simple bit error rates may be sufcient.
Packet loss over an Internet connection may be dened very coarsely in terms of ping ratios. For
that measure, short ping packets are transmitted in a burst and are immediately echoed back
by the destination. The percentage of packets that return within a dened, short time window is
taken as the success ratio. This is an extremely coarse approximation because Internet paths are
variable, and error rates can uctuate greatly because of intersecting trafc ows. The effect of
errors on TCP communications is also quite complex, as TCP is very sensitive to the particular
pattern of errors over time. Therefore, a brief burst of ping packets may not be representative of
the effective performance of the connection as seen by TCP. Block-oriented error ratios for the
underlying transport, such as ATMs cell error ratio or similar measures for Frame Relay, are
therefore preferable. Theyre not perfect for TCP, but theyre better than ping ratios.
One-Way Latency
Services in the interactive classes may require one-way delay measurements. Routing
protocols often select different paths in each direction between a pair of nodes, so the
latency in the two directions can be considerably different. (Each ISP typically tries to hand
a packet to another ISP as soon as possible, thereby decreasing the distance it must carry
the packet.) See Figure 10-1.
Figure 10-1 Internet Latencies and Asymmetric Routes

The challenge is getting the time measurement between two unsynchronized hosts. The
Network Time Protocol (NTP) can be used; there are commercial variants that offer similar
functionality. A more expensive, and more accurate, approach is using Global Positioning
System (GPS) receivers to measure time and synchronize the clocks at each site.
QoS Technologies
181
Round-Trip Latency
Round-trip latency is the common metric for transactional and (some) interactive services.
Synchronizing clocks is not an issue because the initiator gets a response and can easily
determine the elapsed time, which is the metric of interest.
Round-trip latency on the Web is easily measured by using the time elapsed between the
rst two steps in the establishment of a TCP connection. (A SYN packet is sent out, and
a SYN ACK packet is returned.) The turnaround time at the destination is minimal, as it
does not involve the destination application. Most active measurement collectors provide
that measurement, labeling it initial connection time or TCP connection time. It can also be
obtained from ping packets.
Jitter
Jitter is the variance in packet arrival times. It can have a serious impact on service quality,
although buffering in the receiving host helps considerably. Anyone with a CD in his or her
car appreciates the value of buffering and knows its limitations when there are too many
bumps in the road. The real problem is the additional delay added by the dejitter buffer; its
usually one or two times the expected jitter. For interactive use of the network, such as voice
communications, ITU-Ts standard G.114 suggests a maximum one-way latency of 150
milliseconds (ms). Therefore, a dejitter buffer of, say, 50 ms would form a large part of the
latency budget.
Jitter is measured by tracking the arrival time of each successive packet and calculating the
variance between them, assuming a transmitter is introducing trafc into the network at a
constant rate. Some commercial tools enhance this measure with additional analytics; for
example, the NetIQ product calculates the distribution of the jitter measures to provide
more insight into the range of underlying performance.
QoS Technologies
QoS technologies classify network trafc and then ensure that some of that trafc receives
special handling. The special handling may include attempts to provide improved
availability, error rates, latency, and jitter.
However, because of the perceived complexity of QoS, many organizations choose to
implement service quality differentiation through the use of separate facilities (isolation)
instead of through QoS technology. For example, a separate LAN can be built to handle
Voice over IP (VoIP), thereby isolating it from delays caused by large-scale le transfers on
the data LAN. The QoS alternative of using frame tagging on the LAN may appear to be
too complicated. In some cases, the organization may simply over-provision the transport
facilities massively and hope that bandwidth constriction will never occur. In addition,
182
signaling between ISPs for QoS is generally not done, so use of QoS technologies across
the public Internet is impractical.
Even when its completely implemented, QoS does not necessarily guarantee particular
performance. Performance guarantees can be quite difcult and expensive to provide in
packet-switched networks, and most applications and users can be satised with less
stringent promises, such as prioritization only, without delay guarantees.
For the stated reasons, QoS technology is primarily used in private networks and has not
yet achieved the widespread use that was predicted some years ago. Nevertheless, with the
growing use of VoIP and other latency-sensitive network uses, interest in QoS is growing.
This section of the chapter discusses the major QoS technologies. The QoS technologies
are placed into two groups: tag-based QoS, which relies on identication tags placed into
data frames and used by network switches; and trafc-shaping QoS, which tries to manage
bandwidth allocations through queuing or rate-shaping at a single point instead of through
the active cooperation of all network elements and explicit tagging.
This section of the chapter also discusses the alternative to the major QoS technologies. This
alternative is called over-provisioning, also known as design by hope.
Tag-Based QoS
Networks forward trafc through routers, switches, and access devices. The transport
infrastructures must make forwarding decisions based on the required treatment for each
trafc ow.
In tag-based QoS, trafc is initially classied by having the appropriate forwarding
information added to each packet. Desktops, other customer input devices, and network
devices can classify and mark the packets, possibly relying on a central database of
authorizations. Trafc can be identied by end user, protocol, and application at the
network entry point. Then the classier can decide whether to admit the trafc to
the network and, if admitted, which classication tag to place into the data packet headers.
Switches, routers, and other network devices in the core of the network then examine the
tags to determine how to handle the trafc.
There are different types of QoS technologies that use classication and tagging. This
subsection describes IEEE 802 LAN QoS, IP Type of Service (TOS), IP Differentiated
Services (DiffServ), Multiprotocol Label Switching (MPLS), and Resource Reservation
Protocol (RSVP) as examples; other technologies are also available.
IEEE 802 LAN QoS

The Institute of Electrical and Electronics Engineers (IEEE) has developed the 802 family
of LAN standards. Forwarding at the data link layer within a LAN infrastructure is
controlled using the IEEE 802.1D specications, which provides a three-bit eld (the
QoS Technologies
183
802.1p eld) for priority. There are therefore eight possible non-overlapping priorities.
There is also a eld (the 802.1Q eld) that enables the identication of up to 4095 virtual
LANs (VLANs). VLAN trafc may also receive differentiated treatment.
IP TOS
The TOS byte has been part of the IP header from the earliest specication. It provides three
bits that can be used to differentiate priority levels. Routers can examine these bits to set
queuing priorities and to select among routing options. Its also possible to set router lters
to examine other parts of the packet header (for example, the protocol type or the origin/
destination address pair) when choosing a particular forwarding priority.
Some administrators use TOS elds and ltering to provide very coarse prioritization
within the router queues. At most, one or two classes of trafc, such as certain transactions,
are given priority over other trafc in the queues. No strict QoS guarantees are made, and
there is no attempt to inuence routing decisions at all.
IP DiffServ
DiffServ technology can provide both performance guarantees and performance
prioritization. It does that by using the TOS byte (renamed the DS byte) in the IP header to
indicate the QoS class. All the information that the router needs to handle the packet is
contained in the packet header, so routers dont need to learn or store information about
individual trafc ows. The disadvantage of DiffServ is that ows must be handled as part
of larger groups; its not possible to single out a particular ow for special handling,
independent of other ows. Instead, it must be grouped with many other ows for its trip through
all the routers, and it will receive the same handling as all of the other ows in its group.
Aggregation into a small set of classes simplies the management of large numbers of
ows, improving scalability for large backbones. Each router, for instance, interprets the
DS Byte and follows the associated forwarding behavior.
Devices at the network boundary may be used to set the DS byte according to current
resource allocation policies. They can map between the DS byte and IEEE 802 QoS tags, for
example.
MPLS
A classical router processes each packet by doing the following:
Decomposing the IP header
Looking up the destination address in a forwarding table
Checking whether the packet should be discarded for being too old (the time to live
has expired)
184
Determining the next hop

Adjusting header elds
Repackaging the IP packet for transit
Queuing the IP packet for forwarding
The router repeats these steps even if the next arriving packet belongs to the same ow. This
approach becomes a bottleneck with higher ow volumes and faster trunk speeds. MPLS is
intended to overcome the limitations of classical routers in backbones with tens of
thousands of ows.
MPLS adds a tag to each packet; each tag is associated with a predened routing and
handling strategy. Each router simply reads the tag, using it to identify the next hop and
forwarding policies to use. Processing time is shortened to a table lookup and trafc is
forwarded quickly through the MPLS domain.
MPLS was originally designed for the dense Internet core where high volumes must be
routed with no delays. Administrators dene the sets of routes and treatments associated
with each label and distribute the information to the core routers. Edge devices are also
congured to identify incoming service ows and append the appropriate tag or label.
Administrators can take advantage of this strategy to build static routes for trafc with time
constraints, using routes with fewer hops and higher speed trunks, for instance. They can
also choose to allow dynamic routing where the routers exchange reachability information
and make adjustments on their own.
RSVP
RSVP is a common mechanism for reserving bandwidth across a single network
infrastructure. A receiver initiates a reservation request for a requested ow. The request is
passed through the network devices and those that are RSVP-enabled reserve the
bandwidth as requested. The trafc ow, identied by its addresses and protocol type, is
then given special handling when it passes through the RSVP-enabled devices that have
accepted the reservation.
One of the drawbacks of RSVP is that devices that do not support it simply pass the request
through. This leads to a situation where a device might not support RSVP but does not
inform anyone of that fact; the data packets depart their point of origin assuming that they
have a dedicated route, but arrive and nd no resources reserved on their behalf. Rerouting
of trafc ows, which is not uncommon in IP-based networks, may also result in the trafc
ow going through routers that are temporarily unaware of the special handling that the
packets should receive.
The network must also handle all service ows appropriately, allocating the resources
needed to comply with the constraints of every active SLA.
QoS Technologies
185
Traffic-Shaping QoS
Bandwidth management is an essential function for guaranteeing service quality. Network
bandwidth is shared among a competing set of service ows and must be allocated and
managed effectively. The most critical services must receive sufcient resources to meet the
objectives set forth in SLAs.
The tag-based approaches to QoS try to perform bandwidth management by tagging
specic packets and then instructing all the network equipment to give those packets
preferential treatment. Those approaches have difculties if all the pieces of network
equipment in the data ow path dont participate. Use of trafc-shaping QoS is an
alternative.
In trafc-shaping QoS, a special appliance or process in a router is invoked to identify data
ows (by their source and destination addresses) and sort them into different queues or
otherwise manage their data rates. The appliances or processes try to change the
characteristics of the trafc itself rather than trying to control the handling of packets
between the connection end points. If that appliance or process is located at a key point
through which all trafc ows, it can control the available bandwidths even though all the
devices in the data ows path dont participate. Trafc-shaping QoS is not always as
precise as tag-based QoS approaches, but its easier to implement.
There are two basic approaches to trafc-shaping QoS: rate control and queuing. Each is
discussed in the following sections.
Taming the Selfish TCP Connection

A basic assumption when TCP was being eshed out was that each connection was
managed independently. The outgrowth of that independence was that each connection
optimized its own delivery at the expense of the others sharing the network. As you no
doubt know, TCP uses a credit mechanism: the receiving computer system species the
amount of information it will accept from the sender. This, by itself, is a good thing because
it prevents a receiving computer system with lower speed and limited resources from being
swamped by a sender on a higher-speed network, as one example.
Imagine a sender is granted a credit of eight packets. (Credit is actually granted in bytes,
but the point is the same.) If you are the sending system, you can boost your performance
by sending those eight packets as fast as you can, banking on getting a new credit as soon
as possible and then sending more data quickly in bunches.
Because connections were all treated equally, theyd each act independently, much like the
tragedy of the commons problem in classical economics. Each attempt by one connection
to grab a bigger share of available bandwidth could result in congestion and packet loss for
another connection, quickly degrading overall performance. Routers being hit with
multiple bursts may be forced to discard some packets, leading to retransmission delays and
186
connection timeouts. The problem compounds itself, triggering further retransmissions,

until a new credit arrives and the bursts start again.
Van Jacobson of the Lawrence Berkeley Laboratories saw the problems this approach
caused and proposed the slow-start approach that is the accepted behavior today. A sender
doesnt expend the entire credit immediately; instead, it sends a smaller portion of it and
keeps increasing the size until getting the feedback of a retransmission to nd the networks
tolerance at that point. Then the sender stays at that level to minimize interference among
all the active connections. RFC 2581 describes these algorithms that are required for TCP
implementations. This was a signicant breakthrough in understanding early performance
problems and mitigating them.
The approach has limitations in todays world because it doesnt recognize the relative
priority of the connections; a rogue connection that doesnt adhere to the slow start
approach can still cause degradation in other connections. In addition, slow start does not
offer any way of providing different bandwidth guarantees or priorities to different ows.
The basic TCP mechanisms are not easily changed with the size of the current installed
base. Instead, there are approaches that attempt to neutralize the undesirable characteristics
of TCP connection behavior.
Rate Control
Rate control is a QoS strategy that helps regulate a set of TCP ows with a range of
forwarding needs. Rate control regulates the introduction of trafc into the transport
infrastructures to minimize the interference among ows competing for the same network
resources and to set relative priorities for access to scarce resources. Coordinating the
behavior of a group of connections is a large departure from the basic free-for-all of the
original TCP design concepts and implementations. (See the preceding sidebar for more
details.)
Rate control is analogous to the air trafc control system. You have probably had the
experience of having your ight delayed because of congestion at the destination airport.
You dont take off until there is an opening for you at the other end, the weather clears, or
other situations improve. In contrast, the typical TCP philosophy discussed in the sidebar
could be characterized like this: launch all the planes and hope they land before they run
out of gas or the skies get too crowded.
Packeteer started this niche in 1996 when they introduced their PacketShaper product line,
based on Packeteers patented rate control technology. A PacketShaper device is placed
between the sender and receiver, typically in front of servers or at customer-provider
network demarcation points. This enables it to intercept the receivers feedback (TCP ow
credits and acknowledgements) and adjust the actual TCP connection behavior by
manipulating the timing of protocol acknowledgments and ow control allocations. The
PacketShaper inspects the packet headers for address, protocol, and application
QoS Technologies
187
information, classies them, and then applies the appropriate policies as the trafc ows
through it. This approach leaves the connection endpoints unchanged and unaware of
actions taken by the queuing appliance.
The PacketShaper smoothes bursty trafc and thus minimizes its impact on other service
ows. TCP connections can be assigned a guaranteed bit rate by the PacketShaper, a
function not explicitly enabled by the TCP specications. The assigned rate is a minimum
guaranteed rate, allowing for higher rates when additional bandwidth is available. If a ow
for a trafc class cannot get the required bandwidth guarantee, the connection request can
be refused, or the connection can be established without a guarantee.
The PacketShaper can also block service ows from the network if desired. A discard
policy blocks connection attempts and discards packets without notifying the user. The
granular classication lets the PacketShaper redirect web users to an error URL that
informs them of the blockage. This technique lets administrators keep unwanted ows off
of their networks or allow them only at specied times.
Queuing
Queuing can be used to reorganize the trafc streams passing through the queue. For
example, a low-priority packet is queued and held if higher-priority trafc is waiting to be
forwarded. An arriving high-priority packet is placed in the queue ahead of lower-priority
packets.
In queuing-based QoS, the packets in an arriving ow are inspected and assigned to a class.
All ows that are members of a class share a queue. Packets are transmitted from the queue
based on relative queue priority and rules of fairness among queues to ensure that ows
have enough (even if minimal) resources to continue operations.
Class-based queuing (CBQ), originally developed by Sally Floyd and others at the
Lawrence Berkeley Laboratory, is an attempt to provide fair allocation of bandwidth
without requiring massive amounts of processing power in the network devices. Each class
of user is guaranteed a certain minimum bandwidth, and any excess bandwidth is allocated
according to rules set up by the network administration. Specic implementations of CBQ
have been designed with an entire hierarchy of classes, with, for example, excess capacity
redistributed within each branch of the hierarchy as much as possible.
A drawback to CBQ is that there is no fairness within the class. A large number of packets
from other members extends the waiting time for following packets from other ows. This
causes inconsistency in forwarding and possible quality uctuations.
Weighted fair queuing (WFQ), which is more complex than CBQ and requires much more
processing power, can be used to provide absolute guarantees of maximum latency.
188
Over-provisioning and Isolated Networks

Decreasing LAN and WAN bandwidth costs have led many organizations to adopt a design
by hope approach; they use enough bandwidth and hope that the performance problems are
solved. Although it is true that over-provisioning capacity makes some management tasks
easier, a design by hope approach does not guarantee that the required service levels can
always be delivered, in no small part because it represents a gamble on capacity,
unburdened by analysis or optimization of the demands on that capacity.
There are many situations that such over-provisioning cannot solve by itself. For example,
using a noncritical, bandwidth-intensive application at the wrong time of day could steal
resources from other critical business services. Low-priority junk e-mail and the highest
priority real-time video trafc receive the same service. This might not be a problem in a
network owned by a small group, but in most larger networks, the groups using the network
for e-mail will start complaining about bearing the cost for an over-provisioned network
that has to support another groups real-time speech and video trafc. Failures and
disruptions will also occur from time to time and may temporarily overstress parts of the
network.
Over-provisioned networks also have difculty scaling, especially if wide-area links are
involved. Any aggregation point in the network is a place where temporary congestion
might occur, resulting in unanticipated packet loss.
Nevertheless, over-provisioning is still a very popular option. Its extremely simple, and
problems are rare in most cases. If a critical application exists that needs priority over other
applications, its often easier to create a completely separate, isolated network for
that application instead of implementing true QoS technologies.

Services usually run across a combination of enterprise and service provider networks. A
transport infrastructure that spans many networks and management domains must
nevertheless be tracked and managed.
To manage data ows, a network administrator needs flow-through QoS, which is true endto-end management that spans multiple management systems. Service quality must be
accurately controlled across all boundaries: customer to business partner, customer to
provider, and provider to provider. Flow-through QoS would ensure that the paths across a
set of ISPs would always deliver the specied service quality. The ow would change ISPs
in its path as needed to maintain the overall quality.
189
Levels of Control
There are macro- and micro-levels of control involved for ow-through QoS.
The micro-level is the management of internal ows within any organization or provider
network infrastructure. All the tools mentioned in the rst part of this chapter can be applied
as needed. The management team for that infrastructure has the responsibility of managing
their own infrastructure to meet compliance criteria.
The macro-level entails monitoring end-to-end service quality, identifying the portion
(responsible organization) of the infrastructure that contributes to poor service quality, and
verifying that quality is restored.
Demarcation Points
Demarcation points are the boundaries between management organizations and the
resources they control. Periodic measurements can be made across the cloud between
different demarcation points. The collectors use active techniques to exchange and measure
trafc between themselves. The basic delay measurements are augmented by jitter and
packet loss measurements. A trend indicating degrading service quality triggers an alert to
management systems that verify the measurements and activate the appropriate management tools to oversee the details of resolving the problem.
Diagnosis and Recovery

An external network environment can be measured only edge-to-edge, with further
examination being left to that particular networks management staff. Therefore, diagnosis
and recovery are primarily a triage process of narrowing the focus after a potential, or
actual, service degradation is detected, and then notifying the responsible organization.
End-user measurements can be used in the triage process, as discussed in Chapter 6, RealTime Operations. In addition to the examples in that chapter, Figure 10-2 shows an
example from a Keynote Systems real-time chart of network round-trip delay measured
from Keynote collectors on the major backbones in the U.S. to the affected web sites home
page. Clearly, there are problems from the collectors located on UUNET, and those
problems dont appear when measurements are made from any other major backbone in the
U.S. Using the chart, the web sites administrator has clear evidence that points to a
probable problem in the peering between the web sites ISP and UUNET. That should be
sufcient to get a trouble ticket opened quickly with the web sites ISP.
190
Figure 10-2 Round-Trip Delay Used in Latency Diagnosis

Start
0$. #

/
Measure
Alarm
Diagnose
Test
Admin
Help
Log Out
)'
/9 9
/
9: 6 & 8$6
/#;
& 8
7" -$

< 8, 7#&
% &
8 9 7#&

/ ,
(
6, 8$6
9
/ #=
9
/ #=
+ 6 (
7 6, 8$6
)*
*+
,
( &
- .
'/(
0#

11"2
3
45

! " #$
%&'
(
End-user measurements can also be used in attempts to bypass transport provider problems.
A new approach named route control is emerging as way of allocating trafc among a set
of service providers to maintain overall control of service quality.
Many organizations use several ISPs to gain the following advantages:
High availability with redundant providers

Closer proximity to customer browsers, offering shorter paths to reach business sites
Use of the lowest cost provider for lower-priority services
Competition that keeps providers on their toes
A router on the customer premises has connections to each service provider and selects one
using the Border Gateway Protocol (BGP). BGP has some shortcomings. It is complex and
difcult to tune, and it can send trafc over paths that dont provide the lowest latency or
the lowest cost. In addition, it can certainly create situations in which low-priority trafc
ows across the ISP with the highest charges.
netVmg, an early mover in route-control solutions, introduced the Flow Control Platform
to address the problems in what they term the middle milethe set of Internet backbones
between the sender and the receiver.
The Flow Control Platform inspects outgoing ows, identies the destination networks, and
uses proprietary approaches to measure performance to each destination across all the
attached ISPs. Performance baselines are monitored for conformance to latency, packet
Summary
191
loss, and cost policies. If the current ISP is unable to comply, the Flow Control Platform
selects another outgoing ISP based on cost and performance specications. It sends a BGP
update to the web sites boundary router to initiate a change to a new provider. The Flow
Control Platform also allows for more sophisticated policies that consider security and
other factors, such as requiring or forbidding certain ISPs. Customers also have the
advantage of consistent provider measurements to facilitate SLA adjustments and contract
negotiations. They can optimize a set of provider services to their needs.
Even if route control isnt used, multiple providers or multiple transport services can
provide different levels of transport quality for an enterprise. The enterprises border
switches or routers can sort outgoing trafc into different classes, depending on the
required transport performance (possibly signaled by the same tagging system used within
the enterprise). Those border devices can then route the outgoing trafc over the
appropriate provider service; for example, trafc requiring latency guarantees would be
sent over a service that can provide such guaranteesprobably at a higher price than for
the service providing best-effort delivery. This is similar to the use of multiple, isolated
networks to provide different levels of service within the enterprise, as discussed in the
previous subsection on over-provisioning and isolated networks.
Summary
The transport infrastructure spans a large number of networks and trafc that must be
managed at two levels. The network managers must consider how they will measure and
manage the network characteristics of bandwidth, availability, packet loss, latency, and
jitter.
Flows within a single organizations domain are managed using specialized technologies
that can tag data packets for special handling and can make reservations for that special
handling.
Rate control and queuing can be used at key network points to manage bandwidth for all
the trafc streams passing through those points. Some organizations dont use ow
management technologies; they simply over-provision massively, hoping that there wont
be severe congestion problems.
Flows across management domain boundaries are an ongoing challenge. Attaining owthrough QoS and delivering consistent end-to-end quality independent of the set of ISPs
used is the goal. The technical means are becoming available with the introduction of
MPLS and route control, although the administrative burdens are still high.
PART
III
Long-term Service Level

Management Functions
Chapter 12
Modeling and Capacity Planning
CHAPTER
11
Load Testing
Load balancers, caches, and Quality of Service (QoS) enabled network devices
dynamically shift resources in real time to meet compliance requirements. In contrast, a
longer-term management focus is intended to identify future problems with enough lead
time to take the appropriate preventive actions. The long-term time scale is days or weeks
versus the minutes and seconds involved in real-time management processes. Real-time
operations play the hand thats been dealt; long-term operations focus on drawing better
cards to begin with.
If you are a services manager, one question that is usually in the back of your mind is this:
When will my services fail to meet their quality objectives? (That you will fail to meet
Service Level Agreement (SLA) objectives under some combination of circumstances is
almost a given. Perhaps a more accurate statement of the question is this: Who breaks my
services rst? My customers, or my testing team?)
The long-term functions discussed in this chapter and Chapter 12, Modeling and Capacity
Planning, ensure that there are adequate resources to allocate and that you better
understand how to structure operational options. The long-term functions are capacity
planning, modeling, and load testing.
Capacity planning using modeling tools, discussed in Chapter 12, has often been difcult
and chancy, greatly depending on the assumptions you make about loads, request rates, and
other key variables. Analysis time lengthens as the numbers of elements and their
interactions increase. Moreover, there are many elements and effects whose interactions are
very difcult to quantify (latency and queuing are two key examples), and the simulations
are no better than the assumptions that went into them.
In contrast, load testing is empirical rather than analytic; it addresses the question in a more
straightforward manner: see how the services actually behave and where they really break. With
load testing, you can learn about problematic performance and address it before deployment.
You can then baseline the behavior so that you can allocate resources properly and set realistic
performance goals. Load testing can also be used to improve the accuracy of analytic tools.
Load testing uses active validation techniques to dene the performance envelope of a
service or an element. The service or element being tested is subjected to an increasing
offered load until service quality begins to plummet. Alternatively, the service is subjected
to a steady load over a longer period of time, to identify effects, if any, of caching or
buffering.
196
Chapter 11: Load Testing
This chapter discusses load testing of the server and application infrastructures, which can
be handled together. I dont discuss in detail the use of load tests to evaluate transport
infrastructures, but specialized tools and techniques are available to test transport networks.
They generate high trafc volumes and can be set to introduce errors and shift timing
relationships to create jitter between packets. Device vendors use these expensive trafc
generators to test and validate their products and to determine the relative strengths and
weaknesses of competing products.
A load testing strategy needs these pieces: a test bed, controllable load generators, good
system proles, good analysis, and clear reporting tools. Toward that end, this chapter
covers the following:
Characteristics of the performance envelope that represents a services performance

as a function of load
Load testing benchmarks

Load test beds and load generators
Building transaction load-test scripts and proles
Using the results of a load test

Load testing is used to identify a performance envelope for a service or a mix of services
under different operating conditions. The performance envelope represents the services
performance under normal and extreme operational ranges. For example, you probably rst
want to test services at their expected normal operating loads (number of users or
transactions per second) over an extended time period to ensure reliable and stable
operation. Then youll want to test under extreme loads to determine the operational
limitations, which should assist in pointing to actions that can reduce bottlenecks.
Determining an accurate performance envelope is notoriously difcult for several reasons.
Web environments are complex, and the causes of problem behavior are hard to identify.
Even experienced designers and administrators have difculty developing a feel for the
likely factors affecting performance.
The classical performance envelope is represented by Figure 11-1, which shows response
time versus offered load. The offered load can be transaction requests, database queries, or
network trafc at an interface, as examples. This general form of the load curve applies to
a variety of managed resources, such as network devices or servers. Web application-level
performance, however, has a surprising difference from this classical behavior: the
Internets Web users abandon sessions when the response time becomes too long, affecting
the response time of other sessions. The text discusses that difference at the end of this
subsection.
197

Figure 11-1 Classical Load Curve
!" "

There are two areas of interest in the classical load curve: where the behavior is linear
and where it changes to being nonlinear. The inflection point is the boundary between linear and
nonlinear responses and is a function of the peak service rate (whether measured in packets
per second, frames per second, or transactions per second) that the specic layers
infrastructure has been engineered to provide. Its based on elementary queuing theory;
above the inection point, even small increases in applied load result in large changes in
performance and, possibly, in availability.
The linear portion of the response curve represents the most stable and predictable part of
the performance envelope. It represents conditions where the resources are sufcient for the
load applied. Queuing delays are minimal, and the response time is low for a range of loads.
As the offered load grows, the response time begins to lengthen as resources are more
heavily subscribed. Loading up to the inection point is (approximately) linear. Any
load increases up to the inection point have the same impactthe incremental
response increases are invariant. Each increment of offered load has an equal
corresponding incremental increase in response time.
The slope of the linear portion gives important information: it indicates the sensitivity to
loading changes. A at slope shows that response is less sensitive to a loading change than
a steeper slope. Flat slopes are desirable because they are characteristics that lend stability
to the behavior so that response time doesnt degrade as loads varysomething customers
demand. Note that a at slope may also mean an underutilized system. However,
underutilized today also means headroom for further growthtying directly to the
challenges of capacity planning.
198
An administrator or planner wants to work in the linear part of the performance envelope
because he or she can make fairly accurate estimates about expected response times with
increasing loading.
At the inection point, the offered loads exceed the ability of the tested environment to
process them quickly enough, and queue lengths begin to increase exponentially. Delays
grow quickly, degrading time-sensitive activities, congesting servers and networks, and
causing customers to go elsewhere and online transactions to fail.
Note that it also takes some time to recover and shift back to linear operation, even if the
offered load is removed completely. Queue lengths must be reduced rst. Administrators
want to avoid the nonlinear area because of its unpredictability. At one moment, operations
are still within the metrics of the SLA; then, when a small burst of requests arrives, the
entire operation can grind to a halt. This is risky ground to manage.
On the public Web, there are two additional phenomena that must be considered: ash load
and abandonment.
On most non-Web, transaction-oriented computer systems, the queue is external; that is, it
is outside the system. Customers wait in telephone queues, or there is a pile of incoming
documents to be processed in front of each data-entry clerk. The ow of transactions is
therefore reasonably steady, with a rm maximum number of sessions set by the number of
clerk terminals or dial-in lines. Under heavy load, the external queue builds, and the result
is that theres a steady, unchanging workload, classically measured by the concurrent
sessions statistic.
On the public Web, in sharp contrast, the incoming trafc hits the system directly. Massive
ash loads can appear in response to a television ad or a mention in a news article, with
hundreds of thousands of users trying to establish TCP sessions simultaneously. Such loads
can overwhelm the system at the precise time that user satisfaction is most important. (Why
run a television ad and then convince most of the public that they never want to go to your
web site again?)
Loads on public web sites can therefore be much higher than the loads generated by
classical load-generation tools; special Web load generation tools are necessary. In
addition, load statistics for Web-based transactions should be in terms of arrival rate over a
given interval, not concurrent users. For example, the Keynote LoadPro service can handle
hundreds of thousands of concurrent user session initiation attempts ooding in from the
Internet, and it measures load in terms of session initiation rate, not in terms of concurrent
users.
The other major difference between classical load transaction testing and Web load
transaction testing is abandonment. In classical systems and on corporate intranets, users
dont abandon a transaction. They remain in the transaction until completion, regardless of
199
the amount of time it takes. There simply isnt anywhere else to go. Call center operators
and data entry operators must wait if the transaction response time is very slow, and external
customers who dialed into a corporate mainframe dont usually disconnect and then dial
into a competitor on a whim, just to see if the competitor is faster.
On the public Web, however, its extremely easy to abandon a transaction; people do it all
the time. Worse, web protocols usually dont inform the system when an end user has abandoned
a transaction; the web server system must use timeouts or other special methods to guess
when an end user has abandoned a transaction. The result is that many transactions in a Web
system may be inactive, waiting for timeout, especially under a heavy load with the long
response times that encourage abandonment. If the Web systems abandonment-detection
and resource-recovery mechanisms are inadequate, the system may clog as a result of
massive numbers of abandoned transactionsleading to even worse performance and even
more abandonment in a vicious cycle. The system might be able to handle a brief peak load,
but be unable to endure a longer-duration peak load because it cannot recover resources
efciently from abandoned transactions.
Load testing of public web sites must therefore include a way to simulate transaction
abandonment and must be run long enough to determine the endurance of the system. In
the Keynote LoadPro system, dissatisfaction and abandonment scores are kept for each
simulated user. They vary according to the type of user (beginner, experienced, and so on)
and according to the type of web page (home page, search page, and so on). They vary
because different classes of users have different tolerances for server delay, and users are
willing to wait different lengths of time for different types of pages. The LoadPro system
then simulates abandonment at the appropriate points, and it also reports a dissatisfaction
score at the end of the load simulation, to indicate aggregate user satisfaction with the entire
experience.
Abandonment is another reason, in addition to ash loads, that concurrent sessions should
be avoided as a measure of load in the Web environment, where session termination is
difcult to detect. Concurrent sessions can, however, be used as a measure of performance.
For a given arrival rate of new transactions, the number of concurrent sessions decreases as
the systems performance increases; good performance allows end users to do their work
quickly and log off.

One way to avoid the expense and time involved in doing your own load testing is to use
reports generated by independent testing organizations. Most of these tests, also known as
benchmarks, usually focus on a single element, such as a server, a load balancer, a router,
and so forth. These independent tests can be very helpful, but care must also be taken. Bear
in mind that product suppliers are often the ones that commission such tests; no reports are
published that show the test sponsor losing.
200
There have been cases where the same testing organization has published different reports
showing that the sponsor of each won hands down over the same competitors that beat them
in a previous report. This happens for two reasons:
The worst case is a testing facility that caters to whomever pays the bills.
A more common reason is that each test uses slightly different conditions to favor the
sponsor. As a case in point, many years ago, a network device vendor trumpeted its
packet forwarding performance. It turned out that each interface card had storage for
recently used IP addressessaving a routing database lookup and speeding up the
forwarding process. Needless to say, their testing process always used a small number
of IP addresses, so the local storage of recent IP addresses was leveraged to the hilt.
The test generated impressive numbers, but had little relevance to a real environment.
When a large set of IP addresses was used, the performance dropped signicantly.
This isnt surprising, and there is still a lot that can be learned, but you have to work
a little harder. Its also an important lesson in attention to detail in designing internal
tests; such effects can be inadvertently introduced into any test environment.
It is important to understand as much as you can about the testing process behind the data
in the testing report. A good report should include the following:
A description of the test bed, the elements, and their interconnection.
A full description of the Device Under Test (DUT) features used. Are they realistic in
your world? Are features that lower performance used in the tests? Does the network
equipment handle a mixed background workload, along with any possible access
control lists and similar resource-using features, similar to your environment?
A complete explanation of the results, along with graphs of performance versus

applied load.
A description of the testing proles. What are the loading characteristics and
workload properties?
Gathering a set of load testing benchmark reports can be useful even if the results of the
tests are not directly comparable to your situation. Studies sponsored by your vendors
competitors may reveal problems that your vendor will not mention to you. Benchmark
reports may also point out common problem areas for which you should be looking. The
value of independent element testing is that you dont pay for it, and it helps dene realistic
envelopes and their associated thresholds.

As much as possible, the test bed must be a faithful copy of the operating environment. It
includes servers, applications, Internet connectivity, databases, load balancers, caches, and
whatever else is needed. This can be a costly undertaking; most organizations must settle
for a test bed that is similar to, but not exactly the same as, the production environment
reduced in scope or in scale.
201
Operational environments with thousands of servers are usually too expensive to replicate
completely. I have worked with several large organizations that have an exact copy of their
operating environmenthundreds of switches, servers, routers, and other equipment.
These few companies are the exception to the rule. A smaller, but still faithful, copy must
be used with adjustments to the load testing results to give realistic operational guidance.
In effect, the test bed is a manageable subset of the target production environment.
Costs of administering the test bed should not be underestimated. (In addition, as with most
testing, investments must be compared with the cost of failure if investments are not made.)
There must be close coordination between the operations and testing teamsnew software
upgrades, patches, and changes in operating systems or connectivity must be included in
the test bed as well. Some organizations manage their test-bed environments as though they
were production-quality operations when new software is distributed, to ensure consistency
between the two environments.
An environment with a test bed and load generators can be useful in other ways. For
example, planners and administrators can carry out some real-world what-if analyses by
changing the test bed or the proles to test for extremes in behavior or sensitivities.
Experimenting with different conditions may reveal more information, for instance.
Performance degradation may follow different scenarios when it results from a steadily
increasing load versus when a sudden jump in the load occurs. Carefully controlling the
offered loads applied to the test bed can reveal inection points in a variety of subsystems.
What-if analysis also extends to other management strategies. Trying new content
distribution techniques, or assessing the real impact of a cache in the test bed, helps drive
better resource allocation decisions.
Many valuable tests can be conducted in a local environment, especially those identifying
the inection points for LANs, switches, and servers. However, for many services, the
Internet and the other networks used by the enterprise must be part of the testing procedure.
Testing across the Internet and enterprise networks is necessary because of the variability
introduced by distance and routing changes, the use of different providers, and the
interactions with other trafc ows.
Testers can use desktop systems located in various locations to drive the transactions at the
test bed. For large-scale tests over the Internet or within corporate networks, service
providers such as Keynote Systems and Mercury Interactive can provide load testing on
demand. In many cases, its much less expensive to use a service for highly realistic,
massive load tests than it is to acquire the software, hardware, network connectivity, and
expertise to run the test on your own.
Testing staff should perform highly repetitive preparatory testing using in-house systems,
but they should consider using an external service for nal, large-scale acceptance tests
before production. (Those external services can normally reuse your in-house scripts,
although they often need to be supplemented by parameters for abandonment behavior and
ash load characteristics.) External testing organizations offer advantages beyond just
saving money for major test efforts. They have extensive testing experience, may be faster
than your own organization, and do not take your staff away from their normal functions.
202
Load testing of web applications requires load generators, which are specially programmed
computer systems that produce large numbers of synthetic (virtual) transactions. Load
generators run scripts of synthetic transactions that follow a prescribed set of steps. Some
of these steps can be parameterized to simulate a wider variety of users or transaction types
found in normal operations. These steps may include the following:
Establishing a connection to a web server
Completing the transaction, such as providing credit or shipping information
Authenticating the customer identity

Carrying out a task, such as browsing various web pages, tracking an order, or buying
products
Checking accuracy
Breaking the connection
It is important to emphasize checking the completed transactions to determine correct

functioningthat the expected information was actually delivered, the order was
completed, the purchase was tracked, or the form was lled in properly. Correct functioning
under normal loads should have been checked by regression testing, which is the
comprehensive, highly-structured testing thats done to detect incorrect operation, long
before the load tests. However, checking of proper function must be continued during load
testing, as some failures appear only under severe load stresses.
The better the load generator, the more control it offers in creating controlled variation of
different parameters of the test. At the same time, its easy to get carried away with too
much creativity. Like any test, tradeoffs must be made between faithful approximation of
the real thing and keeping the tests sufciently straightforward so that their management
does not become more burdensome than operating the real environment.
Note that the efciency of the load generators must be considered because each one
supports a nite number of virtual users executing scripts. (Web load generators can alter
the IP addresses in the packets they generate to simulate a large and diverse user
population.) Large-scale testing may require a large number of load generators, and it is
important to understand this limitation for any testing products. For example, if a load
generator could handle only 500 virtual users, you would need 200 such load generators to
simulate 100,000 concurrent users. This can be another reason for using load-testing
services for large-scale tests.
A variety of additional scaling issues must be considered when the test bed is smaller than
the actual operating environment. The results must be interpreted for a real environment.
Application modeling tools, discussed in Chapter 12, can be helpful in extrapolating from
a small test bed.
203
Other tips that can help in this effort include the following:
Maintaining the same ratios of resources seems to help with gauging results. This is a
practical rule of thumb: the same dependencies are more likely to be revealed if the
ratio of aggregated uplinks to backbone speed is maintained. As an example, if the
production environment aggregates eight 100-Mb Fast Ethernet links into a 1-Gb
Ethernet backbone, the test bed should use the same ratios. Using half that number
would not create the same bottlenecks at high loads. It would, therefore, be misleading
if used to determine a practical level of service over-subscription that both supports
that target usage rates and is also economical in terms of the needed amount of
supporting hardware and software.
Performing unit testing helps you see the maximum number of concurrent
transactions or connections a single server can actually handle, along with the effect
of adding multiple servers. (In some cases, you dont get full advantage from each
additional server because of inter-server synchronization overhead and other factors.)
Then the number of servers for a given load can be estimated. It is also important to
stress test elements, such as load balancers, to determine their actual performance
envelope for the anticipated number of connections and workload.
The collected information from load testing is used to characterize the performance
envelope, including the system behaviors under ash load and when users are abandoning
transactions. A good data management capability is needed to save and organize data from
a series of tests. Planners, developers, and administrators can compare results from
different tests and rene their understanding of behavior and their testing procedures.
Statistical techniques can be applied protably to tease out the past contributions of
different factors; good software for this has come down in price over the past several years.
The most important statistical technique is graphing; visual representation of test results
can quickly identify inection points.

Testing transactions is the ultimate goal. You want to understand which loading levels and
transaction mixes degrade performance. Identifying possible compliance problems and
mitigating them before the service is introduced prevents unpleasant service launches that
melt down.
The rst step is determining the transactions you will use for testing. They must be
representative, reecting the way customers actually use these services. For example, not
every customer who connects to a web site will buy a product, and the transaction mix
should reect that reality rather than have everyone order a product with each transaction.
You still might want to have a 100 percent order rate for ultimate stress testing, but having
realistic results that assist in managing performance is equally important.
204
If the service you are testing will actually use an external provider for credit authorization,
clearing payments, or Customer Relationship Management (CRM), those behaviors must
be included. One issue is that there are fees for these services, and testing large transaction
volumes is expensive. There is also the problem of transactions creating data that has to be
backed out later after the testing is completedno one wants thousands of products ordered
in the test to be actually shipped and paid for.
What is needed instead is easy access to the initiation of the external service. A simple
application can simulate receiving the request, waiting a dened time, responding, and then
replicating the external activities in a controlled environment. Such simulators of external
activities can also introduce errors or timeouts to see how such conditions affect
performance under load.
For a new service, you may need some initial guessing because there is no real operational
data. Good instrumentation makes better operational data available after deployment. The
feedback from actual operations is used to modify the loading assumptions as needed.
Future tests will be more accurate because they use more accurate loading characteristics.
After the appropriate transaction set has been identied, the individual transactions must be
recorded and prepared for the testing phase.
Testers typically use a transaction recorder to capture real transactions and structure them
into scripts that can be replicated to different load generators and executed repeatedly. The
key is that the capture should not require any staff effort to effect the conversion to a test
script, as most of the early products of this type did.
The script of the recorded transaction is staticit captures one user accessing one set of
information, ordering one product, and using one credit card; therefore, it is of limited
value. Running the same transaction repeatedly will not offer much insight into the actual
operational behavior under a wider variety of loads and transactions. In fact, I know of a
team that tested against the same transaction for so long that they unconsciously adapted
their code to produce stunning performance for that transaction only.
Variability in the scripts is needed, and good load testing tools must provide this. The
variability should be structured in several ways. For example, a le with a list of user
names, products, catalog numbers, or other important information could be used as input.
File-driven variability is usually used rst because repeating the tests accurately is helpful
in the early testing phases. Adding randomly generated transactions is also helpful for
testing a wider range of behaviors.
The actual testing process uses the generated scripts to simulate the activity of real
customers. The test procedure includes other parameters that govern its operations,
including the following:
The length of the test

The transaction mix
The number of virtual users being simulated
The iterations of each transaction
The inter-step pause (think) times for each virtual user
205
The ramping characteristicsadding load smoothly or in large discontinuous steps,

as an example
A complex environment has dozens to hundreds of possible conguration options that can
be changed from test to test, and those permutations can quickly get out of hand. However,
statistical methods exist, in a discipline known as Design of Experiments (DOE), that make
it possible to achieve robust results without resorting to the full range of all permutations.
Such methods have been used with great success in a wide variety of scientic, engineering,
and quality control disciplines. References and resources about DOE are available on the
Web.
Remember that an important goal of load testing is to break the service by forcing it into
operational areas in which performance rapidly degrades. The breakdown provides a rich
source of data if the test bed is instrumented properly. What you also want to know is where
the performance broke. Was it network congestion, server overload, poor application
design, the database, or a combination of factors that caused the problem? This information
offers one key to alleviating performance problems or at least to pushing the inection point
toward the right, meaning higher loads are sustainable before nonlinear behavior occurs.
The system elements are already instrumented to provide information on loads and
exceptions; sometimes the elements can provide all the information needed, by monitoring
their internal states during web transaction load testing. At other times, youll need to drive
specialized transactions into a portion of the test bed to determine server, application,
network, or database delays.

Load testing generates large volumes of data that must be sorted and organized before it
offers meaningful insight into service performance. A series of predened reports and
graphs are essential for getting started; reports should be easily customized as testers gain
experience and understand what information is most useful to them. The iterative nature of
load testing also requires easy comparisons between tests to identify the differences
associated with any changes.
Testing produces information that is useful to different groups. One advantage of
distributing this data widely is that it provides a common context for discussions between
groups that do not usually cooperate well. Load testing should produce a set of results for
planners, developers, and administrators.
Planners use the load testing information to project future resource demands. They want to
remain on the linear side of the inection point. Knowing the normal growth in loading
gives an estimate of the time to act before the inection point is reached. Having accurate
206
information about the contributions of specic components to overall response time allows
accurate investments and optimum returns.
Developers use load testing as feedback on the soundness of their designs and their use of
best practices. They are able to determine if the application meets basic performance,
stability, and reliability needs. Tracking the number of failed transactions is also critical.
High transaction volume at the edge of the performance envelope counts as success only if
the transactions are actually completing as expected. Developers need to know if fast
transaction execution actually masks failures within each attempt.
Load testing data also assists administrators with setting realistic thresholds and with
adjusting operations throughout a business cycle. The relationship of the inection point to
the negotiated SLA must be evaluated. For example, if the required response time lies well
within the linear domain, to the left of the inection point, there is a high probability that
delivered services will be compliant. In contrast, if the service will be operating in the
nonlinear part of its performance envelope, to the right of the inection point, compliance
is likely to suffer.
Alarm thresholds can be set as a result of load tests and their determination of the inection
points. As loading or response times approach the inection point, the system can be
congured to issue an alert with a high severity level. A warning alert can be generated at
lower loading and response levels to give the operations team some lead time before
instability occurs. These alerts can be incorporated into automated operations or policy
systems (as discussed in Chapter 7, Policy-Based Management) so that actions such as
redirecting trafc to other sites or bringing more resources online can be initiated
automatically as performance moves toward the inection point.
Load testing results can also be used to make modeling tools more accurate and effective,
as discussed in Chapter 12.
Summary
Load testing is an important long-term function that is used to assist management in
understanding the behavior of their systems. Of particular importance is the inection
point, where the relationship between applied load and system response shifts from being
linear to nonlinear. It can be used in setting operations alerts and parameters as well as in
anticipating future problems caused by lack of resources. Load testing can also be used to
help build models, as discussed in Chapter 12.
Load testing on the Web has some characteristics that are quite different from load testing
in pre-Web, transaction-oriented systems.
The rst key difference is the appearance of ash load. This occurs because external users
connecting from the public Internet can appear in unprecedented numbersmuch greater
than seen in controlled, proprietary systems.
Summary
207
The second key difference is abandonment and the fact that Web systems usually cannot
detect abandonment directly. Unlike call center operators and employees using an intranet,
the Internets web users quickly abandon a transaction if the response time is too long. That
abandonment affects system load and therefore the response time for other users. Also, as
the system usually cannot directly detect abandonment but must use timeouts instead,
abandoned transactions can congest the system if resources are not efciently recovered.
Load testing on the Web must therefore include user abandonment behavior and endurance
tests to evaluate the systems ability to detect abandoned transactions and recover resources
efciently.
Of course, Web load testing must also involve all Web services, networks, and other
equipment to ensure that it is a realistic view of actual performance. Web load test services
may be useful to provide large-scale, highly realistic load testing as a nal step before
production.
CHAPTER
12
Modeling and Capacity Planning

We always try to anticipate the bad things that could happen to our online activities: What
if there is a network or server crash? What if our customer trafc really doubles this
quarter? What if we dont deploy new servers and routers for another six months?
The long-term functions discussed in this chapter and Chapter 11, Load Testing, ensure
that there are adequate resources and that we better understand how to structure operational
options. (The long-term time scale is days or weeks versus the minutes and seconds
involved in real-time management processes.)
The long-term function of load testing is discussed in Chapter 11; this current chapter
discusses the long-term functions of simulation modeling and capacity planning. The text
starts by discussing the advantages and complexity of simulation modeling. It then
discusses some models and nishes with a section on capacity planning.
Advantages of Simulation Modeling

Our ability to what-if is limited by the constraints of time and information. It takes time to
understand a complex environment and all its subtle interactions. There is the time
expended in staying current with the constant changes in a dynamic services ecosystem. It
takes additional time to work through a what-if scenario. The iterative nature of a what-if
exploration really compounds the time constraint. The usual process is seeing a change that
indicates an improvement, problem cause, or a sensitivity factor and then pursuing it
further.
The next questions that naturally arise are as follows:
Is this change really causing the improvement?
Are there any instabilities we should know about, such as scaling under loads?
Where are we in the variables range? Will a bigger change to it lead to a bigger
improvement?
Are there simpler or cheaper alternatives?
Because of limited time and resources, some organizations trying to answer these questions
often limit an approach to what has been done in the past. They are playing it safe, but
missing opportunities to make a bigger impact. The other risk is missing key trends by
210
Chapter 12: Modeling and Capacity Planning
focusing on the familiar. More than a few decisions have also been based upon just plain
guessing and hoping for the best.
Simulation modeling quickly explores many options, leading to better understanding and
decision making. The benets of simulation modeling are as follows:
Easy exploration of a wide set of alternative approaches

Independence from a test bed
Flexibility
After a simulation model is constructed and validated, it can be used for exploring a
range of alternatives for planners, administrators, and designers. They can quickly
eliminate those alternatives that do not improve performance or service quality. Rapidly
iterating through alternatives leads to an optimum solution for a set of operating scenarios.
Being able to evaluate alternative designs and workloads without modifying hardware and
software can be a denite advantage.
Evaluating a range of alternatives is also helpful in identifying sensitivities. For example,
changing the loading characteristics, the transaction mix, the topology, or other factors can
identify specic sensitivities. A certain mixture of services may introduce instability and
mutual interference, while the same mixture with different proportions operates smoothly.
Simulation modeling tools allow more agility and faster results when compared to using
load testing on a test bed (which is discussed in Chapter 11). For example, with a model,
you can add a different kind of device or one that is needed and not yet delivered to the test
bed. This enables testing and analysis to go forward without waiting for all the real pieces
of the environment to be assembled. The elements of a model can be updated, replaced, or
modied in a matter of minutes and new results can be produced quickly thereafter. In
contrast, ordering all possible products that can be placed in the test bed is not economically
feasible. Even if it were, delays in obtaining the products and integrating them into the test
bed must be considered. Modeling is especially useful when parts of the physical or
software infrastructure are not available and testing can begin without them.
Results are usually obtained faster with modeling than with load testing on a test bed
because there is no physical infrastructure to deal with and no software to modify. For
example, making a change to the test bed requires staff time to create any physical changes
such as altering connectivity, reassigning servers, moving switches, or changing link
capacity. Additional time may be required to modify software, update directories, and
adjust management tools to reect changes. Further effort is needed to verify that the
changes were made properly and introduced no new sources of errors or problems.
The expenditures for the equipment in the test bed, for its administration, and for its
operation must also be considered relative to the costs for acquiring and learning to use
good modeling tools. Its a matter of balancing your investment strategy and making sure
that the complete suite of tools provides the most cost-effective management capabilities.
211
Some organizations have their integrator maintain a model that they use for planning and
what-if scenarios. Others aquire load testing tools or work with testing organizations.
Complexity of Simulation Modeling

Although modeling has been used with great success, many organizations have been wary
of using modeling tools. Part of this reticence is because the early modeling tools were
complicated, labor-intensive, took a long time to produce a result, and had questionable
accuracy.
Every management team that uses modeling tools should therefore have a clear idea of what
constitutes a good enough solution. By its nature, modeling will always be an
approximation. Some variables are omitted to keep the model to a reasonable size. Other
variables that affect performance may not even be known or included in the model, and the
assumptions about operational conditions add another approximation.
It is always possible to put more work and effort into any model. You can rene the
assumptions about operational conditions, adding more details and continually trying to
close the gap between the model result and the actual reality. Its very easy to approach the
point of diminishing returns, with each improvement in the models accuracy taking
increasingly larger increments of effort and time. At some point, the model has to be
declared good enough, unless signicant changes dictate more work to bring the model up
to date.

Modeling solutions have evolved beyond early generations. The successful modeling
products on the market today have all discovered one fundamental truth about the
marketplace: customers are more interested in solutions than in modeling technology.
Nevertheless, its important to examine the underlying technology of any simulation
modeling product; some modeling products are easier to use and more accurate than others.
The following sections look at two typical simulation modeling tools as examples of what
is available for modeling a complex Web system: OPNET Modeler and HyPerformix
Integrated Performance Suite.
Model Construction
Building a model has usually been a tedious and error-prone process. Describing the
elements and their relationships grows more difcult as the environment grows more
complex. The possibility of errors being introduced into the model also grows,
necessitating more laborious checking.
212
Modeling tools use automatic discovery as much as they can to simplify model building.
This approach works reasonably well at the topology level where the elements and their
connections can be determined by most discovery tools. The task gets more difcult when
the applications and dependent services are included.
As applications are distributed across many servers and data centers, understanding the
relationships among their components can be quite challenging. Most services are
dependent upon other applications and services, and the dependencies are usually
incorporated into the model manually.
Models are usually constructed by combining the interconnections of the system with the
characteristics of the individual system nodes. The interconnections, or topology, can be
discovered automatically from an existing system, or the topology can be constructed using
design tools. Even if the topology is discovered automatically or imported from other
systems that have discovered it automatically, some manual intervention may be needed to
ensure that its accurate.
The individual system nodes are usually based on prepackaged object libraries, or
templates, that are ready for out-of-the-box model building. In these libraries, behavioral
descriptions are built for each type of object. For example, network object libraries
have descriptions for each device, detailing the maximum number of interfaces and
maximum link speeds, packet forwarding rates, Quality of Service (QoS) capabilities, and
other factors. A server object would describe CPU power, memory capacity, disc I/O rates,
and similar server behaviors.
Application objects are complex and usually require some manual characterization of the
application process ow. Sophisticated simulation modeling packages include
programming languages that can be used to construct those ows.
Planners use the library to build a model quickly and explore its behavior. The predened
templates save time and reduce errors; they are complemented with tools for constructing
new objects and incorporating them into the library.
Models are then driven by a variety of inputs for a thorough coverage of the systems
performance envelope. Using actual inputs is always the best alternative. Actual network
trafc can be captured with a variety of collectorsremote monitoring (RMON) agents,
protocol analyzers, or a variety of point products. Transactions can be captured with
transaction recorders and from server logs. These sources give the most accurate input to
the model. Models can also be driven from scripts, les, or other sources. These inputs can
be tuned to stress different parts of the model and are also used as a repetitive, consistent
baseline to track changes in results.
The OPNET Network Editor is used to build and display topology information. Network
topology information can be imported or constructed graphically with the Network Editor.
Users have a palette of node and link objects to choose from while they build a topological
description of their environment. OPNET has an extensive object library, including objects
for an aggregated cloud node that can be congured with the latencies and packet-loss
213
ratios that have been measured from a real network. Customers can also create their own
objects for new devices. Simple dialog boxes for each object instance provide a means for
conguring them with the appropriate parameters, although reasonable defaults are
provided.
OPNET Flow Analysis can then be used to model the detailed characteristics of networks,
and OPNET Application Characterization Environment (ACE) can be used to model the
details of application transactions. ACE can use input from measurement collectors; it
discovers transactions and their detailed performance characteristics for input into the
model.
Similarly, the HyPerformix solutions include the HyPerformix Infrastructure Optimzer and
Performance Proler; these jointly create or import topology information, model the
system, and use input from measurement collectors to discover transaction performance
characteristics for use in the models.
Model Validation
Determining the good enough point requires validation of the models results. Then you
actually know how good it is and how good you need it to be. One approach uses the test
bed, if it exists, and compares actual results produced by the test bed to the model results.
When the discrepancy between them is acceptable, the model is good enough.
HyPerformix suggests driving the model to the point where the most heavily used server
is at 50, 70, and 90 percent of maximum capacity. They recommend as a validation
guideline that modeled server utilization should be within 10 percent of measured
utilization, modeled response time should be within 1020 percent of measured response
time, and modeled throughput should be within 1015 percent of measured throughput. Of
course, acceptable accuracy is also determined by your time and resource commitments,
tolerance for risk, and your stafng skill levels. At some point, the marginal value produced
by more renements is not worth the time and expense to achieve them.
Comparing the model results with the test bed results can also identify areas where the
models results can be adjusted with the real-world input from the test bed. Rather than
extensive modications to the model, a simple adjustment of the results can sufce
sometimes. Data from the actual production environment can also be used to calibrate the
model results. Good instrumentation captures the loading characteristics and the responses.
The actual loads are used to drive the model, and its results are compared with those from
the actual production environment.
The model becomes even more valuable after it has been calibrated because its results can
be adjusted to achieve more accuracy. Combining modeling with load testing and other
capabilities builds a stronger overall long-term management capability.
214
Reporting
Presenting the modeling results in an easy-to-understand form is another key. Models, like
the environments they simulate, generate large amounts of data, and their value lies in
converting it into usable information, particularly through visual representation. A variety
of formats as well as the ability to interact with the data are key to effective analysis.
Interactive use of the model claries sensitivities to certain operating conditions, showing
the changes in model outputs that result from changes in model inputs.
Capacity Planning
Capacity planning is another key long-term operation. Many of the real-time technologies
I have discussed implement policies that describe how the resources are divided among a
set of competing services. Capacity planning ensures that there are enough resources in the
future to make the resource allocation strategies workno real-time strategy works
effectively when its resources are over-subscribed.
The goals of capacity planning are similar to those for proactive managementto give the
management team sufcient time to take the necessary actions to prevent a service
disruption. In the case of real-time operations, the lead time is measured in minutes,
whereas capacity planning works on the scale of weeks or months.
Capacity planning is considerably more complex than in the early client-server days. In
early client-server designs, the basic environment was a server attached to a router.
Performance problems were usually addressed through boosting server performance or
increasing the Internet connection speed. Most of the time the problem was solved, or at
least postponed for some time.
Todays environments are not amenable to the blanket upgrade approach; adding resources
to every element is simply too expensive. Even when the funding for large-scale overprovisioning is available, there is no guarantee that it will actually solve the problem.
Having excess resources helps with loading uctuations, but often the resource
enhancements that actually contribute to any improvement are hard to pinpoint.
Planners need to understand the sensitivitiesthe factors that have the most inuence on
behaviorin their environments. The dynamics of complex systems depend on the
relationships between resources and the changing distance between operating and
inection points. Understanding sensitivities to user volume, transaction volume, service
mix, and other factors helps focus on the areas where the highest return will be realized.
Identifying the resources that are the rst to be over-subscribed and congested enables
specic interventions that produce the largest improvement for the least investment. This
represents another chance to take a short breather before the process begins again. Some
other resource becomes a new problem area, and the planning and evaluation of alternatives
is repeated.
Summary
215
Its important to note that the different constituent domains of the services delivery
environmentsoftware, hardware, servers, networking elements, and peopleall scale
somewhat differently. As a result, theres no single capacity planning methodology that
extends across multiple domains. There are also important dependencies among them
server scaling is obviously a product of the demands of the application or applications that
the server will be running.
Note that more mature applications, such as databases and enterprise resource planning
(ERP), are fairly well characterized, and they have the instrumentation to support analysis
of capacity horizons. Newer applications, such as directories and application servers, are
less straightforward. In both cases, theres a fair amount of literature that addresses
planning for different subsystems. However, given the expertise required for each domain,
effective capacity planning will remain an art of collaboration among such experts for the
foreseeable future.
Summary
Simulation modeling is increasingly important because the current operational complexity
and dynamism overwhelm staff. Models are helpful for predicting future behavior, nding
optimum solutions, and exploring alternatives. Models offer speed and agility compared
with setting up physical test beds.
Test beds are also useful as a reality check for the model; the t of actual behavior to the
model results improves the condence in the model and points to areas where tuning or
adjustments are needed.
Capacity planning prevents future disruption caused by insufcient resources.
PART
IV
Planning and Implementation of

Chapter 13
ROI: Making the Business Case
Chapter 14
Implementing Service Level Management
Chapter 15
Future Developments
CHAPTER
13
ROI: Making the Business Case

IT groups are under pressure from the various constituent organizations they serve, all of
whom demand better service quality, higher stability, reduced operations costs, and slashed
capital expenditures. As is true for any functional organization, IT today must balance the
external competitive pressures with the internal business pressures.
There is continual pressure to improve the end users Quality of Experience (QoE). Current
performance and availability levels become the norm, and pressures to match or exceed
competitors service quality is unrelenting. There are plenty of ways to spend more money
to improve user experience. Infrastructure capacity must be upgraded to accommodate new
growth and a wider array of services. The appropriate management tools must also be
acquired and enhanced. Additional expenditures for external providers and outside
consulting may also be needed.
At the same time, businesses of every size and type are scrutinizing their IT investments
more closely; they also want to understand the business case for making more investments
in technologies that have often failed to deliver better customer experience or improved
Return on Investment (ROI).
Delivering a strong ROI is a response to internal business pressures to continue driving
costs and delays out of every business process. Technology investments have often failed
to deliver the expected benets and are naturally eyed with appropriate skepticism. The
process of assessing ROI can be quite helpful in aligning IT with business goals and then
assessing contributions of IT to competitive advantage.
ROI is the tool used by business managers who are removed from the technology details,
especially as purchasing decisions are being centralized. Technology investment decisions
are made at higher levels in the organization, where upper managers attempt to nd the least
disruptive ways of trimming budgets and head counts. The ROI analysis undertaken by IT
to justify its decisions must speak the language of business requirements as much as
possible, but without letting go of solid technical foundations. If IT doesnt perform the
ROI analysis, its very possible that it will be done by someone who doesnt completely
understand the full scope of project benets. The ROI calculations will therefore be
misleading and may fail to support a project that would actually help the enterprise.
220
Chapter 13: ROI: Making the Business Case
This chapter covers the following topics in detail:
Impact of ROI on the organization

A basic ROI model
Soft benets
An ROI case study
Impact of ROI on the Organization

ROI is a commonly used business measure that is used to quantify the business benet of
expenditures that range from an ofce copier to a completely outsourced network service.
The result of a good ROI analysis is a clear and realistic idea of the potential benets and
risks associated with a business investment. The ROI for webbed services management
investments has a harder side with quantitative information; another, softer, side may have
equally important qualitative contributions.
ROI is becoming a key ingredient in the technology assessment and evaluation phase. Ive
spoken recently with several companies that are setting ROI guidelines for the earliest
phases of the process. The message might be, Dont bring anything to me that doesnt
demonstrate a 25 percent cost reduction within six months. The good news is that
standards for business impact are being integrated into the technology investment process,
resulting in clearer guidelines for selecting products. At the same time, having a target
makes it easier to manipulate the numbers to reach the specied goal.
In this context, selling the ROI is as important as selling the technical approach and
capabilities of the IT group. Its important to address the concerns of key constituents early,
to understand the objectives and issues of boundary organizations, and to get their
expertise. Working on this process with the collaboration of your organizations nance
professionals, who deal with ROI every day, and who have experience interpreting the
economic issues necessary, can add substantial credibility to your proposal.
Note also that there are no hard and xed rules for ROI; different companies ask different
questions and allow different assumptions. Note, however, that whatever the range of
differences, the basic model is the same.
A Basic ROI Model

The basic ROI model has two elements: building the case and quantifying the benets.
Often, the process for projecting the costs and benets involves educated guesswork, based
on multiple sources, and just plain guesswork. The assessment process can be very
effectiveif the appropriate data have been collected over the evaluation period.
A Basic ROI Model
221
As shown in Figure 13-1, the goal is to determine when the project has broken even
delivering additional value that matches the costs. The project begins in the redthere have
been expenses with no recouping of the investment yet. The starting point coordinates are
determined by the purchase and implementation costs (vertical axis) and the deployment
date. As the investment becomes operational, it begins generating business value, such as
increased revenue, reduced costs, or higher service quality.
(Benefit - Cost)
Profitable
In the Red
Initial
Cost
BreakEven
Point
Figure 13-1 Return on Investment Projections
Time to Value
Time
Deployment
At some point in time, the cumulative benet value matches the costs and the break-even
point is reached. The Time to Value is frequently used to describe the time needed to reach
the break-even point; shortening this time interval recovers the costs more quickly and
increases the leverage gained from that investment. The slope of the operations curve
indicates the rate of recoverya steeper slope is a shorter Time to Value and a higher
payback for each succeeding interval.
Each project alternative has its own ROI graph to nd its Time to Value and value slopes.
For example, the solid line in Figure 13-1 indicates a stronger ROI potential than the dashed
one because the solid lines Time to Value is shorter and its value grows more rapidly over
the same time interval. The investment continues to provide additional value after reaching the
break-even point. Cost savings and deferred spending are direct business benets that can
be realized throughout a long operational life.
The two ROI lines in Figure 13-1 are linear for simplicity; actual projections may be curved
or have discontinuities because of variations introduced by seasonal behavior, sudden
market shifts, or other events.
222
ROI graphs for long-duration projects should include calculation of net present value
(NPV). Cash today is worth more than cash tomorrow because cash can be invested and
earn interest or another type of return. Therefore, if $1000 is invested in Web systems today
to obtain $1000 of benets a year from now, the project is in the red. That $1000 could have
been invested in the nancial markets and would have returned more than $1000 over the year.
NPV calculations or similar calculations (such as internal rate of return [IRR]) are used by
nancial ofcers to handle this analysis. They use the time value of money, which is the
return that money can earn if invested, to make the benet of an investment clear. After all,
investing the money carries much less risk than using that money to improve a Web system.
If the money can earn more through investment than it will bring in if used to make those
Web system improvements, the project may not be worth the effort.
The ROI Mission Statement

Any ROI project needs a goal as a starting point. In other words, what problem will this
investment solve? This mission statement sets the perspective as details are eshed out for
the process. Specic metrics and measurement procedures are determined after the problem
is identied. The metrics and measurements will be used to assess the actual ROI generated.
In addition, the appropriate baseline measurements can be applied to establish the actual
service conditions prior to deployment.
A load-balancing investment, for example, could be assessed by measuring the impact of
the investment on some combination of operational metrics, such as the following:
Transaction delay
Reduced number of reported problems caused by overloaded or down servers
Average server loading

Number of concurrent connections and transactions supported with acceptable
performance
Reduced support staff workload
An investment in content delivery infrastructure is another example. The metrics that might
be used to assess the ROI potential for content delivery investments (as discussed in the
Content Distribution and Instrumentation of the Server Infrastructure sections of
Chapter 9, Managing the Server Infrastructure) are as follows:
The projected, or measured, bandwidth gain at the network edge. Bandwidth gain is
used to estimate the bandwidth savings, relative to a centralized distribution system,
when a content distribution network is used.
Download time
Availability
A Basic ROI Model
223
Project Costs
The costs and the time of deployment locate the starting point in Figure 13-1. There are a
number of factors to consider, and each project will need to select and weigh the factors that
apply to that specic evaluation.
Many factors can be incorporated into a cost calculation, including the following:
Staff time for design, project evaluation, project implementation, and project
management
Consulting time, when needed

Training
Product costs (hardware, software, services, and management systems)
Ongoing maintenance and licensing fees
Some of these elements may be harder to identify and quantify, depending on the specic
organization. For example, the requirements may come out of a planning group that
addresses a range of technology issues. Time spent on evaluations may include actual
testing scenarios, specication reviews, and customer research. The implementation costs
might include some vendor-provided professional services, services from other sources,
training for operations staff, modications and updates to management tools, or additional
hardware for parallel operation and gradual cut-over.
The cost of maintaining the status quo must also be considered. For example, consider a
key service that generates $12 million in annual revenue. If the average availability is 98
percent, there is a potential $240,000 revenue loss (every month the company loses another
$20,000 in potential revenues). Investing $50,000 to raise availability to 99.5 percent is an
attractive option because the improvement generates $180,000 annually in new revenue
opportunities. The Time to Value is less than four months, and each month past the breakeven point adds another $15,000 (potentially) to the revenue stream.
At the same time, calculating the costs of the contributors and the alternatives is not always
black and white. For example, will all the headcount involved in development and
deployment be involved in the project 100 percent of their time? Does an uptime of 98
percent mean that revenue grinds to an absolute halt when the systems are down, or could
there be alternatives that keep the revenue going? For example, orders phoned in or faxed
in instead of submitted over the Internet are still orders. Its useful to apply a little
skepticism to estimating both costs and benets, as youll see in the following sections.
Project Benefits
Estimating the benets from the investment may involve various levels of guessing and
using rules of thumb. One of the most frustrating aspects is trying to nd some estimates
that at least bring you closer to understanding the costs and benets of an investment. One
major question is this: How reasonable are the estimates? Is a 30 percent improvement in
224
bandwidth utilization within the realm of possibility, or is 15 percent more realistic? Each
estimate changes the slope of the ROI line and shifts the Time to Value.
Many vendors now have ROI calculators that they use as part of the sales process. They
generally embed a set of assumptions about costs and impacts within their calculator. Of
course, the estimates are helpful only to the degree that they match the environment. The
size of the company, the industry segment, and the relative technical maturity are among
the factors inuencing the ROI outcomes. Obtaining more information on the embedded
information, how it was gathered, and the data used to build the model will help calibrate
the results.
I have worked with several vendors to begin building some rules of thumb from early
adopter experiences. The goal is to get some idea of what other customers report about their
experience with implementation, introduction, and day-to-day operations. These rules of
thumb will give potential buyers some additional ways to relate to the usual ROI
information. For example, a potential customer can relate to a rule of thumb that indicates
that similar customers have reduced their stafng levels by 20 percent while improving
service quality. (Some of the rules that can be applied are described in the following
subsections.)
Measurements should be analyzed to determine the actual benets derived from the
investment. The key is having sound measurements and a starting baseline. Care should be
taken to document performance levels fully and quantitatively before implementation
begins. As the implementation proceeds, it is also important to remain exible and
incorporate measurements for tracking unexpected outcomes. Following up with the actual
assessment is very helpful for the IT group. It helps them calibrate their own estimates and
rene their ability to project benets with more accuracy in the future. Equally important
is establishing their credibility with the business managers through accurate projections.
Remember that a business-knowledgeable IT manager should be involved in the analysis to
improve the relevance for other business managers.
Availability Benefits
Changes in availability are usually evaluated in terms of capturing more revenues or
distributing more information to consumers. Identifying the actual revenue rates is often the
most difcult step. Good service instrumentation is necessary to determine the number of
orders and the revenue generated when the system is available. In some cases, however, a
simple calculationdividing the revenues by the number of operational hours to get an
average revenue ratemay sufce.
After the revenue rate is determined, the rule of thumb is that a 1-percent change in
availability is an additional 7.2 hours per month of potential revenue gain. If your
organization has annual revenues of $12 million, for example, the rate is $1 million
monthly, or almost $1400 per hour. A 1-percent change can produce an additional $10,000
in monthly revenues, or $120,000 on an annual basis. Spending $50,000 to raise availability
by 1 percent gives an ROI Time to Value of ve months.
A Basic ROI Model
225
Performance Benefits
Performance is easy to measure for compliance: it is straightforward to determine the
percentage of transactions that complete within the specied response time. Assessing the
business impact is a bit more difcult. A subset of the total transactions actually generates
revenues; the remaining transactions are used for browsing product information, checking
promotions, facilitating customer-managed support, or tracking outstanding orders, among
other possibilities. Identifying the types of transactions and tracking each category may be
required.
Tracking the actual business that is transacted and closed provides deeper insight into the
impacts of any investment. Some metrics that demonstrate the value include the following:
Change in the completion rateHow many potential revenue-generating

transactions completed successfully?
Change in the deal sizeAre the orders larger as a result of better service quality?
Change in unit transaction costsThe investment may allow larger transaction
volumes without requiring expenditures for additional infrastructure, thereby
reducing the unit costs of transactions.
Staffing Benefits
Stafng impacts may be important considerations in some situations. For example,
investments in automated tools may allow staff head count reductions or reassignments to
other tasks. The savings include salaries, benets, and possibly training costs. At other
times, stafng may remain constant while the infrastructure grows over time, improving
staff productivity by managing more elements and services without adding team members.
In the end, the best metric for services management is the change in transaction volumes
relative to stafng levels.
Infrastructure Benefits
Many investments will impact one or more infrastructures. A basic metric is the change in
service ows relative to the infrastructure changes. If an infrastructure handles more service
ows as a result of the investment, its productivity has been improved accordingly. For
example, a company I interviewed was able to increase their transaction ows by 30 percent
without additional spending, translating into a substantial savings compared to scaling the
infrastructure by that amount.
Deployment Benefits
Deploying new services is an ongoing process rather than one of periodic releases. Rapid
deployment is facilitated with load testing and good design practices. Load testing also
226
helps determine if a new service will degrade the current services mix when it is placed into
production. The usual response is to refuse to deploy any new services that disturb the
normal operational baselines.
Deployment delays can have signicant impacts on revenuesa rule of thumb that I use is
a $20,000 revenue loss for every $1 million in annual revenues that the service generates.
Indirect impacts include disappointing customers and losing competitive advantage to other
early movers. Using predeployment load testing and other service-level management
techniques can decrease the probability of deployment delays and, therefore, improve
revenues.
Soft Benefits
Business analyses depend on quantitative results that can be demonstrated and calculated
from data. There are often qualitative results that may be important in certain situations. For
example, many businesses use customer satisfaction surveys to gauge their relationships
with the market. Improvements in customer satisfaction are important, although they may
not be directly connected to specic business metrics. Surveys of internal IT customers may
also indicate improvements after a project is completed.
Improvement in management staff retention frequently results from projects that
implement better management tools and processes. Staff is freed from repetitive lower-level
tasks and can spend more time focusing on more challenging and valuable tasks.
Soft results, while not to be ignored, must be regarded skeptically, particularly because it is
easy to predict optimistic outcomes for productivity enhancements. Working closely with
customers to identify priorities for investment can help sharpen focus on tradeoffs.
Similarly, collaboration with human resources and nance organizations can help establish
consensus and buyoff where numbers alone dont tell the whole story.
ROI Case Study

A company with a simple web site for dispensing product information recently decided to
move more of their business online, in an attempt to expand their access to customers while
reducing the costs of their business transactions. Some of their competitors were already
making this transition, placing further competitive pressures on them.
The company started several parallel projects in an attempt to recoup some of the time lost
to their competitors. The IT group worked on sizing the infrastructures to handle a specied
user and transaction volume, while the application group began designing the new web
applications.
The new web applications were rushed into production quicklyas soon as the logic was
tested and obvious bugs were found and xed. This ready, re, aim! strategy bypassed
realistic load testing and other end-user QoE evaluation functions, leading to disappointing
ROI Case Study
227
results. Initial measurements taken after deployment indicated that the actual customer
volumes and transaction activity were substantially lower than the projections. The initial
response to the disappointing behavior was to consider spending more for additional
computing power and network bandwidth, which is the common response.
Cooler heads prevailed, fortunately, and argued for better measurements to clarify the
situation rather than possibly compounding the problem with misplaced efforts and poor
investments. A set of synthetic transactions established that the server response times and
the network delays were not the problem; in fact, they were very low. These measurements
were helpfulthe technology was performing well, although the site wasnt. They
indicated that other alternatives needed investigation.
An engagement with a professional services rm that used web analytics was initiated, and
they collected information on the user behavior within the web applications. The data
clearly illuminated the problems. The major contributor to the disappointing outcome was
that the web application was difcult to navigate, and many customers would simply click
away after becoming frustrated.
The key web pages were those that guided potential customers to the offered products and
services and hopefully converted their interest into actual sales. However, the navigation
paths involved traversing a large number of pages with confusing content and links that
were not obvious. Further discussions indicated some of this was done deliberately so that
other cross-selling opportunities could be presented with each new page. This is the same
annoying dead-end strategy used by many sites that churn out new pop-up ads with each page.
The web applications were able to be changed fairly quickly because they were built with
JavaBeans and were, therefore, easy to modify. Shorter paths to the key content were
constructed, and the intervening pages were designed with simpler layouts. Upselling was
linked to the key pages rather than adding distractions on the path.
The synthetic transactions were used to verify that the changes did not add any signicant
delays to the original baselines. The operational results veried that the web application
design, rather than the underlying technology, was the problem. The number of customers
remained steady for the rst six weeks or so. The critical change was that the number of
customers reaching the key pages tripled and generated a 15-percent revenue gain.
The number of customers began to increase, partly by word of mouth and partly from repeat
visits. Over the next six months, the number of customer visits doubled, and revenues per
customer also grew considerably as the application was rened. The impact of a cleaner,
shorter navigation path was that the volume growth was accommodated without any new
infrastructure investments. Much of the computing and networking load was reduced
because customers were not linking through meaningless pages to reach the desired
content.
The ROI evaluation is summarized in Table 13-1. The costs included the professional
services of an analytics rm, the internal staff time to adapt and test the web applications,
and measurement tools. A case could be made that the tools will be used for many other
228
purposes and shouldnt be expensed solely to this effort. In this situation, however, a simple
analysis was deemed sufcient.
Table 13-1
Case Study ROI Payback Summary

Cost
Benefit
Web analytics
services
$ 80,000 Net revenue gain of 15 percent for rst six

weeks
$54,000
Rework web
applications
$ 55,000 Net revenue gain over next six months
$2,153, 000
Staff time for testing

Measurement tools
Total
$ 7,500
$ 22,500
$ 165,000
$2,207, 000
The benets included the revenue increases over the initial deployment period and for the
following six months. The hard numbers indicate the project reached a break-even point in
one month (on an annual basis). The project solved the problems and had a denite business
impact.
There were other benets that deserve mention as well. The infrastructure showed a 100
percent leverage, doubling trafc with no additional investments after the applications were
modied, and the unit costs of transactions were halved as a result.
Summary
Demonstrating the business value of technology investments is becoming an integral part
of most purchasing processes. Being able to nd the strongest ROI is equally as important
as nding the best technical approach. The basic ROI determination is fairly
straightforwardtotal the costs, determine the benets, and calculate the Time to Value. As
with many other things that are simple in concept, the details are more complicated. A
major challenge is projecting the benets before implementation because they will be based
on estimates rather than hard data. After implementation, there will be quantitative data
available.
Assessing the potential benets involves looking at as many different outcomes as
possiblechanges in customer visits, transaction volumes, the size of orders, and the
percentage of customers who return for further business.
Scrutiny of signicant technology initiatives is being moved higher in most organizations,
usually involving business managers with less technical expertise. The ROI is the way they
evaluate and decide on key technology purchases.
CHAPTER
14
Implementing Service
Level Management
I have covered many aspects of Service Level Management (SLM) in a webbed world.
Pulling them all together requires a coherent approach to implementation so that service
management can evolve as a system rather than merely comprise a disjoint set of management tools. This chapter presents some ideas for implementing an effective SLM system.
The focus is on the process and a strategy for moving through that process, rather than on
specic technology or product decisions.
The text discusses the following:
Phased implementation of SLM

An SLM project implementation plan
Phased Implementation of SLM

Effective SLM has many facets, and an effective process is needed to minimize deployment
problems and to ensure solid service management capability in a variety of environments.
As is true for any systems project, the project implementation plan involves selecting
technologies for each functional area and then selecting a vendor or vendors that meet
specic needs. In addition to technology and vendor selection, there must also be a process
that outlines the steps needed to complete the implementation project.
Implementing an effective SLM system usually requires a phased approach. Using phases
offers several advantages, but the primary advantage is breaking down a set of complex,
expensive, and critical processes into smaller, easier to manage steps. Another advantage of
phases is that there is an opportunity to pause after each phase and determine if changes to
plans are required before proceeding. It also spreads expenditures and gives the
management team a chance to show value early.
Choosing the Initial Project

Successful SLM is the product of deliberate analysis, planning, implementation, and
ongoing learning. It bears reiteration that the rst project chosen for SLM will be a test bed
for new methodologies; therefore, the customer must be prepared to accept some new
concepts and some intermittent problems in return for the opportunity to obtain better
232
Chapter 14: Implementing Service Level Management
service. Service Level Agreement (SLA) wording, denition of metrics, and service level
objectives along with their statistical treatment will probably be new to the organization.
Accompanying these will be the need to handle integration and grooming of instrumentation measurements, changed problem management techniques, and service level reporting.
An application that depends on service levels may already exist; for example, many legacy
transaction systems are given priority on internal enterprise networks, and Voice over IP
(VoIP) systems usually are given priority on LANs. These prioritizations are commonly
based on rudimentary packet or frame tagging and on simple priority queues within routers
and switches, usually without a comprehensive system to report on and manage service
levels. Moving one of those applications to a new, more integrated SLM methodology is
probably the smoothest rst project. The migration can provide an opportunity for staff who
are already involved with service level techniques to learn the new methods. The staff can
also bring their knowledge of the organizations needs into the initial development of the
new SLM systems.
Its also important to choose a pilot implementation in which end users can review plans at
critical junctures. Some of this is common sense. SLM helps address the needs of users, so
asking their advice can help avoid blind spots. Allowing users to participate, or at least to
observe, also lets them buy into the learning process. If users are part of the process, they
will probably be more forgiving of the inevitable mistakes and delays.
If there is no existing application already using some type of service level technology, its
probably best to pick an application that uses a limited subset of the enterprises systems,
instead of trying for a global initial project. The fewer the number of different subsystems
and providers involved, the less complexity that will have to be addressed in this rst trial.
Note that a simpler environment also exposes problems more clearly, and the implementation team learns more quickly. However, limitations in scope must be balanced with the
need to detect upcoming implementation problems. Subsystems and providers that are
widely used in the enterprise should be included in one of the earlier SLM projects even if
that increases complexity. One or more of those subsystems or providers might have an
incompatibility with the chosen SLM technologies, and its better to detect that problem
early, before the momentum for a particular set of SLM technologies has grown.
Incremental Aggregation
It is best to introduce and activate new services in small increments. Each service can be
monitored for a trial period to ensure that baseline service quality and stability are
maintained under a variety of loading conditions. At that point, the service can be
incorporated into early SLM projects, and continuous monitoring for compliance can
become part of the regular management routine.
233
Further projects can be brought online as soon as the initial set has been proven successful.
This increased use of the service through aggregation of the needs of multiple projects has
advantages both in building on a now-proven SLM and service technology and in
negotiating more favorable terms with service providers. For example, aggregating the
projected bandwidth demands from each business unit into a single acquisition gives the
organization more leverage in obtaining bulk discounts or other benets. Using the initial
pilot as a teaser for the supplier, with the promise of additional projects from aggregated
additional needs, can provide important negotiating leverage.
This iterative process may appear slower in the beginning, but the phased, gradual approach
is useful. There are usually gaps between the projections of resource requirements and the
actual conditions. Ongoing measurements can be used in the phased approach to determine
if resource adjustments are needed before more services are added.

This section of the chapter presents a plan for implementation of SLM on an existing
system; implementation for a new project is similar. These steps can proceed somewhat in
parallel; for example, SLAs can be drafted and rened while instrumentation is collecting
data to be used in setting service level objectives.
The steps of the plan are as follows:
Census and documentation of the existing system

Specication of performance metrics
Instrumentation choices and locations
Baseline of existing system performance
Investigation of system performance sensitivities and system tuning
Census and Documentation of the Existing System

An overall census and documentation of the existing system conguration and capabilities
provides the basic data for lling in the details of any implementation plan. For example, a
typical network census identies network devices and their connections to other devices.
The census can be partially performed by automated tools; most organizations have a
multitude of automated discovery capabilities. The traditional Simple Network Management
Protocol (SNMP) management platforms provide discovery, as do many reporting and
troubleshooting tools. Manual effort is also needed to ensure that the tools havent
misinterpreted some congurations and to ensure that unusual network features are
correctly documented.
234
A services census checks whether devices have service management capabilities; the goal
is to catalog the service management capabilities already in place. In addition to Quality of
Service (QoS)enabled network devices, elements such as caches, load balancers, and
trafc shapers should be identied.
Specification of Performance Metrics

Metrics are the currency by which the service relationship is conducted. Therefore, they
must be planned for, accounted for, and reviewed on a regular basis. Just as you wouldnt
want to employ a bookkeeper who had only a vague idea about how much money was
coming and going in your business, your metrics need to be precise and focus on the key
contributors that affect the desired outcomes of the services. Like accounting, the art and
science of metrics can range from ad hoc (more than pencil and paper, but not much) to
heavy-duty statistical analysis. When the value invested in the service is high, its prudent
to make more than a cursory investment in the metrics. These performance metrics (service
level indicators) and service level objectives must be clearly dened in advance to minimize
disputes.
The particular metrics chosen depend on the application, of course. For example,
transactional applications do not usually specify a packet loss metric. Delays due to
retransmissions will affect the response time metric, which is more directly relevant to
transaction end users. In contrast, a certain level of packet loss makes VoIP and some other
interactive services unacceptable because their real-time nature doesnt allow time for
retransmissions of lost packets. Specic recommendations for metrics are detailed in
Chapter 2, Service Level Management, and in Chapters 810.
Each service level indicator and its accompanying service level objective must be dened
clearly and unambiguously. For example, transaction response time can be measured as the
time between sending the last request packet and receiving a complete response. Alternatively, it can be specied as the time between the last request packet and the rst response
packet. These two specications will yield different results, and while neither is necessarily
superior, both parties must agree on which of the two they will use. The degree to which
the metric clearly indicates the end users experience should be considered in this choice.
Service metrics often involve synthetic (virtual) transactions. Those synthetic transactions
must be specied and should be included in any agreement. Synthetic transactions must be
updated as the transaction mix changes; provisions for adding new ones to the agreements
should also be addressed. Similarly, the SLA must include them in regularly scheduled
reviews.
Measuring availability also demands unambiguous specications. For example, a service
customer would characterize an outage as lasting from the time it was rst detected until
a customer transaction veries that the service is available and functioning within the SLA
compliance criteria. Providers who own a subcomponent will tend toward dening and
235
measuring the outage duration in terms of the service offeredsuch as the period of time
during which a piece of the network was not functioning. Where the commitments made by
providers do not integrate effectively, end users will perceive a different impact from the
outages than might be indicated by the availability statistics of the underlying
subcomponents.
It is also necessary to specify measurement validation and any statistical treatments that
should be applied to the data. These should be combined with sampling frequency to ensure
that condence intervals are acceptable. For example, a critical service might be probed
every ve minutes, while those of lesser importance are checked every fteen minutes. (See
Chapter 2 for detailed discussions.) The increased granularity of more frequent
measurements must be balanced against the additional demand on servers and networks.
More organizations are adding a dynamic specication that shortens the measurement
interval if the metric is trending toward unacceptable values.
After the metrics and their measurement procedures are specied, the service level
objectives can be established based on the requirements of the application. In cases where
the performance characteristics of a service are well-established, such as those associated
with a service from a major external supplier, it may be necessary to choose from the
service classes that the supplier offers. For example, an interactive application might be
able to choose among three offered classes of service with three different sets of acceptable
response times and packet losses, as shown in Table 14-1. Major Internet Service Providers
(ISPs) offer service guarantees for transit thats completely on their networks, and some are
offering guarantees for transit to and from endpoints on other networks.
Table 14-1
Examples of Interactive Service Classes

Service Class
Maximum Response Time

(in milliseconds [ms])
Maximum Packet Loss
Platinum
60
0.25%
Gold
100
0.5%
Silver
150
2%
Instrumentation Choices and Locations

Getting the proper instrumentation in place is an important step. Note that the
measurements needed for reporting are often somewhat different from the measurements
needed for problem management and performance optimizationand both are needed for
effective SLM. (Detailed discussions are in Chapter 2 and in Chapters 810.)
Both passive and active measurements are used to gather all the necessary information.
Each is discussed in the following subsections.
236
Passive Measurements
Passive measurements provide insight into the actual services being used and their volumes
throughout the day. Passive monitors should be placed on the access links from data centers
where they can capture the actual trafc owing to and from the center. Placing agents near
the organizational boundary also tracks the outbound trafc originating within the
organization.
Remote Monitoring (RMON) agents are passive agents that can provide a rich view of
application ows across the networked infrastructure. Many LAN switches have embedded
RMON agents that can be used. A stand-alone agent can also be used to collect the
information, allowing measurements at sites where a switch does not have an RMON probe
or where the large volume of collected data impacts the switch performance.
Passive agents in servers can also provide information about the applications that are
executing, the numbers of concurrent users, and the time distribution of usage. The server
information can supplement or replace the RMON data. The measurements at the
organizations edge will still be needed to understand the outbound trafc.
Active Measurements
Active measurements are used to build a consistent view of service behavior as seen by the
end users. For example, a set of synthetic transactions can be constructed that are realistic
approximations of actual end-user activity. Performance is tracked by sending synthetic
transactions to the actual site. This offers the performance perspective as experienced by
customers, partners, or suppliers. The measurements are used to alert the local management
team that service disruptions may be threatening sales or business relationships.
Active probes should be placed in multiple locations for the best results. The internal
environment can be as transparent as desired because multiple probes can provide detailed
and granular measurements. There is more exibility within an infrastructure that the
organization controls and manages; probe placements should be used to provide overall
end-to-end measurements and to break down the components along the path. Performance
of different areasacross the backbone, on the web server, or on the backend server, for
examplegives the detailed data needed for resource planning.
Placing probes at different points in the internal network also gives a broad picture of any
service quality variations that are related to differences between locations. This gives
resource planners a ner level of detail and identies specic areas that require further
attention.
One caution in distributing active probes across the infrastructure is that overduplication of
measurements should be avoided. For example, it may seem reasonable to place a probe
that monitors an end-to-end service, plus individual probes that monitor each component
of that service. However, when multiple edge services use the same core service, you dont
need to add multiple monitors for the core service as well.
237
External parties will not expose their internal operations to outsiders; nonetheless, they still
must be measured. Therefore, the main measurement objective for external services is the
identication of the proper demarcation points so that the performance of external parties
suppliers, partners, or hosted servicescan be isolated and measured by appropriate
instrumentation deployed at those points.
Demarcation points close to edge routers connecting to external services are the most
desirable locations for such instrumentation. Active measurements can track the
performance between these demarcation points, and that performance can be used in SLAs
with the external service providers. These measurements can evaluate the delay in provider
networks, the delays at hosting sites, or the delays within a partner environment.
Measuring a provider is simple when both demarcation points are within the same
organization. Placement is not restricted, and the provider delay can be clearly determined.
Negotiating with key partners or suppliers for placing measurement probes is becoming
more common. For security reasons, a business partner may not want any external
equipment connected inside the rewall. If a probe cannot be placed inside a rewall, a
probe located at a demarcation point just outside the rewall can be used to measure the
external network delays.
Baseline of Existing System Performance

Its important that SLA specications be determined not only in the abstract, but also by
baselining real services under real-world conditions. Neither providers nor consumers are
well served by specications and performance targets that have not undergone a
shakedown cruise that covers the range of operational extremes.
A baseline of application and service activity denes the actual behavior of the services
before any SLM solution is deployed. This information is used for planning the detailed
steps of the implementation process and in evaluating the success of the implementation.
In other words, the baseline denes the gap between what is wanted and what can be
delivered.
Investigation of System Performance Sensitivities and System

Tuning
The agreement to provide specied levels of service quality must be realistic; what is
promised must be matched with the capacity for delivery. The census of the existing system
components, the draft specication of metrics, and the baseline of existing system
performance are therefore used in this step, which tries to understand the costs of meeting
various performance objectives by understanding the sensitivity of the system to changes
in design. Final tuning of the system design and the service objectives may be necessary to
match the delivery capabilities of the system without incurring excessive costs.
238
Adjustments to the design are indicated when the actual performance isnt sufcient to meet
the objectives. The gap between the desired and the actual performance will be a gauge
of the effort and expense to bridge the gap. A small gap may be bridged with a small
upgrade or some simple reorganization of the resources. A small gap may also be resolved
by upgrading a single resource, such as adding a faster server or distributing content to edge
servers to reduce congestion. The census process can help by identifying system
components that can be moved to places where they add more leverage and control while
reducing the need for new purchases.
Larger differences may indicate that an investment in multiple areas, such as network
bandwidth or a faster database server, may be necessary. A balance between larger
investments and the target levels may be considered. Would adding two seconds to the
proposed response-time metric result in lost business or productivity? Would the two
seconds be a good idea if it saved a substantial sum of money?
Granular internal instrumentation is very helpful at this stage. Measurements may directly
identify the main contributors to the delay. If the internal network delays are 10 ms and the
server delays are 6.5 seconds, most leverage is going to be found in improving server
performance (or whatever back-end services are activated).
Capacity of the systems under the actual workload is also a critical consideration.
Acceptable response when loads are light may be deceptive. If some of the services are not
yet deployed, there is more uncertainty about the actual infrastructure capacities.
Estimating expected growth in transactions or users is important to ensure that a new
service management system has the headroom to accommodate growth for some initial
period, all the while staying below the inection point, which is where performance
becomes nonlinear. Baseline information or data from load testing is very helpful in
building a realistic assessment of the implementation effort, its costs, and time frames. (See
Chapter 11, Load Testing, for more information about inection points and load testing.)
After the service-level and system capacity needs have been determined, the process of
tuning system performance can begin. Having a set of choices increases options and
leverages competition. It also makes the selection process more complex because there will
be several ways of satisfying any particular requirement. For example, money can be spent
directly on the servers to improve processing power or memory, or it can be spent on storage
systems to boost server performance. The expenditures may also be indirect and include
buying load balancers, web server front ends, or content management and delivery systems.
The instrumentation used for baselines can now be used in sensitivity testing as the new
elements are added to the test bed and then to the production environment. It is best to
measure one change at a time to get a better feel for the changes that are making the most
signicant contributions.
Instrumentation may also reveal that some other applications carried by the system are
interfering with service performance. For example, some activities, such as playing games
over the network or downloading media les, are not related to business goals and waste
239
time and resources. At other times, a legitimate service, such as backing up a database, may
interfere with other critical services because they are scheduled incorrectly. Data from
instrumentation may therefore indicate a need for admission control policies and
enforcement. Undesired applications would be barred from using network resources, while
others could be scheduled to reduce their interference with other operations.
The foundation of any service management system is based on instrumentation, reporting,
and a clearly dened SLA. As part of an SLM implementation, most organizations will
create one set of SLAs with a range of external providers and partners. They will create a
separate internal set of SLAs between the IT group and the business units. All parties gain
from having an SLA; service customers expect to have more control of service quality and
their costs, while providers have clearer investment guidance and can reap premiums for
higher service quality.
Because it species the consensus across all parties, the SLA becomes the foundation for
deciding if services are being delivered in a satisfactory fashion. I discussed SLAs in
Chapter 2, and they are reviewed here, along with some additional discussion of SLA
dispute resolution.
An SLA should clearly dene the following:
The metrics (service level indicators) used to measure service quality

The service level objectives
The roles and responsibilities of the service provider and the service customers
Reporting mechanisms
Dispute resolution methods
There may be other areas, such as determining nancial penalties, that must also be
addressed and resolved. This is driven by business considerations: the providers want the
lowest possible exposure to penalties, while the customers want realistic compensation for
service disruptions.
Usually the penalties suit the provider rather than the customer. For instance, the common
remedy for a disruption is a rebate on future bills or a refund. The downside for customers
is that the rebate may be a small fraction of the customer losspossibly thousands of
dollars of lost revenue per minute. As discussed in Chapter 2, there are strategies that can
be applied to encourage desired supplier behavior and that can be coupled with risk
insurance, if necessary, to compensate for losses.
Even where legally binding SLAs with external providers are not involved, its important
to reduce ambiguities to an absolute minimum.
The SLA areas of metrics and service level objectives were discussed in preceding sections;
the other SLA areas are described in the following subsections.
240
Roles and Responsibilities

Although individual roles within a team can be claried in conversation, the set of activities
and responsibilities across separate entities is best agreed upon in advance.
The service provider provides a set of services at a specied service quality and cost. The
provider is also responsible for reporting service level indicators and costs on a regular and
as-needed basis. This may include business process metrics such as help-desk response
times, escalation times, and service activation times.
Responsibilities of the service user, such as adherence to specied workload characteristics
and methods for interacting with the service provider, should also be described.
Reporting Mechanisms and Scheduled Reviews

In my experience, putting service quality reporting in place early is most helpful, especially
in building internal support for SLM expenditures and potential inconvenience during
deployment and cut-over.
There are many reporting tools available. Many management tools have reporting
capabilities as part of the package. These capabilities are usually product-specic,
providing a set of reports about the device, or server, for example.
Active measurement systems almost always include methods for reporting their results in
a form that can be used for service level indicators. Your choice of tools will depend on the
particular service level indicators included in the SLA, the capabilities of the various
management tools being used, and the statistical manipulation required by the SLA.
All these products or features for reporting service levels share common characteristics. For
instance, they offer a variety of presentation formats, such as bar charts, pie charts, and
histograms, so that the information is in a form that is most useful to the person using the
report.
Reporting tools save administrators time with a set of predened templates that are ready
to use out-of-the-box. This feature enables administrators to get immediate value without
needing to learn much about the reporting product itself. Templates can be supplemented
with easy customization so that the reports can be tailored as needs change.
The reporting tools must offer various levels of granularity and detail because there are
different sets of consumers. The technical teams usually need more extensive details, such
as a breakdown of each service disruption. They want information dening the initial
indications of a disruption and quantifying the disruption. (How slow was the transaction
relative to compliance guidelines? How long was the duration of the problem?) Upper
management usually wants summariesthe numbers of disruptions and the applicable
penaltiesbut underlying data must be easily accessible.
Reporting becomes a useful ally in building an SLM plan because it helps to set
expectations. Publishing a set of reports on the Web makes the information available to all
241
internal usersmuch like trafc reports on the radio. This enables them to see the actual
service quality they are experiencing, and they can track improvements over time.
Published reports reduce the load on the help desk because users can check performance
and other parameters for themselves rather than asking help desk staff for information. User
access to SLA compliance reporting also keeps the management service team accountable
because everyone sees the results.
Its important that there be consistency between the published reports used for determining
SLA compliance and the instant reports available on the Web. However, there may need to
be adjustment of the instant reports because of measurement errors and other problems;
customers must be made aware of that possibility to avoid losing credibility. Credibility
problems can also appear if instant measurements are made available before the reporting
system has been fully tested; spurious reports of service level problems will make the load
on the help desk worse instead of better.
Accountability is enhanced by scheduled, periodic reviews of the service level reports.
Dening who participates in these reviews from each side, how often they are scheduled,
and what material is to be reviewed should go hand in hand with the reporting requirements.
Dispute Resolution
There should be a mechanism dened for resolving disputes because they will inevitably
arisedespite having as many details as possible spelled out in the SLA. Given the
pressures to minimize penalties on the providers side and the criticality of services on the
customer side, discrepancies in the measurements and their interpretation will be subjected
to substantial scrutiny.
In the past, service providers traditionally made the measurements themselves and just
reported them to customers. Today, many customers want to conduct their own
measurements. Some customers feel it keeps the provider honest. In this, the webbed
services industry is just catching up with more traditional industries, in which regular
monitoring of supplier inputs to manufacturing or other elements of the supply chain is a
critical, standard operating procedure.
When provider and consumer are measuring from different points or using different
intervals, the results will not be consistent and will lead to disputes when disruptions occur.
As the nancial consequences mount, the probability of disputes rises accordingly.
There is, therefore, a move to use a trusted third-party whose measurements are assumed to
be objective. Companies that measure Internet performance, such as Keynote Systems, are
used to verify the performance of the cloudthe integrated set of services. Other
companies, such as Brix Networks, have also been founded to address this specic concern.
Brix places measurement appliances at key demarcation points, collects the performance
information, and analyzes it at a central site. Mercury Interactivethrough their Topaz
Managed Servicesoffers some of the same capabilities.
242
It benets both parties when accountability is clearly determined from agreed-upon metrics
and measurements. Any dispute resolution process needs specic steps, such as both parties
simultaneously conducting measurements to determine the differences in the readings. Any
differences can be used to correct and calibrate the results before determining if services
are compliant. Such a process is critical to building a workable relationship between service
providers and service customers.
In some cases, especially during initial implementations of SLM, it may be necessary to
adjust the service level objectives because it has become apparent that the costs are too
high, or the system is not yet capable of meeting the original targets. The possibility of that
readjustment must be understood by all parties involved in those initial efforts.
Summary
The implementation process depends on solid basics, including crafting a clear and specic
SLA. Getting instrumentation and reporting in place early for internal SLAs helps with
realistic assessments of service quality before implementation begins. Giving internal users
access to service quality reports is also helpful in building support.
A phased approach is usually most effective, starting with smaller efforts and gaining
experience and speed over time. An existing application that uses simple prioritization or
other rudimentary service level management systems is a prime candidate for the rst full
SLM project, as are applications that use manageable, but representative, subsets of the
enterprises total architecture.
Baselining the performance of the existing system and taking a census of all the existing
systems components and their capabilities will help during the system development and
tuning phases as well as during the evaluation of the success of the entire SLM project.
The implementation of full SLM relies on specication of the performance metrics, choice
of instrumentation types and locations, and correct construction of the SLAincluding
methods for resolving the disputes that will inevitably occur in the rst implementations.
CHAPTER
15
Future Developments
This nal chapter offers some closing thoughts about future directions of Service Level
Management (SLM). As a rich area, it is evolving in several directions. Topics covered in
this chapter are as follows:
The demands of speed and dynamism in the new management environment

The evolution of management systems integration
Architectural trends for Web management systems
Business goals for service performance
The search for the best tools

The unrelenting pressures of speed and dynamism (constant change) will continue to
challenge any organization that depends on the Internet for part of its success and survival.
There will always be an advantage to being quick and responsive.
Speed and dynamism are related, but they are also antithetical to each other. Managing
service quality would be simpler if things just got faster over time. The traditional Systems
Network Architecture (SNA) networks offered by IBM are a good example. The SNA
environment remained relatively constant for years. There was growth in trafc volume and
the number of connected terminals, but the basic topology and trafc ows remained xed.
Administrators were able to use this consistency to iterate their performance strategies.
They could then rene the strategies and apply them to deliver stable, guaranteed
performance.
Dynamism makes the challenge more difcult because constant changes present a moving
target for planners and administrators. They do not usually have the luxury of working with
situations that have long-term stability. Instead, they are forced into making frequent
tactical decisions, often working from a changed environment and a new set of trade-offs
in every situation.
The emergence of more dynamic service chains, where downstream partners are selected
in real time, will force management systems to follow suit. The management systems must
dynamically identify each other and establish the rules for communicating. These
246
Chapter 15: Future Developments
management system relationships might last indenitely or endure for time periods as brief
as the duration of a single ow. For example, a downstream shipping service might be
dynamically selected based on geographic coverage, delivery schedules, and cost. The
management system needs to communicate with the selected shippers management system
to detect and help diagnose any problems with transactions involving the shipping service.
The emergence of routing optimization products and services offers multi-homed sites
another example of a dynamic service chain. Using these route control products, customers
can select Internet service providers (ISPs) in real time, measuring their performance and
comparing costs. The basic interactions addressing service disruptions will be
supplemented with additional requests for information or new measurements. Other
interchanges will focus on reporting trends that indicate a drift toward SLA noncompliance.
Efcient providers are rewarded with more trafc and more revenues.
Real-time, customer-to-provider interactions will become more important because
customers are always eager to trim their costs and deliver their services more effectively.
These factors will continue to push customers and their providers toward real-time
exchanges that give the customers increasing control of the resources they buy from the
provider.
The term customer network management has been used to describe systems that enable
customers to make real-time requests for a range of network services. With customer
network management, customers can submit trouble tickets, track their status, obtain
billing and usage information, and change service metrics. The real-time interaction can
speed problem resolution and enable customers to adjust their bandwidth consumption to
their current and projected demands. Service providers also benet because customers are
taking over tasks that were previously handled by provider staff. Service providers can
leverage more specialized offerings to strengthen their competitive differentiation.
One obstacle to greater customer participation in services delivery is a distinct lack of
integration among the various back-ofce systems. (These systems include provisioning,
billing, order tracking, and capacity planning.) Many new service rollouts, such as digital
subscriber line (DSL), overwhelmed the providers and their systems, resulting in long
delays and many installation and activation errors. Methods for customer participation in
service allocation, accounting, and management exist today in the switched networks used
by traditional telephony service providers; as the economics lter to routed networks, the
tools and infrastructure will have to catch up.
Customer management systems interact with management systems from business partners,
providers, customers, and suppliers. These interactions are between management systems
and are different than the business application interactions that are usually dened as
business-to-business ows. Different management systems need to exchange information
during regular operations, and especially when service disruptions occur.
Because of the complex, dynamic, and rapidly changing web of interactions, adaptable
management products are needed. These are tools that discover changes in the managed
247
environment, update their information, and continue operating without staff involvement.
The pressures of intense competition, constant change, and growing criticality have
outstripped the capacity for hands-on maintenance of service management tools.
If management tools depend on manual entry of information or rules in response to changes
in the current environment, they wont be able to keep pace. Organizations cannot afford
time delays while they update their tools. They also cannot afford incorrect analyses caused
by outdated tools or problems caused by tools failing to provide information in a timely
manner.
A simple change, such as assigning an application to a different server, might take no more
than a minute, but it can consume hours of staff time to update the management tools if they
must be updated manually. The combination of complexity, stringent time pressures, and
stiffer penalties for noncompliance is forcing management systems to become more
automated with more sophisticated analysis and responses.
RiverSoft introduced its Network Management Operating System (NMOS) to address this
problem. (RiverSoft is now part of Micromuse, and NMOS has been integrated into the
Netcool product line.) NMOS periodically checks the network infrastructure for changes.
When changes are detected, it updates its information accordingly. It then adjusts the
correlation information to reect new connectivity and new dependencies. Other products
are taking an intermediate step of detecting and reporting changes, leaving the adjustments
to the staff. This feature will be a key differentiator, especially for those organizations that
are struggling to stay on top of their own environments.
A similar problem for administrators is the need to understand the relationships between
service ows and underlying infrastructures, which is difcult if those relationships keep
changing. Without knowledge of the relationships, administrators must take corrective
actions without understanding their impacts on the business. They may restore a device,
adjust a route, or change access parameters, among other tasks, without knowing if their
actions made any difference in the organizations business outcomes. Service management
tools must be able to present business-related information to administrators before they
make such management decisions.
Mapping these relationships automatically is not an easy task, and the relationships
between service ows and underlying infrastructures must be shown in both directions:
from ows to elements and from elements to ows. Most management vendors offer a
partial solution supporting automatic discovery of elements and applications, leaving the
identication of relationships to the staff. This helps, but falls short of what is actually
needed.
Dynamism makes the problem even more difcult because continuous changes increase the
risk of using outdated information. Some companies I have interviewed are introducing
new services weekly, constantly reallocating servers to applications as loads shift, and
altering bandwidth assignments in (near) real-time. Perhaps a new approach will be needed
that enables applications to register themselves when they are activated. A management
tool could receive the registration and update the information accordingly.
248

Integration across application and network levels, and among management systems at the
same level, is clearly a key to the evolution of management systems. Unfortunately,
integration has been one of the most abused words in the systems management
marketplace. Many management vendors have positioned their offerings as integrated
solutions, but too often their capabilities fall far short of what is actually needed by
administrators. Most early products included only the least valuable capabilities and were
then marketed strenuously.
This section looks at the types of integration, from the simplest to the most comprehensive,
as a background for discussions of future developments.
Superficial Integration
The two most supercial forms of integration are integration on the glass and integration
on a system. Integration on the glass means that there is a consistent look and feel to the
management tools on the platform. Consistency is helpful because it reduces training and
simplies many tasks, but integration on the glass is of limited value after the training
savings are realized.
The rst Simple Network Management Protocol (SNMP) management tools used a
dedicated computer system, called integration on a system. This can be a wasteful
approach, especially when the demands on the server are low. Early SNMP platforms made
a virtue of this fact, by marketing several tools sharing a single server as a form of
integration. These platforms used their event management functions to launch a specic
tool whenever criteria called out by a set of rules were satised. Such sharing does save
hardware costs, and keeping it all local to one server simplies some of the tool-launching
logic. However, after those efciencies are realized, the value is limited.
Data Integration
Data integration has long been touted as a breakthrough that will solve management
challenges arising when a set of tools is needed to restore service levels. When one tool
cannot share its information in a straightforward way with others, staff time must be
expended to close the gap. The usual process has entailed using a tool, getting its output,
and entering that as input for another management tool. Involving staff also raises the
possibility of errors being introduced by staff as they manually move information between
tools.
Extensible markup language (XML) is emerging as the preferred way of attaining a level
of data sharing and integration. XML-parsing technology and document creation tools are
readily available, simplifying the transformation between local and standard
representations. In practice, selection of a common schema remains a challenge. All the
249
different parties sharing data must have a common way of interpreting the tagged and
structured information inside an XML document. Within an enterprise, such interfaces can
be handled with local standards because documents are shared in a single organization.
However, sharing information across organizational boundaries, or in the absence of strong
standards, can be more of a struggle because theres no guarantee that the schema used by
each party is compatible.
The work of the Distributed Management Task Force (DMTF) may be very helpful here.
This standards body has dened the Common Information Model (CIM) and an encoding
scheme for XML documents. However, this standard has not yet clearly demonstrated its
value. Economic realities tend to work against spending in support of standards that may
not demonstrate an immediate payback.
One outcome of the emerging focus on XML is a shift away from efforts to create a single
management information repository. Projects that attempt to dene and implement such a
single repository for management information almost always fail, for reasons that are clear,
especially in hindsight.
The rst barrier for unied management information repositories has been the schema.
Every management vendor has its own internal schema and prefers to impose it as the
industry standard. Competitors understand the advantages of having a proprietary schema,
which ensures lock-in for their products; the resulting deadlocks often lead to early failure.
Another barrier was the relative immaturity of distributed database technologies. Problems
of keeping information fresh, arbitrating concurrent updates, saving information, and
providing easy database backup and recovery often made any effort look impractical. It
appeared that the foundation technology was not ready for prime time.
Distributed database technologies have matured, but the reluctance to reengineer
fragmented databases is still strong. Adoption of a monolithic management repository
requires extensive changes in almost all organizations. The reality is that there are many
databases within an organization that hold critical management information. In some
cases, databases are separate for legal or regulatory reasons; in all cases, organizations are
reluctant to reorganize their databases.
Rather than focus on the data store, XML facilitates data exchange with a protocol for
documents, using a dened encoding scheme. Schema descriptions and presentation
information can also be appended to documents. That is what makes XML a strong
alternative to the repository concept for data integration.
Event Integration
XML document exchanges are sufcient for ongoing data communication between
management tools. Such exchanges are essentially synchronousa tool receives a message
and responds to it. However, that is only part of the answer; event integration is also
required, because management tools need asynchronous communication with other parts of
the management system.
250
Event integration enables a management tool to signal asynchronously and activate another
part of the management system when a specic event occurs. Such integration must be bidirectional; events can ow in either direction as determined by the specic needs of any
management task. Each party must be able to understand the event so that it can take the
appropriate action. As more events are encoded as XML documents, XML can simplify the
integration process and leverage data-integration methods for use in event integration.
Process Integration
Solid data and event integration enables administrators to build management processes,
which are automated sequences of tool functions that are sequenced and controlled by a
process manager. Process integration offers high value to administrators because each
automated process saves staff effort and expense each time its triggering situation occurs.

Managing service ows in a webbed environment is a multi-faceted challenge, as you have
seen throughout the preceding chapters. The market is cluttered with offerings associated
with SLM and its variations, using phrases such as Application Performance Management,
Service Assurance, Quality of Service, and Quality of Experience. There are traditional
SNMP management platforms attempting to maintain their positions in a uid management
world and vying with purpose-built Java services platforms. In addition, innovative startups
offer new solutions, while more mature management software vendors are acquiring new
companies to broaden their portfolios. The multitude of forces creates some uncertainty and
also opportunities to create new solutions.
Most environments have sets of management tools that address the specic needs of
element and infrastructure management. Other tools are oriented toward SLM, but they also
tend to be designed to operate in isolation. Integration remains the critical concern; various
emerging strategies were discussed in the preceding section of this chapter. In all cases, sets
of tools must be integrated to gain the process-level integration needed for effective realtime management processes.
The traditional management environment was platform-based. An enterprise or a service
provider went through a lengthy evaluation, selected a platform, and spent a long time and
a lot of money customizing and conguring it (and frequently had disappointing results
after implementation). Many tools were selected for their ability to integrate with the
selected platform instead of for their functionality. Other management tools were acquired
for their functionality or to address a hole in platform coverage.
A different strategy is needed for todays SLM systems; the Web with its supporting
services is a good model of a strategy that can work. Such a strategy uses the same
architecture as many of the services it manages. It can scale using the network, and it
evolves with new breakthroughs in Web service design and deployment.
251
The webbed services environment can be the basis for building effective SLM solutions.
Many of the pieces are now available and the benets of a new approach are compelling.
The Web is based on exploiting the power of loosely coupled systems that interact in many
different ways to create a variety of services. That same loosely coupled approach can be
applied to the architecture of a services management system and to the processes performed
by that architecture; proposals for those designs are discussed in the following subsections.
Loosely Coupled Service-Management Systems Architecture

Management tools and tool clusters address fundamental functions, such as root-cause
analysis, correlation, trafc adjustments, and content management. Tools are singlefunction modules, whereas a tool cluster is an integrated set of related tools from a single
vendor or a set of closely collaborating vendors. A tool cluster can be multifunctional; for
example, it might have its own discovery, event management, instrumentation, root-cause
analysis, and reporting functions already integrated into a functional package. Tool clusters
may simplify some integration tasks because the vendor has (ideally) integrated its own
products beyond the supercial integration levels discussed earlier in this chapter.
A loosely coupled service-management system depends on sets of process managers,
which coordinate sets of tools and tool clusters. These process managers and their
underlying tools may be organized in a loosely coupled web of clustered processes that
communicate by using signaling and messaging, as discussed in the following sections.
Process Managers
Process managers oversee a management process by ensuring that all lower-level tools and
processes carry out their tasks successfully. They organize information and oversee
portions of the managed environment. The process managers are higher-level tools or
functions that coordinate tools and tool clusters; they also communicate with other process
managers. The process manager organizes the collected information and determines if its
task is complete; if it is, the process manager reports to a higher-level process manager.
When the task is not complete, the process manager initiates further activities or reports a
failure.
The process manager needs logic to analyze the incoming information and make the
appropriate decisions. It takes different steps depending upon the analysis; for example, the
process manager might request further detailed measurements, access other information
sources, or use different tools as its analysis dictates. Process managers may also have
correlation, policy, and presentation functions.
Correlation is important when determining a root cause or trying to understand the
interactions among different parts of the managed environment. The ability to correlate
across different infrastructures is a high-value capability.
252
The process manager might set or modify system policies while it collects information it
needs; in addition, it may adjust other parts of the managed environment. Note that some
process managers might focus primarily on overseeing policy-based operations.
Presentation is also a key function. A complex and dynamic environment is challenging to
manage, and it is also a challenge to organize and present information that is useful. Useful
information must be presented in a way that enables a human to gain an understanding of
the situation quickly. This function must be very exible because different people will
respond to different types of presentation formats.
Clustering and the Webbed Architecture

A loosely coupled hierarchy enables administrators to add and change functional
components without causing unnecessary readjustment of the rest of the management
system. The exibility of loosely coupled systems enables a range of functional
management structures that suits each organizations needs.
Complexity and large quantities of data often slow down the analysis tools in service
management systems. However, the webbed architecture offers easy scaling, performance
tuning, high availability, and exibility. New processors can be added as demands grow,
and more speed can be gained through parallelism, which is the activating of all the tools
simultaneously rather than serially. The environment can also adapt in the other direction
as well; functions can be consolidated on the same hardware if demands shift. In addition,
redundancy can be used to meet high availability goals.
A webbed environment means that resources (such as tools, tool clusters, and process
managers) can be located anywhere, and they can be organized in a variety of ways.
However, using multiple instances of tools, tool clusters, and process managers increases
the burden of information management. Keeping information fresh at multiple locations
can incur delays inherent in moving information across long distances. Trade-offs among
simplicity, availability, and overhead must be evaluated as the service management system
is constructed.
Integrating the Components with Signaling and Messaging

Data and event integration are needed to enable tools, tool clusters, and process managers
to exchange information and to signal each other. Signaling controls the sequence of tool
and tool cluster usage within a management process. Each tool signals its successor, thus
ensuring that process steps are in the proper sequence.
Todays event managers can be extended into general signaling engines to integrate
management system components. Events are generated and used to trigger other actions
through the event manager itself. The event engine can be used by any component capable
of generating the appropriate events to trigger further actions. Additional exibility enables
richer management processes by selecting tools depending on the outcome of current
activities.
253
Instead of extending the signaling function of todays event managers, it is possible to use
message queuing systems to integrate management system components. Message queuing
software is offered by companies such as IBM (WebSphere MQ) and TIBCO
ActiveEnterprise. These messaging products provide the means for different applications
on different computer systems to exchange information in a controlled way. Such
messaging platforms have been implemented as backbones in complex inter-application
environments, such as brokerages and other nancial services organizations.
The information can be exchanged in the form of XML documents, which the parties are
responsible for transforming into locally useful forms. The messaging software handles the
other aspects of application-to-application communications. It handles synchronization,
queuing, backpressure or ow control, status reports, and other matters that smooth the
exchange of the XML documents.
This combination of messaging and XML constitute a strong foundation for integrating a
set of management tools. The XML documents provide the information, and the messaging
software handles efcient exchanges and signaling between applications. Management
tools can now be distributed across several servers, if desired. This exibility enables
administrators to link their management tools into sequences that dene management
processes. This level of integration offers substantial value to the management teams.
XML will be the major means of data sharing between management tools, especially those
from different vendors. XML has already achieved a strong foothold in many products, and
vendors are using it as an internal integration tool for their own products. This trend will
accelerate, especially because XML makes absorbing products from mergers and
acquisitions easier as well.
Loosely Coupled Service-Management Processes

Traditional management architectures revolved around a single platform that determined
the conventions and standards for all tools that could be integrated into that framework.
Newer management architectures will be organized around management processes and
tasks rather than around a single platform.
Triage provides an example of an alternative, process-centric structure. Triage is the
process of determining the responsible infrastructure and organizational group as a rst step
in handling a service disruption.
In the case of poor performance at an end-user location, the triage process manager could
initiate a set of subprocesses to determine which infrastructure is the likely cause of the
problem. First, it would initiate other subprocesses to determine if the servers, content
delivery, applications, or other infrastructures have a role in the service disruption. Each
subprocess could, in turn, activate other tools to carry out its responsibilities. Instead of a
central management platform needing to know all the details of each server element, the
element managers themselves would concentrate on knowing the details of their particular
subsystem and would respond to status requests from the triage process manager.
254
If the server infrastructures do not report any problems, the triage process manager could
turn to the transport infrastructure. For instance, it could ask transport infrastructure service
managers to initiate end-to-end measurements to determine basic delays, packet loss, or
other relevant metrics. The end-to-end testing tool may need to access instrumentation
information to locate the appropriate probes to activate for the measurements.
If the end-to-end measurements indicate that further investigation is warranted, additional
tools could be brought into action. If the servers are multi-homed, one example of an
investigation into network problems might be a check of ISP performance using
measurement data accumulated by the routing optimization system thats managing the
selection of ISPs. If an external network is determined to be the problem, synthetic
transactions or another testing method could be initiated to probe the external network and
determine when it resumes operating within the range that does not further threaten service
quality.
The loosely coupled, process-oriented approach enables administrators to focus on the
steps they need to follow to achieve a management result, instead of trying to address
management strictly in the context of a platform or element.
This approach will help IT groups manage the responsibilities and cooperation of each key
process team. If, for example, a triage operation within the services group identies the
transport infrastructure as the likely cause of the disruption, the appropriate transport
specialists and processes can be automatically signaled to resolve the problem and restore
service levels. The technical means to integrate the support groups can come from the
emerging messaging and signaling functions discussed previously in this chapter. They
enable either team to signal the other and activate the appropriate processes.
Business Goals for Service Performance

Measuring the download time for a web page shows only the basic technical performance
of the service. The measurement includes the time to process the request, take the
appropriate action, and move information to the requestor. However, a business manager
needs to consider other issues of business performance as well.
Services are designed to distribute information, enhance Internet brand awareness, sell
products, reduce costs, and sustain competitive advantage, among other objectives. Some
of these business metrics, such as revenues, are easily measured and quantied. Customer
satisfaction or brand awareness is harder to quantify, but these aspects might also be keys
to long-term success.
Service performance management must therefore include business factors, even though that
is a challenge for most organizations.
A company I have spoken with is instituting business performance metrics as part of the
performance improvement process. They are attempting to discover and incorporate
business goals into the measurement systems to provide explicit measurements of success
in meeting those goals.
255

The search for better tools has centered around an ongoing question: Do you buy the bestof-breed tools and suffer integration problems, or do you buy an integrated set of adequate
tools? Do you go for the best functionality, or do you settle for something with less
functionality that is easier to deploy and operate?
Current industry trends give you help in answering these questions. Considerable
consolidation is occurring within the management industry. Economic difculties are
forcing smaller companies to look at being acquired as a means of ensuring survival. Larger
companies may gain the advantages of faster time to market and a broader portfolio of
products when they acquire a startup instead of developing new products themselves.
Larger companies also see that innovative startups are much less expensive to acquire than
they were in the late 1990s.
Customers often feel that they have to decide whether to bet on a small innovative startup
that may not survive or on a large company that will be there to support them, but that offers
less functionality (often at higher prices).
Large companies with aggressive acquisition strategies blend the best of both alternatives.
Startups give large companies innovation more quickly and cheaply than internal
development. In return, the startups get the nancial resources, sales, support, brand
recognition, and customer base that can move large amounts of product to the marketplace.
Service management has seen such market consolidation as larger companies ll out their
portfolios by acquiring smaller niche players.
Faced with such turbulence among management tool suppliers, what is your best choice?
Well, I have been advising clients to look for innovation as one major factor. Innovation is
needed to reduce costs, redirect staff, and maintain high service quality. Fortunately, there
are often several companies offering new capabilities and similar innovationsusually
because the next problem to be solved becomes pretty obvious to the industry as a whole.
Partnering with a small, innovative company also gives you the chance to inuence their
development in a direction that meets your needs.
The next step is to evaluate the viability of the candidateswho has the funding or board
contacts to survive? Another way of assessing viability is to look at alliances. Innovative
startups that have distribution, cross-selling, or other collaborations with large management
vendors are more viable because they can move large amounts of products, and the large
players tend to acquire companies with which they are already working.
Using this approach to vendor selection reduces the exposure to a startup that sells its
products and then closes its doors, leaving customers without support and a future growth
path. Balancing innovation with survivability is the new art of buying products.
256
Summary
SLM is not only necessaryit is vital in the webbed world that we inhabit. The rapid
change and dynamic service chains that are characteristic of newer Web-based services are
forcing development of new management systems that can cope with that change
automatically, instead of relying on manual conguration and manual system management.
Originally, management systems had simplistic integration; although all the management
tools might reside on the same platform and bear a supercial similarity, operators still had
to move data manually from one tool to the next. Errors were frequent, and automation was
problematic.
Effective tools with more innovative products are coming. Integration is becoming deeper,
at the level of data integration, event integration, and process integration. Events detected
by one tool will automatically trigger process initiation and automated sequences in a
completely different tool, using shared data. These tools can be loosely coupled, which
gives great exibility in tool location and organization. Tools no longer will need to share
a single platform; instead, for example, they could communicate with each other using
message-based queuing and XML.
The service management industry is undergoing consolidation, and large companies are
aggressively acquiring smaller startups. When you need to choose a supplier, you should
focus on innovative companies that have a good path to viability, through their own assets
or partnering.
SLM is now feasible, although not as simple as we would like. Nonetheless, an exciting and
challenging world is emerging, and SLM will be a key enabler of the potential of the
Internet.
079x_01iIX.fm Page 258 Tuesday, December 16, 2003 11:26 AM
INDEX
A
abandoned shopping carts, 151
abandonment, 198
accelerators, SSL, 165
access links, passive monitoring, 236
accountability, 241
accuracy of root-cause analysis, 107
ACE (Application Characterization
Environment), 213
actions, 113116
activation
deactivation time
of management tools, 95
active collection, 7578
active customers, 150
active measurements, 236
active monitoring, 75
activity baselines, 149
adaptive instrumentation, 77, 87
addresses, IP, 122
administration
aggregators, 72
applications
effect of organizational structures,
146
infrastructure, 145
instrumentation, 157, 161
metrics, 147, 152
operational environments, 146
time lines, 147
transaction response time, 152, 157
complexity, 130
CRM, 204
demarcation points, 189

digital certificates, 166
events, 8182, 85
applying Micromuse, 9799
reducing noise, 85, 97
instrumentation, 53
components, 68, 73
modes, 65
time slices, 67
trip wires, 6667
new technologies, 31, 34
NMOS, 247
phased implementations, 231
incremental aggregation, 232233
initial project selection, 231232
planning, 233242
policy-based, 129130
architecture, 133, 136
design, 136138
elements, 131132
need for, 130131
products, 139142
service-centric, 132133
problem management metrics, 33
real-time operations, 101
automated responses, 113, 116
brownouts, 110
commercial, 116, 127
proactive, 112113
reactive, 103104
root-cause analysis, 107111
triage, 104, 107
virtualized resources, 110111
260 administration
real-time service metrics, 33

services, 6365
SLM
components, 15
overview of, 917
SNMP, 248
systems integration, 248250
technologies, 91
tools, 95, 255
transport infrastructure, 177
data ow control, 188191
metrics, 178, 181
QoS, 181, 188
Web system architecture, 250, 254
agents, 29, 236
aggregation, 72
behavior, 62
incremental, 232233
measurement, 2627
monitoring, 69
alarms, de-duplication, 87. See also alerts
alerts
coordinating, 96
event management, 82, 85
prioritization of, 94
processing, 86
raw, 81
reliability, 83
trip wires, 6667
verifying, 88
algorithms, slow start, 155
alters, ltering, 89
analysis
FMEA, 136
process managers, 251
root-cause, 107
statistical, 2930
API (application programming interface), 96
Application Characterization Environment
(ACE), 213
application program interface (API), 96
applications, 117
baselining, 237
development teams, 146147
existing, 232
infrastructure, 145
instrumentation, 6163
legacy, 160
management
146
metrics, 147, 152
time lines, 147
management systems integration, 248250
Netuitive, 120
network-aware, 156
ProactiveNet, 117
servers, 42
Arbor Networks, 125
architecture, 41
delivery of Web services, 4244
beds
design, 45
drivers, 4852
environment evolution, 4546
heterogeneous systems, 4648
example of, 5256
instrumentation, 52
policies, 133, 136
servers, 163, 174
SNA, 245
Web management systems, 250, 254
Web services, 163
arrival rates, 198
artifacts
alerts, 84
correlation, 90
eliminating, 88
reducing, 28, 72, 88
assessment of
headroom, 115
local impact, 114
association, 51, 93
asymmetric routes, 180
ATM (Asynchronous Transfer Mode), 178
attacks
Arbor Networks, 125
DDoS, 121, 124
SYN Flood, 121
attributes, 51
objects, 93
policies, 137
audio, 170
auditing policies, 138
authentication, 165
authoritative DNS servers, 43
automation
defenses, 124
operations, 55
policy-based management, 129130
design, 136, 138
elements, 131132
need for, 130131
products, 139, 142
responses, 113, 116
availability, 19, 21
ROI, 224
transport services, 179
B
B2B (business to business), 67
B2C (business to consumer), 78
B2E (business to employee), 8
back-ofce operations, 56
bandwidth
over-provisioning, 188
traffic-shaping QoS, 185
baselines
activity, 149
monitoring, 112
performance, 237
ProactiveNet, 117
revenue, 150
time slices, 67
beds, 200203
261
262 behavior
behavior
customer behavior measurements, 149
Netuitive, 120
predicting, 112
services, 62
benchmarks, load testing, 199200
best effort services, 18
BGP (Border Gateway Protocol), 190
billing, 56
boosting signals, 85, 97
bootstrapping, 25
Border Gateway Protocol (BGP), 190
bottom-up integration, 92
boundaries, elastic, 49
Brix Networks, 28
brownouts, 110111
buffering, 124
dejitter, 154
jitter, 181
building
automated responses, 116
simulation modeling, 211, 213
business to business. See B2B
business to consumer. See B2C
business to employee. See B2E
businesses
e-business, 56
B2B, 67
B2C, 78
B2E, 8
goals for performance, 254
measurements, 150
process metrics, 31, 34
ROI, 219, 228
C
caches, 43, 157, 168169
server-side, 43
calculations
confidence intervals, 25
NPV, 222
candidates for automated responses, 116
capacity
planning, 197, 214215
workload metrics, 149
CBQ (class-based queuing), 187
CDN (Content Distribution Network), 168, 43
cell error ratios, 179
census of existing systems, 233
change latency, 34
characteristics, 51
CIM (Common Information Model), 51, 249
CIR (Committed Information Rate), 178
circuits, Frame Relay, 178
Cisco QoS Policy Manager (QPM), 139
class-based queuing (CBQ), 187
classes, one-way latency, 180
Clickstream Technologies, 158
client-side caches, 169
clocks, synchronizing, 180
closure criteria policies, 138
clusters, tools, 251
code, XML, 248
collaboration of instrumentation, 78
collectors, 125
deploying, 73
embedding, 71
content
linkage, 78
managing, 72
measurements, 70
monitoring, 69
roll-up method, 86
services, 75
commerce
e-business, 56
B2B, 67
B2C, 78
B2E, 8
measurements, 150
ROI, 219, 228
commercial operations, 116, 127
Committed Information Rate (CIR), 178
Common Information Model (CIM), 51, 249
communication
between design and operations, 156
effect of organizational structures, 146
completion rates, ROI, 225
complexity, managing, 130
compliance testing, 55
components
SLM, 15
systems, 68, 73
computation load, 166
concurrent sessions, 198
concurrent statistics, 198
condence intervals, 25
conguration
architecture, 45
drivers, 4852
263
example of, 5256

businesses
e-business, 58
measurements, 150
ROI, 219, 228
collectors, 73
communication between operations, 156
planning, 233242
policies, 136138
ROI, 225
SLAs, 239
conguration-centric element management,
131
connections
Arbor Networks, 125
CDN, 168
collectors, 71
isolation, 188
modems, 156
SNMP, 248
SSL, 165
constant bit rates, 178
content, 170
distribution, 169
managers, 170
servers, 170
switches, 166
264 Content Distribution Network (CDN)
Content Distribution Network (CDN), 43, 168

cookies, 158, 167
coordination of alerts, 96
correlation
artifacts, 90
associating, 93
costs, ROI, 219, 228
CRM (Customer Relationship Management),
204
curves, load, 196
Customer Relationship Management (CRM),
204
customers,
behavior measurements, 149
customer service, 56
facing, 163
customization
architecture, 45
drivers, 4852
example of, 5256
collectors, 73
planning, 233242
policies, 136138
ROI, 225
SLAs, 239
D
data ow control, transport services, 188, 191
data integration, 248. See also integration
data item denition, 50
databases
distributed, 249
servers, 43
DDoS (Distributed Denial of Service) attacks,
121, 124
DE (discard eligible), 178
de-duplication, 87
defenses, automated, 124
dejitter buffers, 154, 181
delay, 154
processing, 156
propagation, 154
queuing, 154
round-trip, 189
serialization, 152
think time, 152
demand-side (end-user request) criteria, 166
demarcation points, 73, 189, 237
deployment, 231
collectors, 73
planning, 233242
ROI, 225
design
architecture, 45
drivers, 4852
development
example of, 5256

collectors, 73
planning, 233242
policies, 136, 138
ROI, 225
SLAs, 239
Design of Experiments (DOE), 205
detection
alerts, 85, 97
coordinating, 96
prioritization of, 94
processing, 86
raw, 81
reliability, 83
trip wires, 6667
verifying, 88
erroneous measurements, 28
development
aggregators, 72
applications
146
infrastructure, 145
metrics, 147, 152
time lines, 147
265

complexity, 130
CRM, 204
events, 8182, 85
instrumentation, 53
components, 68, 73
modes, 65
time slices, 67
trip wires, 6667
NMOS, 247
planning, 233242
design, 136, 138
elements, 131132
need for, 130131
products, 139142
brownouts, 110
proactive, 112113
reactive, 103104
266 development
triage, 104, 107

services, 6365
SLM
components, 15
overview of, 917
SNMP, 248
teams, 146147
technologies, 91
time lines, 147
tools, 95, 255
metrics, 178, 181
QoS, 181, 188
Device Under Test (DUT), 200
diagnostics, 189. See also troubleshooting
brownouts, 110
translation, 89
Diffserv, 183
digital certicates, 165166
digital links, troubleshooting, 179
DIRIG Software PathFinder, 94
discard eligible (DE), 178
dispute resolution, 241
disruption of services, 62
distributed databases, 249
Distributed Denial of Service (DDoS) attacks,
121, 124
Distributed Management Task Force (DMTF),

249
distribution, 164
content, 169, 173
hybrid, 135
load
geographic, 167
local, 166
policies, 134
DMTF (Distributed Management Task Force),
249
DNS (Domain Name System), 43
documents
existing systems, 233
SLAs, 239. See also SLAs
XML, 248
DOE (Design of Experiments), 205
Domain Name System. See DNS
drivers, architecture, 4852
DUT (Device Under Test), 200
dynamism, demands of, 245247
E
e-business services, 56
B2B, 67
B2C, 78
B2E, 8
edge routers, demarcation points, 237
Edge-Side Includes (ESI), 170
effective throughput, 179
efciency of aggregators, 72
fidelity of transactions
egress ltering, 122

elastic boundaries, 49
elements
correlation, 90
integrating, 92
elimination of artifacts, 88
embedding
collectors, 71
SAAs, 71
end-to-end response problem, root-cause
analysis of, 108
end-user measurements, 160
enforcers, policies, 136
environments
evolution of, 4546
load test beds, 200203
operational, 146
SNA, 245
sticky, 149
error-free seconds, 179
errors, 180
alerts, 82. See also alerts
DDoS attacks, 121
digital links, 179
Micromuse, 9799
policies, 138
ratios, 180
real-time operations
proactive management, 112113
reactive management, 103104
triage, 104, 107
267
services, 64
escalation time, 33
ESI (Edge-Side Includes), 170
evaluation of ROI, 227. See also ROI
events
integration. See also integration, 249
managing, 8182, 85
tools, 95
publishing, 96
real-time handling, 54
signaling, 50
evolution of environments, 4546
existing applications, 232
existing systems
baselining, 237
documentation, 233
optimizing, 237
expanding services, 49
extensible markup language (XML), 51, 248
external role of IT groups, 14
F
failover latency, 69
Failure Modes and Effects Analysis (FMEA),
136
failure rate of transactions, 20
false positives, alerts, 84
fast system management, 50
feedback, 65. See also management
delity of transactions, 78
268 file transfers
le transfers, 20
lters
aggregators, 72
alerts, 89
correlation, 90
egress, 122
repeat failures, 88
nancials, 56
ash load, 198
ooding, 198
Flow Control Platform, 190
ow-through QoS, 189. See also data ow
control
FMEA (Failure Modes and Effects Analysis),
136
Frame Relay, CIR, 178
front-end processors, 164
functions, 72
instrumentation, 68
of event management, 86
H
headers, caching, 169
headroom, assessing, 115
heartbeats, 70
heavy-tailed distribution, 30
heriachical collector structures, 86. See also
collectors
hierarchies, policies, 137
high-level technical metrics, 17. See also
metrics
history of architecture, 45
drivers, 4852
HTTP (Hypertext Transfer Protocol), 165
hybrid distribution, 135
hybrid systems, active/passive agents, 77
I
G
gateways, BGP, 190
generated revenues, 150
generators, load testing, 200, 203
geographic distribution technologies, 43
geographic load distribution, 167
geometric deviation, 30
geometric mean, 30
geometric standard deviation, 30
GPS (Global Positioning System), 180
grooming, 72
groups, monitoring, 69
IEEE (Institute of Electrical and Electronics

Engineers), 182
IEEE 802 LAN QoS, 182. See also QoS
implementation, 231. See also development
planning, 233242
design, 136138
elements, 131132
products, 139142
IP (Internet Protocol)

indications, 51
inection points, 197
infrastructure
applications, 145
architecture. See architecture
behavior, 62
collectors, 71
ROI, 225
transport
data ow control, 188, 191
managing, 177
metrics, 178, 181
QoS, 181, 188
instances of objects, 93
Institute of Electrical and Electronics
Engineers (IEEE), 182
instrumentation, 52
adaptive, 87
applications, 157, 161
caches, 172
choices and location, 235
components, 159
content distribution, 173
design, 73, 77
elements, 6163
load distribution, 172
management, 53
modes, 65
time slices, 67
trip wires, 6667
server infrastructure, 171
services
managing, 6365
tracking, 77
systems, 68, 73, 77
Web servers, 157
integration
alerts, 96
on the glass, 47
processes, 92
systems, 248250
technologies, 91
tools, 255
integrity, web pages, 159
intelligent monitoring, 87
interactive classes
collectors, 70
one-way latency, 180
interactive services, round-trip latency, 181
interfaces, API, 96
internal failures, 82. See also alerts;
troubleshooting
internal role of IT groups, 14
internally-generated alerts, 96
Internet latencies, 180
Internet Service Providers (ISPs), 43, 49
intervals
aggregation, 2627
time slices, 67
intranets, 8
invariant responses, 197
investments, ROI, 219, 228
IP (Internet Protocol)
addresses, 122
DiffServ, 183
TOS, 183
269
270 isolation
isolation, 181, 188

ISPs (Internet Service Providers), 43, 49
IT groups, roles of, 14
J
Jacobson, Van, 186
jitter, 22, 154, 181
K
Keynote Systems, 28
Keynote WebIntegrity tool, 159
keys, 165
knowledge repositories, need for, 131
L
LAN (local area network), 182
languages, 113
latency, 21
change latency, 34
diagnostics, 189
failover, 69
one-way, 180
round-trip, 181
Lawrence Berkeley Laboratories, 186
lead time, benets of, 112
legacy applications, opacity of, 160
linkage, 78
links
passive monitoring, 236
troubleshooting, 179
load balancing, 126, 166
load distribution, 164
geographic, 167
local, 166
load testing, 67, 195196
beds, 200203
benchmarks, 199200
generators, 200203
performance envelope, 196, 199
results, 205206
transaction load-test scripts, 203205
local impact, assessing, 114
local load distribution, 166
locations
active probes, 236
long-term effect of management decisions, 65.
See also management
long-term operations, 55
loss, packets, 179
lower-level services, 152, 157
low-level technical metrics, 17. See also
technical metrics
maximum burst size
M
macro/micro-level, QoS, 189
management
aggregators, 72
applications
baselining, 237
existing, 232
infrastructure, 145
legacy, 160
management, 146-157, 248250
Netuitive, 120
network-aware, 156
ProactiveNet, 117
servers, 42
complexity, 130
CRM, 204
events, 8185
instrumentation, 53
components, 68, 73
modes, 65
time slices, 67
trip wires, 6667
NMOS, 247
planning, 233242
design, 136138
elements, 131132
need for, 130131
products, 139142
brownouts, 110
proactive, 112113
reactive, 103104
triage, 104, 107
services, 6365
SLM
components, 15
overview of, 917
SNMP, 248
technologies, 91
tools, 95, 255
metrics, 178181
QoS, 181, 188
Management Information Base (MIB), 50
manual association of services, 93
maximum burst size, 178
271
272 Mean Opinion Score (MOS)
Mean Opinion Score (MOS), 21

Mean Time Between Failures (MTBF), 19
Mean Time To Repair (MTTR), 19
measurements
active, 236
artifacts, 88
baselining, 237
benchmarks, 199200
business, 150
capacity, 149
collectors, 70
customer behavior, 149
end-user, 160
granularity
aggregation intervals, 2627
sampling frequency, 2426
scope, 2324
instrumentation, 69
passive, 236
policies, 76
quality service, 151
reporting tools, 240
SAAs, 71
service performance, 254
time slices, 67
validation, 2829
Mercury Interactive Astra SiteManager tool,
159
methods, 51, 93
metrics
applications, 147, 152
business process, 31, 34
performance, 234235
problem management, 33
real-time service management, 33
SLA, 16
technical, 17, 23
workload, 149
MIB (Management Information Base), 50
Micromuse, 9799, 247
middle mile, 190
mission statements, ROI, 222. See also ROI
modeling
ROI, 220
simulation, 209211
building, 211213
performance, 211
reporting, 214
validating, 213
services, 92
modems, processing delays, 156
moderate priority level, 95
modes, instrumentation, 6567
modication
services, 49
thresholds, 115
monitoring, 28
active, 75
baselines, 112
groups, 69
instrumentation design, 73, 77
intelligent, 87
passive, 75, 236
services, 6163
transactions, 63
variables, 82
operations
MOS (Mean Opinion Score), 21

moving applications, 232
MTBF (Mean Time Between Failures), 19
MTTR (Mean Time To Repair), 19
multi-homed servers, 43
multimedia
quality of streams, 20
rebuffering, 179
multiple locations, active probes, 236
multiple server architecture, 163, 174
multiple service providers, 49
273
NMOS (Network Management Operating

System), 247
noise, reducing, 85, 97
normalization, 72
notications
objects, 93
time, 33
NPV (net present value), 222
NTP (Network Time Protocol), 180
O
N
NAT (Network Address Translation), 167
net present value (NPV), 222
Netcool, 247. See also Micromuse
NetIQ, 157
NetScaler, 126
Netuitive, 120
Network Address Translation (NAT), 167
network edge, 170
Network Management Operating System
(NMOS), 247
Network Time Protocol (NTP), 180
network-aware applications, 156
networks
Arbor Networks, 125
CDN, 168
collectors, 71
isolation, 188
modems, 156
SNMP, 248
objects, 93
oncurrent user session initiation attempts, 198
one-way latency, 180
opacity of legacy applications, 160
operating systems, NMOS, 247
operational business decisions, 64. See also
management
operational technical decisions, 64. See also
management
operations, 54, 101107, 109113, 116, 127
aggregators, 72
applications
baselining, 237
existing, 232
infrastructure, 145
legacy, 160
management, 146-157, 248250
Netuitive, 120
network-aware, 156
274 operations
ProactiveNet, 117
servers, 42
back-office, 56
complexity, 130
CRM, 204
design groups, 156
events, 8185
instrumentation, 53
components, 68, 73
modes, 65
time slices, 67
trip wires, 6667
interaction teams, 146147
long-term, 55
NMOS, 247
planning, 233242
design, 136138
elements, 131132
need for, 130131
products, 139142

brownouts, 110
proactive, 112113
reactive, 103104
triage, 104, 107
services, 6365
SLM
components, 15
overview of, 917
SNMP, 248
technologies, 91
tools, 95, 255
metrics, 178181
QoS, 181, 188
Operations Support Systems (OSS), 56
OPNET Flow Analysis, 213
OPNET Network Editor, 212
optimization
architecture, 45
drivers, 4852
example of, 5256
collectors, 73
policy-based management

planning, 233242
policies, 136, 138
ROI, 225
SLAs, 239
Orchestream Service Activator, 141
order tracking, 56
organizational structures, effects of, 146
OSS (Operation Support Systems), 56
outlying measurement, 30
over-provisioning, 188
P
Packeteer, 186
packets
collectors, 70
jitter, 181
loss, 21, 179
PacketShaper, 187
page-bug tracking, 158
parsing XML, 248
partners, 49
passive collection, 7578
passive measurements, 236
passive monitoring, 75
PathFinder (DIRIG Software), 94
payback, ROI, 228
peak cell rates, 178
peak service rates, 197
performance. See also optimization

baselining, 237
metrics, 234235
of simulation modeling, 211
ROI, 225
SAAs, 71
sensitivities, 237
services, 254
persistence, 167
phantom objects, 158
planning, 233242
pixel-based tracking, 158
placement of probes, 236
planning
capacity, 197, 214215
implementations, 233242
policies, 54, 134
attributes, 137
auditing, 138
closure criteria, 138
distribution, 134
enforcers, 136
hierarchies, 137
measurements, 76
Orchestream Service Activator, 141
oversight, 55
QPM, 139
testing, 138
275
276 policy-based management
design, 136, 138

elements, 131132
need for, 130131
products, 139, 142
predictions of behavior, 112
predictive analysis, 55
preferences
architecture, 45
drivers, 4852
example of, 5256
collectors, 73
planning, 233242
policies, 136138
ROI, 225
SLAs, 239
prioritization
applications, 232
of alerts, 94
privacy, 165
proactive management, real-time operations,
112113
ProactiveNet, 117
probes
active, 236
RMON, 236
problem signatures, 91
process integration, 250
process managers, 251
processing, 72
alerts, 86
functions, 72
processing delays, 156
products, policies, 139, 142
proles, load testing, 203205
programming XML, 248
projections, ROI, 221
promotion feedback, 151
propagation, delays, 152, 154
protocol analyzers, 212
protocols
analyzers, 212
BGP, 190
HTTP, 165
NTP, 180
SNMP, 50, 172, 248
TCP, 180
providers, measuring, 237
provisioning, 33, 56
publishing events, 96
pull (component-centric) model, 134
push (repository-centric) model, 135
Q
QoE (Quality of Experience), 10, 43
QoS (Quality of Service), 10, 212
services census, 233
transport services, 181, 188
resolution time
QPM (Cisco QoS Policy Manager), 139

qualitative contributions, 220
quality
measurement of service, 21
streams, 20
telephone voice transmissions, 21
Quality of Experience. See QoE
Quality of Service. See QoS
quantitative information, 220
queuing, 154
CBQ, 187
QoS, 187
WFQ, 187
R
rate control, QoS, 186
ratios, 180
customer orders to customer visitors, 150
load testing, 203
raw alerts, 81
brownouts, 110
triage, 104, 107
real-time, 101
brownouts, 110
commercials, 116, 127
triage, 104, 107

real-time event handling, 54
brownouts, 110
triage, 104, 107
real-time service management metrics, 33
rebuffering, 179
recovery, 189
reduction
of artifacts, 88
of noise, 85, 97
of volume, 86
redundancy, 124
regression testing, 202
reliability of alerts, 83
Remote Monitoring (RMON), 212, 236
repeat failures lter, 88
repetitive measurements, time slices, 67
reports, 240
automated responses, 116
benchmarks, 199200
simulation modeling, 214
SLA, 54
repositories, 134
requests, NetScaler, 126
resolution time, 33
277
278 resources
resources
SLM
components, 15
overview of, 917
virtualized, 110
responses
automated, 113, 116
servers, 23
transactions, 20, 152, 157
responsibilities, 240
results, load testing, 205206
retransmissions, effective throughput, 179
Return on Investment (ROI), 56, 219, 228
revenue baselines, 150
reviews, scheduling, 240
Risk Priority Number (RPN), 137
RiverSoft, 247
RMON (Remote Monitoring), 212, 236
ROI (Return on Investment), 56, 219, 228
roles, 240
roll-up methods, 86
root cause, 33
root-cause analysis, 55, 107
round-trip delay, 189
round-trip latency, 181
route control, 190, 246
routers, demarcation points, 237
routes, asymmetric, 180
RPN (Risk Priority Number), 137
rules of policy-based management, 130
S
SAA (Service Assurance Agent), 71
scaling aggregators, 72
scheduled reviews, 240
scope, measurement of, 2324
scripts, transaction load-test, 203205
Secure Sockets Layer (SSL), 165167
security
authentication, 165
DDoS attacks, 121
selection
of candidates, 116
of instrumentation, 235
of thresholds, 67
semantics, 50
sensitivities, 197, 214, 237
serialization of delays, 152
servers, 170
application, 42
database, 43
infrastructure, 163, 174
priority level, 95
response time, 23
Web, 42, 157
server-side caches, 43, 169
Service Assurance Agent (SAA), 71
Service Level Agreements. See SLAs
Service Level Management. See SLM
service providers, elastic boundaries, 49
service-centric policies, 132133
soft benefits
services
behavior, 62
census, 233
collectors, 70
correlation, 93
disruptions, 62
troubleshooting, 64
e-business services, 56
B2B, 67
B2C, 78
B2E, 8
expanding, 49
monitoring design, 73, 77
tracking, 77
integrating, 91
management, 6365
measurement, 160
modeling, 92
modifying, 49
performance envelope, 196, 199, 254
quality measurement, 151
technical metrics, 17, 23
tracking, 75
transport
metrics, 178, 181
QoS, 181, 188
Web
architecture, 163
delivery architecture, 4244
webbed, 89
279
sessions, 167, 198

shaping, trafc, 185. See also QoS
shelfware, 90
signaling
boosting, 85, 97
events, 50
Simple Network Management Protocol
(SNMP), 50, 172, 248
building, 211213
performance, 211
reporting, 214
validating, 213
single-function modules, 251
SLAs (Service Level Agreements), 10, 15
alerts, 84
configuring, 239
measurement granularity
aggregation intervals, 2627
scope, 2324
measurement validation, 2830
metrics, 1617, 23
reports, 54
statistics, 54
SLM (Service Level Management)
components, 15
overview of, 910, 14, 17
slow start algorithm, 155
slow-start approach, trafc-shaping QoS, 186
SNA (System Network Architecture), 45, 245
SNMP (Simple Network Management
Protocol), 50, 172, 248
soft benets, ROI, 225
280 software
software, 117
baselining, 237
existing, 232
infrastructure, 145
legacy, 160
management
146
metrics, 147, 152
time lines, 147
Netuitive, 120
network-aware, 156
ProactiveNet, 117
servers, 42
source persistence, 167
speed
demands of, 245, 247
spoong, 122
SSCPs (System Services Control Points), 46
SSL (Secure Sockets Layer), 165167
stafng
costs, 130
ROI, 225
starving out best effort services, 18
statistics
analysis, 2930
concurrent sessions, 198
SLA, 54
sticky environments, 149
streaming
collectors, 70
multimedia, 179
quality, 20
supercial integration, 248
supply-side (server status) criteria, 166
suppression, IP source address spoong, 122
sustainable cell rates, 178
switches
content, 166
NetScaler, 126
switching, CDN, 168
SYN Flood attacks, 121
synchronization of GPS, 180
syntax, SNMP, 50. See also programming
synthetic (virtual) transactions, 75, 202
system architecture, 41. See also architecture
design, 45
drivers, 4852
example of, 5256
Web service delivery, 4244
System Services Control Points (SSCPs), 46
systems
integration, 248250
Systems Network Architecture (SNA), 45, 245
tools
T
tag-based QoS, 182
tagging erroneous measurements, 28
Tavve EventWatch, 117
TCP (Transmission Control Protocol)
Packeteer, 186
round-trip latency, 181
slow start algorithm, 155
traffic-shaping QoS, 185
teams
development, 147149
elastic boundaries, 49
technical metrics, 17, 23
technical quality metrics, 178181
technologies, integrating, 91
telephone voice transmissions, quality of, 21
testing
DUT, 200
integrity, 159
load, 195196
beds, 200, 203
benchmarks, 199200
generators, 200203
results, 205206
transaction load-test scripts, 203205
phased implementation, 231232
policies, 138
regression, 202
think time, 152
third-party content providers, 44
threats, automated defenses, 124
281
thresholds
alerts
triggers, 82
trip wires, 6667
modifying, 115
throughput, 179
tiers of architecture, 163, 174
time
correlation, 91
lines, 147
NTP, 180
slices, 67
transactions, 151152, 157
Time to Value (ROI), 221
time, 151. See also measurements, metrics
tolerance for service interruption, 27
tools, 255
baselining, 237
clusters, 251
existing, 232
infrastructure, 145
legacy, 160
management, 95
146
metrics, 147, 152
time lines, 147
Micromuse, 97, 99
Netuitive, 120
282 tools
network-aware, 156
ProactiveNet, 117
reporting, 240
servers, 42
building, 211, 213
reporting, 214
validating, 213
top-down integration, 92
top-down process, 62
tracking
services, 75, 77
workflow, 95
trafc, tag-based QoS, 182, 185
transactions
collectors, 70
failure rates, 20
fidelity, 78
load-test scripts, 203205
monitoring, 63
recorder, 204
response time, 20, 152, 157
ROI, 228. See also ROI
roll-up methods, 86
security of, 165
service quality measurement, 151
synthetic, 75
synthetic (virtual), 202
time, 151
virtual, 75
transfers, les, 20
transport infrastructure
managing, 177
metrics, 178181
QoS, 181, 188191

transport methods, reliable alerts, 83
triage, 104, 107, 253
trial periods, incremental aggregation, 232
triggers, 82. See also alerts
trip wires, 6667
trouble relief time, 33
trouble response time, 33
troubleshooting
alerts, 82
brownouts, 110
DDoS attacks, 121
digital links, 179
Micromuse, 9799
policies, 138
real-time operations
triage, 104, 107
services, 64
U
users
experience, 75
measurements, 160
utilities
baselining, 237
clusters, 251
existing, 232
workload
infrastructure, 145
legacy, 160
management, 95
146
metrics, 147, 152
time lines, 147
Micromuse, 97, 99
Netuitive, 120
network-aware, 156
ProactiveNet, 117
reporting, 240
servers, 42
building, 211, 213
reporting, 214
validating, 213
V
validation
measurement, 2829
values, NPV, 222
variables
bit rates, 178
monitoring, 82
time slices, 67
verication of alerts, 88
video, 170
virtual (synthetic) transactions, 147
virtual transactions, 75, 202
volume
intelligent monitoring, 87
reducing, 86
W
warning priority level, 95
Web management systems, 250, 254
web pages
integrity, 159
load testing, 199. See also load testing
Web servers, 42, 157
Web services
architecture, 163
delivery architecture, 4244
webbed ecosystem, 89
webbed services, 89
WebEffective, 158
WebTrends Log Analyzer Series, 157
WFQ (weighted fair queuing), 187. See also
queuing
windows, 154
workow, tracking, 95
workload
metrics, 18, 21, 149
283
284 X out of Y process
X-Y
X out of Y process, 89
XML (extensible markup language), 51, 248
Z
zombies, 121

X

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

X

Uploaded by

Copyright:

Available Formats

079x_FMi.

fm Page i Tuesday, December 16, 2003 12:57 PM

Practical Service Level Management:

079x_FMi.fm Page ii Tuesday, December 16, 2003 12:57 PM

Practical Service Level Management:

Warning and Disclaimer

Corporate and Government Sales

079x_FMi.fm Page iii Tuesday, December 16, 2003 12:57 PM

079x_FMi.fm Page iv Tuesday, December 16, 2003 12:57 PM

My perception of my very special son, John W. McConnell

079x_FMi.fm Page v Tuesday, December 16, 2003 12:57 PM

079x_FMi.fm Page vi Tuesday, December 16, 2003 12:57 PM

079x_FMi.fm Page vii Tuesday, December 16, 2003 12:57 PM

About the Authors

079x_FMi.fm Page viii Tuesday, December 16, 2003 12:57 PM

About the Technical Reviewers

079x_FMi.fm Page ix Tuesday, December 16, 2003 12:57 PM

Service Level Agreements and Introduction to Service Level Management 3

Service Level Management

Service Management Architecture

Components of the Service Level Management Infrastructure

Managing the Application Infrastructure

Managing the Server Infrastructure

Long-term Service Level Management Functions

Chapter 11 Load Testing

Planning and Implementation of Service Level Management

Chapter 13 ROI: Making the Business Case

Chapter 14 Implementing Service Level Management

Chapter 12 Modeling and Capacity Planning

Chapter 10 Managing the Transport Infrastructure

079x_FMi.fm Page x Tuesday, December 16, 2003 12:57 PM

Service Level Agreements and Introduction to Service Level Management 3

Webbed Services and the Webbed Ecosystem

Service Level Management

Overview of Service Level Management

The Internal Role of the IT Group

The External Role of the IT Group

The Components of Service Level Management

High-Level Technical Metrics

Workload and Availability

079x_FMi.fm Page xi Tuesday, December 16, 2003 12:57 PM

Measurement Sampling Frequency

Measurement Aggregation Interval

Measurement Validation and Statistical Analysis

Business Process Metrics

Problem Management Metrics

Real-Time Service Management Metrics

Service Management Architecture

Web Service Delivery Architecture

Service Management Architecture: History and Design Factors

Service Management Architectures for Heterogeneous Systems

Demands for Changing, Expanding Services

Multiple Service Providers and Partners 49

079x_FMi.fm Page xii Tuesday, December 16, 2003 12:57 PM

SLA Statistics and Reporting

Real-Time Event Handling, Operations, and Policy

Components of the Service Level Management Infrastructure

Differences Between Element and Service Instrumentation

Decisions That Have Long-Term Effect

Instrumentation Modes: Trip Wires and Time Slices

Starting with the Instrumentation Managers

The Instrumentation System

Ending with the Instrumentation Manager