You are on page 1of 22

G

N
I Y
IZ B
IM S
T EM ND N
P O
O ST A TI
F Y G, C
O S N U
S N ZI OD
S O
E T I LY P R
TY T C
O UC N E
A

L I N R D A S

BI E
P
O , HO
E R G T

A M H P IN
T F R O
IL GE I S O U
S S
T

VA A Y SS A E
IT E ME AG
A AN L
I IN
B D LY U
A A E O
T

M A
V E
I L R E AT G
R IN .
A H U C S
T C C U M
E
A ED T
R YS
S
EXAMPLE
We can draw a useful analogy to the traditional landline
telephone system. When you pick up the receiver of such a
telephone you generally expect a dial tone. In those rare
instances when a dial tone is not present, the entire
telephone system appears to be down, or unavailable, to
the user. In reality, it may be the central office, the
switching station, the line coming into your house, or any
one of a number of other components that have failed and
are causing the outage.
As a customer you typically are less concerned with the root cause of
the problem and more interested in when service will be restored.
But the telephone technician who is responsible for maximizing
the uptime of your service needs to focus on cause, analysis, and
prevention, in addition to restoring service as quickly as possible.
By the same token, infrastructure analysts focus not only on the
timely recovery from outages to service, but on methods to reduce
their frequency and duration to maximize availability.
There are several other terms and expressions closely associated
with of the term availability. These include uptime, downtime,
slow response, and high availability. A clear understanding of how
their meanings differ from one another can help bring the topic of
availability into sharper focus. The next few sections will explain
these different meanings.
DIFFERENTIATINGAVAILABILITYFROMUPTIME

The simplest way to distinguish the


termsavailabilityanduptimeis to think of availability as
oriented toward customers and the uptime as oriented
toward suppliers. Customers, or end-users, are primarily
interested in their system being up and running, that is,
available to them. The suppliers, meaning the support
groups within the infrastructure, by nature of their
responsibilities, are interested in keeping their particular
components of the system are up and running. For example,
systems administrators focus on keeping the server
hardware and software up and operational. Network
administrators have a similar focus on network hardware
and software, and database administrators do the same
with their database software.
The large number of diverse components leads to two common
dilemmas faced by infrastructure professionals in managing
availability.
The first is trading off the costs of outages against the costs of
total redundancy. Any component acting as a single source
of failure puts overall system availability at risk and can
undermine the excellent uptime of other components. The
end-user whose critical application is unavailable due to a
malfunctioning server will care very little that network
uptime is at an all-time high.
The second dilemma is that multiple components usually correspond
to multiple owners, which can be a formula for disaster when it
comes to managing overall availability. One of the first tenets of
management says that when several people are in charge there is
no one in charge. This is why it is important to distinguish
availability what the end-user experiences when all components of
a system are operating from uptime, which is what most owners
of a single component focus on as well they should. The solution
to this dilemma is an availability manager, or availability process
owner, who is responsible for all facets of availability, regardless
of the components involved. This may seem obvious, but many
shops elect not to do this for technical, managerial, or political
reasons. Most robust infrastructures follow this model.
DIFFERENTIATINGSLOW RESPONSEFROMDOWNTIME

Slow response can infuriate users and frustrate infrastructure specialists. The
growth of a database, traffic on the network, contention for disk volumes, or the
disabling of processors or portions of main memory in servers can all contribute
to response time slowdowns. Each of these conditions requires analysis and
resolution by infrastructure professionals. Users understandably are normally
unaware of these root causes and sometimes interpret extremely slow response
as downtime to their system. The threshold of time at which this interpretation
occurs varies from user to user. It does not matter to users whether the problem
is due to slowly responding software (slow response) or malfunctioning hardware
(downtime). What does matter is that slow or non-responsive transactions can
infuriate users who expect quick, consistent response times.
Butslow responseis different fromdowntime,and the root cause of these problems
does matter a great deal to infrastructure analysts and administrators. They are
charged with identifying, correcting, and permanently resolving the root causes
of these service disruptions. Understanding the type of problem it is affects the
course of action taken to resolve it. Slow response is usually a performance and
tuning issue involving different personnel, different processes, and different
process owners than those involved with downtime, which is an availability issue.
DIFFERENTIATINGAVAILABILITYFROMHIGH AVAILABILITY

he primary difference between availability and high availability


is that the latter is designed to tolerate virtually no
downtime. All online computer systems are intended to
maximize availability, or to minimize downtime, as much as
possible. In high-availability environments, a number of
design considerations are employed to make online systems
as fault tolerant as possible
DESIRED TRAITS OF AN AVAILABILITY PROCESS OWNER

As we mentioned previously, the most robust infrastructures


select a single individual to be the process owner of
availability. Some shops refer to this person as the
availability manager. In some instances it is the operations
managers; in others it is a strong technical lead in technical
support. Regardless of who these individuals are, or to
whom they report, they should be knowledgeable in a
variety of areas, including systems, networks, databases,
and facilities, and they must be able to think and act
tactically. A slightly less critical, but desirable, trait of an
ideal candidate for availability process owner is a
knowledge of software and hardware configurations, backup
systems, and desktop hardware and software.
METHODS FOR MEASURING AVAILABILITY

Case study
Achieving high availability does not happen by accident.
Careful planning, clever design, flawless execution and
reliable support are just some of the characteristics
required to keep critical systems up and operating for
months on end.
THE SEVEN RS OF HIGH AVAILABILITY

The goal of all availability process owners is to maximize the


uptime of the various online systems for which they are
responsiblein essence, to make them completely fault
tolerant. Constraints inside and outside the IT environment
make this challenge close to impossible. Budget limitations,
component failures, faulty code, human error, flawed design,
natural disasters, and unforeseen business shifts such as
mergers, downturns, and political changes are just some of
the factors working against that elusive goal of 100%
availability the ultimate expression of high availability.
There are several approaches that can be taken to maximize
availability without breaking the budget bank. Each of these
approaches start with the same letter, so we refer to them as
the seven Rs of high availability
Let's begin with redundancy. Manufacturers have been designing this
into their products for years in the form of redundant power
supplies, multiple processors, segmented memory, and redundant
disks. This can also refer to entire server systems running in a hot
standby mode. Infrastructure analysts can take a similar approach
by configuring disk and tape controllers, and servers with dual
paths, splitting network loads over dual lines, and providing
alternate control consolesin short, eliminate as much as possible
any single points of failure that could disrupt service availability.
The next three approachesreputation, reliability, and repairability
are closely related. Reputation refers to the track record of key
suppliers. Reliability pertains to the dependability of the
components and the coding that go into their products.
Repairability is a measure of how quickly and easily suppliers can
fix or replace failing parts.
The reputation of key suppliers of servers, disk storage systems, database
management systems, and network hardware and software plays a
principle role in striving for high availability. It is always best to go with
the best. Reputations can be verified in several ways. Percent of market
share is one measure. Reports from industry analysts and Wall Street are
another. Track record in the field is a third. Customer references can be
especially useful when it comes to confirming such factors as cost,
service, quality of the product, training of service personnel, and
trustworthiness.
The reliability of the hardware and software can also be verified from
customer references and industry analysts. Beyond that, you should
consider performing what we call anempirical component reliability
analysis.Figure 2lists the steps to perform to accomplish this. An
analysis of problem logs should reveal any unusual patterns of failure and
should be studied by supplier, product, using department, time and day
of failures, frequency of failures, and time to repair. Suppliers often keep
onsite repair logs that can be perused to conduct a similar analysis
EMPIRICAL METHODS FOR COMPONENT RELIABILITY ANALYSIS
Repairability is the relative ease with which service technicians can
resolve or replace failing components. Two common metrics used to
evaluate this trait are how long it takes to do the actual repair and
how often the repair work needs to be repeated. In more sophisticated
systems, this can be done from remote diagnostic centers where
failures are detected and circumvented, and arrangements are made
for permanent resolution with little or no involvement of operations
personnel.
The next characteristic of high availability is recoverability. This refers to
the ability to overcome a momentary failure in such a way that there
is no impact on end-user availability. It could be as small as a portion
of main memory recovering from a single-bit memory error, and as
large as having an entire server system switch over to its standby
system with no loss of data or transactions. Recoverability also
includes retries of attempted reads and writes out to disk or tape, as
well as the retrying of transmissions down network lines.
Responsiveness is the sense of urgency all people involved with high
availability need to exhibit. This includes having well-trained suppliers
and in-house support personnel who can respond to problems quickly
and efficiently. It also pertains to how quickly the automated recovery
of resources such as disks or servers can be enacted.
The final characteristic of high availability is robustness, which describes
the overall design of the availability process. A robust process will be
able to withstand a variety of forcesboth internal and externalthat
could easily disrupt and undermine availability in a weaker
environment. Robustness puts a high premium on documentation and
training to withstand technical changes as they relate to platforms,
products, services, and customers; personnel changes as they relate
to turnover, expansion, and rotation; and business changes as they
relate to new direction, acquisitions, and mergers.
ASSESSING AN INFRASTRUCTURE'S AVAILABILITY PROCESS
ASSESSING AN AVAILABILITY PROCESS USING WEIGHTED CRITERIA