You are on page 1of 11

Availability and Reachability in eHealth

A discussion of eHealth availability and reachability Concepts, definitions, calculations and approximations

Prepared by: Dan Seligman Concord Engineering

Copyright 2004 Concord Communications, Inc. eHealth, the Concord Logo, Live Health, Live Status, SystemEDGE, AdvantEDGE and/or other Concord marks or products referenced herein are either registered trademarks or trademarks of Concord Communications, Inc. Other trademarks are the property of their respective owners.

Table of Contents I. Introduction.......................................................................................................2 II. Availability ......................................................................................................2 III. Reachability..................................................................................................10 IV. References....................................................................................................11 I. Introduction
This paper addresses the related concepts of availability and reachability in eHealth. It describes the underlying ideas, relevant performance variables, and the details of associated calculations and approximations. Conceptually, availability refers to the ability of a managed eHealth element to perform its assigned functions. Availability is a binary property. At any point in time an element is either available or unavailable: eHealth does not recognize partial availability. In cases where an element might be unable to perform all of its functions without being totally disabled, we adjust our models to resolve it into subcomponents, each of which can be assigned a binary availability state. Where this type of resolution is impossible, eHealth considers the partially disabled element unavailable. This latter case occurs infrequently. Formally, we define availability as the percentage of some time interval during which the element was available, i.e., capable of performing its assigned functions. Reachability refers to the ability of eHealth to communicate with a polled element. Formally, reachability is the percentage of some time interval during which the eHealth system could communicate with the element. Availability is an intrinsic property of the element. Reachability, on the other hand, reflects the ability of eHealth to poll the element. In the following sections we discuss, in detail, how availability, reachability and associated performance variables are calculated and used in eHealth.

II.

Availability
As described above, the availability of an element is the percentage of time that the element was available over some period and available time is the actual time, usually denominated in seconds, that the element was available over the same period. Availability is usually calculated over a time period referred to as total time. Mathematically,

Concord Communications Availability and Reachability in eHealth

availability = 100 * (available time)/(total time)

where total time is he number of seconds since the last good poll. A good poll is an SNMP poll of the associated element that was correctly formatted, finished successfully and detected no counter wraps in any polled variable. Total time includes time gaps in polling due to missed polls, bad polls or events such as system reboots. A missed poll is one that did not result in a response from the element. A bad poll is one that resulted in an erroneous response or detected a counter wrap. A related concept is delta time, the number of seconds between two successive good polls. Delta time does not embrace time gaps in polling due to events such as system reboots or bad polls, but it does include gaps due to missed polls. Delta time is usually equal to total time; however, if a bad poll or a system reboot occurs, delta time is less than total time. A system reboot is defined as a reset of the (nearly universal) MIB variable sysUpTime. Consistent with our definition of available time, unavailable time is the time that the element was unavailable over some period. In principle an element can exist in one of only two availability states, available and unavailable. However, when we take the perspective of the eHealth server and consider that the information it obtains is, in general, imperfect, we wind up with three perceived availability states: available, unavailable and unknown. For the most part, we consider time when the availability state was unknown as unavailable, avoiding the unknown state and causing our availability calculations to err on the side of unavailability. However when the time associated with a report is a superset of the time over which we have calculated availability, e.g., where eHealth discovered the device in the middle of the time period represented by the report period, we consider the time over which availability was not calculated as unknown. There are two additional qualifications to our definitions of availability. When a device reboots, we assume the element state was unavailable for the time between last good poll and the time the device came back up. In the case of a hiccup in the eHealth server, we assume the element state was available for the time between the last good poll and the restart time, although in the case of a server restart coincident with a device reboot this rule is overridden by the reboot rule. There are four categories of elements for which we calculate availability: LAN/WAN interfaces Routers, Remote Access Servers, and Applications, Servers, and Response Paths

We treat each case a little differently. In the following sections we address each in turn.

Concord Communications Availability and Reachability in eHealth

LAN/WAN Interfaces
We define LAN/WAN interface availability as the percentage of time an interface has an operational status that renders it available, i.e., capable of sending and receiving network traffic. We base our definition of availability on two relevant MIB variables in ifTable (or their equivalents): ifOperStatus, the operational status of the interface, and ifLastChange, the value of sysUpTime at the time of the last status change.

We currently use the statuses described in RFC 1573. [3] According to that standard, the variable ifOperStatus can take on five possible values. A sixth value (not in the standard) indicates that no status was returned. In the discussion below we distinguish between availability states, which we are trying to calculate, and operational statuses, which we obtain by polling and from which we are attempting to determine availability states. We identify those statuses which map to the available state, and those which map to the unavailable state. We define as available the following statuses: noSuchName (0): up(1): dormant(5): no status returned ready to pass packets waiting for some external event in order to pass packets

and as unavailable the following: down(2) testing(3): unknown(4) in some test mode status cannot be determined for some reason

We know there was a status change during a polling interval if ifLastChange is greater than the last poll time (and, trivially, less than the current poll time). This status change may or may not be "operationally significant," i.e., represent a change from the available state to the unavailable state or the reverse. There are three cases: 1. The status at the time of the current poll is the same as the status at the time of the last poll and ifLastchange is less than the time of the last poll. Here the status has not changed from one poll to the next and the value of the availability state for the current poll is constant (available or unavailable) for the entire poll period.

Concord Communications Availability and Reachability in eHealth

ifLastChange

Iast poll

current poll

time

status A

status A

2. The status at the time of the current poll is different from the status at the time of the last poll. In this case ifLastChange is greater than the time of the last poll.

Iast poll

ifLastChange

current poll

time

status A

status B

There are two possibilities here: (a) If the status at the last poll and the status at the current poll represent a change from available to unavailable or the reverse, we assign the time between last poll and ifLastChange to available (unavailable) and the time between ifLastChange and current poll as unavailable (available), effectively assuming a single change in status, although in principle there could have been more than one in the time period between the last poll and ifLastChange. (b) If, on the other hand, the statuses before and after ifLastChange are both among the available statuses above or, alternatively, both statuses are unavailable, there might have been one, two or more status changes between last poll and ifLastChange. For this period we invoke the environment variable NH_UNKNOWN_AVAIL_PCT, which represents percentage of an unknown time interval that we assign to the other state, i.e., unavailable if the statuses both map to available or available if they both may to unavailable. The default value of this variable is zero. For the period between ifLastChange and the current poll, we identify the availability state as that indicated by the status of the current poll.
Concord Communications Availability and Reachability in eHealth 5

3. The status at the time of the current poll is the same as the status at the time of the last poll but ifLastChange is greater than the time of the last poll. Here there were at least two status changes between the last poll time and ifLastChange (one to leave the current status and one to get back to it), and possibly more. Again, we have no way of knowing the availability of the interface between the time of the last poll and ifLastChange, so we invoke NH_UNKNOWN _AVAIL_PCT, in the same manner as described above. Once again, for the period between ifLastChange and the current poll, we identify the availability state as that indicated by the status of the current poll.

Iast poll

ifLastChange

current poll

time

status A

status A

In some cases we have operational status values for each poll, but no measurement of ifLastChange. In this case, we assume the availability state for the entire poll period to be the same as the state at the end of the poll period, although we have no way of knowing that this was really the case. One final consideration is how we assign status in the event of a device reboot during the poll period, where ifLastChange may not, in general, be coincident with the reboot time. In this case, we assume that the state of the interface was unavailable from the time of the last poll to the time of ifLastChange.

Routers, Remote Access Servers and Applications


Here we obtain a metric for available time directly from the element, divide by total time, and convert to a percent:

availability = 100 * (available time)/(total time)

This formula works fine when the counter available time survives a period of unavailability, but not as well when it resets. In the latter case, the available time counter is reset when the element comes back up so the resultant availability is potentially an underestimate. We have no way of knowing how many restarts there were, so we assume there was a single restart and use the same

Concord Communications Availability and Reachability in eHealth

calculation. A special case is device element availabilities that use the MIB variable sysUpTime for availableTime.

In some cases available time is not provided directly but there is a status variable that identifies the status at any moment and which is obtained by eHealth at the end of each poll period. Each status value can be mapped to an availability state, and can be used in the same way ifOperStatus is used to obtain the availability of interfaces in the case where there is no value for ifLastChange. We make the same approximation here, assuming the availability state for the entire poll period is the same as the availability state at the end of the poll period. For example, consider the following polled variable in Reference [1]:
ccmStatus OBJECT-TYPE SYNTAX up(2), down(3) } MAX-ACCESS read-only STATUS current DESCRIPTION "The current status of the CallManager. A CallManager is up if the SNMP Agent received a system up event from the local CCM system unknown: up: down: Current status of the CallManager is Unknown CallManager is running & is able to communicate with other CallManagers CallManager is down or the Agent is unable to communicate with the local CallManager." ::= { ccmEntry 5 } INTEGER { unknown(1),

This status variable tells us the availability state of the Cisco CallManager at a point in time at the end of each poll period but says nothing about the time between polls. Here we assume that available time is equal to delta time if the Cisco Call Manager element is up (available) and zero otherwise. Thus for each poll period availability is either 100% or 0%.

Concord Communications Availability and Reachability in eHealth

Servers
We use the same approach and the same formula for Servers as for Routers, Remote Access Servers and Applications. However, there is an additional wrinkle with servers concerning precisely which variable to use for available time. Virtually all devices that support an SNMP agent, including servers, support sysUpTime: [4] sysUpTime OBJECT-TYPE SYNTAX TimeTicks ACCESS read-only STATUS mandatory DESCRIPTION "The time (in hundredths of a second) since the network management portion of the system was last re-initialized." ::= { system 3 }

For this reason, a reset of this variable effectively defines a device reboot and this variable is used as a default for available time. Since the variable is denominated in centiseconds, we need to divide by 100 to denominate available time in seconds. A close reading of the DESCRIPTION reveals that this variable actually represents the time since the network management portion of the system or server, i.e., the agent, was last initialized. So use of sysUpTime is, in fact, misleading, as the server or system might be hale and whole while only the agent is unavailable. There is an alternative variable, present in any agent that supports the Host Resources MIB [2] that represents true server uptime: hrSystemUptime OBJECT-TYPE SYNTAX TimeTicks ACCESS read-only STATUS mandatory DESCRIPTION "The amount of time since this host was last initialized. Note that this is different from sysUpTime in MIB-II [3] because sysUpTime is the uptime of the network management portion of the

Concord Communications Availability and Reachability in eHealth

system." ::= { hrSystem 1 }

When this variable is available, we improve our availability calculations by equating the available time to hrSystemUptime. This will cause the system availability calculations to use the It should be pointed out that while this approach offers more accurate server availability figures, it suffers from an unfortunate side-effect. Interface availability is calculated from the values of ifOperStatus and ifLastChange for each individual interface, as discussed above. ifLastChange is likely to be coincident with the reboot time (reset of sysUpTime) and hence aligned with server availability when the latter is calculated using sysUpTime. When we use hrSystemUptime, this is no longer true, and eHealth might report interfaces going down owing merely to an agent reset while the server remains available.

Response Paths
A Response Path element represents transactions between a response source and a response destination and applies to client-server communication. There are two types of availability associated with a Response Path element, path availability and service availability. Path availability or availability without a qualifier refers to the availability of the Response Path itself. We define path availability as the percentage of attempted transactions during a poll period that were successful. Since, in general, there will be a mix of successful and unsuccessful transactions during a poll period, we divide the poll period between available time and unavailable time in proportion to the number of transactions that succeeded and failed respectively. If no transactions are attempted during a poll period, path availability is unknown for that poll period and no value is reported. Service availability attempts to measure or approximate not the availability of the Response Path itself but rather the availability of the associated service to the client. We make the approximation that if a single transaction is completed successfully during the poll period, the service is assumed to be available for entire poll period. If, on the other hand, there are attempted transactions but no completed transactions during the poll period, the service is unavailable for that poll period. For service availability, we make no distinction between one or more transactions succeeding and none failing and a mix of transaction successes and failures. If there are no attempted transactions during the poll period, eHealth does not report any value for service availability.

Concord Communications Availability and Reachability in eHealth

Non-Polling Environments
In some cases eHealth is called upon to derive availability from non-polled sources, such as other management stations, where the data is imported into eHealth via protocols other than SNMP. Here the concept of polling, central to many eHealth definitions, does not exist. Under these circumstances, eHealth only reports availability if available time can be obtained directly or calculated from or parameters explicitly provided by the data source. If there is no way of calculating availability, eHealth offers a somewhat inconsistent default. In some reports it causes availability to display as 100%; in others as "No Data Available" or "Variable not Supported". It is important to note that even in reports that actually display availability it is not a bona fide measurement: its value will always be 100% and never change.

III. Reachability
eHealth preceeds each SNMP poll with a series of pings that determine whether the element is accessible. If the series is unsuccessful, the poll is flagged a missed poll and the element is considered to be unreachable for the duration of the poll period ending in that poll. If the ping series is successful, the poll is considered a good poll, and the element is considered reachable for the duration of the poll period. Reachable time is the time, generally denominated in seconds, that eHealth system could communicate with, i.e., ping, the element. An element is considered reachable for the duration of any poll ending in a successful ping test and unreachable for the duration of any poll ending in an unsuccessful ping test. Thus for any poll period (delta time), reachability is either 100% or 0%. There are two exceptions to this rule if a reboot (reset of sysUptime) occurred during the poll period, the element is assumed unreachable from the start of the poll period to the reboot time and reachable from the reboot time to the end of the poll period, assuming the reachability test was successful. This is another way of saying that reachability is set equal to availability in the case of a system reboot if eHealth is not polling, reachability is also set equal to availability.

Unreachable time is the number of seconds that the eHealth system could not communicate with the element. It is possible for the user to disable the ping measurements. If this is done, eHealth uses the success or failure of the SNMP request to determine reachability. Reachability is undefined in non-polling environments.

Concord Communications Availability and Reachability in eHealth

10

IV. References
[1] Cisco Systems, Inc., CISCO-CCM-MIB.my LastUpdated 200012010000Z [2] P. Grillo, S. Waldbusser, Host Resources MIB, RFC 1514, September 1993 [3] K. McCloghrie, F. Kastenholz, Evolution of the Interfaces Group of MIB-II, RFC 1573, January 1994 [4] K. McCloghrie, M. Rose, Management Information Base for Network Management of TCP/IP-based internets: MIB-II, RFC 1213, March 1991

Concord Communications Availability and Reachability in eHealth

11