Availability Management

Availability Management
- Premanand Lotlikar
31st July, 2007

Agenda
• Introduction
• Objective of Availability Mgmt
• Basic Concepts
• Benefits
• Relationship with other processes
• Activities in Change Mgmt
• Process Control
• Key Performance Indicators
• Cost
• Possible Problems
Objectives
• Determining availability requirements in close
collaboration with customers
• Guaranteeing the level of availability established
for the IT services
• Monitoring the availability of the IT services
• Proposing improvements in the IT infrastructure
and services with a view to increasing levels of
availability
• Supervising compliance with the OLAs and UCs
agreed with internal and external service
providers
Basic Concepts
Basic Concepts
• High Availability means
– IT service is continuously available to the customer
– Little downtime
– Rapid service recovery
• Availability of service depends on
– Complexity of the IT infrastructure architecture
– Reliability of the components
– Ability to respond quickly and effectively to faults
– Quality of maintenance by support and suppliers
Basic Concepts
• Reliability means
– Service is available for an agreed period without
interruptions
• Includes resilience
• Calculated using statistics
• Determined by
– Reliability of the components
– Ability of service/component to operate despite failure
(resilience)
– Preventive maintenance
Basic Concepts
• Maintainability needed to
– Keep the services in operations
– Restore services when they fail
• Includes
– Taking measures to prevent faults
– Detecting faults
– Making diagnosis by components themselves
– Resolving the fault
– Restoring the service
Basic Concepts
Basic Concepts
• Mean Time to Repair (MTTR)
– Avg time b/w the occurrence of a fault and service
recovery
• Mean Time Between Failures (MTBF)
– Avg time b/w recovery from one incident and the
occurrence of next
• Mean Time Between System Incidents (MTBSI)
– Avg time b/w the occurrence of two consecutive
incidents
Benefits
• Fulfillment of the agreed service levels.
• Reduction in the costs associated with a
given level of availability.
• The customer perceives a better quality of
service.
• The levels of availability progressively
increase.
• The number of incidents is reduced.
Inputs - Outputs
Relationship with other processes
• Service Level Mgmt is responsible for
negotiating & managing availability
• Availability is one of the most important
element in SLA
• Configuration Mgmt has information about
the infrastructure and can provide valuable
information to Availability Mgmt
• Changes in capacity can often affect the
availability of a service
• Changes to availability will affect capacity
• These 2 processes exchange info about
– Scenarios for upgrading
– Phasing out IT components
– Availability trends that may need changes to
capacity
• Problem Mgmt is directly involved in
identifying and resolving the causes of
actual or potential availability problems
• Incident Mgmt provides reports with
information about recovery times, repair
times etc. This information is used to
determine the achieved availability.
• Change Mgmt informs Availability Mgmt
about FSC
• Availability Mgmt informs Change Mgmt
about maintenance related to new service
and elements.
Activities
• Planning
• Monitoring
Planning
• Determining the availability requirements
• Designing for availability
• Designing for recoverability
• Security issues
• Maintenance management
• Developing the Availability Plan
Determining the availability
requirements
• Must be undertaken before SLA is
concluded
• Should address both new IT services and
changes to existing services
• Clearly defining availability requirements
early is essential to prevent confusion and
differences
Determining the availability
requirements
• Should identify:
– Key business functions
– Agreed definition of IT service downtime
– Quantifiable availability requirements
– Quantifiable impact on the business functions
of unscheduled IT service downtime
– Business hours of customer
– Agreements about maintenance windows
Designing for availability
• Vulnerabilities affecting availability
standards should be identified early
• This will prevent
– Excessive development costs
– Unplanned expenditure at later stages
– Additional cost by suppliers
– Overall delays
Designing for recoverability
• Uninterrupted availability is rarely feasible
• Design for recoverability involves
– Effective Incident Mgmt
– Appropriate escalation
– Communication
– Backup and recovery procedures
– Tasks, responsibilities and authority clearly
defined
Key Security issues
• Security and reliability are closely linked
• High availability can be supported by
effective information security
• This includes:
– Determining who is authorized to access
secure areas
– Determining which critical authorizations may
be issued
Maintenance management
• There will always be scheduled window of
unavailability
• These periods can be used for preventive
actions
• Maintenance must be carried out when
impact on services can be minimized
Developing the Availability Plan
• Long term plan concerning availability
over the next few years
• It is not the implementation plan for
Availability Mgmt
• Plan require liaison with areas such as
– Service Level Mgmt
– IT Service Continuity Mgmt
– Capacity Mgmt
– Change Mgmt
Methods and Techniques
• Component Failure Impact Analysis(CFIA)
– Uses an Availability matrix with strategic
components and their roles in each service
– Horizontal Analysis
– Vertical Analysis
CFIA
Fault Tree Analysis
• Used to identify chain of events leading to
failure of IT service
• Distinguishes following events:
– Basic Event: power outages or operator error
– Resulting Event: resulting from combination of
earlier events
– Conditional Event: events that occur only in
certain conditions
– Trigger Event: events that cause other events
Fault Tree Analysis
Availability Calculations
• Availability is commonly defined as a
percentage as follows:
• For example, if the service is 24/7 and

over the last month the system has been
down for four hours to carry out
maintenance, the real availability of the
system was:
Process Control
• Critical Success Factors
– Business must have clearly defined
availability objectives
– SLM must have been setup to formalize
agreements
– Both parties must use the same definitions of
availability and downtime
Process Control
• Key Performance Indicators
– Percentage availability per service
– Downtime duration
– Downtime frequency
Cost
Possible Problems
• The real availability of the service is not monitored
correctly.
• There is no commitment to the process in the IT
organization.
• The appropriate software tools and personnel are not
available.
• The availability objectives do not match the customer's
needs.
• There is a lack of coordination with other processes.
• Internal and external service providers do not recognize
the authority of the Availability Manager as a result of a
lack of support from management
Thank you!

Availability Management

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Availability Management

Uploaded by

Copyright:

Available Formats

Availability Management

31st July, 2007

• For example, if the service is 24/7 and

You might also like