Incident Process

MAJOR INCIDENT PROCESS
Overview
Version 2.2
June 28, 2018
Matthew Wollman
This page left intentionally blank.
Page 2 HUIT Major Incident Process

Document Change Control
Version # Date of Issue Author(s) Brief Description
0.1 8/3/2012 Matthew Start of Document

Wollman
0.2 8/21/2012 Matthew Incorporated feedback from Courtney Harwood,

Wollman Richard Ohlsten and Steve Martino
0.3 8/28/2012 Matthew Made major modifications to Responsibilities and Workflow.

Wollman
 Added definitions for critical, core, and non-core service
0.4 9/10/2012 Matthew Incorporated feedback from Dennis Ravenelle

Wollman
 Drafted water mark
 Reordered Objectives and Policy by importance
 Further clarified definition
 Added additional Role responsibilities
 Expanded Process activities
 Made grammatical changes
1.0 9/24/2012 Matthew First release of document after core team approval
Wollman
 Removed P1 and P2 differences
1.1 11/27/2012 Matthew Separated Incident Commander and Incident

Wollman Communications roles; added text about criteria for
hierarchical escalations
2.0 2/13/2013 Matthew Combined Purpose and Scope, and objectives and policies.
Wollman & Reorganized roles and responsibilities by order of role
Janet Crystal involvement in process. Reorganized and reduced Process
activities section to a high – level overview. Process activities
will be detailed in separate documentation
2.1 8/15/2014 Matthew Change to RACI, Service Owner is Accountable for External
Wollman Communications, Removed C-Cure to Critical Services
2.2 11/2/2014 Matthew

Wollman
HUIT Major Incident Process Page 3


Table of Contents
Document Change Control............................................................................................................................ 3
Purpose and Scope........................................................................................................................................ 7
Policies .......................................................................................................................................................... 7
Process Roles and Responsibilities ............................................................................................................... 8
Incident Commander ................................................................................................................................ 8
Incident Commander Escalation ........................................................................................................... 8
Incident Communicator ............................................................................................................................ 9
Service Desk .............................................................................................................................................. 9
SOC Operations ......................................................................................................................................... 9
Technical Resources (Infrastructure, Development, DevOps, etc.) ........................................................ 10
Technical Line Manager .......................................................................................................................... 10
Service Owner / Practice (or Product) Manager ..................................................................................... 10
Process Activities ........................................................................................................................................ 11
Major Incident Identification .................................................................................................................. 11
Initial Communication and Escalation .................................................................................................... 11
Incident Coordination ............................................................................................................................. 11
Conference Bridge .................................................................................................................................. 11
External Communication ........................................................................................................................ 11
Internal Communication ......................................................................................................................... 12
Investigation............................................................................................................................................ 12
Resolution ............................................................................................................................................... 12
Incident Documentation ......................................................................................................................... 12
Appendix A: Process Flowchart for a Major Incident ............................................................................. 13
Appendix B: RACI Matrix ........................................................................................................................ 14
Appendix C: Critical Services .................................................................................................................. 15
Appendix D: Major Incident Process Timeframes (Estimated) .............................................................. 16
Glossary ....................................................................................................................................................... 17


Purpose and Scope
The Harvard University Information Technology (HUIT) Major Incident process provides a unified system
for resolving Major Incidents as quickly as possible through proper identification, predefined escalation
paths, and prompt communication procedures across all HUIT services.
A Major Incident is the interruption or degradation of a core production service (any centralized
HUIT-provided service that serves multiple customers and users) that results in the disruption of its
customers’ ability to carry out University teaching, learning, research and/or administration at the
University.
The scope of this document is to provide an overview of the processes that apply to every Major
Incident for all HUIT services and that all HUIT employees must follow. Once trained, all HUIT employees
will be able to identify a Major Incident and to escalate it to the appropriate technical group for
resolution.
Policies
1. HUIT’s focus is to alert the community to the occurrence of a Major Incident as quickly as possible.
Early notification of a potential issue is more important than an accurate description of the problem.
2. HUIT will use standardized methods and procedures to enable an efficient and prompt response,
analysis, documentation, ongoing coordination and ownership, communication, and reporting.
3. Escalation in a Major Incident will start with the Incident Commander and move to the HUIT
employees most responsible for each service.
4. HUIT will communicate with affected end-users regularly throughout the lifecycle of a Major
Incident.
5. HUIT will maintain a consistent and regular presence through open communications among HUIT
staff and will provide consistent updates to the Service Desk, Service Owner, Incident Manger, and
HUIT leadership.
6. HUIT will log and document all details of Major Incidents throughout the lifetime of each event.

Process Roles and Responsibilities
Incident Commander
The Incident Commander has the highest level of responsibility during a Major Incident and is
accountable for its lifecycle through coordination, documentation, and communication. The roles of
HUIT Incident Commander and HUIT Incident Communicator may be combined in one person for
incidents that are of short duration or that are deemed less critical. For incidents of longer duration or
those with greater impact, the responsibility of the Incident Commander can be escalated to a Manager
or Director in HUIT.
The Incident Commander is responsible for the following activities:
 Facilitating and participating in and a conference bridge

 Maintaining communication with Technical Resources and Service Owners for status updates and
additional information
 Coordinating resources needed to troubleshoot, communicate, and/or make decisions to resolve a
Major Incident
 Ensuring that internal and external communications about a Major Incident are completed in a
timely manner
 Creating and completing a Major Incident Report
Incident Commander Escalation

If the scale of the event requires escalation to a HUIT Manager or Director, the responsibilities for the
Incident Communicator role will remain with the original Incident Commander. The following conditions,
whether individual or in combination, will guide the need for escalation of Incident Commander
responsibilities to a higher level of HUIT management:
1. A Major Incident is one of the Critical Services listed in Appendix C of this document.
2. A Major Incident affects over 1,000 users of one or more services.
3. A Major Incident is not or cannot be resolved within four hours.

Incident Communicator
The Incident Communicator is responsible for the documentation and communication during a Major
Incident, both internally to HUIT and externally to customers and end-users. The roles of HUIT Incident
Commander and HUIT Incident Communicator may be combined in one person for incidents that are of
short duration or that are deemed less critical. For incidents of longer duration or those with greater
impact, the responsibility of the Incident Commander can be escalated to a Manager or Director in HUIT.
The Incident Communicator is responsible for the following activities:
 Participating in a conference bridge

 Communicating internally to HUIT staff and externally to the customers of the service, end-users,
and other non-HUIT parties
 Maintaining a record of events throughout a Major Incident
 Notifying HUIT staff and any external parties of the resolution
 Updating the HUIT website, Twitter, Facebook, and email distribution lists with notifications of
incidents, updates, and resolution.
Service Desk
The Service Desk is responsible for the following activities:
 Identifying a Major Incident

 Escalating a Major Incident to the HUIT Incident Commander
 Logging Major Incident tickets for end-users
 Participating in a “servicedesk” Jabber chat room or the conference bridge
 Placing a generic Major Incident message on the ACD system
SOC Operations
The SOC Operations group is responsible for the following activities:
 Identifying a Major Incidents

 Escalating a Major Incident to the HUIT Incident Commander
 Logging Major Incident tickets for end-users
 Notifying the Service Desk during business hours of any Major Incident

Technical Resources (Infrastructure, Development, DevOps, etc.)
Any HUIT Technical Resource who receives alerts, escalations, and/or who has a role in restoring HUIT
services to normal operation is responsible for the following activities:

 Escalating a Major Incident to the Technical Line Manger
 Troubleshooting and working to resolve the incident in accordance with internal procedures for
handling Major Incidents
 Documenting incident details and steps taken to resolve the underlying problem
 Providing regular updates to the Line Manger and/or the Incident Commander on the status of an
investigation and the resolution of the incident.
Technical Resource Manager

Any HUIT manager who manages technical resources and their performance is responsible for the
following activities:
 Identifying and escalating a Major Incident to the HUIT Incident Commander

 Identifying the scope of the problem and identifying additional services that may be affected by a
Major Incident
 Notifying respective service areas and providing updates throughout the lifecycle of a Major Incident
 Facilitating communication among technical resources, HUIT Incident Commander and Service
Owner
 Recording and tracking progress throughout the lifecycle of a Major Incident and providing updates
to the Incident Commander and Service Owner
 Estimating the service recovery time
 Managing the activities of the Technical Resources
Service Owner / Practice (or Product) Manager

Any HUIT employee or their proxy who is responsible for the overall quality of a service and has the
most comprehensive knowledge of its components is responsible for the following activities:

 Notifying the Service Desk during business hours of a Major Incident
 Identifying the business impact of a Major Incident
 Communicating externally to the customers of the service, end-users and other non-HUIT parties
 Maintaining a record of events throughout a Major Incident
 Confirming that resolution of a Major Incident is in place
 Notifying HUIT staff and any external parties of the resolution after confirmation

Process Activities
HUIT maintains detailed descriptions of the following activities in separate documents. They are listed
below in this document for high-level reference and overview.
Major Incident Identification

 Major Incidents can be initiated by customers, reports from user, observations, monitoring, Event
Management, and/or Change Management.
Initial Communication and Escalation

 As soon as HUIT staff has identified a suspected Major Incident, they must escalate it immediately to
the HUIT Incident Commander.
 The Incident Commander will declare the event as a Major Incident and set its priority.
 The Incident Commander will escalate it to the appropriate technical groups and service owners.
 After declaration of a Major Incident, the Incident Communicator will email the Service Desk with
appropriate information and place a service alert on the HUIT website.
Incident Coordination
 The Incident Commander will involve and consult with all necessary parties to resolve the incident
as quickly as possible.
 The Incident Commander will facilitate conference bridges to ensure that information is
disseminated in a timely manner, that time spent on the bridge is focused and that troubleshooting
can continue.
 The Incident Commander will escalate the incident to additional resources, including hierarchical
escalations as necessary.
Conference Bridge
 Once notified of a Major Incident, the Incident Commander will use a conference bridge that
includes all affected groups to maintain communication between the technical resources and the
service owner(s).
 The Incident Commander will determine the appropriate schedule for calling a conference bridge
and its duration after the initial assessment.
External Communication
 Throughout the Incident, HUIT will use its website as the primary location for information updates.
 HUIT will distribute Incident notification(s) to external customers, add an outgoing message to the
Service Desk ACD system (as necessary), and send a tweet whose content will also appear on HUIT's
Facebook page and in Harvard’s Yammer community.

Internal Communication
 The Incident Communicator will create a Major Incident ticket to be available in Remedy.
 The Incident Communicator will send a notification containing internal details of the Major Incident.
 The Incident Communicator will notify the Operational Managing Directors, as necessary.
Investigation
 HUIT will investigate continuously throughout a Major Incident and coordinate updates with
vendors, developers, and end-users.
Resolution
 Service Owners have final sign-off authority on the resolution of a Major Incident and ensure
end-user notification.
Incident Documentation
 The Incident Communicator will document the initial assessment of the incident's root cause (if
known), create a timeline, and establish the steps taken for investigation and resolution.
 The Service Owner(s) and Technical Line Manager(s) will forward any notes or timelines that they
have maintained throughout the incident to the Incident Commander.

Appendix A: Process Flowchart for a Major Incident
Major Incident Process
8/29/2012
Is this a Major Update ACD
Escalate to ITSM (6-2831)

Service Desk
Phone or Emails
incident?
Yes
No
Assume Normal
Process
Join Conference Bridge

Users or Is this a Major
Monitoring Incident?
Operations
Yes
No
Assume Normal
Process
Service Owner / Product
 Confirm Incident
Is this a Major Resolution
Customer
Incident?  Communicate
Manager
Communicate to External Externally

Yes Customers / Users  Provide Business
Impact Details /
Log for Incident
Report
Assume Normal
Process
 Resolve Incident
Escalate to
 Communicate
ITSM
Notify Service
Assume Incident Appropriate
Notify SD Owner / Product Internally End
Commander Role Technical
Manager  Open Incident
Resources
Report
Line Manager
Technical
Provide
Escalate to ITSM Notify Update to ITSM
Technical
(6-2831) Huit-inf-alerts Coordinate add’l
Details / Log for
tech resources
Incident Report
Technical Resources
Is this a Major Escalate to Line Investigate and

Monitoring Yes
Incident? Manager Restore
No
Assume Normal
Process
Legend
Incident Ownership Coordination and
Communication Tasks Technical Resolution Tasks
Path Ownership Tasks

Appendix B: RACI Matrix
A = Accountable, R = Responsible, C = Consulted, I = Informed
Incident Communicator
Technical Line Manager

Incident Commander
Technical Resource
SOC Operations
Service Owner
Service Desk
Activity
Incident Identification A R R R R R
Initial Communications A,R R R C C
Escalation A,R R R R R R
Incident Coordination A,R C C
Conference Bridge A,R R I C R
External Communication R R I I C A,R
Internal Communication C,I R A C,I
Investigation I I I R C,I A
Resolution A,R R R R
Incident Documentation A,R R C C C C C

Appendix C: Critical Services
1. Central Networking Services 3. E-mail
4. PIN / LDAP
a. DNS, CHCP, Infoblox 5. University website
b. Core / Data Center Routers 6. College website
c. Core / Data Center Firewalls 7. Phone System / Voicemail / i3
d. Load Balancer 8. PeopleSoft
2. Data Center 9. Oracle Financials
10. HarvIE
a. Facilities / Power 11. CAADS
b. Shared Storage 12. iSites / Canvas
c. Virtualization

Appendix D:Major Incident Process Timeframes (Estimated)
•Major Incident Identified

•Major Incident Escalated to Incident Commander
•Service Desk Informed
•Service Owner Informed
T0 •Technical Resources Informed
•Call Bridge Opened
•Initial Communications
•Service Desk Updates ACD System
•Service DeskLlogs Remedy Ticket
•Incident Commander Sends HUIT Alert
T+30 •Incident Commander Updates Website
•Service Owner or Incident Commander Sends External Notification
•First Update
•Initial Diagnosis?
•Estimated Time to Resolution?
T+45 •Additional Communications Need to be Sent?
•Agree upon Update Times and Intervals (e.g., every 30 minutes)
•Regular Updates
•Update on Progress?
•Additional Rresources?
T+Interval •Updated Communications?
•Service Owner Confirm Service is Restored to Acceptable Levels

•Incident Commander Notifies HUIT Alert
•Incident Commander Updates Website
•Service Owner Sends External Communication
Resolution •Incident Commander Resolves Major Incident
•Incident Commander Begins Incident Report

Glossary
Core Service—Any HUIT-provided service that serves multiple customer groups and end-users, and is a
centralized service. See non-core service.
Critical Service—Any service whose failure or degradation creates an immediate and large-scale impact.
See Appendix C.
Incident Commander—The Incident Commander is responsible for the lifecycle of the Major Incident,
including coordination, documentation and communication and is its owner.
Major Incident—A Major Incident occurs when a core production service is interrupted or degraded,
resulting in a noticeable disruption of the customers’ ability to carry out University teaching,
learning, research and administration.
Non-Core Service—Any HUIT service that is hosted or provided to one specific customer or group of
users for a non-centralized purpose.
Service Owner—In the context of the Major Incident process, the service owner is a HUIT staff member
who has a comprehensive view of the service including but not limited to customer and user
relationships, a broad understanding of the components required to deliver that service, and the
expectations for the quality set for that service.
Utility—The functionality offered by a service to meet a particular need. Utility can be summarized as
‘what a service does’, and can be used to determine whether a service is able to meet its
required outcomes or is ‘fit for purpose’. The business value of an IT service is created by a
combination of utility and warranty.
Warranty – Assurance that a product or service will meet agreed requirements. This may be a formal
agreement such as a service level agreement or contract, or it may be implied through ad-hoc
messages or agreements. Warranty refers to the ability of a service to be available when
needed, to provide the required capacity, and to provide the required reliability in terms of
continuity and security. Warranty can be summarized as 'how the service is delivered', and can
be used to determine whether a service is 'fit for use'. The business value of an IT service is
created by the combination of utility and warranty. See also service validation and testing.

Incident Process

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Incident Process

Uploaded by

Copyright:

Available Formats

MAJOR INCIDENT PROCESS

Page 2 HUIT Major Incident Process

Version # Date of Issue Author(s) Brief Description

0.1 8/3/2012 Matthew Start of Document

0.2 8/21/2012 Matthew Incorporated feedback from Courtney Harwood,

0.3 8/28/2012 Matthew Made major modifications to Responsibilities and Workflow.

0.4 9/10/2012 Matthew Incorporated feedback from Dennis Ravenelle

1.1 11/27/2012 Matthew Separated Incident Commander and Incident

2.2 11/2/2014 Matthew

HUIT Major Incident Process Page 3

Page 4 HUIT Major Incident Process

HUIT Major Incident Process Page 5

Page 6 HUIT Major Incident Process

HUIT Major Incident Process Page 7

The Incident Commander is responsible for the following activities:

 Facilitating and participating in and a conference bridge

Incident Commander Escalation

Page 8 HUIT Major Incident Process

The Incident Communicator is responsible for the following activities:

 Participating in a conference bridge

 Identifying a Major Incident

 Identifying a Major Incidents

HUIT Major Incident Process Page 9

 Identifying a Major Incident

Technical Resource Manager

 Identifying and escalating a Major Incident to the HUIT Incident Commander

Service Owner / Practice (or Product) Manager

 Identifying a Major Incident

Page 10 HUIT Major Incident Process

Major Incident Identification

Initial Communication and Escalation

HUIT Major Incident Process Page 11

Page 12 HUIT Major Incident Process

Is this a Major Update ACD

Escalate to ITSM (6-2831)

Join Conference Bridge

Communicate to External Externally

Is this a Major Escalate to Line Investigate and

HUIT Major Incident Process Page 13

Technical Line Manager

Page 14 HUIT Major Incident Process

HUIT Major Incident Process Page 15

•Major Incident Identified

•Service Owner Confirm Service is Restored to Acceptable Levels

Page 16 HUIT Major Incident Process

HUIT Major Incident Process Page 17

You might also like