You are on page 1of 12

Build your knowledge base

Rapid Problem Resolution (RPR) explained

Introduction
87% of IT problems reported to your Service Desk get fixed within hours, if not minutes. A further 11% are resolved nd rd within days by 2 and 3 line support, perhaps with help from supplier technical staff. The final 2% are the toughest and represent chronic problems that are fixed by trial & error, or simply remain unresolved. Unfortunately, trial & error is very slow, sometimes expensive and often disruptive. The good news is that we are at the dawn of a new era in problem resolution. The wide adoption of ITIL has provided a framework for the management of incidents and problems, and this in turn is driving an interest in problem resolution methods, particularly those based on Root Cause Analysis (RCA). One such method is Advance7s Rapid Problem Resolution (RPR) Technique. In this paper we look at the need for RPR, when we should use it, how it works, and the skills we need to practice it, and its challenges & limitations. We finish the paper with a very recent case study that helps demonstrate the benefits.

The Need for RPR


This question actually comes in two parts; why do we need a problem resolution method, and, why RPR? A problem reported to the Service Desk will go through four phases: Phase 1 - Fixed by Help Desk / 1st Line staff - Simple problems and user errors - Fixed through knowledge and procedures - Fixed within 16 hours worked Phase 2 - Dealt with by 2nd / 3rd Line Support - Issues caused by faults, overload or misconfiguration. - Fixed through advanced knowledge, tools and knowledge-base access - Fixed within a further 24 hours worked Phase 3 - Dealt with by 3rd line support with supplier product specialist - Complex problems, often performance related and/or intermittent 1 - Fixed through pattern method , detailed product knowledge and advanced tools - May use process of elimination - Fixed within a further 24 working hours Phase 4 - More and more people get involved - Complex problems that have dropped through Phase 3 process with root cause unknown - Attempts are made to fix through swap-out, holistic method, gut feel, random upgrades, etc. - May be fixed within 10 days to 2+ years - May never be fixed.

We can put this in an ITIL context. Incident Management covers all of Phase 1 and much of Phase 2 in that most of these issues are dealt with through any action that achieves rapid service recovery. Where the IT Team identifies

Pattern Method is at the core of a number of problem resolution methods and involves linking the first indication of a problem, a change in frequency or the pattern of problems to a common cause e.g. only Windows XP users experience the problem, therefore, the problem must be related to Windows XP.
Executive Whitepaper 2008 Advance Seven Limited Advance7 Defining IT stability, control & performance www.Advance7.com +44 (0) 1371 876805

repeated occurrences of the same problem the issue moves into Problem Management. Such problems are typical of those that drop through Phase 2, often are fixed in Phase 3 and sometimes end up in Phase 4.

In Phase 4, the combination of procedures, skills and tools available in a typical company has been unable to determine the cause of the problem. At this point the IT Team is faced with two choices: 1 2 Do more of the same, even though this has not produced the solution to date Resort to trial and error which could involve testing a range of changes, upgrading infrastructure, or replacing software

There is a third way method-based problem resolution, and there are many methods available. Some have sprung from the IT industry, but many of the front-runners are actually adaptations of business problem resolution methods. There are some common shortcomings: Due to the soft nature of many business problems, methods with this lineage are not designed to take advantage of the logic and tools available to us in the IT industry Many of the methods are actually just processes with no supporting IT techniques making it difficult for IT people to run the process Many methods require that we already know the root cause as one of a list of possible causes that are then tested To avoid disruption, some methods force the IT Team to attempt to recreate the problem in a lab environment, which is time consuming, can be expensive and rarely works Many methods rely on statistical analysis which often fails with Phase 4 problems due to their intermittent nature and transient causes Some methods rely on trial and error Although many methods claim to be based on root cause analysis most only achieve this with hindsight

RPR was designed from the outset to solve IT problems and is heavily influenced by software engineering techniques, 2 primarily IBMs PSI/PD . From this starting point RPR avoids the shortcomings suffered by other methods because: It makes full use of the IT tools that are available in every business It is a fully mature method with a core 5-step process and supporting IT techniques RPR requires no pre-conceived idea of the cause of the problem, in fact such thoughts are positively discouraged The method uses non-disruptive techniques and so there is minimal business impact RPR doesnt require recreation in a lab environment, or even testing outside of normal working hours The method is based around the collection of definitive diagnostic data at the exact point of a problem and so precisely identifies the cause; transient or not RPRs primary objective is to identify the root cause RPR enhances the skills of the IT team and support companies

In the late 1970s and early 1980s IBM taught its software engineers a two stage process of problem diagnosis called Problem Source Identification / Problem Determination.

RPR Process
RPR starts with the premise that first we must identify the root cause, and only then can we define a fix. The method differs from many Root Cause Analysis methods that start from the fix and work backwards to the root cause. RPR begins by determining the root cause and then works forward to a solution. The root cause is determined through a five-step process: Gain an accurate understanding of the problem at some level Choose one specific symptom Create an Action Plan to capture definitive diagnostic data for one or more identifiable instances of the chosen symptom Execute the plan whilst controlling the environment Analyse the results and either; - identify the root cause and determine a fix, or - define a new Action Plan and execute it It is likely that we will need to iterate around the last three steps, revising our Action Plan and re-analysing the results. At first sight the process might look ridiculously simplistic but the devil is in the detail.

Understanding the Problem


How many times have you heard the complaint, The network is slow? What does that term actually mean? What is slow and how slow is slow? An accurate understanding of the problem might be: When I click on the Appointments button in our calendaring system I usually see the diary for the day within 5 seconds, but intermittently it takes more than 15 seconds. Our understanding must be of a high level to enable us to replicate the problem in a test environment.

A Single Symptom
RPR dictates that we can only diagnose one symptom at a time, even if we think many symptoms have one common cause. This can be a tougher proposition than you might think. The RPR Practitioner can come under considerable pressure to deal with all of the issues particularly as they are all linked. RPR warns against trying to establish patterns, i.e. links between differing symptoms. There are many reasons for avoiding using a pattern method to diagnose problems and heres an illustration of one. The Service Desk reported that users suffered a 3 oclock slowdown. At around three oclock every day users said that the network was slow. Three specific symptoms were identified; Outlook Inbox items were slow to open, Word documents sometimes took 30 seconds to save and Citrix users suffered intermittent type-ahead delays. It turned out that none of these problems had a common cause. Starting from a presumption that all were linked would have led to failure to find the root cause of any of them. If multiple symptoms have the same cause then by fixing one we will fix them all.

Executive Whitepaper 2008 Advance Seven Limited

Advance7 Defining IT stability, control & performance

www.Advance7.com +44 (0) 1371 876805

Definitive Diagnostics
Shortcomings with Statistics Generating definitive diagnostics is a very big subject that alone could fill several whitepapers. The most important point of this step is that we must be able to gather diagnostics that can be directly correlated with the users experience of the problem. RPR rejects the use of statistical data that cannot be directly matched to the moment that the problem occurred. This usually comes down to wiggly graphs like this:
CPU Utilisation 100 90 80 70 60

Utilisation

50 40 30 20 10 0

If the CPU utilisation of a server was constantly above 90% it would make sense to solve this, although even then such load does not guarantee that we have discovered the cause of the performance problem in a complex end-toend system. More often than not we are actually faced with a graph like that above, the interpretation of which is very subjective. th To remove the subjectivity we might work on a designated overload threshold of 50% for the 95 percentile figure but even that ignores the issues of transient problems which can get hidden by the averaging that occurs over the sample period. A possible solution is to use a more granular measurement based on one-second samples, but even then we must be able to match the start and end time of the problem to the correct points on the graph. Without correlation there is a danger that we might spend money, and, more importantly, time on upgrading the server only to find that we still have the problem. Correlation RPR proposes that we gather diagnostics that can be directly correlated with one or more user experienced problems. So returning to our earlier scenario of the slow response time to the Appointments button we might decide to set-up: A network trace for the users PC A network trace showing everything going in and out of the application server A SQL trace for all database calls A perfmon study of CPU, memory and disk I/O on the application and SQL server

We then wait for the problem and when it occurs we immediately mark the diagnostic data. Many of the Supporting Techniques of RPR are designed to address the issue of correlation of the user experience of a problem to the corresponding diagnostic events.

Markers Markers are a key RPR technique to achieve correlation. A marker is simply an entry in the diagnostic data that is generated under our control, and is unique and easily identifiable.

11 :0 0 11 :00 :0 2 11 :00 :0 4 11 :00 :0 6 11 :00 :0 8 11 :00 :1 0 11 :00 :1 2 11 :00 :1 4 11 :00 :1 6 11 :00 :1 8 11 :00 :2 0 11 :00 :2 2 11 :00 :2 4 11 :00 :2 6 11 :00 :2 8 11 :00 :3 0 11 :00 :3 2 11 :00 :3 4 11 :00 :3 6 11 :00 :3 8 11 :00 :4 0 11 :00 :4 2 11 :00 :4 4 11 :00 :4 6 11 :00 :4 8 11 :00 :5 0 11 :00 :5 2 11 :00 :5 4 11 :00 :5 6 11 :00 :5 8: 00
Time of Day

Here are some examples of markers: A ping with a payload length of 101 bytes ping n 1 l 101 LONSERVER01 - Generates an identifiable entry in network trace data - Correlate with perfmon by adding ICMP / Received Echo / sec counter A GET for a non-existent URL http://intranet/marker101.asp - Generates a trace entry and web log entry so that we can match server and analyzer time A dir command for a file that doesnt exist - Generates a marker in a filemon and / or procmon trace Remote execution of Performance to show the Thread Count for a process every second - When the process dies we see the event in the network trace Use a SQL client to generate an identifiable query in a database select marker101; - Generates a marker in a network trace and SQL profiler trace

Execute the Plan


A major consideration in the execution of the Action Plan is CONTROL. It is important that we carefully follow the steps in our plan in a very controlled manner. This too can be challenging, particularly if the problem is highly intermittent. Here are typical things that can go wrong: Delayed Notification - The user agrees to call us the moment the problem has occurred. When he does call we find that he has done a few more things since the problem Procedure Failure - The user agreed to send a marker immediately after the application gave a slow response, but now cant remember if he sent it during the problem, immediately after, or some time later Freelancing - The problem is being tackled by a large team and, despite having agreed to the plan, some members of the team think they know what the problem is and so make some changes. The changes invalidate our diagnostic data Pragmatism - IT management decide that rather than wait for the problem to occur again they will get the server upgraded

In all these cases, and many more like them we have to set the clock back to zero. We cannot safely assume that anything we have already discovered remains true. Of course, every IT department comes under great pressure to fix a problem as quickly as possible. Sometimes a pragmatic approach is needed and we are not able to follow the RPR method. We just need to make sure that everyone involved accepts that RPR will not work if ad-hoc changes are made.

Analyse the Results


The RPR Supporting Techniques provide many techniques for detailed analysis of diagnostic data. Here are just a few tips: Identify the markers in the diagnostics, and hence the boundaries of the problem Filter the diagnostic data to get a high level view of the events leading up to the problem

When studying a slow response time from for example, an application server, try to account for all of the time spent on network interactions anything left over must have occurred inside the server When studying a failure or a hang, compare diagnostics for a working scenario with those for a failure and focus on the point where things first differ
Executive Whitepaper 2008 Advance Seven Limited Advance7 Defining IT stability, control & performance www.Advance7.com +44 (0) 1371 876805

It is useful to consider the type of output that we would like from the analysis. Continuing with the Appointments scenario, once we have successfully executed the plan, careful analysis of the diagnostics might lead us to the conclusions that: At the time of the user reported problem we can see from network trace data that the database takes 13 seconds to execute a request The request is a stored procedure call sp_GetCustomerDetail At this time SQL server CPU load is 65%, max disk I/O queue length is 2 and there is 1.2 GB of memory in use Analysis of a matching SQL Profiler trace taken at the time shows that an additional index is required on table custinfo

RPR Supporting Techniques


Overview
RPR presents a very wide range of techniques to help us follow the RPR Process. In summary these are: Initial Data Gathering getting up to speed with the problem Initiation Workshop a structured brainstorming session Accurate Data Collection covering the practicalities of generating definitive diagnostics Diagnostic Data Analysis practical tips to help the analyst convert all that data into information Analysing Intermittent Failures particular steps to aid the detailed analysis Problem Management how to best reflect status and progress back to the business Soft Skills how to deal with the politics and the pressure, and how to enhance the skills of others

The techniques presented are generic ie they are applicable to any technology. Although some tools and technologies are cited as an illustration of a technique, RPR teaches nothing about specific tools or technologies. One of the Supporting Techniques is Whiteboard Analysis, and well take a brief look at this by way of example.

Whiteboard Analysis
We use whiteboard analysis to pull together the strands of a brainstorming session, the first of which is the Initiation Workshop. The technique is quite simple. Write five headings on a whiteboard or flipchart: Symptoms Boundaries Other Observations Possible Causes

Action Plan

Under Symptoms write an accurate description of one or more symptoms to a level that would enable us to attempt to recreate the problem. Prioritise the symptoms and choose the one we intend to tackle remember, we can only tackle one.

Under Boundaries note when, where and under what circumstances the problem occurs. These boundaries are used to determine when, where and how to conduct the investigation. They are never used to reach a conclusion regarding the cause of a problem. The boundaries simply guide you to the best time and place to do the diagnosis. For example, if the problem has only ever been noted in the Munich office at 3 pm on a Friday, there is little point trying to collect diagnostics in London on Monday morning. That doesnt mean that the problem is linked to the Munich office or the time of day. For all we know the problem might occur in London, but the users just dont report it, or, their pattern of application use is different. Other Observations are things that may or may not be related to the problem, and are worth noting for later consideration but must not figure in our diagnostic efforts. Put simply, they must be ignored at this stage. From the Symptoms and Boundaries you can identify the components of the end-to-end system. Each of these is a Possible Cause. Dont be too granular in determining the possible causes choose large chunks of infrastructure. The list might be: User PC Munich LAN WAN Data centre LAN Application Server Database Server Now devise an Action Plan to prove that a particular Possible Cause is, or is not, the root cause of the problem. If its possible to maintain good control of the diagnostic data capture, diagnosis can be made quicker by gathering data from many places along the end-to-end path. For the first pass make sure that the plan includes capture of data adjacent the users PC since its here that the user experience can be most easily correlated with diagnostic events.

Appendix B - Whiteboard Exercise Briefing Sheet on page 11 gives a sample scenario for practice.

RPR Challenges & Limitations


It is worth noting the limitations and demands of RPR, some of which can be difficult to overcome if all involved do not understand or believe in the method: Changes must not be made immediately prior to and whilst investigating the problem. Any change that affects the system being studied will invalidate the diagnostic data collection RPR requires the users to suffer the problem at least one more time. The method is totally evidence-based and that means collecting the diagnostic data when the problem occurs RPR can only deal with one symptom at a time. This is often controversial as there is an incorrect belief that this is a slower method than taking an all encompassing approach to determine a common cause

IT Management must allow the Problem Analyst dedicated time for most of the duration of the problem analysis certainly during the data collection phase The Problem Analyst must have a good knowledge of diagnostic tools, when to use them and their limitations The Problem Analyst must have strong analytical skills I guess this goes with the territory
Executive Whitepaper 2008 Advance Seven Limited Advance7 Defining IT stability, control & performance www.Advance7.com +44 (0) 1371 876805

Initially selling the approach as the most effective method of resolving Phase 4 problems can be tough, but, on the positive side, it becomes easier to promote the method after the first success.

When to Use RPR


RPR is not appropriate for the majority of problems that arise certainly not for Phase 1 and 2 problems (or events that ITIL classifies as Incidents). However, it is a very effective and efficient method for the resolution of Phase 4 problems. A Phase 4 problem will have some or all of the following characteristics: Senior IT management involvement Business demands regular status updates Regular crisis meetings Help Desk calls about the problem drop off Business adjusts its work processes to accommodate the problem An ever growing group of people are getting involved There is no sense of direction in the problem resolution effort The root cause is unknown Repeated statements are made that We just want to try one more thing The final point is an ideal trigger for a switch to RPR, as it indicates that resolution has moved into a trial and error phase which is very slow.

Skills Needed
A broad knowledge of IT is needed to make best use of RPR. The practitioner must have a good basic knowledge of the technologies and concepts of an end-to-end system. We live in a networked world and modern systems have many components networked together. The networking slant is further influenced by the choice of tools. The natural tool of choice for analysis of end-to-end problems is the network analyser since: Its use is non-disruptive There is no requirement to install additional software on any components of the system It can be used to narrow a problem to a component, and hence an owner

This means that a good knowledge of networking and protocols is a significant advantage. Once we have narrowed the problem to say a server, we may then need to use further tools to drill into that component, and this demands further skills, largely based around knowledge of software and operating systems. For career enhancement, becoming an RPR practitioner is ideal for senior IT support staff who dont want to move into management, but need a new challenge. A very bright junior would also be a valuable addition to a problem management team but its important to note that there is often a lot of politics, many business considerations and significant people management in the resolution of a Phase 4 problem, and so senior support is likely to be needed. The ideal RPR person would be a senior application developer who has a good knowledge of networking or vice versa. Those companies that have embraced ITIL may have dedicated Problem Managers and Problem Analysts. RPR is ideal training for both since the RPR Process provides a framework for the Problem Manager, and the Supporting Techniques help develop effective skills in the Problem Analyst.

Problem analysis skills can be taught (as per RPR) and require development through experience. Its important to recognise that effective problem analysis requires its own set of skills. These skills are often built upon operational (BAU) experience, but the skills are different. For a Problem Analyst to fix problems quickly with RPR, he or she needs to be exposed to Phase 4 problems almost continuously. It may be difficult to keep a Problem Analyst sufficiently occupied in an organisation of less than, say, 8,000 users.

Closing Comments
I hope this short introduction has given some idea of the power of RPR. There is only so much that we can cover in a whitepaper. However, you can benefit today from this whitepaper by: Using it as a guide, you could set an escalation point to recognise that a problem has entered Phase 4 and requires a different approach hearing or using the phrase Were just going to try one more thing is a good starting point Focusing on one symptom of a Phase 4 problem will simplify diagnosis and speed up resolution Avoiding pattern-based methods will increase the likelihood of success Collecting definitive diagnostics will identify the root cause, which will save time and money Using the Whiteboard Analysis will help you plan the diagnosis

Build your knowledge base Rapid Problem Resolution (RPR) Paul Offord 4th September 2007

Executive Whitepaper 2008 Advance Seven Limited

Advance7 Defining IT stability, control & performance

www.Advance7.com +44 (0) 1371 876805

Appendix A Outlook Freeze Case Study


This was a very recent REACT project and perfectly demonstrates the collaborative nature of RPR. Users complained that intermittently Outlook would freeze. The first issue was a plethora of symptoms that were believed to be due to the same cause. We asked the customer to prioritise one symptom, and settled on this problem scenario: The user opens an Inbox message and reads it He opens a Word document that resides in a Document Management System folder visible in Outlook the DMS being integrated with Outlook He makes a change to the document and closes it He tries to switch back to the Inbox message he was reading but just gets an empty window (all white) Around 75 seconds later the message window fills with the missing text Using RPR we started by determining if the problem was inside the users PC, or, whether the network or servers caused it. To do this we traced all network data going in and out of the user PC. We then set up a small command line program to send pings of different lengths when the user hit Return. When the problem started we sent a single ping of 101 bytes. When the problem passed, and as soon as the Inbox message became visible again we sent a second ping of 102 bytes. Analysing the trace we found the 102-byte ping and wound back through the trace, starting at this point. It was evident that the problem passed after the completion of the transfer of 15 MB of data from the DMS server to the users PC. A study of the data showed a repeating pattern that looked like a list of documents. We took the data to the DMS support team who immediately recognised it as folder summary information. Working with the DMS team, we identified it as cache refresh information. A call to the DMS vendor quickly identified the issue as an incorrect registry setting. This scenario is very typical. Although we (the RPR practitioners) had some knowledge of the DMS, we didnt know as much as the support team let alone the vendor. Following the RPR method caused us to: Focus on a single symptom Collect high quality diagnostic data that could be directly correlated with a users experience of the problem Convert the diagnostics into information that is meaningful to the DMS support team
3

This was a text book RPR scenario and demonstrates the collaborative nature of the method. I hope it also shows that you can use the method even if you are not an expert in the particular technical subject.

REACT is Advance7s unique method-based problem resolution service.

Appendix B - Whiteboard Exercise Briefing Sheet


Cambridge Southampton Derby

User PC Giant Finance WAN User PC Citrix Server Internet

User PC

Staff at Giant Finance use a third-party application called TopFund hosted by Acme Corp. Access is via a Virtual Private Network (VPN) established across the Internet from Giants Southampton site to Acmes Derby Data Centre. Giant Finance staff at Cambridge say that the system is so slow that it is unusable. Heres the information as it is told to you: Only Giant Finance users in Cambridge suffer the problem, so it must be a Giant Finance problem Cambridge users dont have problems with any other systems and so it must be an Acme problem No other Acme users get the problem and so it must be a problem with Giants VPN firewalls Giant users access other 3rd party systems via the same VPN firewalls and they dont have problems Users of the Citrix-based TopFund system suffer type-ahead delays The problem only happens when more than one Cambridge user accesses TopFund so it must be a Citrix server issue Lots of users access the same TopFund servers and dont experience problems All Giant users experience slow web access but the Citrix traffic has been prioritised above web traffic
4

Write five headings on a whiteboard or flipchart: Symptom comes from the rule You can go no further until you understand the problem at some level. It must be accurate. It must be specific Boundaries used to help you decide where and when to investigate the problem Other Observations things that may or may not be related to the problem Possible Causes at a high level Action Plan a plan to prove or disprove the first of the possible causes

Users of terminal-type services that use character echo protocols such as Citrix ICA, Windows Terminal Services and TELNET sometimes complain that they type characters but nothing appears for a few seconds, then all the characters appear at once. We call this effect type-ahead delay.
Executive Whitepaper 2008 Advance Seven Limited Advance7 Defining IT stability, control & performance www.Advance7.com +44 (0) 1371 876805

Appendix C Whiteboard Exercise Answers


Symptoms Users of the Citrix-based TopFund system suffer type-ahead delays Boundaries Only users in Cambridge experience the problem Problem only occurs when more than one user accesses the system No other TopFund users get the problem Other Observations Cambridge users dont have problems with other systems except for slow web browsing VPN connections are made to other systems and they work fine Possible Causes Client PC Cambridge LAN Giant private WAN Southampton network infrastructure Internet Derby network infrastructure Citrix servers Action Plan Install an analyser in Cambridge to capture traffic in and out of the user PC Install a second analyser in Cambridge to capture traffic in and out of the WAN router Install an analyzer in Southampton to capture traffic in and out of the WAN router Recreate the problem, marking the traces with a ping immediately after the problem occurs Analyse and match the traces to determine if the problem is in the user PC, in the Cambridge LAN, in the WAN or between the Southampton LAN and the Derby server Produce a revised plan if necessary

If this were a general complaint of slow performance, we might make Slow web browsing a symptom and then determine the priority of the two symptoms. However, in this case the issue at hand was Citrix type-ahead problems, and so the slow web browsing is relegated to an observation.

You might also like