4 Ways to Cut Problem Resolution Time November 14, 2010
In this section we present best practices which address all the inefficiencies in the problem resolution/critical maintenance process, allowing you to substantially reduce Mean Time to Recovery (MTTR) in your IT organization. Some of these guidelines are easy to implement; some are more complex and require planning and supporting technology.
In the next section (pg. 9), we discuss Ayehu eyeShare, an IT solution that allows you to fully implement all four of these best practices in one simple, off-the-shelf product.
#1: Turn Expert Knowledge into Diagnosis Rules
Instead of relying on experts to diagnose problems in real time, which turns these experts into a bottleneck, you can make the expert knowledge available in real time. Specify rules that clearly answer these questions:
- Which combination of symptoms indicates that the problem occurred?
- How to validate that the problem really occurred?
- Which solution is appropriate for this type of problem?
These rules must be readily available, so that as soon as a problem occurs, it is immediately clear what the problem is and what type of solution is appropriate.

For example: The problem—high risk of server downtime due to disk raid malfunction. The diagnosis rule—if event logs for the past 6 hours show symptoms of disk failure in 2 out of 3 disks in the array, the problem is about to occur. The solution—repair or replace the malfunctioning disks.
#2: Define Problem Ownership and Escalation in Advance
When a problem occurs, operations staff spend time seeking an owner for the problem. To save this time and speed problem resolution, you should define the following in advance:
- A shift schedule, specifying which staff members are on call at any given time of day, and what types of problems each of them can handle.
- Escalation paths, for cases in which staff are unavailable or unable to take responsibility for a problem.
The next step is to find a way to immediately notify staff according to the schedule when a problem occurs—this allows you to find an owner for the problem immediately.
For example: The website is down at 7pm—the person who receives the alert checks the shift schedule, and sees that the engineers on call are John and Andrew. The schedule specifies that only John handles website downtime problems. An SMS is sent to John, requesting that he takes ownership of the problem. If John does not respond, the problem is escalated—an SMS is sent to Sarah, John’s boss, requesting that she take ownership.
#3: Document the Full Resolution Process
To streamline the problem solving work itself, and make sure IT staff are able to capitalize on previous knowledge and experience, document and integrate the full resolution process for each problem:
- Clearly spell out all the operations needed to resolve the problem, from start to finish.
- Document decision junctions during the process, what the decision should be based on, and resolution steps for each possible decision.
- Test the documented process, by watching an inexperienced operator using it to solve the problem.
- Make sure the documentation is available at the time and place the problem occurs.

#4: Automate Problem Resolution Steps
Automation addresses two problems in the problem resolution process—slow manual execution of problem resolution tasks, and human errors. Any process you automate will run faster and will be less error-prone. You should strive to:
- Automate any step in the resolution process that is predetermined and does not require manual intervention or human judgment. In Ayehu’s experience, over 80% of problem resolution steps can be automated.
- Simple forks in the process such as “if the server is up, do X, if not, do Y” can and should be automated, unless there is a complex decision that a human being really needs to make.
- Use scripts to automate simple tasks, which do not require complex interactions between machines.
- Investigate automation solutions to automate complex tasks—today’s process automation technology can integrate with numerous systems and perform broad, cross-cutting operations.
- Give human operators full control—automatic processes should stop and wait for human input when they reach an important decision junction. Human operators should be able to easily oversee and abort any automatic process.

For example: Automate Microsoft IIS service recovery process—an automatic process can be designed, which starts by pinging the web server to confirm that it is up. If server is up, and system Telnet port 80 is working, the process checks status of IIS services, and returns the status to relevant IT staff. If server is down or Telnet port 80 not working, the process reports this. The automatic process could ask whether to restart the server, and do this automatically upon receiving a response.
Ayehu eyeShare is the first off-the-shelf product that manages and automates IT problem resolution and critical maintenance tasks. It incorporates all of the four best practices we mentioned above, allowing you to speed up problem diagnosis, immediately locate an owner for a problem, guide problem resolution using a structured workflow, and automate routine tasks…

Leave a Reply
You must be logged in to post a comment.