Why Do Problems Take So Long to Resolve? October 16, 2010

There are five inherent inefficiencies in the way most IT organizations resolve problems and perform critical maintenance tasks. Each one of these inefficiencies causes IT staff to spend more time on problem resolution, and increases Mean Time to Recovery (MTTR). Thus, each of them also represents an opportunity for increased productivity and faster recovery.

  • Non-expert diagnosis—there is usually someone in the organization who knows to diagnose and solve a given problem. But chances are that when the problem occurs, that expert is not around, and there is usually no comprehensive process documentation which explains how to solve the problem. Often a non-expert needs to decide which problem occurred and how to solve it—they might contact the wrong people or perform the wrong resolution steps, causing needless delay.
  • Can’t find an owner for the problem—when a problem occurs, someone needs to take responsibility and make the tough decisions. It takes precious minutes to get someone on the line, and then it often turns out they’re unable or unwilling to take ownership of the problem. The search for an owner can sometimes take hours.
  • No structured process—when somebody starts working on resolving the problem, they usually don’t have a step-by-step process to guide them. Even the best troubleshooter on your team might miss an important step or go off in a wrong direction—particularly when under pressure and at unusual hours—further stretching time to recovery.
  • Slow manual resolution—even the right expert, performing the right steps to resolve the problem, might take a very long time to do it manually. If the problem requires checking memory and restarting a service on 9 remote servers, this will take plenty of time for any human operator.
  • Prone to human error—it only takes a typo in a command-line operation to bring critical systems to their knees. Human errors are always possible, but much more likely when staff are responding to immediate problems under pressure. What’s more, staff might take actions that solve the problem immediately, without realizing broader implications, such as risk to peripheral or dependent systems.

Existing Solutions are Not Enough
Many IT organizations use monitoring systems, script automation and ITIL-style documented workflows to improve the problem resolution process. But these solutions cannot solve the inefficiencies we list above—as explained below.

Monitoring
(such as CA Unicenter, HP, IBM Tivoli, BMC Patrol, Nagios)

  • Reports symptoms
  • Reports root causes
  • Performs simple tasks automatically (e.g. restarting a service)

Why It’s Not Enough
Cannot solve severe problems, which require troubleshooting and tricky, multi-step resolution. So human intervention and manual problem resolution are still needed.

Scripts and batches

  • Automating tasks on a single machine
  • Automating tasks in simple P2P scenarios
  • On-demand or scheduled execution

Why It’s Not Enough
Cannot deal with complex environments with multiple nodes, virtualization, remote servers, etc. It’s very difficult to write a script that will run on numerous machines with interdependencies. So many tasks are too complex to automate with scripts.

ITIL and documented workflows

  • Creates a central, standard knowledge base for problem resolution
  • Clarifies risks and important considerations
  • Reduces human error

Why It’s Not Enough
Not there at 3:00am—when a critical problem occurs, the procedure is not on hand and there’s no time to read complex flowcharts to find the solution. IT staff will simply do something immediately to solve the problem.

Share and Enjoy

  • Facebook
  • Twitter
  • LinkedIn
  • Print
  • Add to favorites
  • Email
  • Reddit
  • Tumblr
  • Technorati
  • StumbleUpon
  • Posterous
  • Digg
  • Delicious
  • DZone
Leave a Reply

You must be logged in to post a comment.