Labels

new blog 2.0

2009/05/27

Thoughts on troubleshooting methodology...

I never thought I could have any problems troubleshooting IT stuff, especially since it's been my day-to-day job for the last 3 years. However, not so long ago I was asked a couple of troubleshooting scenario questions in an interview and while I had the right feeling about the solution I still failed to break the problem down theoretically and when the time was up the interviewer didn't seem impressed with me (this is probably a nice way of putting it). I said to myself 'no big deal', as having my hands on the actual problem in the real world I would probably have had it solved in no time.
Or would I? Maybe yeah, but certainly without the 'no time' factor.
It got me thinking about how intuitively I have been doing my job sofar and that it's actually not the right approach for a troubleshoot professional. I decided I needed some structure, and here it is. I'd like to share my thoughts with you:
  1. Problem identification and data collection.
    • If someone reported the problem, ask the reporter precise questions:
      • What's wrong?
      • When did it happen for the 1st time? (If relevant, where?)
      • Did you change something just before the occurence? If so, what?
      • Is the problem persistent or intermittent?
      • Are you aware about other people experiencing the same issue?
      • Have you tried any workarounds?
    • What does an Internet search return? Maybe a fairly common issue globally?
      What does your internal knowledge base return? Maybe a fairly common local issue?
    • What is the scope of the problem. Local? Global?
  2. Reproducing a problem.
    • Where possible, create a test environment.
    • Try to reproduce the bug. If succesfully reproduced, on Unix-like systems truss and strace are irreplaceable for runtime executable analysis. Similar tools for Windows are available too.
  3. Analysis.
    • Visualize. Draw all the components and try to figure out visually where the problem might be hidden.
    • Try to understand the potential dangers.
    • Make a thorough research and refer to technical documentation.
    • It is crucial that you understand the terminology.
  4. Isolating the issue.
    • Rule out factors that have nothing to do with the problem, but don't just assume - make sure they don't.
    • Test the problem under many different conditions. Change one or two conditions at a time.
  5. Suggest and apply fixes.
    • Always make a backup of any important data that can be lost.
  6. Escalate where appropriate.
  7. Understanding and verifying the solution.
    • Make sure you understand what's going on before moving on to conclusions.
  8. Creating Documentation!
    • Document your findings in order to avoid effort duplication. Describe the symptoms, error messages and the solution and publish them somewhere. Try to organize your knowledge base so that finding information is easy after you get the same problem in after 6 months time...

1 comment:

Anonymous said...

Again a good post. Offer your friend