The Lost Art of Information Technology Troubleshooting
Whatever happened to the expertise of troubleshooting within the Information Technology Profession? The engineering, scientific and electrical worlds have many formalized methods and concepts for troubleshooting. I feel that somewhere along the way, this valuable skill born from the fundamental pieces of the information technology profession failed to transition for the information age.
Have you tried what passes for troubleshooting when you call a company for support with a product these days? It is usually something like: "Try This", Yes/No, "Try That", if not then "Buy a new one". Try going to a consumer website like Apple.com or Dell.com and play with what passes for troubleshooting tools to see what I mean.
At work, is it really any better? I am frustrated daily by the slipshod, unprofessional, undisciplined activities that I see take place in the name of "troubleshooting". My recent questioning of people at work tells me that schools don’t teach it anymore, there is no certification program that validates it, and the rapid turnover of personnel in the IT profession prevents any real on the job training on it causing a transfer from the experienced to the new.
Our profession really significantly undervalues troubleshooting and pays the price for it every day. The price is slow trouble resolution, secondary damage from uncontrolled troubleshooting actions, extended loss of system availability, increased chance of transient vulnerability exploitation, loss of configuration management, increased management frustration, and a failure to increase the level of proficiency and knowledge of inexperienced workers.
We all know that all engineered systems will have problems. IT Pros are really all computer and network engineers. Troubleshooting and correcting these problems needs to be one of their core competencies. So what are the fundamentals of troubleshooting that should be adopted by Information Technology Professionals everywhere? Here are a few.
First we need to agree that Troubleshooting is an IT discipline. Wikipedia’s generalized description of it is accurate for our uses.
"Troubleshooting is a form of problem solving most often applied to repair of failed products or processes. It is a logical, systematic search for the source of a problem so that it can be solved, and so the product or process can be made operational again. Troubleshooting is needed to develop and maintain complex systems where the symptoms of a problem can have many possible causes."
Second, there are four basic elements of all effective troubleshooting efforts:
- They are based upon half-splitting. That means that you are always trying to divide the world into two states: "known good" and "known bad". When this effort is truly complete, you have at least isolated the things that need to change to get you back to good.
- You need to eliminate possibilities through testing. You always start with a complete list of the possible causes and eliminate them one by one from hard evidence. Many a troubleshooter has found that the real problem was one that they discounted from experience rather than through checking. This is where the question, "Is the computer plugged into the wall?", came from when you call the Help Desk.
- The KISS — Keep It Simple Stupid — principle is always best. Your day is already complicated enough if you are really troubleshooting. Never let your tests become more complicated than the problem you are trying to solve.
- Always document everything that you thought, did, and will do. The tendency to just stop when you have found the fault and not document the challenge or solution nearly always leads to pain down the road. The job is not done until the paperwork is including trouble tickets, configuration management documents, operating manuals, etc.
There are many Formal Methodologies for Troubleshooting that we could borrow from other disciplines. The advantage of a formal method is that all participants, local and remote, familiar and unknown, senior and junior, are all synchronized on what you are doing and where you are going next. Best of all, it provides the framework for management decision-making in support of the troubleshooting and allows for formalized training programs to be built. From my time working in the nuclear energy field, where formality and standardization are prerequisites, I remain a huge fan of formal methods to technical investigation.
Here are some leads for a formal method to employ:
- Steve Litt has written guides on what he calls The Universal Troubleshooting Process (UTP). Its 10 steps are:
- Prepare
- Make damage control plan
- Get a complete and accurate symptom description
- Reproduce the symptom
- Do the appropriate corrective maintenance
- Narrow it down to the root cause
- Repair or replace the defective component
- Test
- Take pride in your solution
- Prevent future occurrence of this problem
- Another favorite list found on the web follows: Troubleshooting Steps:
- Establish symptoms
- Identify the affected area.
- Establish what has changed.
- Select the most probable cause.
- Implement a Solution.
- Test the result.
- Recognize potential effects of the solution.
- Document the solution.
- MaintenanceWorld.com has an excellent article on Electrical Troubleshooting in Seven Steps . This is list that I learned many years ago and use as the basis for my own approach.
- Gather information
- Understand the malfunction
- Identify which parameters need to be evaluated
- Identify the source of the problem
- Correct/repair the component
- Verify the repair
- Perform root cause analysis
- As an exception to most information processing companies, Cisco does produce a Cisco Internetwork Troubleshooting (CIT) course. They teach an 8 step troubleshooting process for their equipment that is probably most relevant for information technology pros. Their steps are:
- Define the problem.
- Gather detailed information.
- Consider probable cause for the failure.
- Devise a plan to solve the problem.
- Implement the plan.
- Observe the results of the implementation.
- Repeat the process if the plan does not resolve the problem.
- Document the changes made to solve the problem.
I propose that it is high time that we all rediscover the Lost Art of Information Technology Troubleshooting for ourselves, our organizations, and our users. This will lead to a much happier Information Age for one and all.
What do you think? Are we any good at troubleshooting? Are there standardized methods that are accepted by the IT Pro community? Tell me what you think, please.
That is my Information Technology Thought of the Day (ITTOD) for July 31, 2009 ©Scott Coughlin .
Image Credit: http://www.veryslowcomputer.com
No related posts.
Related posts brought to you by Yet Another Related Posts Plugin.

Thanks for posting this Scott. There is a real lack of troubleshooting skills out there. I wonder if it is related to we are raised. I grew up in a rural area without a lot of money, so we had to fix our own bikes or make bikes out of spare parts. That taught us to learn how things work.
I see the troubleshooting today consists of reboot first, troubleshoot later. Another common thing is “oh it was x last time so let me do that.”
Sorry for the rant, but this is an area we really need to improve in our industry.