Download Help: Problem Solving and Troubleshooting

Transcript
World 2011
Help!
Problem Solving and Troubleshooting
Daniel Rodwell
Australian National University
XW11
Intro
Outline
Today’s Session
Two Parts
•
Problem Solving
Concepts and Theory
– Methods
– Group Solve
–
•
Troubleshooting
Concepts
– Methods
–
XW11
Today
What’s in it
•
Professional Development workshop
•
Toolset for you to use
•
Lighthearted, not too serious
•
Mixture of Skills and Backgrounds
–
hopefully theres something here for everyone
XW11
Part 1: Problem Solving
Problem Solving Concepts
Problem Solving
The dictionary says...
problem ¦ präbləm¦
noun
1 a matter or situation regarded as unwelcome or harmful and needing to be dealt with and overcome : mental health
problems | [as adj. ] city planners consider it a problem district.
• a thing that is difficult to achieve or accomplish : motivation of staff can also be a problem.
ORIGIN late Middle English (originally denoting a riddle or a question for academic
discussion): from Old French probleme, via Latin from Greek problēma, from proballein
‘put forth,’ from pro ‘before’ + ballein ‘to throw.’
XW11
Problem Solving
The thesaurus says...
problem
noun
1 they ran into a problem: difficulty, trouble, worry, complication, difficult situation; snag, hitch, drawback,
stumbling block, obstacle, hurdle, hiccup, setback, catch; predicament, plight; misfortune, mishap,
misadventure; dilemma, quandary; informal headache, nightmare.
2 I don't want to be a problem: nuisance, bother, pest, irritant, thorn in one's side/flesh, vexation; informal drag,
pain, pain in the neck.
3 mathematical problems: puzzle, question, poser, enigma, riddle, conundrum; informal teaser, brainteaser.
adjective
a problem child: troublesome, difficult, unmanageable, unruly, disobedient, uncontrollable, recalcitrant,
delinquent.
ANTONYMS well-behaved, manageable.
XW11
Problem Solving
The dictionary says...
solve ¦sälv; sôlv¦
verb [ trans. ]
find an answer to, explanation for, or means of effectively dealing with (a problem or mystery) : the policy could
solve the town's housing crisis | a murder investigation that has never been solved.
XW11
Problem Solving
In Context
For System Administrators or System Engineers
design a new system
• grow an existing system
• transition to another system
• codify a process or activity
• solve an IT need
•
XW11
But...
Problem Solving Skills are reusable!
•
Core Skills can be applied generally to solve non-IT problems, anywhere.
design a building
– organise a world-wide roadshow
– fix something
–
XW11
How do we know?
How do we know we have a problem?
Two ways we typically discover a problem
SENSE
we sense something is
different from ‘normal’
Text
TOLD
someone tells us we have a
problem
XW11
At this point
You should be thinking...
ALERT!
SUBJECTIVE INFORMATION SOURCES
XW11
Subjective
cf. objective
Perception based
• typically not driven by fact or data
• opinion rather than scientific observation
• May contain traces of Emotion
•
XW11
How do we react?
How do we react to a problem?
PANIC !
AARGH!
SCREAM!
ALARM!
EEEEK! TERROR! FLUSTER! HYSTERIA!
GRUMBLE BLAMETHROWER SPAGHETTI-CHUCKER GRIPE
COMPLAINT :(
DISMISSAL...
SIGH
MOAN
WHATEVER...
AGAIN...
SHE’LL BE RIGHT... THERE IS NO PROBLEM...
MMMM K...
XW11
How do we react?
How do we react to a problem?
Sometimes, but rarely
Analytically
• Pragmatically
•
XW11
movie clip
Understanding the Problem
Don’t be mislead or confused
Before you do anything:
1. Determine if there is an actual problem
2. clearly define what the problem is
3. and what you are trying to solve
(the act of solving is sometimes the easy part).
XW11
Why?
We want to make the situation better, not worse.
(how many times have you seen the opposite happen?? DIY anyone?)
XW11
What am I
trying to solve?
Constant Re-evaluation
XW11
OBVIOUSNESS ALERT!
This all seems like common sense.
But... its easy to get lured into a
big mess.
Often you don’t know you have a
big problem, until you have a
really big problem.
How do we get in this mess?
Understanding the precursors
1. Pressure (Management, time, resourcing)
•
Rationale and the ability to reason often disappear under pressure.
•
Your focus is set on “fix” rather than “solution”.
•
There may be few incentives to step back, and think before doing.
XW11
How do we get in this mess?
Understanding the precursors
2. Limited Familiarity
•
The technology is unknown to you or you have only basic knowledge
•
You’ve inherited a system and it’s broken
•
You’re new to a role or organisation
XW11
How do we get in this mess?
Understanding the precursors
3. Overconfident
•
Massive underestimation of the problem
•
“how hard can it be?”
XW11
How do we get in this mess?
Understanding the precursors
4. Quick Fix Temptation
•
It’s tempting
•
It’s delicious
•
You’ll regret it later.
Quick Fix Now = probably a really big
problem later.
XW11
Problem Solving Methods
Stage 1 - Problem Definition
1. Determine if there is actually a problem
– Gather information
– Understand the situation
– Establish a baseline where the problem is a ‘variation on
normal’ - ie capacity & performance problem.
– Verify the problem exists
XW11
Stage 1 - Problem Definition
2. clearly define what the problem is
– Scope
– Impact
– Nature
XW11
Stage 1 - Problem Definition
3. what are you trying to solve
– Outcomes
– Deliverables
– Solution
– ie. What
you want to see at the end of it.
XW11
Simple Example
We have No Milk!
1. Determine if there is actually a problem
–
Look in the fridge. Yes, there’s no milk.
2. Clearly define what the problem is.
–
We need milk for breakfast in the morning, and we don’t have any.... and I need a
a coffee before leaving the house.
3. What are you trying to solve.
–
Get enough milk for breakfast, nothing more, nothing less.
XW11
What am I
trying to solve?
How many systems or projects
have you seen that don’t solve
the original problem?
Remember this ?
XW11
Stage 1: Problem Definition
– Stage
1 is your foundation - weak problem definition will lead to
weak solutions.
– Your
problem definition doesn’t need to be pages and pages of
blurb. A concise, accurate problem description is better
– Stage
1 is knowledge and familiarity building.
Knowledge + Familiarity = less stress
XW11
Stage 2: Research
Understanding:
- What has been done so far
- The factors that have lead to this situation
Research:
- You might not be the first to encounter this problem.
- Your research may lead you back to Stage 1 again
XW11
Stage 3: Peer Check
Possibly the most powerful resource
Describe the problem to a peer or colleague
- Clearly articulate what the problem is
- What you’re trying to solve
- any difficulties you see
Why?
- gaps or gotchas will be exposed
- it might sound good in your head, but verbalising it exposes the
weaknesses
XW11
Stage 3: Peer Check
Possibly the most powerful resource
What if I’m working alone?
Write it down.
– Blog it.
– Tweet it.
–
–
Even if no one reads it, you have a record of your thoughts.
– Gives you a point of return if you get lost
–
Talk to your manager (!)
XW11
Stage 4: Nature of Problem
The nature of the problem will guide you toward a methodology.
Loosely Defined Problem
Broad, non-specific goals
• Ideal-based
• Experimental / Trial / Future Projects
•
XW11
Stage 4: Nature of Problem
The nature of the problem will guide you toward a methodology.
Tightly Defined Problem
Specific goals
• Target-based
• Production ready, workflow style systems
•
XW11
Problem understood
Now how to solve it
We have a big lump of a problem
PROBLEM
XW11
Problem understood
Now how to solve it
We could chip away at it, and may get somewhere if we’re lucky.
PROBLEM
XW11
To effectively solve any problem:
Break it up
XW11
Break it up
Problem
AA
AB
BA
BB
A
C
F
E
B
D
G
XW11
Stage 5: Break it up
A big problem is hard to solve
Smaller chunks are easier to solve
a piece or chunk is far more workable
– each piece may have specific but different requirements
– completeness (individually solved = collectively solved)
– can be delegated or allocated
–
A Piece or Chunk is likely to be
an activity or task
– attribute or category
–
XW11
Top - Down Method
Tightly Defined Problem
Top-Down Analysis:
–
Start at highest level of system
–
partial understanding of sub-technologies
–
You know what you want from a solution
– maybe not at module or piece level
XW11
Top - Down
Start here
Main
Logic
Peripheral
Mass Storage
Direct Attach
Analysis
System
Thermal
Network
XW11
Bottom - Up Method
Tightly Defined Problem
Bottom - Up Synthesis:
–
Start at lowest level of system
–
Individual modules collectively build the system or solution
–
You understand what is happening at module level,
– unsure on individual relationship to whole
XW11
Bottom - Up
Synthesis
System
Main
Logic
Peripheral
Mass Storage
Direct Attach
Network
Thermal
Start here
XW11
Finding the Pieces
Order in chaos
Ways ‘pieces’ of the problem become obvious (things to look for):
Natural Grouping
• Functional or Procedural Grouping
• Modular
• Derived from First Principles or Architecture
•
XW11
Funnel Method
Loosely Defined Problem
Recall:
Broad, non-specific goals
• Ideal-based
• Experimental / Trial / Future Projects
•
•
The problem may not be fully understood, and solution options are
completely unknown.
XW11
Funnel Method
Loosely Defined Problem
Inputs:
new or unproven Ideas
• parallel prototyping (project bake-off )
• experimentation and discovery
•
Output:
Evolutionary goal
– The best solution (progressive)
–
XW11
Funnel Method
Lots of Ideas
Concept
generation
Gate
B
A
Modular
Grouping
C
D
A
B
D
C
Bake off
A
B
Solution
XW11
Group Solve
Group Solve
Solve for X
- Likely to encounter this scenario in your organisation
- Problems progressively revealed as you traverse the scenario
- individually / pair up & think of the problem
and how you might start to solve it
– modules / categories / attributes
–
XW11
Scenario
< scenario removed >
XW11
Why Problem Solving Hurts
Ouch
•
•
If it was easy, you’d have solved it already
It typically involves learning new stuff, while simultaneously developing a
solution
•
Chances are you will not immediately know the answer.
•
You’re under pressure.
XW11
Constraints
Fixed vs. imposed Constraints
•
Some constraints will be fixed and are physically determined.
–
•
ie. Cable breaking strain of 1200KG
Other constraints are imposed or we unintentionally limit ourselves with
prior convention.
Think outside of the problem as well.
•
is the problem part of a bigger picture?
XW11
Consider this
Imposed Constraint
You are here
XW11
Consider this
Down under (& NZ too) is on top
XW11
No! It’s all wrong.
Why?
N
Someone decided North goes at the top.
XW11
No Problems
I’m awesome, No problems here.
... yet
Discover weaknesses in your systems
use same approaches
• module by module analysis
• understand what ‘normal is for your system’
• understand utilisation and capacity
• If you do have a problem, you’ll know how each module normally behaves
•
XW11
Part 2. Troubleshooting
Troubleshooting Concepts
What is Troubleshooting?
Dictionary says...
troubleshoot ¦ trəbəl sh oōt¦
verb [ intrans. ] [usu. as n. ] ( troubleshooting)
solve serious problems for a company or other organization.
– trace and correct faults in a mechanical or electronic system.
XW11
What is troubleshooting?
Applied Problem Solving
XW11
Inherit: Problem Solving methods
It’s reusable
Core points retained
Define what the issue is
• Understand what you are trying to fix
• Break the issue down into smaller parts
•
XW11
Types of Failure
3 Common Types
Technical Failures usually fall into three top level categories
–
Bogus (there is no failure)
–
Outright (it’s dead)
–
Intermittent (the most problematic)
XW11
Influences
Influences on Troubleshooting accuracy
•
•
•
Quality of Symptom description
Symptoms often do not have a 1:1 correlation with failure mode
Data may be incorrect
XW11
How not to fail
The most important part
Symptom Description
•
•
An accurate and concise Symptom Description is critical to your
troubleshooting success
Without an accurate Symptom Description
You’ll be chasing the wrong thing
– It’ll be unclear where to start
–
XW11
Symptom Description
It’s easy to spot a bad one
It’s dead.
It doesn’t work.
There’s something wrong with my computer.
I can’t download the internet.
XW11
A System
and its parts
Any ‘System’ is a collection of modules
•
It’s normally a module that breaks, not the entire system
A web server is a system - I/O, network, authentication, db, content, config
• A washing machine is a system - pump, motor, controller, valves, sensor
•
XW11
Accurate Troubleshooting
Report of System Failure
Verification or Replication of fault
where there is an actual, verifiable fault
locate the faulty module within system
Fix only the faulty module or part
Return Correctly functioning system to operational status
XW11
What is Troubleshooting
Sequential Fact Building
Loosely Defined Symptoms
Progress through the
troubleshooting process should
Fault Verified
– reduce
Module
isolation
Cause
the uncertainty
– progressively isolate the modules
– increase the number of known
states
XW11
Fact Building
Symptom Gathering
Administrator asks probing questions
User reports of problems and
description
Priming Data
Normal Statistics
Log Files
Error Reports
Symptom Verification
Bogus Isolation
Module identification
Solution
Symptoms
Loosely Defined Symptoms
Fault Verified
Uncertainty
decreasing
Facts
Increasing
Module
isolation
Cause
Cause
XW11
Feedback Concept
We like to know whats going on
Humans like feedback in the form of progress.
We like to know that our interactions are changing the
environment we are attempting to influence.
It gives us the sense of “getting somewhere”.
XW11
Feedback Concept
Managers are human too
Managers are human too (!)
Uninformed managers can become a larger problem than
the technical issue you are trying to resolve.
XW11
Feedback Concept
Keep it in mind
When determining the steps you are going to take in your
troubleshooting task:
•
keep in mind the result you are looking for at each step
•
and what result a normal, correctly operating module would return.
•
If you have progressive results, you can keep others informed.
–
ie, we’re ruled X out, established Y is working, need to test Z.
XW11
Why Feedback Matters
Consider this
A theoretical moving car
Input
Process
Output
Steering Angle
Wheels turn
Change in Direction
Feedback:
Visual Recognition
Sensory Feedback (g-force)
XW11
Feedback Delayed
Feedback altered
A theoretical moving car
Input
Process
Output
Steering Angle
Wheels turn
Change in Direction
Feedback:
30sec
Visual Recognition
Sensory Feedback (g-force)
XW11
Feedback Removed
Feedback altered
A theoretical moving car
Input
Process
Output
Steering Angle
Wheels turn
Change in Direction
Feedback:
X
Visual Recognition
Sensory Feedback (g-force)
XW11
Oh no!
You crashed and burned.
Why?
Multiple wrong inputs
• Situation becomes progressively worse
• progress is unknown
•
Each Troubleshooting stage should result in usable information.
Even if that is “this part works as expected”.
• You now have one less module to isolate.
•
XW11
Troubleshooting Methodologies
Gather info and verify
First Steps
•
Gather info
•
Verify situation against information
•
Establish a baseline of a correctly operating system
•
Rule out really obvious factors
–
Storage full, No IP address, No AC input, etc.
XW11
Brute-Force Guesswork
Troubleshooting Methodologies
Brute-force Guesswork
–
MLB
Housing
Battery
–
–
Display
HDD
Unfixable
variable
certain /
uncertain
state
–
–
Belief based
Evidence poor
Procedurally inadequate
highly uncertain if correct cause identified
occasionally works for some experienced
techs. Common cause of “it must be this
part”.
XW11
Brute-Force Guesswork
Methodology
MLB
Housing
Battery
Display
HDD
variable certain /
uncertain state
Unfixable
XW11
Split-Half
Troubleshooting Methodologies
Split-Half
–
System
X
–
Software
Hardware
–
X
Graphics
Memory
X
GPU
Display
Function
isolation
–
–
Eliminate half of the probable cause at each
level
Requires understanding of common issues
Requires understanding of core functions of
each function area or differentiating
behaviour
highly structured, complete but can be time
consuming and indirect if starting point is
vague.
Works best for isolate/verify function areas
where there is no obvious likely cause
XW11
Split-Half
Methodology
System
X
Software
Hardware
X
Graphics
Memory
X
GPU
Function
isolation
Display
XW11
Power / Signal Flow
Troubleshooting Methodologies
Power / Signal Flow
AC - IN
signal
flow
PSU
loom
PWR BTN
Speaker
Audio
–
PROC
–
Controller
PCI
–
–
MLB / SMC
RAM
–
Follow Signal sequence through system
Highly sequential, must be performed in
order
effective for “no X” or “dead” symptoms
often places core modules early in the
troubleshooting, even if they may be a less
likely cause.
Requires understanding of signal flow in
system architecture.
SATA
XW11
Power / Signal Flow
Methodology
signal flow
AC - IN
PSU
loom
PWR BTN
MLB / SMC
RAM
Speaker
Audio
PROC
Controller
PCI
SATA
XW11
Likely Cause
Troubleshooting Methodologies
Likelihood
decreasing
Bogus
Likely Cause Identification
–
–
Config
Software
–
Use known likely causes as starting point
can often be reordered to promote more
likely causes, demote less likely cause
works best where
it is possible to identify all sources of possible
cause
– there are few causes
– or the causes are well known
–
Fan
Sensor
MLB
–
less suitable for cases where there is no
obvious cause
XW11
Likely Cause
Methodology
Bogus
Config
Likelihood
decreasing
Software
Fan
Sensor
MLB
XW11
Likely Cause + Weighted Matrix
Troubleshooting Methodologies
Weighted Matrix
Order
1
Possible
Cause
Likelihood
Possibly
Bogus
Isolation
Priority
Possible Cause A
High
Yes
High,
Dependencies
Yes
High,
Dependencies
No
Low
HIGH
2
Possible Cause B
Low
3
Possible Cause C
Low
MID
LOW
–
–
–
–
Use to assist prioritising the Likely Cause
isolation order
Promotes more likely / relevant isolation
tests for the scenario
Demotes less likely causes
Use to correctly “weight” troubleshooting
priority.
XW11
Likely Cause + Weighted Matrix
Methodology
Possible Cause
Likelihood
Possibly Bogus
Isolation Priority
Possible Cause A
Possible Cause B
Possible Cause C
XW11
Likely Cause + Weighted Matrix
Methodology
Possible Cause
Likelihood
Possibly Bogus
Isolation Priority
Possible Cause A
High
Yes
High, Dependencies
Possible Cause B
Low
Yes
High, Dependencies
Possible Cause C
Low
No
Low
XW11
Likely Cause + Weighted Matrix
Methodology
Derived
Possible
Cause
Order
Likelihood
Possibly Bogus
Isolation Priority
High
Yes
High, Dependencies
Yes
High, Dependencies
No
Low
1
Possible Cause A
2
Possible Cause B
Low
3
Possible Cause C
Low
HIGH RANK
MID RANK
LOW RANK
XW11
Minimal Config
Troubleshooting Methodologies
Minimal Config
–
Core Components
–
Module A
Next Component
+
Module B
+
Module C
Test
ok?
Module D
Next Component
Test
ok?
Module E
–
–
–
The Final Frontier
Saviour when all else fails
Highly time consuming,
but high accuracy
Must know what components are the
absolute minimum for the system start
XW11
Minimal Config
Methodology
System Build Up
+
Core Components
Module A
Next Component
+
Module B
+
Test
ok?
Module D
Module C
Re-test
XW11
Minimal Config
Methodology
System Build Up
+
Core Components
Module A
+
Module B
+
Test
ok?
+
Next Component
Next Component
Module D
Module E
Test
ok?
Test
ok?
Module C
Re-test
Re-test
XW11
No Single Answer
Select-a-method
•
No single method works for all types of symptoms or fault
– complexity
simple, tightly correlated symptoms
– complex, loosely correlated symptoms
–
–
nature of failure
electrical, mechanical
– runtime, configuration, design, capacity
– Intermittent
–
XW11
Known Good
Troubleshooting Methodologies
Known Good modules are modules, code or some other component
that is known to be operating correctly.
It’s often called “KG” or “golden”.
For core components, you may need to use a KG module OR have a
good understanding of the expected behaviour of the core modules.
... but they really need to be “good” or “golden” or you’ll prime your
troubleshooting for failure.
XW11
Tools To Help You
They’re often right there.
Console (logs, would you believe have heaps of info!)
• Activity Monitor
• top & ps
• fs_usage & lsof
• iostat
• sc_usage & dtrace
• netstat
• wireshark
• rubbish webmin interface on your switch / fabric / CSS / FC array
•
XW11
Group Troubleshoot
Group Troubleshoot
Scenario
- Less likely to encounter this situation in your organisation
- You might not know all of the technology involved. Use first principle
knowledge of IT systems to identify modules
- individually / pair up & think of the problem
and how you might start to solve it
– modules / categories / attributes
–
XW11
Group Troubleshoot
Scenario
< scenario removed >
XW11
Workarounds
Where it’s not something you can fix
Occasionally, there will some some issues you have isolated to a cause
that you cannot directly fix.
For Example, a software bug.
Using your troubleshooting results, you’ll know where it’s failing
• Use this information to develop a workaround until a permanent fix is
available
• Report the bug to the product vendor or manufacturer
• When the fix is available, you’ll know how to correctly verify its operation
•
XW11
World 2011