Thursday, April 12, 2012

Addressing the 90% Problem

If I were to try to measure the time that I spent thinking, analyzing and communicating versus actually typing in code, how much of the time would it be? If I were to guess, I'd say something at least 90%.  I wonder what other people would say? Especially without being biased by the other opinions in the room?

We spend so much of our time trying to understand... even putting a dent in improvement would mean -huge- gains in productivity. So...

How can we improve our efficiency at understanding?
How can we avoid misunderstanding, forgetting, or lack of understanding?
How can we improve our ability and efficiency at communicating understanding?
How might we reduce the amount of stuff that we need to understand?

These are the questions that I want to focus on... its where the answers and solutions will make all the difference.

Mistakes in a World of Gradients

I've been working on material for avoiding software mistakes, and have been searching for clarity on how to actually define "mistake."

I've been struggling with common definitions that are very black and white about what is wrong and considered "an error" or "incorrect".  Reality often seems more of a gradient than that, and likewise avoiding mistakes should maybe be more a matter of avoiding poor decisions in favor of better ones?

I like this definition better, because it accounts for the gradient in outcome, without missing the point.

"A human action that produces an incorrect or inadequate result. Note: The fault tolerance discipline distinguishes between the human action (a mistake), its manifestation (a hardware or software fault), the result of the fault (a failure), and the amount by which the result is incorrect (the error)."

 GE Russell & Associates Glossary - http://www.ge-russell.com/ref_swe_glossary_m.htm




Tuesday, April 10, 2012

What is a Mistake?

I've been trying to come up with a good definition for "mistake" in the context of developing software.

It's easy to see defects as caused by mistakes, but what about other kinds of poor choices? Choices that led to massive work inefficiencies?  And what if you did everything you were "supposed to" do, but still missed something, is that caused by a mistake?   What if the problem is caused by the system and no person is responsible, is that a mistake?

All of these, I think should be considered mistakes.  If we look at the system and the cause, we can work to prevent them.   The problem with the word "mistake", is it's quickly associated with blaming the who responsible for whatever went wrong.  Mistake triggers fear, avoidance, and guilt.  Which is the exact opposite of the kind of response that can lead somewhere positive.

Here's the best definition I found from dictionary.com:

"an error in action, calculation, opinion, or judgment caused by poor reasoning, carelessness, insufficient knowledge,etc. "

From this definition, even if you failed to do something that you didn't know you were supposed to do (having insufficient knowledge), its still a mistake.  Even if it was an action triggered by interacting parts of the system, but no one thing, its still a mistake.

But choices that cause inefficiencies? That seems to fall under the gradient of an "error of action or judgement".  If we could have made a better choice, was the choice we made an error? Hmm.

Sunday, April 8, 2012

A Humbling Experience

About 7 years ago, I was working on a custom SPC system project.  Our software ran in a semiconductor fab, and was basically responsible for reading in all the measurement data off of the tools and detecting processing errors.  Our users would write thousands of little mini programs that would gather data across the process, do some analysis, and then if they found a problem, could shutdown the tool responsible or stop the lot from further processing.

It was my first release on the project. We had just finished up a 3 month development cycle, and worked through all of our regression and performance tests.  Everything looked good to go, so we tied a bow on it and shipped it to production.

That night at about three in the morning, I got a phone call from my team lead. And I could hear a guy just screaming in the background.  Apparently, we had shut down every tool in the fab.  Our system ground to a screeching halt, and everyone was in a panic.  

Fortunately, we were able to rollback to the prior release and get things running again.  But we still had to figure out what happened.  We spent weeks verifying configuration, profiling performance, and testing with different data.  Then finally, we found a bad slow down that we didn't see before.  Relieved to find the issue, we fixed it quickly, and assured our customers that everything would be ok this time.

Fifteen minutes after installing the new release... the same thing happened.

At this point, our customers were just pissed at us.   They didn't trust us.   And what can you say to that? Oops?

We went back to our performance test, but couldn't reproduce the problem.  And after spending weeks trying to figure it out, and about 15 people on the team sitting pretty much idle, management decided to move ahead with the next release.  But we couldn't ship...

There's an overwhelming feeling that hits you when something like this happens.  A feeling that most of us will instinctively do anything to avoid.  The feeling of failure.

We cope with it and avoid the feeling with blame and anger.   I didn't go and point fingers or yell at anyone, but on the inside, I told myself that I wasn't the one that introduced the defect, that it was someone else that had messed it up for our team.

We did eventually figure it out and get it fixed, but by that time it was already time for the next release, so we just rolled in the patch.  We were extra careful and disciplined about our testing and performance testing, we didn't want the same thing to happen.

At first everything looked ok, but we had a different kind of problem.  It was a latent failure, that didn't manifest until the DBA ran a stats job on a table that crashed our system... again.   But this time, it was my code, my changes, and my fault.

There was nobody else I could blame but myself...  I felt completely crushed.

I remember sitting in a dark meeting room with my boss, trying to hold it in.  I didn't want to cry at work, but that only lasted so long.  I sat there sniffling, while he gave me some of the best advice of my life.

"I know it sucks... but it's what you do now that matters.  You can put it behind you, and try to let it go... or face the failure with courage, and learn everything that it has to teach you."

Our tests didn't catch our bugs.  Our code took forever to change.  When we'd try to fix something, sometimes we'd break five other things in the process.  Our customers were scared to install our software.  And nothing says failure more than bringing down production the last 3 times that we tried to ship!

That's where we started...

After 3 years we went from chaos, brittleness and fear to predictable, quality releases.  We did it.  The key to making it all happen, wasn't the process, the tools, or the automation.  It was about facing our failures.  Understanding our mistakes.  Understanding ourselves.  We spent those 3 years learning and working to prevent the causes of our mistakes.