Monday, December 3, 2012

My book is live!

I finally got my book effort officially kicked off now.  I'm planning on iterating through chapter development and getting the first early release done in January.  Also kicking off my new community group in January for Austin DIG and the Idea Flow project, a lot to do!

My awesome friend Wiley also helped me design the book cover. :)

http://leanpub.com/ideaflow


Saturday, July 21, 2012

Breaking Things That Work

I went to NFJS today, and went to a talk on "Complexity Theory and Software Development" by Tim Berglund.  It was a great presentation, but one idea in particular stuck out to me.  Tim described the concept of "thinking about a problem as a surface."  Imagine yourself surrounded by mountains, everywhere you look - you want to reach as high as you can.  But from where you are, you can only see potential peaks near by. Beyond the clouds and the mountain range obstructing your view, might be the highest peak of all.


With agile and continuous improvement come a concept of tweaking the system toward perfection.  But a tweak always takes me to somewhere nearby - take a step, if it wasn't up, step back to where I was.  But what if my problem is a surface, and the solution is some peak out there I can't see?  Or even if it is a peak I can see...  if I only ever take one small step at a time, I'll never discover that other mountain...

Maybe sometimes we need to leap.  Maybe sometimes we need to break the things that are working just fine.  Maybe we should do exactly what we're not "supposed to do", and see what happens.  

Now imagine a still pool of water... I drop in a stone, and watch the rings of ripples growing outward in response.  I can see the reaction of the system and gain insight into the interaction.  But what if I cast my stone into a raging river? I can certainly see changes in waves, but which waves are in response to my stone?  It seems like I'll likely guess wrong.  Or come to completely wrong conclusions about how the system works.

With all to variance in movement of the system - maybe it takes a big splash to in order to improve our understanding of how it works?  Step away from everything we know and make a leap for a far away peak?

Here's one experiment.  We've noticed that tests we write during development provide a different benefit when you write them vs when they fail and you need to fix them later.  How you interact with them and think about them totally changes.   So maybe the tests you write first and the ones you keep, shouldn't be the same tests?  We started deleting tests.  If we didn't think a test would keep us from making a mistake later, it was gone.   We worked at optimizing the tests we kept for catching mistakes. But this made me wonder about a pretty drastic step - what if you designed all code with TDD, then deleted all your tests? What if you added tests back that were only specifically optimized for the purpose of coming back to later?  If you had a clean slate, and you were positive your tests worked already, what would you slip in the time capsule to communicate to your future self?



Friday, June 8, 2012

What Makes a "Better" Design?

An observation... there are order of magnitude differences in developer productivity, and a big gap in between - like a place where people get stuck and those that make a huge leap. 

Of the people I've observed, it seems like there's a substantial difference in the idea of what makes a better design, as well as an ability to create more options. Those that don't make the leap tend to be bound to a set of operating rules and practices that heavily constrain their thinking. Think about how software practices are taught... I see the focus on behavior-focused "best practices" without thinking tools as something that has stunted the learning and development of our industry. 

Is it possible to learn a mental model such that we can evaluate "better" that doesn't rely on heuristics and best practice tricks? If we have such a model, does it allow us to see more options, connect more ideas? 

This has been my focus with mentoring - to see if I could teach this "model." More specifically a definition of "better" that means optimizing for cognitive flow. But since its not anything static, I've focused on tools of observation. By building awareness of how the design affects that flow, we can learn to optimize it for the humans. 

A "better" software design is one that allows ideas to flow out of the software, and into the software more easily.

Monday, June 4, 2012

Effects of Measuring

As long as measurements are used responsibly, not for performance reviews or the like, it doesn't affect anything, right?


It's not just the measurements being used irresponsibly - the act of measuring effects the system, our understanding, and our actions. Like a metaphor - metrics highlight certain aspects of the system, but likewise hide others. We are less likely to see and understand the influencers of the system that we don't measure... and in software the most important things, the stuff we need to understand better, we can't really put a number on. 

Rather than trying to come up with a measurement, I think we should try and come up with a mental model for understanding software productivity. Once we have an understanding of the system, maybe there is hope for a measurement. Until then, sustaining productivity is left to an invisible mystic art - with the effects of productivity problems being so latent, by the time we make the discovery, its usually way too late and expensive to do much about it. 

Productivity understanding, unlike productivity measuring, I believe is WAY more worth the investment. A good starting point is looking at idea flow.

Thursday, May 24, 2012

Humans as Part of the System

I think about every software process diagram that I've ever seen, and every one seems to focus on the work items and how they flow - through requirements, design, implementation, testing and deployment.  Whether short cycles or long, discreet handoffs or a collapsed 'do the work' stage, the work item is the center piece of the flow.

But then over time, something happens.  The work items take longer, defects become more common and the system deteriorates.   We have a nebulous term to bucket these deterioration effects - technical debt.  The design is 'ugly', and making it 'pretty' is sort of a mystic art.   And likewise keeping a software system on the rails is dependent on this mystic art - that seems quite unfortunate.  So why aren't the humans part of our process diagram - if we recognized the underlying system at work, could we learn how to better keep it in check?

What effect does this 'ugly' code really have on us?  How does it change the interactions with the human? What is really happening?

If we start focusing our attention on thinking processes instead of work item processes, how ideas flow instead of how work items flow... the real impact of these problems may actually be visible.  Ideas flow between humans.  Ideas flow from humans to software.  Ideas flow from software to humans.  What are these ideas?  What does this interaction look like?

Mapping this out even for one work item is enlightening.  It highlights our thinking process.  It highlights our cognitive missteps that lead us to make mistakes.  It highlights the effects of technical debt.  And it opens a whole new world of learning.

Thursday, April 12, 2012

Addressing the 90% Problem

If I were to try to measure the time that I spent thinking, analyzing and communicating versus actually typing in code, how much of the time would it be? If I were to guess, I'd say something at least 90%.  I wonder what other people would say? Especially without being biased by the other opinions in the room?

We spend so much of our time trying to understand... even putting a dent in improvement would mean -huge- gains in productivity. So...

How can we improve our efficiency at understanding?
How can we avoid misunderstanding, forgetting, or lack of understanding?
How can we improve our ability and efficiency at communicating understanding?
How might we reduce the amount of stuff that we need to understand?

These are the questions that I want to focus on... its where the answers and solutions will make all the difference.

Mistakes in a World of Gradients

I've been working on material for avoiding software mistakes, and have been searching for clarity on how to actually define "mistake."

I've been struggling with common definitions that are very black and white about what is wrong and considered "an error" or "incorrect".  Reality often seems more of a gradient than that, and likewise avoiding mistakes should maybe be more a matter of avoiding poor decisions in favor of better ones?

I like this definition better, because it accounts for the gradient in outcome, without missing the point.

"A human action that produces an incorrect or inadequate result. Note: The fault tolerance discipline distinguishes between the human action (a mistake), its manifestation (a hardware or software fault), the result of the fault (a failure), and the amount by which the result is incorrect (the error)."

 GE Russell & Associates Glossary - http://www.ge-russell.com/ref_swe_glossary_m.htm




Tuesday, April 10, 2012

What is a Mistake?

I've been trying to come up with a good definition for "mistake" in the context of developing software.

It's easy to see defects as caused by mistakes, but what about other kinds of poor choices? Choices that led to massive work inefficiencies?  And what if you did everything you were "supposed to" do, but still missed something, is that caused by a mistake?   What if the problem is caused by the system and no person is responsible, is that a mistake?

All of these, I think should be considered mistakes.  If we look at the system and the cause, we can work to prevent them.   The problem with the word "mistake", is it's quickly associated with blaming the who responsible for whatever went wrong.  Mistake triggers fear, avoidance, and guilt.  Which is the exact opposite of the kind of response that can lead somewhere positive.

Here's the best definition I found from dictionary.com:

"an error in action, calculation, opinion, or judgment caused by poor reasoning, carelessness, insufficient knowledge,etc. "

From this definition, even if you failed to do something that you didn't know you were supposed to do (having insufficient knowledge), its still a mistake.  Even if it was an action triggered by interacting parts of the system, but no one thing, its still a mistake.

But choices that cause inefficiencies? That seems to fall under the gradient of an "error of action or judgement".  If we could have made a better choice, was the choice we made an error? Hmm.

Sunday, April 8, 2012

A Humbling Experience

About 7 years ago, I was working on a custom SPC system project.  Our software ran in a semiconductor fab, and was basically responsible for reading in all the measurement data off of the tools and detecting processing errors.  Our users would write thousands of little mini programs that would gather data across the process, do some analysis, and then if they found a problem, could shutdown the tool responsible or stop the lot from further processing.

It was my first release on the project. We had just finished up a 3 month development cycle, and worked through all of our regression and performance tests.  Everything looked good to go, so we tied a bow on it and shipped it to production.

That night at about three in the morning, I got a phone call from my team lead. And I could hear a guy just screaming in the background.  Apparently, we had shut down every tool in the fab.  Our system ground to a screeching halt, and everyone was in a panic.  

Fortunately, we were able to rollback to the prior release and get things running again.  But we still had to figure out what happened.  We spent weeks verifying configuration, profiling performance, and testing with different data.  Then finally, we found a bad slow down that we didn't see before.  Relieved to find the issue, we fixed it quickly, and assured our customers that everything would be ok this time.

Fifteen minutes after installing the new release... the same thing happened.

At this point, our customers were just pissed at us.   They didn't trust us.   And what can you say to that? Oops?

We went back to our performance test, but couldn't reproduce the problem.  And after spending weeks trying to figure it out, and about 15 people on the team sitting pretty much idle, management decided to move ahead with the next release.  But we couldn't ship...

There's an overwhelming feeling that hits you when something like this happens.  A feeling that most of us will instinctively do anything to avoid.  The feeling of failure.

We cope with it and avoid the feeling with blame and anger.   I didn't go and point fingers or yell at anyone, but on the inside, I told myself that I wasn't the one that introduced the defect, that it was someone else that had messed it up for our team.

We did eventually figure it out and get it fixed, but by that time it was already time for the next release, so we just rolled in the patch.  We were extra careful and disciplined about our testing and performance testing, we didn't want the same thing to happen.

At first everything looked ok, but we had a different kind of problem.  It was a latent failure, that didn't manifest until the DBA ran a stats job on a table that crashed our system... again.   But this time, it was my code, my changes, and my fault.

There was nobody else I could blame but myself...  I felt completely crushed.

I remember sitting in a dark meeting room with my boss, trying to hold it in.  I didn't want to cry at work, but that only lasted so long.  I sat there sniffling, while he gave me some of the best advice of my life.

"I know it sucks... but it's what you do now that matters.  You can put it behind you, and try to let it go... or face the failure with courage, and learn everything that it has to teach you."

Our tests didn't catch our bugs.  Our code took forever to change.  When we'd try to fix something, sometimes we'd break five other things in the process.  Our customers were scared to install our software.  And nothing says failure more than bringing down production the last 3 times that we tried to ship!

That's where we started...

After 3 years we went from chaos, brittleness and fear to predictable, quality releases.  We did it.  The key to making it all happen, wasn't the process, the tools, or the automation.  It was about facing our failures.  Understanding our mistakes.  Understanding ourselves.  We spent those 3 years learning and working to prevent the causes of our mistakes.

Wednesday, March 28, 2012

What we REALLY Value is the Cost...

Today, someone in the community mentioned the idea of measuring "value points".  And the light went on... could this finally highlight our productivity problems?  It could be a totally dead-end idea, but its a hypothesis that needs testing.


When I thought "value points", I imagined a bunch of product folks sitting around playing planning poker judging the relative value of features. Using stable reference stories and choosing whether one story was more or less valuable than others. Might seem goofy, but its an interesting idea. My initial thought was that this would be way more stable over time than cost since cost varies dramatically over the lifetime of a project. And if it is truly stable, it might provide the missing link when trying to understand changes in cost over time.

For this to make sense, you gotta think about long term trends.  Suppose our team can deliver 20 points of cost per sprint. But our codebase gets more complex, bigger, uglier and more costly to change. Early on, we can do 10 stories at 2 points each.  But 2 years later, very similar features on the more complex code base require more effort to implement, so maybe these similar stories now take 5 points each and we can do 4 of them.  Our capacity is still 20 story points, but our ability to delivery value has REALLY decreased.

We often use story points as a proxy for value delivered per sprint, but think about that... We get "credit" for the -cost- of the story as opposed to the -value- of the story.   If our costs go up, we get MORE credit for the same work!

How can we ever hope to improve productivity if we measure our value in terms of our costs? How can we tell if a story that has a cost of 5 could have been a cost of 1? Looking at story points as value delivered makes the productivity problems INVISIBLE. It's no wonder that it's so hard to get buy in for technical debt...

What if we aimed for value point delivery? If you improved productivity, or your productivity tanked, would it actually be visible then? On that same project, with 20 cost points per sprint, suppose that equates to 10 value points early on, and 4 value points later.  Clearly something is different. Maybe we should talk about how to improve? Productivity, anyone? Innovation?

At least it would seem to encourage the right conversations...

Thursday, March 8, 2012

Does Agile process actually discourage collaboration and innovation?

Before everyone freaks out at that assertion, give me a sec to explain. :)

In the Dev SIG today, we were discussing our challenges with integrating UX into
development, and had an awesome discussion. I think Kerry will be posting some
notes. Most of the discussion though, went to ideas and challenges with
creating and understanding requirements, and the processes that we use to scale
destroying a lot of our effectiveness. The question we all left with, via Greg
Symons, was how do we scale our efforts while preserving this close connection
in understanding between the actual customer and those that aim to serve them?

In thinking about this, our recent discussions about backlog, and recalling past
projects, I realized some crucial skills that we seem to have largely lost. In
the days of waterfall, we were actually much more effective at it.

My first agile project was with XP, living in Oregon, and fortunate enough to
have Kent Beck provide a little guidance on our implementation. Sitting face to
face with me, on the other side of a half wall cube, was an actual customer of
our system, who had used it and things like it for more than 20 years. I could
sit and watch how they used it, ask them questions, find out exactly what they
were trying to accomplish and exchange ideas. From this experience I came away
with a great appreciation for the power of a direct collaborative exchange
between developers and real customers.

My next project, was waterfall. One of the guys on my team, wickedly smart, his
background was mainly RUP, and he just -loved- requirements process. What he
taught me were techniques for understanding, figuring out the core purpose,
figuring out the context of that purpose, and exploring alternatives to build a
deeper understanding of what a user really needs. Some of these were
documentation techniques, and others were just how you ask questions and
respond. I learned a ton. On our team, the customers would make a request, and
the developers were responsible for working with the customers to discover the
requirements.

With Scrum-esque Agile process, this understanding process is outsourced to the
product owner. As we try to scale, we use a product owner to act as a
communication proxy, and with it create a barrier of understanding between
developers and actual customers. Developers seldom really understand their
customers, and when given the opportunity to connect with them, the number of
discoveries of all the things we've been doing that could have been so much
better, are astounding.

I've done agile before on a new project, sitting in the same room with our real
users, understanding their problems, taking what they asked for and figuring out
what they needed, and also having control of the architecture, design, interface
and running the team to build it - the innovation of the project and what we
build was incredible. Industry cutting-edge stuff was just spilling out of
everything we did. And it all came out of sitting in a room together and
building deep understanding of both the goals, and the possibilities. This was
agile with no PO proxy. The developers managed the backlog, but really wrote
very little down... we did 1 week releases.

Developers seldom have much skill in requirements these days. And are often
handed a specification or a problem statement that is usually still quite far
from the root problem.

In building in this understanding disconnect, and losing these skills, are we
really just building walls that prevent collaboration and tearing down our
opportunities to innovate?

Manufacturing of a Complex Thought

Imagine that the software system is a physical thing. Its shape isn't really
concretely describable, its like a physical version of a complex thought. All
of the developers sit in a circle, poking and prodding at the physical thought -
adding new concepts, and changing existing ones.

Just like we have user interfaces for our application users, the code is the
developer's interface to this complex thought. Knowledge processes and tools
help us to manipulate and change the thought.

If I want to make a change, I need to first understand enough of the thought to
know how it would need to change. If I can easily control and manipulate parts
of the thought, and easily observe the consequences, its easier and faster to
build the understanding I need. Once I understand, I can start changing the
ideas, and again if I misunderstood something, it would be nice to know as early
as possible what exactly my mistake was, so that I can correct my thinking. If
there were no misunderstandings, the newly modified complex thought is then
complete.

In order to collaboratively work on evolving this complex thought, we must also
maintain a shared understanding - more brains involved increases the likelihood
of misunderstandings and likewise mistakes.

So with that model of development work, then think about all of the ideas and
thinking that have to flow through the process in order to support the creation
of this physical idea. Inventory in this context is a half-baked idea, either
sitting in the shelf, or currently in our minds being processed. These ideas
are what we manufacture, but since each idea has to be weaved into a single
complex thought - our tools that we use to control and observe the thought, the
clarity and organization of the thought, the size of the thought, all have a
massive impact on our productivity.

The tools are not the value, the ideas that get baked into this complex thought
are. All of the tooling is just a means to an end. We should strive to do just
enough to support the manufacturing of the idea.

If you think about creating tests from the mindset of supporting the humans and
these knowledge processes, a lot of what we do with both automated and manual
testing can clearly be seen as waste. An idea that is clear to everyone, for
example, is not one likely to cause misunderstandings and mistakes. We should
first aim to clarify the idea to prevent these misunderstandings. We should
then aim for controllable and observable, as these characteristics allow us to
come to understand more quickly. And when misunderstandings and mistakes are
still likely, we should then use alarms to alert us when we've made a mistake. 
False alarms quickly dilute the effectiveness of the useful alarms... thus
taking care in trying to point the human clearly to the mistake, and not raise
unnecessary false alarms is what makes effective tests.

Now think about things like code coverage metrics in this light. This metric
tends to encourage extreme amounts of waste. We forget all about the humans and
fill our systems with ever-ringing false alarms. We tend to only think about
tests breaking in our CI loop, but their real effect is their constancy in
breaking while we try to understand and modify this shared complex thought. 
With our test-infected mindsets, we quickly bury the changeability of our ideas
in rigidity, and lose the very agility that we are supposedly aiming for.

Friday, February 17, 2012

Fighting my way to agility - Part 5

Shrinking Release Size

Now that our batches were shrinkable, we could feasibly shrink our iteration and release sizes.  The ultimate test of whether you are -really- shippable is to actually ship.  If we could actually get our software in production, we could put the risk behind us for the changes we'd done so far.  If you don't actually ship, it was hard to -really- know if we were shippable.  We were also introducing more risk at a time to production - and troubleshooting production defects would usually take longer to diagnose and fix.

But our customers were just starting to trust us, and were still doing about a month of testing after our testing before they would feel safe enough to install something.   We asked if we could do releases more often, and the answer was pretty much... 'hell no.'  Rather than give up on trying, we figured out why there was so much pushback.  And solving those problems, whatever they might be, became our top priority.

We learned about all kinds of problems that we didn't even know we had.  Some were ticket requests that had been sitting in the backlog for years.  And others were just stuff that had to be done for every release that isn't really a big deal until we asked them to be done a whole lot more often.   There was a whole lot of pain downstream that we weren't even aware of.  And most of it was really just our problems rolling downhill - and completely within our power to fix.

Reducing the Transaction Cost of a Release

There's lots of different kinds of barriers to releasing more often.  Regardless of what yours are, getting the right people in the room, working together to understand the whole system goes a long ways. A lot of the stuff that seem like hard timeline constraints actually aren't.  Challenge your assumptions. 

So how could we relieve our customer's pains?

"Everytime we do a release, we lose some data" - we had no idea.  The system was designed to do a rolling restart, but there was a major problem.  During the roll over we had old component processes communicating with new ones.  In general the interfaces were stable, but there was still subtle coupling in the data semantics that caused errors.  Rather than trying to test for these conditions, we instead changed the failover logic so all of the data processing would be sticky to either the old version or new version and could never cross over.   This prevented us from having to even think about solving cross-talk scenarios.  We also created a new kind of test that mimicked a production upgrade while the system was highly live.  This turned out to be a great test for backward compatibility bugs as well.

"Its not safe to install to production without adequate testing, and we can't afford to do this testing more often" - Whether they found bugs or not while testing, was almost irrelevant.  Unless they knew what was tested and felt safe about it, they wouldn't budge.  They were doing different testing, with different frameworks, tools and people than we were, and unless it was done that way, it was a no go.  So we went to our customer site.  We learned about their fears and what was important to them.  We learned how they were testing and wanted testing to be done.   

We shared our scenario framework with them, code and all, and then worked with them to automate their manual tests in our framework.  We made sure their tests passed (and could prove it), before we gave them the release.  And likewise, we adopted some of their testing tools and techniques and at release time gave them a list of what we had covered.  We also started giving them a heads up about what areas of the system were more at risk based on what we had changed so they didn't feel so much need to test everything.  After we helped them to reduce their effort and just built a lot more trust and collaborative spirit with our customers, this was no longer an issue.

Editing the Scrum Rule Book - What process tools DID we actually use?

Since we weren't predictable and weren't using time boxes, we also threw out story point estimation, velocity and any estimation-based planning activities.  The theory goes that you improve your ability to estimate and therefore your predictability by practicing estimation.  I think this is largely a myth.

"Predictions don't create Predictability." This is one of my favorite quotes from the Poppendieck's book, Implementing Lean Software Development.   You create predictability by BEING predictable.  The more complex and far in the future your predictions, the more likely you are to be wrong - and way wrong.  So wrong that you are likely to make really bad decisions under the illusion that your predictions are accurate.   Its an illusion of control when control doesn't actually exist.  You can't be in control until you ARE controllable.   Predictability doesn't come from any process, its an attribute that exists (or doesn't) in the system.  Uncertainty is very uncomfortable.  But unless you face reality and focus on solving the root problem, nothing is ever really likely to change.

Burn downs we did use, but not until we were closer to wrapping up a release.  This was helpful in answering the 'are we done yet?' questions, the timing of which we used to synchronize other release activities.  There were tickets to submit, customer training to do, customer testing to coordinate etc.  We tried doing burn downs for the whole sprint, but since our attempted estimates were so wildly inaccurate - it wasn't helpful and more so harmful as input into any decisions.  The better decision input was that we really had no idea, but were trying to do as little as possible so done would come as soon as possible.  If a decision had to be made, we would try and provide as much insight as we could to improve the quality of the decision, without hiding the truth of uncertainty.  Although management never liked the answers, our customers were unbelievably supportive and thankful for the truth.

Saturday, February 11, 2012

Fighting my way to agility - Part 4

Shrinking your Iterations when you have a Painful Test Burden

The obvious thing, which we quickly jumped on, was automating our existing manual test cases.  They were mostly end-to-end, click-through UI and verify tests.  We automated most of them with QTP, and tried to run them in our CI loop.  This turned out to be a terrible idea… or rather a horrible nightmare.  Not only were they terribly painful to maintain and always breaking - we had 2 other problems:  Our developers started totally ignoring the CI build since it was always broken,  and they weren't even catching most of our bugs.  It was a massive cost, with very little benefit.   We ended up throwing them all away and going back to manual.   Don't get me wrong, this sucked too.  We instead focused our efforts on creating better tests.  


We started looking more deeply at where our bugs were coming from, and why we were making mistakes.  Our old manual test suite was built by writing manual tests for each new feature that was implemented.  The switch in attention to finding ways to stop our actual bugs, instead of just trying to cover our features, was key to really turning quality around.   We ultimately created 3 new test frameworks.  

Performance simulation framework -  The SPC engine had some pretty tight performance constraints that seemed to unexpectedly go haywire.  Trying to track down the cause of the haywire performance, was a hell of a challenge.  The more code that had changed the harder it was to track down.  We also discovered that our existing performance test that used simulated data didn't find our problems.  The system had very different performance characteristics with highly-variable data.  So we got a copy of each production site and made a framework that would reverse engineer outputs as inputs and replay them through the system.  We would run a test nightly that did this and send us an email if it detected 'haywire'.  We actually used our own SPC product and created an SPC chart against our performance data to do it ;)  Catching performance issues immediately was a complete game-changer.   It also gave us a way to tune performance, and as a bonus some handy libraries for testing our app.

SPC 'fingerprinting' Tests -  Remember all the bandaids and fragileness I talked about?  It was insanely hard to not break.  We used the performance test tooling to replay the outputs as inputs to the system, but then we recorded the new output as a fingerprint file.  For every single chart in every production environment, we generated one of these fingerprinted scenarios.  Then as the system changed, we would compare the old fingerprints with the new ones and fail if there were differences.  Even with 8000 generated tests, it was easy to flip the bar green for expected changes by copying all files in 1 folder to another. But the challenge was in telling between expected changes and accidental ones.  Again, this greatly increased in complexity with an increased amount of change.  If the change was relatively small, you could look at a sample of files and see if behavior was as expected or not.  We not only caught a TON of bugs this way, but also found cases where users asked us to implement changes that would break other use cases and we were able to alert them.

Integration scenario tests - The other area where we had a lot of bugs were with scenarios that crossed system boundaries.  For example, SPC would put a lot (container of material) on hold, a user would investigate and then release the hold.  These scenarios got quite complicated when the lot was split up, combined with other material, and processed different places.  We had to track down all the material that might be affected and put all of it on hold.  With remote calls failing on occasion and other activities happening concurrently, there were a lot of ways for things to go wrong.  Anyway, we worked with the testers for the other system to create a framework that we could orchestrate and verify state across systems.  For our apps, rather than going through the UI, we had an internal test controller that we would use to drive the app from the inside.  To verify state we would internally collect and dump all the critical state information to an XML file that we would diff to detect failure.  We had about 150 or so end to end integration tests like this that had a much lower maintenance cost.

The Fate of Our Manual Tests

Once we had much better coverage of the error-prone parts of the system, running all of these tests became a lot less important.  A few we converted to scenario or unit tests, but for the most part these all stayed manual.  It was still expensive to run them all, but we used another strategy to reduce that burden.  We categorized them all in a Sharepoint list, then at release time, we would run only tests we thought had a chance at finding a bug based on what we had changed that release.   With all the other testing, this was good enough.  We found a couple bugs with them, but at this point, they were almost always green.

But the Work was Still TOO Big!

We still suffered some pretty massive productivity problems, but at this point we were back in control.  When we changed stuff it wasn't so risky.  But it was still too time-consuming.  So we focused our team on how could we get the most productivity gain for our effort


We analyzed both our past work for where we were spending time, and where future code changes needed to be.   The UI layer was frightful and always time-consuming to change, but we rarely had to change it.   Effort here, wouldn't actually buy us more productivity.   Most ouf changes were in the core SPC pipeline and it was a major bear to understand and change.  Although the effort level was high, we all knew we would see the payoff. The impact we had on productivity from rewriting the SPC engine was HUGE.  

We had all kinds of awesome new features we were able to do, including supporting SPC for a completely different type of facility.  Our productivity exploded.  We could have never done these features on the system before - the effort was probably beyond the cost to rewrite.  Our users were also thrilled because the behavior was so predictable - even for complex scenarios.   Their productivity improved in designing and maintaining SPC charts!

As we got better and better control of our productivity, we kept shrinking our iteration sizes, and were finally able to have consistent quality.  It wasn't perfect, but it was good enough to earn back the trust of our customers.   Though we never did end up using timeboxes, at this point, our sprints were 'boxable'. :)

Fighting my way to agility - Part 3

Breaking down work

On a large complex hairy project, work just gets big.  Our work was big.  And breaking down work is sort of an art.  The think time was usually 10x the coding time, and the test time would explode by just changing 1 line in the SPC engine.   When we tried to force big work into a small box, the team kept trying to slice it in ways that didn't independently deliver anything useful - which you're not supposed to do.

A lot of times in a break down we'd get a requirements story, a design story etc… and not because the team was just hung up on waterfall, they were trying to figure out how to break down 'think'.  After all, how do you break down a 1 line code change story when the scope of the work is really thinking through and testing the implications of changing that line of code?  Clarifying the exact expectations for a specific SPC behavior might take the developer and customer working together for a couple weeks.  Our stories might have looked 'wrong' to some, but when these cases came up, breaking down 'think' work this way was helpful.   A more problematic breakdown attempt had stories that checking in the code for story part would leave the system unshippable.  When this would come up we would work really hard to find another way - which usually meant creating a story to find another way.  In either case, all of the various story parts that didn't independently deliver anything useful we tracked under an epic that did.

In the process of trial and error with breaking down work, we learned a lot about what mattered and what really didn't.  

Lessons Learned on Breaking Down Work

Teach over Force - Given all the pain we had, rather than making a box and forcing work to fit, I would focus teaching break-down work skills.  Whether your box is fixed or not, you need to solve this problem.   You can aim for a box, but don't mutate the work to a degree that it violates the below constraints.  If you can't come up with a way to do that, just leave it as a big item.

No Invisible WIP - If you have stories that don't deliver anything usable to the customer (e.g. my requirements story), track them under an epic.  Not because of some Agile holy gospel, but because if you don't it HIDES work in progress.  You have a partial investment in work, and over time you have thoughts rotting away, and code that may not even work sitting in your code base.  If its not ALL done and you have to even think about the set of changes again, its WIP.  Keep all your WIP in front of you, all of the time.  It should remind you of everything that is part way done.  Push to finish what you started before starting more stuff.  Push to get back to shippable as soon as you can.  Don't let yourself feel like the work is all done when you still have WIP.  Its not done.  That looming manual testing that tells you whats broken in your code? The integration testing with that other system?    Your stories are still in progress and if you hide that fact its easy to feel ok about starting more work.   

If you are not shippable, you still have WIP.  This is probably the most critical thing I see missed for people in any process - invisible, hidden WIP.   I think its one of the main reasons people don't even see what otherwise would be a glaringly obvious problem.

Vertical Slicing - The typical strategies of trying to break work into vertical slices (sliver of usable functionality traversing all layers of the system) and finding the smallest possible thing that can work - all are generally good advice.  Try to break down the implementation of the work by iterating over even smaller vertical slices, even if its totally not useful.  You learn a lot in the process, which tends to reduce rework.

Don't abandon your requirements format - "User story" format can be helpful, but it's not that important.  The key is communication.  Whatever you need to do to effectively communicate the purpose and expectations effectively - use what works.   If you've got something that is working well for you, don't change it.  Don't assume stories are going to work better for you.  After some chaos and thrashing, we basically settled on a 'story statement' added to our existing really formal requirements document.  Distributed communication, cultural barriers, complex requirements all make communication REALLY hard.

Fighting my way to agility - Part 2

Trying to Fix the Timebox... and Failing

We tried to fix the time and vary the scope and be shippable at the end of the sprint.   Well, that didn't really work.  How do you run 1200 manual tests in the sprint? What if I just have a month of code integrated that other people built stuff on top of, and I'm almost done but not quite? What if the app doesn't work at the end?  What if the smallest possible work size is just big?  If we have to be shippable at the end of the sprint, and the scope of getting to quality is highly variable, how could we possibly fix the timebox?  

I've seen people try to do testing sprints or hardening sprints or integration sprints but it really fundamentally destroys one of the core principles of Scrum - being shippable at the end of every sprint.  Shippable as in - you really could put what you just coded in production.  As soon as you make it ok to throw your testing into another sprint, you ignore that fundamental constraint.  And even worse, it masks a HUGE leering problem that really should be your priority to solve.  

If you do another sprint, with your testing all delayed, you then violate another core principle - you shouldn't build on top of BROKEN code.  It was massively harder to diagnose bugs when there was more code that hadn't been proven yet.  It was roughly about 4 to 1.  Doubling the amount of code, roughly quadrupled the time of our test/fix cycle.   But it didn't seem to take much more to go off the test/fix cliff of no return.

Both of these are way more important to not violate than time boxes.  Yet I see soooo many teams keep the time boxes and throw out the other 2.   Until you can actually be predictable, I don't think the timeboxes buy you much anyway.  Something had to give, we voted for the box... with a goal of working on becoming more 'boxable'.

So we scheduled our 'sprint' for what we thought was about a month-ish of work, but then however long it took to test the app and get it all working and finish anything that we couldn't pull out, we just took that time and didn't work on anything new.  It was done whenever we were back to shippable.  Nobody was allowed to work on forward development.  Nobody did any refactoring that might add more risk.  Get the app back to being green.   

It was easy to fall into a trap of worrying about inefficiencies, especially as test/fix cycles dragged out, the business wanted their features.   We tried creating a release branch and having a few continue on forward development while we focused on stabilizing.  This idea blew up in our face.  We were only really thinking about the penalty of managing the branch and merging, and didn't foresee this at all.  

Eventually we got the release out the door and planned out another one month-ish sprint.  Then we got to our test/fix phase, but there was an even bigger batch of work in it!  The last stabilization had taken ~2 months, and so we actually did about twice as much work.  The difference in time that it took to troubleshoot defects in the larger set of changes was HUGE.  The troubleshooting complexity shot through the roof.   It was another 6 months before we were able to get the changes out the door.  


Forget getting a head start, and have your spare capacity work on the problem right in front of you - its too damn hard to get code out the door.  What good are a bunch of coded features that you can't ship?


Fighting my way to agility - Part 1

Real Projects Have Hard Problems


The challenge with principles is that its really hard to see how they apply in a specific context.  The idea of the principle may make perfect sense. How the software process should work in theory may make perfect sense.  But the messy world of software reality isn't so kind.  

Concrete messy real world experiences are awesome to learn from because they give us insights on how to take abstract truths (principles) and map them to reality.   From all of these examples, we can distill patterns that give us ideas on how to map solutions between different problems.   Its far too easy to make a really bad decision if you over-simplify a very complex world.  Real software is just hard.  


For anyone who reads this, please share your stories.  Whether it ends in victory or not, its the guts of our learning.  By sharing them we broaden all of our experiences, and improve all of our abilities to tackle the problems we face. 

So here's one of my stories.  This is the story of a messy real world project in which I personally fought my way to agility.

My Project

A semiconductor factory SPC system responsible for reading and aggregating data coming off of the tools to detect problems (and shut down what was causing the problem).   High volume, highly variable incoming data stream, user-defined analysis programs and near real-time charts.   A program would usually gather a mix of historical data and current data, and do a bunch of math on the results to make a decision.  If we took too long to make that decision, the tool would timeout and shutdown.  Deployed in a 24/7 environment with 1 downtime per year.

Starting point (for me, 2005)

Scary reality. :) 500k lines of code, ~10 years old. The web server was home-grown, the UI flow-control was done with exceptions (seriously!), and the core SPC engine had bandaid over bandaid of hack fixes that it was next to impossible to make a change without having some unintended side-effect on another use case. 1200 formal manual test cases to all be run and made green for every release.  

Half of the team in Austin, half in India, one in Germany, and customers in both Austin and Germany each with different data formats, problems and strategies.  Interestingly, the team started with 2 week iterations, but as the transaction cost of delivery went up with the growing test burden, the batches kept getting larger to compensate.   When I got there, they were doing ~4 months of development and a couple months of test/fix chaos after that. 

They had recently been having major performance problems and didn't know how to solve them.  So the 2 months of chaos 2 months and counting with no end in sight.   I moved from an XP shop building financial transaction/banking software, but they had really hired me for my Oracle performance wizardry skills.  When I got there it was even more tragic, they had also been having major quality problems.  They had just rolled back the production release - again.  Our customers were literally scared to install our software.  We were sitting on the tipping point of being completely unable to release with critical defects that we couldn't diagnose.

End point (3 years later)


Consistently delivering releases every ~2 months with high-quality, stable performance, and predictable behavior.   Deployed 2 more installations, one of them was a totally different type of processing facility.  Happy customers.  They even threw a party for me when I went to Germany!

Going from A to B - Was it Magic?

It was pain.  A lot of pain.  A lot of mistakes.  A lot of learning and hard work.

We accidentally shutdown every tool in the fab (twice).  We accidentally had a sprint that our test and fix churn dragged out for a year before we saw light at the end of the tunnel (way scary).  We threw away massive investments in test automation with QTP and started over.   

We were aiming to do Scrum.  We all read a Scrum book, and Bryan (still my awesome boss) was then our manager and Scrum Master. 

It took a long and hard journey to accomplish real change.