Janelle Klein: February 2012

Friday, February 17, 2012

Fighting my way to agility - Part 5

Shrinking Release Size

Now that our batches were shrinkable, we could feasibly shrink our iteration and release sizes. The ultimate test of whether you are -really- shippable is to actually ship. If we could actually get our software in production, we could put the risk behind us for the changes we'd done so far. If you don't actually ship, it was hard to -really- know if we were shippable. We were also introducing more risk at a time to production - and troubleshooting production defects would usually take longer to diagnose and fix.

But our customers were just starting to trust us, and were still doing about a month of testing after our testing before they would feel safe enough to install something. We asked if we could do releases more often, and the answer was pretty much... 'hell no.' Rather than give up on trying, we figured out why there was so much pushback. And solving those problems, whatever they might be, became our top priority.

We learned about all kinds of problems that we didn't even know we had. Some were ticket requests that had been sitting in the backlog for years. And others were just stuff that had to be done for every release that isn't really a big deal until we asked them to be done a whole lot more often. There was a whole lot of pain downstream that we weren't even aware of. And most of it was really just our problems rolling downhill - and completely within our power to fix.

Reducing the Transaction Cost of a Release

There's lots of different kinds of barriers to releasing more often. Regardless of what yours are, getting the right people in the room, working together to understand the whole system goes a long ways. A lot of the stuff that seem like hard timeline constraints actually aren't. Challenge your assumptions.

So how could we relieve our customer's pains?

"Everytime we do a release, we lose some data" - we had no idea. The system was designed to do a rolling restart, but there was a major problem. During the roll over we had old component processes communicating with new ones. In general the interfaces were stable, but there was still subtle coupling in the data semantics that caused errors. Rather than trying to test for these conditions, we instead changed the failover logic so all of the data processing would be sticky to either the old version or new version and could never cross over. This prevented us from having to even think about solving cross-talk scenarios. We also created a new kind of test that mimicked a production upgrade while the system was highly live. This turned out to be a great test for backward compatibility bugs as well.

"Its not safe to install to production without adequate testing, and we can't afford to do this testing more often" - Whether they found bugs or not while testing, was almost irrelevant. Unless they knew what was tested and felt safe about it, they wouldn't budge. They were doing different testing, with different frameworks, tools and people than we were, and unless it was done that way, it was a no go. So we went to our customer site. We learned about their fears and what was important to them. We learned how they were testing and wanted testing to be done.

We shared our scenario framework with them, code and all, and then worked with them to automate their manual tests in our framework. We made sure their tests passed (and could prove it), before we gave them the release. And likewise, we adopted some of their testing tools and techniques and at release time gave them a list of what we had covered. We also started giving them a heads up about what areas of the system were more at risk based on what we had changed so they didn't feel so much need to test everything. After we helped them to reduce their effort and just built a lot more trust and collaborative spirit with our customers, this was no longer an issue.

Editing the Scrum Rule Book - What process tools DID we actually use?

Since we weren't predictable and weren't using time boxes, we also threw out story point estimation, velocity and any estimation-based planning activities. The theory goes that you improve your ability to estimate and therefore your predictability by practicing estimation. I think this is largely a myth.

"Predictions don't create Predictability." This is one of my favorite quotes from the Poppendieck's book, Implementing Lean Software Development. You create predictability by BEING predictable. The more complex and far in the future your predictions, the more likely you are to be wrong - and way wrong. So wrong that you are likely to make really bad decisions under the illusion that your predictions are accurate. Its an illusion of control when control doesn't actually exist. You can't be in control until you ARE controllable. Predictability doesn't come from any process, its an attribute that exists (or doesn't) in the system. Uncertainty is very uncomfortable. But unless you face reality and focus on solving the root problem, nothing is ever really likely to change.

Burn downs we did use, but not until we were closer to wrapping up a release. This was helpful in answering the 'are we done yet?' questions, the timing of which we used to synchronize other release activities. There were tickets to submit, customer training to do, customer testing to coordinate etc. We tried doing burn downs for the whole sprint, but since our attempted estimates were so wildly inaccurate - it wasn't helpful and more so harmful as input into any decisions. The better decision input was that we really had no idea, but were trying to do as little as possible so done would come as soon as possible. If a decision had to be made, we would try and provide as much insight as we could to improve the quality of the decision, without hiding the truth of uncertainty. Although management never liked the answers, our customers were unbelievably supportive and thankful for the truth.

Saturday, February 11, 2012

Fighting my way to agility - Part 4

Shrinking your Iterations when you have a Painful Test Burden

The obvious thing, which we quickly jumped on, was automating our existing manual test cases. They were mostly end-to-end, click-through UI and verify tests. We automated most of them with QTP, and tried to run them in our CI loop. This turned out to be a terrible idea… or rather a horrible nightmare. Not only were they terribly painful to maintain and always breaking - we had 2 other problems: Our developers started totally ignoring the CI build since it was always broken, and they weren't even catching most of our bugs. It was a massive cost, with very little benefit. We ended up throwing them all away and going back to manual. Don't get me wrong, this sucked too. We instead focused our efforts on creating better tests.

We started looking more deeply at where our bugs were coming from, and why we were making mistakes. Our old manual test suite was built by writing manual tests for each new feature that was implemented. The switch in attention to finding ways to stop our actual bugs, instead of just trying to cover our features, was key to really turning quality around. We ultimately created 3 new test frameworks.

Performance simulation framework - The SPC engine had some pretty tight performance constraints that seemed to unexpectedly go haywire. Trying to track down the cause of the haywire performance, was a hell of a challenge. The more code that had changed the harder it was to track down. We also discovered that our existing performance test that used simulated data didn't find our problems. The system had very different performance characteristics with highly-variable data. So we got a copy of each production site and made a framework that would reverse engineer outputs as inputs and replay them through the system. We would run a test nightly that did this and send us an email if it detected 'haywire'. We actually used our own SPC product and created an SPC chart against our performance data to do it ;) Catching performance issues immediately was a complete game-changer. It also gave us a way to tune performance, and as a bonus some handy libraries for testing our app.

SPC 'fingerprinting' Tests - Remember all the bandaids and fragileness I talked about? It was insanely hard to not break. We used the performance test tooling to replay the outputs as inputs to the system, but then we recorded the new output as a fingerprint file. For every single chart in every production environment, we generated one of these fingerprinted scenarios. Then as the system changed, we would compare the old fingerprints with the new ones and fail if there were differences. Even with 8000 generated tests, it was easy to flip the bar green for expected changes by copying all files in 1 folder to another. But the challenge was in telling between expected changes and accidental ones. Again, this greatly increased in complexity with an increased amount of change. If the change was relatively small, you could look at a sample of files and see if behavior was as expected or not. We not only caught a TON of bugs this way, but also found cases where users asked us to implement changes that would break other use cases and we were able to alert them.

Integration scenario tests - The other area where we had a lot of bugs were with scenarios that crossed system boundaries. For example, SPC would put a lot (container of material) on hold, a user would investigate and then release the hold. These scenarios got quite complicated when the lot was split up, combined with other material, and processed different places. We had to track down all the material that might be affected and put all of it on hold. With remote calls failing on occasion and other activities happening concurrently, there were a lot of ways for things to go wrong. Anyway, we worked with the testers for the other system to create a framework that we could orchestrate and verify state across systems. For our apps, rather than going through the UI, we had an internal test controller that we would use to drive the app from the inside. To verify state we would internally collect and dump all the critical state information to an XML file that we would diff to detect failure. We had about 150 or so end to end integration tests like this that had a much lower maintenance cost.

The Fate of Our Manual Tests

Once we had much better coverage of the error-prone parts of the system, running all of these tests became a lot less important. A few we converted to scenario or unit tests, but for the most part these all stayed manual. It was still expensive to run them all, but we used another strategy to reduce that burden. We categorized them all in a Sharepoint list, then at release time, we would run only tests we thought had a chance at finding a bug based on what we had changed that release. With all the other testing, this was good enough. We found a couple bugs with them, but at this point, they were almost always green.

But the Work was Still TOO Big!

We still suffered some pretty massive productivity problems, but at this point we were back in control. When we changed stuff it wasn't so risky. But it was still too time-consuming. So we focused our team on how could we get the most productivity gain for our effort.

We analyzed both our past work for where we were spending time, and where future code changes needed to be. The UI layer was frightful and always time-consuming to change, but we rarely had to change it. Effort here, wouldn't actually buy us more productivity. Most ouf changes were in the core SPC pipeline and it was a major bear to understand and change. Although the effort level was high, we all knew we would see the payoff. The impact we had on productivity from rewriting the SPC engine was HUGE.

We had all kinds of awesome new features we were able to do, including supporting SPC for a completely different type of facility. Our productivity exploded. We could have never done these features on the system before - the effort was probably beyond the cost to rewrite. Our users were also thrilled because the behavior was so predictable - even for complex scenarios. Their productivity improved in designing and maintaining SPC charts!

As we got better and better control of our productivity, we kept shrinking our iteration sizes, and were finally able to have consistent quality. It wasn't perfect, but it was good enough to earn back the trust of our customers. Though we never did end up using timeboxes, at this point, our sprints were 'boxable'. :)

Fighting my way to agility - Part 3

Breaking down work

On a large complex hairy project, work just gets big. Our work was big. And breaking down work is sort of an art. The think time was usually 10x the coding time, and the test time would explode by just changing 1 line in the SPC engine. When we tried to force big work into a small box, the team kept trying to slice it in ways that didn't independently deliver anything useful - which you're not supposed to do.

A lot of times in a break down we'd get a requirements story, a design story etc… and not because the team was just hung up on waterfall, they were trying to figure out how to break down 'think'. After all, how do you break down a 1 line code change story when the scope of the work is really thinking through and testing the implications of changing that line of code? Clarifying the exact expectations for a specific SPC behavior might take the developer and customer working together for a couple weeks. Our stories might have looked 'wrong' to some, but when these cases came up, breaking down 'think' work this way was helpful. A more problematic breakdown attempt had stories that checking in the code for story part would leave the system unshippable. When this would come up we would work really hard to find another way - which usually meant creating a story to find another way. In either case, all of the various story parts that didn't independently deliver anything useful we tracked under an epic that did.

In the process of trial and error with breaking down work, we learned a lot about what mattered and what really didn't.

Lessons Learned on Breaking Down Work

Teach over Force - Given all the pain we had, rather than making a box and forcing work to fit, I would focus teaching break-down work skills. Whether your box is fixed or not, you need to solve this problem. You can aim for a box, but don't mutate the work to a degree that it violates the below constraints. If you can't come up with a way to do that, just leave it as a big item.

No Invisible WIP - If you have stories that don't deliver anything usable to the customer (e.g. my requirements story), track them under an epic. Not because of some Agile holy gospel, but because if you don't it HIDES work in progress. You have a partial investment in work, and over time you have thoughts rotting away, and code that may not even work sitting in your code base. If its not ALL done and you have to even think about the set of changes again, its WIP. Keep all your WIP in front of you, all of the time. It should remind you of everything that is part way done. Push to finish what you started before starting more stuff. Push to get back to shippable as soon as you can. Don't let yourself feel like the work is all done when you still have WIP. Its not done. That looming manual testing that tells you whats broken in your code? The integration testing with that other system? Your stories are still in progress and if you hide that fact its easy to feel ok about starting more work.

If you are not shippable, you still have WIP. This is probably the most critical thing I see missed for people in any process - invisible, hidden WIP. I think its one of the main reasons people don't even see what otherwise would be a glaringly obvious problem.

Vertical Slicing - The typical strategies of trying to break work into vertical slices (sliver of usable functionality traversing all layers of the system) and finding the smallest possible thing that can work - all are generally good advice. Try to break down the implementation of the work by iterating over even smaller vertical slices, even if its totally not useful. You learn a lot in the process, which tends to reduce rework.

Don't abandon your requirements format - "User story" format can be helpful, but it's not that important. The key is communication. Whatever you need to do to effectively communicate the purpose and expectations effectively - use what works. If you've got something that is working well for you, don't change it. Don't assume stories are going to work better for you. After some chaos and thrashing, we basically settled on a 'story statement' added to our existing really formal requirements document. Distributed communication, cultural barriers, complex requirements all make communication REALLY hard.

Fighting my way to agility - Part 2

Trying to Fix the Timebox... and Failing

We tried to fix the time and vary the scope and be shippable at the end of the sprint. Well, that didn't really work. How do you run 1200 manual tests in the sprint? What if I just have a month of code integrated that other people built stuff on top of, and I'm almost done but not quite? What if the app doesn't work at the end? What if the smallest possible work size is just big? If we have to be shippable at the end of the sprint, and the scope of getting to quality is highly variable, how could we possibly fix the timebox?

I've seen people try to do testing sprints or hardening sprints or integration sprints but it really fundamentally destroys one of the core principles of Scrum - being shippable at the end of every sprint. Shippable as in - you really could put what you just coded in production. As soon as you make it ok to throw your testing into another sprint, you ignore that fundamental constraint. And even worse, it masks a HUGE leering problem that really should be your priority to solve.

If you do another sprint, with your testing all delayed, you then violate another core principle - you shouldn't build on top of BROKEN code. It was massively harder to diagnose bugs when there was more code that hadn't been proven yet. It was roughly about 4 to 1. Doubling the amount of code, roughly quadrupled the time of our test/fix cycle. But it didn't seem to take much more to go off the test/fix cliff of no return.

Both of these are way more important to not violate than time boxes. Yet I see soooo many teams keep the time boxes and throw out the other 2. Until you can actually be predictable, I don't think the timeboxes buy you much anyway. Something had to give, we voted for the box... with a goal of working on becoming more 'boxable'.

So we scheduled our 'sprint' for what we thought was about a month-ish of work, but then however long it took to test the app and get it all working and finish anything that we couldn't pull out, we just took that time and didn't work on anything new. It was done whenever we were back to shippable. Nobody was allowed to work on forward development. Nobody did any refactoring that might add more risk. Get the app back to being green.

It was easy to fall into a trap of worrying about inefficiencies, especially as test/fix cycles dragged out, the business wanted their features. We tried creating a release branch and having a few continue on forward development while we focused on stabilizing. This idea blew up in our face. We were only really thinking about the penalty of managing the branch and merging, and didn't foresee this at all.

Eventually we got the release out the door and planned out another one month-ish sprint. Then we got to our test/fix phase, but there was an even bigger batch of work in it! The last stabilization had taken ~2 months, and so we actually did about twice as much work. The difference in time that it took to troubleshoot defects in the larger set of changes was HUGE. The troubleshooting complexity shot through the roof. It was another 6 months before we were able to get the changes out the door.

Forget getting a head start, and have your spare capacity work on the problem right in front of you - its too damn hard to get code out the door. What good are a bunch of coded features that you can't ship?

Fighting my way to agility - Part 1

Real Projects Have Hard Problems

The challenge with principles is that its really hard to see how they apply in a specific context. The idea of the principle may make perfect sense. How the software process should work in theory may make perfect sense. But the messy world of software reality isn't so kind.

Concrete messy real world experiences are awesome to learn from because they give us insights on how to take abstract truths (principles) and map them to reality. From all of these examples, we can distill patterns that give us ideas on how to map solutions between different problems. Its far too easy to make a really bad decision if you over-simplify a very complex world. Real software is just hard.

For anyone who reads this, please share your stories. Whether it ends in victory or not, its the guts of our learning. By sharing them we broaden all of our experiences, and improve all of our abilities to tackle the problems we face.

So here's one of my stories. This is the story of a messy real world project in which I personally fought my way to agility.

My Project

A semiconductor factory SPC system responsible for reading and aggregating data coming off of the tools to detect problems (and shut down what was causing the problem). High volume, highly variable incoming data stream, user-defined analysis programs and near real-time charts. A program would usually gather a mix of historical data and current data, and do a bunch of math on the results to make a decision. If we took too long to make that decision, the tool would timeout and shutdown. Deployed in a 24/7 environment with 1 downtime per year.

Starting point (for me, 2005)

Scary reality. :) 500k lines of code, ~10 years old. The web server was home-grown, the UI flow-control was done with exceptions (seriously!), and the core SPC engine had bandaid over bandaid of hack fixes that it was next to impossible to make a change without having some unintended side-effect on another use case. 1200 formal manual test cases to all be run and made green for every release.

Half of the team in Austin, half in India, one in Germany, and customers in both Austin and Germany each with different data formats, problems and strategies. Interestingly, the team started with 2 week iterations, but as the transaction cost of delivery went up with the growing test burden, the batches kept getting larger to compensate. When I got there, they were doing ~4 months of development and a couple months of test/fix chaos after that.

They had recently been having major performance problems and didn't know how to solve them. So the 2 months of chaos 2 months and counting with no end in sight. I moved from an XP shop building financial transaction/banking software, but they had really hired me for my Oracle performance wizardry skills. When I got there it was even more tragic, they had also been having major quality problems. They had just rolled back the production release - again. Our customers were literally scared to install our software. We were sitting on the tipping point of being completely unable to release with critical defects that we couldn't diagnose.

End point (3 years later)

Consistently delivering releases every ~2 months with high-quality, stable performance, and predictable behavior. Deployed 2 more installations, one of them was a totally different type of processing facility. Happy customers. They even threw a party for me when I went to Germany!

Going from A to B - Was it Magic?

It was pain. A lot of pain. A lot of mistakes. A lot of learning and hard work.

We accidentally shutdown every tool in the fab (twice). We accidentally had a sprint that our test and fix churn dragged out for a year before we saw light at the end of the tunnel (way scary). We threw away massive investments in test automation with QTP and started over.

We were aiming to do Scrum. We all read a Scrum book, and Bryan (still my awesome boss) was then our manager and Scrum Master.

It took a long and hard journey to accomplish real change.