Saturday, June 28, 2014

Constant Urgency is a Symptom of a Hazardous Environment

When we use analogies like land-mines, fires, and various types of hell, we're referring to the hazards in our everyday work.  We may not be working with sharp objects or molten metal, but the anxiety, stress and exhaustion from working in a hazardous environment is very real.

From the outside, these problems are largely invisible.  So despite the hazardous work, we are usually expected to work without making mistakes.   We never have time to fix the hazards because there's always something more important to do.  Building the tools we need for safe development like failure recovery, diagnostic support, adequate logging and reliable deployment are often deferred in favor of more features.

Then when something explodes, as it inevitably will, high-risk heroics are required to save the day.  We work late nights and weekends repairing complex problems by hand and hope that nothing else goes wrong.

Instead of recognizing the symptoms of a serious problem, the long hours and heroics are often rewarded.  Fire-fighting, overtime, and last-minute hacks start to be expected.  Constant stress and exhaustion become the norm.

More people just add fuel to the fire.


Those that don't want to put in the long hours anymore are seen as not pulling their weight.  Frustration builds and the team gets burned out and the best developers start to leave.

The new guys just make things worse.  They don't know the software and the hazards to watch out for and they keep messing things up in the code.  We try to hold things together, but it's hard to get anything else done.  It becomes a full-time job just to keep the system from falling apart.

Management doesn't understand why productivity is so poor and tries to add more people to the work.  This just adds fuel to the fire.

Once this cycle gets started, it's hard to turn things around.  We get sucked into the problems, operating in a mode of constant urgency, and we don't want to see our project fail.  So we push ourselves to the limit of stress and exhaustion doing the best we can.  However, we're so busy reacting to all the things going wrong, there's no time to stop and fix the problems.  One more late night and a few hacks to get things working, but the cycle just doesn't end.

We knew better, but we did it anyway


The worst part about this is even when we know better, we do it anyway.

I remember one night in particular after working 60+ hour weeks for several months.  I checked in some code without running it at all and deployed my changes so I could test it in production.  I was so used to working under constant urgency, I had eventually thrown all my sense of principle out the window.

We had built out the delivery infrastructure and automated our release process from the beginning.  For a while, we were releasing every week; there were challenges, but for the most part things were going fairly well.  We had a major deadline coming up to support a new customer on our platform and investors had been promised it would happen by the end of the year.

The requirements meant drastically changing parts of the architecture and conquering some extremely difficult problems.  How long was it going to take?  We had no idea, but we did know we had better get to work!

We broke down the work and started chipping away at it, trying to do just enough unit testing to get by.  We paired on the more challenging parts and tried to parallelize the work to get it done as fast as we could.  We tried to integrate early, but there were so many problems.  The software produced weird results.  We just had to work through it.


We were caught up in the cycle


Some of us worked on testing and fixing, while others kept pushing along with the remaining features.  We knew we were headed down the path of a monstrous release, but we didn't seem to have any choice.  We worked an insane amount of hours troubleshooting problems just trying to get it stable.

The end of the year was rolling around and we finally got the software in production.  We thought the pain was finally over, but that was just the beginning.  We had no time to build out the infrastructure we needed to make changes safely, and our new users had a long list of complaints.  The pressure just never let up.

Every release it seemed like things would go wrong.  We'd work all weekend and be up late Sunday night trying to fix deployments that went wrong.  The data would be messed up.  Reports wouldn't be right.  We didn't really have a viable plan B.  The system was down, it took too long to restore from backup, we just had to fix it in production.


Something had to give...


We were so exhausted, but the urgency didn't end.  We were yelled at and threatened whenever things went wrong, but expected to continue the high-risk work.  How could they possibly give us bandwidth for work that wasn't part of the deliverables, when the project was already several months behind schedule?

We had poured so much of our time into the software and the people on the team were my friends.  We had great developers that had always been disciplined engineers and we all got sucked into the same trap.

Sometimes you just have to leave.  Working under threat and constant urgency makes great people do really stupid things.

Saturday, June 14, 2014

Designing Effective Teams

The same things that make for good software make for good team structure.  We need high cohesion within a team and low coupling between teams.  If people need high bandwidth communication across team structures to do their jobs effectively, the team structures are usually pretty dysfunctional.  Likewise if the members of a team don't have a need to talk to each other, they don't really operate as a team either.

Team structure is a design problem.  Developers can be quite good at it, once they start to look at it that way.  Designing the team structure around the architecture has a lot of benefits.  However, if you have a hairball interdependent architecture, you can't build a good team structure around it.  Trying to throw more people at the problem and artificially carve it apart is often where software organizations fail.  

Trying to go faster and throw more people at it often results in going *slower*. Teams get stuck in a trap of trying to police the code with reviews and there's no way to keep up.   The best resources can no longer be productive because they spend all their time reacting to the system that is busting at the seams.  Until the team learns a way to design the system in a way that *communication* can be scaled, leaders need to keep their foot off the accelerator pedal.   We need the time to invest in that critical learning.

I don't think it's a hands off, let the team figure it out kind of problem.  Organizational design is challenging problem and we need leadership to help figure it out.  But we need leaders that listen to their engineers, that know what to look for, and have an appreciation for the challenges of our craft.