arsonist firefighter

The Arsonist Firefighter

July 11, 20248 min read

Background

An arsonist fireman is someone who ignores, starts or fans flame on a fire so they can go in put the fire out and be a hero.

A former colleague who worked for me on the COMET program at Hill AFB had given me a call and said "We need you here. It's a mess." So off I went to work for a Financial Services company I had never heard of before.

I was hired as a development manager for the server team of around 10 people in 1995. Total team size was around 50 people. We were developing a 3-tier client server system that automated business process workflows.

Actions Taken

Observations

Things were indeed a mess. I learned that there had been a 100% turnover in first line managers. None had been there for more than 2 years. We also had a reputation of being the second worst "sweat shop" in the state and it was hard to hire or recruit people.

The teams had spent 4 months developing the then current release of the product and were in the throes of fixing bugs and trying to get a major release out to one of the business units. The teams were working 7 days a week and 16 hour days and managed to get the release out. I was hoping things would settle down some but would soon be disappointed.

Friday afternoon rolled around and my boss (a VP) came into my office stating that there was a Severity 2 bug we had to fix as an emergency and told me to call all the developers in to fix it. We got it fixed over the weekend and installed it in production. The following Friday it happened again. There was another bug to fix and we called our developers in again. The Friday after it happened again. I noticed a pattern.

Rebellion

Having developers work more than 2 weeks of overtime results in less productivity and more bugs. So after the third Friday I met with the other two development managers. We agreed to not call our developers in to work the weekend. The following Monday the VP wasn't happy at all. We all thought we'd get fired. He didn't say a thing. And his Friday afternoon emergencies went away for a while.

Service Level Agreement

One of the customers escalated a bug that had to get fixed. We worked it 7X24 and got it ready for release. But the customer opted to not take it into production. Was it really an emergency? No it wasn't. The next thing I did was develop a service level agreement. The service level agreement included:
- Severity definitions
- Product release definitions (major, minor, patch)
- Release frequencies
- Expected response times for the different bug criticalities
- If an emergency bug or patch release, a person from the BU had to be available to the team throughout to answer questions and validate the bug fix

Customer escalations stopped. They didn't want to work weekends either. The SLA worked!

Release Control Board

Once the SLA was in place I wanted to take it a step further. I proposed forming a Release Control Board made up of all the VPs (Product Management, Business Implementation, QA, and my Engineering VP). And that if there was an emergency patch, the RCB would have to meet ASAP and approve it before we did any work. This finally throttled the Engineering VP from calling any more emergency patches.

A couple months later I came in to the office on Monday morning and noticed a bunch of developers already working. I asked a couple of folks what they were working on. They said the VP had called them into the office himself.

I had built a really good relationship with the VP of QA. So I gave him a call to let him know what was happening. He convened the RCB and they determined the patch release wasn't really an emergency and called a halt to it.

The developers weren't very happy. They'd worked all weekend to no avail. The bug fixes would get rolled into the weekly minor release.

Infrastructure

Another crisis hit. A bunch of bugs the developers had fixed started appearing again. The engineering VP told me the developers were lying. I said no they did fix them. So I asked one of my developers to go into the source code control system and see if there were any days where no changes were made to the code base. Within a day he came back and said there was a 7 day, a 4 day and a 3 day periods of time where no changes were made. I looked at the time sheets and sure enough, it was when the team was working 60 hour weeks.

After a few phone calls, I found out that
The development environment was on the failover server in production. If production went down, we'd be down.
Source code was managed via bare bones SCCS
The failover server was not being backed up
The build process was all manual and not documented.

I went into the Engineering VP's office and told him he had no business doing software development. His eyes opened up wide in shock. I told him we needed to hire a full time software configuration management engineer ASAP. That it would be just a matter of time before it would bite us.

Within two weeks, we had a build that wouldn't install. It got escalated quickly. The VP came into my office and asked if I knew of anyone we could hire. I said yes and we hired an SCM engineer who had worked for me previously.

While waiting for the hiring process to go through, I took my best developer and made them the SCM engineer. He'd do all the builds and start documenting the process so when the full time person came on board he'd have something to start with.
I asked the System Administrator to start copying all the source code nightly to a local server so it could be backed up.
Once the SCM engineer was hired he immediately automated the build. Then moved to a more robust SCCS system.

Development and Test Environments

I identified a need for dedicated development and test environments so we could get all development off the failover server in production. Having planned out the next release of the product we needed to have the environments in place and working within 6 weeks or there would be a day for day delay of the release. The VP told me he'd take care of it. I had my doubts.

Each week I'd put the risk in my status report and each week the VP would take that risk back out. I asked why and he told me that it was within his span of control to get the environments in place and that the risk didn't belong in the status report. After six weeks we still didn't have any environments and our schedule was now slipping. It had become an issue.

Since we were trying to develop new functionality on an unstable buggy base, I decided to use this as an opportunity to reallocate 10 developers for maintenance/bug fixes so the remaining 30 developers could focus on the new features and not be interrupted.

The SVP made a visit wanting to get an understanding of everything that happened. I explained the rationale for my decisions and he said I did the right thing. We reset the date and kept the 10 developers on maintenance.

8 months after being hired, I was involuntarily transferred to another team by the VP. I don't think he liked me very much. Although 6 months later he asked me back to be the product release manager for the next major update.

Resulting Context

Two years later the engineering VP asked me to meet with him in his office and asked if I noticed anything different. Everything was calm and peaceful. We were hitting all our dates with quality releases. The customers were happy. The developers were happy. Retention improved. And we no longer had a reputation of being a sweat shop. He said "you had something to do with it". I replied that it was the development leads and team that deserves all the credit. He disagreed. Turns out I later found out that rumors were going around that I had saved the product from extinction. And we were able to get the average hours worked across the whole team down to 45 hours per week.

The VP eventually got a new job in another part of the company. He was replaced with a person who had extensive commercial software development experience.

Takeaways

  • Bad managers influence software development more than any other factor. Hire slow, fire fast.

  • Come to work each day willing to be fired. This enables you to take whatever risks you need to improve. In this case I felt I had nothing to lose.

  • If a team has worked overtime for two weeks in a row give them a break. Renegotiate scope if need be.

  • Depending on the size of the teams, getting demonstrable results can easily take 2-3 years.

  • Give all the credit to the team and give yourself a small private pat on the back.

Back to Blog