Why should you underestimate your users?

There’s a great story in the preface of Site Reliability Engineering. Margaret Hamilton was the director of the Software Engineering Division of the MIT Instrumentation Laboratory. She was on loan to NASA to help them with their on-board flight software, in particular making sure it was reliable.

Margaret Hamilton

Margaret Hamilton next to Apollo Guidance Computer (AGC) source code.

One day, she brought her young daughter to work. Well, her daughter found some of the buttons and switches on the control panel irresistible (as I myself would) and played with them. Doing so exposed a potentially catastrophic consequence. Running “P01” (a pre-launch operation) during flight wipes out the navigation data from the computer, preventing it from piloting the ship.

Obviously, that’s a big deal. Margaret thought they should add code to prevent this from happening.

Management said there was no reason to. It’s run by astronauts, the best of the best and highly trained. They’re not going to do that.

OK, said Margaret, we’ll add a warning in the manual about this. Some were quite amused by that. Like if your car manual had a warning telling you not to put it park while driving 55mph.

Guess what happened during Apollo 8? Someone ran P01 during flight and wiped out the navigation data. Who would do that?! Not our astronauts! That must’ve been one of our monkey flights, right?

Nope. The person who did that was Jim Lovell. And right about now you’re asking yourself why that name sounds familiar. It’s because Jim Lovell is an absolute freaking legend of an astronaut whom you saw played by Tom Hanks in Apollo 13.

That Jim Lovell. Blew away the nav data. On Christmas day, no less!

Lucky for Jim, someone thought to put that possibility – and how to fix it – in the manual. And if you’re Jim Lovell, you don’t panic, you just fix it.

How many of your users are as well trained on your system as American astronauts? How many are Jim Lovell?

Jim Lovell

Jim Lovell