Every time there’s a major outage at Meta, the first question I get from friends and family is usually “did they fire the person who caused it?” which is where I have to explain this concept of No Blame SEV Culture. Especially for an outage so big that a significant number of users are affected, the individual causing it likely does not have ill intent and there are likely multiple different processes and systems that failed along the way to get us here in the first place.
This is part of a series (The Opinionated Engineer) where I share my strong opinions on engineering practices.
Process over People
When something goes wrong (especially something really catastrophic), it’s usually a combination of both process and people problems. The difference here is that process is more deterministic compared to people. People have off-days, get tired, make mistakes etc. so it’s important to have a process (or automated systems) in place to prevent that. This can mean anything from adding more test coverage, lint rules against bad code patterns, and / or more alerts. It is however important to note that they need to maintain a certain level of quality bar. As mentioned in the previous article, flaky / broken tests are tech debt, same goes for noisy lint rules and alerts. Too many noisy lint rules and alerts would lead to engineers disregarding them or adopting a “wait-and-see” mentality which is not ideal in preventing future outages.