There comes a point where we all fail, it doesn’t matter when, if you don’t think it’s happened yet, give it time; either way it’s coming. The question you need to ask your self is “What am I going to do about it?”. I’ve worked in places where failure was a point the finger affair and places where it wasn’t. It is clear to me that failure is the only way to move forward and succeed, you just need the right strategy for dealing with failure that allows you to move on with life and to make the changes you need to make things better.Remember you are not going to fix all problems immediately on your first attempt but stick to the process and religiously follow it and eventually you will be in a better place.
Thomas Edison famously got it right:
I have not failed. I’ve just found 10,000 ways that won’t work…
The whole point of failure is to learn from it, and as long as you remember that you will succeed. With that in mind the most common mistake I see is the failure to learn. It’s fine to fail, fail all day long if you want. The important thing is to have the right mechanism to cope with the failure so you ensure you learn from it, this doesn’t mean it needs to be process heavy but it does need to be done religiously every failure.
There’s a few things I ask for every failure regardless if it was customer facing or internal, failure is failure is failure.
- What can we do to stop this happening again?
- How can we get more notification next time?
- Did we have the right people looking at this at the right time?
I feel the need to be abundantly clear here, “What can we do to stop this happening again?” literally means what crazy ideas do people have to stop this? do we add a new layer in? do we double up somehow? throw things behind a load balancer? It’s no good to have a room full of bright people if you can’t answer this question, there’s always something that can be done, a change in process, some crazy technical solution or just adding more capacity.
Getting more notification is important, not just after the event but can you predict the event? the obvious example is disk space, when it comes to other issues your millage may vary. Either way you should be able to do something to give you a little more time to start dealing with it, even if it’s something simple like upping the rate of the checks and the failure notification so you get the alert 1 min sooner than before.
Having the right people is also important and i’m not talking about having Bob on call rather than Chris I’m talking about getting developers awake at the right time. Let’s say there’s a memory leak, the alert should wake up both a sysadmin/DevOps guy and a developer. The only thing the sysadmin can do is make sure that the memory buffers are cleaned so it can start again (ready to fail at an undetermined later point) or automate restarts. These are all working around fixing the problem and are things to be considered when it comes to “What can we do to stop this happening again?” but You wrote the app, you have the developers so why would you not do both, have the DevOps/Sysadmin stabilise the system and minimise the impact while the developers are investigating the cause and writing a fix for the problem.
With these simple tasks in place the only sensible thing to do for your service is to fail, lots, regularly and to then put in place the solutions to stop it happening again. Failure is an option and it’s one I’d recommend; with the appropriate framework in place!