The short version
Stuff happens, move on.
The long version
Risk management is a really interesting topic, I know there will be lots of people out there falling asleep at just the thought or risk management, well to you I say Hah! If you find risk management dull you’ve probably never had the fun of thinking through 101 different ways in which something could fail, and that requires a great use of imagination!
When considering risk there is a tendency from a sysadmin point of view to get stuck in the technical detail, i.e. if Node X dies we lose service Y; which is fine, that is a valid risk, but moving past this is kinda vital, predominately as most technical risks can be avoided with change processes or redundancy and high availability. After the technical risks you end up in environmental risks, “what if…” risks, for example “What if a power failure occurs” Great, these are environmental and you’ve chosen a provider that has UPS’s, Wonderful, Do they have generators? Diesel stored on site? In multiple containers? with a deliver schedule with multiple suppliers in the event of an emergency? Divergent power sources?
Okay, nothing to panic about here, these are just common sense issues, regardless of all of the mitigations that are in place you could just run 2 sites, 30 miles apart. So what if you are using the same provider for your 2 sites, what about the financial collapse of your hosting provider?
Okay so being totally paranoid, You have 2 providers each 30 miles apart, each with UPS, redundant generators, divergent power sources, SLA’s with fule providers, free air cooled data centre with backup air conditioners. Great, Good job…. Wrong! Where’s the backups? Are they both in the same Country? same Planet?
I guess the laboured point is you can’t mitigate everything, even if you think you can, you can’t.
So what do you do?
Kick back and relax, the problems will solve themselves! Not quite, but not far from the truth either, you have to be pragmatic, you have to consider what level of risk is affordable and justifiable. Remember that mitigating risk often costs money, and it is very easy for Senior management bods to pull you over hot coals when something fails and they will ask “How did this happen?”, It’s probably worth noting at this point you do not want to reply with “We didn’t have a suitable DR plan” That’s not going to wash.
Luckily for you, you just have to come up with all the risks you can and a number of solutions that mitigate against varying numbers of risks, let someone else make the call about what is an acceptable amount of risk and what can we live with.
It may also help to plot your risk management strategy against your year long or three year long strategy or against growth of the solution so there are known points at which a certain amount of resilience is needed.
For example, You launch a new website, you don’t know if it will be popular or not, you don’t know if it will be profitable. So for this solution, what is wrong with just ensuring you have a decent backup, even if it is to local disk and not “offsite” that’s better than nothing NB I would highly recommend you at least make a regular local copy, or better yet store the website in SVN as well and back that up…
This solution has a cheap and reasonable risk management policy, it may occasionally go down for an unknown period of time, Worse case scenario you have to apologise to all the users, promise to make it better and actually make it better (always do what you say you are going to do…)
As time goes on you can always add in additional sites and better backups. Always go for the solution that gives you the best bang for your buck. i.e. If you need off site backups, why not run two sites in high availability and do local backups in each, more throughput and better resilience.
You can not mitigate everything, so don’t try, look at what really is important, make sure you can recover. Have a plan that if customers hit number X or the solution profitability reaches y% you’ll add in the additional risk mitigation.