So on the 29th we saw a small minor issue with a couple of serves EBS volumes suffering, luckily we identified and fixed the issue quickly by removing the nodes from the clusters. Well with that problem dealt with on with a restful weekend… not so much.
During Saturday we had a single minor incident but on the whole we seemed to survive. At some point in the early hours of Sunday on the 1st July in the UK but I guess 30th June for the Americans, another issue, here *sigh* To be honest I typically wouldn’t mind, it’s a data centre, they provide multiple regions for a reason so you can mitigate this and rest assured the availability zones are all separate well separate ish, either way they provide availability zones which are meant to be fully isolated form each other.
Wakey wakey, rise and shine
So luckily I was not on call and some other team members dealt with the initial fallout. 5 hours later my phone starts ringing, which is fine it will do that as an escalation. On a side note, about 11pm the night before my PC just stopped working and suffered a kernel panic, so I lost DNS / DHCP so no internet access easily. I rolled downstairs more or less right away and started playing with my mobile phone to set up a wireless hotspot, wonderful 2 bars of GPRS, thankfully it was enough for a terminal (or five).
It turns out that almost all of our servers were completely inaccessible, now we very much divide and conquer with our server distribution, so each node is in a different AZ (Availability Zone) on the assumption that they should be fine. On a side note I will write down some information I’ve learnt about Amazon for those starting a hosting experience with them so they can avoid some of the pitfalls I’ve seen.
Anyway, back to the point I keep procrastinating from. We managed to bring the service back up and working which wasn’t difficult but it did involve quite a bit of effort on our part to get it back working. What I was able to spot was a high amount of IO wait on most instances or at least on the ones I could get onto. In some cases a reboot was enough to kick it on its way but on others a more drastic stop / start was needed.
The annoying thing for me
Is that they have had outages of this kind in the US-East region in the past Time and Time Again. They obviously have issues with that region, and there are some underlying dependancies between AZ’s that aren’t clear, like power, EBS, RDS and S3 obviously these services need to be shared at some point to make them useful but if they are integral to what you are doing then simply putting your servers in another availability zone won’t be good enough. For example if you are using any instance… you are probably EBS backed, as we know they are not infallible.
I do hope that they are looking to make some serious improvements to that region, we are certainly considering other options now and trying to work out the bets way to mitigate these types of issues. If you are not heavily tied into US-East I would suggest abandoning ship apart from your most throw-away-able servers. I’m sure the other regions have there own issues as well, but at the same frequency?
The other thing that is puzzling, is that when we saw the issues Amazon claim all is well. There is certainly a miss understanding somewhere, I guess it could be that we were hit by the Leap second bug, but all of our devices are on new enough kernels that they shouldn’t have been affected but alas they were, either way something happened and it may remain a mystery for ever.