AWS Outage

Not again

Yes, again, another Amazon outage, in fact their reports are a little more miss leading and much more forgiving than the truth. For some background on the rant look Here and the official words: Here

So on the 29th we saw a small minor issue with a couple of serves EBS volumes suffering, luckily we identified and fixed the issue quickly by removing the nodes from the clusters. Well with that problem dealt with on with a restful weekend… not so much.

During Saturday we had a single minor incident but on the whole we seemed to survive. At some point in the early hours of Sunday on the 1st July in the UK but I guess 30th June for the Americans, another issue, here *sigh* To be honest I typically wouldn’t mind, it’s a data centre, they provide multiple regions for a reason so you can mitigate this and rest assured the availability zones are all separate well separate ish, either way they provide availability zones which are meant to be fully isolated form each other.

Wakey wakey, rise and shine

So luckily I was not on call and some other team members dealt with the initial fallout. 5 hours later my phone starts ringing, which is fine it will do that as an escalation. On a side note, about 11pm the night before my PC just stopped working and suffered a kernel panic, so I lost DNS / DHCP so no internet access easily. I rolled downstairs more or less right away and started playing with my mobile phone to set up a wireless hotspot, wonderful 2 bars of GPRS, thankfully it was enough for a terminal (or five).

It turns out that almost all of our servers were completely inaccessible, now we very much divide and conquer with our server distribution, so each node is in a different AZ (Availability Zone) on the assumption that they should be fine. On a side note I will write down some information I’ve learnt about Amazon for those starting a hosting experience with them so they can avoid some of the pitfalls I’ve seen.

Anyway, back to the point I keep procrastinating from. We managed to bring the service back up and working which wasn’t difficult but it did involve quite a bit of effort on our part to get it back working. What I was able to spot was a high amount of IO wait on most instances or at least on the ones I could get onto. In some cases a reboot was enough to kick it on its way but on others a more drastic stop / start was needed.

The annoying thing for me

Is that they have had outages of this kind in the US-East region in the past Time and Time Again. They obviously have issues with that region, and there are some underlying dependancies between AZ’s that aren’t clear, like power, EBS, RDS and S3 obviously these services need to be shared at some point to make them useful but if they are integral to what you are doing then simply putting your servers in another availability zone won’t be good enough. For example if you are using any instance… you are probably EBS backed, as we know they are not infallible.

I do hope that they are looking to make some serious improvements to that region, we are certainly considering other options now and trying to work out the bets way to mitigate these types of issues. If you are not heavily tied into US-East I would suggest abandoning ship apart from your most throw-away-able servers. I’m sure the other regions have there own issues as well, but at the same frequency?

The other thing that is puzzling, is that when we saw the issues Amazon claim all is well. There is certainly a miss understanding somewhere, I guess it could be that we were hit by the Leap second bug, but all of our devices are on new enough kernels that they shouldn’t have been affected but alas they were, either way something happened and it may remain a mystery for ever.

Amazon one simple outage, one day of hell

ARRRGHHH!!!

What a horrible day for me, for those of you that didn’t notice… Amazon had another outage in the East coast see Here for details. Well I can honestly say it was not a pleasurable day for myself, I had just got into the office about to get my first coffee of the day at 8:15 BST when we had a system down alert from our cloud product, Arse. For those that read my other Posts you’ll know I recommend coffee for any problem solving. However first things first, stabilise the situation, annoyingly I was so concerned about the box which I couldn’t become root on I didn’t do the simplest step and it took my boss to remind me I could just remove the affected server from the Apache load balancer, there for the service was affected for 14 mins, probably would have been 10 if I’d have had my coffee!.

Just as I stabilised the service, a colleague turned up for the day and mentioned that Amazon had some issues, turns out he found out by following other cloud based companies such as Heroku, Rightscale and Netflix, a good strategy to employ later. I of course started to prioritise my workload as any good sysadmin does, grab coffee, fix problem, rinse & repeat until desired results are obtained.

EBS Volumes missing

We have a number of EBS volumes, luckily only 1 of 130 was affected, but oh boy didn’t strage things happen. So for starters to gain root access to our box we had to restart it. When it came back up one of the disks wasn’t attached which was rather confusing trying to start tomcat to find that it was no longer there (we mount /var on an EBS volume).

We identified that we had to re-enable IO and as suggested fsck the disks, so this is probably where things went horribly wrong.

I tried using tune2fs to change the disk mount count above that which fsck kicks in to force a fsck on reboot, it turns out this wasn’t going to work, I’m not sure why but it didn’t seem to kick in. In the end we took the 3 volumes on this server and attached them to another server, from here we were able to check the disks and only 1 had some minor issues. we re-attached them back to the original node and hoped for the best…

So obviously it got worse

Now, those that have used AWS know that attaching disks can be challenging when in the API / console you tell a device to mount as /dev/sde to find it attaches as /dev/xvdi, as a result we are meticulous with our disk mount points and I ensured I mounted them in the same order on the same box. So surely turning it on would be okay. Not so much. Our box came up, it passed all of the Amazon checks, it is network available.

For some reason we were getting connection refused from ssh, this is odd. I logged tickets with Amazon’s support, this was pretty much a waste of time and effort. We spent a couple of hours discussing how it wouldn’t have been possible for an OS to change its own firewall rules until I resorted to mounting the root EBS volume for the server fsck’ing it and re-configuring iptables to allow any TCP connection. Guess what, it still didn’t respond.

Somewhere along the line this box has gone from unresponsive to working but missing a drive to totally buggered, not a good day. Luckily we are able to re-build a node and hook it back into to the clusters / load balancers, but why should we need to do that, in a traditional data centre I would have logged onto the console and just simple fixed the issue.

Summary

I have never liked the idea of cloud providers for running core services and have always felt they’d be better in a traditional hosted environment, it is definitely a good place to float an idea or to try something out but longer term DIY. I’m not saying AWS isn’t good, it has a purpose, bursting web traffic or bursting for stateless servers is one. Amazon do now have a console feature that if you had attached a key to the server you could use it to give you that local data centre type console. Unfortunately for us we do not deploy our Amazon boxes with any keys apart form user ones, maybe this is something we will review, more likely we will increase the automation of our environment so a server is just a thing and not a name.
We will be doing this and over the next few months I will start posting more details as we have it going, needless to say it is slightly bleeding edge and as a result is constantly in development.

Anyway, rant over, horrible day.