What a horrible day for me, for those of you that didn’t notice… Amazon had another outage in the East coast see Here for details. Well I can honestly say it was not a pleasurable day for myself, I had just got into the office about to get my first coffee of the day at 8:15 BST when we had a system down alert from our cloud product, Arse. For those that read my other Posts you’ll know I recommend coffee for any problem solving. However first things first, stabilise the situation, annoyingly I was so concerned about the box which I couldn’t become root on I didn’t do the simplest step and it took my boss to remind me I could just remove the affected server from the Apache load balancer, there for the service was affected for 14 mins, probably would have been 10 if I’d have had my coffee!.
Just as I stabilised the service, a colleague turned up for the day and mentioned that Amazon had some issues, turns out he found out by following other cloud based companies such as Heroku, Rightscale and Netflix, a good strategy to employ later. I of course started to prioritise my workload as any good sysadmin does, grab coffee, fix problem, rinse & repeat until desired results are obtained.
EBS Volumes missing
We have a number of EBS volumes, luckily only 1 of 130 was affected, but oh boy didn’t strage things happen. So for starters to gain root access to our box we had to restart it. When it came back up one of the disks wasn’t attached which was rather confusing trying to start tomcat to find that it was no longer there (we mount /var on an EBS volume).
We identified that we had to re-enable IO and as suggested fsck the disks, so this is probably where things went horribly wrong.
I tried using tune2fs to change the disk mount count above that which fsck kicks in to force a fsck on reboot, it turns out this wasn’t going to work, I’m not sure why but it didn’t seem to kick in. In the end we took the 3 volumes on this server and attached them to another server, from here we were able to check the disks and only 1 had some minor issues. we re-attached them back to the original node and hoped for the best…
So obviously it got worse
Now, those that have used AWS know that attaching disks can be challenging when in the API / console you tell a device to mount as /dev/sde to find it attaches as /dev/xvdi, as a result we are meticulous with our disk mount points and I ensured I mounted them in the same order on the same box. So surely turning it on would be okay. Not so much. Our box came up, it passed all of the Amazon checks, it is network available.
For some reason we were getting connection refused from ssh, this is odd. I logged tickets with Amazon’s support, this was pretty much a waste of time and effort. We spent a couple of hours discussing how it wouldn’t have been possible for an OS to change its own firewall rules until I resorted to mounting the root EBS volume for the server fsck’ing it and re-configuring iptables to allow any TCP connection. Guess what, it still didn’t respond.
Somewhere along the line this box has gone from unresponsive to working but missing a drive to totally buggered, not a good day. Luckily we are able to re-build a node and hook it back into to the clusters / load balancers, but why should we need to do that, in a traditional data centre I would have logged onto the console and just simple fixed the issue.
I have never liked the idea of cloud providers for running core services and have always felt they’d be better in a traditional hosted environment, it is definitely a good place to float an idea or to try something out but longer term DIY. I’m not saying AWS isn’t good, it has a purpose, bursting web traffic or bursting for stateless servers is one. Amazon do now have a console feature that if you had attached a key to the server you could use it to give you that local data centre type console. Unfortunately for us we do not deploy our Amazon boxes with any keys apart form user ones, maybe this is something we will review, more likely we will increase the automation of our environment so a server is just a thing and not a name.
We will be doing this and over the next few months I will start posting more details as we have it going, needless to say it is slightly bleeding edge and as a result is constantly in development.
Anyway, rant over, horrible day.
[…] way, this time we survived, and we survived because we learnt. Back in June and July we experienced these issues with EBS so we did something about it, now why didn’t […]