Architecting the Cloud
In this post I will go over some best practice to help you architect a solution that will hopefully survive most amazon incidents. To start with, let’s look at a single region and how to make the best use of a region.
Starting with the most basic steps first, you want to have each instance created be as stateless as possible and as light weight as possible. Ideally you would use instance-store backed instances as these do not rely on EBS to be working, so you are reducing your dependancy on the Amazon infrastructure and one less dependancy is one less thing to go wrong. If you can not avoid the use of an EBS backed instance then you will want to be ensuring that you have multiple instances providing the same service.
Also consider the use of your service, S3 is slow for you to download data and then share out again, but you could push the handling of the access off to Amazon helping make your environment a bit more stateless. it is also worth noting that there have been far fewer issues with S3 than EBS. Obviously if you need the capacity of EBS (S3 has a single file size limit of 5TB) then RAID the drives together for data storage. You can not do this for your instance storage but at least your data will be okay.
On a side note out of 200+ volumes during a recent outage we only had one with issues so they are quite reliable, although some times slow, however if your aim is ultimate uptime you should not rely on it.
As I pointed out before, your main storage types are EBS and S3, EBS is block device storage and as a result is just another hard drive for you to manage, you can set RAID on them or leave them as single disks. Then there is S3 which is a key value store which is accessed via a REST API to get the data.
With EBS and S3 it is never stated anywhere that your data is backed up. Your data is your responsibility, if you need a backup you ned to take snapshots of the data and if you want an “off site” equivalent you would need to make sure you have the EBS snapshot replicated to another region, the same applies for S3.
A big advantage of EBS is the speed to write and read from it, if you have an application that requires large amounts of disk space then this si your only real option without re-architecting.
S3 is Simple, hence the name, as a result it very rarely goes wrong but it does have a lot more limitations around it compared to EBS. One of them is down to the reliability, it won’t send a confirmation that the data has written until it has been written to two AZs, for light usage, and non time dependant work this is probably your best choice. S3 is ideal however for data that is going to be read a lot, one reason is it can easily be pushed into cloud front (A CDN) and as a result you can start offloading the work from your node.
In short where possible don’t store anything, if you do have to store it try S3 so you can offload the access if that is not adequate then fall back to EBS and make sure you have decent snapshots and be prepared for it to fail miserably.
RDS is a nice database service that will take care of a lot of hassel for you and I’d recommend that is used or DynamoDB. RDS is can be split across multiple AZs and the patch management is taken care of for you which leaves you to just configure any parameters you want and point your data to it. There are limitations with RDS of 1TB of database storage but in most cases I’d hope the application could deal with this some how else you are left trying to run a highly performant database in Amazon at which point you are on your own.
Unless of course you can make use of a non-rational database such as DynamoDB which is infinitely scalable and performant and totally managed for you. Sounds too good to be true, well of course, it is a non rational database and the scalability speed is limited, at the present moment in time you can only double the size and speed of your dynamoDB once per day, so if you are doing a marketing campaign you have to take this int account days in advance possibly.
Hopefully by this point you have chosen the write type of instance and storage locations for any data, leaving you the joys of thinking about resilience. At a bear minimum you will want to have servers that provide the same functionality spread out across multiple AZs and some sort of balancing mechanism, be it an ELB, round robin DNS or latency based DNS.
The more availability zones your systems are in the more likely you are to be able to cope with any incidents, ideally you would take care of this through the use of auto scaling, that way if there is a major incident it will bring nodes back for you.
Having instances in multiple AZs will protect you in about 80% of cases, however, EBS and S3, although spread across multiple AZ’s are a single point of failure and I have seen issues where access to EBS backed instances is incredibly slow across a number of servers, in my case 50% of servers across multiple availability zones were all affected by accessibility of the data. So your service can not rely on a single region for reasons like this. One of the reasons I believe for this is when EBS fails there is some sort of auto recovery which can flood the network and cause some disruption to other instances.
A little known fact about AZs is that every client’s AZ is different. If you have 2 accounts with Amazon you may well get presented different AZs but even those with the same name may in fact be in different AZs and visa-versa.
With all of the above you can run a quite successful service all in one region with a reasonable SLA, if you are lucky to not have any incidents. At the very least you should consider making your backups into another region. multiple regions much like multiple data centres are difficult, especially when you have no control over the networking, this leaves you in a bit of a predicament. You can do latency based routing within Route53 or weighted Round Robin, in this case, assume a region is off line your traffic should be re-routed to the alternative address.
Things to watch out for
Over the months we’ve been hosting on AWS there’s been a number of occasions where things don’t work the way you expect them too and the aim of this section is to give you some pointers to save you the sorrow.
There has been a number of occasions where an instance has stopped working with no good reason, all of a sudden the network may drop a few packets, the IO wait may go high or just in general it is not behaving the way it should. In these situations, the only solution is to stop and start the instance, a little known fact is that the stop and start process will ensure that your instance is on hardware with the latest software updates. However, I have been told by AWS support that new instances may end up on hardware that is not optimal so as a result you should always stop and start new instances.
In severe cases Amazon will mark a node in a degraded state, but I believe they will only do this after a certain percentage of instances have migrated over or it has been degraded for a while.
Scaling up instance size
This is an odd one, predominantly because of a contradiction. You can easily scale up any instance by stopping it in the web gui and changing it’s size on the right click menu. This is good, you can have a short period of downtime and have a much larger instance, the downside being your IP and DNS will change as it is a stop and start. However, if you had deployed your instance via Cloud formations it would be able to scale up and down on the fly with a cloud formations script change.
Security with ELBs
With security groups you can add TCP, ICMP or UDP access rules to a group from another security group or from a network range thus securing instances in the same way a perimeter firewall would. However, this doesn’t guarantee security specifically if you then add an ELB to the front end. With ELB’s you do not know what the network would or could be for them so you ultimately would need to open up full access just to get the ELB to talk to your host. Now, amazon will allow you to add a special security group that will basically grant the ELB’s full access to your security group and as a result you have guaranteed that access is now secured, in the most part.
However, ELB’s are by their nature publicly accessibly, so what do you do if you’re in EC2 and want to secure your ELB which you may need to load balance some traffic. Well Nothing. The only option available for you in this situation is to use a ELB within a VPC which gives you that ability to apply security groups to the ELB.
There are ways to architect around this using apache but this does depend on your architecture and how you intend to use the balancer.
Everything will fail
Don’t rely on anything to be available, if you make use of the API to return results expect it to fail or to not be available. One thing we do is cache some of the details locally and add some logic around the data so that if it’s not available it continues to work.. The same principle aplies to each and every server / service you are using, where possible just expect it to not be there, if it has to be there at least make sure it fails gracefully.