AWS best practice – Architecting the cloud

Architecting the Cloud

In this post I will go over some best practice to help you architect a solution that will hopefully survive most amazon incidents. To start with, let’s look at a single region and how to make the best use of a region.

Instances

Starting with the most basic steps first, you want to have each instance created be as stateless as possible and as light weight as possible. Ideally you would use instance-store backed instances as these do not rely on EBS to be working, so you are reducing your dependancy on the Amazon infrastructure and one less dependancy is one less thing to go wrong. If you can not avoid the use of an EBS backed instance then you will want to be ensuring that you have multiple instances providing the same service.

Also consider the use of your service, S3 is slow for you to download data and then share out again, but you could push the handling of the access off to Amazon helping make your environment a bit more stateless. it is also worth noting that there have been far fewer issues with S3 than EBS. Obviously if you need the capacity of EBS (S3 has a single file size limit of 5TB) then RAID the drives together for data storage. You can not do this for your instance storage but at least your data will be okay.

On a side note out of 200+ volumes during a recent outage we only had one with issues so they are quite reliable, although some times slow, however if your aim is ultimate uptime you should not rely on it.

Storage

As I pointed out before, your main storage types are EBS and S3, EBS is block device storage and as a result is just another hard drive for you to manage, you can set RAID on them or leave them as single disks. Then there is S3 which is a key value store which is accessed via a REST API to get the data.

With EBS and S3 it is never stated anywhere that your data is backed up. Your data is your responsibility, if you need a backup you ned to take snapshots of the data and if you want an “off site” equivalent you would need to make sure you have the EBS snapshot replicated to another region, the same applies for S3.

A big advantage of EBS is the speed to write and read from it, if you have an application that requires large amounts of disk space then this si your only real option without re-architecting.

S3 is Simple, hence the name, as a result it very rarely goes wrong but it does have a lot more limitations around it compared to EBS. One of them is down to the reliability, it won’t send a confirmation that the data has written until it has been written to two AZs, for light usage, and non time dependant work this is probably your best choice. S3 is ideal however for data that is going to be read a lot, one reason is it can easily be pushed into cloud front (A CDN) and as a result you can start offloading the work from your node.

In short where possible don’t store anything, if you do have to store it try S3 so you can offload the access if that is not adequate then fall back to EBS and make sure you have decent snapshots and be prepared for it to fail miserably.

Database Storage

RDS is a nice database service that will take care of a lot of hassel for you and I’d recommend that is used or DynamoDB. RDS is can be split across multiple AZs and the patch management is taken care of for you which leaves you to just configure any parameters you want and point your data to it. There are limitations with RDS of 1TB of database storage but in most cases I’d hope the application could deal with this some how else you are left trying to run a highly performant database in Amazon at which point you are on your own.

Unless of course you can make use of a non-rational database such as DynamoDB which is infinitely scalable and performant and totally managed for you. Sounds too good to be true, well of course, it is a non rational database and the scalability speed is limited, at the present moment in time you can only double the size and speed of your dynamoDB once per day, so if you are doing a marketing campaign you have to take this int account days in advance possibly.

Availability Zones

Hopefully by this point you have chosen the write type of instance and storage locations for any data, leaving you the joys of thinking about resilience. At a bear minimum you will want to have servers that provide the same functionality spread out across multiple AZs and some sort of balancing mechanism, be it an ELB, round robin DNS or latency based DNS.

The more availability zones your systems are in the more likely you are to be able to cope with any incidents, ideally you would take care of this through the use of auto scaling, that way if there is a major incident it will bring nodes back for you.

Having instances in multiple AZs will protect you in about 80% of cases, however, EBS and S3, although spread across multiple AZ’s are a single point of failure and I have seen issues where access to EBS backed instances is incredibly slow across a number of servers, in my case 50% of servers across multiple availability zones were all affected by accessibility of the data. So your service can not rely on a single region for reasons like this. One of the reasons I believe for this is when EBS fails there is some sort of auto recovery which can flood the network and cause some disruption to other instances.

A little known fact about AZs is that every client’s AZ is different. If you have 2 accounts with Amazon you may well get presented different AZs but even those with the same name may in fact be in different AZs and visa-versa.

Regions

With all of the above you can run a quite successful service all in one region with a reasonable SLA, if you are lucky to not have any incidents. At the very least you should consider making your backups into another region. multiple regions much like multiple data centres are difficult, especially when you have no control over the networking, this leaves you in a bit of a predicament. You can do latency based routing within Route53 or weighted Round Robin, in this case, assume a region is off line your traffic should be re-routed to the alternative address.

Things to watch out for

Over the months we’ve been hosting on AWS there’s been a number of occasions where things don’t work the way you expect them too and the aim of this section is to give you some pointers to save you the sorrow.

Instance updates
There has been a number of occasions where an instance has stopped working with no good reason, all of a sudden the network may drop a few packets, the IO wait may go high or just in general it is not behaving the way it should. In these situations, the only solution is to stop and start the instance, a little known fact is that the stop and start process will ensure that your instance is on hardware with the latest software updates. However, I have been told by AWS support that new instances may end up on hardware that is not optimal so as a result you should always stop and start new instances.

In severe cases Amazon will mark a node in a degraded state, but I believe they will only do this after a certain percentage of instances have migrated over or it has been degraded for a while.

Scaling up instance size

This is an odd one, predominantly because of a contradiction. You can easily scale up any instance by stopping it in the web gui and changing it’s size on the right click menu. This is good, you can have a short period of downtime and have a much larger instance, the downside being your IP and DNS will change as it is a stop and start. However, if you had deployed your instance via Cloud formations it would be able to scale up and down on the fly with a cloud formations script change.

Security with ELBs
With security groups you can add TCP, ICMP or UDP access rules to a group from another security group or from a network range thus securing instances in the same way a perimeter firewall would. However, this doesn’t guarantee security specifically if you then add an ELB to the front end. With ELB’s you do not know what the network would or could be for them so you ultimately would need to open up full access just to get the ELB to talk to your host. Now, amazon will allow you to add a special security group that will basically grant the ELB’s full access to your security group and as a result you have guaranteed that access is now secured, in the most part.

However, ELB’s are by their nature publicly accessibly, so what do you do if you’re in EC2 and want to secure your ELB which you may need to load balance some traffic. Well Nothing. The only option available for you in this situation is to use a ELB within a VPC which gives you that ability to apply security groups to the ELB.

There are ways to architect around this using apache but this does depend on your architecture and how you intend to use the balancer.

Everything will fail
Don’t rely on anything to be available, if you make use of the API to return results expect it to fail or to not be available. One thing we do is cache some of the details locally and add some logic around the data so that if it’s not available it continues to work.. The same principle aplies to each and every server / service you are using, where possible just expect it to not be there, if it has to be there at least make sure it fails gracefully.

AWS best practice – Introducing Amazon

Introducing Amazon

Last week I introduced the Cloud, if you missed it and feel the need to have a read you can find it Here Now on with Introducing Amazon…

I’m not really going to introduce all of Amazon, Amazon release a lot of new features each month but I will take you though some of the basics that Amazon offer so when you’re next confronted with them it is not a confusing list of terms, I won’t go into any of the issues you may face as that is a later topic.

EC2 Elastic Compute Cloud, this is more than likely your entry point, it is in short a virtual platform to provide you an OS on, they come in various shapes and sizes and different flavours. For more information on EC2 click here

ELB Elastic Load Balancer, this is used to balance web traffic or tcp traffic depending on which type you get (layer 7 or Layer 4) an ELB is typically used to front your web servers that are in different Availability Zones (AZ) and they can do SSL termination.

Security Groups These are quite simply containers that your EC2 instances live in and you can apply security rules to them. However, two instances in the same security group will not be able to talk to each other unless you have specifically allowed them to do so in the security group. It is this functionality that separates a security group from a being considered a network, that and the fact each instance is in a different subnet.

EIP Elastic IP, These are public IP addresses that are static and can be assigned to an individual EC2 Instance, they are ideal for public DNS to point to.

EBS Elastic Block Storage, In short, a disk array attached to your EC2 Instance. EBS volumes are persistant disk stores, most EC2 instances are EBS Backed and are therefore persistant. However, you can mount ephemeral disk drives that are local storage on the virtual host, these disk stores are non-persistant so if you stop / start an instance the data will be lost (they will survive a reboot)

S3 Simple Storage Service, S3 is a simple key value store, but one that can contain keys that are folders, and the value can be anything, text files, word docs, ISO’s, html pages etc. You can use S3 as a simple web hosting service if you just upload all of your html to it and make it public. You can also push S3 data into a CDN (Cloud Front). There are some nice security options around accessibility permissions and at rest encryption for your S3 buckets. An s3 bucket is just the term to describe where your data ends up and is the name of the S3 area you create.

IAM Identity and Access Management, This is a very useful service that will allow you to take your original account you used to signed up to amazon with and lock it away for eternity. You can use IAM to create individual accounts for users or services and create groups to contain the users in, with users and groups you can sue JSON to create security policies that grant the user or group specific access to specific services in specific ways.

VPC Virtual private Cloud, This is more or less the same service you get via EC2 but private. There are some interesting elements of it that are quirky to say the least, but you can create your own networks making your services private from the greater amazon network but you can still assign EIP’s if you so wish. Most services, but not all are available with a VPC and some features are only available in VPC’s such as security groups on ELB’s.

AZ Availability Zones, are essentially data halls, or areas of racks that have independent cooling and power but are not geographically disperse. i.e. an AZ can be in the same building as another. Amazons description is as follows “Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region” This will be touche don later.

Region A region is a geographically disperse amazon location, it could be in another country, it could be in the same country, I’d imagine that all are at least 30 miles apart but amazon are so secretive about everything it could be that building behind you.

If you want to know more about the products I would read the product page here. In next weeks post I’m going to start going into a bit of detail about architecting for the cloud and some design considerations that you should be aware of.

AWS best practices – Introducing cloud

Overview

With this series of posts over the next few weeks I am attempting to help those new to Amazon Web Services (AWS) get a step up and to help you avoid some of the pitfalls that I have encountered, the sort of guide I would have been biting peoples hands off for when I was in the same boat, but before we go any further, a picture, people like pictures, you are people so here’s a picture.

Money Tree 2

It’s a picture of a tree trunk which over the years has been embedded with coins, people walk by, they see that someone has pushed a coin in, so they do the same, rinse and repeat for 10 years or so and you end up with several trees like this one. This is essentially what the Cloud is to people, 5 years ago, no one knew what the cloud was, no one cared, 3 years ago, people said “Hey, look at this”, 2 years ago people said the cloud was going to change the world, a year ago people said big business was adopting the cloud and today I tell you not to with out reading through this.

Although I am going to focus AWS the topics covered are more than likely relevant to other cloud providers and I would encourage you to read through this to cement the foibles of the largest cloud provider Amazon, so you can better understand the contraints they place on you and those of other cloud providers may place on you. Now on with the post.

What is the Cloud

Apologies for the history lesson, feel free to breeze over…

In the beginning there was only one way, build a data centre, source your own power, cooling and network and start building out a data centre full of disk arrays, high performance servers and networking equipment, I would label these the “Golden age” but truth be told running your own data centre from the ground up can not be easy.

As with everything progression; some smart people noticed an opportunity and started to take over the management so all you had to do was turn up with your server, disk arrays, and networking, this is co-location and is a good way of doing things, this is not as cheap as doing your own but takes a lot of hassle out of it.

Leading on from this companies began to form that went one step further, they would provide your equipment for you so all you have to do now is login, all the disk worries, network were taken care of and they would help you on your way, of course charging a premium for the service. Moving on from this but in the same area of hosted services are the almost fully managed solutions where they do everything, you give them an application and they make it happen, great if you don’t have an IT team.

So getting onto more recent times, virtualisation has really taken off in the last 10 years despite being around for longer, I believe the big drive for this was after the “.com bubble” burst back in the early 2000’s and companies were looking for ways to save costs on their hosted or co-location services. One of the ways this was achieved was through virtualisation such as Xen and VMWare. In most cases the equipment was run and managed by the company and people over allocated or under allocated memory and CPU depending on their use profiles with all sorts of redundancy.

As you can see from all of this the constant push is to reduce costs, granted running your own data centre is the cheapest way, but you will need a few thousand servers to make that so. Faced with a problem a company called Amazon, who you have probably heard of, they run a web shop by the way; noticed that even with all of their virtualisation technology they still needed a large percentage of servers just for 2-3 weeks of their business each year, the rest of the time te boxes sat idle; but what are you to do, you have to have the capacity for your peaks. Well they worked it out and we have the cloud, i’m not 100% sure if they had the big idea but they certainly took the idea and ran with it.

The idea behind Cloud computing is a utility based cost $X per hour, this comes across as a very cheap modle but as we’ll find out later in the following posts it’s not that cheap and it depends heavily on your use modle. With the Cloud you now have the ability to choose how much disk you want and for how long and how much CPU time you need. This is the joys of Cloud computing.

Summary

That was a rather long introduction to the cloud but with this understanding of the history behind it and how it was born you will now hopefully appreciate where it is going. I wouldn’t be surprised if most of the the features Amazon release are just new ways of them making better use of their own applications and architecture and then working out how to do that more times to cover the costs and offer it as a service.