In last weeks episode…
Last week I covered what the cloud was and what the cloud wasn’t some very basic concepts that people seem to forget about when choosing to go to the cloud. This week my focus is on how to make the best use of the cloud to save the most money and utilise the flexibility that it can provide.
How to make the best use of your cloud
To start with here’s some bullet points, quite a few and after those some explanation around them…
- Understand the limitations of the environment and mitigate against them all
- Ensure your application can scale based on performance automatically
- Build in spare capacity (no more than 40% utilisation)
- Make it stateless
- Utilise the flexibility
- Carry out regular deployments of your environment and test DR plans
- Implement systems across multiple regions and availability zones
- Make use of the infrastructure tools for balancing traffic, caching data, storing data
- Know when to compromise
- On security
- On functionality
- On performance
- Automation, Autonomy and Automatically
- Automate the deployment and configuration of systems and applications
- Puppet, Chef, Red Hat Satellite
- Capistrano, Mcollective, Fabric
- Through automation of the system you can automatically scale the environment as performance and DR requires or through monitoring actions deploy new environments and backups
- Autonomy of day to day tasks is important, the system needs to look after its self using monitoring tools can help this
- Nagios, Swatch to react based on log events or Monitoring statuses
- Make use of the different regions and automated snapshots
- Keep it simple
- Each specific component of a Service should be de-coupled
- Where possible even functions within an application should be de-coupled
- Offload tasks
- Just because you can do something doesn’t mean you Should…
- Utilise the cloud providers services where possible
You need to build in spare capacity at each stage within the solution; this is so that in the event that when a host that one of your critical systems is on is under load the others are able to cope. As a result the application needs to be stateless and where possible transactions need to flow through all systems so a weak performing node does not affect the overall performance, ideally the scatter gun approach to load balancing, as your tools and understanding become better you may even start taking poorly performing nodes out of service and re-provisioning them.
Utilise the flexibility and scalability of the cloud infrastructure, Why waste your time trying to work out how to load balance the data when they can do it for you, utilise the scalable storage and make use of the all the redundancy they offer, this simplifies the tasks at hand to hopefully running a few OS’s and your desired application with out the added hassle of clustered DB’s.
You have to know when to compromise and when not to. Within a cloud environment you just don’t have the same control as you do on your own hardware. If you do not learn to compromise you will hit a wall that means your whole environment will need to be redeployed in a different way.
Automation is key, Everything should be automated. The deployment, the upgrades, the maintenance tasks. Automation leads to the path of stability and reproducible results. This is necessary for rapid deployments and to offer a stable service.
Important aspects of cloud solutions should always be simple, if the solution is overly complicated it can make it hard to support by everyone. It is vital that the solution only be as technically challenging as needed,
As many components as possible within a solution need to be isolated, we need to do this for performance, scalability and stability reasons.
We are not experts at everything, Luckily the cloud providers often hire those experts, so for the sake of paying a few dollars for a solution where possible we always need to strive to use the in built solutions and if they are not suitable, let the cloud provider know it needs to be made better, they are often happy to help make improvements, although on their time line.
Automatic recovery and scalability
This is an area I’m really interested in, The ideas and concepts of automation to a level where you don’t look at every problem but instead have a mechanism that will flag re-occurring issues for further investigation and allowing you to fix known issues by executing certain scripts is useful. It’s with this level of tooling that a handful of sysadmins could effectively look after thousands of servers.
- Utilise auto scaling for
- Active – Active or Active – Passive?
- Active – Active configuration provides the best reliability and most tested DR scenario
- Active – Passive configuration is easier to maintain and implement, but does not offer the same rewards in performance
- None, Why bother when you can re-build the whole thing in an hour?
A lot of people will preach auto-scaling as the be all and end all for Amazon uptime. Auto scaling will deploy an image and fire it up, this has a few downsides. It means every time a change is made to the OS you have to create a new image, update the auto scaling and do (ideally) some tests to make sure it all works. From an operational point of view that is not ideal, You spend so long going through a change process that releases start taking weeks if not longer, even for the most simplest of changes. Of course you could by-pass the auto-scaling and make the changes to the boxes as and when, but do this at your peril, inconsistent service lies down this route and that’s hard to trouble shoot and to explain to clients.. “What do you mean each system isn’t identical…” This type of scaling is known as the Full bake (everything exists in 1 AMI)
So how about the best of both, As you recall we mentioned puppet in an earlier post, Using tools like this you can do what is known as a “From Scratch” or “half baked” solution.
The From scratch solution means taking a stock AMI with nothing on it and using the configuration management tool to configure the OS and the application in one go, the down side of this is it can take longer to build out a solution or get a box up and working as part of the auto-scaling which could mean that by the time the box has provisioned the need for it could have disappeared.
The Half Bake is about compromise, The OS and the application are both on the AMI to a reasonable level. From this point onwards the configuration tool just has to make sure the latest configuration is in place and go from there. This would still require the AMI to be kept up to date but only when a new application is released not necessarily with every configuration change.
Don’t overreach what you are trying to do, the most important aspect is simplicity, by keeping the solution simple to start with you can do a lot of the funky scalability with a full baked solution. As time progresses and the needs develop you can start implementing all of the other elements that will improve the solution.
There is but one more part to this cloud deployment 101 post spree but that will be at least another week away.