What challenges you?

Over the last few weeks

I have been wondering what most people find challenging in the “modern” IT world. There’s been a recent upsurge in tools and technology that address most problems which only leaves me to wonder what is filling that gap? What is the current big annoying problem, maybe it’s not being able to push your architecture into multiple clouds, or having to live with the constraints of small root disk volumes; Who knows? Hence the poll :)

Distributed Puppet

Some might say…

Some might say that running puppet as a server is the right way to go, it certainly provides some advantages like puppet db, sotred configs, external resources etc etc but is that really what you want?

If you have a centralised puppet server with 200 or so clients, there’s some fancy things that can be done to ensure that not all nodes hit the server at the same time but that requires setting up and configuring additional tools etc ect…

What if you just didn’t need any of that? what if you just needed a git repo with your manifests and modules in and puppet to be installed?
Have the script download / install puppet, pull down the git repo and then run it locally. This method puts more overhead on a per node basis but not much, it had to run puppet anyway, and in all cases this can still provide the same level of configuration as server client method, you just need to think out side of the server.

Don’t do it it’s stupid!

My response to my boss some 10 months ago when he said we should ditch puppet servers, manifests per server and make all variables outlawed. Our mission was to be able to scale our service to over 1 million users and we realised that manually having to add extra node manifests to puppet was not sustainable so we started on a journey to get rid of the puppet server and redo our entire puppet infrastructure.

Step 1 Set up a git repo, You should already be using one, if you aren’t Shame on you! We chose github, why do something yourself when there are better people out there doing a better job and are dedicated to doing just one thing, spend your time looking after your service not your infrastructure!

Step 2 Remove all manifests based on nodes, replace with a manifest per tier / role. For us this meant consolidation of our prod web role with our qa, test and dev roles so it was just one role file regardless of environment. This forces the management of the environment specific bits into variables.

Step 3 Implement hiera – Hiera gives puppet the ability to externalise variables into configuration files so we now end up with a configuration file per environment and only one role manifest. This, as my boss would say “removes the madness” Now if someone says “what’s the differences between prod and test you diff two files regardless of how complicated you want to make your manifests inherited or not. It’s probably worth noting you can set default variables for Hiera… hiera(“my_var”,”default value”)

Step 4 Parameterise everything – We had lengthy talks about parameterising modules vs just using hiera, but to help keep the modules transparent to what ever is coming into them, and that I was writing them, we kept parameters, I did however move all parameters for all manifests in a module into a “params.pp” file and inherit that everywhere to re-use the variables, within each manifest that always defaults to the params.pp value or is blank (to make it mandatory) This means that if you have sensible defaults you can set them here and reduce the size of your hiera files, which in turn makes it easier to see what is happening. Remember most people don’t care about the underlying technology just the top level settings and trust that the rest is magic… for the lower level bits see these: Puppet with out barriers part one for a generic overview Puppet with out barriers part two for manifest consolidation and Puppet with out barriers part three for params & hiera

This is all good, But what if you were in Amazon? and you don’t know what your box is? Well it’s in a security group but that is not enough information, especially if your security groups are dynamic, you can also Tag your boxes and you should make use, where possible of the aws cli tools to do this. We decided a long time ago to set n a per node basis a few details, Env, Role & Name From this we know what to set the hostname, what puppet manifests to apply and what set of hiera variables to apply as follows…

Step 5 Facts are cool – Write your own custom facts for facter. We did this in two ways, the first was to just pull down the tags from amazon (where we host) and return them as ec2_<tag>, this works but AWS has issues so it fails occasionally, Version2, was to get the tags, cache them locally in files and then facter can pull it from the files locally… something like this…

#!/bin/bash
# Load the AWS config
source /tmp/awsconfig.inc

# Grab all tags locally
IFS=$'\n'
for i in $($EC2_HOME/bin/ec2-describe-tags --filter "resource-type=instance" --filter "resource-id=`facter ec2_instance_id`" | grep -v cloudformation | cut -f 4-)
do
        key=$(echo $i | cut -f1)
        value=$(echo $i | cut -f2-)

        if [ ! -d "/opt/facts/tags/" ]
        then
                mkdir -p /opt/facts/tags
        fi
        if [ -n $value ]
        then
                echo $value > /opt/facts/tags/$key
        /usr/bin/logger set fact $key to $value
        fi
done

The AWS config file just contais the same info you would use to set up any of the CLI tools on linux and you can turn them to tags with this:

tags=`ls /opt/facts/tags/`

tags.each do |keys|
        value = `cat /opt/facts/tags/#{keys}`
        fact = "ec2_#{keys.chomp}"
        Facter.add(fact) { setcode { value.chomp } }
end

Also see: Simple facts with puppet

Step 6 Write your own boot scripts – This is a good one, scripts make the world run. Make a script that installs puppet, make a script that pulls down your git repo, then run puppet at the end (like the following)

The node_name_fact is awesome, as it kicks everything into gear and hooks your deployed boxes in a security group with the appropriate tags to become fully built servers.

Summary

So now, puppet is on each box, every box from the time it’s built knows what it is (thanks to tags) and bootstraps it’s self to a fully working box thanks to your own boot script and puppet. With some well written scripts you can cron the pulling of git and a re-run of puppet if so desired. The main advantage of this method is the distribution, as long as it manages to pull that git repo it will build a box. and if something changes on the box, it’ll put it back, because it has everything locally so no network issues to worry about.

Openstack

I played with hardware!

It has been over a year since I had to play with hardware properly to achieve something practical, but that is part of the joy of being in the world of cloud computing. That world where you don’t own anything, you pay by the hour and occasionally things go horribly wrong but you delete it an start again; the throw away society of cloud computing.

Every now and then I get frustrated with AWS, normally because there is a something wrong, lets say a box that is meant to have unrivalled resources starts going slow, you end up doing some investigation but the answer is simply that the underlying hypervisor is busy, probably due to other people hammering the server for some reason… Either way in a cloudy world your choices are thus:

  1. Wait it out
  2. throw it away

You could hope the problem gets better or you could delete the server and build it somewhere else and hope that one is better, rinse and repeat the above two until a stable service is resumed.

Being throw away is really useful, it enables you to re-build quickly and not suffer to much if something major happens so I think people (you…) should make sure that no matter where your server is you can rebuild it from scratch in less than 10 mins. If you had the ability to still be throw away and request servers as and when you wanted via a WebUI or a CLi or some API calls but in addition to all of that you had the control of the physical hardware you could optimise what was running on the hypervisor to offer the best performance, this is all very good but is not with out its draw backs; someone has to physical rack / cable in all of the servers that are running the infrastructure, someone has to firmware patch them and replace dead hard drives and do all of that Boring stuff that cloud folk have forgotten about.

So what about Openstack

So for those that don’t know OpenStack is a private cloud, this means you can run services in your data centre that mimic AWS, You get the Block storage (EBS) in the form of Cinder, you get Object storage (S3) Instance storage (EC2) and a host of other things that I won’t go into. So the API may not be 100% the same as AWS and the features that you have in AWS may not be available in Openstack yet, but it’s catching up and it’s doing so rapidly. I would predict that over the next 2-3 years we see openstack compete with AWS for features and even start seeing AWS taking features that openstack has and porting them to AWS. So definitely one to watch.

Over the last 3-6 months it had come up a few times about openstack and I put it on my todo list to have a play but quite frankly I had other things to be doing. Well last week I was asked to help set up the SAN and network for a openstack PoC for the internal IT, falling back on my not as legacy as I’d like Cisco skills and having used the same SAN tech before it wasn’t long to get that set up and I thought it would take ages to get the various components of openstack up and working. Well it could have if it wasn’t for one saving grace, the PoC on a disk that Rackspace provide Here it may not be the latest or the most perfect but it saved a lot of time in getting something up and working and if you aren’t sure what it is I would suggest getting a few bits of legacy kit and having a play like we did, just set aside two or three days to play with the technology and to set up the various elements of it, it’s worth a play.

There’s already a few advantages of openstack vs aws, a silly one for me is a console. Openstack gives you VNC access to your servers, you can now survive any minor iptables glitch or networking mishap by your self, yes I know it should all be throw away, but sometimes the box has some data on it that is important or you want to know what went wrong and having a console is good. Lets not overlook the fact that you’re calling the shots so if it doesn’t do what you want it too you could if you wanted commit code back to make it better, change the hardware spec, distribution of VM’s or any other element in a thousand that you may need to control, with this you can.

But it’s not all good, it still comes back to managing your own data centre and there’s very few companies or services that get to a size where they have to move off of AWS for performance reasons, typically you’d move off of AWS to save a few dollars, but by the time you fator in additional head count for maintaining the physical boxes, power, cooling, rack locations, geographically diverse locations and the infrastructure services, the platform and it’s skill set you may not be saving as much money as you want, but you’ll probably break even with the advantage of controlling the whole underlying infrastructure on top of still having the throwaway nature a cloud services.

I’m not saying you should and could make it so you support everything all the time even high bursts of traffic, but at least you could use public cloud for what it’s good for, bursting onto when times get hard and more processing power is needed. Granted to be able to do that all systems would need to be automated and be able to migrate at the push of a button. By the time you’ve gone through that whole process with all of your applications either in a private cloud or in public cloud it wouldn’t matte rif you had to u-turn tomorrow you could do that. As long as you’re smart enough to oly use services that are available in multiple places i.e. in openstack and AWS.

Interesting times ahead I think.

This time, We survived the AWS outage

Another minor bump

Anyone based in the US East region in AWS knows that yet again there were issues with EBS volumes, although you wouldn’t know it if you looked at their website. It’s a bit of a joke when you see headlines like Amazon outage takes down Reddit, Foursquare, others yet on their status page a tiny little note icon appears that states there’s a slight issue, extremely minor, don’t worry about it. Yeah right.

The main culprits were EC2 and the API, both of which were EBS related.

“Degraded EBS performance in a single Availability Zone
10:38 AM PDT We are currently investigating degraded performance for a small number of EBS volumes in a single Availability Zone in the US-EAST-1 Region.
11:11 AM PDT We can confirm degraded performance for a small number of EBS volumes in a single Availability Zone in the US-EAST-1 Region. Instances using affected EBS volumes will also experience degraded performance.
11:26 AM PDT We are currently experiencing degraded performance for EBS volumes in a single Availability Zone in the US-EAST-1 Region. New launches for EBS backed instances are failing and instances using affected EBS volumes will experience degraded performance.
12:32 PM PDT We are working on recovering the impacted EBS volumes in a single Availability Zone in the US-EAST-1 Region.
1:02 PM PDT We continue to work to resolve the issue affecting EBS volumes in a single availability zone in the US-EAST-1 region. The AWS Management Console for EC2 indicates which availability zone is impaired. “

The actual message is much much longer but you get the gist, a small number of people were affected. Yet most of the major websites that use amazon were affected, how can that be considered small?

Either way, this time we survived, and we survived because we learnt. Back in June and July we experienced these issues with EBS so we did something about it, now why didn’t everyone else?

How Alfresco Cloud Survived

So back in June and July we were heavily reliant on EBS just like everyone else, we had an EBS backed AMI that we then used puppet to build out the OS, this is pretty much what everyone does and this is why everyone was affected, back then we probably had 100 – 150 EBS volumes so the likely hood of one of them going funny was quite high, now we have about 18, and as soon as we can we will ditch those as well.

After being hit twice in relatively quick succession we realised we had a choice, be lazy or be crazy, we went for crazy and now it paid out. We could have been lazy and just said that Amazon had issues and it wasn’t that frequent and it is not likely to happen again, or we could be crazy and reduce all of our EBS usage as much as possible, we did that.

Over the last few months I’ve added a numer or posts about The Cloud, Amazon and Architecting for the cloud along with a few funky Abnormal puppet set ups and oddities in the middle. All of this was spawned from the EBS outages, we had to be crazy, Amazon tell us all the time don’t have state, don’t rely on anything other than failure use multiple AZ’s etc etc all of those big players that were affected would have been told that they should use multiple availability zones, but as I pointed out Here their AZ can’t be fully independent and yet again this outage proves it.

Now up until those outages we had done all of that, but we still trusted Amazon to remain operational, since July we have made a concerted effort to move our infrastructure to elements within Amazon that are more stable, hence the removal of EBS. We now only deploy instance backed EC2 nodes which means we have no ability to restart a server, but it also means that we can build them quickly and consistently.

We possibly took it to the extreme, our base AMI, now instance backed, consists of a single file that does a git checkout, once it has done that it simply builds its self to the point that chef and puppet can take over and run. The tools used to do this are many but needless to say many hundreds of of lines of bash, supported by Ruby, Java, Go and any number of other languages or tools.

We combined this with fully distributing puppet so it runs locally, in theory once a box is built it is there for the long run; we externalised all of the configuration so puppet was simpler and easier to maintain. Puppet, its config, the Base OS, the tools to manage and maintain the systems are all pulled from remote services including our DNS which automatically updates its self based on a set of tags.

Summary

So, how did we survive, we decided every box was not important, if some crazy person can not randomly delete a box or service and the system keeps working then we had failed. I can only imagine that the bigger companies with a lot more money and people and time looking at this are still treating Amazon more as a datacentre rather than a collection of web services that may or may not work. With the distributed puppet and config once our servers are built they run happily on a local copy of the data, no network, and that is important because AWS’s network is not always reliable and nor is their data access. If a box no longer works delete it, if an environment stops working rebuild it; if amazon has a glitch, keep working, simple.

AWS best practices – Introducing cloud

Overview

With this series of posts over the next few weeks I am attempting to help those new to Amazon Web Services (AWS) get a step up and to help you avoid some of the pitfalls that I have encountered, the sort of guide I would have been biting peoples hands off for when I was in the same boat, but before we go any further, a picture, people like pictures, you are people so here’s a picture.

Money Tree 2

It’s a picture of a tree trunk which over the years has been embedded with coins, people walk by, they see that someone has pushed a coin in, so they do the same, rinse and repeat for 10 years or so and you end up with several trees like this one. This is essentially what the Cloud is to people, 5 years ago, no one knew what the cloud was, no one cared, 3 years ago, people said “Hey, look at this”, 2 years ago people said the cloud was going to change the world, a year ago people said big business was adopting the cloud and today I tell you not to with out reading through this.

Although I am going to focus AWS the topics covered are more than likely relevant to other cloud providers and I would encourage you to read through this to cement the foibles of the largest cloud provider Amazon, so you can better understand the contraints they place on you and those of other cloud providers may place on you. Now on with the post.

What is the Cloud

Apologies for the history lesson, feel free to breeze over…

In the beginning there was only one way, build a data centre, source your own power, cooling and network and start building out a data centre full of disk arrays, high performance servers and networking equipment, I would label these the “Golden age” but truth be told running your own data centre from the ground up can not be easy.

As with everything progression; some smart people noticed an opportunity and started to take over the management so all you had to do was turn up with your server, disk arrays, and networking, this is co-location and is a good way of doing things, this is not as cheap as doing your own but takes a lot of hassle out of it.

Leading on from this companies began to form that went one step further, they would provide your equipment for you so all you have to do now is login, all the disk worries, network were taken care of and they would help you on your way, of course charging a premium for the service. Moving on from this but in the same area of hosted services are the almost fully managed solutions where they do everything, you give them an application and they make it happen, great if you don’t have an IT team.

So getting onto more recent times, virtualisation has really taken off in the last 10 years despite being around for longer, I believe the big drive for this was after the “.com bubble” burst back in the early 2000’s and companies were looking for ways to save costs on their hosted or co-location services. One of the ways this was achieved was through virtualisation such as Xen and VMWare. In most cases the equipment was run and managed by the company and people over allocated or under allocated memory and CPU depending on their use profiles with all sorts of redundancy.

As you can see from all of this the constant push is to reduce costs, granted running your own data centre is the cheapest way, but you will need a few thousand servers to make that so. Faced with a problem a company called Amazon, who you have probably heard of, they run a web shop by the way; noticed that even with all of their virtualisation technology they still needed a large percentage of servers just for 2-3 weeks of their business each year, the rest of the time te boxes sat idle; but what are you to do, you have to have the capacity for your peaks. Well they worked it out and we have the cloud, i’m not 100% sure if they had the big idea but they certainly took the idea and ran with it.

The idea behind Cloud computing is a utility based cost $X per hour, this comes across as a very cheap modle but as we’ll find out later in the following posts it’s not that cheap and it depends heavily on your use modle. With the Cloud you now have the ability to choose how much disk you want and for how long and how much CPU time you need. This is the joys of Cloud computing.

Summary

That was a rather long introduction to the cloud but with this understanding of the history behind it and how it was born you will now hopefully appreciate where it is going. I wouldn’t be surprised if most of the the features Amazon release are just new ways of them making better use of their own applications and architecture and then working out how to do that more times to cover the costs and offer it as a service.

Alfresco Cloud – Out of beta

Finally!

For those of you that don’t know, I work at Alfresco in the Operations department specifically looking after and evolving our cloud based product. It feels like an absolute age I’ve been working on the cloud product and the release, but finally today (Well 31st May 2012) we took off the “Beta” tag from the product.

Being on the support side of the service I know the system very well and overall I’m really pleased with how it is now and how it will be. The most fantastic thing about the product is knowing what is coming up, just because we have taken the beta label off we are still going to be innovating new ways of doing things, utilising the best technology and writing some bespoke management tools to help support the environment. Granted, now the Beta tag is off we have reduced the amount of disruptive impact we will have on the system, but, unlike all those months ago we now have the right framework around managing and testing the changes we are making.

I’m looking forward to the next few months as I know we’ve got more good stuff coming and I can’t wait to see how the general public take to the product, it’s been an interesting journey and it looks to be getting better!

What can you expect from Alfresco in the cloud?

I’m going to start this with a warning, I’m not in marketing or product design, so this is just the way I see the product and what I like about the cloud product. For those of you that have used Alfresco before you’ll be familiar with the Share interface, it is some what cut down for the cloud but none the less just as powerful. You can still upload your documents, like & comment on them just as always, you can use the Quick share feature to share a document via email, facebook or twitter so there’s no need to invite everyone to see a single document or picture. For those privileged enough to sign up you can use WebDav to mount the cloud as a drive to your local PC, very handy…

And the best bit… well you have to sign up to find that….

Only a short update today, it has been a very busy week to get this all sorted and now it’s time to rejoice… and rest.

Cloud deployment 101 – Part3

The final instalment

Over the last couple of weeks I have posted a Foundation to what the cloud really is and How to make the best use of your cloud. This week is about tying off lose ends, better ways of working, distilling a few myths and setting some things straight.

Infrastructure as code

DevOps is not the silver bullet, but it is a framework that encourages teamwork across departments to have rapid agility on deployment of code to a production environment.

  • Agile development
    • Frequent releases of minor changes
    • Often the changes are simpler as they are broken down into smaller pieces
  • Configuration management
    • This allows a server (or hundreds) to be managed by a single sysadmin and produce reliable results
    • No need to debug 1 faulty server of 100, re-build and move on
  • Close, co-ordinated partnership with engineering
    • Mitigates “over the wall” mentality
    • Encourages a team mentality to solving issues
    • Better utilises the skills of everyone to solve complex issues

Infrastructure as code is the fundamentals of rapid deployment. Why hand build 20 systems when you can create an automated way of doing it. Utilising the api tools provided by cloud providers it is possible to build entire infrastructures automatically and rapidly.

Automation through code is not a new concept, Sysadmins have been doing this for a long time through the use Bash, Perl, Ruby and other such languages, as a result the ability to program and understand complicated object orientated code is becoming more and more important within a sysadmin role, typically this was the domain of the developer and a sysadmin just needed to ”hack” together a few commands. Likewise in this new world, Development teams are being utilised by the sysadmins to fine tune the configuration of the application platforms such as tomcat, or to make specific code changes that benefit the operation of the service.

Through using an agile delivery method frequent changes are possible. At first this can be seen to be crazy, why would you make frequent changes to a stable system? Well for one, when the changes are made they are smaller, so between each iteration there is a less likely total outage. This also means that if an update does have a negative impact it can be very quickly identified and fixed, again minimising the total outage of a system.

In an Ideal world you’d be rolling out every individual feature rather than a bunch of features together, this is a difficult concept for development teams and sysadmins to get use to, especially are they are more use to the on-premise way of doing things.

Automation is not everything

I know I said automation is key, the more we automate the more things become stable. However, as automating everything is not practical and can be very time consuming, it can also lead to large scale disaster.

  • Automation, although handy can make life difficult
    • Solutions become more complex
    • When something fails, it fails in style
  • Understand what should be automated
    • Yes you can automate everything, but ask your self, Should you?
    • Automate boring, repetitive tasks
    • Don’t automate largely complex tasks, simplify the tasks and then automate

We need to make sure we automate the things that need to be automated, deployments, updates, DR
We do not want to spend time automating a solution that is complex, it needs to be simplified first and then automated; the whole point of automation is to free up more time, if you are spending all of your time automating you are no longer saving the time.

Failure is not an option!

Anyone that thinks things won’t fail is being rather naïve, The most important thing to understand about failures is what you will do when there is one.

  • Things will fail
    • Data will be lost
    • A server will crash
    • An update will make it through QA and then into production that reduces functionality
    • A sysadmin will remove data by accident
    • The users will crash the system
  • Plan for failures
    • If we know things will fail we can think about how we should deal with them when they happen.
    • Create alerts for the failure situations you know could happen
    • Ensure that the common ones are well understood on how to fix them
  • You can not plan for everything
    • Accept this, have good processes in place for DR, Backup and partial failures

Following a process makes it quick to resolve an issue, so creating run books and DR plans is a good thing. Having a wash up after a failure to ensure you understand what happened, why and how you can prevent it in the future will ensure the mitigations are put in place to stop it again.
Regularly review operational issues to ensure that the important ones are being dealt with, there’s little point in logging all of the issues if they are not being prioritised appropriately.

DR, Backup and Restoration of service are the most important elements of an operational service, although no one cares about them until there is a failure, get these sorted first.
Deploying new code and making updates are a nice to have. People want new features, but they pay for uptime and availability of the service. This is kinda counter intuitive for DevOps as you want to allow the most rapid of changes to happen, but it still needs control, testing and gatekeeping.

Summary

Concentrate on the things that no one cares about unless there’s a failure. Make sure that your DR and backup plan is good, test it works regularly, ensure your monitoring is relavent and timely. If you have any issues with any of these fix them quick, put the controls in place to ensure they stay up to date.

In regards to automation, just be sensible about what you are trying to do, if it needs automating and is complicated, find a better way.