EBS Snapshot script

Like it say’s

Over the last 18 months, the one key thingI have learnt about amazon is don’t use EBS, in any way shape or form, in most cases it will be okay, but if you start relying on it it can ruin even the best architected service and reduce it to a rubble. So you can imagine how pleased I was to find I’d need to write something to make EBS snapshots.

For those of you that don’t know, Alfresco Enterprise has an amp to connect to S3 which is fantastic and makes use of a local cache while it’s waiting for the s3 buckets to actually write the data and if you’re hosting in Amazon this is the way to go. It means you can separate the application from the OS & data, which is important for the following reasons:

1, EBS volumes suck, so where possible don’t use them for storing data, or for your OS,
2, Having data else where means you can, with out prejudice delete nodes and your data is safe
3, It forces you to build an environment that can be rapidly re-built

So in short, data off of the server means you can scale up and down easily and you can rebuild easily, the key is always to keep the distinctively different areas separate and do not merge them together.

So facing this need to backup EBS volumes I’d thought I’d start with snapshots, I did a bit of googling and came across a few ebs snapshot programs that seem to do the job, but I wanted one in Ruby and I’ve used amazon’s SDK’s before so why not write my own.

The script

#!/usr/bin/ruby

require 'rubygems'
require 'aws-sdk'

#Get options
access_key_id=ARGV[0]
secret_access_key=ARGV[1]



if File.exist?("/usr/local/bin/backup_volumes.txt")
  puts "File found, loading content"
  ec2 = AWS::EC2.new(:access_key_id => access_key_id, :secret_access_key=> secret_access_key)
  File.open("/usr/local/bin/backup_volumes.txt", "r") do |fh|
    fh.each do |line|
      volume_id=line.split(',')[0].chomp
      volume_desc=line.split(',')[1].chomp
      puts "Volume ID = #{volume_id} Volume Description = #{volume_desc}}"
      v = ec2.volumes["#{volume_id}"]
      if v.exists? 
        puts "creating snapshot"
        date = Time.now
        backup_string="Backup of #{volume_id} - #{date.day}-#{date.month}-#{date.year}"
        puts "#{backup_string}" 
        snapshot = v.create_snapshot(backup_string)
        sleep 1 until [:completed, :error].include?(snapshot.status)
        snapshot.tag("Name", :value =>"#{volume_desc} #{volume_id}")
      else
        puts "Volume #{volume_id} no longer exists"
      end
    end
  end
else
  puts "no file backup_volumes.txt"
end

I started writing it with the idea of having it just backup all EBS volumes that ever existed, but I thought better of it. So I added a file “backup_volumes.txt” so instead it will lead this and look for a volume id and a name for it, i.e.

vol-1264asde,Data Volume

if you wanted to backup everything it wouldn’t take much to extend this, i.e. change the following:

v = ec2.volumes["#{volume_id}"]

To

ec2.volumes.each do |v|

or at least something like that…

Anyway, the file takes the keys via the cli as params to the script so it makes it quite easy to run the script on one server in several cron jobs with different keys if needed.

It’s worth mentioning at this point that within AWS you should be using IAM to restrict the EBS policy down to the bear minimum something like this is a good start:

{
  "Statement": [
    {
      "Sid": "Stmt1353967821525",
      "Action": [
        "ec2:CreateSnapshot",
        "ec2:CreateTags",
        "ec2:DescribeSnapshots",
        "ec2:DescribeTags",
        "ec2:DescribeVolumeAttribute",
        "ec2:DescribeVolumeStatus",
        "ec2:DescribeVolumes"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}

For what it’s worth, you can spend a while reading loads of stuff to work out how to set up the policy or just use the policy generator

Right, slightly back on topic, It also tags the name of the volume for you because we all know the description field isn’t good enough.

Well that is it. Short and something, something… Now the disclaimer, I ran the script a handful of times and it seemed good, So please test it :)

Monitoring Alfresco Solr

It’s not just about numbers

Up until recently if you wanted to monitor Alfresco’s solr usage you would have had to either use a costly call to the stats page use the summary report that only really gave you a lag number. Luckily because Alfresco have extended solr they have changed the Summary report to provide some really useful information which can then be tracked via nagios or what ever your favourite monitoring tool is.

Firstly it’s worth reading the Wiki as it explains the variables better than I would, it’s also worth mentioning my preferred way of programatically access this page is via json like so:

 
http://localhost:8080/solr/admin/cores?action=SUMMARY&wt=json

It’s worth mentioning that depending on the json parsing library you are using you can get some fatal parsing errors caused by the hit ratio, For what it’s worth I found Crack to be good, it doesn’t validate the json as heavily as the raw json one does which means you can pull back all the data even if there is a problem with the hitratios.

On that subject, before the relavent cache is hit, the hit ratio will display “NaN” (Not a Number) once it has been hit it will display the appropriate number, which I’ll dive into a bit more later.

So before getting into the nitty gritty service checks, it’s important to have a good understanding of the numbers, most of them are straight forward; the only one that confused me was the hit ratios.

The hit ratio is a number between 0 and 1, when the number is greater than say 0.3 all is well, less than 0.3 things could be bad. However, when the hit count is less than say 100, it would be expected that the hit ratio is low as it is not being hit enough to provide a reasonable response. Other than the hit ratio the others are pretty straight forward.

Some code

It’s probably worth me sharing with you the class I’m using to access/return solr information, that way if you want to write your own nagios checks you can just copy / paste

Firstly, the class that get’s all the solr information:

#
# Solr Metric gatherer

require 'rubygems'
require "crack"
require 'open-uri'

class SolrDAO

  def initialize (url)
    @solr_hash = get_metrics(url)
  end

  def get_lag(index)
    lag = @solr_hash["Summary"][index]["TX Lag"]
    regex= Regexp.new(/\d*/)
    lag_number = regex.match(lag)
    return lag_number
  end

  def get_alfresco_node_in_index(index)
    return @solr_hash["Summary"][index]["Alfresco Nodes in Index"]
  end
  
  def get_num_docs(index)
    return @solr_hash["Summary"][index]["Searcher"]["numDocs"]
  end
  
  def get_alfresco_avgTimePerRequest(index)
    return @solr_hash["Summary"][index]["/alfresco"]["avgTimePerRequest"]
  end

  def get_afts_avgTimePerRequest(index)
    return @solr_hash["Summary"][index]["/afts"]["avgTimePerRequest"]
  end

  def get_cmis_avgTimePerRequest(index)
    return @solr_hash["Summary"][index]["/cmis"]["avgTimePerRequest"]
  end

  def get_mean_doc_transformation_time(index)
    return @solr_hash["Summary"][index]["Doc Transformation time (ms)"]["Mean"]
  end

  def get_queryResultCache_lookups(index)
    return @solr_hash["Summary"][index]["/queryResultCache"]["lookups"]
  end
  
  def get_queryResultCache_hitratio(index)
    return @solr_hash["Summary"][index]["/queryResultCache"]["hitratio"]
  end
  
  def get_filterCache_lookups(index)
    return @solr_hash["Summary"][index]["/filterCache"]["lookups"]
  end
  
  def get_filterCache_hitratio(index)
    return @solr_hash["Summary"][index]["/filterCache"]["hitratio"]
  end
  
  def get_alfrescoPathCache_lookups(index)
    return @solr_hash["Summary"][index]["/alfrescoPathCache"]["lookups"]
  end
  
  def get_alfrescoPathCache_hitratio(index)
    return @solr_hash["Summary"][index]["/alfrescoPathCache"]["hitratio"]
  end
  
  def get_alfrescoAuthorityCache_lookups(index)
    return @solr_hash["Summary"][index]["/alfrescoAuthorityCache"]["lookups"]
  end
  
  def get_alfrescoAuthorityCache_hitratio(index)
    return @solr_hash["Summary"][index]["/alfrescoAuthorityCache"]["hitratio"]
  end
  
  def get_queryResultCache_warmupTime(index)
    return @solr_hash["Summary"][index]["/queryResultCache"]["warmupTime"]
  end
  
  def get_filterCache_warmupTime(index)
    return @solr_hash["Summary"][index]["/filterCache"]["warmupTime"]
  end
  
  def get_alfrescoPathCache_warmupTime(index)
    return @solr_hash["Summary"][index]["/alfrescoPathCache"]["warmupTime"]
  end
  
  def get_alfrescoAuthorityCache_warmupTime(index)
    return @solr_hash["Summary"][index]["/alfrescoAuthorityCache"]["warmupTime"]
  end
  
  private
  def get_metrics(url)
    url += "&wt=json"
    response = open(url).read
    # Convert to hash
    result_hash = {}
    result_hash = Crack::JSON.parse(response)
    # if the hash has 'Error' as a key, we raise an error
    if result_hash.has_key? 'Error'
      raise "web service error"
    end
    return result_hash
  end

end # End of class

As you can see it is quite straight forward to extend this if you want to pull back different metrics. At some point I will hook this into a git hub repo for people or use it in another metrics based project, but for now just use this.

Now some of you may not be use to using ruby, so here’s is a check that checks the filtercache hitratio

#!/usr/bin/ruby
$:.unshift File.expand_path("../", __FILE__)
require 'lib/solr_dao.rb'
solr_results=SolrDAO.new("http://localhost:8080/solr/admin/cores?action=SUMMARY")
hitratio=solr_results.get_filterCache_hitratio("alfresco").to_f
lookups=solr_results.get_filterCache_lookups("alfresco").to_i
#Hit ratio is an inverse, 1.0 is perfect 0.1 is crap, and can be ignored if there is less than 100 lookups
inverse=(1.0-hitratio)
critical=0.8
warning=0.7
if (inverse.is_a? Float)
  if ( lookups >= 100 )
    if ( inverse >= warning )
      if (inverse >= critical )
        puts "CRITICAL :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;"
        exit 2
      else
        puts "WARNING :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;"
        exit 1
      end
    else
      puts "OK :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;"
      exit 0
    end
  else
    puts "OK :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;"
    exit 0
  end
else
  puts "UNKNOWN :: FilterCache hitratio is #{hitratio}"
  exit 3
end
[/sourecode]

to get this to work, you'll just need to put it with your other nagios checks, and in the same directory as the above put a lib directory with the solr_DAO from further up in it, if you need to change it's location you will only need to adjust the following:


$:.unshift File.expand_path("../", __FILE__)
require 'lib/solr_dao.rb'

Also if you wanted to you could modify the script to take the critical and warning as params so you can easily change it within nagios.

DevOps team DNA

Hi, this is my first post on Matt’s blog. I’ve been an avid supporter of his blogging for a while and today got an invite to contribute. So here’s my post (created very quickly before he changes his mind).

My job has always been within an operations department of software product companies. I started at a small company as ‘everything’ support and slowly drifting towards a specialisation in the more recently branded DevOp’sy areas as I made my way through various acquisitions and mergers. Over the past couple of years I’ve found myself building DevOps teams. During that time I’ve discovered some of the things that work and almost everything that doesn’t work (or it feels like that :) ).

Some of the things that have worked..   (for me anyway)

Obviously these are going to be quite subjective and I doubt they will work for everyone. I’ll focus mostly on what I think are the key ingredients of a successful team. Maybe some people will find it interesting. Bare in mind that this only really applies to an operations team that supports a Cloud service.

I’m not a big football fan but I can draw some parallels between football managers and DevOps teams. You don’t see Arsenal winning and losing games based on their process redesigns. I may be simplifying, and I’m sure tactics plays a large part, but I believe you get quite a bit more out of a team when you have excellent players. Players who excel in different areas. My teams tend to be 5 – 7 players nowadays and between all of us we need to cover a few areas.

The first is product knowledge.. If you have a product guru in your team then you’ve got a productivity catalyst. So many aspects of our work involves investigating whether issues are product vs config and whether we can improve things from an operational perspective that will result in the product running better. The most recent team has a Product Architect and he’s awesome. He’s on the cutting edge of ideas for the product, for Amazon AWS and for all of the supporting technologies. Having a dedicated resource to do all of this in the background is great – it means that when we automate his prototypes and release them we get the maximum benefit. Recent examples include our Public API work and the work being done on our Amazon architecture to improve speed (CDN’s etc).

The second role I’ve always tried to fill is an engineer (at least one person, preferably two). Get the most senior developer(s) that you can, who knows the language of the product and build system of the product that you are supporting. You can now write the high level instrumentation that every DevOps teams need – as is true with any automation project. There is only ever so far you can go with Bash (I tend to take things beyond where they are supposed to be with Bash as it is). Ultimately having a senior developer or two buys you a massive amount of flexibility. Need a web service for something like externalised Puppet variables?.. you can write your own. Backup scripts not fast enough?.. a senior developer will make those scripts look very feeble in comparison when rewritten in their preferred language and multithreaded. I’m careful about not reinventing the wheel and will usually go off and clone something from Github before starting from scratch myself. But having some people who can write stuff from scratch is a major advantage. One caveat I would say for this role – hire from outside. Developers usually end up getting pulled back to work on stuff they did at the company at some stage. If you can, hire a new person and liven things up. Obviously tell the engineering teams that the hire(s) are for instrumentation in case they get worried that you want to start adding buttons to the product :)

Lastly, the sysadmins. I’d actually consider myself one of these at heart. Getting a good sysadmin can be tricky. It’s not uncommon to read 100 CV’s before finding someone even remotely eligible.  For a DevOps team you need a reasonably rare mix of skills.. people who know linux inside out, who can script and get excited by the latest batch of tools, and nowadays you need to throw Puppet / Chef into the mix. I have a couple of these currently and consider myself extremely blessed. Everything that we do is checked into source control (we use AWS as our data center) and this buys us a lot of things.. like the ability to automate everything, reduce costs by deleting and recreating at whim and disaster recovery. However, you pay for those things buy hiring really good people.. which is a cost saving in the long run once the cost saving benefits of the team start to show.

Now if you add in all of those types of role.. what I’ve found works quite nicely is running the team without being too focussed on the separation of responsibility. Everyone is on call 24/7. Everyone is expect to know the product inside out (although nobody will get near the level of expertise of the Product Architect), everyone scripts (even me) and ultimately everyone will end up doing some programming tasks. You can probably see from Matt’s previous blog posts about the Metrics project he got the chance to learn some Ruby. I think it’s important that everyone knows a bit of everyone else’s job.. although when under pressure everyone naturally drops back into doing what they are good at to speed things along.

This probably looks a little odd from the outside. But it makes things fun, everyone stays engaged and ultimately we all share the same goal: scale to 1 million users :D

This time, We survived the AWS outage

Another minor bump

Anyone based in the US East region in AWS knows that yet again there were issues with EBS volumes, although you wouldn’t know it if you looked at their website. It’s a bit of a joke when you see headlines like Amazon outage takes down Reddit, Foursquare, others yet on their status page a tiny little note icon appears that states there’s a slight issue, extremely minor, don’t worry about it. Yeah right.

The main culprits were EC2 and the API, both of which were EBS related.

“Degraded EBS performance in a single Availability Zone
10:38 AM PDT We are currently investigating degraded performance for a small number of EBS volumes in a single Availability Zone in the US-EAST-1 Region.
11:11 AM PDT We can confirm degraded performance for a small number of EBS volumes in a single Availability Zone in the US-EAST-1 Region. Instances using affected EBS volumes will also experience degraded performance.
11:26 AM PDT We are currently experiencing degraded performance for EBS volumes in a single Availability Zone in the US-EAST-1 Region. New launches for EBS backed instances are failing and instances using affected EBS volumes will experience degraded performance.
12:32 PM PDT We are working on recovering the impacted EBS volumes in a single Availability Zone in the US-EAST-1 Region.
1:02 PM PDT We continue to work to resolve the issue affecting EBS volumes in a single availability zone in the US-EAST-1 region. The AWS Management Console for EC2 indicates which availability zone is impaired. “

The actual message is much much longer but you get the gist, a small number of people were affected. Yet most of the major websites that use amazon were affected, how can that be considered small?

Either way, this time we survived, and we survived because we learnt. Back in June and July we experienced these issues with EBS so we did something about it, now why didn’t everyone else?

How Alfresco Cloud Survived

So back in June and July we were heavily reliant on EBS just like everyone else, we had an EBS backed AMI that we then used puppet to build out the OS, this is pretty much what everyone does and this is why everyone was affected, back then we probably had 100 – 150 EBS volumes so the likely hood of one of them going funny was quite high, now we have about 18, and as soon as we can we will ditch those as well.

After being hit twice in relatively quick succession we realised we had a choice, be lazy or be crazy, we went for crazy and now it paid out. We could have been lazy and just said that Amazon had issues and it wasn’t that frequent and it is not likely to happen again, or we could be crazy and reduce all of our EBS usage as much as possible, we did that.

Over the last few months I’ve added a numer or posts about The Cloud, Amazon and Architecting for the cloud along with a few funky Abnormal puppet set ups and oddities in the middle. All of this was spawned from the EBS outages, we had to be crazy, Amazon tell us all the time don’t have state, don’t rely on anything other than failure use multiple AZ’s etc etc all of those big players that were affected would have been told that they should use multiple availability zones, but as I pointed out Here their AZ can’t be fully independent and yet again this outage proves it.

Now up until those outages we had done all of that, but we still trusted Amazon to remain operational, since July we have made a concerted effort to move our infrastructure to elements within Amazon that are more stable, hence the removal of EBS. We now only deploy instance backed EC2 nodes which means we have no ability to restart a server, but it also means that we can build them quickly and consistently.

We possibly took it to the extreme, our base AMI, now instance backed, consists of a single file that does a git checkout, once it has done that it simply builds its self to the point that chef and puppet can take over and run. The tools used to do this are many but needless to say many hundreds of of lines of bash, supported by Ruby, Java, Go and any number of other languages or tools.

We combined this with fully distributing puppet so it runs locally, in theory once a box is built it is there for the long run; we externalised all of the configuration so puppet was simpler and easier to maintain. Puppet, its config, the Base OS, the tools to manage and maintain the systems are all pulled from remote services including our DNS which automatically updates its self based on a set of tags.

Summary

So, how did we survive, we decided every box was not important, if some crazy person can not randomly delete a box or service and the system keeps working then we had failed. I can only imagine that the bigger companies with a lot more money and people and time looking at this are still treating Amazon more as a datacentre rather than a collection of web services that may or may not work. With the distributed puppet and config once our servers are built they run happily on a local copy of the data, no network, and that is important because AWS’s network is not always reliable and nor is their data access. If a box no longer works delete it, if an environment stops working rebuild it; if amazon has a glitch, keep working, simple.

Metrics from Amazon

Amazon have an API for that

It was words to that affect I heard, I wish it was what I wanted though. After watching a colleague get metrics for the billing it was clear after a short look they were very good for estimating your total expense but they were never going to be an accurate figure.

So in short if you want to be disappointed by Amazon guessing your monthly cost you can find out some information Here.

If you wish to learn more, read on…

What do you want?

We were looking at the metrics because we wanted to report on our running costs on a granular basis so we could see the hourly cost of our account as and when new services were turned on or off or if we added / subtracted a node from an existing system. In addition to that snapshot we wanted to be able to compare one week to the next and with some other operational metrics such as users on line.

So after discussing for a a little while it was clear the Amazon metrics are good for predicting, not good for historical and ultimately not very accurate as it was always a potential and never an actual. I made the decision that we were better off getting the information our selves, which at first sounded crazy, and the more I think about it the more I agree, really amazon should offer this in a better way through their current billing.

Knowing what we wanted meant I could not bother with tracking the things we don’t care about. What is really important to us, is it the disk space being used? is it the bandwidth? cost of ELB’s? Nope, we just really care about how much does it cost for the instances we are running.

In the end that’s all that mattered, are we costing more money or less money and with this money are we providing more or less value. Ideally we will reduce cost and increase performance but until we start tracking the figures we have no idea what is actually going on unless we spend hours looking at a summarised cost and guessing backwards… well until now anyway.

Over the last 5 days I’ve spent some time knocking together some ruby scripts that will poll amazon and get back the cost of an account based on the current running instances across all regions. For us that is good enough, but I decided to add extra functionality by getting all fo the nodes as well, it sort of acts like an audit trail and will allow us to do more in depth monitoring if we so wish, for example… if we switch instance type does that save us more money and give more performance? Well we wouldn’t know either, especially if we didn’t track what was running.

We also wanted to track the number of objects in a bucket within S3, now our S3 buckets have millions objects in each of them, if you use the aws-sdk to get this it will take forever +1 day, if you use the right_aws it will still take a long time but not as much (over 30 mins for us). This isn’t acceptable so we’re looking at alternative ways to generate this number quicker, but the short answer is it’ll be slow, if I come up with a fancy s3 object count alternative i’ll let you know, but for now I have had to abandon that. Unless Amazon want to add two simpel options like s3.totalObjects and s3.totalSize…

It’s just data

So, we gathered our Amazon data, we gathered some data from our product, this was about a day into the project, all of this was done currently on the fly and it use to take 20 seconds or so to get the information. We had a quick review and it was decided we should track now vs the last week, this made a slight difference as it meant we now needed to store the data.

This is a good thing, by storing the data we care about we no longer have to make lots of long winded calls that hang for 20 seconds, it’s all local, speed++

Needless to say the joys of storing the data and searching back through it is all interesting, but I’m not going to go into them.

To take the data and turn it into something useful requires reports to be written, all the time it’s raw data no one will particularly care, once a figure is associated with a cost per user or a this account costs X per hour people care more. One of the decisions made was to take the data and do some abstraction of it separately to the formatting of the output, mainly as there’s multiple places to send the data, we might want to graph some in graphite, email some of the other data and squirrel away the rest in a CSV output somewhere.

An advantage of this is now that the data has been generated there’s one file to modify to chose how and what data should get returned which gives us the ability to essentially write bespoke reports for what ever is floating our boat this week.

A Freebie

I thought this might be useful, It’s the class we are using to get our instance data from amazon, it’s missing a lot error checking but so far it is functional, and as everyone knows before it can be useful it must first work.

#
# Get instance size cost
#

require 'net/http'
require 'rubygems'
require 'json'
require 'aws-sdk'

class AWSInstance
 
  def initialize (access_key, secret_key)
      @access_key_id=access_key
      @secret_access_key=secret_key
  end

  def cost 
    instances = get_running_instances
    cost = 0.00
    #Calc Cost
    instances.each_pair do |region,value|
      value.each_pair do |instance_type, value|
        cost += (value.size.to_f * price(instance_type,region).to_f)
      end
    end
    return cost
  end

  def get_instances
    #Return all running instaces as that has a cost in its hash
    return get_running_instances
  end

  private
  def price (api_size, region)
    rhel_url = "/rhel/pricing-on-demand-instances.json"
    url = "/ec2/pricing/pricing-on-demand-instances.json"
    price = 0
    size=""
    instance_type=""
    response = Net::HTTP.get_response("aws.amazon.com",rhel_url)
    #puts response.body 
    # Convert to hash
    json_hash = JSON.parse(response.body)

    # api_size i.e. m1.xlarge
    # Get some info to help looking up the json
    case api_size
      when "m1.small"
        size = "sm"
        instance_type = "stdODI"
      when "m1.medium"
        size = "med"
        instance_type = "stdODI"
      when "m1.large"
        size = "lg"
        instance_type = "stdODI"
      when "m1.xlarge"
        size = "xl"
        instance_type = "stdODI"
      when "t1.micro"
        size = "u"
        instance_type = "uODI"
      when "m2.xlarge"
        size = "xl"
        instance_type = "hiMemODI"
      when "m2.2xlarge"
        size = "xxl"
        instance_type = "hiMemODI"
      when "m2.4xlarge"
        size = "xxxxl"
        instance_type = "hiMemODI"
      when "c1.medium"
        size = "med"
        instance_type = "hiCPUODI"
      when "c1.xlarge"
        size = "xl"
        instance_type = "hiCPUODI"
      when "cc1.4xlarge"
        size = "xxxxl"
        instance_type = "clusterComputeI"
      when "cc2.8xlarge"
        size = "xxxxxxxxl"
        instance_type = "clusterComputeI"
      when "cg1.4xlarge"
        size = "xxxxl"
        instance_type = "clusterGPUI"
      when "hi1.4xlarge"
        size = "xxxxl"
        instance_type = "hiIoODI"
    end
  # json_hash["config"]["regions"][0]["instanceTypes"][0]["sizes"][3]["valueColumns"][0]["prices"]["USD"]
    json_hash["config"]["regions"].each do |r|    
      if (r["region"] == region.sub(/-1$/,''))
        r["instanceTypes"].each do |it|
          if (it["type"] == instance_type)
            it["sizes"].each do |sz|
              if (sz["size"] == size)
                price=sz["valueColumns"][0]["prices"]["USD"]
              end
            end
          end 
        end
      end
    end
  
    return price
  end

  def get_running_instances
    #Set up EC2 connection
    ec2 = AWS::EC2.new(:access_key_id => @access_key_id, :secret_access_key=> @secret_access_key)
    instance_hash = Hash.new
    
    #Get a list of instances
    #Memorize cuts down on chatter
    AWS.memoize do
      ec2.regions.each do |region|
        instance_hash.merge!({region.name => {}})
        region.instances.each do |instance|
          if (instance.status == :running)
            #Need to create a blank hash of instance_type else merge fails
            if (!instance_hash[region.name].has_key?(instance.instance_type) )
              instance_hash[region.name].merge!({instance.instance_type => {}})
            end
            #For all running instances 
            tmp_hash={instance.id => {:env => instance.tags.Env, :role => instance.tags.Role, :name => instance.tags.Name, :cost => price(instance.instance_type,region.name) }}
            instance_hash[region.name][instance.instance_type].merge!(tmp_hash)
          end
        end
      end
    end
    return instance_hash
  end
end #End class

AWS best practice – Architecting the cloud

Architecting the Cloud

In this post I will go over some best practice to help you architect a solution that will hopefully survive most amazon incidents. To start with, let’s look at a single region and how to make the best use of a region.

Instances

Starting with the most basic steps first, you want to have each instance created be as stateless as possible and as light weight as possible. Ideally you would use instance-store backed instances as these do not rely on EBS to be working, so you are reducing your dependancy on the Amazon infrastructure and one less dependancy is one less thing to go wrong. If you can not avoid the use of an EBS backed instance then you will want to be ensuring that you have multiple instances providing the same service.

Also consider the use of your service, S3 is slow for you to download data and then share out again, but you could push the handling of the access off to Amazon helping make your environment a bit more stateless. it is also worth noting that there have been far fewer issues with S3 than EBS. Obviously if you need the capacity of EBS (S3 has a single file size limit of 5TB) then RAID the drives together for data storage. You can not do this for your instance storage but at least your data will be okay.

On a side note out of 200+ volumes during a recent outage we only had one with issues so they are quite reliable, although some times slow, however if your aim is ultimate uptime you should not rely on it.

Storage

As I pointed out before, your main storage types are EBS and S3, EBS is block device storage and as a result is just another hard drive for you to manage, you can set RAID on them or leave them as single disks. Then there is S3 which is a key value store which is accessed via a REST API to get the data.

With EBS and S3 it is never stated anywhere that your data is backed up. Your data is your responsibility, if you need a backup you ned to take snapshots of the data and if you want an “off site” equivalent you would need to make sure you have the EBS snapshot replicated to another region, the same applies for S3.

A big advantage of EBS is the speed to write and read from it, if you have an application that requires large amounts of disk space then this si your only real option without re-architecting.

S3 is Simple, hence the name, as a result it very rarely goes wrong but it does have a lot more limitations around it compared to EBS. One of them is down to the reliability, it won’t send a confirmation that the data has written until it has been written to two AZs, for light usage, and non time dependant work this is probably your best choice. S3 is ideal however for data that is going to be read a lot, one reason is it can easily be pushed into cloud front (A CDN) and as a result you can start offloading the work from your node.

In short where possible don’t store anything, if you do have to store it try S3 so you can offload the access if that is not adequate then fall back to EBS and make sure you have decent snapshots and be prepared for it to fail miserably.

Database Storage

RDS is a nice database service that will take care of a lot of hassel for you and I’d recommend that is used or DynamoDB. RDS is can be split across multiple AZs and the patch management is taken care of for you which leaves you to just configure any parameters you want and point your data to it. There are limitations with RDS of 1TB of database storage but in most cases I’d hope the application could deal with this some how else you are left trying to run a highly performant database in Amazon at which point you are on your own.

Unless of course you can make use of a non-rational database such as DynamoDB which is infinitely scalable and performant and totally managed for you. Sounds too good to be true, well of course, it is a non rational database and the scalability speed is limited, at the present moment in time you can only double the size and speed of your dynamoDB once per day, so if you are doing a marketing campaign you have to take this int account days in advance possibly.

Availability Zones

Hopefully by this point you have chosen the write type of instance and storage locations for any data, leaving you the joys of thinking about resilience. At a bear minimum you will want to have servers that provide the same functionality spread out across multiple AZs and some sort of balancing mechanism, be it an ELB, round robin DNS or latency based DNS.

The more availability zones your systems are in the more likely you are to be able to cope with any incidents, ideally you would take care of this through the use of auto scaling, that way if there is a major incident it will bring nodes back for you.

Having instances in multiple AZs will protect you in about 80% of cases, however, EBS and S3, although spread across multiple AZ’s are a single point of failure and I have seen issues where access to EBS backed instances is incredibly slow across a number of servers, in my case 50% of servers across multiple availability zones were all affected by accessibility of the data. So your service can not rely on a single region for reasons like this. One of the reasons I believe for this is when EBS fails there is some sort of auto recovery which can flood the network and cause some disruption to other instances.

A little known fact about AZs is that every client’s AZ is different. If you have 2 accounts with Amazon you may well get presented different AZs but even those with the same name may in fact be in different AZs and visa-versa.

Regions

With all of the above you can run a quite successful service all in one region with a reasonable SLA, if you are lucky to not have any incidents. At the very least you should consider making your backups into another region. multiple regions much like multiple data centres are difficult, especially when you have no control over the networking, this leaves you in a bit of a predicament. You can do latency based routing within Route53 or weighted Round Robin, in this case, assume a region is off line your traffic should be re-routed to the alternative address.

Things to watch out for

Over the months we’ve been hosting on AWS there’s been a number of occasions where things don’t work the way you expect them too and the aim of this section is to give you some pointers to save you the sorrow.

Instance updates
There has been a number of occasions where an instance has stopped working with no good reason, all of a sudden the network may drop a few packets, the IO wait may go high or just in general it is not behaving the way it should. In these situations, the only solution is to stop and start the instance, a little known fact is that the stop and start process will ensure that your instance is on hardware with the latest software updates. However, I have been told by AWS support that new instances may end up on hardware that is not optimal so as a result you should always stop and start new instances.

In severe cases Amazon will mark a node in a degraded state, but I believe they will only do this after a certain percentage of instances have migrated over or it has been degraded for a while.

Scaling up instance size

This is an odd one, predominantly because of a contradiction. You can easily scale up any instance by stopping it in the web gui and changing it’s size on the right click menu. This is good, you can have a short period of downtime and have a much larger instance, the downside being your IP and DNS will change as it is a stop and start. However, if you had deployed your instance via Cloud formations it would be able to scale up and down on the fly with a cloud formations script change.

Security with ELBs
With security groups you can add TCP, ICMP or UDP access rules to a group from another security group or from a network range thus securing instances in the same way a perimeter firewall would. However, this doesn’t guarantee security specifically if you then add an ELB to the front end. With ELB’s you do not know what the network would or could be for them so you ultimately would need to open up full access just to get the ELB to talk to your host. Now, amazon will allow you to add a special security group that will basically grant the ELB’s full access to your security group and as a result you have guaranteed that access is now secured, in the most part.

However, ELB’s are by their nature publicly accessibly, so what do you do if you’re in EC2 and want to secure your ELB which you may need to load balance some traffic. Well Nothing. The only option available for you in this situation is to use a ELB within a VPC which gives you that ability to apply security groups to the ELB.

There are ways to architect around this using apache but this does depend on your architecture and how you intend to use the balancer.

Everything will fail
Don’t rely on anything to be available, if you make use of the API to return results expect it to fail or to not be available. One thing we do is cache some of the details locally and add some logic around the data so that if it’s not available it continues to work.. The same principle aplies to each and every server / service you are using, where possible just expect it to not be there, if it has to be there at least make sure it fails gracefully.

AWS best practice – Introducing Amazon

Introducing Amazon

Last week I introduced the Cloud, if you missed it and feel the need to have a read you can find it Here Now on with Introducing Amazon…

I’m not really going to introduce all of Amazon, Amazon release a lot of new features each month but I will take you though some of the basics that Amazon offer so when you’re next confronted with them it is not a confusing list of terms, I won’t go into any of the issues you may face as that is a later topic.

EC2 Elastic Compute Cloud, this is more than likely your entry point, it is in short a virtual platform to provide you an OS on, they come in various shapes and sizes and different flavours. For more information on EC2 click here

ELB Elastic Load Balancer, this is used to balance web traffic or tcp traffic depending on which type you get (layer 7 or Layer 4) an ELB is typically used to front your web servers that are in different Availability Zones (AZ) and they can do SSL termination.

Security Groups These are quite simply containers that your EC2 instances live in and you can apply security rules to them. However, two instances in the same security group will not be able to talk to each other unless you have specifically allowed them to do so in the security group. It is this functionality that separates a security group from a being considered a network, that and the fact each instance is in a different subnet.

EIP Elastic IP, These are public IP addresses that are static and can be assigned to an individual EC2 Instance, they are ideal for public DNS to point to.

EBS Elastic Block Storage, In short, a disk array attached to your EC2 Instance. EBS volumes are persistant disk stores, most EC2 instances are EBS Backed and are therefore persistant. However, you can mount ephemeral disk drives that are local storage on the virtual host, these disk stores are non-persistant so if you stop / start an instance the data will be lost (they will survive a reboot)

S3 Simple Storage Service, S3 is a simple key value store, but one that can contain keys that are folders, and the value can be anything, text files, word docs, ISO’s, html pages etc. You can use S3 as a simple web hosting service if you just upload all of your html to it and make it public. You can also push S3 data into a CDN (Cloud Front). There are some nice security options around accessibility permissions and at rest encryption for your S3 buckets. An s3 bucket is just the term to describe where your data ends up and is the name of the S3 area you create.

IAM Identity and Access Management, This is a very useful service that will allow you to take your original account you used to signed up to amazon with and lock it away for eternity. You can use IAM to create individual accounts for users or services and create groups to contain the users in, with users and groups you can sue JSON to create security policies that grant the user or group specific access to specific services in specific ways.

VPC Virtual private Cloud, This is more or less the same service you get via EC2 but private. There are some interesting elements of it that are quirky to say the least, but you can create your own networks making your services private from the greater amazon network but you can still assign EIP’s if you so wish. Most services, but not all are available with a VPC and some features are only available in VPC’s such as security groups on ELB’s.

AZ Availability Zones, are essentially data halls, or areas of racks that have independent cooling and power but are not geographically disperse. i.e. an AZ can be in the same building as another. Amazons description is as follows “Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region” This will be touche don later.

Region A region is a geographically disperse amazon location, it could be in another country, it could be in the same country, I’d imagine that all are at least 30 miles apart but amazon are so secretive about everything it could be that building behind you.

If you want to know more about the products I would read the product page here. In next weeks post I’m going to start going into a bit of detail about architecting for the cloud and some design considerations that you should be aware of.