Automate to survive

Everyone has a choice

Automate or die, That is pretty much it, you can automate everything or you can keep working with manual processes that slow you down. If you don’t think you have the time to automate, you’re wrong; you need to automate and do it quickly before you get even more busy and even further behind. Maybe you think that you can’t automate because you don’t have the time to do it justice? maybe you can’t automate because the task is to big? Too complicated? well it’s all rubbish.

Start small

This is a bit like eating an elephant, You have to start somewhere, you have to start small, by all means try and start big if you want, but smaller is better. Maybe you have a task to check for new packages from a site once a month, that is a good place to start, pull third party packages from vendors sites into your yum repo or maybe every time you build a server you need to do x, y & z. These sort of tasks are achievable for everyone even those without a good background of programming background which leads on to language choice, not all are equal but knowing two or three is better than just one. At a minimum some sort of terminal language, so Bash,ash,sh or ksh and a ‘Proper’ sort of language that is object orientated like, Ruby, Python or Perl. The terminal languages are good for re-producing what you do on a terminal into a reproducible and consistant format but are terrible for manipulating multiple data sources, mangling data, although with that said you can do some complicated things.

Once you start building up many smaller components of automation start looking at ways of linking it all together so that a series of tasks becomes one. It is this constant cycle of simplifying the process to automate the small chunks and then linking the small chunks together that make an automated system.

Grow large

Over a year ago we use to deploy our environments with Puppet and cloud formations and it use to take about 2.5 days to complete and get it working, that whole process is now down to 10 mins thanks to automation. It required many leaps of faith, many poor decisions and a lot of bug fixing but it got there though simplifying everything down and then automating each component. Other than building the servers and tagging them with appropriate keys in Amazon the whole process is controlled by bash to the point of a working system and is typically very robust. That is a massive time saving, but to get there we had to fail, we had to try and we had to persist.

As a result we now automate large portions of the architecture to a point where all of our time is split between incidents or project work to implement new features hardly any daily grind. Recently I have been working on our DR strategy to take it to the point of clicking a button to deploy a clean environment built from the ground up and automatically pulls the latest backups to restore to the environment but it is now done and saves hours of time building out a DR which makes the recovery time shorter and the process is easier. So larger projects are perfectly achievable with the right attitude!

Summary

Give it a go, start small and work up to it but be un-relenting and do what ever it takes, no matter how much you disagree with it, just do it to get it automated, once more is automated you’ll have some time to fix it up properly or you’ll need to extend it and you can make a small part better then.

Oh no, not java

How strange

Over the past few months I’ve been writing more and more applications to help maintain and deliver the services we run, from metric gathering to regional dr and anything in the middle. For A while now one of the developers at Alfresco has been writing a framework that makes it easier to write selenium tests for Alfresco share which takes a lot of the hassle out of looking for certain elements or class id’s or updating everything if the UI changes. So we have been talking about it for a couple of months and today I decided to get some time to look at it and ask loads of silly questions about eclipse and maven and so on and so forth.

It took about 3 hours to get everything set up and working, most of the time was just spent learning to use eclipse and maven with a walk through of what it can do, how to extend it and how to do stuff with it. Considering I hadn’t done any Java for 6 years it wasn’t that bad and within 15 mins of being left to it I had made a class that loged in and searched for content inside the repo.

One of the reasons we’re so interested in it is because as DevOps we like simple things and it takes a lot of the hassle out, it means we also get to do some complicated things with Share and we only have to worry about what we want to test or measure. All of this got me thinking about the languages we use and the problems they solve.

Right tool, Right job

Currently in our team we are using bash, ruby, python and java. Bash is simple and can achieve some good results although typically quite slow, typically if it is a short script it will end up in bash, although we do our orchestration in bash and it manages the bear metal to working OS by triggering what ever apps we need or setting config.
Ruby is the language of choice for me when I need to do something that requires data to be manipulated or retrying actions or anything that is more than procedural and you can rely on it to do a good job in a reasonable time.
Python is new to the team, it feels a gap which is that it’s as easy to write as Ruby but is more scalable at size, I haven’t done any python yet so I can’t really comment but the web app that has been built with it in a couple of weeks is quite impressive. Java is more complicated harder to write but can offer more complex apps, but typically I’m not sure that you need to make apps that complex.

So I’m not a fan of Java, but mainly because I think it takes a long time to get anything of any value out of it, especially on a small task. If I had to write an application to manage backups I would not go to Java as it’s like using a bazooka to hit a fly, likewise using Bash is like using a feather duster where as ruby and python fit nicely in the middle. Well after todays experience I’m glad I’m doing it in Java, I would have spent weeks making something half as good in Ruby to just avoid using Java and I guess it’s not really that bad.

I could have wasted time doing it all from scratch or just take what’s already written, so I stole like any good DevOps guy would.

Summary

I’m probably going to spend some more time in Java over the next week writing something a bit more useful than todays experiment so hopefully I will still be optimistic about it all, and maybe I’ll remember why I don’t like Java or maybe I’ll change my opinion, who knows!

You have to stick with these things

Hang in there

Just over a year ago I decided that I had, at best a mediocre online presence, Sure if you search for my handle all sorts of stuff turns up and most of it is me but in this age where the internet is ever lasting I didn’t want my previous 10 years of internet to be the defining pattern I left on the world. So with that I decided to annoying people, originally I planned to do two blogs a week, one techy and one non-techy, well quite frankly that’s hard work but it lasted for several months, now I have a much more leisurely one a week post and that seems to be working out better.

When I first started I remember being relatively happy with a small number of users visiting, well over the last year I have grown my blogging empire quit significantly and the best of all of this is statistics, it is nicely measurable. Having that feedback on what articles work and which ones don’t is handy, it doesn’t stop me writing the ones no one likes but at least i’m aware they will be less liked before I write them.

Statistics

Here’s some numbers to make things more interesting

Visits per month

Month visitors
Feb 2012 132
Mar 2012 167
Apr 2012 167
May 2012 284
Jun 2012 387
Jul 2012 407
Aug 2012 460
Sep 2012 491
Oct 2012 728
Nov 2012 1323
Dec 2012 1115
Jan 2013 1572

Up until September I was thinking how the progress was slow but steady, a little disappointed, and then bang, much better! I was talking to my boss a couple of months back and he was mentioning how google seem to sit you in a sandbox for a few months before they really let things go, and that’s sort of what it seems like here, I didn’t write any killer articles that all of a sudden had a spurt, I may have got a link back from Alfresco.com but I didn’t know that, until November, my boss decided to tell everyone, oh well better than employing a marketing person :)

I now hover around 300 visitors a week to the site which is still not a large number but if only 1 percent of those actualy reads an article it maks it worth while!

One of my articles rhn satellite vs puppet, a clear victory? has almost managed 200 views by it’s self. but there’s a few others in there doing okay, and some I thought would do better, so here’s a chance to show some of those that I thought would do better.

The 5 posts that should have done better!

  1. Self healing systems
  2. What university forgot to mention about programming
  3. Cloud deployment 101 – Part3
  4. Cloud deployment 101 – Part1
  5. Who burnt down the building?

There are quite a few more stats that could be shared but they are not that interesting really… so back to the point of why I started the blog, to leave a bit more of an impression well I think I have done that, I have some godo referrals that make it back to me, here’s a couple that always surprised me

And one that came up to day is me being quoted! Here. So when I look back over the year and I see what has been achieved, despite a slow start I’m glad I stuck it out, hopefully when I’m doing this next year I’m talking more about the news article I had to write or tv show I appeared on… well We still need to have dreams else what’s the point!

So I wonder what next year will bring if I keep on blogging, and you the people keep on reading, as let’s face it if it wasn’t for you people I would just be watching tv or something, so Thanks!

Distributed Puppet

Some might say…

Some might say that running puppet as a server is the right way to go, it certainly provides some advantages like puppet db, sotred configs, external resources etc etc but is that really what you want?

If you have a centralised puppet server with 200 or so clients, there’s some fancy things that can be done to ensure that not all nodes hit the server at the same time but that requires setting up and configuring additional tools etc ect…

What if you just didn’t need any of that? what if you just needed a git repo with your manifests and modules in and puppet to be installed?
Have the script download / install puppet, pull down the git repo and then run it locally. This method puts more overhead on a per node basis but not much, it had to run puppet anyway, and in all cases this can still provide the same level of configuration as server client method, you just need to think out side of the server.

Don’t do it it’s stupid!

My response to my boss some 10 months ago when he said we should ditch puppet servers, manifests per server and make all variables outlawed. Our mission was to be able to scale our service to over 1 million users and we realised that manually having to add extra node manifests to puppet was not sustainable so we started on a journey to get rid of the puppet server and redo our entire puppet infrastructure.

Step 1 Set up a git repo, You should already be using one, if you aren’t Shame on you! We chose github, why do something yourself when there are better people out there doing a better job and are dedicated to doing just one thing, spend your time looking after your service not your infrastructure!

Step 2 Remove all manifests based on nodes, replace with a manifest per tier / role. For us this meant consolidation of our prod web role with our qa, test and dev roles so it was just one role file regardless of environment. This forces the management of the environment specific bits into variables.

Step 3 Implement hiera – Hiera gives puppet the ability to externalise variables into configuration files so we now end up with a configuration file per environment and only one role manifest. This, as my boss would say “removes the madness” Now if someone says “what’s the differences between prod and test you diff two files regardless of how complicated you want to make your manifests inherited or not. It’s probably worth noting you can set default variables for Hiera… hiera(“my_var”,”default value”)

Step 4 Parameterise everything – We had lengthy talks about parameterising modules vs just using hiera, but to help keep the modules transparent to what ever is coming into them, and that I was writing them, we kept parameters, I did however move all parameters for all manifests in a module into a “params.pp” file and inherit that everywhere to re-use the variables, within each manifest that always defaults to the params.pp value or is blank (to make it mandatory) This means that if you have sensible defaults you can set them here and reduce the size of your hiera files, which in turn makes it easier to see what is happening. Remember most people don’t care about the underlying technology just the top level settings and trust that the rest is magic… for the lower level bits see these: Puppet with out barriers part one for a generic overview Puppet with out barriers part two for manifest consolidation and Puppet with out barriers part three for params & hiera

This is all good, But what if you were in Amazon? and you don’t know what your box is? Well it’s in a security group but that is not enough information, especially if your security groups are dynamic, you can also Tag your boxes and you should make use, where possible of the aws cli tools to do this. We decided a long time ago to set n a per node basis a few details, Env, Role & Name From this we know what to set the hostname, what puppet manifests to apply and what set of hiera variables to apply as follows…

Step 5 Facts are cool – Write your own custom facts for facter. We did this in two ways, the first was to just pull down the tags from amazon (where we host) and return them as ec2_<tag>, this works but AWS has issues so it fails occasionally, Version2, was to get the tags, cache them locally in files and then facter can pull it from the files locally… something like this…

#!/bin/bash
# Load the AWS config
source /tmp/awsconfig.inc

# Grab all tags locally
IFS=$'\n'
for i in $($EC2_HOME/bin/ec2-describe-tags --filter "resource-type=instance" --filter "resource-id=`facter ec2_instance_id`" | grep -v cloudformation | cut -f 4-)
do
        key=$(echo $i | cut -f1)
        value=$(echo $i | cut -f2-)

        if [ ! -d "/opt/facts/tags/" ]
        then
                mkdir -p /opt/facts/tags
        fi
        if [ -n $value ]
        then
                echo $value > /opt/facts/tags/$key
        /usr/bin/logger set fact $key to $value
        fi
done

The AWS config file just contais the same info you would use to set up any of the CLI tools on linux and you can turn them to tags with this:

tags=`ls /opt/facts/tags/`

tags.each do |keys|
        value = `cat /opt/facts/tags/#{keys}`
        fact = "ec2_#{keys.chomp}"
        Facter.add(fact) { setcode { value.chomp } }
end

Also see: Simple facts with puppet

Step 6 Write your own boot scripts – This is a good one, scripts make the world run. Make a script that installs puppet, make a script that pulls down your git repo, then run puppet at the end (like the following)

The node_name_fact is awesome, as it kicks everything into gear and hooks your deployed boxes in a security group with the appropriate tags to become fully built servers.

Summary

So now, puppet is on each box, every box from the time it’s built knows what it is (thanks to tags) and bootstraps it’s self to a fully working box thanks to your own boot script and puppet. With some well written scripts you can cron the pulling of git and a re-run of puppet if so desired. The main advantage of this method is the distribution, as long as it manages to pull that git repo it will build a box. and if something changes on the box, it’ll put it back, because it has everything locally so no network issues to worry about.

Openstack

I played with hardware!

It has been over a year since I had to play with hardware properly to achieve something practical, but that is part of the joy of being in the world of cloud computing. That world where you don’t own anything, you pay by the hour and occasionally things go horribly wrong but you delete it an start again; the throw away society of cloud computing.

Every now and then I get frustrated with AWS, normally because there is a something wrong, lets say a box that is meant to have unrivalled resources starts going slow, you end up doing some investigation but the answer is simply that the underlying hypervisor is busy, probably due to other people hammering the server for some reason… Either way in a cloudy world your choices are thus:

  1. Wait it out
  2. throw it away

You could hope the problem gets better or you could delete the server and build it somewhere else and hope that one is better, rinse and repeat the above two until a stable service is resumed.

Being throw away is really useful, it enables you to re-build quickly and not suffer to much if something major happens so I think people (you…) should make sure that no matter where your server is you can rebuild it from scratch in less than 10 mins. If you had the ability to still be throw away and request servers as and when you wanted via a WebUI or a CLi or some API calls but in addition to all of that you had the control of the physical hardware you could optimise what was running on the hypervisor to offer the best performance, this is all very good but is not with out its draw backs; someone has to physical rack / cable in all of the servers that are running the infrastructure, someone has to firmware patch them and replace dead hard drives and do all of that Boring stuff that cloud folk have forgotten about.

So what about Openstack

So for those that don’t know OpenStack is a private cloud, this means you can run services in your data centre that mimic AWS, You get the Block storage (EBS) in the form of Cinder, you get Object storage (S3) Instance storage (EC2) and a host of other things that I won’t go into. So the API may not be 100% the same as AWS and the features that you have in AWS may not be available in Openstack yet, but it’s catching up and it’s doing so rapidly. I would predict that over the next 2-3 years we see openstack compete with AWS for features and even start seeing AWS taking features that openstack has and porting them to AWS. So definitely one to watch.

Over the last 3-6 months it had come up a few times about openstack and I put it on my todo list to have a play but quite frankly I had other things to be doing. Well last week I was asked to help set up the SAN and network for a openstack PoC for the internal IT, falling back on my not as legacy as I’d like Cisco skills and having used the same SAN tech before it wasn’t long to get that set up and I thought it would take ages to get the various components of openstack up and working. Well it could have if it wasn’t for one saving grace, the PoC on a disk that Rackspace provide Here it may not be the latest or the most perfect but it saved a lot of time in getting something up and working and if you aren’t sure what it is I would suggest getting a few bits of legacy kit and having a play like we did, just set aside two or three days to play with the technology and to set up the various elements of it, it’s worth a play.

There’s already a few advantages of openstack vs aws, a silly one for me is a console. Openstack gives you VNC access to your servers, you can now survive any minor iptables glitch or networking mishap by your self, yes I know it should all be throw away, but sometimes the box has some data on it that is important or you want to know what went wrong and having a console is good. Lets not overlook the fact that you’re calling the shots so if it doesn’t do what you want it too you could if you wanted commit code back to make it better, change the hardware spec, distribution of VM’s or any other element in a thousand that you may need to control, with this you can.

But it’s not all good, it still comes back to managing your own data centre and there’s very few companies or services that get to a size where they have to move off of AWS for performance reasons, typically you’d move off of AWS to save a few dollars, but by the time you fator in additional head count for maintaining the physical boxes, power, cooling, rack locations, geographically diverse locations and the infrastructure services, the platform and it’s skill set you may not be saving as much money as you want, but you’ll probably break even with the advantage of controlling the whole underlying infrastructure on top of still having the throwaway nature a cloud services.

I’m not saying you should and could make it so you support everything all the time even high bursts of traffic, but at least you could use public cloud for what it’s good for, bursting onto when times get hard and more processing power is needed. Granted to be able to do that all systems would need to be automated and be able to migrate at the push of a button. By the time you’ve gone through that whole process with all of your applications either in a private cloud or in public cloud it wouldn’t matte rif you had to u-turn tomorrow you could do that. As long as you’re smart enough to oly use services that are available in multiple places i.e. in openstack and AWS.

Interesting times ahead I think.

Applying AMPs to Alfresco

A bit of background

Alfresco comes with the ability to be extended in a nice easy way, that is through the use of Alfresco Module Packages. In essence it is the delta of changes you would have made to the raw source code if you wanted to make some sort of customisation or apply one of the ones alfresco supplies like the S3-connector.

Over the last 3 years I’ve seen it done a number of ways, using the mmt tool to apply them manually, a shell script to do it and now I decided that wasn’t good enough.

Using the mmt tool manually is obviously not brilliant, some poor person has to sit there and run it to apply the amps. So you may have guessed this is not a good idea.

What about wrapping the mmt tool in a shell script that can be triggered by say a sysadmin to apply all the amps or just have it executed once per amp using some configuration managent tool like puppet. This is good. You put the amp into the configuration management tool push the right buttons and it magically get’s applied to the war files and all is well. Well sort of, what happens if someone just throws an amp on the server? who puts it in configuration management? who’s made a backup? Well I decided that I’d write a new script for applying amps so that it can be used both with a CM and as a ad-hock script.

What does it do?

I’ve written it so it will trawl through a directory and pull up every amp in the directory and it will apply the amps to alfresco or share as needed. What’s quite handy is that it will take several versions of an amp and work out what the latest version is, it will check the latest version against what is already installed in the war and then if the amp is a newer version it will apply it after making a backup.

For some odd reason I also made it cope with a variety of amp naming schemes, so you could upload alfresco-bob-123.amp or you could upload frank-share-super-1.2.3.4-5.amp it’s your amp, call it what you want. All the script cares about is the correlation of terms between the file name and the amp info when it’s installed. So as long as you use 2 words from the file name that also appear in the amp description it will work it out for you. The higher the correlation the more accurate it would be, it is configurable too but I set it to 2 occurrence of at least 2 words to match, so far… it’s working.

I also forgot to mention that the script will stop alfresco clear the caches and restart it for you in a pretty safe way.

A Script

Firstly I realise this is a bad format to get the script I’ve in the past put them in a git repo and shared it that way, I have put this one in a git repo and I hope to share that repo with some of the things we have done at alfresco that are useful for either running servers in general or for running alfresco either way I hope to shortly get that out on a public repo but for now here it is:

#!/usr/bin/ruby
# Require libs
$:.unshift File.expand_path("../", __FILE__)
require 'lib/logging'
require 'fileutils'
require 'timeout'

# Set up logging provider
Logging::new('alfresco_apply_amps')
Logging.log_level("INFO",false)
$log.info "Starting"

#CONSTANTS
BKUP_LOCATION="/var/lib/alfresco/alf_data/backups"
ALF_MMT="/var/lib/tomcat6/bin/alfresco-mmt.jar"
WEBAPPS_DIR="/var/lib/tomcat6/webapps"
AMP_LOCATIONS=["/var/lib/alfresco/alf_data/amps/"]

#Defaults
@restart=false

#Methods
def available_amps(amp_dir)
  #Get a list of Amps
  amps_list = `ls #{amp_dir}*`
  amps_array = amps_list.split("\n")
end

def backup(war)
  version=`/usr/bin/java -jar #{ALF_MMT} list #{WEBAPPS_DIR}/#{war}.war | grep Version | awk '{print $3}'`

  #Date stamp the war
  $log.info "Backing up #{WEBAPPS_DIR}/#{war}.war to #{BKUP_LOCATION}/#{war}-#{current_date}.war"
  `cp -a #{WEBAPPS_DIR}/#{war}.war #{BKUP_LOCATION}/#{war}-#{current_date}.war`
end

def clear_caches()
  $log.debug "Cleaning caches"  
  delete_dir("#{WEBAPPS_DIR}/alfresco/")
  delete_dir("#{WEBAPPS_DIR}/share/")
  delete_dir('/var/cache/tomcat6/work/',true)
  delete_dir('/var/cache/tomcat6/temp/',true)
  $log.info "Caches cleaned"  
end

def compare_strings(str1,str2,options={})
  matches = options[:matches] || 2
  frequency = options[:frequency] || 2

  #Make one array of words
  words=Array.new
  words << str1.split(' ') << str2.split(' ')
  words.flatten!
  #Hash to store each unique key in and number of occurances
  keys = Hash.new
  words.each do |key|
    if keys.has_key?(key)
      keys[key] +=1
    else
      keys.merge!({key =>1})
    end
  end

  #Now we have a Hash of keys with counts how many matches and what frequency
  #where a match is a unique key >1 and frequency si the count of each key i.e. 
  #matches=7 will mean 7 keys must be >1 frequency=3 means 7 matches must be > 3
  
  act_matches=0
  keys.each_pair do |key,value|
    if value >= frequency
      act_matches +=1
    end
  end
  if act_matches >= matches
    true
  else
    false
  end
end

def compare_versions(ver1,ver2)
  #return largest
  v2_maj=0
  v2_min=0
  v2_tiny=0
  v2_release=0
  v1_maj=0
  v1_min=0
  v1_tiny=0
  v1_release=0
  if ver1 =~ /\./ && ver2 =~ /\./
    #both are dotted notation
    #Compare maj -> release

    #Conver '-' to '.'
    ver1.gsub!(/-/,'.')
    ver2.gsub!(/-/,'.')

    v1_maj = ver1.split('.')[0]
    v1_min = ver1.split('.')[1] || 0
    v1_tiny = ver1.split('.')[2] || 0
    v1_release = ver1.split('.')[3] || 0

    v2_maj = ver2.split('.')[0]
    v2_min = ver2.split('.')[1] || 0
    v2_tiny = ver2.split('.')[2] || 0
    v2_release = ver2.split('.')[3] || 0

    if v1_maj > v2_maj
      return ver1
    elsif v1_min > v2_min
      return ver1
    elsif v1_tiny > v2_tiny
      return ver1
    #Don't compare release for now as some amps don't put the release in the amp when installed so you end up re-installing
    #elsif v1_release > v2_release
    #  return ver1
    else
      return ver2
    end
  else
    #Validate both are not-dotted
    if ver1 =~ /\./ || ver2 =~ /\./
      $log.debug "Eiher both types aren't the same or there's only one amp"
      return ver2
    else
      result = ver1<=>ver2
      if result.to_i > 0 && !result.nil?
        return ver1
      else
        return ver2
      end
    end
  end
end

def current_date()
  year=Time.now.year
  month=Time.now.month
  day=Time.now.day
  if month < 10
    month = "0"+month.to_s
  end
  if day < 10
    day = "0"+day.to_s
  end
  "#{year.to_s+month.to_s+day.to_s}"
end

def current_version(app, amp_name)

#
# THIS needs to cope with multiple amps being installed, produce a array hash [{:amp=>"ampname",:version => ver},etc]
#

  if app == "alfresco" || app == "share"
    amp_info = `/usr/bin/java -jar #{ALF_MMT} list #{WEBAPPS_DIR}/#{app}.war`
    amp_title=""
    amp_ver=0
    #$log.debug "Amp info: #{amp_info}"
    amp_info.each_line do |line|
      if line =~ /Title/
        amp_title=line.split("Title:").last.strip.gsub(%r/(-|_|\.)/,' ')
      elsif line =~ /Version/
        # strip/replace ampname, downcase etc
        if compare_strings(amp_name.gsub(%r/(-|_|\.)/,' ').downcase,amp_title.downcase)
          amp_ver=line.split("Version:").last.strip
          $log.info "Installed Amp found for #{amp_name}"
          $log.debug "Installed version: #{amp_ver}"
        else
          $log.debug "No installed amp for #{amp_name} for #{app}"
        end
      end
    end
  else
    $log.warn "The application #{app} can not be found in #{WEBAPPS_DIR}/"
  end
  return amp_ver
end

def delete_dir (path,contents_only=false)
  begin
    if (contents_only)
      $log.debug "Removing #{path}*"
      FileUtils.rm_rf Dir.glob(path+"*")
    else
      $log.debug "Removing #{path}"
      FileUtils.rm_rf path
    end
  rescue Errno::ENOENT
    $log.warn "#{path} Does not exist"
  rescue Erro::EACCES
    $log.warn "No permissions to delete #{path}"
  rescue
    $log.warn "Something went wrong"
  end
end

def firewall(block=false)
  if block
    `/sbin/iptables -I INPUT -m state --state NEW -m tcp -p tcp  --dport 8080 -j DROP`
  else
    `/sbin/iptables -D INPUT -m state --state NEW -m tcp -p tcp  --dport 8080 -j DROP`
  end
end

def get_amp_details(amps)
  amps_hash = Hash.new
  amps.each do |amp|
    amp_hash = Hash.new
    #Return hash with unique amps with just the latest version
    amp_filename = amp.split("/").last
    amp_path = amp
    amp_name=""
    amp_version=""
    first_name=true
    first_ver=true
    #Remove the ".amp" extension and loop through
    amp_filename[0..-5].split("-").each do |comp|
      pos = comp =~ /\d/
      if pos == 0
        if first_ver
          amp_version << comp
          first_ver=false
        else
          #By commenting this out the release will get ignored which because some amps to put it in their version is probably safest
          #amp_version << "-" << comp
        end
      else
        if first_name
          amp_name << comp.downcase
          first_name=false
        else
          amp_name << "_" << comp.downcase
        end
      end
    end

    #If a key of amp name exists, merge the version down hash else merge the lot
    if amps_hash.has_key?(amp_name)
      amp_hash={amp_version => {:path => amp_path, :filename => amp_filename}}
      amps_hash[amp_name].merge!(amp_hash)
    else 
      amp_hash={amp_name =>{amp_version => {:path => amp_path, :filename => amp_filename}}}
      amps_hash.merge!(amp_hash)
    end
  end
  return amps_hash
end

def install_amp(app, amp)
  $log.info "applying amp to #{app}"
  $log.warn "amp path must be passed!" unless !amp.nil?

  $log.debug "Command to install = /usr/bin/java -jar #{ALF_MMT} install #{amp} #{WEBAPPS_DIR}/#{app}.war -nobackup -force"
  `/usr/bin/java -jar #{ALF_MMT} install #{amp} #{WEBAPPS_DIR}/#{app}.war -nobackup -force`
  restart_tomcat?(true)
  $log.debug "Setting flag to restart tomcat"
end

def latest_amps(amp_hash)
  amp_hash.each_pair do |amp,amp_vers|
    latest_amp_ver=0
    $log.debug "Comparing versions for #{amp}"
    amp_vers.each_key do |version|
      $log.debug "Comparing #{latest_amp_ver} with #{version}"
      latest_amp_ver = compare_versions(latest_amp_ver,version)
      $log.info "Latest version for #{amp}: #{latest_amp_ver}"
      if latest_amp_ver != version
        amp_vers.delete(version)
      end
    end
  end
  return amp_hash
end

def next_version?(ver, current_ver, app)
  #Loop through amp versions to work out which is newer than the installed
  #Turn list into array
  next_amp=false
  $log.debug "if #{ver} > #{current_ver}"
  if ( ver.to_i > current_ver.to_i)
    $log.debug "Next #{app} amp version to be applied:  #{ver}"
    next_amp=true
  end
end

def restart_tomcat()
  #If an amp was applied restart
  if (restart_tomcat?)
    $log.info "Restarting Tomcat.... this may take some time"
    $log.debug"Getting pid"
    if (File.exists?('/var/run/tomcat6.pid') )
      pid=File.read('/var/run/tomcat6.pid').to_i
      $log.debug "Killing Tomcat PID= #{pid}"
      begin
        Process.kill("KILL",pid)
        Timeout::timeout(30) do
          begin
            sleep 5
            $log.debug "Sleeping for 5 seconds..."
          end while !!(`ps -p #{pid}`.match pid.to_s)
        end
      rescue Timeout::Error
        $log.debug "didn't kill process in 30 seconds"
      end
    end
    $log.debug "Killed tomcat"

    #Clear caches
    clear_caches
    $log.info "blocking firewall access"
    firewall(true)
    $log.debug "starting tomcat"
    `/sbin/service tomcat6 start`
    if ($?.exitstatus != 0)
      $log.debug "Tomcat6 service failed to start, exitstatus = #{$?.exitstatus}"
    else
      #Tomcat is starting sleep until it has started
      #For now sleep for 180 seconds
      $log.info "Sleeping for 180 seconds"
      sleep 180
      $log.info "un-blocking firewall access"
      firewall(false)
    end
  else
    $log.info "No new amps to be installed"
  end
end

def restart_tomcat?(bool=nil)
  @restart = bool unless bool.nil?
  #$log.debug "Restart tomcat = #{@restart}"
  return @restart
end

# - Methods End

#
# doGreatWork()
#

#Store an Hash of amps
amps=Hash.new

#For each AMP_LOCATIONS find the latest Amps
AMP_LOCATIONS.each do |amp_loc|
  $log.debug "Looking in #{amp_loc} for amps"
  amps.merge!(get_amp_details(available_amps(amp_loc)))
end

#Sort through the array and return only the latest versions of each amp
latest_amps(amps)

amps.each do |amp, details|
  #The Amps in here are the latest of their kind available so check with what is installed
  details.each_pair do |version,value|
    if amp =~ /share/
      if next_version?(version,current_version("share",amp),"share")
        $log.debug "Backing up share war"
        backup("share")
        $log.info "Installing #{amp} (#{version}): #{value[:path]}"
        install_amp("share",value[:path])
      else
        $log.info "No update needed"
      end
    else
      if next_version?(version,current_version("alfresco",amp),"alfresco")
        $log.debug "Backing up alfresco war"
        backup("alfresco")
        $log.info "Installing #{amp} (#{version}): #{value[:path]}"
        install_amp("alfresco",value[:path])
      else
        $log.info "No update needed"
      end
    end
  end
end

$log.debug "Restart tomcat?: #{restart_tomcat?}"
restart_tomcat

$log.info "All done for now"

Okay 2 things, it’s a long script all in one file to make it easy to transport, I’ve also used a logging class that enables logging to screen / file that is …below :) you could also just remove the require at the top and replace “$log.debug” with “puts” up to you.

#
#   Set up Logging
#

require 'rubygems'
require 'log4r'

class Logging

  def initialize(log_name,log_location="/var/log/")
    # Create a logger named 'log' that logs to stdout
    $log = Log4r::Logger.new log_name

    # Open a new file logger and ask him not to truncate the file before opening.
    # FileOutputter.new(nameofoutputter, Hash containing(filename, trunc))
    file = Log4r::FileOutputter.new('fileOutputter', :filename => "#{log_location}#{log_name}.log",:trunc => false)

    # You can add as many outputters you want. You can add them using reference
    # or by name specified while creating
    $log.add(file)
    # or mylog.add(fileOutputter) : name we have given.

    # As I have set my logging level to ERROR. only messages greater than or 
    # equal to this level will show. Order is
    # DEBUG < INFO < WARN < ERROR < FATAL

    # specify the format for the message.
    format = Log4r::PatternFormatter.new(:pattern => "[%l] %d: %m")

    # Add formatter to outputter not to logger. 
    # So its like this : you add outputter to logger, and add formattters to outputters.
    # As we haven't added this formatter to outputter we created to log messages at 
    # STDOUT. Log messages at stdout will be simple
    # but the log messages in file will be formatted
    file.formatter = format
    
  end

  def self.log_level(lvl,verbose=false)
    # You can use any Outputter here.
    $log.outputters = Log4r::Outputter.stdout if verbose

    # Log level order is DEBUG < INFO < WARN < ERROR < FATAL
    case lvl
        when    "DEBUG"
            $log.level = Log4r::DEBUG
        when    "INFO"
            $log.level = Log4r::INFO
        when    "WARN"
            $log.level = Log4r::WARN
        when    "ERROR"
            $log.level = Log4r::ERROR
        when    "FATAL"
            $log.level = Log4r::FATAL
        else
             print "You provided an invalid option: #{lvl}"
    end
  end

end

I hope this helps people out, if there’s any issues just leave comments and i’ll help :)

I did it, Plugins

I said I couldn’t do it

It was not long ago I said in the Sentinel Update I didn’t know how to do plugins. Well less than a week after writing it I was reading a few articles by Gregory Brown on modules and Mixins. These are the first time I’ve read something that explains them in a way I actually understand.

I was doing research into modules and mixins as they seemed a bit pointless but thanks to the articles by Gregory I was able to understand them and right in the middle of reading some of the examples and having a play a lightning bolt struck me, it all became clear on how to implement the a plugin manager.

Some Bad Code

Based on some of the stuff I saw I came up with the following, ignore most of it I was just hacking around to see if I could get it to work the names meant more in a previous iteration.

module PluginManager
#Just seeing if this works like magic...
    class LoadPlugin
        def initialize
            #The key is a plugin_name the &block is the code so in theory when initialzed it can be run
            @plugins={} unless !@plugins.nil?
        end

        def add_plugin (key,&block)
            @plugins.merge!({key=>block})
        end

        def run_plugin (key)
            puts "Plugin to Run = #{key}"
            puts "Plugin does:\n"
            @plugins[key].call

        end

        def list_plugins
            @plugins.each_key {|key| puts key}
        end
    end

end

plugins = PluginManager::LoadPlugin.new

plugins.add_plugin (:say_hello) do
    puts "Hello"
end

plugins.add_plugin (:count_to_10) do
    for i in 0..10
        puts "#{i}"
    end
end

plugins.add_plugin (:woop) do
    puts "Wooop!"
end

plugins.add_plugin (:say_goodbye) do
    puts "Good Bye :D"
end

puts "in theory... Multiple plugins have been loaded"
puts "listing plugins:"
plugins.list_plugins
puts "running plugins:"
plugins.run_plugin (:say_hello)
plugins.run_plugin (:woop)
plugins.run_plugin (:count_to_10)
plugins.run_plugin (:say_goodbye)

And when it runs:

in theory... Multiple plugins have been loaded
listing plugins:
say_hello
count_to_10
woop
say_goodbye
running plugins:
Plugin to Run = say_hello
Plugin does:
Hello
Plugin to Run = woop
Plugin does:
Wooop!
Plugin to Run = count_to_10
Plugin does:
0
1
2
3
4
5
6
7
8
9
10
Plugin to Run = say_goodbye
Plugin does:
Good Bye :D

This is good news I and I really like the site, I will be using it a lot more as I learn more about ruby it explains things really well, and it looks like if you can afford the $8/month you can get a lot more articles by the same guy at practicingruby.com

Summary

So in short… Sentinel will have plugins, I like the blogs at ruby best practices and This blog will also be short :D

soimasysadmin.com

What’s in a name!

I took the decision today to set up soimasysadmin.com to point to this blog there’s a number of reasons for this:

  1. Looks cooler!
  2. Blog could be transferred at a later date
  3. I needed something to do

It came about a couple of days ago when I was looking at another wordpress based site that was being hosted else where. I do run my own servers at home but I have home broadband and it probably isn’t as good as what wordpress can supply so I made the decision to let it be hosted else where. However, having seen some nicely themed wordpress sites and the versatility of it as a platform I’ve been quite impressed.

One of the other reasons for looking at this as an option is the ability to manage at a greater level the content of the site, such as hooking in google analytics or putting my own adwords in place all things to be considered for the future and as I see this as a long term game i’m better off making the change now.

I’m going to leave it a few weeks before I actually flip over but I wanted to get the domain out and about and make sure that I can get a couple of referrers updated to help with the transition, hopefully it won’t have a major impact, but who knows!

Originally when I started out I wasn’t sure how long I would keep this going but it’s become a bit of a dumping ground for good information that I’ve learnt over the years and hopefully it’s been useful to more than just myself so over the next year I’ll be thinking about and maybe playing with a few other ideas all of which are helped by having the domain in place.

Oh… I also updated my About page to now have a feedback form so you can use this to contact me rather than commenting if you so wish :)

AWS CopySnapshot – A regional DR backup

Finally!

After many months of talking with Amazon about better ways of getting backups from one region to another they sneak in a sneaky little update on their blog I will say it here, World changing update! The ability to easily and readily sync your EBS data between regions is game changing, I kid you not, in my tests I synced 100GB from us-east-1 to us-west-1 in such a quick time it was done before I’d switched to that region to see it! However… sometimes it is a little slower… Thinking about it, it could have been a blank volume I don’t really know :/

So at Alfresco we do not heavily use EBS as explained Here when we survived a major amazon issue that affected many larger websites than our own. We do still have EBS volumes as it is almost impossible to get rid of them, and by the very nature the data that is on these EBS volumes is very precious so obviously we want it backed up. A few weeks ago I started writing a backup script for EBS volumes, well the script wasn’t well tested it was quickly written but it worked. I decided that I would very quickly, well to be fair I spent ages on it, update the script with the new CopySnapshot feature.

At the time of writing, the CopySnapshot exists in one place, the deepest, darkest place known to man, the Amazon Query API interface; this basically means that rather than simply doing some method call you have to throw all the data to it and back again to make it hang together, for the real programmers out there this is just an inconvenience for me it is a nightmare, it was an epic battle between my motivation, my understanding and my google prowess, in short I won.

It was never going to be easy…

In the past I have done some basic stuff with REST type API’s, set some header, put some variable on the params of the url and just let it go, all very simple, Amazon’s was slightly more advanced to say the least.

So I had to use this complicated encoded, parsed encrypted and back to back handshake as described here with that and the CopySnapshot docs I was ready to rock!

So after failing for an hour to even get my head around the authentication I decided to cheat, and use google. The biggest break through was thanks to James Murty the AWS class he has written is perfect, the only issue was my understanding on how to use modules in ruby which were very new to me. On a side note i thought Modules were meant to fix issues with name space but for some reason even though I included the class in script it seemed to conflict with the ruby aws-sdk I already had so I just had to rename the class / file from AWS to AWSAPI and all was then fine. That and I also had to add a parameter to pass in the AWS_ACCESS_KEY which was a little annoying as I thought the class would have taken care of that, but to be fair it wasn’t hard to work out in the end.

So first things first, have a look at the AWS.rb file on the site it does the whole signature signing bit well and saves me the hassle of doing or thinking about it. On a side note, this all uses version 2 of the signing which I imagine will be deprecated at some point as version 4 is out and about Here

If you were proactive you’ve already read the CopySnapshot docs and noticed that in plain english or complicated that page does not tell you how to copy between regions. I imagine it’s because I don’t know how to really use the tools but it’s not clear to me… I had noticed that th wording they used was identical to the params being passed in the example so I tried using Region, DestinationRegion, DestRegion all failing, kind of as expected seeing as I was left to guess my way through; I was there, that point where you’ve had enough and it doesn’t look like it is ever going to work so I started writing a support ticket for Amazon so they could point out what ever it was I was missing at the moment of just about to hit submit I had a brainwave. If the only option is to specify the source then how do you know the destination? well, I realised that each region has its own API url, so would that work as the destination? YES!

The journey was challenging, epic even for this sysadmin to tackle and yet here we are, a blog post about regional DR backups of EBS snapshots so without further ado, and no more gilding the lily I present some install notes and code…

Make it work

The first thing you will need to do is get the appropriate files, the AWS.rb from James Murty. Once you have this You will need to make the following changes:

21c21
< module AWS
---
> module AWSAPI

Next you will need to steal the code for the backup script:

#!/usr/bin/ruby

require 'rubygems'
require 'aws-sdk'
require 'uri'
require 'crack'

#Get options
ENV['AWS_ACCESS_KEY']=ARGV[0]
ENV['AWS_SECRET_KEY']=ARGV[1]
volumes_file=ARGV[2]
source_region=ARGV[3]
source_region ||= "us-east-1"

#Create a class for the aws module
class CopySnapshot
  #This allows me to initalize the module with out re-writing it
  require 'awsapi'
  include AWSAPI

end

def get_dest_url (region)
  case region
  when "us-east-1"
    url = "ec2.us-east-1.amazonaws.com"
  when "us-west-2"
    url = "ec2.us-west-2.amazonaws.com"
  when "us-west-1"
    url = "ec2.us-west-1.amazonaws.com"
  when "eu-west-1"
    url = "ec2.eu-west-1.amazonaws.com"
  when "ap-southeast-1"
    url = "ec2.ap-southeast-1.amazonaws.com"
  when "ap-southeast-2"
    url = "ec2.ap-southeast-2.amazonaws.com"
  when "ap-northeast-1"
    url = "ec2.ap-northeast-1.amazonaws.com"
  when "sa-east-1"
    url = "ec2.sa-east-1.amazonaws.com"
  end
  return url
end

def copy_to_region(description,dest_region,snapshotid, src_region)

  cs = CopySnapshot.new

  #Gen URL
  
  url= get_dest_url(dest_region)
  uri="https://#{url}"

  #Set up Params
  params = Hash.new
  params["Action"] = "CopySnapshot"
  params["Version"] = "2012-12-01"
  params["SignatureVersion"] = "2"
  params["Description"] = description
  params["SourceRegion"] = src_region
  params["SourceSnapshotId"] = snapshotid
  params["Timestamp"] = Time.now.iso8601(10)
  params["AWSAccessKeyId"] = ENV['AWS_ACCESS_KEY']

  resp = begin
    cs.do_query("POST",URI(uri),params)
  rescue Exception => e
    puts e.message
  end

  if resp.is_a?(Net::HTTPSuccess)
    response = Crack::XML.parse(resp.body)
    if response["CopySnapshotResponse"].has_key?('snapshotId')
      puts "Snapshot ID in #{dest_region} is #{response["CopySnapshotResponse"]["snapshotId"]}" 
    end
  else
    puts "Something went wrong: #{resp.class}"
  end
  
end

if File.exist?(volumes_file)
  puts "File found, loading content"
  #Fix contributed by Justin Smith: https://soimasysadmin.com/2013/01/09/aws-copysnapshot-a-regional-dr-backup/#comment-379
  ec2 = AWS::EC2.new(:access_key_id => ENV['AWS_ACCESS_KEY'], :secret_access_key=> ENV['AWS_SECRET_KEY']).regions[source_region]
  File.open(volumes_file, "r") do |fh|
    fh.each do |line|
      volume_id=line.split(',')[0].chomp
      volume_desc=line.split(',')[1].chomp
      if line.split(',').size >2
        volume_dest_region=line.split(',')[2].to_s.chomp
      end
      puts "Volume ID = #{volume_id} Volume Description = #{volume_desc}"
      v = ec2.volumes["#{volume_id}"]
      if v.exists? 
        puts "creating snapshot"
        date = Time.now
        backup_string="Backup of #{volume_id} - #{date.day}-#{date.month}-#{date.year}"
        puts "#{backup_string}" 
        snapshot = v.create_snapshot(backup_string)
        sleep 1 until [:completed, :error].include?(snapshot.status)
        snapshot.tag("Name", :value =>"#{volume_desc} #{volume_id}")
        # if it should be backed up to another region do so now
        if !volume_dest_region.nil? 
          if !volume_dest_region.match(/\s/) ? true : false
            puts "Backing up to #{volume_dest_region}"
            puts "Snapshot ID = #{snapshot.id}"
            copy_to_region(volume_desc,volume_dest_region,snapshot.id,source_region)
          end
        end
      else
        puts "Volume #{volume_id} no longer exists"
      end
    end
  end
else
  puts "no file #{volumes_file}"
end

Once you have that you will need to create a file with the volume sin to backup, in the following format:

vol-127facd,My vol,us-west-1
vol-1ac123d,My vol2
vol-cd1245f,My vol3,us-west-2

The format is “volume id, description,region” the region is where you want to backup to. once you have these details you just call the file as follows:

ruby ebs_snapshot.rb <Access key> <secret key> <volumes file>

I don’t recommend putting your key’s on the CLI or even in a cron job but it wouldn’t take much to re-facter this into a class if needed and if you were bothered about that.
It should work quite well if anyone has any problems let me know and I’ll see what I can do :)

Sentinel update

Many moons ago…

A while back I started to mention the idea of Self healing systems a dedicated system that makes use of monitoring and real time system information to make intelligent decisions about what to do, i.e. I write a complicated program to gradually replace my self. It was suggested about using hooks in Nagios to do the tasks but that misses the intelligence side of what I’m trying to get to, restarting based on Nagios checks is simply an if statement that on a certain condition does something, Sentinel will be more that that.

Back in April I started Sentinel as an open source project As expected the uptake has been phenomenal! absolutely no one has even looked at it :) Either way I am not deterred. I have been on and off re-factoring Sentinel into something a bit more logical Here and I have gone from 3 files to some 13! from 1411 words to 2906 and I even have one fully working unit test! I don’t think I’ll be writing more as at the moment it is not really helping me get to where I want to be quickly but I know I’ll need them at some point!

So far all I have done is split out some of the code to give it structure and added the odd bit here and there. The next thing I need to start doing is to make it better, there’s a number of options:

  • Writing more providers for it so it can start to manage disks, memory etc etc so it’s a bit more useful
  • Sorting out the structure of the code adding in more error handling / logging and resilience
  • Integration with Nagios or some tool that already monitors system health and use that to base actions off of
  • Daemonize Sentinel so it runs like a real program!
  • Configuration file rather than CLI

What to do

I think for me I’d rather sort out the structure of the code and improve what is already there first, I’m in no rush with this so the least I could do is make what I have less hacky. This also gives me the opportunity to start working out how I’d rather have the whole thing structured.

I did look at writing a plugin framework so it would be possible to just drop in a module or something similar and it would have the correct information about how to manage what ever it was written to do, but I figured that was a bit beyond me at this time and I have better things to do!

After that I think the configuration file and daemonizing the application, the main reason for this will be to identify any issues with it running continually any issue here would be nice to know sooner rather than later.

This then leaves more providers and nagios type integration which i’m sure will be fun.

Give it AI!

Once those items are done this leaves sentinel with one more thing to do, start intelligently working out solutions to problems, obviously I don’t know the right way to tackle this however I do have a few ideas though.

In my head… I think how I would solve an issue and inevitably it starts with gathering information about the system, but how do you know what information is relavent to which problems and how much weighting should it have? well for starters I figure each provider would return a score about how healthy it thinks it is. So for example:

A provider for checking the site is available notices that it’s not available; this produces a score that is very high say 10000. It then makes sure it’s got the latest information from all providers on the server. One of those providers is disk which notices one of the volumes is 67% full but the thresholds have been set to warn at 70 and 95 % so it sets a score of say 250 and is ranked in a list somewhere to come back to if all else fails.

At this point it is unlikely that disk is the culprit, we have to assume that whomever set the thresholds knew something about the system, so more information is needed, it checks the local network and gets back a score of 0 as far as the network provider can tell it’s working fine it can get to localhost, the gateway another gateway on the internet. A good test at this point is to try and work out which layer of the OSI model the issue is, so one of the actions might be to connect to port 80 or 443 or both and see what happens, is there a web response? or not, if there is does it have any words in it or a response code that suggests it’s a known web error like a 500 or does it not get connected.

And so on and so forth, this would mean that where ever this logic exists it has to make associations betten results and the following actions. one of the ways to do this is to “tag” a provider with potential subsystems that could affect it then based on the score of each of the subsystems produce a vector of potential areas to check, combined with the score it’s possible to travel the vector and work out how likely each is to fix the issue, as and when each one produces a result it either dives in a new vector either more detailed or not. It would then, in theory be possible to start making correlations between these subsystems, so say the web one requires disk and networking to be available and both the networking and disk require CPU then it can assume that web one needs that and base don how many of these connections exist it can rank it higher or lower much in the same way a search engine would work.

But all of this is for another day, today is just about saying it’s started and I hope to continue on it this year.