Applying AMPs to Alfresco

A bit of background

Alfresco comes with the ability to be extended in a nice easy way, that is through the use of Alfresco Module Packages. In essence it is the delta of changes you would have made to the raw source code if you wanted to make some sort of customisation or apply one of the ones alfresco supplies like the S3-connector.

Over the last 3 years I’ve seen it done a number of ways, using the mmt tool to apply them manually, a shell script to do it and now I decided that wasn’t good enough.

Using the mmt tool manually is obviously not brilliant, some poor person has to sit there and run it to apply the amps. So you may have guessed this is not a good idea.

What about wrapping the mmt tool in a shell script that can be triggered by say a sysadmin to apply all the amps or just have it executed once per amp using some configuration managent tool like puppet. This is good. You put the amp into the configuration management tool push the right buttons and it magically get’s applied to the war files and all is well. Well sort of, what happens if someone just throws an amp on the server? who puts it in configuration management? who’s made a backup? Well I decided that I’d write a new script for applying amps so that it can be used both with a CM and as a ad-hock script.

What does it do?

I’ve written it so it will trawl through a directory and pull up every amp in the directory and it will apply the amps to alfresco or share as needed. What’s quite handy is that it will take several versions of an amp and work out what the latest version is, it will check the latest version against what is already installed in the war and then if the amp is a newer version it will apply it after making a backup.

For some odd reason I also made it cope with a variety of amp naming schemes, so you could upload alfresco-bob-123.amp or you could upload frank-share-super-1.2.3.4-5.amp it’s your amp, call it what you want. All the script cares about is the correlation of terms between the file name and the amp info when it’s installed. So as long as you use 2 words from the file name that also appear in the amp description it will work it out for you. The higher the correlation the more accurate it would be, it is configurable too but I set it to 2 occurrence of at least 2 words to match, so far… it’s working.

I also forgot to mention that the script will stop alfresco clear the caches and restart it for you in a pretty safe way.

A Script

Firstly I realise this is a bad format to get the script I’ve in the past put them in a git repo and shared it that way, I have put this one in a git repo and I hope to share that repo with some of the things we have done at alfresco that are useful for either running servers in general or for running alfresco either way I hope to shortly get that out on a public repo but for now here it is:

#!/usr/bin/ruby
# Require libs
$:.unshift File.expand_path("../", __FILE__)
require 'lib/logging'
require 'fileutils'
require 'timeout'

# Set up logging provider
Logging::new('alfresco_apply_amps')
Logging.log_level("INFO",false)
$log.info "Starting"

#CONSTANTS
BKUP_LOCATION="/var/lib/alfresco/alf_data/backups"
ALF_MMT="/var/lib/tomcat6/bin/alfresco-mmt.jar"
WEBAPPS_DIR="/var/lib/tomcat6/webapps"
AMP_LOCATIONS=["/var/lib/alfresco/alf_data/amps/"]

#Defaults
@restart=false

#Methods
def available_amps(amp_dir)
  #Get a list of Amps
  amps_list = `ls #{amp_dir}*`
  amps_array = amps_list.split("\n")
end

def backup(war)
  version=`/usr/bin/java -jar #{ALF_MMT} list #{WEBAPPS_DIR}/#{war}.war | grep Version | awk '{print $3}'`

  #Date stamp the war
  $log.info "Backing up #{WEBAPPS_DIR}/#{war}.war to #{BKUP_LOCATION}/#{war}-#{current_date}.war"
  `cp -a #{WEBAPPS_DIR}/#{war}.war #{BKUP_LOCATION}/#{war}-#{current_date}.war`
end

def clear_caches()
  $log.debug "Cleaning caches"  
  delete_dir("#{WEBAPPS_DIR}/alfresco/")
  delete_dir("#{WEBAPPS_DIR}/share/")
  delete_dir('/var/cache/tomcat6/work/',true)
  delete_dir('/var/cache/tomcat6/temp/',true)
  $log.info "Caches cleaned"  
end

def compare_strings(str1,str2,options={})
  matches = options[:matches] || 2
  frequency = options[:frequency] || 2

  #Make one array of words
  words=Array.new
  words << str1.split(' ') << str2.split(' ')
  words.flatten!
  #Hash to store each unique key in and number of occurances
  keys = Hash.new
  words.each do |key|
    if keys.has_key?(key)
      keys[key] +=1
    else
      keys.merge!({key =>1})
    end
  end

  #Now we have a Hash of keys with counts how many matches and what frequency
  #where a match is a unique key >1 and frequency si the count of each key i.e. 
  #matches=7 will mean 7 keys must be >1 frequency=3 means 7 matches must be > 3
  
  act_matches=0
  keys.each_pair do |key,value|
    if value >= frequency
      act_matches +=1
    end
  end
  if act_matches >= matches
    true
  else
    false
  end
end

def compare_versions(ver1,ver2)
  #return largest
  v2_maj=0
  v2_min=0
  v2_tiny=0
  v2_release=0
  v1_maj=0
  v1_min=0
  v1_tiny=0
  v1_release=0
  if ver1 =~ /\./ && ver2 =~ /\./
    #both are dotted notation
    #Compare maj -> release

    #Conver '-' to '.'
    ver1.gsub!(/-/,'.')
    ver2.gsub!(/-/,'.')

    v1_maj = ver1.split('.')[0]
    v1_min = ver1.split('.')[1] || 0
    v1_tiny = ver1.split('.')[2] || 0
    v1_release = ver1.split('.')[3] || 0

    v2_maj = ver2.split('.')[0]
    v2_min = ver2.split('.')[1] || 0
    v2_tiny = ver2.split('.')[2] || 0
    v2_release = ver2.split('.')[3] || 0

    if v1_maj > v2_maj
      return ver1
    elsif v1_min > v2_min
      return ver1
    elsif v1_tiny > v2_tiny
      return ver1
    #Don't compare release for now as some amps don't put the release in the amp when installed so you end up re-installing
    #elsif v1_release > v2_release
    #  return ver1
    else
      return ver2
    end
  else
    #Validate both are not-dotted
    if ver1 =~ /\./ || ver2 =~ /\./
      $log.debug "Eiher both types aren't the same or there's only one amp"
      return ver2
    else
      result = ver1<=>ver2
      if result.to_i > 0 && !result.nil?
        return ver1
      else
        return ver2
      end
    end
  end
end

def current_date()
  year=Time.now.year
  month=Time.now.month
  day=Time.now.day
  if month < 10
    month = "0"+month.to_s
  end
  if day < 10
    day = "0"+day.to_s
  end
  "#{year.to_s+month.to_s+day.to_s}"
end

def current_version(app, amp_name)

#
# THIS needs to cope with multiple amps being installed, produce a array hash [{:amp=>"ampname",:version => ver},etc]
#

  if app == "alfresco" || app == "share"
    amp_info = `/usr/bin/java -jar #{ALF_MMT} list #{WEBAPPS_DIR}/#{app}.war`
    amp_title=""
    amp_ver=0
    #$log.debug "Amp info: #{amp_info}"
    amp_info.each_line do |line|
      if line =~ /Title/
        amp_title=line.split("Title:").last.strip.gsub(%r/(-|_|\.)/,' ')
      elsif line =~ /Version/
        # strip/replace ampname, downcase etc
        if compare_strings(amp_name.gsub(%r/(-|_|\.)/,' ').downcase,amp_title.downcase)
          amp_ver=line.split("Version:").last.strip
          $log.info "Installed Amp found for #{amp_name}"
          $log.debug "Installed version: #{amp_ver}"
        else
          $log.debug "No installed amp for #{amp_name} for #{app}"
        end
      end
    end
  else
    $log.warn "The application #{app} can not be found in #{WEBAPPS_DIR}/"
  end
  return amp_ver
end

def delete_dir (path,contents_only=false)
  begin
    if (contents_only)
      $log.debug "Removing #{path}*"
      FileUtils.rm_rf Dir.glob(path+"*")
    else
      $log.debug "Removing #{path}"
      FileUtils.rm_rf path
    end
  rescue Errno::ENOENT
    $log.warn "#{path} Does not exist"
  rescue Erro::EACCES
    $log.warn "No permissions to delete #{path}"
  rescue
    $log.warn "Something went wrong"
  end
end

def firewall(block=false)
  if block
    `/sbin/iptables -I INPUT -m state --state NEW -m tcp -p tcp  --dport 8080 -j DROP`
  else
    `/sbin/iptables -D INPUT -m state --state NEW -m tcp -p tcp  --dport 8080 -j DROP`
  end
end

def get_amp_details(amps)
  amps_hash = Hash.new
  amps.each do |amp|
    amp_hash = Hash.new
    #Return hash with unique amps with just the latest version
    amp_filename = amp.split("/").last
    amp_path = amp
    amp_name=""
    amp_version=""
    first_name=true
    first_ver=true
    #Remove the ".amp" extension and loop through
    amp_filename[0..-5].split("-").each do |comp|
      pos = comp =~ /\d/
      if pos == 0
        if first_ver
          amp_version << comp
          first_ver=false
        else
          #By commenting this out the release will get ignored which because some amps to put it in their version is probably safest
          #amp_version << "-" << comp
        end
      else
        if first_name
          amp_name << comp.downcase
          first_name=false
        else
          amp_name << "_" << comp.downcase
        end
      end
    end

    #If a key of amp name exists, merge the version down hash else merge the lot
    if amps_hash.has_key?(amp_name)
      amp_hash={amp_version => {:path => amp_path, :filename => amp_filename}}
      amps_hash[amp_name].merge!(amp_hash)
    else 
      amp_hash={amp_name =>{amp_version => {:path => amp_path, :filename => amp_filename}}}
      amps_hash.merge!(amp_hash)
    end
  end
  return amps_hash
end

def install_amp(app, amp)
  $log.info "applying amp to #{app}"
  $log.warn "amp path must be passed!" unless !amp.nil?

  $log.debug "Command to install = /usr/bin/java -jar #{ALF_MMT} install #{amp} #{WEBAPPS_DIR}/#{app}.war -nobackup -force"
  `/usr/bin/java -jar #{ALF_MMT} install #{amp} #{WEBAPPS_DIR}/#{app}.war -nobackup -force`
  restart_tomcat?(true)
  $log.debug "Setting flag to restart tomcat"
end

def latest_amps(amp_hash)
  amp_hash.each_pair do |amp,amp_vers|
    latest_amp_ver=0
    $log.debug "Comparing versions for #{amp}"
    amp_vers.each_key do |version|
      $log.debug "Comparing #{latest_amp_ver} with #{version}"
      latest_amp_ver = compare_versions(latest_amp_ver,version)
      $log.info "Latest version for #{amp}: #{latest_amp_ver}"
      if latest_amp_ver != version
        amp_vers.delete(version)
      end
    end
  end
  return amp_hash
end

def next_version?(ver, current_ver, app)
  #Loop through amp versions to work out which is newer than the installed
  #Turn list into array
  next_amp=false
  $log.debug "if #{ver} > #{current_ver}"
  if ( ver.to_i > current_ver.to_i)
    $log.debug "Next #{app} amp version to be applied:  #{ver}"
    next_amp=true
  end
end

def restart_tomcat()
  #If an amp was applied restart
  if (restart_tomcat?)
    $log.info "Restarting Tomcat.... this may take some time"
    $log.debug"Getting pid"
    if (File.exists?('/var/run/tomcat6.pid') )
      pid=File.read('/var/run/tomcat6.pid').to_i
      $log.debug "Killing Tomcat PID= #{pid}"
      begin
        Process.kill("KILL",pid)
        Timeout::timeout(30) do
          begin
            sleep 5
            $log.debug "Sleeping for 5 seconds..."
          end while !!(`ps -p #{pid}`.match pid.to_s)
        end
      rescue Timeout::Error
        $log.debug "didn't kill process in 30 seconds"
      end
    end
    $log.debug "Killed tomcat"

    #Clear caches
    clear_caches
    $log.info "blocking firewall access"
    firewall(true)
    $log.debug "starting tomcat"
    `/sbin/service tomcat6 start`
    if ($?.exitstatus != 0)
      $log.debug "Tomcat6 service failed to start, exitstatus = #{$?.exitstatus}"
    else
      #Tomcat is starting sleep until it has started
      #For now sleep for 180 seconds
      $log.info "Sleeping for 180 seconds"
      sleep 180
      $log.info "un-blocking firewall access"
      firewall(false)
    end
  else
    $log.info "No new amps to be installed"
  end
end

def restart_tomcat?(bool=nil)
  @restart = bool unless bool.nil?
  #$log.debug "Restart tomcat = #{@restart}"
  return @restart
end

# - Methods End

#
# doGreatWork()
#

#Store an Hash of amps
amps=Hash.new

#For each AMP_LOCATIONS find the latest Amps
AMP_LOCATIONS.each do |amp_loc|
  $log.debug "Looking in #{amp_loc} for amps"
  amps.merge!(get_amp_details(available_amps(amp_loc)))
end

#Sort through the array and return only the latest versions of each amp
latest_amps(amps)

amps.each do |amp, details|
  #The Amps in here are the latest of their kind available so check with what is installed
  details.each_pair do |version,value|
    if amp =~ /share/
      if next_version?(version,current_version("share",amp),"share")
        $log.debug "Backing up share war"
        backup("share")
        $log.info "Installing #{amp} (#{version}): #{value[:path]}"
        install_amp("share",value[:path])
      else
        $log.info "No update needed"
      end
    else
      if next_version?(version,current_version("alfresco",amp),"alfresco")
        $log.debug "Backing up alfresco war"
        backup("alfresco")
        $log.info "Installing #{amp} (#{version}): #{value[:path]}"
        install_amp("alfresco",value[:path])
      else
        $log.info "No update needed"
      end
    end
  end
end

$log.debug "Restart tomcat?: #{restart_tomcat?}"
restart_tomcat

$log.info "All done for now"

Okay 2 things, it’s a long script all in one file to make it easy to transport, I’ve also used a logging class that enables logging to screen / file that is …below :) you could also just remove the require at the top and replace “$log.debug” with “puts” up to you.

#
#   Set up Logging
#

require 'rubygems'
require 'log4r'

class Logging

  def initialize(log_name,log_location="/var/log/")
    # Create a logger named 'log' that logs to stdout
    $log = Log4r::Logger.new log_name

    # Open a new file logger and ask him not to truncate the file before opening.
    # FileOutputter.new(nameofoutputter, Hash containing(filename, trunc))
    file = Log4r::FileOutputter.new('fileOutputter', :filename => "#{log_location}#{log_name}.log",:trunc => false)

    # You can add as many outputters you want. You can add them using reference
    # or by name specified while creating
    $log.add(file)
    # or mylog.add(fileOutputter) : name we have given.

    # As I have set my logging level to ERROR. only messages greater than or 
    # equal to this level will show. Order is
    # DEBUG < INFO < WARN < ERROR < FATAL

    # specify the format for the message.
    format = Log4r::PatternFormatter.new(:pattern => "[%l] %d: %m")

    # Add formatter to outputter not to logger. 
    # So its like this : you add outputter to logger, and add formattters to outputters.
    # As we haven't added this formatter to outputter we created to log messages at 
    # STDOUT. Log messages at stdout will be simple
    # but the log messages in file will be formatted
    file.formatter = format
    
  end

  def self.log_level(lvl,verbose=false)
    # You can use any Outputter here.
    $log.outputters = Log4r::Outputter.stdout if verbose

    # Log level order is DEBUG < INFO < WARN < ERROR < FATAL
    case lvl
        when    "DEBUG"
            $log.level = Log4r::DEBUG
        when    "INFO"
            $log.level = Log4r::INFO
        when    "WARN"
            $log.level = Log4r::WARN
        when    "ERROR"
            $log.level = Log4r::ERROR
        when    "FATAL"
            $log.level = Log4r::FATAL
        else
             print "You provided an invalid option: #{lvl}"
    end
  end

end

I hope this helps people out, if there’s any issues just leave comments and i’ll help :)

AWS CopySnapshot – A regional DR backup

Finally!

After many months of talking with Amazon about better ways of getting backups from one region to another they sneak in a sneaky little update on their blog I will say it here, World changing update! The ability to easily and readily sync your EBS data between regions is game changing, I kid you not, in my tests I synced 100GB from us-east-1 to us-west-1 in such a quick time it was done before I’d switched to that region to see it! However… sometimes it is a little slower… Thinking about it, it could have been a blank volume I don’t really know :/

So at Alfresco we do not heavily use EBS as explained Here when we survived a major amazon issue that affected many larger websites than our own. We do still have EBS volumes as it is almost impossible to get rid of them, and by the very nature the data that is on these EBS volumes is very precious so obviously we want it backed up. A few weeks ago I started writing a backup script for EBS volumes, well the script wasn’t well tested it was quickly written but it worked. I decided that I would very quickly, well to be fair I spent ages on it, update the script with the new CopySnapshot feature.

At the time of writing, the CopySnapshot exists in one place, the deepest, darkest place known to man, the Amazon Query API interface; this basically means that rather than simply doing some method call you have to throw all the data to it and back again to make it hang together, for the real programmers out there this is just an inconvenience for me it is a nightmare, it was an epic battle between my motivation, my understanding and my google prowess, in short I won.

It was never going to be easy…

In the past I have done some basic stuff with REST type API’s, set some header, put some variable on the params of the url and just let it go, all very simple, Amazon’s was slightly more advanced to say the least.

So I had to use this complicated encoded, parsed encrypted and back to back handshake as described here with that and the CopySnapshot docs I was ready to rock!

So after failing for an hour to even get my head around the authentication I decided to cheat, and use google. The biggest break through was thanks to James Murty the AWS class he has written is perfect, the only issue was my understanding on how to use modules in ruby which were very new to me. On a side note i thought Modules were meant to fix issues with name space but for some reason even though I included the class in script it seemed to conflict with the ruby aws-sdk I already had so I just had to rename the class / file from AWS to AWSAPI and all was then fine. That and I also had to add a parameter to pass in the AWS_ACCESS_KEY which was a little annoying as I thought the class would have taken care of that, but to be fair it wasn’t hard to work out in the end.

So first things first, have a look at the AWS.rb file on the site it does the whole signature signing bit well and saves me the hassle of doing or thinking about it. On a side note, this all uses version 2 of the signing which I imagine will be deprecated at some point as version 4 is out and about Here

If you were proactive you’ve already read the CopySnapshot docs and noticed that in plain english or complicated that page does not tell you how to copy between regions. I imagine it’s because I don’t know how to really use the tools but it’s not clear to me… I had noticed that th wording they used was identical to the params being passed in the example so I tried using Region, DestinationRegion, DestRegion all failing, kind of as expected seeing as I was left to guess my way through; I was there, that point where you’ve had enough and it doesn’t look like it is ever going to work so I started writing a support ticket for Amazon so they could point out what ever it was I was missing at the moment of just about to hit submit I had a brainwave. If the only option is to specify the source then how do you know the destination? well, I realised that each region has its own API url, so would that work as the destination? YES!

The journey was challenging, epic even for this sysadmin to tackle and yet here we are, a blog post about regional DR backups of EBS snapshots so without further ado, and no more gilding the lily I present some install notes and code…

Make it work

The first thing you will need to do is get the appropriate files, the AWS.rb from James Murty. Once you have this You will need to make the following changes:

21c21
< module AWS
---
> module AWSAPI

Next you will need to steal the code for the backup script:

#!/usr/bin/ruby

require 'rubygems'
require 'aws-sdk'
require 'uri'
require 'crack'

#Get options
ENV['AWS_ACCESS_KEY']=ARGV[0]
ENV['AWS_SECRET_KEY']=ARGV[1]
volumes_file=ARGV[2]
source_region=ARGV[3]
source_region ||= "us-east-1"

#Create a class for the aws module
class CopySnapshot
  #This allows me to initalize the module with out re-writing it
  require 'awsapi'
  include AWSAPI

end

def get_dest_url (region)
  case region
  when "us-east-1"
    url = "ec2.us-east-1.amazonaws.com"
  when "us-west-2"
    url = "ec2.us-west-2.amazonaws.com"
  when "us-west-1"
    url = "ec2.us-west-1.amazonaws.com"
  when "eu-west-1"
    url = "ec2.eu-west-1.amazonaws.com"
  when "ap-southeast-1"
    url = "ec2.ap-southeast-1.amazonaws.com"
  when "ap-southeast-2"
    url = "ec2.ap-southeast-2.amazonaws.com"
  when "ap-northeast-1"
    url = "ec2.ap-northeast-1.amazonaws.com"
  when "sa-east-1"
    url = "ec2.sa-east-1.amazonaws.com"
  end
  return url
end

def copy_to_region(description,dest_region,snapshotid, src_region)

  cs = CopySnapshot.new

  #Gen URL
  
  url= get_dest_url(dest_region)
  uri="https://#{url}"

  #Set up Params
  params = Hash.new
  params["Action"] = "CopySnapshot"
  params["Version"] = "2012-12-01"
  params["SignatureVersion"] = "2"
  params["Description"] = description
  params["SourceRegion"] = src_region
  params["SourceSnapshotId"] = snapshotid
  params["Timestamp"] = Time.now.iso8601(10)
  params["AWSAccessKeyId"] = ENV['AWS_ACCESS_KEY']

  resp = begin
    cs.do_query("POST",URI(uri),params)
  rescue Exception => e
    puts e.message
  end

  if resp.is_a?(Net::HTTPSuccess)
    response = Crack::XML.parse(resp.body)
    if response["CopySnapshotResponse"].has_key?('snapshotId')
      puts "Snapshot ID in #{dest_region} is #{response["CopySnapshotResponse"]["snapshotId"]}" 
    end
  else
    puts "Something went wrong: #{resp.class}"
  end
  
end

if File.exist?(volumes_file)
  puts "File found, loading content"
  #Fix contributed by Justin Smith: https://soimasysadmin.com/2013/01/09/aws-copysnapshot-a-regional-dr-backup/#comment-379
  ec2 = AWS::EC2.new(:access_key_id => ENV['AWS_ACCESS_KEY'], :secret_access_key=> ENV['AWS_SECRET_KEY']).regions[source_region]
  File.open(volumes_file, "r") do |fh|
    fh.each do |line|
      volume_id=line.split(',')[0].chomp
      volume_desc=line.split(',')[1].chomp
      if line.split(',').size >2
        volume_dest_region=line.split(',')[2].to_s.chomp
      end
      puts "Volume ID = #{volume_id} Volume Description = #{volume_desc}"
      v = ec2.volumes["#{volume_id}"]
      if v.exists? 
        puts "creating snapshot"
        date = Time.now
        backup_string="Backup of #{volume_id} - #{date.day}-#{date.month}-#{date.year}"
        puts "#{backup_string}" 
        snapshot = v.create_snapshot(backup_string)
        sleep 1 until [:completed, :error].include?(snapshot.status)
        snapshot.tag("Name", :value =>"#{volume_desc} #{volume_id}")
        # if it should be backed up to another region do so now
        if !volume_dest_region.nil? 
          if !volume_dest_region.match(/\s/) ? true : false
            puts "Backing up to #{volume_dest_region}"
            puts "Snapshot ID = #{snapshot.id}"
            copy_to_region(volume_desc,volume_dest_region,snapshot.id,source_region)
          end
        end
      else
        puts "Volume #{volume_id} no longer exists"
      end
    end
  end
else
  puts "no file #{volumes_file}"
end

Once you have that you will need to create a file with the volume sin to backup, in the following format:

vol-127facd,My vol,us-west-1
vol-1ac123d,My vol2
vol-cd1245f,My vol3,us-west-2

The format is “volume id, description,region” the region is where you want to backup to. once you have these details you just call the file as follows:

ruby ebs_snapshot.rb <Access key> <secret key> <volumes file>

I don’t recommend putting your key’s on the CLI or even in a cron job but it wouldn’t take much to re-facter this into a class if needed and if you were bothered about that.
It should work quite well if anyone has any problems let me know and I’ll see what I can do :)

Updated Alfresco Solr Checks

As some may know…

A little while back I put up some checks for Alfresco Solr Here and wrote a little blog Here

Well over the last few weeks I have added yet more checks to it and I’ve also added some caching of the results so it will now no longer make a separate request to solr for each check and instead will use a local cached copy of the results and after 5 mins get a new one. The reason for this is that most of the results don’t change that frequently and with nagios it was calling each check so 20 calls to solr over a 5 min period, well each individual check is only verified once every 5 mins so now it will pull the report once and reference that cached copy for 5 mins, after that it will simply pull a new one…

In addition to the caching it now has 13 new checks! including cumulative hit ratios which are typically more relavent than the normal hit ratios as they are based over all time (Since reboot) and no, I don’t know how long the normal hitratios are based over.

There is also some checks for the number of Transactions remaining and the number of change sets remaining, these combined with the Lag can give you an indication of how far behind / how much work is left for Solr to do so quite useful.

If you need any help with these or have a few additional checks that are relavant let me know I’m happy to help.

Monitoring Alfresco Solr

It’s not just about numbers

Up until recently if you wanted to monitor Alfresco’s solr usage you would have had to either use a costly call to the stats page use the summary report that only really gave you a lag number. Luckily because Alfresco have extended solr they have changed the Summary report to provide some really useful information which can then be tracked via nagios or what ever your favourite monitoring tool is.

Firstly it’s worth reading the Wiki as it explains the variables better than I would, it’s also worth mentioning my preferred way of programatically access this page is via json like so:

 
http://localhost:8080/solr/admin/cores?action=SUMMARY&wt=json

It’s worth mentioning that depending on the json parsing library you are using you can get some fatal parsing errors caused by the hit ratio, For what it’s worth I found Crack to be good, it doesn’t validate the json as heavily as the raw json one does which means you can pull back all the data even if there is a problem with the hitratios.

On that subject, before the relavent cache is hit, the hit ratio will display “NaN” (Not a Number) once it has been hit it will display the appropriate number, which I’ll dive into a bit more later.

So before getting into the nitty gritty service checks, it’s important to have a good understanding of the numbers, most of them are straight forward; the only one that confused me was the hit ratios.

The hit ratio is a number between 0 and 1, when the number is greater than say 0.3 all is well, less than 0.3 things could be bad. However, when the hit count is less than say 100, it would be expected that the hit ratio is low as it is not being hit enough to provide a reasonable response. Other than the hit ratio the others are pretty straight forward.

Some code

It’s probably worth me sharing with you the class I’m using to access/return solr information, that way if you want to write your own nagios checks you can just copy / paste

Firstly, the class that get’s all the solr information:

#
# Solr Metric gatherer

require 'rubygems'
require &quot;crack&quot;
require 'open-uri'

class SolrDAO

  def initialize (url)
    @solr_hash = get_metrics(url)
  end

  def get_lag(index)
    lag = @solr_hash[&quot;Summary&quot;][index][&quot;TX Lag&quot;]
    regex= Regexp.new(/\d*/)
    lag_number = regex.match(lag)
    return lag_number
  end

  def get_alfresco_node_in_index(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;Alfresco Nodes in Index&quot;]
  end
  
  def get_num_docs(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;Searcher&quot;][&quot;numDocs&quot;]
  end
  
  def get_alfresco_avgTimePerRequest(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/alfresco&quot;][&quot;avgTimePerRequest&quot;]
  end

  def get_afts_avgTimePerRequest(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/afts&quot;][&quot;avgTimePerRequest&quot;]
  end

  def get_cmis_avgTimePerRequest(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/cmis&quot;][&quot;avgTimePerRequest&quot;]
  end

  def get_mean_doc_transformation_time(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;Doc Transformation time (ms)&quot;][&quot;Mean&quot;]
  end

  def get_queryResultCache_lookups(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/queryResultCache&quot;][&quot;lookups&quot;]
  end
  
  def get_queryResultCache_hitratio(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/queryResultCache&quot;][&quot;hitratio&quot;]
  end
  
  def get_filterCache_lookups(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/filterCache&quot;][&quot;lookups&quot;]
  end
  
  def get_filterCache_hitratio(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/filterCache&quot;][&quot;hitratio&quot;]
  end
  
  def get_alfrescoPathCache_lookups(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/alfrescoPathCache&quot;][&quot;lookups&quot;]
  end
  
  def get_alfrescoPathCache_hitratio(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/alfrescoPathCache&quot;][&quot;hitratio&quot;]
  end
  
  def get_alfrescoAuthorityCache_lookups(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/alfrescoAuthorityCache&quot;][&quot;lookups&quot;]
  end
  
  def get_alfrescoAuthorityCache_hitratio(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/alfrescoAuthorityCache&quot;][&quot;hitratio&quot;]
  end
  
  def get_queryResultCache_warmupTime(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/queryResultCache&quot;][&quot;warmupTime&quot;]
  end
  
  def get_filterCache_warmupTime(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/filterCache&quot;][&quot;warmupTime&quot;]
  end
  
  def get_alfrescoPathCache_warmupTime(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/alfrescoPathCache&quot;][&quot;warmupTime&quot;]
  end
  
  def get_alfrescoAuthorityCache_warmupTime(index)
    return @solr_hash[&quot;Summary&quot;][index][&quot;/alfrescoAuthorityCache&quot;][&quot;warmupTime&quot;]
  end
  
  private
  def get_metrics(url)
    url += &quot;&amp;wt=json&quot;
    response = open(url).read
    # Convert to hash
    result_hash = {}
    result_hash = Crack::JSON.parse(response)
    # if the hash has 'Error' as a key, we raise an error
    if result_hash.has_key? 'Error'
      raise &quot;web service error&quot;
    end
    return result_hash
  end

end # End of class

As you can see it is quite straight forward to extend this if you want to pull back different metrics. At some point I will hook this into a git hub repo for people or use it in another metrics based project, but for now just use this.

Now some of you may not be use to using ruby, so here’s is a check that checks the filtercache hitratio

#!/usr/bin/ruby
$:.unshift File.expand_path(&quot;../&quot;, __FILE__)
require 'lib/solr_dao.rb'
solr_results=SolrDAO.new(&quot;http://localhost:8080/solr/admin/cores?action=SUMMARY&quot;)
hitratio=solr_results.get_filterCache_hitratio(&quot;alfresco&quot;).to_f
lookups=solr_results.get_filterCache_lookups(&quot;alfresco&quot;).to_i
#Hit ratio is an inverse, 1.0 is perfect 0.1 is crap, and can be ignored if there is less than 100 lookups
inverse=(1.0-hitratio)
critical=0.8
warning=0.7
if (inverse.is_a? Float)
  if ( lookups &gt;= 100 )
    if ( inverse &gt;= warning )
      if (inverse &gt;= critical )
        puts &quot;CRITICAL :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;&quot;
        exit 2
      else
        puts &quot;WARNING :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;&quot;
        exit 1
      end
    else
      puts &quot;OK :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;&quot;
      exit 0
    end
  else
    puts &quot;OK :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;&quot;
    exit 0
  end
else
  puts &quot;UNKNOWN :: FilterCache hitratio is #{hitratio}&quot;
  exit 3
end
[/sourecode]

to get this to work, you'll just need to put it with your other nagios checks, and in the same directory as the above put a lib directory with the solr_DAO from further up in it, if you need to change it's location you will only need to adjust the following:


$:.unshift File.expand_path(&quot;../&quot;, __FILE__)
require 'lib/solr_dao.rb'

Also if you wanted to you could modify the script to take the critical and warning as params so you can easily change it within nagios.

This time, We survived the AWS outage

Another minor bump

Anyone based in the US East region in AWS knows that yet again there were issues with EBS volumes, although you wouldn’t know it if you looked at their website. It’s a bit of a joke when you see headlines like Amazon outage takes down Reddit, Foursquare, others yet on their status page a tiny little note icon appears that states there’s a slight issue, extremely minor, don’t worry about it. Yeah right.

The main culprits were EC2 and the API, both of which were EBS related.

“Degraded EBS performance in a single Availability Zone
10:38 AM PDT We are currently investigating degraded performance for a small number of EBS volumes in a single Availability Zone in the US-EAST-1 Region.
11:11 AM PDT We can confirm degraded performance for a small number of EBS volumes in a single Availability Zone in the US-EAST-1 Region. Instances using affected EBS volumes will also experience degraded performance.
11:26 AM PDT We are currently experiencing degraded performance for EBS volumes in a single Availability Zone in the US-EAST-1 Region. New launches for EBS backed instances are failing and instances using affected EBS volumes will experience degraded performance.
12:32 PM PDT We are working on recovering the impacted EBS volumes in a single Availability Zone in the US-EAST-1 Region.
1:02 PM PDT We continue to work to resolve the issue affecting EBS volumes in a single availability zone in the US-EAST-1 region. The AWS Management Console for EC2 indicates which availability zone is impaired. “

The actual message is much much longer but you get the gist, a small number of people were affected. Yet most of the major websites that use amazon were affected, how can that be considered small?

Either way, this time we survived, and we survived because we learnt. Back in June and July we experienced these issues with EBS so we did something about it, now why didn’t everyone else?

How Alfresco Cloud Survived

So back in June and July we were heavily reliant on EBS just like everyone else, we had an EBS backed AMI that we then used puppet to build out the OS, this is pretty much what everyone does and this is why everyone was affected, back then we probably had 100 – 150 EBS volumes so the likely hood of one of them going funny was quite high, now we have about 18, and as soon as we can we will ditch those as well.

After being hit twice in relatively quick succession we realised we had a choice, be lazy or be crazy, we went for crazy and now it paid out. We could have been lazy and just said that Amazon had issues and it wasn’t that frequent and it is not likely to happen again, or we could be crazy and reduce all of our EBS usage as much as possible, we did that.

Over the last few months I’ve added a numer or posts about The Cloud, Amazon and Architecting for the cloud along with a few funky Abnormal puppet set ups and oddities in the middle. All of this was spawned from the EBS outages, we had to be crazy, Amazon tell us all the time don’t have state, don’t rely on anything other than failure use multiple AZ’s etc etc all of those big players that were affected would have been told that they should use multiple availability zones, but as I pointed out Here their AZ can’t be fully independent and yet again this outage proves it.

Now up until those outages we had done all of that, but we still trusted Amazon to remain operational, since July we have made a concerted effort to move our infrastructure to elements within Amazon that are more stable, hence the removal of EBS. We now only deploy instance backed EC2 nodes which means we have no ability to restart a server, but it also means that we can build them quickly and consistently.

We possibly took it to the extreme, our base AMI, now instance backed, consists of a single file that does a git checkout, once it has done that it simply builds its self to the point that chef and puppet can take over and run. The tools used to do this are many but needless to say many hundreds of of lines of bash, supported by Ruby, Java, Go and any number of other languages or tools.

We combined this with fully distributing puppet so it runs locally, in theory once a box is built it is there for the long run; we externalised all of the configuration so puppet was simpler and easier to maintain. Puppet, its config, the Base OS, the tools to manage and maintain the systems are all pulled from remote services including our DNS which automatically updates its self based on a set of tags.

Summary

So, how did we survive, we decided every box was not important, if some crazy person can not randomly delete a box or service and the system keeps working then we had failed. I can only imagine that the bigger companies with a lot more money and people and time looking at this are still treating Amazon more as a datacentre rather than a collection of web services that may or may not work. With the distributed puppet and config once our servers are built they run happily on a local copy of the data, no network, and that is important because AWS’s network is not always reliable and nor is their data access. If a box no longer works delete it, if an environment stops working rebuild it; if amazon has a glitch, keep working, simple.

Puppet with out barriers -part one

Structure is good

Like everyone that has used puppet and is using puppet; the puppet structure we use to use to manage our nodes was relatively straight forward. But before getting into that lets go into the two or three ways you would know of already.

1. Node inheritance, Typically you define a number of nodes that define physical and logical structures that suite your business and inherit from each of these to end up with a host node that has the correct details relavent to the data centre it is in and its logical assignment within it.
2. Class inheritance, Similar to node inheritance, you create your modules which are typically agnostic and more than likely take a number of parameters; then create a module that contains a set number of node types, for example, “web” nodes. So the web node would include the apache module and would do all of the configuration you expect of all apache nodes, this can be done to the point of each node manifest simple including a web, or wiki type class.
3, By far I imagine the most common, is a mixture of them both.

Either of these methods is fine, however what will essentially happen is duplication and you’ll be faced with situations where you want 90% of a web node but not that last 10%; you then create a one of module or node and before you know it you have several web nodes or several roles defining the same task.

It’s fair to say we do not do any one of those, node inheritance a little to save duplication of code but that’s about it, I’ll touch on this more next week, but this week is about foundations, and in puppet that means modules.

Parameterise everything

Parameterised classes help with this level of flexibility but if your module isn’t fully parameterised it’s a bit of a pain, so a slight detour based on experience.

I imagine like everyone out there, I write puppet modules that are half arsed on most days. I parameterise the variables I need for my templates and stick to a simple module layout to stop my classes getting to busy. This isn’t a bad way of doing things, you get quite a lot of benefit with out a lot of investment in time, and I for one will continue to write modules that are half arsed, but with one difference, more parameters.

Occasionally I have to extend my module, some times I don’t have time to re-factor and make it perfect so I work around my module, if you’re looking to build technical debt, this is a good way of going. and if you continue down this route you will end up with modules that are complicated and hard to follow.
Initially when introducing module layouts to people they assume they will make life more complicated because you have split a module into classes, but the reality is rather than having 1 file with 100 lines you have 10 files with 10 lines, and of those 10 files you may only be using 2 files in your configuration which can make life a lot simpler than sifting through code.

A typical module layout I use will consist of the following

[root@rincewind modules]# ls example/
files  manifests  README  templates
[root@rincewind modules]# ls example/manifests/
install.pp  config.pp  init.pp  params.pp

For those not familiar with such naming schemes… init is loaded by default and will typically call include the install class and the config class. So if I want to install / configure an entire module that has had some reasonable params set I just “include modulename”. You can read more about modules and how I typically use them Here It was rather experimental at the time but with only a few tweaks it is now in a production environment and is by far the easiest module we have to maintain.

The main reason this structure works for us is that we are able to set a number of sensible defaults in params which reduces the number of params we need to pass in while allowing us to still have a lot of parameters on each class. typically it means our actual use of the classes even though they may have 40+ params won’t normally go past 10.

The big thing we did different with this module is to parameterise about 90% of all the configuration options, now you may think that’s fine for small apps but what about large ones, well we did it with our own product Alfresco, which is quite large and has a lot of options. Granted we can’t account for every possible configuration but we gave it a shot. Consequently we now have a puppet module that out of the box should work with multiple application containers (tomcat, jboss etc ) but also allows for the application to be run in different containers on the same machine.

The advantage of this up front effort on a module which is core to what we are doing is that we are now able to just change options with out having to do a major re-factor of code, adding more complexity in terms of configuration is simpler as it has been split into multiple classes that are very specific to their tasks. If we want to change most of the configuration it is simply a matter of changing that parameter.

Be specific
So you have a module that is nicely parameterise and that’s it, well not quite, you have to make sure your module is specific to it’s task, and this is one that was a little odd to get over. Puppet is great at configuration management, it’s not so good at managing dependancies and order, so we made the decision to simplify our installation process so rather than having the puppet module try and role out the application and multiple versions of it, with different configurations for different versions, it simply set’s up the environment and sets the configuration with no application deployed.

So yes, you can install within the module if you wish, and we use to do that and some complicated expanding / re-zipping of the application to include plugins, but we also have a yum repo, or git, or a ftp server where we can just write a much more specific script to deal with the installation.

By separating out the installation we have been able to cut a lot of code from our module that was trying to do far more than was sensible.

Be sensible
When you look through configuration there are sometimes options that will only ever be set if X or Y is set, in this case write more complicated modules to cope with that logic and don’t just add everything as a parameter. When you look through your module and you start seeing lots of logic, you’ve gone wrong.

Typically now when writing modules the only logic that ends up in the pp file is is feature enabled or disabled, if you try to make a manifest in a module all things to all people you end up with a lot of logic and complications that are not needed. To tackle this, after writing a base for the module I then cloned the params file and the application manifest which does most of the config and made it more specific to the needs of the main task that I’m trying to achieve. It is 90% the same as the normal application one but with a few extra params and allows me to add specific files that are 100% relavent to what we do in the cloud with out ruining the main application one incase we decide to change direction or if we need a more traditional configuration.

The point is more that you should avoid over complicating each manifest in the module, create more simple manifests that are sensible for the task in hand.

This module layout will give you something to work with rather than against, we spent a lot of time making our whole puppet infrastructure as simple as possible to explain, this doesn’t mean it’s simple it just means everyone understands what it does, me, my boss, my bosses boss and anyone else that asks “How does it work?”

Alfresco Cloud – Out of beta

Finally!

For those of you that don’t know, I work at Alfresco in the Operations department specifically looking after and evolving our cloud based product. It feels like an absolute age I’ve been working on the cloud product and the release, but finally today (Well 31st May 2012) we took off the “Beta” tag from the product.

Being on the support side of the service I know the system very well and overall I’m really pleased with how it is now and how it will be. The most fantastic thing about the product is knowing what is coming up, just because we have taken the beta label off we are still going to be innovating new ways of doing things, utilising the best technology and writing some bespoke management tools to help support the environment. Granted, now the Beta tag is off we have reduced the amount of disruptive impact we will have on the system, but, unlike all those months ago we now have the right framework around managing and testing the changes we are making.

I’m looking forward to the next few months as I know we’ve got more good stuff coming and I can’t wait to see how the general public take to the product, it’s been an interesting journey and it looks to be getting better!

What can you expect from Alfresco in the cloud?

I’m going to start this with a warning, I’m not in marketing or product design, so this is just the way I see the product and what I like about the cloud product. For those of you that have used Alfresco before you’ll be familiar with the Share interface, it is some what cut down for the cloud but none the less just as powerful. You can still upload your documents, like & comment on them just as always, you can use the Quick share feature to share a document via email, facebook or twitter so there’s no need to invite everyone to see a single document or picture. For those privileged enough to sign up you can use WebDav to mount the cloud as a drive to your local PC, very handy…

And the best bit… well you have to sign up to find that….

Only a short update today, it has been a very busy week to get this all sorted and now it’s time to rejoice… and rest.