Monitoring Alfresco Solr

It’s not just about numbers

Up until recently if you wanted to monitor Alfresco’s solr usage you would have had to either use a costly call to the stats page use the summary report that only really gave you a lag number. Luckily because Alfresco have extended solr they have changed the Summary report to provide some really useful information which can then be tracked via nagios or what ever your favourite monitoring tool is.

Firstly it’s worth reading the Wiki as it explains the variables better than I would, it’s also worth mentioning my preferred way of programatically access this page is via json like so:


It’s worth mentioning that depending on the json parsing library you are using you can get some fatal parsing errors caused by the hit ratio, For what it’s worth I found Crack to be good, it doesn’t validate the json as heavily as the raw json one does which means you can pull back all the data even if there is a problem with the hitratios.

On that subject, before the relavent cache is hit, the hit ratio will display “NaN” (Not a Number) once it has been hit it will display the appropriate number, which I’ll dive into a bit more later.

So before getting into the nitty gritty service checks, it’s important to have a good understanding of the numbers, most of them are straight forward; the only one that confused me was the hit ratios.

The hit ratio is a number between 0 and 1, when the number is greater than say 0.3 all is well, less than 0.3 things could be bad. However, when the hit count is less than say 100, it would be expected that the hit ratio is low as it is not being hit enough to provide a reasonable response. Other than the hit ratio the others are pretty straight forward.

Some code

It’s probably worth me sharing with you the class I’m using to access/return solr information, that way if you want to write your own nagios checks you can just copy / paste

Firstly, the class that get’s all the solr information:

# Solr Metric gatherer

require 'rubygems'
require "crack"
require 'open-uri'

class SolrDAO

  def initialize (url)
    @solr_hash = get_metrics(url)

  def get_lag(index)
    lag = @solr_hash["Summary"][index]["TX Lag"]
    lag_number = regex.match(lag)
    return lag_number

  def get_alfresco_node_in_index(index)
    return @solr_hash["Summary"][index]["Alfresco Nodes in Index"]
  def get_num_docs(index)
    return @solr_hash["Summary"][index]["Searcher"]["numDocs"]
  def get_alfresco_avgTimePerRequest(index)
    return @solr_hash["Summary"][index]["/alfresco"]["avgTimePerRequest"]

  def get_afts_avgTimePerRequest(index)
    return @solr_hash["Summary"][index]["/afts"]["avgTimePerRequest"]

  def get_cmis_avgTimePerRequest(index)
    return @solr_hash["Summary"][index]["/cmis"]["avgTimePerRequest"]

  def get_mean_doc_transformation_time(index)
    return @solr_hash["Summary"][index]["Doc Transformation time (ms)"]["Mean"]

  def get_queryResultCache_lookups(index)
    return @solr_hash["Summary"][index]["/queryResultCache"]["lookups"]
  def get_queryResultCache_hitratio(index)
    return @solr_hash["Summary"][index]["/queryResultCache"]["hitratio"]
  def get_filterCache_lookups(index)
    return @solr_hash["Summary"][index]["/filterCache"]["lookups"]
  def get_filterCache_hitratio(index)
    return @solr_hash["Summary"][index]["/filterCache"]["hitratio"]
  def get_alfrescoPathCache_lookups(index)
    return @solr_hash["Summary"][index]["/alfrescoPathCache"]["lookups"]
  def get_alfrescoPathCache_hitratio(index)
    return @solr_hash["Summary"][index]["/alfrescoPathCache"]["hitratio"]
  def get_alfrescoAuthorityCache_lookups(index)
    return @solr_hash["Summary"][index]["/alfrescoAuthorityCache"]["lookups"]
  def get_alfrescoAuthorityCache_hitratio(index)
    return @solr_hash["Summary"][index]["/alfrescoAuthorityCache"]["hitratio"]
  def get_queryResultCache_warmupTime(index)
    return @solr_hash["Summary"][index]["/queryResultCache"]["warmupTime"]
  def get_filterCache_warmupTime(index)
    return @solr_hash["Summary"][index]["/filterCache"]["warmupTime"]
  def get_alfrescoPathCache_warmupTime(index)
    return @solr_hash["Summary"][index]["/alfrescoPathCache"]["warmupTime"]
  def get_alfrescoAuthorityCache_warmupTime(index)
    return @solr_hash["Summary"][index]["/alfrescoAuthorityCache"]["warmupTime"]
  def get_metrics(url)
    url += "&wt=json"
    response = open(url).read
    # Convert to hash
    result_hash = {}
    result_hash = Crack::JSON.parse(response)
    # if the hash has 'Error' as a key, we raise an error
    if result_hash.has_key? 'Error'
      raise "web service error"
    return result_hash

end # End of class

As you can see it is quite straight forward to extend this if you want to pull back different metrics. At some point I will hook this into a git hub repo for people or use it in another metrics based project, but for now just use this.

Now some of you may not be use to using ruby, so here’s is a check that checks the filtercache hitratio

$:.unshift File.expand_path("../", __FILE__)
require 'lib/solr_dao.rb'"http://localhost:8080/solr/admin/cores?action=SUMMARY")
#Hit ratio is an inverse, 1.0 is perfect 0.1 is crap, and can be ignored if there is less than 100 lookups
if (inverse.is_a? Float)
  if ( lookups >= 100 )
    if ( inverse >= warning )
      if (inverse >= critical )
        puts "CRITICAL :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;"
        exit 2
        puts "WARNING :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;"
        exit 1
      puts "OK :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;"
      exit 0
    puts "OK :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;"
    exit 0
  puts "UNKNOWN :: FilterCache hitratio is #{hitratio}"
  exit 3

to get this to work, you'll just need to put it with your other nagios checks, and in the same directory as the above put a lib directory with the solr_DAO from further up in it, if you need to change it's location you will only need to adjust the following:

$:.unshift File.expand_path("../", __FILE__)
require 'lib/solr_dao.rb'

Also if you wanted to you could modify the script to take the critical and warning as params so you can easily change it within nagios.

DevOps team DNA

Hi, this is my first post on Matt’s blog. I’ve been an avid supporter of his blogging for a while and today got an invite to contribute. So here’s my post (created very quickly before he changes his mind).

My job has always been within an operations department of software product companies. I started at a small company as ‘everything’ support and slowly drifting towards a specialisation in the more recently branded DevOp’sy areas as I made my way through various acquisitions and mergers. Over the past couple of years I’ve found myself building DevOps teams. During that time I’ve discovered some of the things that work and almost everything that doesn’t work (or it feels like that :) ).

Some of the things that have worked..   (for me anyway)

Obviously these are going to be quite subjective and I doubt they will work for everyone. I’ll focus mostly on what I think are the key ingredients of a successful team. Maybe some people will find it interesting. Bare in mind that this only really applies to an operations team that supports a Cloud service.

I’m not a big football fan but I can draw some parallels between football managers and DevOps teams. You don’t see Arsenal winning and losing games based on their process redesigns. I may be simplifying, and I’m sure tactics plays a large part, but I believe you get quite a bit more out of a team when you have excellent players. Players who excel in different areas. My teams tend to be 5 – 7 players nowadays and between all of us we need to cover a few areas.

The first is product knowledge.. If you have a product guru in your team then you’ve got a productivity catalyst. So many aspects of our work involves investigating whether issues are product vs config and whether we can improve things from an operational perspective that will result in the product running better. The most recent team has a Product Architect and he’s awesome. He’s on the cutting edge of ideas for the product, for Amazon AWS and for all of the supporting technologies. Having a dedicated resource to do all of this in the background is great – it means that when we automate his prototypes and release them we get the maximum benefit. Recent examples include our Public API work and the work being done on our Amazon architecture to improve speed (CDN’s etc).

The second role I’ve always tried to fill is an engineer (at least one person, preferably two). Get the most senior developer(s) that you can, who knows the language of the product and build system of the product that you are supporting. You can now write the high level instrumentation that every DevOps teams need – as is true with any automation project. There is only ever so far you can go with Bash (I tend to take things beyond where they are supposed to be with Bash as it is). Ultimately having a senior developer or two buys you a massive amount of flexibility. Need a web service for something like externalised Puppet variables?.. you can write your own. Backup scripts not fast enough?.. a senior developer will make those scripts look very feeble in comparison when rewritten in their preferred language and multithreaded. I’m careful about not reinventing the wheel and will usually go off and clone something from Github before starting from scratch myself. But having some people who can write stuff from scratch is a major advantage. One caveat I would say for this role – hire from outside. Developers usually end up getting pulled back to work on stuff they did at the company at some stage. If you can, hire a new person and liven things up. Obviously tell the engineering teams that the hire(s) are for instrumentation in case they get worried that you want to start adding buttons to the product :)

Lastly, the sysadmins. I’d actually consider myself one of these at heart. Getting a good sysadmin can be tricky. It’s not uncommon to read 100 CV’s before finding someone even remotely eligible.  For a DevOps team you need a reasonably rare mix of skills.. people who know linux inside out, who can script and get excited by the latest batch of tools, and nowadays you need to throw Puppet / Chef into the mix. I have a couple of these currently and consider myself extremely blessed. Everything that we do is checked into source control (we use AWS as our data center) and this buys us a lot of things.. like the ability to automate everything, reduce costs by deleting and recreating at whim and disaster recovery. However, you pay for those things buy hiring really good people.. which is a cost saving in the long run once the cost saving benefits of the team start to show.

Now if you add in all of those types of role.. what I’ve found works quite nicely is running the team without being too focussed on the separation of responsibility. Everyone is on call 24/7. Everyone is expect to know the product inside out (although nobody will get near the level of expertise of the Product Architect), everyone scripts (even me) and ultimately everyone will end up doing some programming tasks. You can probably see from Matt’s previous blog posts about the Metrics project he got the chance to learn some Ruby. I think it’s important that everyone knows a bit of everyone else’s job.. although when under pressure everyone naturally drops back into doing what they are good at to speed things along.

This probably looks a little odd from the outside. But it makes things fun, everyone stays engaged and ultimately we all share the same goal: scale to 1 million users :D

This time, We survived the AWS outage

Another minor bump

Anyone based in the US East region in AWS knows that yet again there were issues with EBS volumes, although you wouldn’t know it if you looked at their website. It’s a bit of a joke when you see headlines like Amazon outage takes down Reddit, Foursquare, others yet on their status page a tiny little note icon appears that states there’s a slight issue, extremely minor, don’t worry about it. Yeah right.

The main culprits were EC2 and the API, both of which were EBS related.

“Degraded EBS performance in a single Availability Zone
10:38 AM PDT We are currently investigating degraded performance for a small number of EBS volumes in a single Availability Zone in the US-EAST-1 Region.
11:11 AM PDT We can confirm degraded performance for a small number of EBS volumes in a single Availability Zone in the US-EAST-1 Region. Instances using affected EBS volumes will also experience degraded performance.
11:26 AM PDT We are currently experiencing degraded performance for EBS volumes in a single Availability Zone in the US-EAST-1 Region. New launches for EBS backed instances are failing and instances using affected EBS volumes will experience degraded performance.
12:32 PM PDT We are working on recovering the impacted EBS volumes in a single Availability Zone in the US-EAST-1 Region.
1:02 PM PDT We continue to work to resolve the issue affecting EBS volumes in a single availability zone in the US-EAST-1 region. The AWS Management Console for EC2 indicates which availability zone is impaired. “

The actual message is much much longer but you get the gist, a small number of people were affected. Yet most of the major websites that use amazon were affected, how can that be considered small?

Either way, this time we survived, and we survived because we learnt. Back in June and July we experienced these issues with EBS so we did something about it, now why didn’t everyone else?

How Alfresco Cloud Survived

So back in June and July we were heavily reliant on EBS just like everyone else, we had an EBS backed AMI that we then used puppet to build out the OS, this is pretty much what everyone does and this is why everyone was affected, back then we probably had 100 – 150 EBS volumes so the likely hood of one of them going funny was quite high, now we have about 18, and as soon as we can we will ditch those as well.

After being hit twice in relatively quick succession we realised we had a choice, be lazy or be crazy, we went for crazy and now it paid out. We could have been lazy and just said that Amazon had issues and it wasn’t that frequent and it is not likely to happen again, or we could be crazy and reduce all of our EBS usage as much as possible, we did that.

Over the last few months I’ve added a numer or posts about The Cloud, Amazon and Architecting for the cloud along with a few funky Abnormal puppet set ups and oddities in the middle. All of this was spawned from the EBS outages, we had to be crazy, Amazon tell us all the time don’t have state, don’t rely on anything other than failure use multiple AZ’s etc etc all of those big players that were affected would have been told that they should use multiple availability zones, but as I pointed out Here their AZ can’t be fully independent and yet again this outage proves it.

Now up until those outages we had done all of that, but we still trusted Amazon to remain operational, since July we have made a concerted effort to move our infrastructure to elements within Amazon that are more stable, hence the removal of EBS. We now only deploy instance backed EC2 nodes which means we have no ability to restart a server, but it also means that we can build them quickly and consistently.

We possibly took it to the extreme, our base AMI, now instance backed, consists of a single file that does a git checkout, once it has done that it simply builds its self to the point that chef and puppet can take over and run. The tools used to do this are many but needless to say many hundreds of of lines of bash, supported by Ruby, Java, Go and any number of other languages or tools.

We combined this with fully distributing puppet so it runs locally, in theory once a box is built it is there for the long run; we externalised all of the configuration so puppet was simpler and easier to maintain. Puppet, its config, the Base OS, the tools to manage and maintain the systems are all pulled from remote services including our DNS which automatically updates its self based on a set of tags.


So, how did we survive, we decided every box was not important, if some crazy person can not randomly delete a box or service and the system keeps working then we had failed. I can only imagine that the bigger companies with a lot more money and people and time looking at this are still treating Amazon more as a datacentre rather than a collection of web services that may or may not work. With the distributed puppet and config once our servers are built they run happily on a local copy of the data, no network, and that is important because AWS’s network is not always reliable and nor is their data access. If a box no longer works delete it, if an environment stops working rebuild it; if amazon has a glitch, keep working, simple.

Sysadmins in a Developers world

It’s all back to front

Well it was about 9 months back when I was touching on Developers in a sysadmin world and my initial thoughts were along the lines of we are better at different tasks, and after spending a week doing only development I am of the same opinion still.

Over the last 6 months we have had our solitary developer, coding away making great things happen, predominately developing a portal that allows us to deploy environments in 15 mins vs the 2 days it took before and the whole things is very pretty, it even has its own Favicon.ico which we are all pleased about. In addition to just deploying, it also allows us to scale up and down the environments it creates and despite constant interruptions it is coming along really well and in the next month we will be providing it as a service to the engineering teams to self serve.

As more and more of our tools are developing we are also in-housing more and more of our tools. As the regular readers know I do dable with the odd slightly more complex program than the average sysadmin might tackle. When we are faced with a situation such as monitoring the operations, by this I mean, the number of user growth week on week and the cost of running the environment(s) it just made more sense to do it our selves. There are tools out there that provide various dashboards like Geckoboard which can all do approximately 80% of the job, but it’s that last 20% that adds the usefulness, as such we are trying to develop tolls that are pluggable and extensible and support multiple outputs. For example the Metrics report we have will also support Geckoboard, Graphite, Email and probably have it’s own web interface.

For us it is becoming more about having the flexibility to add and remove components and keeping the flexibility around it, this introduces challenges with what ever being written needing to be pluggable and easy to maintain, which often make sit complicated.

I used classes, as a necessity

Typically when I program there is not much need for classes or even objects for that matter, a simple var and some nice loops and conditional statements would be plenty. Well not so much anymore, The last project was metrics and as with other projects I got it working within a day or two, and I hated it, it took over 30 seconds for it to run and generate the report I needed but not in the right format and then the level of detail in the metrics was not high enough, it could manage weekly but it was not good enough.

I decided that I’d have a chat with a few developers to help with the structure of the application, at first I was dubious, but it turned out well. The key step which I wouldn’t have made until it was a real problem was to separate out the the tasks that gathers the raw data, the tasks that manipulates the data into useful numbers, the bit that stores the data, the bit that manipulates the data into useful numbers and then finally the bit that outputs the pretty data.

This was an evolutionary step, I would have got to the point of understanding the need to separate each step out but not until it had become a real big pain many months later. Another advantage of splitting it out was how much simpler each step was, there were classes defining methods for getting data that were being used in classes to format the data that were being used… you get the idea. Rather than being one class to connect to amazon, manipulate the data and return an object that could be used to generate the metrics everything was done on much smaller steps. As a result it was a lot easier to write small chunks of code “that just worked” and it made debugging a lot easier, and I feel like I progressed my understanding, and this is always a good thing.

Who should do what

I touched on this in my other post, but I want to amend it based on a better understanding. To summarise I pretty much said as it is, Developers develop, sysadmins admin. They do, and certainly that should be their focus, but I think there is a lot to be gained from both points of view when pushed to work in the others world.

Before our developer joined the focus was on making the build, test and release process better, after forcing the developer to do sysadmin work for a month or so while the team was trying to grow and cope with the loss of a team member, it became clear that the time wasted for us all was not getting a build though but by us not being able to paralise the testing or being agile enough to re-deploy an environment if it was not quite right. These steps and understandings would not have happened if we didn’t encroach on each others work and gain the understanding from the other persons perspective.


This is what DevOps is really about, forget sysadmins doing code, forget about developers doing sysadmin work, it is about us meeting in the middle and understanding the issues we each face and working together to solve bigger problems.

Metrics from Amazon

Amazon have an API for that

It was words to that affect I heard, I wish it was what I wanted though. After watching a colleague get metrics for the billing it was clear after a short look they were very good for estimating your total expense but they were never going to be an accurate figure.

So in short if you want to be disappointed by Amazon guessing your monthly cost you can find out some information Here.

If you wish to learn more, read on…

What do you want?

We were looking at the metrics because we wanted to report on our running costs on a granular basis so we could see the hourly cost of our account as and when new services were turned on or off or if we added / subtracted a node from an existing system. In addition to that snapshot we wanted to be able to compare one week to the next and with some other operational metrics such as users on line.

So after discussing for a a little while it was clear the Amazon metrics are good for predicting, not good for historical and ultimately not very accurate as it was always a potential and never an actual. I made the decision that we were better off getting the information our selves, which at first sounded crazy, and the more I think about it the more I agree, really amazon should offer this in a better way through their current billing.

Knowing what we wanted meant I could not bother with tracking the things we don’t care about. What is really important to us, is it the disk space being used? is it the bandwidth? cost of ELB’s? Nope, we just really care about how much does it cost for the instances we are running.

In the end that’s all that mattered, are we costing more money or less money and with this money are we providing more or less value. Ideally we will reduce cost and increase performance but until we start tracking the figures we have no idea what is actually going on unless we spend hours looking at a summarised cost and guessing backwards… well until now anyway.

Over the last 5 days I’ve spent some time knocking together some ruby scripts that will poll amazon and get back the cost of an account based on the current running instances across all regions. For us that is good enough, but I decided to add extra functionality by getting all fo the nodes as well, it sort of acts like an audit trail and will allow us to do more in depth monitoring if we so wish, for example… if we switch instance type does that save us more money and give more performance? Well we wouldn’t know either, especially if we didn’t track what was running.

We also wanted to track the number of objects in a bucket within S3, now our S3 buckets have millions objects in each of them, if you use the aws-sdk to get this it will take forever +1 day, if you use the right_aws it will still take a long time but not as much (over 30 mins for us). This isn’t acceptable so we’re looking at alternative ways to generate this number quicker, but the short answer is it’ll be slow, if I come up with a fancy s3 object count alternative i’ll let you know, but for now I have had to abandon that. Unless Amazon want to add two simpel options like s3.totalObjects and s3.totalSize…

It’s just data

So, we gathered our Amazon data, we gathered some data from our product, this was about a day into the project, all of this was done currently on the fly and it use to take 20 seconds or so to get the information. We had a quick review and it was decided we should track now vs the last week, this made a slight difference as it meant we now needed to store the data.

This is a good thing, by storing the data we care about we no longer have to make lots of long winded calls that hang for 20 seconds, it’s all local, speed++

Needless to say the joys of storing the data and searching back through it is all interesting, but I’m not going to go into them.

To take the data and turn it into something useful requires reports to be written, all the time it’s raw data no one will particularly care, once a figure is associated with a cost per user or a this account costs X per hour people care more. One of the decisions made was to take the data and do some abstraction of it separately to the formatting of the output, mainly as there’s multiple places to send the data, we might want to graph some in graphite, email some of the other data and squirrel away the rest in a CSV output somewhere.

An advantage of this is now that the data has been generated there’s one file to modify to chose how and what data should get returned which gives us the ability to essentially write bespoke reports for what ever is floating our boat this week.

A Freebie

I thought this might be useful, It’s the class we are using to get our instance data from amazon, it’s missing a lot error checking but so far it is functional, and as everyone knows before it can be useful it must first work.

# Get instance size cost

require 'net/http'
require 'rubygems'
require 'json'
require 'aws-sdk'

class AWSInstance
  def initialize (access_key, secret_key)

  def cost 
    instances = get_running_instances
    cost = 0.00
    #Calc Cost
    instances.each_pair do |region,value|
      value.each_pair do |instance_type, value|
        cost += (value.size.to_f * price(instance_type,region).to_f)
    return cost

  def get_instances
    #Return all running instaces as that has a cost in its hash
    return get_running_instances

  def price (api_size, region)
    rhel_url = "/rhel/pricing-on-demand-instances.json"
    url = "/ec2/pricing/pricing-on-demand-instances.json"
    price = 0
    response = Net::HTTP.get_response("",rhel_url)
    #puts response.body 
    # Convert to hash
    json_hash = JSON.parse(response.body)

    # api_size i.e. m1.xlarge
    # Get some info to help looking up the json
    case api_size
      when "m1.small"
        size = "sm"
        instance_type = "stdODI"
      when "m1.medium"
        size = "med"
        instance_type = "stdODI"
      when "m1.large"
        size = "lg"
        instance_type = "stdODI"
      when "m1.xlarge"
        size = "xl"
        instance_type = "stdODI"
      when "t1.micro"
        size = "u"
        instance_type = "uODI"
      when "m2.xlarge"
        size = "xl"
        instance_type = "hiMemODI"
      when "m2.2xlarge"
        size = "xxl"
        instance_type = "hiMemODI"
      when "m2.4xlarge"
        size = "xxxxl"
        instance_type = "hiMemODI"
      when "c1.medium"
        size = "med"
        instance_type = "hiCPUODI"
      when "c1.xlarge"
        size = "xl"
        instance_type = "hiCPUODI"
      when "cc1.4xlarge"
        size = "xxxxl"
        instance_type = "clusterComputeI"
      when "cc2.8xlarge"
        size = "xxxxxxxxl"
        instance_type = "clusterComputeI"
      when "cg1.4xlarge"
        size = "xxxxl"
        instance_type = "clusterGPUI"
      when "hi1.4xlarge"
        size = "xxxxl"
        instance_type = "hiIoODI"
  # json_hash["config"]["regions"][0]["instanceTypes"][0]["sizes"][3]["valueColumns"][0]["prices"]["USD"]
    json_hash["config"]["regions"].each do |r|    
      if (r["region"] == region.sub(/-1$/,''))
        r["instanceTypes"].each do |it|
          if (it["type"] == instance_type)
            it["sizes"].each do |sz|
              if (sz["size"] == size)
    return price

  def get_running_instances
    #Set up EC2 connection
    ec2 = => @access_key_id, :secret_access_key=> @secret_access_key)
    instance_hash =
    #Get a list of instances
    #Memorize cuts down on chatter
    AWS.memoize do
      ec2.regions.each do |region|
        instance_hash.merge!({ => {}})
        region.instances.each do |instance|
          if (instance.status == :running)
            #Need to create a blank hash of instance_type else merge fails
            if (!instance_hash[].has_key?(instance.instance_type) )
              instance_hash[].merge!({instance.instance_type => {}})
            #For all running instances 
            tmp_hash={ => {:env => instance.tags.Env, :role => instance.tags.Role, :name => instance.tags.Name, :cost => price(instance.instance_type, }}
    return instance_hash
end #End class

Release consistency

It’s gotta be… perfect

This is an odd one that the world of devops sits in an awkward place on but is very vital to the operational success of a product. As an operations team we want to ensure the product is performing well, there’s no issues and there’s some control over the release process, some times this leads to long winded and bloaty procedures that are common in service providers where people stop thinking and start following, this is a good sign a sysadmin has died inside. From the more Development side we want to be quick and efficient and reuse tooling. Consistency is important, a known state of release is important, the process side of it should not exists as it should be automated with the minimal interaction.

So as Operations guys, we may have to make changes ad-hoc to ensure the service continues to work, as a result a retrospective change is made to ensure the config is good for long term; often this can stay untested as you’re not going to re-build the system from scratch to re-test, are you?

The development teams want it all; rapid, agile change, quick testing suites and an infallible release process with plenty of resilience and double checks and catches for all sorts of error codes etc etc, the important thing is that the code makes it out, but it has to be perfect, which leads on to QA.

The QA team want to test the features, the product, the environment, combinations of each and the impact of going from X to Y and tracking the performance of the environment in between. All QA want is an environment that never changes, with a code base that never changes so they can test everything in every combination, those bugs must be squashed.

Obviously, everyones right but with so many contradicting opinions it is easy to end up in a blame circle, which isn’t all that productive. The good news is we know what we want…

  • Quick release
  • Quick accurate testing
  • Known Product build
  • Known configuration
  • No ad-hoc changes to systems
  • Ability to make ad-hoc to systems
  • Infallible release process

All in all not that hard, there are two issue points here. Point one, infallible releases are a thing of wonder and can only ever be iterated over to make them perfect, in time it will get better. Day 1 will be crap, day 101 will be better, day 1000 better still. point two, you can’t have no ad-hoc changes and the ability to make ad-hoc changes, can you? Well you can.

If you love it, let it go

As a sysadmin, if there’s an issue on production I will in most cases fix it in production, if it is risky I will use our staging environment and test it on there, but this is usually no good as staging typically won’t show the issues production does, i.e. all of its servers would be working yet production is missing one. This means I have to make ad-hoc changes to production, this causes challenges for the QA team as now the test environments aren’t always the same, it then screws up the release process as we made a change in one place and didn’t manually add it in to all other environments.

So, What if we throw away, production, staging or any other environment with every release? This is typically a no-no for traditional operations, why would you throw away a working server? well it provides several useful functions:

  1. Removes ad-hoc changes
  2. Tests documentation / configuration management
  3. Enhances DR strategy and familiarisation with it
  4. Clean environment

The typical reason why the throw away approach isn’t done is due to a lack in confidence of the tools. Well, Bollocks. What good is it having configuration management and DR polices if you aren’t using it? if in an operational place now you are making changes to puppet and rolling them forward you achieve some of this, but it’s not good enough, you still have the risk of not having tested the configuration on a new system.

With every environment being built from scratch with every release, we can version control a release to a specific build number and specific git commit which is better for QA as it’s always a known state, if there’s any doubt delete and re-build.

The release process can now start being baked into the startup of a server / puppet run so the consistency is increasing and hopefully the infallibility of the release is improving, adding to this a set of System wide checks for basic health, a set of checks for unit testing the environment, a quick user level test before handing over for testing then it’s more likely, more often that the environments are at a known consistence.

All of this goodness starts with deleting it and building fresh, some of it will happen organically but by at least making sure all of the configuration / deployment is managed at startup you can re-build, and you have some DR. Writing some simple smoke tests and some basic automation is a starting point, from here at least you can build upon the basics to start making it fully bullet proof.


Be radical, delete something in production today and see if you survive, I know I will.