Monitoring Alfresco Solr

It’s not just about numbers

Up until recently if you wanted to monitor Alfresco’s solr usage you would have had to either use a costly call to the stats page use the summary report that only really gave you a lag number. Luckily because Alfresco have extended solr they have changed the Summary report to provide some really useful information which can then be tracked via nagios or what ever your favourite monitoring tool is.

Firstly it’s worth reading the Wiki as it explains the variables better than I would, it’s also worth mentioning my preferred way of programatically access this page is via json like so:

 
http://localhost:8080/solr/admin/cores?action=SUMMARY&wt=json

It’s worth mentioning that depending on the json parsing library you are using you can get some fatal parsing errors caused by the hit ratio, For what it’s worth I found Crack to be good, it doesn’t validate the json as heavily as the raw json one does which means you can pull back all the data even if there is a problem with the hitratios.

On that subject, before the relavent cache is hit, the hit ratio will display “NaN” (Not a Number) once it has been hit it will display the appropriate number, which I’ll dive into a bit more later.

So before getting into the nitty gritty service checks, it’s important to have a good understanding of the numbers, most of them are straight forward; the only one that confused me was the hit ratios.

The hit ratio is a number between 0 and 1, when the number is greater than say 0.3 all is well, less than 0.3 things could be bad. However, when the hit count is less than say 100, it would be expected that the hit ratio is low as it is not being hit enough to provide a reasonable response. Other than the hit ratio the others are pretty straight forward.

Some code

It’s probably worth me sharing with you the class I’m using to access/return solr information, that way if you want to write your own nagios checks you can just copy / paste

Firstly, the class that get’s all the solr information:

#
# Solr Metric gatherer

require 'rubygems'
require "crack"
require 'open-uri'

class SolrDAO

  def initialize (url)
    @solr_hash = get_metrics(url)
  end

  def get_lag(index)
    lag = @solr_hash["Summary"][index]["TX Lag"]
    regex= Regexp.new(/\d*/)
    lag_number = regex.match(lag)
    return lag_number
  end

  def get_alfresco_node_in_index(index)
    return @solr_hash["Summary"][index]["Alfresco Nodes in Index"]
  end
  
  def get_num_docs(index)
    return @solr_hash["Summary"][index]["Searcher"]["numDocs"]
  end
  
  def get_alfresco_avgTimePerRequest(index)
    return @solr_hash["Summary"][index]["/alfresco"]["avgTimePerRequest"]
  end

  def get_afts_avgTimePerRequest(index)
    return @solr_hash["Summary"][index]["/afts"]["avgTimePerRequest"]
  end

  def get_cmis_avgTimePerRequest(index)
    return @solr_hash["Summary"][index]["/cmis"]["avgTimePerRequest"]
  end

  def get_mean_doc_transformation_time(index)
    return @solr_hash["Summary"][index]["Doc Transformation time (ms)"]["Mean"]
  end

  def get_queryResultCache_lookups(index)
    return @solr_hash["Summary"][index]["/queryResultCache"]["lookups"]
  end
  
  def get_queryResultCache_hitratio(index)
    return @solr_hash["Summary"][index]["/queryResultCache"]["hitratio"]
  end
  
  def get_filterCache_lookups(index)
    return @solr_hash["Summary"][index]["/filterCache"]["lookups"]
  end
  
  def get_filterCache_hitratio(index)
    return @solr_hash["Summary"][index]["/filterCache"]["hitratio"]
  end
  
  def get_alfrescoPathCache_lookups(index)
    return @solr_hash["Summary"][index]["/alfrescoPathCache"]["lookups"]
  end
  
  def get_alfrescoPathCache_hitratio(index)
    return @solr_hash["Summary"][index]["/alfrescoPathCache"]["hitratio"]
  end
  
  def get_alfrescoAuthorityCache_lookups(index)
    return @solr_hash["Summary"][index]["/alfrescoAuthorityCache"]["lookups"]
  end
  
  def get_alfrescoAuthorityCache_hitratio(index)
    return @solr_hash["Summary"][index]["/alfrescoAuthorityCache"]["hitratio"]
  end
  
  def get_queryResultCache_warmupTime(index)
    return @solr_hash["Summary"][index]["/queryResultCache"]["warmupTime"]
  end
  
  def get_filterCache_warmupTime(index)
    return @solr_hash["Summary"][index]["/filterCache"]["warmupTime"]
  end
  
  def get_alfrescoPathCache_warmupTime(index)
    return @solr_hash["Summary"][index]["/alfrescoPathCache"]["warmupTime"]
  end
  
  def get_alfrescoAuthorityCache_warmupTime(index)
    return @solr_hash["Summary"][index]["/alfrescoAuthorityCache"]["warmupTime"]
  end
  
  private
  def get_metrics(url)
    url += "&wt=json"
    response = open(url).read
    # Convert to hash
    result_hash = {}
    result_hash = Crack::JSON.parse(response)
    # if the hash has 'Error' as a key, we raise an error
    if result_hash.has_key? 'Error'
      raise "web service error"
    end
    return result_hash
  end

end # End of class

As you can see it is quite straight forward to extend this if you want to pull back different metrics. At some point I will hook this into a git hub repo for people or use it in another metrics based project, but for now just use this.

Now some of you may not be use to using ruby, so here’s is a check that checks the filtercache hitratio

#!/usr/bin/ruby
$:.unshift File.expand_path("../", __FILE__)
require 'lib/solr_dao.rb'
solr_results=SolrDAO.new("http://localhost:8080/solr/admin/cores?action=SUMMARY")
hitratio=solr_results.get_filterCache_hitratio("alfresco").to_f
lookups=solr_results.get_filterCache_lookups("alfresco").to_i
#Hit ratio is an inverse, 1.0 is perfect 0.1 is crap, and can be ignored if there is less than 100 lookups
inverse=(1.0-hitratio)
critical=0.8
warning=0.7
if (inverse.is_a? Float)
  if ( lookups >= 100 )
    if ( inverse >= warning )
      if (inverse >= critical )
        puts "CRITICAL :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;"
        exit 2
      else
        puts "WARNING :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;"
        exit 1
      end
    else
      puts "OK :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;"
      exit 0
    end
  else
    puts "OK :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;"
    exit 0
  end
else
  puts "UNKNOWN :: FilterCache hitratio is #{hitratio}"
  exit 3
end
[/sourecode]

to get this to work, you'll just need to put it with your other nagios checks, and in the same directory as the above put a lib directory with the solr_DAO from further up in it, if you need to change it's location you will only need to adjust the following:


$:.unshift File.expand_path("../", __FILE__)
require 'lib/solr_dao.rb'

Also if you wanted to you could modify the script to take the critical and warning as params so you can easily change it within nagios.

Sentinel – An open source start

An open source start

Last week I introduced a concept of Self Healing Systems Which then lead me on to have a tiny tony bit of a think and I decided that I would write one, the decision took all of 5 mins but it gives me an excuse to do something a bit more complex than your every day script.

I created a very simple website here which outlines my goals, as of writing I have got most of the features coded up for the MVP, and I do need to finish it off which will hopefully be soon, which will hopefully be by the time this is published, but let’s see.

I decided to take this on as a project for a number of reasons:

  1. More ruby programming experience
  2. Other than Monit there doesn’t seem to be any other tools, and I had to be told about this one…
  3. It’s a project with just the right amount of programming challenge for me
  4. I like making things work
  5. It may stop me getting called out as often at work if it does what it’s meant to

So a number of reasons, and I’ve already come across a number of things that I don’t know how to solve or what the right way of doing it is. Which is good I get to do a bit of googling and work out a reasonable way, but to be honest that is not going to be enough in the long run. hopefully as time goes on my programming experience will increase sufficiently that I can make improvements to the code as time goes by.

Why continue if there’s products out there that do the same thing?

Why not? Quite often there’s someone doing the same thing even if you can’t find evidence of it, competition should not be a barrier to entry, especially as people like choice.

I guess the important thing is that it becomes usefully different, Take a look at systems management tools, a personal favourite of mine, you have things like RHN Satellite, Puppet and Chef 3 tools, 1 very different from the other two and another only slightly different. People like choice, different tools work differently for different people.

I guess what I mean by that is that some people strike an accord with one or another application and as a result become FanBoys, normally for no good reason.

There’s also the other side of it, I’ve not used monit, I probably should, I probably won’t; but it doesn’t sound like where I want to go with Sentinel. Quite simply I want to replace junior systems administrators, I don’t want another tool to be used, I want a tool that can provide real benefit by doing the checks on the system, by making logical deterministic decisions based on logic and raw data, and not just by looking at the systems it’s on but by considering the whole environment in which it is part of. I think that is a relatively ambitious goal, but I also think it is a useful one, and hopefully it will get to a point where the application is more useful than the MVP product and it can do more than just look after one system.

Like any good open source product it will probably stay at version 0.X for a long time until it has a reasonable set of feature sin it that make it more than just a simple ruby programme.

A call for help

So I’ve started on this path, I intend to continue regardless at the moment and one thing that will help keep me focused is user participation either through using the script and logging bugs at the github site it’s hosted on.

I think at the moment what I need is some guidance on the programming of the project, it’s clear to me that in a matter of months if not weeks this single file application will become overly complicated to maintain and would benefit from being split out into classes. Although I know that, I do not know the right way of doing it I don’t have any experience of larger applications and the right way to do it so if anyone knows that would be good!

In addition to the architecture of the application there is just some programming issues which I’m sure I can overcome at some point but I will probably achieve the solution by having a punt and seeing what sticks. There’s a wonderful switch in the code for processor states which needs to change. I need to iterate through each character of the state and report back on it’s status where as at the moment it is just looking for a combination. To start with I took the pragmatic option, Add all of the processor states mys system has to the witch and hope that’s enough.

So if anyone feels like contributing, or can even see a simple way of fixing some dodgy coding, I’d appreciate it, I guess the only thing I ask is if you are making changes, See the README, Log a ticket in github and commit the changes with reference to the ticket so I know what’s happened and why.

So please, please, please get involved with Sentinel