Updated Alfresco Solr Checks

As some may know…

A little while back I put up some checks for Alfresco Solr Here and wrote a little blog Here

Well over the last few weeks I have added yet more checks to it and I’ve also added some caching of the results so it will now no longer make a separate request to solr for each check and instead will use a local cached copy of the results and after 5 mins get a new one. The reason for this is that most of the results don’t change that frequently and with nagios it was calling each check so 20 calls to solr over a 5 min period, well each individual check is only verified once every 5 mins so now it will pull the report once and reference that cached copy for 5 mins, after that it will simply pull a new one…

In addition to the caching it now has 13 new checks! including cumulative hit ratios which are typically more relavent than the normal hit ratios as they are based over all time (Since reboot) and no, I don’t know how long the normal hitratios are based over.

There is also some checks for the number of Transactions remaining and the number of change sets remaining, these combined with the Lag can give you an indication of how far behind / how much work is left for Solr to do so quite useful.

If you need any help with these or have a few additional checks that are relavant let me know I’m happy to help.

Monitoring Alfresco Solr

It’s not just about numbers

Up until recently if you wanted to monitor Alfresco’s solr usage you would have had to either use a costly call to the stats page use the summary report that only really gave you a lag number. Luckily because Alfresco have extended solr they have changed the Summary report to provide some really useful information which can then be tracked via nagios or what ever your favourite monitoring tool is.

Firstly it’s worth reading the Wiki as it explains the variables better than I would, it’s also worth mentioning my preferred way of programatically access this page is via json like so:


It’s worth mentioning that depending on the json parsing library you are using you can get some fatal parsing errors caused by the hit ratio, For what it’s worth I found Crack to be good, it doesn’t validate the json as heavily as the raw json one does which means you can pull back all the data even if there is a problem with the hitratios.

On that subject, before the relavent cache is hit, the hit ratio will display “NaN” (Not a Number) once it has been hit it will display the appropriate number, which I’ll dive into a bit more later.

So before getting into the nitty gritty service checks, it’s important to have a good understanding of the numbers, most of them are straight forward; the only one that confused me was the hit ratios.

The hit ratio is a number between 0 and 1, when the number is greater than say 0.3 all is well, less than 0.3 things could be bad. However, when the hit count is less than say 100, it would be expected that the hit ratio is low as it is not being hit enough to provide a reasonable response. Other than the hit ratio the others are pretty straight forward.

Some code

It’s probably worth me sharing with you the class I’m using to access/return solr information, that way if you want to write your own nagios checks you can just copy / paste

Firstly, the class that get’s all the solr information:

# Solr Metric gatherer

require 'rubygems'
require "crack"
require 'open-uri'

class SolrDAO

  def initialize (url)
    @solr_hash = get_metrics(url)

  def get_lag(index)
    lag = @solr_hash["Summary"][index]["TX Lag"]
    regex= Regexp.new(/\d*/)
    lag_number = regex.match(lag)
    return lag_number

  def get_alfresco_node_in_index(index)
    return @solr_hash["Summary"][index]["Alfresco Nodes in Index"]
  def get_num_docs(index)
    return @solr_hash["Summary"][index]["Searcher"]["numDocs"]
  def get_alfresco_avgTimePerRequest(index)
    return @solr_hash["Summary"][index]["/alfresco"]["avgTimePerRequest"]

  def get_afts_avgTimePerRequest(index)
    return @solr_hash["Summary"][index]["/afts"]["avgTimePerRequest"]

  def get_cmis_avgTimePerRequest(index)
    return @solr_hash["Summary"][index]["/cmis"]["avgTimePerRequest"]

  def get_mean_doc_transformation_time(index)
    return @solr_hash["Summary"][index]["Doc Transformation time (ms)"]["Mean"]

  def get_queryResultCache_lookups(index)
    return @solr_hash["Summary"][index]["/queryResultCache"]["lookups"]
  def get_queryResultCache_hitratio(index)
    return @solr_hash["Summary"][index]["/queryResultCache"]["hitratio"]
  def get_filterCache_lookups(index)
    return @solr_hash["Summary"][index]["/filterCache"]["lookups"]
  def get_filterCache_hitratio(index)
    return @solr_hash["Summary"][index]["/filterCache"]["hitratio"]
  def get_alfrescoPathCache_lookups(index)
    return @solr_hash["Summary"][index]["/alfrescoPathCache"]["lookups"]
  def get_alfrescoPathCache_hitratio(index)
    return @solr_hash["Summary"][index]["/alfrescoPathCache"]["hitratio"]
  def get_alfrescoAuthorityCache_lookups(index)
    return @solr_hash["Summary"][index]["/alfrescoAuthorityCache"]["lookups"]
  def get_alfrescoAuthorityCache_hitratio(index)
    return @solr_hash["Summary"][index]["/alfrescoAuthorityCache"]["hitratio"]
  def get_queryResultCache_warmupTime(index)
    return @solr_hash["Summary"][index]["/queryResultCache"]["warmupTime"]
  def get_filterCache_warmupTime(index)
    return @solr_hash["Summary"][index]["/filterCache"]["warmupTime"]
  def get_alfrescoPathCache_warmupTime(index)
    return @solr_hash["Summary"][index]["/alfrescoPathCache"]["warmupTime"]
  def get_alfrescoAuthorityCache_warmupTime(index)
    return @solr_hash["Summary"][index]["/alfrescoAuthorityCache"]["warmupTime"]
  def get_metrics(url)
    url += "&wt=json"
    response = open(url).read
    # Convert to hash
    result_hash = {}
    result_hash = Crack::JSON.parse(response)
    # if the hash has 'Error' as a key, we raise an error
    if result_hash.has_key? 'Error'
      raise "web service error"
    return result_hash

end # End of class

As you can see it is quite straight forward to extend this if you want to pull back different metrics. At some point I will hook this into a git hub repo for people or use it in another metrics based project, but for now just use this.

Now some of you may not be use to using ruby, so here’s is a check that checks the filtercache hitratio

$:.unshift File.expand_path("../", __FILE__)
require 'lib/solr_dao.rb'
#Hit ratio is an inverse, 1.0 is perfect 0.1 is crap, and can be ignored if there is less than 100 lookups
if (inverse.is_a? Float)
  if ( lookups >= 100 )
    if ( inverse >= warning )
      if (inverse >= critical )
        puts "CRITICAL :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;"
        exit 2
        puts "WARNING :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;"
        exit 1
      puts "OK :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;"
      exit 0
    puts "OK :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;"
    exit 0
  puts "UNKNOWN :: FilterCache hitratio is #{hitratio}"
  exit 3

to get this to work, you'll just need to put it with your other nagios checks, and in the same directory as the above put a lib directory with the solr_DAO from further up in it, if you need to change it's location you will only need to adjust the following:

$:.unshift File.expand_path("../", __FILE__)
require 'lib/solr_dao.rb'

Also if you wanted to you could modify the script to take the critical and warning as params so you can easily change it within nagios.