Updated Alfresco Solr Checks

As some may know…

A little while back I put up some checks for Alfresco Solr Here and wrote a little blog Here

Well over the last few weeks I have added yet more checks to it and I’ve also added some caching of the results so it will now no longer make a separate request to solr for each check and instead will use a local cached copy of the results and after 5 mins get a new one. The reason for this is that most of the results don’t change that frequently and with nagios it was calling each check so 20 calls to solr over a 5 min period, well each individual check is only verified once every 5 mins so now it will pull the report once and reference that cached copy for 5 mins, after that it will simply pull a new one…

In addition to the caching it now has 13 new checks! including cumulative hit ratios which are typically more relavent than the normal hit ratios as they are based over all time (Since reboot) and no, I don’t know how long the normal hitratios are based over.

There is also some checks for the number of Transactions remaining and the number of change sets remaining, these combined with the Lag can give you an indication of how far behind / how much work is left for Solr to do so quite useful.

If you need any help with these or have a few additional checks that are relavant let me know I’m happy to help.

Monitoring Alfresco Solr

It’s not just about numbers

Up until recently if you wanted to monitor Alfresco’s solr usage you would have had to either use a costly call to the stats page use the summary report that only really gave you a lag number. Luckily because Alfresco have extended solr they have changed the Summary report to provide some really useful information which can then be tracked via nagios or what ever your favourite monitoring tool is.

Firstly it’s worth reading the Wiki as it explains the variables better than I would, it’s also worth mentioning my preferred way of programatically access this page is via json like so:

 
http://localhost:8080/solr/admin/cores?action=SUMMARY&wt=json

It’s worth mentioning that depending on the json parsing library you are using you can get some fatal parsing errors caused by the hit ratio, For what it’s worth I found Crack to be good, it doesn’t validate the json as heavily as the raw json one does which means you can pull back all the data even if there is a problem with the hitratios.

On that subject, before the relavent cache is hit, the hit ratio will display “NaN” (Not a Number) once it has been hit it will display the appropriate number, which I’ll dive into a bit more later.

So before getting into the nitty gritty service checks, it’s important to have a good understanding of the numbers, most of them are straight forward; the only one that confused me was the hit ratios.

The hit ratio is a number between 0 and 1, when the number is greater than say 0.3 all is well, less than 0.3 things could be bad. However, when the hit count is less than say 100, it would be expected that the hit ratio is low as it is not being hit enough to provide a reasonable response. Other than the hit ratio the others are pretty straight forward.

Some code

It’s probably worth me sharing with you the class I’m using to access/return solr information, that way if you want to write your own nagios checks you can just copy / paste

Firstly, the class that get’s all the solr information:

#
# Solr Metric gatherer

require 'rubygems'
require "crack"
require 'open-uri'

class SolrDAO

  def initialize (url)
    @solr_hash = get_metrics(url)
  end

  def get_lag(index)
    lag = @solr_hash["Summary"][index]["TX Lag"]
    regex= Regexp.new(/\d*/)
    lag_number = regex.match(lag)
    return lag_number
  end

  def get_alfresco_node_in_index(index)
    return @solr_hash["Summary"][index]["Alfresco Nodes in Index"]
  end
  
  def get_num_docs(index)
    return @solr_hash["Summary"][index]["Searcher"]["numDocs"]
  end
  
  def get_alfresco_avgTimePerRequest(index)
    return @solr_hash["Summary"][index]["/alfresco"]["avgTimePerRequest"]
  end

  def get_afts_avgTimePerRequest(index)
    return @solr_hash["Summary"][index]["/afts"]["avgTimePerRequest"]
  end

  def get_cmis_avgTimePerRequest(index)
    return @solr_hash["Summary"][index]["/cmis"]["avgTimePerRequest"]
  end

  def get_mean_doc_transformation_time(index)
    return @solr_hash["Summary"][index]["Doc Transformation time (ms)"]["Mean"]
  end

  def get_queryResultCache_lookups(index)
    return @solr_hash["Summary"][index]["/queryResultCache"]["lookups"]
  end
  
  def get_queryResultCache_hitratio(index)
    return @solr_hash["Summary"][index]["/queryResultCache"]["hitratio"]
  end
  
  def get_filterCache_lookups(index)
    return @solr_hash["Summary"][index]["/filterCache"]["lookups"]
  end
  
  def get_filterCache_hitratio(index)
    return @solr_hash["Summary"][index]["/filterCache"]["hitratio"]
  end
  
  def get_alfrescoPathCache_lookups(index)
    return @solr_hash["Summary"][index]["/alfrescoPathCache"]["lookups"]
  end
  
  def get_alfrescoPathCache_hitratio(index)
    return @solr_hash["Summary"][index]["/alfrescoPathCache"]["hitratio"]
  end
  
  def get_alfrescoAuthorityCache_lookups(index)
    return @solr_hash["Summary"][index]["/alfrescoAuthorityCache"]["lookups"]
  end
  
  def get_alfrescoAuthorityCache_hitratio(index)
    return @solr_hash["Summary"][index]["/alfrescoAuthorityCache"]["hitratio"]
  end
  
  def get_queryResultCache_warmupTime(index)
    return @solr_hash["Summary"][index]["/queryResultCache"]["warmupTime"]
  end
  
  def get_filterCache_warmupTime(index)
    return @solr_hash["Summary"][index]["/filterCache"]["warmupTime"]
  end
  
  def get_alfrescoPathCache_warmupTime(index)
    return @solr_hash["Summary"][index]["/alfrescoPathCache"]["warmupTime"]
  end
  
  def get_alfrescoAuthorityCache_warmupTime(index)
    return @solr_hash["Summary"][index]["/alfrescoAuthorityCache"]["warmupTime"]
  end
  
  private
  def get_metrics(url)
    url += "&wt=json"
    response = open(url).read
    # Convert to hash
    result_hash = {}
    result_hash = Crack::JSON.parse(response)
    # if the hash has 'Error' as a key, we raise an error
    if result_hash.has_key? 'Error'
      raise "web service error"
    end
    return result_hash
  end

end # End of class

As you can see it is quite straight forward to extend this if you want to pull back different metrics. At some point I will hook this into a git hub repo for people or use it in another metrics based project, but for now just use this.

Now some of you may not be use to using ruby, so here’s is a check that checks the filtercache hitratio

#!/usr/bin/ruby
$:.unshift File.expand_path("../", __FILE__)
require 'lib/solr_dao.rb'
solr_results=SolrDAO.new("http://localhost:8080/solr/admin/cores?action=SUMMARY")
hitratio=solr_results.get_filterCache_hitratio("alfresco").to_f
lookups=solr_results.get_filterCache_lookups("alfresco").to_i
#Hit ratio is an inverse, 1.0 is perfect 0.1 is crap, and can be ignored if there is less than 100 lookups
inverse=(1.0-hitratio)
critical=0.8
warning=0.7
if (inverse.is_a? Float)
  if ( lookups >= 100 )
    if ( inverse >= warning )
      if (inverse >= critical )
        puts "CRITICAL :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;"
        exit 2
      else
        puts "WARNING :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;"
        exit 1
      end
    else
      puts "OK :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;"
      exit 0
    end
  else
    puts "OK :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;"
    exit 0
  end
else
  puts "UNKNOWN :: FilterCache hitratio is #{hitratio}"
  exit 3
end
[/sourecode]

to get this to work, you'll just need to put it with your other nagios checks, and in the same directory as the above put a lib directory with the solr_DAO from further up in it, if you need to change it's location you will only need to adjust the following:


$:.unshift File.expand_path("../", __FILE__)
require 'lib/solr_dao.rb'

Also if you wanted to you could modify the script to take the critical and warning as params so you can easily change it within nagios.