It’s not just about numbers
Up until recently if you wanted to monitor Alfresco’s solr usage you would have had to either use a costly call to the stats page use the summary report that only really gave you a lag number. Luckily because Alfresco have extended solr they have changed the Summary report to provide some really useful information which can then be tracked via nagios or what ever your favourite monitoring tool is.
Firstly it’s worth reading the Wiki as it explains the variables better than I would, it’s also worth mentioning my preferred way of programatically access this page is via json like so:
http://localhost:8080/solr/admin/cores?action=SUMMARY&wt=json
It’s worth mentioning that depending on the json parsing library you are using you can get some fatal parsing errors caused by the hit ratio, For what it’s worth I found Crack to be good, it doesn’t validate the json as heavily as the raw json one does which means you can pull back all the data even if there is a problem with the hitratios.
On that subject, before the relavent cache is hit, the hit ratio will display “NaN” (Not a Number) once it has been hit it will display the appropriate number, which I’ll dive into a bit more later.
So before getting into the nitty gritty service checks, it’s important to have a good understanding of the numbers, most of them are straight forward; the only one that confused me was the hit ratios.
The hit ratio is a number between 0 and 1, when the number is greater than say 0.3 all is well, less than 0.3 things could be bad. However, when the hit count is less than say 100, it would be expected that the hit ratio is low as it is not being hit enough to provide a reasonable response. Other than the hit ratio the others are pretty straight forward.
Some code
It’s probably worth me sharing with you the class I’m using to access/return solr information, that way if you want to write your own nagios checks you can just copy / paste
Firstly, the class that get’s all the solr information:
# # Solr Metric gatherer require 'rubygems' require "crack" require 'open-uri' class SolrDAO def initialize (url) @solr_hash = get_metrics(url) end def get_lag(index) lag = @solr_hash["Summary"][index]["TX Lag"] regex= Regexp.new(/\d*/) lag_number = regex.match(lag) return lag_number end def get_alfresco_node_in_index(index) return @solr_hash["Summary"][index]["Alfresco Nodes in Index"] end def get_num_docs(index) return @solr_hash["Summary"][index]["Searcher"]["numDocs"] end def get_alfresco_avgTimePerRequest(index) return @solr_hash["Summary"][index]["/alfresco"]["avgTimePerRequest"] end def get_afts_avgTimePerRequest(index) return @solr_hash["Summary"][index]["/afts"]["avgTimePerRequest"] end def get_cmis_avgTimePerRequest(index) return @solr_hash["Summary"][index]["/cmis"]["avgTimePerRequest"] end def get_mean_doc_transformation_time(index) return @solr_hash["Summary"][index]["Doc Transformation time (ms)"]["Mean"] end def get_queryResultCache_lookups(index) return @solr_hash["Summary"][index]["/queryResultCache"]["lookups"] end def get_queryResultCache_hitratio(index) return @solr_hash["Summary"][index]["/queryResultCache"]["hitratio"] end def get_filterCache_lookups(index) return @solr_hash["Summary"][index]["/filterCache"]["lookups"] end def get_filterCache_hitratio(index) return @solr_hash["Summary"][index]["/filterCache"]["hitratio"] end def get_alfrescoPathCache_lookups(index) return @solr_hash["Summary"][index]["/alfrescoPathCache"]["lookups"] end def get_alfrescoPathCache_hitratio(index) return @solr_hash["Summary"][index]["/alfrescoPathCache"]["hitratio"] end def get_alfrescoAuthorityCache_lookups(index) return @solr_hash["Summary"][index]["/alfrescoAuthorityCache"]["lookups"] end def get_alfrescoAuthorityCache_hitratio(index) return @solr_hash["Summary"][index]["/alfrescoAuthorityCache"]["hitratio"] end def get_queryResultCache_warmupTime(index) return @solr_hash["Summary"][index]["/queryResultCache"]["warmupTime"] end def get_filterCache_warmupTime(index) return @solr_hash["Summary"][index]["/filterCache"]["warmupTime"] end def get_alfrescoPathCache_warmupTime(index) return @solr_hash["Summary"][index]["/alfrescoPathCache"]["warmupTime"] end def get_alfrescoAuthorityCache_warmupTime(index) return @solr_hash["Summary"][index]["/alfrescoAuthorityCache"]["warmupTime"] end private def get_metrics(url) url += "&wt=json" response = open(url).read # Convert to hash result_hash = {} result_hash = Crack::JSON.parse(response) # if the hash has 'Error' as a key, we raise an error if result_hash.has_key? 'Error' raise "web service error" end return result_hash end end # End of class
As you can see it is quite straight forward to extend this if you want to pull back different metrics. At some point I will hook this into a git hub repo for people or use it in another metrics based project, but for now just use this.
Now some of you may not be use to using ruby, so here’s is a check that checks the filtercache hitratio
#!/usr/bin/ruby $:.unshift File.expand_path("../", __FILE__) require 'lib/solr_dao.rb' solr_results=SolrDAO.new("http://localhost:8080/solr/admin/cores?action=SUMMARY") hitratio=solr_results.get_filterCache_hitratio("alfresco").to_f lookups=solr_results.get_filterCache_lookups("alfresco").to_i #Hit ratio is an inverse, 1.0 is perfect 0.1 is crap, and can be ignored if there is less than 100 lookups inverse=(1.0-hitratio) critical=0.8 warning=0.7 if (inverse.is_a? Float) if ( lookups >= 100 ) if ( inverse >= warning ) if (inverse >= critical ) puts "CRITICAL :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;" exit 2 else puts "WARNING :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;" exit 1 end else puts "OK :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;" exit 0 end else puts "OK :: FilterCache hitratio is #{hitratio}|'hitratio'=#{hitratio};#{warning};#{critical};;" exit 0 end else puts "UNKNOWN :: FilterCache hitratio is #{hitratio}" exit 3 end [/sourecode] to get this to work, you'll just need to put it with your other nagios checks, and in the same directory as the above put a lib directory with the solr_DAO from further up in it, if you need to change it's location you will only need to adjust the following: $:.unshift File.expand_path("../", __FILE__) require 'lib/solr_dao.rb'
Also if you wanted to you could modify the script to take the critical and warning as params so you can easily change it within nagios.
Excelent post, it looks like the code can’t be copy/paste correctly because it has some character changes. I would like to integrate it with the nagios4alfresco plugin.
Hi Toni, Thanks for the kind words, Sorry it’s a bit awkward to copy, how about having the entire git repo instead: https://github.com/soimafreak/ASM :) It’s not perfect code but any contributions welcome!
Great, thanks! I will take a look into it.
[…] A little while back I put up some checks for Alfresco Solr Here and wrote a little blog Here […]