Sentinel update

Many moons ago…

A while back I started to mention the idea of Self healing systems a dedicated system that makes use of monitoring and real time system information to make intelligent decisions about what to do, i.e. I write a complicated program to gradually replace my self. It was suggested about using hooks in Nagios to do the tasks but that misses the intelligence side of what I’m trying to get to, restarting based on Nagios checks is simply an if statement that on a certain condition does something, Sentinel will be more that that.

Back in April I started Sentinel as an open source project As expected the uptake has been phenomenal! absolutely no one has even looked at it :) Either way I am not deterred. I have been on and off re-factoring Sentinel into something a bit more logical Here and I have gone from 3 files to some 13! from 1411 words to 2906 and I even have one fully working unit test! I don’t think I’ll be writing more as at the moment it is not really helping me get to where I want to be quickly but I know I’ll need them at some point!

So far all I have done is split out some of the code to give it structure and added the odd bit here and there. The next thing I need to start doing is to make it better, there’s a number of options:

  • Writing more providers for it so it can start to manage disks, memory etc etc so it’s a bit more useful
  • Sorting out the structure of the code adding in more error handling / logging and resilience
  • Integration with Nagios or some tool that already monitors system health and use that to base actions off of
  • Daemonize Sentinel so it runs like a real program!
  • Configuration file rather than CLI

What to do

I think for me I’d rather sort out the structure of the code and improve what is already there first, I’m in no rush with this so the least I could do is make what I have less hacky. This also gives me the opportunity to start working out how I’d rather have the whole thing structured.

I did look at writing a plugin framework so it would be possible to just drop in a module or something similar and it would have the correct information about how to manage what ever it was written to do, but I figured that was a bit beyond me at this time and I have better things to do!

After that I think the configuration file and daemonizing the application, the main reason for this will be to identify any issues with it running continually any issue here would be nice to know sooner rather than later.

This then leaves more providers and nagios type integration which i’m sure will be fun.

Give it AI!

Once those items are done this leaves sentinel with one more thing to do, start intelligently working out solutions to problems, obviously I don’t know the right way to tackle this however I do have a few ideas though.

In my head… I think how I would solve an issue and inevitably it starts with gathering information about the system, but how do you know what information is relavent to which problems and how much weighting should it have? well for starters I figure each provider would return a score about how healthy it thinks it is. So for example:

A provider for checking the site is available notices that it’s not available; this produces a score that is very high say 10000. It then makes sure it’s got the latest information from all providers on the server. One of those providers is disk which notices one of the volumes is 67% full but the thresholds have been set to warn at 70 and 95 % so it sets a score of say 250 and is ranked in a list somewhere to come back to if all else fails.

At this point it is unlikely that disk is the culprit, we have to assume that whomever set the thresholds knew something about the system, so more information is needed, it checks the local network and gets back a score of 0 as far as the network provider can tell it’s working fine it can get to localhost, the gateway another gateway on the internet. A good test at this point is to try and work out which layer of the OSI model the issue is, so one of the actions might be to connect to port 80 or 443 or both and see what happens, is there a web response? or not, if there is does it have any words in it or a response code that suggests it’s a known web error like a 500 or does it not get connected.

And so on and so forth, this would mean that where ever this logic exists it has to make associations betten results and the following actions. one of the ways to do this is to “tag” a provider with potential subsystems that could affect it then based on the score of each of the subsystems produce a vector of potential areas to check, combined with the score it’s possible to travel the vector and work out how likely each is to fix the issue, as and when each one produces a result it either dives in a new vector either more detailed or not. It would then, in theory be possible to start making correlations between these subsystems, so say the web one requires disk and networking to be available and both the networking and disk require CPU then it can assume that web one needs that and base don how many of these connections exist it can rank it higher or lower much in the same way a search engine would work.

But all of this is for another day, today is just about saying it’s started and I hope to continue on it this year.

Sentinel – An open source start

An open source start

Last week I introduced a concept of Self Healing Systems Which then lead me on to have a tiny tony bit of a think and I decided that I would write one, the decision took all of 5 mins but it gives me an excuse to do something a bit more complex than your every day script.

I created a very simple website here which outlines my goals, as of writing I have got most of the features coded up for the MVP, and I do need to finish it off which will hopefully be soon, which will hopefully be by the time this is published, but let’s see.

I decided to take this on as a project for a number of reasons:

  1. More ruby programming experience
  2. Other than Monit there doesn’t seem to be any other tools, and I had to be told about this one…
  3. It’s a project with just the right amount of programming challenge for me
  4. I like making things work
  5. It may stop me getting called out as often at work if it does what it’s meant to

So a number of reasons, and I’ve already come across a number of things that I don’t know how to solve or what the right way of doing it is. Which is good I get to do a bit of googling and work out a reasonable way, but to be honest that is not going to be enough in the long run. hopefully as time goes on my programming experience will increase sufficiently that I can make improvements to the code as time goes by.

Why continue if there’s products out there that do the same thing?

Why not? Quite often there’s someone doing the same thing even if you can’t find evidence of it, competition should not be a barrier to entry, especially as people like choice.

I guess the important thing is that it becomes usefully different, Take a look at systems management tools, a personal favourite of mine, you have things like RHN Satellite, Puppet and Chef 3 tools, 1 very different from the other two and another only slightly different. People like choice, different tools work differently for different people.

I guess what I mean by that is that some people strike an accord with one or another application and as a result become FanBoys, normally for no good reason.

There’s also the other side of it, I’ve not used monit, I probably should, I probably won’t; but it doesn’t sound like where I want to go with Sentinel. Quite simply I want to replace junior systems administrators, I don’t want another tool to be used, I want a tool that can provide real benefit by doing the checks on the system, by making logical deterministic decisions based on logic and raw data, and not just by looking at the systems it’s on but by considering the whole environment in which it is part of. I think that is a relatively ambitious goal, but I also think it is a useful one, and hopefully it will get to a point where the application is more useful than the MVP product and it can do more than just look after one system.

Like any good open source product it will probably stay at version 0.X for a long time until it has a reasonable set of feature sin it that make it more than just a simple ruby programme.

A call for help

So I’ve started on this path, I intend to continue regardless at the moment and one thing that will help keep me focused is user participation either through using the script and logging bugs at the github site it’s hosted on.

I think at the moment what I need is some guidance on the programming of the project, it’s clear to me that in a matter of months if not weeks this single file application will become overly complicated to maintain and would benefit from being split out into classes. Although I know that, I do not know the right way of doing it I don’t have any experience of larger applications and the right way to do it so if anyone knows that would be good!

In addition to the architecture of the application there is just some programming issues which I’m sure I can overcome at some point but I will probably achieve the solution by having a punt and seeing what sticks. There’s a wonderful switch in the code for processor states which needs to change. I need to iterate through each character of the state and report back on it’s status where as at the moment it is just looking for a combination. To start with I took the pragmatic option, Add all of the processor states mys system has to the witch and hope that’s enough.

So if anyone feels like contributing, or can even see a simple way of fixing some dodgy coding, I’d appreciate it, I guess the only thing I ask is if you are making changes, See the README, Log a ticket in github and commit the changes with reference to the ticket so I know what’s happened and why.

So please, please, please get involved with Sentinel

Self healing systems

An odd beginning

So I’m writing this having just spent the last 10 days on my back in pain and finally starting to feel better, it’s not come at a good time as another member of the same team as me decided they had a “better opportunity” This is the second person to have left this organisation without as much as a passing comment to myself that they were even leaving, how strage; but I digress.

Either way it opens up a void, a team of 2 and a manager now down to a team of one, with the one having back pain that could at any moment take me out of action. Unfortunately up to the day before I was not able to make it to work the system we look after has been surprisingly stable, rock like in-fact; as soon as I say “I’m not going to make it in” the system starts having application issues (JVM crashes).

Obviously the cause needs a bit of looking into and a proper fix etc etc, but in the mean time what do we do? I had an idea, A crazy idea which I don’t think is a fix to any problems but it at least a starting point.

Sentinel

I have spent a bit of time exploring Ruby a few weeks back so I started to look at ways of writing something that would do a simple check; is process X running? In the simple version I wrote it just checked that tomcat was running more than one instance (our application always runs 2) if it was 2, do nothing, if it was more than 2 do nothing (something else crazy has happened so it just logs to that affect) but if it was less than 2 it would try a graceful-ish restart of the service.

So this obviously works in the one specific case that we have, but isn’t extensible and it doesn’t do any better checks, which all got me thinking. Why isn’t there something to do this for us? I don’t know of anything that does this, if anyone does I’d appreciate knowing, there’s a number tools that could be muddled together to do the same sort of function.

Nagios would monitor the system, cucumber could monitor the application interactions, Swatch could monitor the logs, but in most cases these are monitoring, I’m sure there’s ways to get them to carry out actions based on certain outcomes but why use so many tools.

Yes, the individual tools probably do the job better than a single tool, but as a sysadmin, I’d rather have one tool to do everything but that isn’t practical either. So can we some how get the benefits of monitoring with nagios but have a tool that is specifically monitoring the application performance nagios is gathering information about and then making decisions based on that?

The big Idea

So I wonder if it’d be possible to write a simple ruby application that every now and then did a number of actions:

  1. Check the service on the box, right number of processes, not zombied etc, etc
  2. Check the disk capacities
  3. Check the CPU utilisation
  4. Check the memory utilisation
  5. Probe the application from the local box, a loopback test of sorts
  6. Integrate with nagios or another monitoring tool to validate the state it thinks the box is in compared witht he locally gathered stats
  7. Depending on the outcome of all the checks carry out a number of actions
  8. Hooks int ticketing systems

When I was thinking this through the other day, it seemed like a good idea, the biggest issue I have is not being a programmer, So I have a steep learning curve, it’s a complicated application, so requires some thought. I would also probably have to ignore everyone that thinks it is a waste of time, which isn’t too hard to do.

I guess what I’m thinking of is something like FBAR. As a system scales up to hundreds of servers the up time and reliability becomes more important, it is sometimes necessary to take a short term view to keep a system working. The most important thing is that those short term views are totaled up and then logged as tickets, 1% of your severs crashing and needing a restart isn’t an issue, but if that 1% becomes 5% and then 10% it’s panic stations!

Summary

I think my mind is made up, a sentinel is needed to keep watch over a solution, and what’s crazy is that the more I think of it the more useful it seems and the more complicated it seems to become. As such I think I’m going to need help!