An odd beginning
So I’m writing this having just spent the last 10 days on my back in pain and finally starting to feel better, it’s not come at a good time as another member of the same team as me decided they had a “better opportunity” This is the second person to have left this organisation without as much as a passing comment to myself that they were even leaving, how strage; but I digress.
Either way it opens up a void, a team of 2 and a manager now down to a team of one, with the one having back pain that could at any moment take me out of action. Unfortunately up to the day before I was not able to make it to work the system we look after has been surprisingly stable, rock like in-fact; as soon as I say “I’m not going to make it in” the system starts having application issues (JVM crashes).
Obviously the cause needs a bit of looking into and a proper fix etc etc, but in the mean time what do we do? I had an idea, A crazy idea which I don’t think is a fix to any problems but it at least a starting point.
I have spent a bit of time exploring Ruby a few weeks back so I started to look at ways of writing something that would do a simple check; is process X running? In the simple version I wrote it just checked that tomcat was running more than one instance (our application always runs 2) if it was 2, do nothing, if it was more than 2 do nothing (something else crazy has happened so it just logs to that affect) but if it was less than 2 it would try a graceful-ish restart of the service.
So this obviously works in the one specific case that we have, but isn’t extensible and it doesn’t do any better checks, which all got me thinking. Why isn’t there something to do this for us? I don’t know of anything that does this, if anyone does I’d appreciate knowing, there’s a number tools that could be muddled together to do the same sort of function.
Nagios would monitor the system, cucumber could monitor the application interactions, Swatch could monitor the logs, but in most cases these are monitoring, I’m sure there’s ways to get them to carry out actions based on certain outcomes but why use so many tools.
Yes, the individual tools probably do the job better than a single tool, but as a sysadmin, I’d rather have one tool to do everything but that isn’t practical either. So can we some how get the benefits of monitoring with nagios but have a tool that is specifically monitoring the application performance nagios is gathering information about and then making decisions based on that?
The big Idea
So I wonder if it’d be possible to write a simple ruby application that every now and then did a number of actions:
- Check the service on the box, right number of processes, not zombied etc, etc
- Check the disk capacities
- Check the CPU utilisation
- Check the memory utilisation
- Probe the application from the local box, a loopback test of sorts
- Integrate with nagios or another monitoring tool to validate the state it thinks the box is in compared witht he locally gathered stats
- Depending on the outcome of all the checks carry out a number of actions
- Hooks int ticketing systems
When I was thinking this through the other day, it seemed like a good idea, the biggest issue I have is not being a programmer, So I have a steep learning curve, it’s a complicated application, so requires some thought. I would also probably have to ignore everyone that thinks it is a waste of time, which isn’t too hard to do.
I guess what I’m thinking of is something like FBAR. As a system scales up to hundreds of servers the up time and reliability becomes more important, it is sometimes necessary to take a short term view to keep a system working. The most important thing is that those short term views are totaled up and then logged as tickets, 1% of your severs crashing and needing a restart isn’t an issue, but if that 1% becomes 5% and then 10% it’s panic stations!
I think my mind is made up, a sentinel is needed to keep watch over a solution, and what’s crazy is that the more I think of it the more useful it seems and the more complicated it seems to become. As such I think I’m going to need help!