An odd beginning
So I’m writing this having just spent the last 10 days on my back in pain and finally starting to feel better, it’s not come at a good time as another member of the same team as me decided they had a “better opportunity” This is the second person to have left this organisation without as much as a passing comment to myself that they were even leaving, how strage; but I digress.
Either way it opens up a void, a team of 2 and a manager now down to a team of one, with the one having back pain that could at any moment take me out of action. Unfortunately up to the day before I was not able to make it to work the system we look after has been surprisingly stable, rock like in-fact; as soon as I say “I’m not going to make it in” the system starts having application issues (JVM crashes).
Obviously the cause needs a bit of looking into and a proper fix etc etc, but in the mean time what do we do? I had an idea, A crazy idea which I don’t think is a fix to any problems but it at least a starting point.
I have spent a bit of time exploring Ruby a few weeks back so I started to look at ways of writing something that would do a simple check; is process X running? In the simple version I wrote it just checked that tomcat was running more than one instance (our application always runs 2) if it was 2, do nothing, if it was more than 2 do nothing (something else crazy has happened so it just logs to that affect) but if it was less than 2 it would try a graceful-ish restart of the service.
So this obviously works in the one specific case that we have, but isn’t extensible and it doesn’t do any better checks, which all got me thinking. Why isn’t there something to do this for us? I don’t know of anything that does this, if anyone does I’d appreciate knowing, there’s a number tools that could be muddled together to do the same sort of function.
Nagios would monitor the system, cucumber could monitor the application interactions, Swatch could monitor the logs, but in most cases these are monitoring, I’m sure there’s ways to get them to carry out actions based on certain outcomes but why use so many tools.
Yes, the individual tools probably do the job better than a single tool, but as a sysadmin, I’d rather have one tool to do everything but that isn’t practical either. So can we some how get the benefits of monitoring with nagios but have a tool that is specifically monitoring the application performance nagios is gathering information about and then making decisions based on that?
The big Idea
So I wonder if it’d be possible to write a simple ruby application that every now and then did a number of actions:
- Check the service on the box, right number of processes, not zombied etc, etc
- Check the disk capacities
- Check the CPU utilisation
- Check the memory utilisation
- Probe the application from the local box, a loopback test of sorts
- Integrate with nagios or another monitoring tool to validate the state it thinks the box is in compared witht he locally gathered stats
- Depending on the outcome of all the checks carry out a number of actions
- Hooks int ticketing systems
When I was thinking this through the other day, it seemed like a good idea, the biggest issue I have is not being a programmer, So I have a steep learning curve, it’s a complicated application, so requires some thought. I would also probably have to ignore everyone that thinks it is a waste of time, which isn’t too hard to do.
I guess what I’m thinking of is something like FBAR. As a system scales up to hundreds of servers the up time and reliability becomes more important, it is sometimes necessary to take a short term view to keep a system working. The most important thing is that those short term views are totaled up and then logged as tickets, 1% of your severs crashing and needing a restart isn’t an issue, but if that 1% becomes 5% and then 10% it’s panic stations!
I think my mind is made up, a sentinel is needed to keep watch over a solution, and what’s crazy is that the more I think of it the more useful it seems and the more complicated it seems to become. As such I think I’m going to need help!
[…] week I introduced a concept of Self Healing Systems Which then lead me on to have a tiny tony bit of a think and I decided that I would write one, the […]
Why do you need a self-healing if you have tools like puppet & MCollective in your arsenal?
A couple of points, Puppet is a configuration management tool and is great for getting a system back to a known state. Mcollective is great for orchestration but it is not a monitoring tool, it requires your input to make it do anything. You could argue that a bad server should be nuked and re-built, and to be honest I’d expect the self healing system to make that call if it wasn’t able to fix it.
Let’s look at some real world examples, for some reason /var fills up to 100% there may or may not have been alerts that may have been totally ignored. Either way server is now not quite working the way it should.
You would log on, use du and df to identify where the problem is and once you find the culprit(s) make a call whether to delete them or leave them, maybe it’s your mysql DB. If you can’t delete them you’d probably hunt for other low hanging fruit like compressed archives or old ISO’s what ever you have on the /var partition you don’t need; probably old logs. You’d then remove them, in total the elapsed time on something like this is probably 15 – 30 mins if you were at your desk and ready to go.
In a self healing system it would have picked up the disk utilisation at 70 or 80 % where the threshold was set and carried out the same actions you’d have done manually, remove any large files or tidy up logs etc, all based on a set of includes / excludes
Time to fix in this case would be 2-3 seconds from the point it realised it was over utilised.This is all good and well but becuase it healed it’s self it needs to log you a ticket to tell you it did so, that way when you come in the next day you can make a call whether it was a one off or if it would re-occur.
Hopefully that make sense.
Thank you for your detailed reply!
We are currently checking http://rundeck.org/ as a platform for building our self healing system on top of it.
What do you think? Can it provide a good solution (by plugged it to Nagios as the alerting source)?
Run deck looks pretty good, but i’d be concerned about the limitations it may place on you interms of what you can do with the workflows and the actions. It seems you can do quite a lot as long as you invest the time into the scripts it is executing. It doesn’t look like you can parse details from one node to another though.
So for example you might want to find out how many web nodes you have and if you have more than 1 do action A on the tomcat servers but if you only have 1 you might want to do action B.
So with something like MCollective you are able to get those details because you’re running it on a device that can run multiple queries and you can then wrap the workflow within another workflow.
I wouldn’t recommend MCollective though as it requires some good ruby to make reasonable use of the system.
From a “Self healing systems” point of view the aim is to get to a point where you set up a web node and define parameters for normal operation and abnormal operations and if it’s abnormal a set of the tasks it needs to do to manage it’s self. As time develops and you have a backend server that si set up similarly, one of the actions under abnormal load may be to trigger a new box to be created but part of a new box starting up may be to poke the web node which in it’s self may then trigger the web nodes to create a new set of config and tell the backend to pause while it sorts it’s self out.
It’s hard to get across just the level of automation that is required for this and there is certainly a large way that can be gone with MCollective and fancy shell scripts but at the end of the day they always end up lacking the full understanding of an environment due to a lack of intercommunication and no one really thinks about message buses or inter shell script communication when they create them.
But from a “Self healing” point of view my aim
Why not use Nagios event handlers?
I’ve never used the Nagios event handlers, I’m sure they’re good for actions on the localhost side of things, but how do they cope with communicating across a network and checking the health of another 2 or 3 nodes before taking an action?
I want to try and get the self healing system to a point where it will sort out it’s own disk issues or application issues etc and if it is too far gone, but there are other health servers it would build a replacement and terminate it’s self once the replacement is suitable.
[…] while back I started to mention the idea of Self healing systems a dedicated system that makes use of monitoring and real time system information to make […]
[…] Self healing systems […]
In my opinion, this kind of thing has been long overdue. A lot of my work is in high assurance systems. Most mainframe vendors and big NIX vendors like Sun/SGI had lots of RAS features built into their products. Some would be prohibitive for COTS hardware, yet others were simple functions to do mostly what you said. I think some of this should have been built into the Linux server distro’s by now.
You mentioned tools that already monitor certain things. This made me think of two things. The UNIX philosophy would say we would have a different (tiny) app for each category of issue or OSI layer. The main system monitoring program, which might be a script, would run each at certain intervals to produce summary data, which would be compared to threshholds (rule engine could work here). If a problem is found, it could give that info to another program that notifies the admin or yet another that fixes the problem. As these jobs all have very separate, isolated functions I see no reason that we need one dedicated monolithic program. This is also a case where reuse is almost a certainty b/c there’s many reasons to collect data on one’s system or automate administrative actions.
The other thing I thought about was recovery-oriented architecture. Old school assurance was prevention. These days, recovery is in. The idea is to constantly be restarting nodes to a clean state and design your app/network in a way to allow this without problems. This has a few benefits. First, issues building up in background (or that already happened) will be cleared out. Second, you have high availability built in. Third, and very important, your availability *mechanism* is regularly tested and shown to work. Fourth, updates are easier.
A common set up is a high-uptime load balancer serving 2 to 3 nodes with one getting restarted every 5min to 1hr depending on one’s needs. In my setups, I’ve tried to apply this to every level: servers, firewalls, network switches, ISP connections (sometimes), etc. I recently read about Netflix’s Simian Army and it seems they do about the same thing. I’d like to offer more specifics but we use very different platforms so I hope some of these general ideas help.
That is a good comment! one of the reasons I started Sentinel was that I felt there wasn’t enough in terms of the monitoring of a process in real time, so at least once every min that service should be checked, ideally once a second. Some sort of rule engine to work out what to do and some mechanism for doing that should mean that as known “bad” things happen you can take the necessary precautions.
Unfortunately i’ve stopped on this at the moment because of other commitments but it is something I think could fit in really well. Even now we have issues we have the correct debugging info for and are awaiting longer term fixes for, but once that initial investigation is done why can’t the system automatically detect and react to it rather than letting it get out of hand.
I don’t think the self healing system is about simply rebooting on a regular basis but it is about identifying times where you know a reboot is the solution, gathering what information is available and then logging a ticket, adding a comment to an existing one and just doing what is essentially grunt work. If as a sysadmin there’s ever been on more than one occasion where you’ve looked at a problem and within 30 seconds gone, ahh X, Y & Z to fix that should only happen twice, the first is a fluke and once off, the second is a problem. from that point onwards the system should know enough to recognise the issue and keep pushing a known fix. Having run books and other things so a human can follow a script to do the same thing is nuts and quite frankly the way of the stone age.
“Having run books and other things so a human can follow a script to do the same thing is nuts and quite frankly the way of the stone age.”
I totally agree. Personally, I think some of the stuff you said is actually easier in modular systems, like microkernels, than mainstream systems. The reason is that those systems are incredibly modular with minimum shared state. The functionality you describe, particularly monitoring/debugging, could be a shim function inserted between a service and an app that did nothing but record the state of that function call (w/ timestamp). These shim functions could produce the raw information that diagnostic software needed to automate future diagnoses or responses. The shims can also be where a hotfix response happens, esp if the component driving them is scriptable. And when unnecessary, these shims could be removed w/out any impact on rest of the system.