Automate to survive

Everyone has a choice

Automate or die, That is pretty much it, you can automate everything or you can keep working with manual processes that slow you down. If you don’t think you have the time to automate, you’re wrong; you need to automate and do it quickly before you get even more busy and even further behind. Maybe you think that you can’t automate because you don’t have the time to do it justice? maybe you can’t automate because the task is to big? Too complicated? well it’s all rubbish.

Start small

This is a bit like eating an elephant, You have to start somewhere, you have to start small, by all means try and start big if you want, but smaller is better. Maybe you have a task to check for new packages from a site once a month, that is a good place to start, pull third party packages from vendors sites into your yum repo or maybe every time you build a server you need to do x, y & z. These sort of tasks are achievable for everyone even those without a good background of programming background which leads on to language choice, not all are equal but knowing two or three is better than just one. At a minimum some sort of terminal language, so Bash,ash,sh or ksh and a ‘Proper’ sort of language that is object orientated like, Ruby, Python or Perl. The terminal languages are good for re-producing what you do on a terminal into a reproducible and consistant format but are terrible for manipulating multiple data sources, mangling data, although with that said you can do some complicated things.

Once you start building up many smaller components of automation start looking at ways of linking it all together so that a series of tasks becomes one. It is this constant cycle of simplifying the process to automate the small chunks and then linking the small chunks together that make an automated system.

Grow large

Over a year ago we use to deploy our environments with Puppet and cloud formations and it use to take about 2.5 days to complete and get it working, that whole process is now down to 10 mins thanks to automation. It required many leaps of faith, many poor decisions and a lot of bug fixing but it got there though simplifying everything down and then automating each component. Other than building the servers and tagging them with appropriate keys in Amazon the whole process is controlled by bash to the point of a working system and is typically very robust. That is a massive time saving, but to get there we had to fail, we had to try and we had to persist.

As a result we now automate large portions of the architecture to a point where all of our time is split between incidents or project work to implement new features hardly any daily grind. Recently I have been working on our DR strategy to take it to the point of clicking a button to deploy a clean environment built from the ground up and automatically pulls the latest backups to restore to the environment but it is now done and saves hours of time building out a DR which makes the recovery time shorter and the process is easier. So larger projects are perfectly achievable with the right attitude!

Summary

Give it a go, start small and work up to it but be un-relenting and do what ever it takes, no matter how much you disagree with it, just do it to get it automated, once more is automated you’ll have some time to fix it up properly or you’ll need to extend it and you can make a small part better then.

Release consistency

It’s gotta be… perfect

This is an odd one that the world of devops sits in an awkward place on but is very vital to the operational success of a product. As an operations team we want to ensure the product is performing well, there’s no issues and there’s some control over the release process, some times this leads to long winded and bloaty procedures that are common in service providers where people stop thinking and start following, this is a good sign a sysadmin has died inside. From the more Development side we want to be quick and efficient and reuse tooling. Consistency is important, a known state of release is important, the process side of it should not exists as it should be automated with the minimal interaction.

So as Operations guys, we may have to make changes ad-hoc to ensure the service continues to work, as a result a retrospective change is made to ensure the config is good for long term; often this can stay untested as you’re not going to re-build the system from scratch to re-test, are you?

The development teams want it all; rapid, agile change, quick testing suites and an infallible release process with plenty of resilience and double checks and catches for all sorts of error codes etc etc, the important thing is that the code makes it out, but it has to be perfect, which leads on to QA.

The QA team want to test the features, the product, the environment, combinations of each and the impact of going from X to Y and tracking the performance of the environment in between. All QA want is an environment that never changes, with a code base that never changes so they can test everything in every combination, those bugs must be squashed.

Obviously, everyones right but with so many contradicting opinions it is easy to end up in a blame circle, which isn’t all that productive. The good news is we know what we want…

  • Quick release
  • Quick accurate testing
  • Known Product build
  • Known configuration
  • No ad-hoc changes to systems
  • Ability to make ad-hoc to systems
  • Infallible release process

All in all not that hard, there are two issue points here. Point one, infallible releases are a thing of wonder and can only ever be iterated over to make them perfect, in time it will get better. Day 1 will be crap, day 101 will be better, day 1000 better still. point two, you can’t have no ad-hoc changes and the ability to make ad-hoc changes, can you? Well you can.

If you love it, let it go

As a sysadmin, if there’s an issue on production I will in most cases fix it in production, if it is risky I will use our staging environment and test it on there, but this is usually no good as staging typically won’t show the issues production does, i.e. all of its servers would be working yet production is missing one. This means I have to make ad-hoc changes to production, this causes challenges for the QA team as now the test environments aren’t always the same, it then screws up the release process as we made a change in one place and didn’t manually add it in to all other environments.

So, What if we throw away, production, staging or any other environment with every release? This is typically a no-no for traditional operations, why would you throw away a working server? well it provides several useful functions:

  1. Removes ad-hoc changes
  2. Tests documentation / configuration management
  3. Enhances DR strategy and familiarisation with it
  4. Clean environment

The typical reason why the throw away approach isn’t done is due to a lack in confidence of the tools. Well, Bollocks. What good is it having configuration management and DR polices if you aren’t using it? if in an operational place now you are making changes to puppet and rolling them forward you achieve some of this, but it’s not good enough, you still have the risk of not having tested the configuration on a new system.

With every environment being built from scratch with every release, we can version control a release to a specific build number and specific git commit which is better for QA as it’s always a known state, if there’s any doubt delete and re-build.

The release process can now start being baked into the startup of a server / puppet run so the consistency is increasing and hopefully the infallibility of the release is improving, adding to this a set of System wide checks for basic health, a set of checks for unit testing the environment, a quick user level test before handing over for testing then it’s more likely, more often that the environments are at a known consistence.

All of this goodness starts with deleting it and building fresh, some of it will happen organically but by at least making sure all of the configuration / deployment is managed at startup you can re-build, and you have some DR. Writing some simple smoke tests and some basic automation is a starting point, from here at least you can build upon the basics to start making it fully bullet proof.

Summary

Be radical, delete something in production today and see if you survive, I know I will.

Sentinel Update

No, really an actual update

So after feeling all bad for not committing any work to sentinel for a while I decided to brush off the dust and crack on with some simple bits.

As of right now Sentinel will:

  • Perform basic checks of *nix processes
    • Is there a process running?
    • Is it in a running or sleep state (or other healthy state)?
  • Basic check of system health
    • Check disk usage with DF
    • If the disk usage is high, Log the offending disk info to the system log
    • Check memory usage of system
  • Basic Application health
    • Perform basic URL grab / scrape for search string
    • Check the amount of memory the application is using of what it requested & of the system total
  • Basic actions
    • If process Zombied / Dead, Kill it
  • Run as a cron job
  • Log output to file and to screen in log4J style format
  • Take options from CLI where appropriate

This is not to bad, there’s a few things still missing from what I’d like it to be able to do before I start testing it out such as:

  • Identify service associated with process and restart if it kills it
  • Restart application if URL check fails X number of times
  • Tidy up disk space based on known “safe” files that can be deleted (like log files over X days)

That’s all I want to achieve. Once I have that I then need to start with what is a rather tiresome activity which will be the testing of that application to make sure it works in a way that is sensible. After the testing that’s it!

I wish, I will after I’ve done some testing then re-structure the code, although it works as it is it is not maintainable and it is becoming harder to write code for it without it being in the wrong place and not quite doing the right thing… so I want to split the code out so it is easier to see what each bit does. I will probably create a a few classes to put the code in and make it better, either way I’ll grab some advice from people that know how to program first so I can get it into a reasonable structure.

Other than code layout, theres the structure of the data I’m trying to store, at the moment I am trying to use a default constructor in Ruby to create a class for my scores, however this is not working out I’m pretty sure I’ll need to write my own constructor and methods to get and set the values. This way i’ll be able to create a hash to store the data in instead of a list of variables.

Then there’s writing a proper fix to my 50 line case statement which I imagine is no small task but hopefully that will be pleasantly challenging.

is it worth it?

I think so, even if no one ever uses this I am learning more about programming structures which I think will be rather useful in the long term as I imagine there may be a time when being able to code to a reasonable standard may be needed.

Hopefully when the re-structure of the code is done and the last of the features is there it will hopefully be a useful application to deploy and hopefully that will save me from being woken up at 3am to deal with a simple support issue.

So that is basically what I’ve been playing with for a majority of this last week and I will try to continue on with it for now to at least get the final proof of concept done and see how it works out, at that point I’ll start trying to design it a bit better so it works as a proper application rather than a dodgy script.

Sentinel – An open source start

An open source start

Last week I introduced a concept of Self Healing Systems Which then lead me on to have a tiny tony bit of a think and I decided that I would write one, the decision took all of 5 mins but it gives me an excuse to do something a bit more complex than your every day script.

I created a very simple website here which outlines my goals, as of writing I have got most of the features coded up for the MVP, and I do need to finish it off which will hopefully be soon, which will hopefully be by the time this is published, but let’s see.

I decided to take this on as a project for a number of reasons:

  1. More ruby programming experience
  2. Other than Monit there doesn’t seem to be any other tools, and I had to be told about this one…
  3. It’s a project with just the right amount of programming challenge for me
  4. I like making things work
  5. It may stop me getting called out as often at work if it does what it’s meant to

So a number of reasons, and I’ve already come across a number of things that I don’t know how to solve or what the right way of doing it is. Which is good I get to do a bit of googling and work out a reasonable way, but to be honest that is not going to be enough in the long run. hopefully as time goes on my programming experience will increase sufficiently that I can make improvements to the code as time goes by.

Why continue if there’s products out there that do the same thing?

Why not? Quite often there’s someone doing the same thing even if you can’t find evidence of it, competition should not be a barrier to entry, especially as people like choice.

I guess the important thing is that it becomes usefully different, Take a look at systems management tools, a personal favourite of mine, you have things like RHN Satellite, Puppet and Chef 3 tools, 1 very different from the other two and another only slightly different. People like choice, different tools work differently for different people.

I guess what I mean by that is that some people strike an accord with one or another application and as a result become FanBoys, normally for no good reason.

There’s also the other side of it, I’ve not used monit, I probably should, I probably won’t; but it doesn’t sound like where I want to go with Sentinel. Quite simply I want to replace junior systems administrators, I don’t want another tool to be used, I want a tool that can provide real benefit by doing the checks on the system, by making logical deterministic decisions based on logic and raw data, and not just by looking at the systems it’s on but by considering the whole environment in which it is part of. I think that is a relatively ambitious goal, but I also think it is a useful one, and hopefully it will get to a point where the application is more useful than the MVP product and it can do more than just look after one system.

Like any good open source product it will probably stay at version 0.X for a long time until it has a reasonable set of feature sin it that make it more than just a simple ruby programme.

A call for help

So I’ve started on this path, I intend to continue regardless at the moment and one thing that will help keep me focused is user participation either through using the script and logging bugs at the github site it’s hosted on.

I think at the moment what I need is some guidance on the programming of the project, it’s clear to me that in a matter of months if not weeks this single file application will become overly complicated to maintain and would benefit from being split out into classes. Although I know that, I do not know the right way of doing it I don’t have any experience of larger applications and the right way to do it so if anyone knows that would be good!

In addition to the architecture of the application there is just some programming issues which I’m sure I can overcome at some point but I will probably achieve the solution by having a punt and seeing what sticks. There’s a wonderful switch in the code for processor states which needs to change. I need to iterate through each character of the state and report back on it’s status where as at the moment it is just looking for a combination. To start with I took the pragmatic option, Add all of the processor states mys system has to the witch and hope that’s enough.

So if anyone feels like contributing, or can even see a simple way of fixing some dodgy coding, I’d appreciate it, I guess the only thing I ask is if you are making changes, See the README, Log a ticket in github and commit the changes with reference to the ticket so I know what’s happened and why.

So please, please, please get involved with Sentinel

Self healing systems

An odd beginning

So I’m writing this having just spent the last 10 days on my back in pain and finally starting to feel better, it’s not come at a good time as another member of the same team as me decided they had a “better opportunity” This is the second person to have left this organisation without as much as a passing comment to myself that they were even leaving, how strage; but I digress.

Either way it opens up a void, a team of 2 and a manager now down to a team of one, with the one having back pain that could at any moment take me out of action. Unfortunately up to the day before I was not able to make it to work the system we look after has been surprisingly stable, rock like in-fact; as soon as I say “I’m not going to make it in” the system starts having application issues (JVM crashes).

Obviously the cause needs a bit of looking into and a proper fix etc etc, but in the mean time what do we do? I had an idea, A crazy idea which I don’t think is a fix to any problems but it at least a starting point.

Sentinel

I have spent a bit of time exploring Ruby a few weeks back so I started to look at ways of writing something that would do a simple check; is process X running? In the simple version I wrote it just checked that tomcat was running more than one instance (our application always runs 2) if it was 2, do nothing, if it was more than 2 do nothing (something else crazy has happened so it just logs to that affect) but if it was less than 2 it would try a graceful-ish restart of the service.

So this obviously works in the one specific case that we have, but isn’t extensible and it doesn’t do any better checks, which all got me thinking. Why isn’t there something to do this for us? I don’t know of anything that does this, if anyone does I’d appreciate knowing, there’s a number tools that could be muddled together to do the same sort of function.

Nagios would monitor the system, cucumber could monitor the application interactions, Swatch could monitor the logs, but in most cases these are monitoring, I’m sure there’s ways to get them to carry out actions based on certain outcomes but why use so many tools.

Yes, the individual tools probably do the job better than a single tool, but as a sysadmin, I’d rather have one tool to do everything but that isn’t practical either. So can we some how get the benefits of monitoring with nagios but have a tool that is specifically monitoring the application performance nagios is gathering information about and then making decisions based on that?

The big Idea

So I wonder if it’d be possible to write a simple ruby application that every now and then did a number of actions:

  1. Check the service on the box, right number of processes, not zombied etc, etc
  2. Check the disk capacities
  3. Check the CPU utilisation
  4. Check the memory utilisation
  5. Probe the application from the local box, a loopback test of sorts
  6. Integrate with nagios or another monitoring tool to validate the state it thinks the box is in compared witht he locally gathered stats
  7. Depending on the outcome of all the checks carry out a number of actions
  8. Hooks int ticketing systems

When I was thinking this through the other day, it seemed like a good idea, the biggest issue I have is not being a programmer, So I have a steep learning curve, it’s a complicated application, so requires some thought. I would also probably have to ignore everyone that thinks it is a waste of time, which isn’t too hard to do.

I guess what I’m thinking of is something like FBAR. As a system scales up to hundreds of servers the up time and reliability becomes more important, it is sometimes necessary to take a short term view to keep a system working. The most important thing is that those short term views are totaled up and then logged as tickets, 1% of your severs crashing and needing a restart isn’t an issue, but if that 1% becomes 5% and then 10% it’s panic stations!

Summary

I think my mind is made up, a sentinel is needed to keep watch over a solution, and what’s crazy is that the more I think of it the more useful it seems and the more complicated it seems to become. As such I think I’m going to need help!