Have you noticed…
In short, all monitoring out there sucks. I promised a few months back to do a review, I was wrong, it is not possible. Let us consider the review of industry standard tools like Nagios, after only several hours of install I may have a server installed not in config management and no users or servers to monitor… This is why these type of on premise apps will die out.
Who wants to spend weeks working out the config and management of a system that is meant to make your life easier? Monitoring tools are very simply put, meant to let you know if serverX is on… or off.. advanced… details like on for service X, or off for service Y come later.
The basic monitoring life cycles should go like this.
Day 1 is the server on or off,
Day 2 are the services I care about running
Day 3 in X days Y may happen
These 3 things are important to monitoring, they allow you some predictability in your service so the sooner you have them the better. A good monitoring tool should be one that allows you to answer these questions as quickly as possible from the time you purchase / download it to the moment its on your server, quicker is better!
Bang for buck
I am acutely aware that monitoring tools that promise the world cost arms, legs, souls and pride; worse yet fail to deliver anything of value that you need. In the past I have seen £100k hp open view systems replaced in a couple of weeks by Nagios and I’ve seen Nagios + munin replaced by Opsview because it is easier to manage and config than both individual tools. For those that don’t know Opsview is a nice front end and config piece for nagios.
I have even, unfortunately seen £2k a month wasted on 10 servers with New Relic. I guess the point is… monitoring is anything from free to ridiculous the key is always what does it do for you?
Does it make your life easier?
Can you work quicker with or without your monitoring tool?
On a side note… New relic’s product is awesome, but if you are not using Java why bother, If you are, you may find like me your engineers find it useful but not irreplaceable… All I can say is it wasn’t as good as Nagios for the alerting and monitoring of the hosts but was definitely better at the application.
Where is the happy ground? You need something as configurable as Nagios, as cheap as Nagios but most importantly not Nagios and this leaves you in an awkward position.
Nagios is awesome and has some cool features, good support, many plugins etc. However the server doesn’t scale easily, configuration is not as simple as it should be and quite frankly the web UI looks like a child vomited hatred on it, just plain ugly. So you naturally lean to OpsView which takes away the config hassle of Nagios by providing puppet modules and decent web ui config but now you have to pay. Is it worth while? Definitely it’s better than Nagios, but that isn’t good enough is it? Certainly it’s a step in the right direction but it’s not the killer tool.
Likewise New Relic was meant to be that killer tool,
designed for devs by devs. So, in short, complicated, non standards compliant and lacking in os monitoring. So what is a sysadmin to do? Give up? I think not.
It comes down to this, you install tools like Opsview or CheckMK as they at least give you a better interface, but they don’t solve the issues of nrpe or firewall rules having to be opened in all directions. It’s for this reason I think there has to be a better way, I don’t want to think or spend my time opening up rules, I want something simple and powerful.
There’s new tools coming onto the market that to me sound better, imagine being able to leverage the Nagios community while having a easy to drive UI on a monitoring tool that gave you the same power as chef knife or puppet marionette while being able to update all of this through simple git commits or the web UI as you see fit. Writing a new monitoring check is done while in the analysis process rather than as backlog or you can simply utilise the RPC nature of the tool to debug issues in prod and write checks on the fly. Did I mention while doing this it is also able to act like Pingdom and provide dashboards to management.
So where does this leave us? well looking to tools like Dataloop.IO for solutions. I have had the privilege of using this while they are in closed beta and they’ve been really good at taking on feedback to make it the monitoring platform I need it to be and it is getting close to being ready and I’m genuinely excited about what is going to happen to this platform over the next year or two.
There are resource types in Puppet which can be used to register a node/service with a Nagios server assuming you understand exporting/collecting of Exported Resources. Another approach would be to build out the Nagios config via PuppetDB queries for Service resource types perhaps using a module like dalen/puppetdbquery.
Outside of Nagios, there are now various monitoring tools which tie into registration services likes etcd and/or ZooKeeper natively and thus don’t require you to perform relatively advanced Puppet techniques. A lot of that work converged around the #monitoringsucks hashtag a couple of years ago.
I’ve used the both methods you describe to manage Nagios. I think the exported resources methods is the cleanest but that requires having a puppet master and I don’t like that :)
I’ll go into more detail later as to why dataloop is better than those, but it has groups defined in the UI which can be physical or geographical, what ever you want and at any tier in the hierarchy. Why you bring up an agent you just specify this one group, and the agent will take care of working out what checks it needs to download based on the hierarchy defined in the UI.
However cool that feature is, it’s not the one I think is better than the old school puppet. When you bring these servers up and they are configured that is great, there’s a mechanism for it, but what about when they go down? when you terminate the server? In my experience the best you could do is write a custom script to hit an api on a shutdown. With dataloop the agent cleanly de-registers when the server is shut down and the data is removed from the UI. If you stand up a new server and wish to maintain historic information then you can use the original agent fingerprint to maintain history, for example in a blue/green deployment this may be desirable.
I agree, there are many problems with the current approach to monitoring. First, the complexity of the systems. As you mentioned, it takes days to get the system up and running and months to teach the system what to look for, then usually hire dedicated people to look after the system. Secondly, SNMP monitoring is really basic. It doesn’t give you a hint on why something went wrong. Lastly, you need to figure out yourself how to fix the problem.
I felt obligated to jump in because I don’t think monitoring sucks anymore. I can name one tool for networking and firewall monitoring that doesn’t take this approach, indeni.
indeni uses SSH to connect to the devices and collect very detailed information, then it analyses the information it collected and returns very detailed alerts. It looks for failures such as mismatches between cluster members, mismatching duplex configurations, flapping VPN connections and more.
The benefits I see:
A. Users don’t teach the system what to look for, the knowledge exists out-of-the-box. The knowledge base grows with each release.
B. Each alert comes with built in section of commands and actions you should take in order to solve the problem indeni found.
C. The installation process is really fast (about 30 minutes) and it starts monitoring imminently (given that you have sufficient permissions to SSH to the device)
Disclaimer – I’m a product manager at indeni. Our customers use the most popular monitoring systems in the market and they still buy indeni, they see the added value of a smart dynamic system over static standard tools that encourage you eventually to do more laborious work.
[…] wrote a post the other week about how much monitoring sucked and there was a number of people on the internet (hello people) that judt didn’t get it so I […]