Featured Image -- 86912

Automated Status pages with Status.io Plugin

Originally posted on Dataloop.IO Blog:

When it comes to service status pages most of us feel they are more of a marketing gimmick than fact. For example with Amazon Web Services the first time you are aware of a problem it is not from the status page. It is when twitter sets on fire with people complaining about the poor service. The trend is alarming and it is not just Amazon doing it, almost all service providers do the same thing. For some reason special authorisation is required to update the status page. Special people need to confirm that this is the right marketing move for the business. That’s not how we work.

People need trust in a service. People want to feel like they are getting the information as and when it happens.  Not 30 minutes or 40 minutes later, if at all. That is where status.io comes in for us. We needed a…

View original 495 more words

How to Fail and make it a Success

Failure-success

There comes a point where we all fail, it doesn’t matter when, if you don’t think it’s happened yet, give it time; either way it’s coming. The question you need to ask your self is “What am I going to do about it?”. I’ve worked in places where failure was a point the finger affair and places where it wasn’t. It is clear to me that failure is the only way to move forward and succeed, you just need the right strategy for dealing with failure that allows you to move on with life and to make the changes you need to make things better.Remember you are not going to fix all problems immediately on your first attempt but stick to the process and religiously follow it and eventually you will be in a better place.
Thomas Edison famously got it right:

I have not failed. I’ve just found 10,000 ways that won’t work…

The whole point of failure is to learn from it, and as long as you remember that you will succeed. With that in mind the most common mistake I see is the failure to learn. It’s fine to fail, fail all day long if you want. The important thing is to have the right mechanism to cope with the failure so you ensure you learn from it, this doesn’t mean it needs to be process heavy but it does need to be done religiously every failure.

There’s a few things I ask for every failure regardless if it was customer facing or internal, failure is failure is failure.

  • What can we do to stop this happening again?
  • How can we get more notification next time?
  • Did we have the right people looking at this at the right time?

I feel the need to be abundantly clear here, “What can we do to stop this happening again?” literally means what crazy ideas do people have to stop this? do we add a new layer in? do we double up somehow? throw things behind a load balancer? It’s no good to have a room full of bright people if you can’t answer this question, there’s always something that can be done, a change in process, some crazy technical solution or just adding more capacity.

Getting more notification is important, not just after the event but can you predict the event? the obvious example is disk space, when it comes to other issues your millage may vary. Either way you should be able to do something to give you a little more time to start dealing with it, even if it’s something simple like upping the rate of the checks and the failure notification so you get the alert 1 min sooner than before.

Having the right people is also important and i’m not talking about having Bob on call rather than Chris I’m talking about getting developers awake at the right time. Let’s say there’s a memory leak, the alert should wake up both a sysadmin/DevOps guy and a developer. The only thing the sysadmin can do is make sure that the memory buffers are cleaned so it can start again (ready to fail at an undetermined later point) or automate restarts. These are all working around fixing the problem and are things to be considered when it comes to “What can we do to stop this happening again?” but You wrote the app, you have the developers so why would you not do both, have the DevOps/Sysadmin stabilise the system and minimise the impact while the developers are investigating the cause and writing a fix for the problem.

With these simple tasks in place the only sensible thing to do for your service is to fail, lots, regularly and to then put in place the solutions to stop it happening again. Failure is an option and it’s one I’d recommend; with the appropriate framework in place!

How to do DevOps in an enterprise

Overview

Over the years we have seen fads come and go and trends ebb and flow as business requirements and drives change. DevOps has however been rather resilient so much so that all of the enterprises believe they now need it, the interesting point here is that they still don’t know why, they just know everyone else has it and if they don’t their behind the times.

In some ways this in its self is worrying because the enterprises need help and guidance. They really need to be reading through this book: Next Gen DevOps: Creating the DevOps organisation

There’s an increasing trend to have the DevOps teams literally sit between Dev and Ops, companies doing this should stop. Here’s a couple of thing not to do and examples.

What Not to do

Sky: full post assuming that is not there see this pic:
Screen Shot 2015-03-25 at 22.48.49

Defra: I know from people that work there and various jobs that they implement the same mechanism, A team makes the app, a team builds the app and packages it and another team deploys the app, so many broken processes I feel for them but it is ultimately what has lead to these – 1 & 2

What both of these enterprises have done is jump on a band wagon and some how understood DevOps to be the thing that makes deliveries smoother and therefore sit them between the dev and the ops, mainly as they have an existing Dev team that works and an existing Ops team that works but they don’t have a DevOps team.

WallOfConfusion_Release

Lets examine this, Dev one side Ops the other. DevOps is meant to, and I’ll spell this out simply for Sky and Defra…. break down the wall of confusion. Not as these guys have done which is to implement DevOps between a Dev Team and an Ops team to help make it more agile.

If Anyone, Anyone… can explain how adding a team in the middle of your existing Dev and Ops team magically brings you improvements in agility I’d be interested in knowing how. Before you say something like… “They make all the tooling and stuff to make it all quick and to deploy the continuously” You should refer to the book from before and go get a job in DevOps (you’re not in it by definition, sorry). We’ve always had Build engineers for that so your argument is to basically do what we always I’d but with more people who are slightly less code focused, alright…

Then there’s the people that think simply by making Developers do operations they get DevOps, Infrastructure as code they say, yay. I’m not saying developers can not learn operations, likewise this rant applies to ops doing dev… anyway the point is forcing one skil set to do the other is hard.

What to do

Firstly realise there is no right answer. Secondly realise experimenting and trying new things is necessary.

Start with a good foundation. Get a Senior Dev and a Senior DevOps guy in the team, for the DevOps guy I consider these the 3 most important things.

1, Can program in an Object Orientated language and understands Classes, modules, inheritance, recursion etc etc
2, Has a fundamental understanding of how the OS works at the filesystem and process level
3, Appreciates pragmatism over perfection

The hard thing for anyone going from Dev to Ops or Ops to Dev is that they have a massive learning curve, and they may not realise what it is until many screw ups have happened but having these people in the team is massively important because they will bridge the gap between the Dev and the DevOps and trust me there is one.

Once you get these 3 or 4 people in a team the next most important thing to do is to make that team fully accountable and responsible for the end to end service. If anyone in the team is not okay with supporting the service they wrote and designed in production, fire them and find someone new. it’s paramount to have the buy in within the team of accountability of the product and services in production, it forms a fundamental pillar of continuous feedback. I did something bad, it bit me / my team, they told me I did it bad / I noticed I did something bad, I made it better, we all benefited.

Summary

By having the right people working in a close nit group and empowering them to own the whole solution front to back is the only way you can realistically implement DevOps and have meaningful gains.

EBay password policy

EBay – old fashioned security in a modern day

Hello EBay

Firstly, I like EBay and have been using it for over 10 years. When I found out via news forums about the big security issue I realised I had to do two things.

  1. Update my email address to one I actually use
  2. Set a secure password

For some reasons both of these rather simple things caused my problems due to “Security”, so lets look at each one and work out why it is a problem.

  1. Email address can not contain your EBay userID
  2. The “Secure” password can only be 20 characters long and could only contain ‘-_@’

With point 1, there’s nothing I could do other than contact support which I did (only tonight after being bored) and spawned this, with 2 I just coped and validated today to find out they have now got a new password policy and it seems to have been set by someone sensible and is now as follows:

EBay password policy

So this does mean my rant about two things now becomes a rant about one thing, but it is the one that is annoying me the most so here goes.

Security Theatre

There’s two types of security, theatre and actual. Actual security results in the system being secure, i.e. implementing two factor authentication. Theatre on the other hand is things people do to make you think it’s secure, i.e. insisting that your username and email address are different. Why is that Theatre? well simply put, my userId can be easily found out so is considered public domain knowledge, secondly knowing my userId should not make logging into my account any easier, thirdly, it not being the same as my email address can only stop people guessing my email addresses or other details.

So Ebay have implemented (back in 2004 apparently) that your UserId can not be present in your email address, so Let’s say my Ebay userId is soimafreak (it is) I can not use any of my normal personal email addresses as with most people I have an internet handle and I stick to it. Sure I could use a different username on every site, that does stop people guessing my username. But, again, knowing my username should not make it easier to hack my account… unless you have poor security to start with…Ebay…

Let us go on a story telling journey now and hypothesise how bad Ebay’s security really is at its core. To do this you have to understand that Ebay was an original .com bubble company back in the good ol’ days where good security consisted of two things, one md5sum a users password and make sure your DB is not accessible on the internet and make the access restricted by username and password.

So as discussed before md5 has some flaws, but I imagine up until recently Ebay used an approach like this or maybe worse for storing passwords. Why is this bad? Well you can be subject simply to Rainbow attacks which are very common place. Now lets say it gets to 2004 and you hear about people doing that, what simple security precaution could you take with out re-hashing everyones password, which would require everyone changing their passwords… well if you insist that the UserId is not the same as or contained in the email address that means for those specific users it would be slightly harder to work out what their username was. Was it a gmail.com? hotmail.com? aol.com address with their userId on it.

Why is this so pointless?

I’m not saying it was a bad thing to do back then, I’m saying it’s a bad thing to still be doing now because things have moved on. I take my passwords quite seriously and as time goes on I move more and more websites into keepassx where I have no idea what the password is. It would not be hard to guess or work out most peoples usernames for websites, I’ll give you a clue, it’s normally their email address or some other UID like your Ebay userID so right aay I can get everyones userId but I shouldn’t be able to break their password. The problem comes if I crack your password on an insecure site, as you may recall from this earlier I don’t have to know your password I just have to know a string that generates the same password which is why salts are important. So going back to Ebay, let’s say I pick a random ebay user my-pet-frog I found this by searching for “wibble” on ebay and I found this and what’s on this page…

mypetfrog

So I now have their email address or at least a couple more leads to try, so again, what’s the point of the original security put in place in 2004 when the real solution is to educate users and to implement actual security and not security theatre.

Summary

So I ask you EBay to implement actual security and not theatre and more importantly to let me change my sodding email address.

Now as for my-pet-frog I feel bad, they hopefully will read this and see that they should not share those details on Ebay because of ‘security concerns’, but why shouldn’t they, should all EBay users insist that Ebay just implements actual security so the users can use the system in a better way with out having to make their email addresses public because of security theatre and a lack of education from Ebay to its users. Anyway as I was bad and used my-pet-frog as an example I hope to go some way to compensate them.

Please check out their Ebay shop or their Amazon store front or better yet their actual website Hotscamp.com there really are some awesome T-shirts on here and one of my favourites is this Back to the future one or this Portal one

I do have a massive transcript of the conversation I had with Ebay customer support about this issue, but largely irrelevant other than they are tied by the same system and they were helpful. Ebay did graciously allow me to write a letter of complaint to their complaints department but that was too old Fashioned for me so they get a Blog rant. However if you would like to pritn this blog and send it to their complaints department here’s the details:

Complaint Department 
P.O. Box 9473 
Dublin 15 
Ireland 

Enjoy!

Flexible monitoring, going up and down

The other day…

I wrote a post the other week about how much monitoring sucked and there was a number of people on the internet (hello people) that just didn’t get it so I thought more detail would be good. One point that was raised was about the scaling up and down of servers and how that affected the monitoring platform. I wanted to cover this specifically as it is an important topic to understand why I said I think Dataloop.IO was the answer.

Nagios + Puppet

Lets look at a typical Puppet / Nagios approach. Puppet has the concept of exported resources, an exported resource can be collected by another server and then actioned so a cool thing to do is to have a manifest that describes a webserver that looks like this:

# /etc/puppetlabs/puppet/modules/nagios/manifests/target/apache.pp
class nagios::target::apache {
   @@nagios_host { $fqdn:
        ensure => present,
        alias => $hostname,
        address => $ipaddress,
        use => "generic-host",
   }
   @@nagios_service { "check_ping_${hostname}":
        check_command => "check_ping!100.0,20%!500.0,60%",
        use => "generic-service",
        host_name => "$fqdn",
        notification_period => "24x7",
        service_description => "${hostname}_check_ping"
   }
}

The double @ tells puppet to send this resource to the puppet database where something looking for it can pick it up later, so the configuration needed to define a host and to add a ping check is. Once the resource is exported it waits on the server until it is collected, the collection looks like this:

# /etc/puppetlabs/puppet/modules/nagios/manifests/monitor.pp
class nagios::monitor {
    package { [ nagios, nagios-plugins ]: ensure => installed, }
    service { nagios:
        ensure => running,
        enable => true,
        #subscribe => File[$nagios_cfgdir],
        require => Package[nagios],
    }
    # collect resources and populate /etc/nagios/nagios_*.cfg
    Nagios_host <<||>>
    Nagios_service <<||>>
}

The Spaceship (<<||>>) tells Puppet to look for that defined resource in the exported resources puppet database, so in this case a resource of Nagios_host or Nagios_service. This is cool, it means a server that previously had no information about another can now do something useful with the specific information that server now provides. This is a good fit for adding new hosts or service checks to Nagios, so lets look at how you remove them next:

N/A

Seriously… If you want to remove it you would have to do the following, reconfigure the host in puppet so it no longer exports, then purge the DB of previous exports, then re-run puppet on the nagios server to re-add all resources again except the one you removed… sounds fun, you could probably make it work if you knew the server was going to be shutdown. If you don’t believe me see this That’s as good as it gets, sorry.

The real problem

With the uptake of utility based computing servers come and go and we should no longer be precious about them. I always give the same answer when someone in the team asks what we call the new server.

These are farm animals not pets

What do I mean by that? well I don’t care what it’s called or even if it exists, if it causes me any problems I will shoot it in the head and get a new one. Lets look at webservers in an auto scaling group, I sometimes have 3, sometimes 3000. Trying to manage that flexibility in puppet will work for scaling up, and I’m sure there’s a way to manage the scale down (if anyone has a way I’d be interested in hearing it)

So why is Dataloop.IO better? well I think it’s better because I can draw a simple hierarchy in the web UI and take a tag, say ‘web’ and add it to the ‘web servers’ service. When I now install Dataloop.IO using puppet or chef or the setup.sh method I have to provide a few details an API Key and an optional tag or list of tags. So assuming that the configuration is done correctly there will be a ‘web server’ role that all web servers collect from and I just put the tag in there and hay presto the server(s) connect to Dataloop.IO in the right container and then they download all of their checks. Lets cover a few examples:

name "web"
description "Web server Role for configuring servers"
run_list(
  'recipe[apache]',
  'recipe[dataloop]'
)
default_attributes(  { "dataloop" =>
                          { "agent" => {
                              "api_key" => "someapikey",
                              "tags" => "web"
                            }
                          }
                      }
                    )

I on purpose made this more verbose, the reality is that Dataloop.IO should be included in a base and there should be a simple override of the tags attribute here. The above is the entire configuration needed to have servers dynamically add all checks and have them spin up / down and de-register themselves as needed from the central service so you only have servers in Dataloop.IO that are turned on. So what happens when the power is yanked? I hear you cry, well, you get an alert as you’d expect, it is only when the server is shutdown and not power cord yanked to turn off that it de-registers.

Lets look at the bash equivalent, lets say you need a server to have monitoring on it in the next 5 seconds!

sudo curl -s https://download.dataloop.io/setup.sh | bash -s <API_KEY> web

That achieves the same as the chef example above; because the configuration of the monitoring is done in Dataloop the agents are all simple, they just need some auth to connect back in (api key), from there you can either drag them into service groups, add tags or whatever plugins you need. If you tag the group and apply the plugins to the tag then as long as that tag is specified it will get all the relevant plugins. You can also layer as many of these tags on top of each other as you like, the agent will just work it out in real time.

Summary

Yes you can scale dynamically up and down with nagios and Puppet or Chef, but most of these tools all rely on being on all the time, i.e. not cloud centric, more enterprise where they still name their pets… Dataloop.IO doesn’t come with that sort of baggage, no firewall rules, quick and easy to setup and use as it should be. If you’re still not convinced I understand, watch this video first:

Monitoring sucks, really

Have you noticed…

In short, all monitoring out there sucks. I promised a few months back to do a review, I was wrong, it is not possible. Let us consider the review of industry standard tools like Nagios, after only several hours of install I may have a server installed not in config management and no users or servers to monitor… This is why these type of on premise apps will die out.

Who wants to spend weeks working out the config and management of a system that is meant to make your life easier? Monitoring tools are very simply put, meant to let you know if serverX is on… or off.. advanced… details like on for service X, or off for service Y come later.

The basic monitoring life cycles should go like this.

Day 1 is the server on or off,
Day 2 are the services I care about running
Day 3 in X days Y may happen

These 3 things are important to monitoring, they allow you some predictability in your service so the sooner you have them the better. A good monitoring tool should be one that allows you to answer these questions as quickly as possible from the time you purchase / download it to the moment its on your server, quicker is better!

Bang for buck

I am acutely aware that monitoring tools that promise the world cost arms, legs, souls and pride; worse yet fail to deliver anything of value that you need. In the past I have seen £100k hp open view systems replaced in a couple of weeks by Nagios and I’ve seen Nagios + munin replaced by Opsview because it is easier to manage and config than both individual tools. For those that don’t know Opsview is a nice front end and config piece for nagios.

I have even, unfortunately seen £2k a month wasted on 10 servers with New Relic. I guess the point is… monitoring is anything from free to ridiculous the key is always what does it do for you?
Does it make your life easier?
Can you work quicker with or without your monitoring tool?

On a side note… New relic’s product is awesome, but if you are not using Java why bother, If you are, you may find like me your engineers find it useful but not irreplaceable… All I can say is it wasn’t as good as Nagios for the alerting and monitoring of the hosts but was definitely better at the application.

Where is the happy ground? You need something as configurable as Nagios, as cheap as Nagios but most importantly not Nagios and this leaves you in an awkward position.

Nagios is awesome and has some cool features, good support, many plugins etc. However the server doesn’t scale easily, configuration is not as simple as it should be and quite frankly the web UI looks like a child vomited hatred on it, just plain ugly. So you naturally lean to OpsView which takes away the config hassle of Nagios by providing puppet modules and decent web ui config but now you have to pay. Is it worth while? Definitely it’s better than Nagios, but that isn’t good enough is it? Certainly it’s a step in the right direction but it’s not the killer tool.

Likewise New Relic was meant to be that killer tool,
designed for devs by devs. So, in short, complicated, non standards compliant and lacking in os monitoring. So what is a sysadmin to do? Give up? I think not.

It comes down to this, you install tools like Opsview or CheckMK as they at least give you a better interface, but they don’t solve the issues of nrpe or firewall rules having to be opened in all directions. It’s for this reason I think there has to be a better way, I don’t want to think or spend my time opening up rules, I want something simple and powerful.

There’s new tools coming onto the market that to me sound better, imagine being able to leverage the Nagios community while having a easy to drive UI on a monitoring tool that gave you the same power as chef knife or puppet marionette while being able to update all of this through simple git commits or the web UI as you see fit. Writing a new monitoring check is done while in the analysis process rather than as backlog or you can simply utilise the RPC nature of the tool to debug issues in prod and write checks on the fly. Did I mention while doing this it is also able to act like Pingdom and provide dashboards to management.

So where does this leave us? well looking to tools like Dataloop.IO for solutions. I have had the privilege of using this while they are in closed beta and they’ve been really good at taking on feedback to make it the monitoring platform I need it to be and it is getting close to being ready and I’m genuinely excited about what is going to happen to this platform over the next year or two.

Foundation building is important

The man who built his house on sand

You are probably familiar with the proverb about the man who built his house on sand, if not read this. It’s important to have a solid foundation to work from when you want to start considering Continuous Delivery (CD) or Continuous Integration (CI).

From an IT perspective this would be like a CTO dictating that CD is the only way to do things; which when poorly managed leads to something that is poorly tested, poorly structured and hard to innovate on. By the time the pessimistic IT bod mentioned it to his boss and it was turned into management speak, then translated to senior management speak it ended up being mistranslated into something completely different.

IT bod “It’s taking ages because the puppet manifests are a complete mess where we had to keep rushing stuff”
IT Bods’ Manager “It’s taking longer than expected as the work is more complicated but it will be done soon”
IT Director “We are spending our time making sure we do this right, we don’t cut corners”
CTO “We have a really stable well produced system”

Yay. I’m 90% sure this is how it works… People become afraid to say how bad it is, but from experience I can honestly say when you start telling people bluntly they stop hassling you, they also stop talking to you so it is a hard thing to make better, it’s harder when the whole train of people desperately want to come across as having done an awesome job.

Imagining that situation, and adding in people that are brought in to deliver just that, while being asked to do lots of other stuff that isn’t in scope, you can end up with something that with lots of careful hand holding produces a build, maybe it even builds an environment with only 2 or 3 hours hand holding, maybe it’s good enough for production using virtual box Who knows.

Typically these nightmarish situations exist only because someone wasn’t clear in defining what the problem was, or when they did they allowed themselves to be pushed over. Well I’m saying it’s not good enough, everyone in the chain has a responsibility to make sure they communicate in clear and uncertain terms what the problem is so there is no ambiguity about how bad a situation is.

foundations

The latest trend at the moment is all towards Continuous Delivery (CD) and Continuous Integration (CI) and all these over wonderful DevOps words. Although it is possible for you to take code and deploy it automatically it is stupid to do so without a sufficient understanding of what the consequences could be. As such it is important to identify what you need to be able to deliver effectively before working out what you need to do to achieve CI or CD.

So before considering CD or CI you need to be able to do the following things, minimum:

  • Easily differentiate between each configuration release
  • Easily differentiate between each infrastructure release
  • Easily differentiate between each application release
  • Be able to build each application server from scratch
  • Be able to build the infrastructure from scratch
  • Be able to track work through a process i.e. request to release for new Infrastructure, Application code or configuration
  • Have an agreed process for peer review of changes
  • Have an agreed release process
  • Be able to manually follow the processes that are in place
  • Adequate test coverage of infrastructure
  • Adequate test coverage of Configuration
  • Adequate test coverage of Application

Once you have those basics in place you can start to look at automating each step, Skip the list at your peril. Let’s touch on a few for clarities sake. “Easily differentiate between XXX” The reason for these is that at some point someone will say “it’s not working and you broke it” and you want to turn that from an opinion based approach into a factual one, and the easiest way to do that is a simple diff between the previous and the current release, no ambiguity, only facts.

Lets look at the “Be able to build XXX from scratch” This is really important, the only way to guarantee that your box is in the state you know it to be in is to build from scratch, use an golden disk, AMI or plain OS, it doesn’t matter as long as you bring that box up from scratch and build it through to a working state (ands off). I’ve had conversations with people that don’t get it, some times the arguments go like this… “We don’t need to because everything is is puppet” well, Lies… No one puts everything in puppet and even if you did, I logged on and stopped the process or I installed a package that wasn’t in puppet or I started a service or I changed a file that was’t etc etc etc… No excuses, build from scratch; it’s really important for the message this sends to the rest of the business which is consistency through process.

Processes are important, they describe the things you will and won’t do, they need to be public, they need to be really simple and they can then be automated, Starting without a process is just going to mean re-working steps as others in the business have different opinions about how it should be done so it’s good practice to sort that out as soon as possible.

The last set “Adequate test coverage of XXX” This needs to be in place beforehand, these tests will become your computerised approver so at the very least it should do everything the human counterpart does to check the system and they need to evolve as time goes on to include more and more tests, when the confidence is in the testing it shouldn’t matter when you release or ho often as you have a set of tests that you and the business trusts.

Summary

It’s important to try and not rush into the final solution, everybody wants it, it’s everyone responsibility to check and cross check that the process is being done sensibly and to call foul if anyone tries to change the process or the requirements. The only way to do this is with some sort of consistency and that should be the driving force, the business needs to accept that if the pipeline is broken the releases don’t happen. but when the pipe line is fixed they should all go fine. This turns the whole release cycle into a maintenance process rather than an active involvement in each release and that will over time be more and more stable and beneficial to the business as a whole. So before trying to do CD or CI, make sure you can put ticks next to the bulleted list above else you’re just wasting time.

Follow

Get every new post delivered to your Inbox.

Join 91 other followers