Why is the UK Government struggling with IT?

Recently I’ve had an insight into how the government is conducting it’s IT. It’s eye opening. If I had to sum it up, it’s like applying 1990’s approaches to current practices, but what does that mean, let me explain.

Many moon’s ago, the UK government decided it was not an IT specialist and that it should not be in the business of running IT. This lead to it outsourcing a lot of its core IT to often large organisations who on the whole I imagine did make it much more cost effective, most likely at the cost of efficiency. Fast forward and over the years there have been many changes to encourage open source technologies and agile methodologies no doubt to address these inefficiencies that have crept in. It was probably not predictable that if you out source it there’s no incentive for them to do it efficiently. Even now with the new technologies there’s mindset issues holding back on unlocking the benefits to a DevOps style approach.

devops

What do I mean by a DevOps style approach? I mean that people should be encourage to use evidence to drive decisions, Automate the process, be encouraged to fail and measure the results against expectations. Based on what I’ve been able to observe over 2 months there’s a number of issues stopping the deliver of true efficiency and agility.

  1. Lack of technical management
  2. Too many consultant / IT outsourcing companies
  3. Mentality and approach barriers

Let’s tackle the causes of these and how to fix them one at a time.

1, Due to the outsourcing approach and the ideal that government does not do IT the technical decisions are being made by people in the government who are not adequately having the  problem explained to them. This is happening because the consultancies do not wish to harm their business by being seen as being ‘difficult’ or by saying “No” to release schedules. This also does not allow for a clear strategy to be defined for the entire department to work towards as there is not central guidance for the consultancy groups to adhere too. In 2 months I have not met anyone who is in technology within the department I’m working that is a position above me.

2, Due to the preference these days to use small consultancies and to bring in subject matter experts in various fields there is a lack of cohesion between styles and approaches to delivery. Everyone is pulling in slightly different directions, this is made worse by the lack of central technical leadership which is why it is the biggest problem that needs fixing.

Stop trying to change things

3, I am not lying when I say I have been told every week for 8 weeks “Stop trying to change things”. There is a culture that has developed where if you speak out you are seen as difficult. If you suggest improvements that require sign off or budget it does not happen because people are scared to approach the subject as they will have to justify how it speeds up delivery, no one cares about cost savings or efficiencies. People can and will change their outlook when it can be evidenced that change is benefiting the process but that requires point 1 again, good leadership and the ability to explain back reasons why fixing things up is worth while.

I’m not saying this is easy, there are payment deadlines that need to happen. It was pointed out to me that when it comes to government if it says it is going to do it it really must, no if’s or buts. Any failure is political canon fodder and the general public are entirely scathing when things don’t happen on time. This makes it harder than it would typically be, however, not impossible.

As a UK taxpayer I’m utterly shocked at how bad it really is, I had suspicions based on what I’d heard or read but I wasn’t expecting people to be so worn down and unwilling to change or for the consultancies to be so scared of speaking up but I guess when you have a revenue stream dependant on the government you wouldn’t want to cut that off. I guess ultimately my concern is that new technology will be implemented without fixing the root cause of the issue, the culture and organisation. No doubt if anyone was to read this of any power they would insist they have checks and measures are in place to ensure this can not happen, the reality is the people deciding are not qualified to know if it is a good deal or not.

How to do DevOps in an enterprise

Overview

Over the years we have seen fads come and go and trends ebb and flow as business requirements and drives change. DevOps has however been rather resilient so much so that all of the enterprises believe they now need it, the interesting point here is that they still don’t know why, they just know everyone else has it and if they don’t their behind the times.

In some ways this in its self is worrying because the enterprises need help and guidance. They really need to be reading through this book: Next Gen DevOps: Creating the DevOps organisation

There’s an increasing trend to have the DevOps teams literally sit between Dev and Ops, companies doing this should stop. Here’s a couple of thing not to do and examples.

What Not to do

Sky: full post assuming that is not there see this pic:
Screen Shot 2015-03-25 at 22.48.49

Defra: I know from people that work there and various jobs that they implement the same mechanism, A team makes the app, a team builds the app and packages it and another team deploys the app, so many broken processes I feel for them but it is ultimately what has lead to these – 1 & 2

What both of these enterprises have done is jump on a band wagon and some how understood DevOps to be the thing that makes deliveries smoother and therefore sit them between the dev and the ops, mainly as they have an existing Dev team that works and an existing Ops team that works but they don’t have a DevOps team.

WallOfConfusion_Release

Lets examine this, Dev one side Ops the other. DevOps is meant to, and I’ll spell this out simply for Sky and Defra…. break down the wall of confusion. Not as these guys have done which is to implement DevOps between a Dev Team and an Ops team to help make it more agile.

If Anyone, Anyone… can explain how adding a team in the middle of your existing Dev and Ops team magically brings you improvements in agility I’d be interested in knowing how. Before you say something like… “They make all the tooling and stuff to make it all quick and to deploy the continuously” You should refer to the book from before and go get a job in DevOps (you’re not in it by definition, sorry). We’ve always had Build engineers for that so your argument is to basically do what we always I’d but with more people who are slightly less code focused, alright…

Then there’s the people that think simply by making Developers do operations they get DevOps, Infrastructure as code they say, yay. I’m not saying developers can not learn operations, likewise this rant applies to ops doing dev… anyway the point is forcing one skil set to do the other is hard.

What to do

Firstly realise there is no right answer. Secondly realise experimenting and trying new things is necessary.

Start with a good foundation. Get a Senior Dev and a Senior DevOps guy in the team, for the DevOps guy I consider these the 3 most important things.

1, Can program in an Object Orientated language and understands Classes, modules, inheritance, recursion etc etc
2, Has a fundamental understanding of how the OS works at the filesystem and process level
3, Appreciates pragmatism over perfection

The hard thing for anyone going from Dev to Ops or Ops to Dev is that they have a massive learning curve, and they may not realise what it is until many screw ups have happened but having these people in the team is massively important because they will bridge the gap between the Dev and the DevOps and trust me there is one.

Once you get these 3 or 4 people in a team the next most important thing to do is to make that team fully accountable and responsible for the end to end service. If anyone in the team is not okay with supporting the service they wrote and designed in production, fire them and find someone new. it’s paramount to have the buy in within the team of accountability of the product and services in production, it forms a fundamental pillar of continuous feedback. I did something bad, it bit me / my team, they told me I did it bad / I noticed I did something bad, I made it better, we all benefited.

Summary

By having the right people working in a close nit group and empowering them to own the whole solution front to back is the only way you can realistically implement DevOps and have meaningful gains.

Foundation building is important

The man who built his house on sand

You are probably familiar with the proverb about the man who built his house on sand, if not read this. It’s important to have a solid foundation to work from when you want to start considering Continuous Delivery (CD) or Continuous Integration (CI).

From an IT perspective this would be like a CTO dictating that CD is the only way to do things; which when poorly managed leads to something that is poorly tested, poorly structured and hard to innovate on. By the time the pessimistic IT bod mentioned it to his boss and it was turned into management speak, then translated to senior management speak it ended up being mistranslated into something completely different.

IT bod “It’s taking ages because the puppet manifests are a complete mess where we had to keep rushing stuff”
IT Bods’ Manager “It’s taking longer than expected as the work is more complicated but it will be done soon”
IT Director “We are spending our time making sure we do this right, we don’t cut corners”
CTO “We have a really stable well produced system”

Yay. I’m 90% sure this is how it works… People become afraid to say how bad it is, but from experience I can honestly say when you start telling people bluntly they stop hassling you, they also stop talking to you so it is a hard thing to make better, it’s harder when the whole train of people desperately want to come across as having done an awesome job.

Imagining that situation, and adding in people that are brought in to deliver just that, while being asked to do lots of other stuff that isn’t in scope, you can end up with something that with lots of careful hand holding produces a build, maybe it even builds an environment with only 2 or 3 hours hand holding, maybe it’s good enough for production using virtual box Who knows.

Typically these nightmarish situations exist only because someone wasn’t clear in defining what the problem was, or when they did they allowed themselves to be pushed over. Well I’m saying it’s not good enough, everyone in the chain has a responsibility to make sure they communicate in clear and uncertain terms what the problem is so there is no ambiguity about how bad a situation is.

foundations

The latest trend at the moment is all towards Continuous Delivery (CD) and Continuous Integration (CI) and all these over wonderful DevOps words. Although it is possible for you to take code and deploy it automatically it is stupid to do so without a sufficient understanding of what the consequences could be. As such it is important to identify what you need to be able to deliver effectively before working out what you need to do to achieve CI or CD.

So before considering CD or CI you need to be able to do the following things, minimum:

  • Easily differentiate between each configuration release
  • Easily differentiate between each infrastructure release
  • Easily differentiate between each application release
  • Be able to build each application server from scratch
  • Be able to build the infrastructure from scratch
  • Be able to track work through a process i.e. request to release for new Infrastructure, Application code or configuration
  • Have an agreed process for peer review of changes
  • Have an agreed release process
  • Be able to manually follow the processes that are in place
  • Adequate test coverage of infrastructure
  • Adequate test coverage of Configuration
  • Adequate test coverage of Application

Once you have those basics in place you can start to look at automating each step, Skip the list at your peril. Let’s touch on a few for clarities sake. “Easily differentiate between XXX” The reason for these is that at some point someone will say “it’s not working and you broke it” and you want to turn that from an opinion based approach into a factual one, and the easiest way to do that is a simple diff between the previous and the current release, no ambiguity, only facts.

Lets look at the “Be able to build XXX from scratch” This is really important, the only way to guarantee that your box is in the state you know it to be in is to build from scratch, use an golden disk, AMI or plain OS, it doesn’t matter as long as you bring that box up from scratch and build it through to a working state (ands off). I’ve had conversations with people that don’t get it, some times the arguments go like this… “We don’t need to because everything is is puppet” well, Lies… No one puts everything in puppet and even if you did, I logged on and stopped the process or I installed a package that wasn’t in puppet or I started a service or I changed a file that was’t etc etc etc… No excuses, build from scratch; it’s really important for the message this sends to the rest of the business which is consistency through process.

Processes are important, they describe the things you will and won’t do, they need to be public, they need to be really simple and they can then be automated, Starting without a process is just going to mean re-working steps as others in the business have different opinions about how it should be done so it’s good practice to sort that out as soon as possible.

The last set “Adequate test coverage of XXX” This needs to be in place beforehand, these tests will become your computerised approver so at the very least it should do everything the human counterpart does to check the system and they need to evolve as time goes on to include more and more tests, when the confidence is in the testing it shouldn’t matter when you release or ho often as you have a set of tests that you and the business trusts.

Summary

It’s important to try and not rush into the final solution, everybody wants it, it’s everyone responsibility to check and cross check that the process is being done sensibly and to call foul if anyone tries to change the process or the requirements. The only way to do this is with some sort of consistency and that should be the driving force, the business needs to accept that if the pipeline is broken the releases don’t happen. but when the pipe line is fixed they should all go fine. This turns the whole release cycle into a maintenance process rather than an active involvement in each release and that will over time be more and more stable and beneficial to the business as a whole. So before trying to do CD or CI, make sure you can put ticks next to the bulleted list above else you’re just wasting time.

Releasing your first Devops Application

First the worry

When it comes to releasing the first version of an application it’s always worth weighing up the constraints of your environment and the time frame in which the task was delivered versus the skill set available. Inevitably as a skilled DevOps professional you want to do a good job, well done you; however you have to be strong and realise it is not about delivering perfection from day one but about the journey you must take to get there.

I recall the first deployment I did for a version 1 and every time I do one since then I get better, be it a bit more focused or a better starting point. The very first one I did was all over the place, no real configuration management, quite a few manual steps but a well written process, unfortunately that project remained in the depths of secrecy and I ended up moving on.

Constantly I see over engineering and complication added to projects and the root cause of this is worry, I know, I use to be there doing it, it is difficult to step back and be objective to what the business needs, but as a DevOps professional that is your job. When delivering a solution try and remember these things to help you worry less and focus more:

  1. Before being perfect you must first just “be”
  2. When in doubt, do less
  3. If you do not know when the site is down you will not have a job
  4. Always have a backup

Then the delivery

The above list is rather quite useful, use it as a bit of guidance. Starting with point 1, some elaboration; when delivering a solution the most important thing is to deliver the solution, so many people forget this part and focus on the technicalities or whether or not it is the “best” way to deliver the solution. In reality, who cares, no one will care when you are in that meeting explaining why you’re late and have not got a working solution.

Getting stuck in the detail is a horrible place to be and sometimes it gets too involved or too complicated leading to much discussion and inevitably the solution comes out complicated and will take a while to deliver, in these situations point 2 comes in, just do less. It sounds silly but if you’re rushing around struggling to meet a deadline then you need to take things out of scope, and focus on what the actual solution needs to be, maybe you have to have a manual step, then at a later point you can automate it.

The last two points are along the same lines, and those lines are things that get you fired. If your site is down and you don’t know that it is totally down, that’s a bad thing; likewise loosing data is considered pretty poor. However do not get stuck in the trap of assuming you must have full monitoring of every server or that the backup needs to be anything more than a cron job for now.

The “trick” is always around identifying what needs to be done and could be done, by focusing on what needs to be done first you can then come back to improve the rest.

build, improve, rinse, repeat

As touched on earlier You are allowed to cut corners and focus on what is necessary, failure to do this will just lead to delays and a business that is getting rapidly turned off of DevOps. The first release you do can be complete and utter crap, it can be all manual, with nothing more than a simple web check on port 80, that is okay. The important thing is you deliver to the deadline, You have mitigated the main risks of not knowing when the site is down or the potential loss of data, heck even having single points failure are allowed as long as you can clearly identify what the risk is and a solution if that were to happen. In fact, I’d almost go as far to say this is expected.

The key is as always to improve, little and often. Step 1, Manual, Step 2, automate what is easy, Step 3, automate the rest. It has never been and will never be about perfection from version 0.1 onwards you just need to improve a little each time in line with that golden view of what perfection is. As long as you know what the end goal is you can work towards it, just don’t get carried away by trying to deliver it all for the first version.

Sinatra – partial templates

Singing a different song

Firstly apologies, it’s been over a month since my last blog post but unfortunately with holidays, illness and change of jobs I’ve been struggling to find any time to write about what I’ve been doing.

A few months back I did a little research into micro web frameworks, did quite a bit of reading around Sinatra, Bottle and Flask. To be honest they all seem good, and I want to play with Flask or bottle at some point too, but Sinatra is the one I’ve gone for so far as he documentation was the best and it seemed the easiest to use and he easiest to extend, not that I’ll ever be doing that!

Either way I thought I’d have a bit of a play with it and see if I could get something up and working and, locally for now due to a lack of documentation from my hosting provider… I have re-created my website Practical DevOps within Sinatra using rackup, bundler and ERB templates.

Now there’s a few reasons I did this, one, I wanted to stop updating every page when ever I needed to update the header or footer of a page, two, I want to implement Split testing (also know as A/B Testing) using something like Split. With all of this in mind and the necessity of having a bit more programmatic control over the website it seemed like a good idea to go with Sinatra.

The Basics of Sinatra

By default Sinatra looks for a static content in a directory called “public” and will look for templates in a folder called “views” which if needed can be configured within the app. So for basic sites this works fine and would have worked fine for me, but I really wanted partial templates to save having to enter the same details on multiple pages, and this can be done with Sinatra partial.

!/usr/bin/ruby

require 'sinatra'
require 'sinatra/partial'
require 'erb'


module Sinatra
  class App < Sinatra::Base

   register Sinatra::Partial
   set :partial_template_engine, :erb

    #Index page
    ['/?', '/index.html'].each do |path|
      get path do
        erb :index, :locals => {:js_plugins => ["assets/plugins/parallax-slider/js/modernizr.js", "assets/plugins/parallax-slider/js/jquery.cslider.js", "assets/js/pages/index.js"], :js_init => '<script type="text/javascript">
        jQuery(document).ready(function() {
            App.init();
            App.initSliders();
            Index.initParallaxSlider();
        });
        </script>', :css_plugins => ['assets/plugins/parallax-slider/css/parallax-slider.css'], :home_active => true}
      end
    end
  end
end

So let’s look at the above which is simply to serve the index of the site.

   register Sinatra::Partial
   set :partial_template_engine, :erb

The register command is how you extend Sinatra, so the sinatra-partial gem when it is installed simply drops it’s code in the sinatra area and when you call register all of the public methods are registered, this allows you to do stuff like this with magic, or you can use it in the ERB template like this. The next line simple tells sinatra to use ERB rather than haml, I chose this because of puppet and chef all using erb and as a result i’m a lot more familiar with that.

 #Index page
    ['/?', '/index.html'].each do |path|
      get path do
        erb :index, :locals => {:js_plugins => ["assets/plugins/parallax-slider/js/modernizr.js", "assets/plugins/parallax-slider/js/jquery.cslider.js", "assets/js/pages/index.js"], :js_init => '<script type="text/javascript">
        jQuery(document).ready(function() {
            App.init();
            App.initSliders();
            Index.initParallaxSlider();
        });
        </script>', :css_plugins => ['assets/plugins/parallax-slider/css/parallax-slider.css'], :home_active => true}
      end
    end

One of the nice things with sinatra is it’s simple to use the same provider for multiple routes, and the easiest way of doing this is to define an array and simply iterate over it for each path. Sinatra uses the http methods to define it’s own functions of what should happen, so a http get requires a route to be defined using the “get [path] block” style syntax and likewise the same for post, delete etc, see the Routes section of the sinatra docs for more info.

The last section is calling the template, so typically the syntax could just be “erb :page_minus_extension” which would load the erb template from the “views” directory created earlier. If you wanted to pass in variables to this you would define a signal ‘:locals’ which takes a hash of variables. All of these variables are only available to the the template that was called at the beginning, so to get the variables to the partial requires some work within the template.

Now within the the views/index.erb file I have the following:

<%= #include header
partial :"partials/header", :locals => {:css_plugins => css_plugins, :home_active => home_active}
%>

Partial calls another template within the views directory, so as I have a partial called header.erb in views/partials/ it loads that, and by defining the locals again from within the template I am able to pass the variables from index into the header or any other partial as needed.

Okay, that’s all folks, Hopefully that’s enough to get people up and running, have a good look at the examples in the git projects they’re useful, and be sure to read the entire Intro to sinatra, very useful!

What challenges you?

Over the last few weeks

I have been wondering what most people find challenging in the “modern” IT world. There’s been a recent upsurge in tools and technology that address most problems which only leaves me to wonder what is filling that gap? What is the current big annoying problem, maybe it’s not being able to push your architecture into multiple clouds, or having to live with the constraints of small root disk volumes; Who knows? Hence the poll :)

Configuration management alone is not the answer

Everything in one place

Normally when businesses start out building s product, especially those that don’t have the pre-existing knowledge of configuration management, tend to just throw the config on the server and then forget what it is. This is all fine, it’s a way of life and progression and sometime just bashing it out could prove very valuable indeed, but typically this becomes a nightmare to manage. Very quickly when there is then 100 servers all manually built it’s a pain in the arse so then everyone jumps into configuration management.

This is sort of phase 1, everything has become too complicated to manage, no one knows what settings are on what boxes and more time is spent working out if box 1 is the same as box 2. This leads to the need to have some consistency which leads to configuration management, the sensible approach is to move an application at a time into configuration management fully, not just the configuration files.

During this phase of execution it is critical to be pedantic and get as much as possible into configuration management, if you only do certain components there will always be the question of does X affect Y which isn’t in configuration management? and quite frankly, every time you have that conversation a sysadmin dies due to embarrassment.

Reduce & Reuse

After getting to Phase 1, probably in a hack and slash way, the same problems that caused the need for Phase 1 happen. 100 servers in configuration management lots of environments with variables set in them, and servers, and in the manifests themselves and the question starts to be come well is that variable overriding that one, why is there settings for var X in 5 places, which one wins? Granted in configuration management systems there are hierarchies that determine what takes precedence but that requires someone to always look through multiple definitions. On top of having the variables set in multiple locations, it is probably becoming clear that more variables are needed, more logic is needed, what was once a sensible default is now crazy.

This is where phase 2 comes in, aim to move 80%+ of each configuration into variables, have chunks of configuration turned on or off through key variables being set and set sensible defaults inside a module/cookbook. This is half of phase 2, the second half and probably the more important side is to reduce the definitions of the systems down to as few as possible. Back in the day, we use to have a server manifest, an environment manifest and a role manifest each of these set different variables in different places, how do you make sure that your 5 web servers in prod have the same config as the 5 in staging? that’s 14 manifests! why not have 1? just define a role and set the variables appropriately, this can then contain the sensible defaults for that role, all other variables would need to be externalised in something like hiera, or you would need to push them into Facter / ohai.

By taking this approach to minimising the definitions of what a server should be and reducing it down to one you are able to reuse the same configuration so all of your roleX servers are now identical except what ever variables are set in your external data store which can now easily be diff’d.

build, don’t configure

By this point, phase 1 & 2 are done, all is well with the world but still there’s some oddities Box X has a patch level y and box A has a patch level z, or there’s some left over hack to solve a prod issue which causes a problem on one of the servers. Well treat your servers as configurable and throw-away-able, There’s many technologies to help with this be it cloud based with Amazon and OpenStack or maybe VMWare, even physical servers with cobbler. This is Phase 3, build everything from scratch every time, at this point the consistency of the environment is pretty good leaving only the data in each environment to contend with.

Summary

Try and treat configuration management as something more than just config files on servers and be persistent about making everything as simple as possible while trying to get everything into it. If you’re only going to manage the files you might as well use tar’s and if that sounds crazy it’s the same level as phase 1 which is why you have to get everything in and I realise it can seem a massive task but start with the application stack you’re running and then cherry pick the modules/cookbooks that already exist for the main OS components like ntp, ssh etc

Oh no, not java

How strange

Over the past few months I’ve been writing more and more applications to help maintain and deliver the services we run, from metric gathering to regional dr and anything in the middle. For A while now one of the developers at Alfresco has been writing a framework that makes it easier to write selenium tests for Alfresco share which takes a lot of the hassle out of looking for certain elements or class id’s or updating everything if the UI changes. So we have been talking about it for a couple of months and today I decided to get some time to look at it and ask loads of silly questions about eclipse and maven and so on and so forth.

It took about 3 hours to get everything set up and working, most of the time was just spent learning to use eclipse and maven with a walk through of what it can do, how to extend it and how to do stuff with it. Considering I hadn’t done any Java for 6 years it wasn’t that bad and within 15 mins of being left to it I had made a class that loged in and searched for content inside the repo.

One of the reasons we’re so interested in it is because as DevOps we like simple things and it takes a lot of the hassle out, it means we also get to do some complicated things with Share and we only have to worry about what we want to test or measure. All of this got me thinking about the languages we use and the problems they solve.

Right tool, Right job

Currently in our team we are using bash, ruby, python and java. Bash is simple and can achieve some good results although typically quite slow, typically if it is a short script it will end up in bash, although we do our orchestration in bash and it manages the bear metal to working OS by triggering what ever apps we need or setting config.
Ruby is the language of choice for me when I need to do something that requires data to be manipulated or retrying actions or anything that is more than procedural and you can rely on it to do a good job in a reasonable time.
Python is new to the team, it feels a gap which is that it’s as easy to write as Ruby but is more scalable at size, I haven’t done any python yet so I can’t really comment but the web app that has been built with it in a couple of weeks is quite impressive. Java is more complicated harder to write but can offer more complex apps, but typically I’m not sure that you need to make apps that complex.

So I’m not a fan of Java, but mainly because I think it takes a long time to get anything of any value out of it, especially on a small task. If I had to write an application to manage backups I would not go to Java as it’s like using a bazooka to hit a fly, likewise using Bash is like using a feather duster where as ruby and python fit nicely in the middle. Well after todays experience I’m glad I’m doing it in Java, I would have spent weeks making something half as good in Ruby to just avoid using Java and I guess it’s not really that bad.

I could have wasted time doing it all from scratch or just take what’s already written, so I stole like any good DevOps guy would.

Summary

I’m probably going to spend some more time in Java over the next week writing something a bit more useful than todays experiment so hopefully I will still be optimistic about it all, and maybe I’ll remember why I don’t like Java or maybe I’ll change my opinion, who knows!

DevOps team DNA

Hi, this is my first post on Matt’s blog. I’ve been an avid supporter of his blogging for a while and today got an invite to contribute. So here’s my post (created very quickly before he changes his mind).

My job has always been within an operations department of software product companies. I started at a small company as ‘everything’ support and slowly drifting towards a specialisation in the more recently branded DevOp’sy areas as I made my way through various acquisitions and mergers. Over the past couple of years I’ve found myself building DevOps teams. During that time I’ve discovered some of the things that work and almost everything that doesn’t work (or it feels like that :) ).

Some of the things that have worked..   (for me anyway)

Obviously these are going to be quite subjective and I doubt they will work for everyone. I’ll focus mostly on what I think are the key ingredients of a successful team. Maybe some people will find it interesting. Bare in mind that this only really applies to an operations team that supports a Cloud service.

I’m not a big football fan but I can draw some parallels between football managers and DevOps teams. You don’t see Arsenal winning and losing games based on their process redesigns. I may be simplifying, and I’m sure tactics plays a large part, but I believe you get quite a bit more out of a team when you have excellent players. Players who excel in different areas. My teams tend to be 5 – 7 players nowadays and between all of us we need to cover a few areas.

The first is product knowledge.. If you have a product guru in your team then you’ve got a productivity catalyst. So many aspects of our work involves investigating whether issues are product vs config and whether we can improve things from an operational perspective that will result in the product running better. The most recent team has a Product Architect and he’s awesome. He’s on the cutting edge of ideas for the product, for Amazon AWS and for all of the supporting technologies. Having a dedicated resource to do all of this in the background is great – it means that when we automate his prototypes and release them we get the maximum benefit. Recent examples include our Public API work and the work being done on our Amazon architecture to improve speed (CDN’s etc).

The second role I’ve always tried to fill is an engineer (at least one person, preferably two). Get the most senior developer(s) that you can, who knows the language of the product and build system of the product that you are supporting. You can now write the high level instrumentation that every DevOps teams need – as is true with any automation project. There is only ever so far you can go with Bash (I tend to take things beyond where they are supposed to be with Bash as it is). Ultimately having a senior developer or two buys you a massive amount of flexibility. Need a web service for something like externalised Puppet variables?.. you can write your own. Backup scripts not fast enough?.. a senior developer will make those scripts look very feeble in comparison when rewritten in their preferred language and multithreaded. I’m careful about not reinventing the wheel and will usually go off and clone something from Github before starting from scratch myself. But having some people who can write stuff from scratch is a major advantage. One caveat I would say for this role – hire from outside. Developers usually end up getting pulled back to work on stuff they did at the company at some stage. If you can, hire a new person and liven things up. Obviously tell the engineering teams that the hire(s) are for instrumentation in case they get worried that you want to start adding buttons to the product :)

Lastly, the sysadmins. I’d actually consider myself one of these at heart. Getting a good sysadmin can be tricky. It’s not uncommon to read 100 CV’s before finding someone even remotely eligible.  For a DevOps team you need a reasonably rare mix of skills.. people who know linux inside out, who can script and get excited by the latest batch of tools, and nowadays you need to throw Puppet / Chef into the mix. I have a couple of these currently and consider myself extremely blessed. Everything that we do is checked into source control (we use AWS as our data center) and this buys us a lot of things.. like the ability to automate everything, reduce costs by deleting and recreating at whim and disaster recovery. However, you pay for those things buy hiring really good people.. which is a cost saving in the long run once the cost saving benefits of the team start to show.

Now if you add in all of those types of role.. what I’ve found works quite nicely is running the team without being too focussed on the separation of responsibility. Everyone is on call 24/7. Everyone is expect to know the product inside out (although nobody will get near the level of expertise of the Product Architect), everyone scripts (even me) and ultimately everyone will end up doing some programming tasks. You can probably see from Matt’s previous blog posts about the Metrics project he got the chance to learn some Ruby. I think it’s important that everyone knows a bit of everyone else’s job.. although when under pressure everyone naturally drops back into doing what they are good at to speed things along.

This probably looks a little odd from the outside. But it makes things fun, everyone stays engaged and ultimately we all share the same goal: scale to 1 million users :D

Sysadmins in a Developers world

It’s all back to front

Well it was about 9 months back when I was touching on Developers in a sysadmin world and my initial thoughts were along the lines of we are better at different tasks, and after spending a week doing only development I am of the same opinion still.

Over the last 6 months we have had our solitary developer, coding away making great things happen, predominately developing a portal that allows us to deploy environments in 15 mins vs the 2 days it took before and the whole things is very pretty, it even has its own Favicon.ico which we are all pleased about. In addition to just deploying, it also allows us to scale up and down the environments it creates and despite constant interruptions it is coming along really well and in the next month we will be providing it as a service to the engineering teams to self serve.

As more and more of our tools are developing we are also in-housing more and more of our tools. As the regular readers know I do dable with the odd slightly more complex program than the average sysadmin might tackle. When we are faced with a situation such as monitoring the operations, by this I mean, the number of user growth week on week and the cost of running the environment(s) it just made more sense to do it our selves. There are tools out there that provide various dashboards like Geckoboard which can all do approximately 80% of the job, but it’s that last 20% that adds the usefulness, as such we are trying to develop tolls that are pluggable and extensible and support multiple outputs. For example the Metrics report we have will also support Geckoboard, Graphite, Email and probably have it’s own web interface.

For us it is becoming more about having the flexibility to add and remove components and keeping the flexibility around it, this introduces challenges with what ever being written needing to be pluggable and easy to maintain, which often make sit complicated.

I used classes, as a necessity

Typically when I program there is not much need for classes or even objects for that matter, a simple var and some nice loops and conditional statements would be plenty. Well not so much anymore, The last project was metrics and as with other projects I got it working within a day or two, and I hated it, it took over 30 seconds for it to run and generate the report I needed but not in the right format and then the level of detail in the metrics was not high enough, it could manage weekly but it was not good enough.

I decided that I’d have a chat with a few developers to help with the structure of the application, at first I was dubious, but it turned out well. The key step which I wouldn’t have made until it was a real problem was to separate out the the tasks that gathers the raw data, the tasks that manipulates the data into useful numbers, the bit that stores the data, the bit that manipulates the data into useful numbers and then finally the bit that outputs the pretty data.

This was an evolutionary step, I would have got to the point of understanding the need to separate each step out but not until it had become a real big pain many months later. Another advantage of splitting it out was how much simpler each step was, there were classes defining methods for getting data that were being used in classes to format the data that were being used… you get the idea. Rather than being one class to connect to amazon, manipulate the data and return an object that could be used to generate the metrics everything was done on much smaller steps. As a result it was a lot easier to write small chunks of code “that just worked” and it made debugging a lot easier, and I feel like I progressed my understanding, and this is always a good thing.

Who should do what

I touched on this in my other post, but I want to amend it based on a better understanding. To summarise I pretty much said as it is, Developers develop, sysadmins admin. They do, and certainly that should be their focus, but I think there is a lot to be gained from both points of view when pushed to work in the others world.

Before our developer joined the focus was on making the build, test and release process better, after forcing the developer to do sysadmin work for a month or so while the team was trying to grow and cope with the loss of a team member, it became clear that the time wasted for us all was not getting a build though but by us not being able to paralise the testing or being agile enough to re-deploy an environment if it was not quite right. These steps and understandings would not have happened if we didn’t encroach on each others work and gain the understanding from the other persons perspective.

Summary

This is what DevOps is really about, forget sysadmins doing code, forget about developers doing sysadmin work, it is about us meeting in the middle and understanding the issues we each face and working together to solve bigger problems.