A week in the Valley

While out and about…

Over the last week I’ve been out in the bay area meeting with an important client talking about their needs and how we’re going to make things better for them and for us, all in all a good trip (apart from the plane crash). This was my first time to the bay area and it seems like a nice enough place, it lives up to expectations in some areas and not others, I’m sure with more local knowledge it’s possible to overcome some of the issues I had with the area. The main issue I could see (granted it was only a week) is that it’s not as nice to live there as it is in the UK or even as nice to be around and in like London.

In London, everything is a walk away or a short tube journey, and better yet if you’re willing to travel more than an hour each way each day you can live in the countryside and just commute in; but in the valley everything is a short car journey away, the public transport seems a bit hit and miss and Taxi’s aren’t cheap!
I think its things like this which will be the end of the bay area over the next 10 years unless it changes, and I’m not the only one to think this and as time moves on I think we’ll see a shift in tech start-ups away from silicon valley into areas that are nicer to live.

Which is what brings me back to London, there’s a good start-up culture, there’s more investment going on and there’s some good companies starting to appear, unfortunately for tech startups the UK still isn’t brilliant, but it will get there in time and it probably just needs a few more years and some brave people to trail blaze.

I think London has the makings of a nice tech hub for Europe and will over the next few years start exceeding the bay area, the only thing it’s really missing at the moment is the massive success stories that appear in the bay area every few years, sure there’s some good companies but none are a apple, google or facebook.

I think I could survive out there to live for a while but not forever, it’s nice being able to go to the beach, forrest, mountains what ever you want all within a reasonable drive but there’s too much convenience stuff, like fast food, corner shops and drive throughs. Like Walmart, it’s got a purpose but not for me, Trader Joes seemed better, but no soft drinks just fresh goods and booze… Maybe in time I would have found stuff that felt a little more “me” and a little less American but I’d have to go and give it a go to find out!

For me personally I don’t really want to live in London, it’s just too busy but living out in Hampshire makes London an awkward commute, do able but not every day. As time goes on I’m still hopefully that more start ups will start offering flexible working like we have at Alfresco where going in for 2-3 days a week is the norm and everyone is trusted to do the work, and who knows over the next 10 years maybe more will start filling the M3/M4 corridor which will make living in a nice place and commuting to a nice tech company is all possible.

It will certainly be interesting over the next few years how the tech industry in the UK changes but I’m certain it’s picking up speed.

Time for an idea

Why not

It’s been a while since I’ve thrown myself into an idea and tried to come out the other side, so I’ve spent the last couple of days just thinking about what’s missing. It doesn’t take much to have an idea; but making sure it’s a good idea, making sure it is unique in it’s offering and making sur eit’s better than anything else is not easy.

At work we are working on an idea, a concept of some sort of DevOps tool that takes a lot of what we do already and simplifies it and merges multiple tools into one place, the driving goal is easy of use, take an entire system, data centre what ever you want and within minuets you’ll have the whole thing monitored, feeding metrics back for reporting, performing real time analysis and trending. It’s still very much prototype phase but it’s a very exciting project that wraps up several elements that we as a team are passionate about, ease of use, efficiency, performance, monitoring, measurements and of course, cool technology; but with that said I still have this urge to do something else, I’m not really sure why, I’m busy enough as it is but I feel like the world is missing something that is more than just an amalgamation of parts, or a re-skin of an existing thing, I feel like it’s missing something, the question is as always what?

There’s a saying “All the good ideas are gone” probably true, but that doesn’t stop people striving for new things, look at Glass I’m not convinced it has a long term future in that styling, but wearable tech certainly does, look at this wrist computer from the tv show chuck, just what I’ve always wanted.

Wristcomputer

a lot of the best ideas today are based on things that have come before and re-envisioned, walkman -> ipod; iPaq -> iPad; alta vista -> google

Just because it’s been done in a similar way before doesn’t mean you can’t do it better, or take from their ideas and make it work and half of the battle is the conviction to want to do it better or different. Which is why everyone should try a new idea and everyone should try something new, to make something better.

What to do

This is the bit I’m struggling with, and it’s the hardest bit of the whole thing, for me it’s not good enough to take an idea and make it better, if someone gives you a product and says make it better it wouldn’t take long for a few ideas to bubble up. I’m thinking more along the lines of taking some wacky out there thinking and making it a reality in a way that works and works well.

I think over the next few weeks I’m just going to just write some things down and see which ones stand out, which ones seem stupid/crazy to do and then probably come up with one that works.

I’m not really sure what that is at all, I could take something like sentinel and munge that into something else, but it just doesn’t feel like the right idea. I had an idea a long time ago, probably 3 years ago, which I talked my self out of because it would take me forever to make and I didn’t have the skills, but things have changed, it isn’t even ground breaking it’s just another internet site.

Either way I’ll keep on plodding along for now and see if I can come up with something but until then much more scribbling on paper and throwing things at a wall.

DevOps team DNA

Hi, this is my first post on Matt’s blog. I’ve been an avid supporter of his blogging for a while and today got an invite to contribute. So here’s my post (created very quickly before he changes his mind).

My job has always been within an operations department of software product companies. I started at a small company as ‘everything’ support and slowly drifting towards a specialisation in the more recently branded DevOp’sy areas as I made my way through various acquisitions and mergers. Over the past couple of years I’ve found myself building DevOps teams. During that time I’ve discovered some of the things that work and almost everything that doesn’t work (or it feels like that :) ).

Some of the things that have worked..   (for me anyway)

Obviously these are going to be quite subjective and I doubt they will work for everyone. I’ll focus mostly on what I think are the key ingredients of a successful team. Maybe some people will find it interesting. Bare in mind that this only really applies to an operations team that supports a Cloud service.

I’m not a big football fan but I can draw some parallels between football managers and DevOps teams. You don’t see Arsenal winning and losing games based on their process redesigns. I may be simplifying, and I’m sure tactics plays a large part, but I believe you get quite a bit more out of a team when you have excellent players. Players who excel in different areas. My teams tend to be 5 – 7 players nowadays and between all of us we need to cover a few areas.

The first is product knowledge.. If you have a product guru in your team then you’ve got a productivity catalyst. So many aspects of our work involves investigating whether issues are product vs config and whether we can improve things from an operational perspective that will result in the product running better. The most recent team has a Product Architect and he’s awesome. He’s on the cutting edge of ideas for the product, for Amazon AWS and for all of the supporting technologies. Having a dedicated resource to do all of this in the background is great – it means that when we automate his prototypes and release them we get the maximum benefit. Recent examples include our Public API work and the work being done on our Amazon architecture to improve speed (CDN’s etc).

The second role I’ve always tried to fill is an engineer (at least one person, preferably two). Get the most senior developer(s) that you can, who knows the language of the product and build system of the product that you are supporting. You can now write the high level instrumentation that every DevOps teams need – as is true with any automation project. There is only ever so far you can go with Bash (I tend to take things beyond where they are supposed to be with Bash as it is). Ultimately having a senior developer or two buys you a massive amount of flexibility. Need a web service for something like externalised Puppet variables?.. you can write your own. Backup scripts not fast enough?.. a senior developer will make those scripts look very feeble in comparison when rewritten in their preferred language and multithreaded. I’m careful about not reinventing the wheel and will usually go off and clone something from Github before starting from scratch myself. But having some people who can write stuff from scratch is a major advantage. One caveat I would say for this role – hire from outside. Developers usually end up getting pulled back to work on stuff they did at the company at some stage. If you can, hire a new person and liven things up. Obviously tell the engineering teams that the hire(s) are for instrumentation in case they get worried that you want to start adding buttons to the product :)

Lastly, the sysadmins. I’d actually consider myself one of these at heart. Getting a good sysadmin can be tricky. It’s not uncommon to read 100 CV’s before finding someone even remotely eligible.  For a DevOps team you need a reasonably rare mix of skills.. people who know linux inside out, who can script and get excited by the latest batch of tools, and nowadays you need to throw Puppet / Chef into the mix. I have a couple of these currently and consider myself extremely blessed. Everything that we do is checked into source control (we use AWS as our data center) and this buys us a lot of things.. like the ability to automate everything, reduce costs by deleting and recreating at whim and disaster recovery. However, you pay for those things buy hiring really good people.. which is a cost saving in the long run once the cost saving benefits of the team start to show.

Now if you add in all of those types of role.. what I’ve found works quite nicely is running the team without being too focussed on the separation of responsibility. Everyone is on call 24/7. Everyone is expect to know the product inside out (although nobody will get near the level of expertise of the Product Architect), everyone scripts (even me) and ultimately everyone will end up doing some programming tasks. You can probably see from Matt’s previous blog posts about the Metrics project he got the chance to learn some Ruby. I think it’s important that everyone knows a bit of everyone else’s job.. although when under pressure everyone naturally drops back into doing what they are good at to speed things along.

This probably looks a little odd from the outside. But it makes things fun, everyone stays engaged and ultimately we all share the same goal: scale to 1 million users :D

This time, We survived the AWS outage

Another minor bump

Anyone based in the US East region in AWS knows that yet again there were issues with EBS volumes, although you wouldn’t know it if you looked at their website. It’s a bit of a joke when you see headlines like Amazon outage takes down Reddit, Foursquare, others yet on their status page a tiny little note icon appears that states there’s a slight issue, extremely minor, don’t worry about it. Yeah right.

The main culprits were EC2 and the API, both of which were EBS related.

“Degraded EBS performance in a single Availability Zone
10:38 AM PDT We are currently investigating degraded performance for a small number of EBS volumes in a single Availability Zone in the US-EAST-1 Region.
11:11 AM PDT We can confirm degraded performance for a small number of EBS volumes in a single Availability Zone in the US-EAST-1 Region. Instances using affected EBS volumes will also experience degraded performance.
11:26 AM PDT We are currently experiencing degraded performance for EBS volumes in a single Availability Zone in the US-EAST-1 Region. New launches for EBS backed instances are failing and instances using affected EBS volumes will experience degraded performance.
12:32 PM PDT We are working on recovering the impacted EBS volumes in a single Availability Zone in the US-EAST-1 Region.
1:02 PM PDT We continue to work to resolve the issue affecting EBS volumes in a single availability zone in the US-EAST-1 region. The AWS Management Console for EC2 indicates which availability zone is impaired. “

The actual message is much much longer but you get the gist, a small number of people were affected. Yet most of the major websites that use amazon were affected, how can that be considered small?

Either way, this time we survived, and we survived because we learnt. Back in June and July we experienced these issues with EBS so we did something about it, now why didn’t everyone else?

How Alfresco Cloud Survived

So back in June and July we were heavily reliant on EBS just like everyone else, we had an EBS backed AMI that we then used puppet to build out the OS, this is pretty much what everyone does and this is why everyone was affected, back then we probably had 100 – 150 EBS volumes so the likely hood of one of them going funny was quite high, now we have about 18, and as soon as we can we will ditch those as well.

After being hit twice in relatively quick succession we realised we had a choice, be lazy or be crazy, we went for crazy and now it paid out. We could have been lazy and just said that Amazon had issues and it wasn’t that frequent and it is not likely to happen again, or we could be crazy and reduce all of our EBS usage as much as possible, we did that.

Over the last few months I’ve added a numer or posts about The Cloud, Amazon and Architecting for the cloud along with a few funky Abnormal puppet set ups and oddities in the middle. All of this was spawned from the EBS outages, we had to be crazy, Amazon tell us all the time don’t have state, don’t rely on anything other than failure use multiple AZ’s etc etc all of those big players that were affected would have been told that they should use multiple availability zones, but as I pointed out Here their AZ can’t be fully independent and yet again this outage proves it.

Now up until those outages we had done all of that, but we still trusted Amazon to remain operational, since July we have made a concerted effort to move our infrastructure to elements within Amazon that are more stable, hence the removal of EBS. We now only deploy instance backed EC2 nodes which means we have no ability to restart a server, but it also means that we can build them quickly and consistently.

We possibly took it to the extreme, our base AMI, now instance backed, consists of a single file that does a git checkout, once it has done that it simply builds its self to the point that chef and puppet can take over and run. The tools used to do this are many but needless to say many hundreds of of lines of bash, supported by Ruby, Java, Go and any number of other languages or tools.

We combined this with fully distributing puppet so it runs locally, in theory once a box is built it is there for the long run; we externalised all of the configuration so puppet was simpler and easier to maintain. Puppet, its config, the Base OS, the tools to manage and maintain the systems are all pulled from remote services including our DNS which automatically updates its self based on a set of tags.

Summary

So, how did we survive, we decided every box was not important, if some crazy person can not randomly delete a box or service and the system keeps working then we had failed. I can only imagine that the bigger companies with a lot more money and people and time looking at this are still treating Amazon more as a datacentre rather than a collection of web services that may or may not work. With the distributed puppet and config once our servers are built they run happily on a local copy of the data, no network, and that is important because AWS’s network is not always reliable and nor is their data access. If a box no longer works delete it, if an environment stops working rebuild it; if amazon has a glitch, keep working, simple.

Sysadmins in a Developers world

It’s all back to front

Well it was about 9 months back when I was touching on Developers in a sysadmin world and my initial thoughts were along the lines of we are better at different tasks, and after spending a week doing only development I am of the same opinion still.

Over the last 6 months we have had our solitary developer, coding away making great things happen, predominately developing a portal that allows us to deploy environments in 15 mins vs the 2 days it took before and the whole things is very pretty, it even has its own Favicon.ico which we are all pleased about. In addition to just deploying, it also allows us to scale up and down the environments it creates and despite constant interruptions it is coming along really well and in the next month we will be providing it as a service to the engineering teams to self serve.

As more and more of our tools are developing we are also in-housing more and more of our tools. As the regular readers know I do dable with the odd slightly more complex program than the average sysadmin might tackle. When we are faced with a situation such as monitoring the operations, by this I mean, the number of user growth week on week and the cost of running the environment(s) it just made more sense to do it our selves. There are tools out there that provide various dashboards like Geckoboard which can all do approximately 80% of the job, but it’s that last 20% that adds the usefulness, as such we are trying to develop tolls that are pluggable and extensible and support multiple outputs. For example the Metrics report we have will also support Geckoboard, Graphite, Email and probably have it’s own web interface.

For us it is becoming more about having the flexibility to add and remove components and keeping the flexibility around it, this introduces challenges with what ever being written needing to be pluggable and easy to maintain, which often make sit complicated.

I used classes, as a necessity

Typically when I program there is not much need for classes or even objects for that matter, a simple var and some nice loops and conditional statements would be plenty. Well not so much anymore, The last project was metrics and as with other projects I got it working within a day or two, and I hated it, it took over 30 seconds for it to run and generate the report I needed but not in the right format and then the level of detail in the metrics was not high enough, it could manage weekly but it was not good enough.

I decided that I’d have a chat with a few developers to help with the structure of the application, at first I was dubious, but it turned out well. The key step which I wouldn’t have made until it was a real problem was to separate out the the tasks that gathers the raw data, the tasks that manipulates the data into useful numbers, the bit that stores the data, the bit that manipulates the data into useful numbers and then finally the bit that outputs the pretty data.

This was an evolutionary step, I would have got to the point of understanding the need to separate each step out but not until it had become a real big pain many months later. Another advantage of splitting it out was how much simpler each step was, there were classes defining methods for getting data that were being used in classes to format the data that were being used… you get the idea. Rather than being one class to connect to amazon, manipulate the data and return an object that could be used to generate the metrics everything was done on much smaller steps. As a result it was a lot easier to write small chunks of code “that just worked” and it made debugging a lot easier, and I feel like I progressed my understanding, and this is always a good thing.

Who should do what

I touched on this in my other post, but I want to amend it based on a better understanding. To summarise I pretty much said as it is, Developers develop, sysadmins admin. They do, and certainly that should be their focus, but I think there is a lot to be gained from both points of view when pushed to work in the others world.

Before our developer joined the focus was on making the build, test and release process better, after forcing the developer to do sysadmin work for a month or so while the team was trying to grow and cope with the loss of a team member, it became clear that the time wasted for us all was not getting a build though but by us not being able to paralise the testing or being agile enough to re-deploy an environment if it was not quite right. These steps and understandings would not have happened if we didn’t encroach on each others work and gain the understanding from the other persons perspective.

Summary

This is what DevOps is really about, forget sysadmins doing code, forget about developers doing sysadmin work, it is about us meeting in the middle and understanding the issues we each face and working together to solve bigger problems.

Release consistency

It’s gotta be… perfect

This is an odd one that the world of devops sits in an awkward place on but is very vital to the operational success of a product. As an operations team we want to ensure the product is performing well, there’s no issues and there’s some control over the release process, some times this leads to long winded and bloaty procedures that are common in service providers where people stop thinking and start following, this is a good sign a sysadmin has died inside. From the more Development side we want to be quick and efficient and reuse tooling. Consistency is important, a known state of release is important, the process side of it should not exists as it should be automated with the minimal interaction.

So as Operations guys, we may have to make changes ad-hoc to ensure the service continues to work, as a result a retrospective change is made to ensure the config is good for long term; often this can stay untested as you’re not going to re-build the system from scratch to re-test, are you?

The development teams want it all; rapid, agile change, quick testing suites and an infallible release process with plenty of resilience and double checks and catches for all sorts of error codes etc etc, the important thing is that the code makes it out, but it has to be perfect, which leads on to QA.

The QA team want to test the features, the product, the environment, combinations of each and the impact of going from X to Y and tracking the performance of the environment in between. All QA want is an environment that never changes, with a code base that never changes so they can test everything in every combination, those bugs must be squashed.

Obviously, everyones right but with so many contradicting opinions it is easy to end up in a blame circle, which isn’t all that productive. The good news is we know what we want…

  • Quick release
  • Quick accurate testing
  • Known Product build
  • Known configuration
  • No ad-hoc changes to systems
  • Ability to make ad-hoc to systems
  • Infallible release process

All in all not that hard, there are two issue points here. Point one, infallible releases are a thing of wonder and can only ever be iterated over to make them perfect, in time it will get better. Day 1 will be crap, day 101 will be better, day 1000 better still. point two, you can’t have no ad-hoc changes and the ability to make ad-hoc changes, can you? Well you can.

If you love it, let it go

As a sysadmin, if there’s an issue on production I will in most cases fix it in production, if it is risky I will use our staging environment and test it on there, but this is usually no good as staging typically won’t show the issues production does, i.e. all of its servers would be working yet production is missing one. This means I have to make ad-hoc changes to production, this causes challenges for the QA team as now the test environments aren’t always the same, it then screws up the release process as we made a change in one place and didn’t manually add it in to all other environments.

So, What if we throw away, production, staging or any other environment with every release? This is typically a no-no for traditional operations, why would you throw away a working server? well it provides several useful functions:

  1. Removes ad-hoc changes
  2. Tests documentation / configuration management
  3. Enhances DR strategy and familiarisation with it
  4. Clean environment

The typical reason why the throw away approach isn’t done is due to a lack in confidence of the tools. Well, Bollocks. What good is it having configuration management and DR polices if you aren’t using it? if in an operational place now you are making changes to puppet and rolling them forward you achieve some of this, but it’s not good enough, you still have the risk of not having tested the configuration on a new system.

With every environment being built from scratch with every release, we can version control a release to a specific build number and specific git commit which is better for QA as it’s always a known state, if there’s any doubt delete and re-build.

The release process can now start being baked into the startup of a server / puppet run so the consistency is increasing and hopefully the infallibility of the release is improving, adding to this a set of System wide checks for basic health, a set of checks for unit testing the environment, a quick user level test before handing over for testing then it’s more likely, more often that the environments are at a known consistence.

All of this goodness starts with deleting it and building fresh, some of it will happen organically but by at least making sure all of the configuration / deployment is managed at startup you can re-build, and you have some DR. Writing some simple smoke tests and some basic automation is a starting point, from here at least you can build upon the basics to start making it fully bullet proof.

Summary

Be radical, delete something in production today and see if you survive, I know I will.

Open source architecture

It all starts with an idea

When you sit down and start to think about it, we, as humans, are very good at learning new things and using our understanding to progress and do new things based on what we learnt. The founding principle is everything I know can be expressed thusly:

“What I know” = (“What I learnt” – “What I forgot about what I learnt”) + My experiences

So based on this every time I do something new I learn something, I then forget a proportion of that but have hopefully formed a subconscious opinion of it and have now gained an experience of that.

One of the ways I learn more is to read, use the internet to find solutions and adapt them to my current situation, sometimes that results in an experience or something I’ll actually retain and will re-use at another time.

Now when I’m often learning I am also working so I apply the “what I know” to the situation I am in with a dash of learning and experience and with some healthy collaboration end up at a solution, now isn’t that the key? collaboration is when we start sharing our “what I know” and our experiences of it to come to a solution that often removes the pain points that each member experienced in the past but potentially introduces more.

So that leads me on to the idea, what if we open source our architecture? Does that provide value? I think so, we have spent months years maybe on our architectures getting to a point that we as individuals are happy with them for various solutions, how cool would it have been if you could go to google and type “Open source architecture for hosting java based cloud solutions” and have it come back with something other than adverts for cloud solutions, maybe diagrams of the solution, maybe configurations, guidelines, bullet points directions of use etc etc. Wouldn’t that be like collaborating with others on your architecture, getting more experience and more knowledge to discuss the idea and suggest improvements. All of a sudden your one or two person IT team is now 30 or 40 people discussing the architecture, if just one of those people comes back with something that isn’t crap and is beneficial doesn’t that make it worth while?

Is there value in it?

I keep asking myself this. If every time I wanted to do a new “solution” there were “boxed” architectures for hosting applications in Amazon or setting up a data centre infrastructure I could pick it off the shelf, read some foot notes or supporting documentation understand the limitations and contraints of the solution and then adapt it to meet my specific needs.

I may even discover something that specifically could be contributed back to help out. For example, maybe Amazon EBS volumes become increasingly unstable so I fork the architecture and release an instance store version that then removes the dependancy on EBS or maybe a bug is found with ELB’s and access that causes a security hole, what ever it may be there a continuous feedback loop enabling everyone to benefit from the “what I learnt” and “My experience”

Okay, so in theory it seems okay, but, everyone has a unique situation so what ever is open sourced is potentially a waste? I don’t think so. Most of the principles behind what is done in any environment are the same, they are just tailored, I imagine most Corporate data centres have a number of distinct networks, probably a DMZ for web facing applications, maybe a more standard Application network for everything else that is not public facing or is internally accessible and lastly a secure area for DB servers or other important things that can never be accessed directly.

I think that’s pretty common, so the Open source architecture would depict this with maybe a dotted line to show A.N.Other network, maybe for Development, or for Testing but it’s a good starting point. I think the starting point coupled with the usage notes, diagrams design principles and documentation should be enough to get people up and running and with a decent architecture.

Summary

I’m sure there’s something in this, I know it will be beneficial, I worry that people will switch off and not take the architectures and adapt them to their specific needs, I hope in most cases that won’t be necessary, but there will always be fringe cases. Hopefully over the coming months this will turn into something more than words, but only time will tell.