It’s gotta be… perfect
This is an odd one that the world of devops sits in an awkward place on but is very vital to the operational success of a product. As an operations team we want to ensure the product is performing well, there’s no issues and there’s some control over the release process, some times this leads to long winded and bloaty procedures that are common in service providers where people stop thinking and start following, this is a good sign a sysadmin has died inside. From the more Development side we want to be quick and efficient and reuse tooling. Consistency is important, a known state of release is important, the process side of it should not exists as it should be automated with the minimal interaction.
So as Operations guys, we may have to make changes ad-hoc to ensure the service continues to work, as a result a retrospective change is made to ensure the config is good for long term; often this can stay untested as you’re not going to re-build the system from scratch to re-test, are you?
The development teams want it all; rapid, agile change, quick testing suites and an infallible release process with plenty of resilience and double checks and catches for all sorts of error codes etc etc, the important thing is that the code makes it out, but it has to be perfect, which leads on to QA.
The QA team want to test the features, the product, the environment, combinations of each and the impact of going from X to Y and tracking the performance of the environment in between. All QA want is an environment that never changes, with a code base that never changes so they can test everything in every combination, those bugs must be squashed.
Obviously, everyones right but with so many contradicting opinions it is easy to end up in a blame circle, which isn’t all that productive. The good news is we know what we want…
- Quick release
- Quick accurate testing
- Known Product build
- Known configuration
- No ad-hoc changes to systems
- Ability to make ad-hoc to systems
- Infallible release process
All in all not that hard, there are two issue points here. Point one, infallible releases are a thing of wonder and can only ever be iterated over to make them perfect, in time it will get better. Day 1 will be crap, day 101 will be better, day 1000 better still. point two, you can’t have no ad-hoc changes and the ability to make ad-hoc changes, can you? Well you can.
If you love it, let it go
As a sysadmin, if there’s an issue on production I will in most cases fix it in production, if it is risky I will use our staging environment and test it on there, but this is usually no good as staging typically won’t show the issues production does, i.e. all of its servers would be working yet production is missing one. This means I have to make ad-hoc changes to production, this causes challenges for the QA team as now the test environments aren’t always the same, it then screws up the release process as we made a change in one place and didn’t manually add it in to all other environments.
So, What if we throw away, production, staging or any other environment with every release? This is typically a no-no for traditional operations, why would you throw away a working server? well it provides several useful functions:
- Removes ad-hoc changes
- Tests documentation / configuration management
- Enhances DR strategy and familiarisation with it
- Clean environment
The typical reason why the throw away approach isn’t done is due to a lack in confidence of the tools. Well, Bollocks. What good is it having configuration management and DR polices if you aren’t using it? if in an operational place now you are making changes to puppet and rolling them forward you achieve some of this, but it’s not good enough, you still have the risk of not having tested the configuration on a new system.
With every environment being built from scratch with every release, we can version control a release to a specific build number and specific git commit which is better for QA as it’s always a known state, if there’s any doubt delete and re-build.
The release process can now start being baked into the startup of a server / puppet run so the consistency is increasing and hopefully the infallibility of the release is improving, adding to this a set of System wide checks for basic health, a set of checks for unit testing the environment, a quick user level test before handing over for testing then it’s more likely, more often that the environments are at a known consistence.
All of this goodness starts with deleting it and building fresh, some of it will happen organically but by at least making sure all of the configuration / deployment is managed at startup you can re-build, and you have some DR. Writing some simple smoke tests and some basic automation is a starting point, from here at least you can build upon the basics to start making it fully bullet proof.
Be radical, delete something in production today and see if you survive, I know I will.