The final instalment
Over the last couple of weeks I have posted a Foundation to what the cloud really is and How to make the best use of your cloud. This week is about tying off lose ends, better ways of working, distilling a few myths and setting some things straight.
Infrastructure as code
DevOps is not the silver bullet, but it is a framework that encourages teamwork across departments to have rapid agility on deployment of code to a production environment.
- Agile development
- Frequent releases of minor changes
- Often the changes are simpler as they are broken down into smaller pieces
- Configuration management
- This allows a server (or hundreds) to be managed by a single sysadmin and produce reliable results
- No need to debug 1 faulty server of 100, re-build and move on
- Close, co-ordinated partnership with engineering
- Mitigates “over the wall” mentality
- Encourages a team mentality to solving issues
- Better utilises the skills of everyone to solve complex issues
Infrastructure as code is the fundamentals of rapid deployment. Why hand build 20 systems when you can create an automated way of doing it. Utilising the api tools provided by cloud providers it is possible to build entire infrastructures automatically and rapidly.
Automation through code is not a new concept, Sysadmins have been doing this for a long time through the use Bash, Perl, Ruby and other such languages, as a result the ability to program and understand complicated object orientated code is becoming more and more important within a sysadmin role, typically this was the domain of the developer and a sysadmin just needed to ”hack” together a few commands. Likewise in this new world, Development teams are being utilised by the sysadmins to fine tune the configuration of the application platforms such as tomcat, or to make specific code changes that benefit the operation of the service.
Through using an agile delivery method frequent changes are possible. At first this can be seen to be crazy, why would you make frequent changes to a stable system? Well for one, when the changes are made they are smaller, so between each iteration there is a less likely total outage. This also means that if an update does have a negative impact it can be very quickly identified and fixed, again minimising the total outage of a system.
In an Ideal world you’d be rolling out every individual feature rather than a bunch of features together, this is a difficult concept for development teams and sysadmins to get use to, especially are they are more use to the on-premise way of doing things.
Automation is not everything
I know I said automation is key, the more we automate the more things become stable. However, as automating everything is not practical and can be very time consuming, it can also lead to large scale disaster.
- Automation, although handy can make life difficult
- Solutions become more complex
- When something fails, it fails in style
- Understand what should be automated
- Yes you can automate everything, but ask your self, Should you?
- Automate boring, repetitive tasks
- Don’t automate largely complex tasks, simplify the tasks and then automate
We need to make sure we automate the things that need to be automated, deployments, updates, DR
We do not want to spend time automating a solution that is complex, it needs to be simplified first and then automated; the whole point of automation is to free up more time, if you are spending all of your time automating you are no longer saving the time.
Failure is not an option!
Anyone that thinks things won’t fail is being rather naïve, The most important thing to understand about failures is what you will do when there is one.
- Things will fail
- Data will be lost
- A server will crash
- An update will make it through QA and then into production that reduces functionality
- A sysadmin will remove data by accident
- The users will crash the system
- Plan for failures
- If we know things will fail we can think about how we should deal with them when they happen.
- Create alerts for the failure situations you know could happen
- Ensure that the common ones are well understood on how to fix them
- You can not plan for everything
- Accept this, have good processes in place for DR, Backup and partial failures
Following a process makes it quick to resolve an issue, so creating run books and DR plans is a good thing. Having a wash up after a failure to ensure you understand what happened, why and how you can prevent it in the future will ensure the mitigations are put in place to stop it again.
Regularly review operational issues to ensure that the important ones are being dealt with, there’s little point in logging all of the issues if they are not being prioritised appropriately.
DR, Backup and Restoration of service are the most important elements of an operational service, although no one cares about them until there is a failure, get these sorted first.
Deploying new code and making updates are a nice to have. People want new features, but they pay for uptime and availability of the service. This is kinda counter intuitive for DevOps as you want to allow the most rapid of changes to happen, but it still needs control, testing and gatekeeping.
Concentrate on the things that no one cares about unless there’s a failure. Make sure that your DR and backup plan is good, test it works regularly, ensure your monitoring is relavent and timely. If you have any issues with any of these fix them quick, put the controls in place to ensure they stay up to date.
In regards to automation, just be sensible about what you are trying to do, if it needs automating and is complicated, find a better way.