Dealing with Technical Debt

A brief history

Last week I touched on Dealing with rapid change and I kinda got carried away and exceeded my self imposed 800 word limit. So the point of today’s blog is about capturing the technical debt and then how best to deal with it so you don’t end up in a situation where the service starts suffering because of tactical decisions made without suitable foresight.

So to summerise what technical debt is for those that did not read last weeks post (shame on you), Technical Debt is what you accrue when you have to make a change that does not head towards the overal goal but instead causes additional headaches for managing the environment or some how limits you in achieving the overal goal of a perfect system.

Capturing technical debt

My personal preference to tracking technical debt is a ticketing system, just create a way within that to easily identify what is technical debt and what is a normal task. You don’t have to use a ticketing system, you could just write it on paper or in a spreadsheet or what ever, the important thing is that you capture it with a reasonable amount of detail.

Avoid the temptation of filling the ticket with a lot of information, just put enough in it that explains what the problem is and why it needs to be fixed, if you have some suggestions on how it could be fixed add them to the ticket but don’t worry about saying “this is the solution” that can be done when the ticket is actioned.

Try and make sure you capture the following as a rule of thumb:

  • What is the issue?
  • Why is it an issue?
  • Any future tasks that are dependant on it
  • How long the task will take
  • A rough priority

These things will help prioritise the task when it comes to the planning the next set of tasks / projects that need to be actioned, but it doesn’t really help it get prioritised. Why? because it will never be the focus of the business to do the boring work that makes it stable unless there are issues or it has to be done as dependancy of another task.

Actioning technical debt

As I pointed out before, the business will never prioritise the technical debt unless there is a real issue for it to do so, service stability or dependancy on another task. This is a fact of life, and if you’ve found your self in this situation getting frustrated about all the hundreds of days of backlog tasks of technical debt that is accruing, panic no more.

As I pointed out above, the business will never want to do the technical debt unless there is good cause to do so, so the point of capturing the tickets with the right information is that the dependancies are clear, the outcomes of not doing it are clear, this makes it easier to discuss it as a business requirement. That is not enough though, you will get some tasks done but you will not be decreasing the technical debt it will continue to increase.

You need to create an environment in your immediate team and the extended team that understand why the technical debt needs to be actioned. This will be easier to do than convincing the whole business as to why it is important. Once you have the buy-in of the teams everyone will hopefully understand why it is important to keep the technical debt to a minimum. This will help it take a higher priority when it comes to scheduling tasks and help reduce the technical debt in the long run, it would also be a good idea as a group to identify an amount of technical debt that as a group you are happy to live with, this can be calculated on the amount of effort days required to deliver each task. This is a sure fire way of getting technical debt actioned and will help ensure that it remains at a level that is sustainable.

There’s always going to be a risk that you’ve tried all of the above and you’ve still not gotten anywhere, the technical debt keeps rising and the environment continues to get more and more complicated to work within. Simple, do not worry about it, you did what you were meant to, you raised the issues, you pointed it out to your boss, you can kick back and relax, maybe even take deep joy in the moment when it all fails and someone asks why and you just point out the years worth of technical debt that has ground your system to a halt. In short, it’s not your problem, it’s your bosses, you just need to make sure you capture it and raise it.

Summary

Technical debt is bad, if you’re not aware of it you need to be, you need to have a mechanism for dealing with it and if you don’t your systems will grind to a halt and you will probably be one of those companies that re-builds the entire infrastructure every 3-5 years because the technical debt was so large that the environment was unmanageable.

DNS results in AWS aren’t always right

A bit of background

For various reasons that are not too interesting we have a requirement to run our own local DNS servers that simply hold the forward and reverse DNS zones for a number of instances. I should point out that the nature of AWS means that doing this approach is not really ideal, specifically if you are not using EIP’s and there are better ways, however thanks to various technologies it is possible to get this solution to work, but don’t overlook the elephant in the room.

What elephant?

A few months ago while doing some proof of concept work I hit a specific issue relating to RDS security groups, specifically where I had added in the security group that my instance was in to grant it access to the DB. One day after the proof of concept had been running for a number of weeks access to the DB suddenly disappeared with no good reason and it was noticed that by adding in the public IP of the instance to the RDS security group access was restored, odd. The issue happened once and it was not seen again for several months, it then came back, odd again, luckily the original ticket was still there and another ticket with AWS was raised, to no avail.

So a bit of a diversion here; if you are using Multi-AZ RDS instances you can’t afford to cache the DNS record, at some random moment it may flip over to a new instance (I have no evidence to support this, but also can’t find any to disprove) so the safest way to get the correct IP address for the DB instance is to ask Amazon for it every time. So you can’t simply take what ever the last IP returned was and set up a local host file or a private DNS record for it, that’s kinda asking for trouble.

So we had a DNS configuration that worked 99.995% of the time flawlessly, and at some random unpredictable time it would flake out, just a matter of time. As everyone should we run multiple DNS servers, which made tracking down the issue a little harder… however eventually I did. Depending on which one of our name servers the instance went to, and how busy AWS’s name server was when which ever of our name servers queried it depended on the results we got back. Occasionally one of the name servers would return the public IP address for the RDS instance, causing the instance to hit the DB on the wrong interface so the mechanism that does the security group lookup within the RDS’s security group was failing; it was expecting the private IP address.

The fix

It took a few mins of looking at the DNS server configuration, and all looked fine, and if it was running in a corporate network that would be fine, but it is not, it’s effectively running in a private network which has a DNS server already running split views. The very simple mistake that was made was the way the forwarders had been set up in the config.

See the following excerpt from here

forward
This option is only meaningful if the forwarders list is not empty. A value of first, the default, causes the server to query the forwarders first, and if that doesn’t answer the question the server will then look for the answer itself. If only is specified, the server will only query the forwarders.

The Forward option had been set to first, which for a DNS server in an enterprise is fine, it will use its forwarders first, if they don’t respond quick enough it will lookup the record on the root name servers. This is typically fine as when you’re looking up a public IP address it doesn’t matter, however when you’re looking up a private IP address against a name server that uses split views it makes a big difference in terms of routing.

What we were seeing was that when AWS name servers were under load / not able to respond quick enough, our Name Server got a reply from the root name servers which were only able to get the public IP address, therefore, our instance routes out to the internet, hits Amazons internet router, turns around and hits the public interface for the RDS security group on its NAT’d public IP and thus not seen as within the security group, Doh!

Luckily the fix is easy, set it to “forward only” and ta-daa, it may mean that you have to wait a few milliseconds longer now and then, but you will get the right result 100% of the time. I think this is a relatively easy mistake for people to make, but can be annoying to track down if you don’t have the understanding of the wider environment.

Summary

Be careful, if you’re running a DNS server in AWS right now I suggest you double check your config.

Probably also worth learning to use “nslookup <domain> <name server ip>” as well to help debug any potential issues with your name servers, but be aware that because of the nature of the problem you are not likely to see this for a long long time, seriously we went months without noticing any issue and then it just happens and if you’re not monitoring the solution it could go un-noticed for a very long time.