top | item 16810092

A Taxonomy of Technical Debt

711 points| edroche | 8 years ago |engineering.riotgames.com

113 comments

order
[+] methodover|8 years ago|reply
This is a fantastic article.

Contagion is a really great term. I've seen my poor abstractions be replicated by others on my team, to my horror -- "don't they see why I did that in this particular case, and not in this other case?" Of course, that's entirely, 100% my fault. I picked a poor abstraction, I put it in the code, I didn't document it well enough, and of COURSE other programmers are going to look to it when solving similar problems. They should!

That said... Sometimes I spend a bunch of time finding the right abstraction for a feature that we end up not expanding. And then it feels bad that I spent all this extra time coming up with the "right" solution, instead of just hacking out something that works. Hmm...

[+] wpietri|8 years ago|reply
One team I was part of kept a separate backlog of technical debt and experiments. It was nice to have a place to say, "in 30 days, look at this hacky thing and see if it's worth making better". Or, "I noticed this is a mess, here's how I might clean it up." We'd occasionally talk over the backlog and prioritize it, which helped communicate both the general make-things-better spirit and specific issues like you mention. I really liked it.

One thing that made it work is that we worked on it in small slices all the time, without involving the product manager. It was still visible, so there'd be the occasional question, but as long as we kept delivering user value, nobody worried to much about our mysterious code concerns.

[+] mannykannot|8 years ago|reply
I would like to suggest that there is a fourth dimension that might be called 'interest' as we are using a debt analogy - the tendency for the cost to increase over the time elapsed since the debt was incurred.

When an item of debt is first created, the people making it are often well aware of what they have done and are therefore in a relatively good position to fix it, but that knowledge quickly dissipates, to the point where it is often forgotten that there is a specific issue there. Furthermore, there is a tendency for it to be made less obvious as further changes are layered on top and around (this is distinct from contagion, as it can occur if the later changes are themselves debt-free, or at least independent of the decisions that created the debt and their consequences.)

[+] theptip|8 years ago|reply
I found contagion to be a great clarifying concept too; it's something that I've been looking at in my codebase as the team expands.

My gut feel is that it's not necessarily about what you write in the first place, but what you refactor -- sometimes you can get away with a gradual replacement strategy (like std::string => AString from the article), but if the original pattern is contagious and bad, then you might have to take a more aggressive one-shot refactoring approach.

I've definitely seen this where a localized refactor is made to try to find a better way of doing something, we decide that we like the new way, and then don't find the time to replace the rest of the usages, resulting in a confusing state of affairs where you need to know which is the "blessed"/"correct" way of doing things.

I think that "contagion" is a good lens to use when assessing what the refactoring strategy should be for a given change to the codebase.

[+] danShumway|8 years ago|reply
I've also seen bad pattern replication, and had a difficult time explaining to other teams why it was a problem.

I used to write a lot of app-wide Javascript at a previous job that would get consumed by multiple teams. If I didn't encapsulate something well enough or if I left a private open, I'd later find a code review with someone exploiting it.

The worst offender was a team that once used the prototype of a shared class as a mixin, duplicated/mocked just enough of my implementation logic to get three or four methods working, and then left it at that. Of course, the next time I changed any of my code, even in the constructor, their page broke.

My experience has been that when other teams see these patterns, they see a single page or feature that's working at the moment and assume "this must be fine." They don't see the three or four frantic show-stopping bugs that got logged last month.

When I would confront teams about this, often the response that I would get was "Well, if it's good enough as a quick fix for them, why can't we do the same thing? Why are we the only team that has to fix this?"

Of course, when teams don't want to be the first one to break from a bad pattern, the end result is that nobody changes anything.

[+] zachsnow|8 years ago|reply
I have found the closer I am to the product and the clients that will be affected, and the more thoroughly I understand the usecase from the client’s perspective, the better I am at understanding how much effort to spend on “getting it right” in this way. Still wrong sometimes though!
[+] hinkley|8 years ago|reply
Contagion is why I want a VCS tool that allows me to keep code review comments with the code. Just because someone senior did something bad two years ago doesn’t mean you have carte Blanche to make new code that behaves the same way!
[+] worldsayshi|8 years ago|reply
Interesting how you point to a slightly different kind of contagion in replicating code patterns. While the article seems to discuss the kind that is inevitably forced on whoever depend on the code.
[+] outworlder|8 years ago|reply
This is my number one concern in my current team.

I have implemented a bunch of things that, while helpful short term, had clumky hacks to make up for either lack of tooling, or due to time constraints. And then the solutions get replicated verbatim, because "they work". The more time passes, the worse they become.

[+] 0xdeadbeefbabe|8 years ago|reply
The whole tech debt concept might be the wrong abstraction.
[+] kashyapc|8 years ago|reply
This reminds me of the following, from the book Team Geek[1], chapter "Offensive" Versus "Defensive" Work:

[...] After this bad experience, Ben began to categorize all work as either “offensive” or “defensive.” Offensive work is typically effort toward new user-visible features—shiny things that are easy to show outsiders and get them excited about, or things that noticeably advance the sexiness of a product (e.g., improved UI, speed, or interoperability). Defensive work is effort aimed at the long-term health of a product (e.g., code refactoring, feature rewrites, schema changes, data migra- tion, or improved emergency monitoring). Defensive activities make the product more maintainable, stable, and reliable. And yet, despite the fact that they’re absolutely critical, you get no political credit for doing them. If you spend all your time on them, people perceive your product as holding still. And to make wordplay on an old maxim: “Perception is nine-tenths of the law.”

We now have a handy rule we live by: a team should never spend more than one-third to one-half of its time and energy on defensive work, no matter how much technical debt there is. Any more time spent is a recipe for political suicide.

[1] http://shop.oreilly.com/product/0636920018025.do

[+] hinkley|8 years ago|reply
The XP guys had it right. Amortize all defensive work across EVERY piece of offensive work.

In the tech debt parlance most people are paying interest only payments instead of paying against the principle. Every check you write should do both (extra payments are good but they aren’t good enough).

[+] eadmund|8 years ago|reply
It's a great article, but I do have one quibble.

> A hilariously stupid piece of real world foundational debt is the measurement system referred to as United States Customary Units. Having grown up in the US, my brain is filled with useless conversions, like that 5,280 feet are in a mile, and 2 pints are in a quart, while 4 quarts are in a gallon. The US government has considered switching to metric multiple times, but we remain one of seven countries that haven’t adopted Système International as the official measurement system. This debt is baked into road signs, recipes, elementary schools, and human minds.

A not-so-hilariously stupid mistake is to think that the traditional measurement system is stupid. His picture illustrates one of its virtues: the entire liquid-measurement system is based on doubling & halving, which are easy to perform with liquids. The French Revolutionary system, OTOH, requires multiplying & dividing by 10, which is easy to do on paper or with graduated containers, but extremely difficult to do with concrete quantities (proof: with one full litre container and two empty containers, none graduates, attempt to divide the litre into decilitres).

The real foundational debt is that we use a base-10 system for counting, due to the number of fingers & thumbs on our hands, rather than something better-suited to the task. If we fixed that problem, then suddenly all sorts of numeric troubles would vanish. There's actually a lot to be said about the Babylonian base-60 system, to be honest.

[+] TeMPOraL|8 years ago|reply
That's an... interesting point I haven't seen brought up before. Makes me appreciate the "traditional" system more.

Still, I guess we aren't going to drop base-10 any time soon, so I believe the US should just accept the "traditional" measurement system as something that used to be very practical, but no longer is due to progress of technology, and switch to SI.

[+] baud147258|8 years ago|reply
What's better about a base-60 system compared to a base-10 system?
[+] mitko|8 years ago|reply
Great article, loved how the examples were presented.

In my time as an engineer, I've found that thinking of tech debt as financial debt also helps. There is the initial convenience (borrowed money) of using the debt-ed approach. Then there is fix cost as Bill Clark name it, i.e. how much to pay back the debt if it were money. The impact is akin to the amortization schedule, i.e. what is the cost every time. For normal money, amortization schedule is over time, but for tech debt it is over usage. The amortization schedule of tech debt is discounted over time, as with money, _now_ is more important that _later_.

Contagion is a great concept, and I think it is a better name than interest rate, as the debt will spread through the system, and not just linearly with time.

Tech debt is also multi-dimensional and not fungible like money, which makes it a harder thing to reason about.

But the good news is, in my opinion, that sometimes it is perfectly fine to default on some tech debt, and never pay it back, delete the code. Then taking that tech debt was a win, if the convenience was more than the amortized payments.

[+] baddox|8 years ago|reply
I think the main difference is that technical debt is not fungible, i.e. you can’t necessarily easily choose to pay off the highest-interest technical debts first like you would for your personal financial debt.
[+] baud147258|8 years ago|reply
As financial analogy, I've seen a piece (linked on HN a few years ago) comparing technical debt to unhedged options, meaning you can get a benefit and you might or might not get bitten by it.
[+] mmsimanga|8 years ago|reply
In data warehousing and BI, it's MacGyver and data technical debt all the way down. MacGyver because of all the "urgent" reports whipped up for CEO, duplicate copies of data and the reports done by consultants who barely understand industry. Data dept because of all the bugs and changes passed down as data from source system.
[+] lmkg|8 years ago|reply
It's practically the definition of data warehousing that its whole purpose in life is to deal with everyone else's bullshit. If you want to combine data from different sources, you have to retroactive fix all the mistakes that the data owners made that don't cause issues for them but do cause issues for you.
[+] worldsayshi|8 years ago|reply
Does any programming paradigms protect better against data debt? The only way that I can imagine to significantly protect against this would be if there was some way to generate data migrations based on type changes.
[+] Unkechaug|8 years ago|reply
You're triggering me real hard. 11th hour "I don't care how just make the system get to this number" regardless of the garbage number dumped in. Then you're stuck with a permanent bandaid in the core code that will inevitably screw things up in the future all because of one due date that probably didn't even matter anyway.
[+] jeffdavis|8 years ago|reply
What about "fear"?

The most pernicious thing about technical debt, in my opinion, is that it creates fear in the sense of "I don't want to touch that module".

Even if you try to be objective and use hard facts to overcome the fear, it doesn't matter, because fear destroys creativity, so you've already lost.

[+] kraftman|8 years ago|reply
Your tests should reduce that.
[+] humanrebar|8 years ago|reply
I might have missed it, but missing from the taxonomy: "Pay In Full" Debt.

In this debt, you pay the entire cost until the last use of it is cleaned up.

This kind of debt is especially insidious because there is no incremental benefit to cleaning it up.

[+] jedanbik|8 years ago|reply
Reminds me of risk analysis: Impact times Probability equals Risk.

Contagion seems like a probability factor. Impact is the cost of leaving things unchanged. Fix cost is the cost of fixing the problem.

Risk management in this context then means comparing Impact cost to Fix cost in terms of impact for the business.

[+] jimmaswell|8 years ago|reply
Somewhat aside, but the brain having to "flip" visual information because it's "upside down" seems suspect to me. Turn it sideways while maintaining all the connections it has to the rest of the body, and what changes? Is it getting visual information sideways that it has to rotate now? Probably not.
[+] ninkendo|8 years ago|reply
Moreover the idea that the collection of neurons that your retina connects to has any concept of "orientation" is nonsense to begin with IMO. It's not that "there's an upside-down image that your brain has to fix", it's just that your brain interprets signals from your retina as a picture in your mind, full stop.

Rods/cones in the top of your retina connect to your brain through neurons, so do the ones at the bottom. But to say that "this 'top' retinal cone should really connect to a 'top' neuron in your brain", doesn't even make sense to me. Since when do the locations of the neurons interpreting the input even matter?

It would be the same with hearing too... you have a left and right ear, but if for some reason those were swapped and your left fed things to the right half of your brain and vice-versa, your brain wouldn't be "flipping it back", because how could the absolute location of the neurons interpreting the sounds even matter?

[+] lifeisstillgood|8 years ago|reply
There are writers who just ooze technical depth of understanding - i thinks it's something to do with trying to explain something at a laypersons level, but leaving many assumptions just there for the reader to follow. It's almost the opposite of baffling with bullshit.

Good read and a really useful concept

[+] monkeydust|8 years ago|reply
I am a senior product manager for a large financial technology company.

Over the years I have learnt to become comfortable with allowing my engineering teams to refactor code whilst delivering new functionality.

This has been a process and largely one of trust between me and the engineering leads.

It has also helped that I have seen payback from the investment made from reducing down the debt in terms of us delivering new functionality quicker and less error prone code. Although, this payback can take a while to see (6months + which is a long time for a product person operating in a competitive space!)

Most of my managers don't get this or if they do they are too blinded by immediate kpi's from further above they can't justify it so in most cases I just tell the engineering guys to add a spread to their estimates to cover the paydown of the debt.

Over they years this has definitely helped me build tighter relationships with engineers which as any product manager knows can have huge benefits.

[+] billysielu|8 years ago|reply
I find it's always worth asking "will this get better over time, or worse" for everything, ever. Folks just fail to see past the next few months, having at least one person in the room asking this question makes them at least ignore it intentionally instead of complacency.
[+] hywel|8 years ago|reply
"I’ve rarely encountered discussions of contagion."

This surprised me: contagion is a good metaphor because it is a compounding measure of the growth of the problem. Just like an interest rate (a compounding measure of the growth of debt).

Most senior developers I've met have considered the interest rate of the debt, which seems like it has been renamed here as contagion. Maybe I've been lucky to just know smart people!

From the point of view of explaining these concepts, I'd suggest keeping the metaphors consistent. Tech debt should have an amount owed and an interest rate, tech infection (?) should have a potency and a contagion level.

[+] drawkbox|8 years ago|reply
At pretty much every game studio there is an epic internal battle of standard libs vs custom. std::string and [some custom string class] here it is AString is usually the spark. A constant of internal game development is that they think they can always build better strings, lists, dictionaries, collections etc than the standard lib, basically thinking the standard lib is as it was in the 90s and all the work that went into them is bunk. In some cases if you are really pushing memory and not writing custom allocators or using something like boost then yes, but in most cases the technical debt of custom classes written by an ancient from generations ago internally is more technical debt.

> One of the best examples of MacGyver debt in the LoL codebase is the use of C++’s std::string vs. our custom AString class. Both are ways to store, modify, and pass around strings of characters. In general, we’ve found that std::string leads to lots of “hidden” memory allocations and performance costs, and makes it easy to write code that does bad things. AString is specifically designed with thoughtful memory management in mind. Our strategy for replacing std::string with AString was to allow both to exist in the codebase and provide conversions between the two (via .c_str() and .Get() respectively). We gave AString a number of ease-of-use improvements that make it easier to work with and encouraged engineers to replace std::string at their leisure as they change code. Thus, we’re slowly phasing std::string out and the “duct tape” interface between the two systems slowly shrinks as we tidy up more of our code.

So now there are two string classes, that is technical debt... and one should be consolidated on and the arguments against std::string are sometimes valid but you can also do custom memory allocators or use better standard lib iterations.

EA even rewrote the whole standard lib EASTL [1] to adjust for some of these issues i.e. fragmented memory. Some games require it, some it is pure ego in game development teams. Game development teams have the highest ego driven development (EDD) I have ever seen and lots of tricks that take five minutes (but add 2-3 months to testing due to five minute solutions) but are more spaghetti than templates that write templates.

The one problem that comes about with your own standard lib or thinking you are better than boost or similar, is that the learning curve on the internal lib replacements add technical debt and start up costs, and the original guy that wrote them is long gone usually. Also, in the end portability suffers as there is invariably 3-4 versions of the internal libs.

Developers have to weigh the technical debt of your own custom classes outside standard libs and see if that outweighs the memory issues that may arise. Today most machines are not as affected by memory fragmentation issues and there is more cpu/memory to go around, and where they are you can write custom allocators for std/stl or use something like boost.

I do love Riot Games and all game development teams just I have never worked in one or with one that doesn't have the standard lib vs custom battle and wastes lots of time when one isn't standardized on or when not necessary. Some games and game engines require it, where they do you should fully commit one way or the other. Though going custom leads to slowdowns in coding for new devs and invariably there will be multiple versions of those internal libs over time that add up in the debt department.

[1] https://github.com/electronicarts/EASTL

[+] badloginagain|8 years ago|reply
One of the biggest problems of this is the tribal knowledge that develops around it. I worked at a studio that had something very similar to EASTL, but had joined the studio after an exodus of senior people.

It meant I had no idea how to use the custom libs. No documentation, no one left in the office to tell me how its used, no Stack Overflow to answer even trivial questions.

I left after less than a year. The studio closed down 2 months after I left.

[+] WorldMaker|8 years ago|reply
There's absolutely a "problem domain debt". If you are a game company, your problem domain that gets you paid is your game(s). Time spent rebuilding standard libraries is time not working directly on the game, and maybe time not getting properly "paid". Meanwhile, there are already people paid to work on the standard libraries, and its their job to make those work and continue to improve them.

Certainly there are tradeoffs where you may have to know the standard libraries well enough to know their performance characteristics, or how best to mitigate worst case scenarios, but if the people paid to build standard libraries are doing their jobs (that you pay them for when you buy that compiler), it should be less debt work to workaround an existing solution than build one from scratch.

[+] hinkley|8 years ago|reply
If I had a nickel for every jackass who thought he understood URL parsing well enough to do it by hand instead of using the goddamned builtin library like a sane person and get all of its sanity checks and corner cases handling for free, I could retire (and would be a lot less bitter).
[+] ryandrake|8 years ago|reply
It’s not just game studios who rewrote the standard library. I’ve seen non game companies with their own containers too. For those cases, it’s always just very old legacy code that nobody wants to touch, but if they were to do it again fresh today, they’d just use what comes with the language.

For places that have their own legacy containers and actively try to move more code to them—I dunno! I think at some point back in the 90’s the standard library got the reputation of being junk (perhaps rightfully) among game programmers, and this belief has been cargo culted all the way into 201x. Who knows.

[+] brann0|8 years ago|reply
You're missing the point of reimplementing some or all of the standard libs. Similarly, disabling C++ exceptions and RTTI is a very common practice in gamedev.

Sometimes you reimplement a certain standard class (vector, string...) to adapt it to the very needs and usage patterns you have. Standard libs tend to be too general, plagued with allocations and other useless (in this specific context) behaviors that may negatively impact your performance/cache friendliness/memory fragmentation...

I agree a simple tiny game doesn't need all of these but when you need to squeeze all the performance you can there's no other option.

So please, do not just dismiss all the gamedev wisdom like that.

[+] jrs95|8 years ago|reply
You would think the approach taken with this would be to just use the standard libs until they're actually contributing to a bottleneck and then worry about optimizations. Doing it prematurely does seem to be a problem. Though I don't think having custom alternatives to parts of stdlib are bad if you're actually making a meaningful optimization.
[+] VyseofArcadia|8 years ago|reply
My company's string container is definitely MacGyver debt. At some point in the distant past we had to worry about having both Pascal-style and C-style strings...
[+] rickbad68|8 years ago|reply
In my experience often the term 'technical debt' will be hijacked by product-oriented folks resulting in feature debt being presented as tech debt.
[+] arca_vorago|8 years ago|reply
This seems far too focused on dev tech debt, which has a very narrow scope. I like the article, so I'm not knocking it, just offering a little perspective. As a senior sysadmin in the past my primary issues have been technical debt across the entire board, number one being too few hires for too much workload due to cheap or nearsighted execs, but I would definitely agree that contagion is a great term for how techdebt grows faster the longer it's left alone.

It's worth remembering the CTO and senior sysadmin and a few others are dealing with all the tech debt of the entire company and IT department of which dev is only a subset (of course this depends on the company, but on HN sometimes I see convos like this where it feels devs are just talking at each other and not receiving much outside feedback.)

[+] scarface74|8 years ago|reply
I've been binge listening to Software Engineering Radio for the past few months. I am currently listening to an episode where they are talking about technical debt.

He has the opinion that clean code is not as important as shipping code - ship the code first and then refactor as needed after you get customers.

http://www.se-radio.net/2015/04/episode-224-sven-johann-and-...

[+] debt|8 years ago|reply
“We gave AString a number of ease-of-use improvements that make it easier to work with and encouraged engineers to replace std::string”

Are you absolutely sure this itself won’t become Foundational technical debt? You seem overly confident, given the metrics, that replacing std::string is a good decision.

[+] LtRandolph|8 years ago|reply
We certainly can't know for certain. But we've had a significant, measurable reduction in CPU cost due to "hidden" memory allocations from things like passing a char* into a function that takes a std::string and stuff like that. (I may be being mildly inaccurate, as I wasn't the guy doing the perf captures etc. I just talked to him about it).

I'm particularly impressed by AStackString, which is a subclass that has initial memory allocated on the stack, but automatically converts to dynamic allocation if you exceed that space. So we get quick stack allocation by default, but it will safely handle when it needs to expand.

Most of the quality of life stuff is around having in-built support for printf style formatting, string searching (including case-insensitive).

[+] jtchang|8 years ago|reply
I love this article. Quickly breaks down the types of debt.
[+] carapace|8 years ago|reply
(MacGyver's name is Angus!?)