top | item 37614032

How to Do a Full Rewrite

91 points| tate | 2 years ago |badsoftwareadvice.substack.com

90 comments

order
[+] simag|2 years ago|reply
Completed a full rewrite of many components of the Kraken.com backend in about 4 years.

The new system is around 1.5M loc of Rust. There was no serious alternative to rewriting, sometimes you find yourself in a corner and need to fix issues, and pay the price.

I wrote about it 3 years ago here https://blog.kraken.com/product/engineering/oxidizing-kraken...

Everything in that blog post still rings true and hindsight is that it were were right. But it was a massive grind and required extreme dedication to get it done, for a variety of reasons that work was very taxing.

We also didn't stop feature development and kept the two systems running concurrently (which explains why it took so long, also growing and training a new team 10x the size took time, so there are many factors).

I'm also against rewrites if I can help it, but reality is complex and sometimes we can't help it. Now however, since we removed the last pieces of legacy that were preventing larger DB schema changes (or required massive, unreasonable changes to the legacy systems), we've been shipping faster and easier than ever and caught-up on a lot of the accumulated backlog, including some of the more ambitious projects that were unthinkable in the legacy systems due to limitations.

[+] SenorKimchi|2 years ago|reply
Huge fan of Kraken.

Looking back, is there anything that you would have done differently? I find that half or more of the rewrites that I have dealt with have been driven by all the wrong motivations. You get inevitable turnover and at some point people dislike code that they didn't write themselves and push for a rewrite, maybe changing the stack to something trendy, justifying it with thin arguments. Once the rewrite starts the company ends up treading water for years while incurring a ton of costs. For me, I think only 1 rewrite that I was part of was a good decision in my 15 years in tech. If I could go back in time, I think I would kill all rewrite discussions the moment that someone first whispers the idea.

How did you guys enjoy switching to Rust? I assume the safety and performance benefits for the trading system are a huge plus (didn't Kraken trading go down for an entire week a few years ago?). Did you also rewrite the webapp backend in Rust as well? How has staffing and budgeting been affected? I would assume that the supply of Rust developers is much lower unless you train them in house. Rust sounds fun, but I can't imagine trying to justify a rewrite of a legacy system, a major tech stack change, and training/building a new team all at the same time.

Sorry for the onslaught on questions. The "rewrite it in rust" fever has spread to my work and I'm fighting myself on how to respond.

[+] snvzz|2 years ago|reply
Rewrites typically work well when the people behind the rewrite are the same people who wrote the original code and maintain it.

Often, the requirements changed along the way as the problem domain incrementally became better understood. At that point, the original design is not helping, but sabotaging everything.

This is why that first version should always be considered a prototype. And the next version will probably also be.

Not rewriting will have a much larger cost down the line.

[+] mekoka|2 years ago|reply
Our experience may have not been the same, but I beg to differ. If the old solution is so problematic that nothing can be salvaged from it and the best solution is a full rewrite, instead of some refactoring or modifications, then your problem is not so much that the code is solving the wrong problems. The problem is the team that built it. Every time I was ever called to rewrite a code base, there were certainly some elements that pointed to a clearer understanding of the requirements, but mostly, the necessity to do it from scratch pointed at a people problem.

Many successful projects started as reverse engineered clones of old and established ones, that then became improved versions of their predecessors. Those are rewrites, just not done by the original project's team.

My opinionated rules of full rewrites, informed by experience and observation:

1- Bring in one, just one, project lead who is a specialist of the technical domain. E.g. if you're building a web app, hire a seasoned web developer, instead of relying on your in-house electrical engineer, whom you allowed to architect the previous solution, because they managed to convince you that code is code.

2- Let the new lead vet every member of the old team, including (especially) the old team leaders.

3- Allow the new lead to drop any dead wood.

[+] bazoom42|2 years ago|reply
Why can’t the existing code just be adapted to the new requirement? Code is supposed to be mallable. But perhaps the code was designed too rigid to support changing requirements. In that case you will have exactly the same problem after the rewrite, next time requirements change again.
[+] mekoka|2 years ago|reply
Obviously satire, but in real life, there are reasons why a full rewrite becomes appealing. Perhaps the solution even.

One is overwhelming technical debt. Code where the project manager didn't believe in encapsulation, or refactoring, or none of that "architectural nonsense", and was only fired 5 years too late. Code that is difficult to understand, maintain, test, debug, change. Code that follows you home after-hours and on the week-ends. Code that nurses you to bed at night, shows up in your dreams, and wakes you up in the morning. Code that has made many a colleague look for employment elsewhere and new hires give up and quit in their first week.

Every time I see someone profess with assurance that you don't rewrite, I just know that that person has never really experienced the hell I've described above.

[+] anonzzzies|2 years ago|reply
I am against rewrites of any significance because they generally just end up worse. Joel Spolsky wrote about that a long time ago as did others; most software that’s older has millions of badly documented changes applied by 1000s of people over the decades and rewriting tends to take literally forever (never finishes) or at least much long than anyone estimated times PI. And then the endresult is usually just as crappy but with more bugs.

The Dutch tax software rewrite attempts are an example I am personally familiar with. These attempt made me create a services company to help companies keep legacy software running forever. We support gnarly stuff over 25 years old and still wouldn’t recommend a rewrite for the above reasons.

There are of course cases where rewrites (of significance; rewriting a 50k LoC codebase is not going to be hard) work, but usually the rewrite is done by the same people that did the original , the original wasn’t actually that bad but just too hard to extend in modern times etc.

[+] BlargMcLarg|2 years ago|reply
> Code where the project manager didn't believe in encapsulation, or refactoring, or none of that "architectural nonsense"

If anything I find the largest proponents to have drunk too deep from that well and cause the rewrites to never be considered, as the time required to do it becomes far too long to be worth the pay-out.

This excludes the worst kind: the overarchitectured old mess in need of a rewrite as it was based on the wrong assumptions and is now boggled down by 10 layers of abstractions and indirection which don't do anything.

[+] tcbawo|2 years ago|reply
What you are describing sounds more like a problem with leadership to empower ICs and groups to make improvements, or possibly a culture of dumping code, declaring victory, and moving on. If new hires are bailing in their first week, the problems run far deeper than the codebase. Rewriting anything is not likely to change the long term end state unless the company culture has shifted. I have yet to experience an organization where a real cultural shift has happened.
[+] l0b0|2 years ago|reply

  > One is overwhelming technical debt. […]
This sounds like an organisation with bigger problems than a single manager. The devs hate the job, but don't have the clout to convince management to let them do things properly? Run. It's not worth it. An organisation doesn't suddenly "heal" once a bad apple is gone. More likely, everybody is at least five years behind current best practice, and very much used to doing things that way.
[+] bazoom42|2 years ago|reply
I have see horrible code, but the point is that a rewrite will not solve this. You will just lose years of opportunity and end up in exactly the same place again after the rewrite. Because the reasons which lead the first version to become a big ball of mud will also cause the rewite to end in the same state.
[+] Aeolun|2 years ago|reply
Somehow I felt like you were describing my Factorio factory…
[+] PaulKeeble|2 years ago|reply
I have seen many organisations try this and end up with a second and even third system with less features running in parallel with the first never caching up to be feature complete.

I am of the opinion the only real way to do this is take small pieces and replace them and keep the main system running slowly replacing parts of it. It will never be complete but progress can be made in areas that need improvements. A complete rewrite isn't worth it, they fail so often, cost far more than any one thinks they do and rarely achieve the magic improvements they were sold on.

[+] heisenbit|2 years ago|reply
The only way is to accept that the new system is less complete and switch over and suffer the consequences. The new system will never have all the old features - the question is whether it is good enough to use now and extend in the future. The mission of the builders of the new system should not be feature completeness nor being better but to kill the old quickly while maintaining a reasonable level of future proofness. This can not be driven from the bottom (re-write for code beauty) as it requires business level commitment to bear the pain of the changeover. Software are coded business processes and new software means new processes. A major cost driver for software are inflexible requirements and taking old code and processes as gospel is a guarantee for a cost explosion.
[+] lazyasciiart|2 years ago|reply
The strangler pattern. I agree.
[+] iamflimflam1|2 years ago|reply
What I always love about “the rewrite” is the sheer optimism of the people involved - “we’ll be done in six months, and the new system will be a thing if wonder”. Fast forward to several years later. The original proponents will have moved on leaving behind the accumulated bodges and shortcuts from the increasingly desperate efforts to try and get something live… and then the cycle repeats.

There are ways to do this properly, but no one wants to put the effort into understanding the existing code base and the reason for why it is the shape it is. Everyone is happy with the “who wrote this crap - we can’t work on it” line.

[+] bakuninsbart|2 years ago|reply
I've only witnessed one full rewrite in my (admittedly short) career so far, and it went shockingly well. The goal was to rewrite a C++ application from the 90s in Java, since the company had largely moved to the web and C++ devs were getting close to retirement. The C++ application was also written in the underfunded startup phase, and then over decades new features had been "tagged on", so the architecture wasn't that robust.

The team tasked with the rewrite set a goal to finish in a year, the first 4-6 months were entirely spent on planning. After that, features were implemented in a modular and iterative approach. I think overall it took a bit longer than a year to be feature complete, but by the end of the year they had a working platform with all the core features implemented.

I think the key here really was very good planning, and the pitfalls you describe can be at least partially ascribed to agile development not being the right tool if you have a very clear and large set of requirements.

[+] loveparade|2 years ago|reply
I see this story here all the time, but I have never seen it play out in the real world. Most rewrites I've seen have been hugely successful. This sounds more like a sticky narrative that everyone keeps repeating to tell a nice story, get attention, and make them look experienced. In reality, no experienced engineer is so naive as to think that the new system will be a perfect thing of wonder or not look at the tradeoffs made in the old codebase. Reality is more nuanced than these simplistic fictional narratives.
[+] TheAceOfHearts|2 years ago|reply
I think a rewrite is fine as long as it's incremental and well-integrated each step along the way, rather than starting off from scratch. Unfortunately this seems like the sort of lesson each engineer needs to learn through experience rather than hearing someone tell you about it. Live and learn I guess. :)
[+] tnr23|2 years ago|reply
I founded a company and reached $6M ARR with >50% net profit margin. After 4 years decided to do a full rewrite. Finished it successfully within a year including full migration of all clients. Was the best decision ever.
[+] dieselgate|2 years ago|reply
Do you consider the rewrite a good decision because it helped increase profits or for other reasons?
[+] seb1204|2 years ago|reply
Congratulations. It reads like you were in control, the driver seat and into the code base. What I want to say is that I consider a rewrite possible because you controlled most /all aspects. Currently we are moving from several ERP instances/configurations due to acquisitions to one completely new ERP version and architecture. This is a huge project consuming enormous resources and needs a director reporting to the CEO.
[+] Scarblac|2 years ago|reply
Am in the middle of a rewrite now. I'm always against them, but this case is clearly perfect for it.

- The old system used to have many users, but only one large organisation was left.

- The way they use the system is atypical of what it was originally intended for. They use like 20% of the original functionality.

- The old system was started by the very first software our company ever produced (long before they brought in professional software developers), credits to their choice of Django that it worked for fifteen years, but it wasn't very good.

- But in the last nine years of that, hardly any maintenance has been done on it and now none of the build tools work.

We're making a much more focused, modern application now that does everything the customer used in the old one but looks completely different.

So there exists at least one situation where a rewrite is the answer.

[+] baz00|2 years ago|reply
How to do a full rewrite:

1. Create a subsidiary company and transfer the code ownership to that.

2. Sell the subsidiary company to an investor for big money and run away quickly.

3. While rolling around in VC cash, green field a better product from the ground up without all the horrible things you did last time.

4. Goto 1

[+] eastbound|2 years ago|reply
This topic is underrated and extremely important in YOUR career. Youngsters believe rewrites are for old people who didn’t get their code right.

Business apps only last for 15 years max. Every app requires a rewrite after 10 years, apps that don’t do it weren’t actually important for business.

Software stacks are flimsy. Apps need to support NPM, then Kubernetes. There are new security discoveries and way to prevent them. How do you list your libraries if you are not using NPM? Also we need to move the app to the cloud because installed software is not fashionable. The original software may be reasonable, but assumptions and expectations change.

Why don’t we see them more often, then?

The response is: Because they are worded as “Brand new app from scratch” in job descriptions, and because NPM hasn’t finished its 10-year cycle yet. But rewrites they will come, and you better be have a methodology for that.

When you build an app, one day, a requirement will be to make apps easily rewritable.

[+] deterministic|2 years ago|reply
Nope. I work on software that is 40+ years old. The code runs major airlines and airports with very few problems. We are currently upgrading it to run in a browser. Without rewriting it.
[+] amelius|2 years ago|reply
> Youngsters believe rewrites are for old people who didn’t get their code right.

Youngsters haven't yet seen the tyranny of managers and their always changing requirements.

[+] jmyeet|2 years ago|reply
The first rule of rewrites is don't. That is almost always the correct decision. Like Lucy and the football, if you're sure that this rewrite is different, you need to consider how that'll happen.

If you build in parallel, you need to decide what work will go into the existing system and how long the new system will take. Do you pause new features? If yes, I guarantee you that'll be an issue. I also guarantee it'll take longer than you think it will. If you don't pause new features, you'll be chasing moving goal posts so it'll take even longer. All of this is significant extra effort that you have to hope to recoup somehow with the new system.

Remember that it won't take that long before your new system is also considered legacy.

Once you have that new system, how do you deploy it? Dropping it in place is generally a high-risk strategy that'll greatly slow your migration.

Instead of the above, what you generally want is to do partial rewrites in-place using infrastructure you may need to build but you'll want anywhere. Here is the progression of, for example, a backend migration:

1. Double-write to the old and new backends. The new system is just a dummy, eseentially. You won't be using it (yet). Do offline verification on the old and new backends. Look at metrics for double-writes vs single-writes (eg latency, general performance, crashes);

2. Start reading from the old and new backends. Do a comparison of the data retrieved from both. Log and flag any instances where the data differs. You've now verified the new backend;

3. Start reading from the new backend;

4. Stop writing to the old backend.

This requires things like a robust experiment framework so you can, for example, double-write for 1% of your users and then compare the two groups easily across a wide range of metrics to see if there is any unexpected regression.

The point of all this is you can do partial rollouts of any of these steps and, more importantly, a rollback is trivial.

[+] samus|2 years ago|reply
The ease of accommodating new features along the way is actually a fine opportunity for the design of the new system to prove its mettle. If new features turn out to be troublesome, the new design is probably flawed as well.
[+] marginalia_nu|2 years ago|reply
> what you generally want is to do partial rewrites in-place using infrastructure you may need to build but you'll want anywhere.

Worked in a place where they were embarking on their third generation of failed incremental replacement of a system originally built in COBOL. No indication third time was the charm either. Big problem these projects is they ran for so long that whatever they were using stopped being the hot new thing half way through, so they always needed to restart with a new paradigm before the old one was fully implemented.

[+] karussell|2 years ago|reply
I once was involved in a bigger rewrite/refactoring of a backend where in a few areas we didn't have enough tests and I found goreplay to be very helpful as you do not need to explicitly write into two backends but the request data will be "doubled". Still stateful requests were a bit problematic but you can write so called "Middleware" for goreplay (in different languages) and I was able to solve it this way.
[+] what-no-tests|2 years ago|reply
Rewriting is so attractive.

If the code has aged and is suffering from many code smells and anti-patterns, a rewrite becomes even more attractive.

"Why should I spend all this time adding a simple feature to this crappy code??"

But writing code is the easy part.

Architecting a correct solution that meets today's business needs and can be built upon, as well as walking the data, users, and business workflows over to the new system, are the hard part.

I've never seen it go smoothly. I want to see it happen, because I'm an optimist, but so far it's not gone well.

[+] nrr|2 years ago|reply
Oddly, your username is relevant here. It's a common refrain of mine when I hear about someone undertaking a rewrite that becomes something of an albatross.

The only times I see rewrites succeed (where success is measured by how often your customers notice you breaking stuff) is when there's a comprehensive set of integration tests to write against. That really seems to be the sole determining factor.

[+] KronisLV|2 years ago|reply
> But writing code is the easy part.

If it's inherently easier to write new code than maintaining old code, then we should do our best to also make it a good choice operationally. Maybe write a new service/module alongside the old thing and deploy that, instead of opting for a full rewrite immediately. What else are we supposed to do, suffer with codebases that decline over time, instead of using more modern tooling and frameworks?

If adding a new module to a dated legacy Spring system would involve messing around for weeks with XML configuration mechanisms, brittle runtimes and mechanisms in the codebase and integrating everything with the existing solution whilst having risks in regards to overall stability, then you might as well bootstrap a new service in Spring Boot/Quarkus/whatever you're comfortable with and deploy it alongside the old thing. (using Java as an example here, because the old Spring projects I've seen got pretty bad sometimes)

With reverse proxies or other gateway solutions, as well as containers and orchestration, deploying and routing traffic to this new service (even just for new endpoints in your existing domain/API) is not a big issue and many other concerns are covered, in addition to this new service being simpler to change and evolve over time, as well as replace altogether.

However, the situation around the data and integrating various services with one another still is difficult - because with microservices your data model would be strewn across multiple separate databases, whereas with multiple services connecting to the same database you'd end up with something that's hard to reason about. Whereas if you call APIs directly, you're also introducing an unreliable network connection into the mix between the services, which has some overhead as well, or a complex message queue/pub-sub solution.

I'm glad that we're far along enough for the former problems to be at least reasonably solved (things like 12 Factor Apps are great), but I don't think that there are all that many solutions for the latter group of issues, unfortunately. Even worse if you do a full rewrite and there's an expectation of the new thing to 1:1 match the logic and features of the old one, then you're setting yourself up for failure, unless the project is simple.

[+] layer8|2 years ago|reply
> When you find large chunks of code that seem to serve no purpose - get rid of those immediately!

Yeah, that's a good recipe to break your software in non-obvious ways that only become apparent later, and will end up costing you a lot of money and customer good-will. Chesterton's Fence applies.

My advice: Don't do a full rewrite. Instead rewrite/refactor the most painful parts in small, well-understood increments. Rewriting only 20% may give you 80% of the benefits. Be very conscious and explicit about what your actual pain points are and what exact benefits you'll get by rewriting specific parts.

[+] BigJono|2 years ago|reply
Your advice only works if the "large chunk" of code isn't a large chunk of code lol. If you can safely refactor a subset of the code without the full context then of course you'd do that. But if there's a big enough chunk of code that people can't load it into their brain RAM before making a change, it'll just get bigger and bigger as more people tack on drive by changes. If you're past that point, it's time to ctrl-A delete specifically so you can apply Chesterton's Fence on future refactors.
[+] rbosinger|2 years ago|reply
I normally agree with this sentiment but I have a story about this. I was in a situation where I was handed over old projects to expand on and the choice to rewrite or build on the same codebase was up to me and another dev (management didn't care or know what should be done in this regard). As "mature" devs we knew rewriting wasn't a great idea. So we code splunked for months, wrote tests, upgraded and cleaned out code bits at a time. Only after succeeding in this did we realize that 85% of the existing code was unused and most of the remaining 15% had pretty substantial bugs in it and naturally ended up mostly rewritten to fix. Turns out the previous dev worked alone for years, didn't use source control, didn't ever delete anything, and that the company has quite a history of pivots. After stepping back from this we realized we should have rewritten, that we basically did anyhow, except we took the longer road to get there but now still have to continue with some of the framework/lib choices the previous dev had made because we opted to stick with them in order to facilitate this piece by piece "rewrite". I don't think we did anything wrong here but it's funny how taking this mature approach, in retrospect, probably didn't help much and now we're beyond the opportunity for a rewrite. I know some folks will say that what we did was actually a first step towards a proper rewrite, and I'd agree, except the time to do that never came.
[+] k__|2 years ago|reply
When I was just starting as a developer, I had to maintain an old system.

I didn't know much so I assumed, all it's problems stemmed from my missing understanding of the system.

After working on it for a few years, I came to the conclusion that this wasn't the case. The system was inherently broken and not doing a rewrite would be negligence.

I did a rewrite in the end, but I did it incrementally. Build a new API, and replaced parts of the interface with new implementations that used the new API.

Took a year, but it was worth it.

[+] bazoom42|2 years ago|reply
HN comments often complain about incompetent management, but in my experience these misguided rewrite-from-scratch projects are initiated by developers who want to explore some new framework or platform, or who just think green-field development is more fun. And then they convince management a rewrite is necessary.
[+] sjducb|2 years ago|reply
I once pulled off a scratch rewrite.

The original author had spent months writing their own custom ORM framework. The big problem was that this framework didn’t support nested transactions, or validators. This meant that when users uploaded data, some of the data would go in, then a later part would fail and you’d get an inconsistent state. Then users would spend days fixing the inconsistent state through a clunky UI.

I rewrote the whole thing with Django in 2 weeks. Each upload was in a single transaction and if something failed it would roll back the whole upload with a message that said “Upload failed with error {}. No data was added to the database. Please fix the error and retry the upload.

The users loved it.

This was a “mature” piece of software that had been critical for many years.

[+] MarceColl|2 years ago|reply
That ORM sounds like prisma ORM
[+] brazzy|2 years ago|reply
I've been in charge of a full rewrite of a B2B system that looks to work out pretty well so far (will soon do the full rollout for the first customer after successfully piloting with 20 users).

I think a key to the success is that the old system is so crufty that it's not, in fact, a moving target.

The old system is over 20 years old, a web application written in C. Apart from security considerations, the UI looks archaic, there is no separation between frontend and backend, and data storage is in custom-designed files - no database engine, so it's essentially impossible to add substantial new functionality.Even the most minor config change requires a recompile.

[+] deterministic|2 years ago|reply
I highly recommend reading “Working Effectively with Legacy Code” BEFORE even considering a rewrite. Isolating and rewriting parts of a legacy system step by step is almost always the right way to do it.
[+] bbojan|2 years ago|reply
I don't know. I always thought of software as a living thing. As such, it is never complete.

For example, 98% of the atoms in your body are replaced every year. Doesn't it make more sense to think of software in this way, as opposed to e.g. a car or a building? I guess even a building needs maintenance (roof is replaced, windows are replaced, ...), but it's on such long timescales that we think of it as "finished".

If you think of software as a living thing where X% of it needs to be replaced every year just to keep it where it is, then there is no need for a "big rewrite".

[+] raincole|2 years ago|reply
It's not a very serious article.
[+] mparnisari|2 years ago|reply
My alternative solution is just to quit lol.
[+] amelius|2 years ago|reply
The problem is finding a new job where you don't have to work on legacy systems.