Sensenmann: Code Deletion at Scale

[+] kgeist|2 years ago|reply

The codebase of an internal system our team inherited is pretty old (14 years) and large, and there's a feeling a lot of the stuff is unused (dead code or unused ancient features) and can be deleted. Fortunately, it's a monolith and all that matters is HTTP endpoints defined in a few config files. So we've made a script which downloads all access logs from the live server for the last 3 months and tries to find endpoints with zero or very low usage (it did find quite a few unused or rarely used endpoints). I'm planning the second step to be a script which builds a dependency graph and finds everything which is only reachable from unused endpoints. I wanted it to be a one-time endeavor but this blog post made me think it should be run on a regular basis.

[+] kortilla|2 years ago|reply

Be very careful using this methodology. This is how you end up deleting stuff only called in rare but critical cases:

- during outage

- at end of year

- during audits

- when the one special customer that paid a fortune for an obscure feature decides to use it

It might be the case that none of this applies to your application, but it’s why “check what got hit recently” (3 months is an eye blink in the business and govt world) is dangerous.

[+] jiggawatts|2 years ago|reply

I wonder if APM performance traces could be used to determine unused code paths.

A lot of these tools sample stack traces at intervals, collecting statistics of how often functions are used. If you had enough samples, then with very high confidence you could find code paths that are never used in production.

[+] eitland|2 years ago|reply

Happy to see someone applying some science to our field instead of just going with a hunch.

That said, as kortilla already mentioned, in most cases 3 months is too short time to get a complete answer.

IIRC "Principles of Network and System Administration" by Mark Burgess has more about it, but intuitively you have to look for the largest cycles that exist, but there is also, again as mentioned by kortilla, events that doesn't happen on a schedule.

One thing I try to do is, when I come across code that I have to research is to write down in the documentation for the function (Javadoc or similar) why it exist and the conditions when it can be deleted.

You can also try to add some instrumentation code that notifies you when something is called. That way you can end up like me who, every new years day get a weird sms message at 14:00 or 14:01 from a system that no one can find, but at least I know that some system I cannot remember anymore still runs somewhere :-)

[+] ly3xqhl8g9|2 years ago|reply

Seems a good methodology, a caveat would be the deletion should start in coeff * 14 years from time of watch start, where coeff is a project complexity coefficient. If the watcher doesn't see a part of code called in another coeff * 14 years, then it's probably safe to delete it.

[+] lifeisstillgood|2 years ago|reply

There is a meta-meta situation surrounding really good software management.

You can knock up some code that say solves a specific business problem right now. (meta:0)

But you need an environment that can take a new piece of code and deploy it and test it (meta:1)

how is that code running - this is shadingnfrom production monitoringninto QA and performance (meta:2)

Compare all the running code and its performance against the benefits of replacing code or going back to level 0 and just fixing a business problem (meta:3)

Then this death eater - meta 4 I think.

And to me this is why comments like "software needs to solve business problems" is naive - once you start using software you need more software to manage the software - it's going to grow till it consumes the business.

[+] groestl|2 years ago|reply

> once you start using software you need more software to manage the software - it's going to grow till it consumes the business.

You can replace "software" with "people" and you'll end up with a sentence that's equally true.

[+] valenterry|2 years ago|reply

True. Which is why it's important to not measure the ROI of a feature by the time and resources you spend on it vs. how it increases your revenue.

You also have to include that every further feature is now _more_ expensive than before.

[+] croes|2 years ago|reply

FYI: Sensenmann is the german word for Grim Reaper

[+] sltkr|2 years ago|reply

And the literal meaning is “Scythe Man” (which, I'm sure you can guess at this point, is a compound noun of “Sense” for scythe, and “Mann” for man. Both the German Sense and English Scythe derive from the proto-germanic Sagu meaning to cut or saw.)

[+] Zetobal|2 years ago|reply

I still don't get the tech industries fascination with random german words at least here it's sort of fitting.

[+] pabs3|2 years ago|reply

They should also look at deleting third_party directories and replacing them with package manager dependencies. For example Qt WebEngine embeds Chromium, which embeds Blink, which embeds wpt, which embeds a bunch of Python libraries and one of those Python libraries (six) is embedded in three other locations within Chromium.

https://sources.debian.org/src/qtwebengine-opensource-src/5.... https://codesearch.debian.net/search?q=package%3Aqtwebengine....

I wonder how many copies of the Python six module the Google monorepo contains.

[+] habosa|2 years ago|reply

I’d assume one. The “one version policy” is one of the strictest codebase rules at Google.

But the Chrome codebase is generally the giant exception to everything due to the major open source components (Android too).

[+] raverbashing|2 years ago|reply

And while six is 'kinda' well behaved, trying to unify dependencies of dependencies is an exercise in frustration usually. Code repetition is cheap

[+] secondcoming|2 years ago|reply

This is a good plan but I've often found that with for example, Debian, the official packages are well behind the times compared to the direct third party code.

Ubuntu is not too bad.

[+] jmyeet|2 years ago|reply

The most important part of this is that the build units are hermetic and all dependencies are explicit. This is why you need to use something like Bazel/Blaze vs older build systems like make where identifying what's used, particularly when you get into meta-rules, becomes all but impossible.

As the article points out, you also have to look at what's actually run. This is the real advantage of Google infrastructure: the vertical integration so if a binary is run on Borg, or even on the command line, that can be tracked.

[+] DamonHD|2 years ago|reply

I worry about archival and enough history for diagnosing long-standing subtle issues that take a long time to surface as bugs. This is not theoretical: apparently a TeX bug picked up after many years had been there from the start.

[+] kragen|2 years ago|reply

i think piper saves the full history of the whole monorepo; if that's correct it's not 'deletion' in that sense

[+] mkoubaa|2 years ago|reply

Sounds like a useful system from which almost nothing is usable outside of Google.

[+] quickthrower2|2 years ago|reply

Code deletion is much easier when you have a typed language with no meta programming. I guess Go and C code is easier to tree shake like that, than say Python or JS.

[+] meindnoch|2 years ago|reply

It's not that simple. A lot of dead code doesn't look dead to static analysis. E.g. gated by long-forgotten feature flags, or platform checks that are no longer relevant.

[+] falcor84|2 years ago|reply

>For example, if an engineer is unsure how to use a library, they can find examples just by searching

Isn't that the case with all libraries? How does the monorepo help here?

[+] er4hn|2 years ago|reply

Discoverability. It's a lot easier to search one repo then it is to search a set of repos. For the latter you need to have all the repos listed somewhere, and have them be accessible.

[+] kpw94|2 years ago|reply

"searching" how a method is used is as simple as clicking on the symbol.

Think Visual Studio "find all references", but working around the entire company's codebase, not just your current project.

[+] speedgoose|2 years ago|reply

I’m not sure. I also thought that these big repos have to use sparse checkouts to not use too much space on the developers machines. So you would have to use an external code search index anyway.

[+] oneplane|2 years ago|reply

On a non-google level just being aware of code sitting around costing resources is pretty important. Often, tests and maintenance are just ignored or not calculated in as cost (be it time, money, effort or otherwise). It is almost in the same realm as "I don't know why it works" which is as dangerous as "I don't know why it doesn't work".

[+] jawns|2 years ago|reply

The most difficult part about code deletion is practicing the Chesterton's Fence principle:

> In the matter of reforming things, as distinct from deforming them, there is one plain and simple principle; a principle which will probably be called a paradox. There exists in such a case a certain institution or law; let us say, for the sake of simplicity, a fence or gate erected across a road. The more modern type of reformer goes gaily up to it and says, “I don’t see the use of this; let us clear it away.” To which the more intelligent type of reformer will do well to answer: “If you don’t see the use of it, I certainly won’t let you clear it away. Go away and think. Then, when you can come back and tell me that you do see the use of it, I may allow you to destroy it.

https://wiki.lesswrong.com/wiki/Chesterton%27s_Fence

While this tool certainly does the job of proposing code deletions, that's the easier part. The harder part is knowing why the code exists in the first place, which is necessary to know whether it's truly a good idea to remove it. Google, smartly, is leaving that part up to a human (for now).

[+] UncleMeat|2 years ago|reply

"This code has been dead for six months" is a very good heuristic that the code is not relevant. I do occasionally reject the sensenmann CLs, but only very very rarely. This isn't weird code that nobody knows why it exists but it is currently doing something. This is code that cannot execute.

[+] dekhn|2 years ago|reply

Google's response to Chesterton's Fence is: "if you liked it, then put a test on it".

I used to update the internal version of numpy for Google and if people asked me to rollback after I made my update (having fixed all the test failures I could detect), and they didn't have a test, well, that's their problem. The one situation where that rule wouldn't apply is if I somehow managed to break production and we needed to do an emergency rollback.

I shed a tear when some of my old, unused code was autodeleted at Google, but nowadays my attitude is: HEAD of your version control should only contain things which are absolutely necessary from a functional selection perspective.

[+] opportune|2 years ago|reply

I don’t think you understand Senssenmann fully based on this post. At Google basically everything in use has a Bazel-like build target. This means the code base is effectively a directed “forest”/tree-like data structure with recognizable sources and sink. If you can trace through the tree and find included-but-not-used code by analyzing build targets, you can safely delete it. There are even systems (though not covering everything) that sample binaries’ function usage you could double check against.

> why the code exists in the first place

If the code is unreachable it’s at best a “possibly will be used in the future” and most likely simply something that was used but not deleted when it’s last use was removed (or a YAGNI liability).

If you can find a piece of code included in build targets but unreachable in all of them, it’s typically safe to delete. And it’s not done without permission generally, automation will send the change to a team member to double check it’s ok to delete/nobody is going to start using it soon.

[+] anonymousiam|2 years ago|reply

It should also be clear that this article is about "deleting" code from an active project, not about "deleting" it entirely from the version control system. Thus, any code "deleted" through the described process could still easily be restored if necessary.

[+] breck|2 years ago|reply

As a counter to Chesterton's Fence: sometimes the fastest way to understand what something does is to remove it and see what complains. You might get only 1 complainer for every 10 fences you take down. Putting that one fence back up takes much longer than taking it down, but the time saved from removing the other 9 unnecessary ones makes it a net win. And this time you can add Documentation to the rebuilt fence.

[+] einpoklum|2 years ago|reply

The article is about the removal of _dead_ code. So, not a "fence across the road" - it's a fence that was moved to the side of the road, already cleared. The question is just whether to dismantle the fence or keep it there just in case.

[+] proper_elb|2 years ago|reply

You raise a good point, and I would answer it with agree and disagree:

Agree: Yes, you are correct, merely observing that a code path was never executed in the last 6 months is not the same as understanding why the code path was created in the first place. There might be the quite real possibility of an infrequent event that appears just once in every two years or so (of course, this should also be documented somewhere!).

Disagree: Pragmatically, we have an answer if the code path was not executed after 6 months use in production and test: We know that, with a very high probability, the code path was created either by mistake (human factor) or intentionally for some behavior that is no longer expected from our software. To continue the Fence metaphor, regarding Sensenmann: After 6 months, we know about the Fence that 1) it has no role to play in keeping the stuff out that we want out (that was all done by other fences that were had contact with an animal at least once) and 2) that it might have been used to keep out flying elephants or whatever, but no such being was observed in the last 6 months (at least the fence made no contact with it, which it then should have!) and probably went away.

That said, having a human in the loop is probably a good idea.

[+] specialist|2 years ago|reply

Conducting an inventory (of services) would be cool too.

If a chunk of code isn't actually deployed somewhere, mark it as a candidate for culling.

Probably requires some kind of metadata provenance for deployed artifacts.

For Java, I thought Maven had a stock manifest.mf entry for the source repo. Alas, a quick search only reveals that the archiver plugin has an entry for the project's "url".

https://maven.apache.org/shared-archives/maven-archiver-2.5/...

Which for in-house projects is probably sufficient.

[+] elesbao|2 years ago|reply

I'm generally not fond of monorepos but reading this and right away being targeted by this tweet https://twitter.com/nikitabier/status/1652764613962760196 made me think on how much time goes on decisions that won't impact users directly and how hard it becomes to justify that on regular companies in an environment where CEOs are jumping the layoff bandwagon for no reason.

[+] unknown|2 years ago|reply

[deleted]

[+] unknown|2 years ago|reply

[deleted]

[+] unknown|2 years ago|reply

[deleted]

[+] daxfohl|2 years ago|reply

"deleting code" should have been included in the "two hardest things in computer science".

[+] joebiden2|2 years ago|reply

Sincere quesion: what is interesting or novel about this? Is it just the scale or did I miss some subtle aspect?

This is more (or less?) the same as industry best practices, just scaled up. There is a challenge in scaling up, as there is more potential for someone to mess it up. But it's the same technique.

So what am I missing?

[+] whirlwin|2 years ago|reply

Vert interesting. At our company we have a spring cleaning event next week, and one of the goals is to delete old code. Some of the comments mention some good techniques. Anyone experienced with deep analysis, e.g. BPF on this field?

[+] terom|2 years ago|reply

It's about time someone gets a KPI for lines of code removed :)

179 comments