top | item 39206842

Show HN: Visualize the entropy of a codebase with a 3D force-directed graph

180 points| gabimtme | 2 years ago |github.com

Hi HN! I'm Gabriel, the author of dep-tree (https://github.com/gabotechs/dep-tree), and I wanted to show off this tool and explain why it's being really useful at my current org for dealing with code complexity.

I work at a startup where business evolves really fast, and requirements change frequently, so it's easy to end up with big piles of code stacked together without a clear structure, specially with tight deadlines. I made dep-tree [1] to help us maintain a clean code architecture and a logical separation of concerns between parts of the application, which is accomplished by: (1) Visualizing the source files and the dependencies between them using a 3D force-directed graph; and (2) Enforcing some dependency rules that allow/forbid dependencies between different parts of the application.

The 3D force-directed graph visualization works like this: - It takes an entrypoint to the codebase, usually the main executable file or a library's entrypoint (index.js, main.py, etc...) - It recursively crawls import statements gathering other source files that are being depended upon - It creates a directed graph out of that, where nodes are source files and edges are the dependencies between them - It renders this graph in the browser using a 3D force-directed layout, where attraction/repulsion forces will be applied to each node depending on which other nodes it is connected to.

With this, properly decoupled codebases will tend to form clusters of nodes, representing logical parts that live together and are clearly separated from other parts, and tightly coupled codebases will be rendered without clear clustering or without a clear structural pattern in the node placement.

Some examples of this visualization for well-known codebases are:

TypeScript: https://dep-tree-explorer.vercel.app/api?repo=https%3A%2F%2F...

React: https://dep-tree-explorer.vercel.app/api?repo=https%3A%2F%2F...

Svelte: https://dep-tree-explorer.vercel.app/api?repo=https%3A%2F%2F...

Langchain: https://dep-tree-explorer.vercel.app/api?repo=https%3A%2F%2F...

Numpy: https://dep-tree-explorer.vercel.app/api?repo=https%3A%2F%2F...

Deno: https://dep-tree-explorer.vercel.app/api?repo=https%3A%2F%2F...

The visualizations are cool, but it's just the first step. The dependency rules checking capabilities is what makes the tool actually useful in a daily basis and what keeps us using it every day in our CI pipelines for enforcing decoupling. More info about this feature is available in the repo: https://github.com/gabotechs/dep-tree?tab=readme-ov-file#che.... The code is fully open-source.

59 comments

[+] ytjohn|2 years ago|reply

This is really cool. And as OP pointed out, I really like the pipeline integration. Like when linting catches function-level complexity, but in a cross functional way. I prefer to think of programs in layers where the top layers can import lower layers, but never the other way (and also very cautious on horizontal imports). Something like this would help track that. Unfortunately, I'd really need to support Go. I find it interesting the the code is written in Go, but doesn't support Go. But I will watch this project.

From the visualization perspective, it reminds me a lot of Gource. Gource is a cool visualization showing contributions to a repo. You see individual contributors buzzing around updating files on per-commit and per-merge.

https://github.com/acaudwell/Gource

[+] gabimtme|2 years ago|reply

The visualization is actually inspired by Gource, but taken to the 3D space, it's a really cool project.

Golang is very challenging to implement, because dependencies between files inside a package are not explicitly declared, you can just use any function from any file without importing it as long as they both belong into the same package, so supporting Golang would probably require spawning an LSP and resolving symbols.

The reason for implementing dep-tree in Go was because things were going to get algorithmic af, and better to choose a language as simple as possible, knowing that it also needed to be performant.

[+] sam_bristow|2 years ago|reply

A tangentially related tool you can use to look at a repo over time is Git of Theseus[1]. It shows things like "what percentage of the code in this repo survives 6 months.

[1]https://erikbern.com/2016/12/05/the-half-life-of-code.html

[+] gabimtme|2 years ago|reply

That's really interesting!

[+] Weidenwalker|2 years ago|reply

This is cool, basically the first 3D codebase visualization I've seen that doesn't immediately give me a headache, so good job! :)

Always interesting to see different ways of visualising the same thing. A while ago my friend and I also made a codebase visualisation tool ([https://www.codeatlas.dev/gallery](https://www.codeatlas.dev...), but instead of taking the graph route, we opted for Voronoi treemaps in 2D! It's a tradeoff between form and function for sure, modelling code as a DAG is definitely more powerful for static analysis. However, in most graph-based visualizations (this, gource) I just find myself getting lost super quickly, because the shapes are just not very recognisable.

Really impressed by how polished this already is, nice docs, on-the-fly rendering, congrats!

If I ever find time to work on codebase visualisation again, I might have to steal the idea of codebase entropy to better layout which files to place close to which others!

[+] Weidenwalker|2 years ago|reply

Ooops, I should take more care pasting links from markdown, this one works: https://codeatlas.dev/gallery

[+] daxfohl|2 years ago|reply

I've always felt like instead of public, private, protected, there should be something like security groups and acls on classes and functions. That way it's very explicit when you are newly coupling things, and brings tighter scrutiny to those changes.

Edit: oh, looking at the docs, apparently that's exactly what this tool does. Though it would be nice to have function level granularity. Maybe by annotating the code itself.

[+] sam_bristow|2 years ago|reply

Build systems like Bazel provide mechanisms for controlling access at the module-level. If you're disciplined about not just making everything "public" it can be really powerful. Bazel is a very big hammer though and might be overkill for your projects.

[+] contravariant|2 years ago|reply

Is this just using the word 'entropy' as a stand-in for complexity or is there some actual definition of entropy involved?

[+] gabimtme|2 years ago|reply

Nah, nothing like that, "entropy" in the colloquial meaning of level of disorder, it has proven to be a useful word for people to understand what it is about, even though it's strictly incorrect.

[+] a1o|2 years ago|reply

It would be nice if Cpp was supported. A lot of large legacy codebases written in c++ would be interesting to visualize.

[+] sideshowb|2 years ago|reply

Would it work to support doxygen import thereby getting several major languages at once?

[+] gabimtme|2 years ago|reply

definitely, that and Java sounds like two very good candidates.

[+] SushiHippie|2 years ago|reply

Could it be, that this can't check absolute imports? My python project, has many files which depend on each other, but are not linked together in the generated graph. But one of my modules has a __init__.py with relative imports, and this shows links between the files imported in the __init__.py.

Lets say my project looks like this:

src/example/foo.py

src/example/bar.py

And If bar.py containse the statement "from example.foo import Foo" there is no link between the files foo and bar. Though, if the statement is "from .foo import Foo" it shows a link.

[+] gabimtme|2 years ago|reply

That's because dep-tree doesn't know it needs to resolve names starting from `src/`, as your imports have that piece of information trimmed. You can solve this by setting the PYTHONPATH env variable like this:

export PYTHONPATH=src

[+] Already__Taken|2 years ago|reply

it's cool but half the battle. To keep an eye on decoupling you need to map where the state goes. For web, what parts of the code are making using fetch / xmlhttprequest. using the URL & params, history. What's using local storage etc. should be able to identify those browser APIs and draw them out like a dep link too. I just had to fix a component that's was directly editing URL parameters instead of the store which updated the URL.

[+] MilStdJunkie|2 years ago|reply

This is gonna sound weird as hell, but I would really dig implementing this on a doc repo with CCS (component content), where you re-use document modules[1]. Why do I care? Because some modules support way too much complexity, and entropy is a pretty good measurement of that.

[1] Asciidoc/RsT (include directive for both), XML (DITA/S1000D/DocBook/etc, each with different transclude mechanisms), any markup that supports transclusion.

[+] palmfacehn|2 years ago|reply

I was recently working with collection of Rust libraries with poor dependency management. Some dependencies wouldn't compile for certain platforms. In most cases these features were totally unnecessary for my usage.

Would love to see a tool that could automatically break these dependencies into optional features within their crate. It felt like a poor use of my time to track everything down manually.

[+] TN1ck|2 years ago|reply

Rich hickey has a nice talk about this exact problem. He uses this scenario on why “classical” dependency management is flawed - you might only want one function of a library that has no dependencies itself, but you have to import the whole thing.

https://youtu.be/oyLBGkS5ICk?si=cawjnPnR9riEyvf2

[+] sideshowb|2 years ago|reply

Very pretty!

Out of interest, I'm thinking how this sort of method works if you ignore the semi-arbitrary distinction between your own code and other libraries. If, say, an array class is used everywhere, wouldn't that look like a bad pattern on the dependency graph? Or is there a way to read the graph that tells you that your pervasive use of np.array is still appropriately decoupled?

[+] gabimtme|2 years ago|reply

That's taken into account while rendering the graph. The attraction force between two nodes is inversely proportional to the number of edges a node has.

If a node is depended upon a lot, all the resulting edges induce weaker forces to adjacent nodes, so this accounts for the fact that some files will be depended upon a lot, and that's fine.

There's also the option to just exclude that kind of files from the analysis with the --exclude flag. I've found that to be useful for massive auto-generated files.

[+] christkv|2 years ago|reply

A friend of mine developed a tool chain with coworkers to try to systematically improve code quality on a big Java project in its day. https://xradar.sourceforge.net/ some off the ideas might be useful for you. I think there is also a link somewhere to the paper they wrote.

[+] leetrout|2 years ago|reply

Off topic but...

> I work at a startup where business evolves really fast, and requirements change frequently, so it's easy to end up with big piles of code stacked together without a clear structure, specially with tight deadlines

That smells.

It sounds like the team could benefit from better stack technologies and a bit more discipline in how it is applied to solutioning.

> Enforcing some dependency rules that allow/forbid dependencies between different parts of the application.

What is the alternative to this tool that lowers the cognitive barrier / builds the right muscles for the team to understand what they should / shouldnt depend on?

[+] gabimtme|2 years ago|reply

> It sounds like the team could benefit from better stack technologies and a bit more discipline in how it is applied to solutioning.

For our specific case it's actually pretty good, we've built a lot of discipline around maintainability, but in general this is a recurring problem in tech teams who might not be able to afford the time it takes to gain discipline.

> What is the alternative to this tool that lowers the cognitive barrier / builds the right muscles for the team to understand what they should / shouldnt depend on?

Some programming languages allow you to split the codebase into modular units (npm workspaces, cargo workspaces, etc..) which forces developers to modularize things, and dependencies between modules need to be explicitly declared.

This is good, but usually not enough, as nothing prevents you to mess things up within a module/workspace.

There's some other tooling with similar functionality to dep-tree, but language-specific and with visualizations not suitable for large codebases (.dot files, 2d svgs...)

[+] nyrikki|2 years ago|reply

Stack technologies tend to bound contexts based on technologies and not on domain boundaries.

This is why we see all these products targeted at companies with 24 microservices with 26 developers who have to run end to end testing on everything.

Architectural erosion is primarily a cultural issue and any tool that helps people discover and call out architectural violations is potentially useful.

Many companies can't just do the inverse Conway law, and if you look at the state of devops report, note how they call out CAB forums and controls being problematic for even high performing companies to become elite.

This product as an example, which just really means you want to keep k8s but have given up on loose coupling and high cohesion.

https://www.signadot.com/blog/how-uber-and-doordash-enable-d...

Throwing products at structure problems typically doesn't work.

[+] crucialfelix|2 years ago|reply

It's extremely common to get things twisted up. Even if there is a good tech lead, that person may not be good at writing documentation, may be too busy writing code, and may not yet have a plan for how to keep things organized.

Maintaining a code base requires communication, PR reviews and discipline. That doesn't always happen.

Having lint check rules is brilliant. Never mind discipline, you just need a friendly error to say don't import services into an ORM model file. I'm going to adopt this right away.

[+] gjgtcbkj|2 years ago|reply

There’s such a weird vane of do nothingness that runs through this comments attitude. Yeah of course it’s easy to pick dependancies when you don’t worry about deadlines. A programmer without a deadline is like a fisherman going to grocery store to buy fish and claiming it’s “best practices” better results, but what was the point?

[+] rikroots|2 years ago|reply

I "think" I understand what I'm looking at - it's like a 3d dependency tree with added flow of exports -> imports? It certainly looks very pretty![1]

One piece of feedback, if I may. It's really difficult to read the blue labels against the black background. Is there any way to change the palette colors?

[1] https://dep-tree-explorer.vercel.app/api?repo=https%3A%2F%2F...

[+] gabimtme|2 years ago|reply

Well, that's one of the drawbacks of the smart color auto generation... it's not that smart.

That's definitely is an improvement point, I have just calibrated things looking at my screen, which might have a high saturation/brightness setting.

Thanks for the feedback!

[+] _ZeD_|2 years ago|reply

This reminds me of doxygen diagrams - https://doxygen.nl/manual/diagrams.html

[+] gabimtme|2 years ago|reply

It's a similar idea, but I often find myself very lost on 2d drawings if the codebase reaches a certain size

[+] airstrike|2 years ago|reply

Pretty cool -- sadly I think this doesn't catch custom `imports` patterns in my package.json[0] so my graph is incomplete

___

0. https://nodejs.org/api/packages.html#subpath-patterns

[+] gabimtme|2 years ago|reply

Yeah, unfortunately custom imports are only implemented if declared in the tsconfig.json as path overrides, but definitely something that should be looked at

[+] enoch2090|2 years ago|reply

This is really cool! We are recently developing a project with heavy C++ and maybe a little Python scripts & wrappers and we are planning for a major refactor. Is it possible to adopt this with a C++ codebase?

[+] gabimtme|2 years ago|reply

Right now it only supports JavaScript, TypeScript, Python and Rust, but it's designed to be extended with any other language. Each language implementation is just some hundreds of lines of code, so it's "easy" to add new ones, I think C/C++ and Java/Kotlin are good candidates that would be very easy to implement.

[+] matheusmoreira|2 years ago|reply

Just tried it with my C project. Entry point extension is not supported. :(

[+] compacct27|2 years ago|reply

Love it, I think dependency trees are super underused data for static analysis.

The visualization here is amazing in its own right as well, can I ask what part of the codebase renders it and handled the force-directed part?

[+] gabimtme|2 years ago|reply

The portion of the code in charge of rendering lives inside the `internal/entropy` (https://github.com/gabotechs/dep-tree/tree/main/internal/ent...).

Force-directed is an algorithm for displaying graphs in a 2d or 3d space, which simulates attraction/repulsion based on the dependencies between the nodes, the wikipedia page explains it really well https://en.wikipedia.org/wiki/Force-directed_graph_drawing

> Love it, I think dependency trees are super underused data for static analysis.

Definitely, specially for evaluating "the big picture" of a codebase

[+] DenisM|2 years ago|reply

I could use something like this for large Java projects.

[+] gabimtme|2 years ago|reply

Java is one of the top candidates for being implemented next actually

[+] jongjong|2 years ago|reply

Great tool!

React's graph looks like a mess. Why am I not surprised...

[+] graphviz|2 years ago|reply

This is nice work on graph visualization, but we learned years ago that readable network visualization does not necessarily mean good software architecture. For example, a good drawing of a tree may be easy to read and even beautiful, but may reflect an underlying design with no re-use or modularity. A graph of relationships between functions in an abstract machine may look very complicated, but that doesn't mean the design is poor.

Graphs are wonderful abstractions for the structures that arise in many kinds of engineering, but you need to focus on understanding those abstractions, not just pictures rendered by heuristics. Visualization can be wonderful, but has its limitations, especially when used out of the box.