top | item 31792303

Ask HN: Why are Git submodules so bad?

213 points| gavinhoward | 3 years ago | reply

I have been a git user for a long time, but I've never used Subversion or any other VCS more than a little.

I also hardly use Git submodules, but when I do, I don't struggle.

Yet people talk about Git submodules as though they are really hard. I presume I'm just not using them as much as other people, or that my use case for them happens to be on their happy path.

So why are Git submodules so bad?

167 comments

order
[+] armchairhacker|3 years ago|reply
Git submodules are fine and can be really useful, but they are really hard. I've run into problems like:

1. Git clone not cloning submodules. You need `git submodule update` or `git clone --recursive`, I think

2. Git submodules being out-of-sync because I forgot to pull them specifically. I'm pretty sure `git submodule update` doesn't always work with this but maybe only when 3)

3. Git diff returns something even after I commit, because the submodule has a change. I have to go into the submodule, and either commit / push that as well or revert it. Basically every operation I do on the git main I need to also do on the submodule if I modified files in both

4. Fixing merge conflicts and using git in one repo is already hard enough. The team I was working on kept having issues with using the wrong submodule commit, not having the same commit / push requirements on submodules, etc.

All of these can be fixed by tools and smart techniques like putting `git submodule update` in the makefile. Git submodules aren't "bad" and honestly they're an essential feature of git. But they are a struggle, and lots of people use monorepos instead (which have their own problems...).

[+] fanf2|3 years ago|reply
Switching branches in a repository with submodules is a huge pain, especially if (like the Ansible repo) some branches have the subdirectory in the same repo like normal, and some branches have the same subdirectory in a submodule.
[+] lelandfe|3 years ago|reply
Agreed with all of this!

In git parlance, the submodule porcelain is hard to use (but the plumbing is good)

[+] dtgriscom|3 years ago|reply
`git submodule update --init --recursive` is the magic phrase.

And, yes: submodules are really useful, as well as a PIA.

[+] esperent|3 years ago|reply
> Git submodules are fine and can be really useful, but they are really hard

If an important software tool is hard to use to the point that most people avoid it, then it's not fine. It's broken.

[+] fburnaby|3 years ago|reply
I agree with all of this. Submodules aren't easy but they perform a useful job. It's hard to see how they could be made significantly easier. Where else in software is dependency management easy and convenient?
[+] andi999|3 years ago|reply
What tools fix this?
[+] Phurist|3 years ago|reply
That sounds like a problem that exists between the chair and the keyboard.
[+] d_watt|3 years ago|reply
Git submodules aren't bad in that they're buggy, they do what the documentation suggests.

I think they're difficult to use, because it breaks my mental model of how I expect a repository to work. It creates a new level of abstraction for figuring out how everything is related, and what commands you need to be able to keep things in sync (as opposed to just a normal pull/branch/push flow). It creates a whole new layer to the way your VCS works the consumer needs to understand.

The two alternatives are

1. Have a bunch of repositories with an understanding of what an expected file structure is, ala ./projects/repo_1, ./projects/repo_2. You have a master repo with a readme instructing people on how to set it up. In theory, there's a disadvantage here in that it puts more work on the end user to manually set that up, but the advantage is there's a simpler understanding of how everything works together.

2. A mono repo. If you absolutely want all of the files to be linked together in a large repo, why not just put them in the same repo, rather than forking everything out across many repos. You lose a little flexibility in being able to mix and match branches, but nothing a good cherry-pick when needed can't fix.

Either of these strategies solve the same problem sub-modules are usually used to solve, without creating a more burdensome mental model, in my opinion. So the question becomes why use them and add more to understand, if there are simpler patterns to use instead.

[+] aslilac|3 years ago|reply
You completely missed the problem that submodules are actually supposed to solve though. Using them for either of those cases would almost definitely be the wrong choice.

What they're really for, is vendoring someone else's code into yours. They're still not great even at that, but sometimes they're the best option.

[+] dswilkerson|3 years ago|reply
"Have a bunch of repositories with an understanding of what an expected file structure is, ala ./projects/repo_1, ./projects/repo_2. You have a master repo with a readme instructing people on how to set it up. In theory, there's a disadvantage here in that it puts more work on the end user to manually set that up, but the advantage is there's a simpler understanding of how everything works together."

This is what I do. I have something like 17 code repos organized this way, plus lots of testing repos, plus an extra "hub" repo. (Credit to a friend for calling this repo "hub": short, to the point, requires no explanation.) The hub repo is a bunch of scripts and makefiles that configure everything and even clone the rest of the repos for me. It also has special grep and find scrips that will run on all of the repos as their target. The hub repo just needs one env var to tell it where the root of all the repos is. Note that in the file system the hub repo is under the root and a sibling of the code repos, not their parent in the file system.

Each code or test repo has an "externs" subdir populated only with softlinks to the other repos on which it depends. The scripts configure this by default, but it is also straightforward to configure by hand if you want to do something non-typical. For example, if you want to have multiple versions of a repo checked out on, say different branches/commits, you can do that and name each directory with a suffix of the branch/commit. Then the client repos can just point at the one they want. You can have all kinds of different configurations set up at any time. Doing this makes it straightforward to know what you depend on just by looking at the softlinks. There is no confusion at any time.

There are ways of configuring the system that do not even need all of the repos, so this is ideal. Using the hub repo makefile I can clone the whole system with one make target (after cloning hub), I can build the whole system with one target, I can test the whole system with one target. It is a testament to how well it works that I don't even know exactly how many repos I have. In short, it works great.

[+] Dm_Linov|3 years ago|reply
There's actually a third alternative, called Git X-Modules (https://gitmodules.com). It's a tool to release the PIA submodules are causing, as described in may comment above :-) In short, it takes all synchronization to the server's side. So you can combine repositories together in any way you like, and still work with a multi-module repository as if it was a regular one - no special commands, specific course of actions, etc.
[+] tannhaeuser|3 years ago|reply
Maybe it's more helpful to think of submodules as a convention in .git to manage commit ids of external repos for code your main repo code depends on, with some assumptions (ie. own subdirectory) and porcelain that might or might not match your workflow with respect to how that external code is integrated. It can get tedious if having to deal with submodules of submodules etc., but so would other ways to track ids of transitive deps.
[+] pornel|3 years ago|reply
The SVN implementation worked pretty seamlessly, almost like a regular subdirectory.

There was no gotcha of a non-recursive clone/checkout. If you've used this feature, your users wouldn't keep getting "broken" checkouts.

There was no gotcha of state split between .gitmodules, top-level .git state, and submodule's .git, and the issues caused by them being out of sync.

There was no gotcha of referencing an unpushed commit.

Submodules are weirdly special and let all of their primitive implementation details leak out to the interface. You can't just clone the repo any more, you have to clone recursively. You can't just checkout branches any more, you have to init and update the submodules too, with extra complications if the same submodules don't exist in all branches. You can't just commit/push, you have to commit/push the submodules first. With submodules every basic git operation gets extra steps and novel failure modes. Some operations feel outright buggy, e.g. rebase gets confused and fails when an in-tree directory has been changed to a submodule.

Functionality provided by submodules is great, but the implementation feels like it's intentionally trying to make less-than-expert users feel stupid.

[+] Groxx|3 years ago|reply
Submodules are just complicated because Git makes no decisions at all about how they should behave, beyond "never make a decision that could lose data".

So you have to understand the tradeoffs and make every decision at every step. It's the safe option.

Like, what happens if you remove a submodule between revisions? Git won't remove the files, you could have stuff in there. So it just dangles there, as a pile of changed files that you now have to separately, manually remove or commit, because it's no longer tracked as a submodule. And then repeat this same kind of "X could sometimes be risky, so don't do it" behavior for dozens of scenarios.

All of which is in some ways reasonable, and is very much "Git-like behavior". But it's annoying, and if you don't really understand it all it seems like it's just getting in your way all the time for no good reason. Git has been very very slowly improving this behavior in general, but it's still taking an extremely conservative stance on everything, so it'll probably never be streamlined or automagic - making one set of decisions implicitly would get in the way of someone who wants different behavior.

[+] arjvik|3 years ago|reply
What's the mental model for the use of a git submodule?

I've always thought of them as a way to "vendor" a git repository, i.e. declare a dependency on a specific version of another project. I thought they made sense to use only when you're not actively developing the other project (at least within the same mental context). If you did want to develop the other project independently, I thought it best to clone it as a non-submodule somewhere else, push any commits, then pull them down into the submodule.

[+] oivey|3 years ago|reply
I think submodules end up highlighting failures of the (implementation of the) many small repo model. People want to develop in both the submodules and main repo simultaneously, and that’s relatively painful. If you find yourself doing that often, that’s a sign that your repos are highly coupled and failing the intent of the many small repo model anyway.
[+] mewse|3 years ago|reply
I have a custom game engine, and I make games that use that engine.

I have about ten different games, all using the same engine, but they were written over the course of many years and so are written against different commits within that engine repo, and git submodules capture that idea perfectly.

If I didn’t have something like that where I effectively had a long-lived library which I wanted to pull into multiple projects, I probably wouldn’t bother with the submodules. But it’s *so* much more convenient to have them in submodules than to copy the engine code separately into each project’s separate repos and then manage applying every change to the engine in all the different game repos individually.

Really, git submodules are exactly the same thing as subversion ‘externals’, but always pinned to a specific commit (which is an available-but-not-enabled-by-default option under subversion), and with a substantially easier interface so maybe folks who don’t need them are more likely to notice them and grumble about how they don’t solve an issue they have?

IMHO git submodules are a huge quality of life improvement over the same system from subversion (as was available in subversion back in the 1.8.x era; I haven’t really used svn in anger in a long time). I definitely wouuldn’t want to go back, or to not have them available.

[+] ghoward|3 years ago|reply
That's how I have thought of them too, so I've never struggled with them. Hence why I asked the question.

I'm glad to know that I'm not the only one to use them as such.

[+] preseinger|3 years ago|reply
The main thing that vendoring is supposed to do is to make it so you can build your code even if all your deps disappear. Submodules don't get you that property.
[+] Pathogen-David|3 years ago|reply
As many others in this thread have stated, the main issue is they have fairly poor UX and if you aren't used to them they can be pretty annoying. They especially have quirks when they're removed from (or moved within) an existing Git repository.

One thing I haven't seen mentioned in this thread though is that they force an opinion of HTTPS vs SSH for the remote repository.

If a developer usually uses SSH, their credential manager might not be authenticated over HTTPS (if they even have one configured at all!) If they usually use HTTPS, they might not even have an SSH keypair associated with their identity. If they're on Windows setting up SSH is generally even higher friction than it is on Linux.

For someone just casually cloning a repository this is a lot of friction right out of the gate, and they haven't even had to start dealing with deciphering your build instructions yet!

-------

Personally I still use Git submodules despite their flaws because they mesh well with my workflow. (This is partially due to the fact that when I transitioned from Hg to Git it was at a job that used them heavily.)

The reality is every solution to bringing external code into your project (whether it's using submodules, subtrees, tools like Josh, scripts to clone separately, IDE features for multi-repo work, ensuring dependencies are always packages, or just plain ol' copy+pasting) all have different pros and cons. None of them are objectively bad, none are objectively best for all situations. You need to determine which workflow makes the most sense for you, your project, and your collaborators.

[+] tommyjl|3 years ago|reply
I recently started using git subtree[0] instead of dealing with all the problems with git submodules, and have been very happy with the experience so far. It does copy every file into your repository, though.

[0]: https://github.com/git/git/blob/master/contrib/subtree/git-s...

[+] matthewmacleod|3 years ago|reply
This is funny - I was going to chip in and say that you really start to appreciate submodules once you’ve experienced the fractal hell of subtrees!

If I ever have to untangle a messed-up repository containing a subtree again in my life I’m quitting development and moving to a cabin in the woods.

[+] FrenchyJiby|3 years ago|reply
Beyond how hard to use they may or may not be, my personal hatred of git submodules is about bypassing your normal dependency management system. See 12 Factors on Dependencies[1].

I've not seen many uses of submodules that weren't better served by adding the package from pypi/npm/crates/...

[1]: https://12factor.net/dependencies

[+] withinboredom|3 years ago|reply
I've been reading a lot of research papers lately, sometimes with POC repos. When compiling the code, it would often fail because dependencies have changed over the years (we're talking about C/C++ code which doesn't have a package manager like you're probably used to). Most of these repos would fail to compile because it expected libraries installed on the system to not have changed in the past 10 years or so. In exactly one of these repos did the author use a git submodule so I was linking directly to the correct version of the library. Granted, those libraries _did not_ do this, so it still failed to compile 7 years later...

If everyone used git submodules for deps, you'd end up with a (block) chain (if everyone's commits were signed) for deps.

[+] gitgud|3 years ago|reply
> bypassing your normal dependency management system

You can use release tags when using Git submodules, which makes it closer to a "normal dependency management system".

It's better than using the commit hash or branch, but still not that great without flexible semver dependencies...

[+] SahAssar|3 years ago|reply
Why would using pypi/npm/crates be better? If I'm already using git (which I will be) then using something else for package management needs to be a huge step up, especially if the other systems you mentioned are language specific and a git solution would probably be language agnostic.
[+] quickthrower2|3 years ago|reply
I interpreted the link as saying don’t make me manually install dependencies.

Using git submodules is no worse or better than say npm in that regard.

[+] jayd16|3 years ago|reply
The security model seems to be "terrible UX defaults that you turn to unsafe defaults instead."

You end up with a lot of gotchas instead of them just working.

The mental model of juggling multiple repos in a non-atomic way also violates the rule of least astonishment. Working with read only submodules smooths this part out at least.

GUI support is slowly getting better at least.

[+] pengaru|3 years ago|reply
> The security model seems to be "terrible UX defaults that you turn to unsafe defaults instead."

Except practically nothing in git is unsafe, it's like Plan9's Venti in the sense that you can only add content to it. You never lose data, but you can easily lose your bearings.

I presume a major reason why there's not much effort put into UX guard rails preventing losing one's bearings in the day-to-day work is a product of the fundamental fact that the underlying committed data is always there. People just need to at least familiarize themselves with `git reflog` to help lose the anxiety.

[+] TillE|3 years ago|reply
It's really annoying that submodules give you a detached head by default, so working on a submodule within a project is prone to mistakes. Otherwise they've been fine for me.
[+] JoshTriplett|3 years ago|reply
The biggest reason I find git submodules painful: a "commitlink" object in a git tree does not count as a reference to that commit or anything that commit references, for the purposes of garbage collecting the repository or pushing and pulling changes. You can't have the only reference in your repository to a given commit be a commitlink within another tree.

I'd like to jettison the entire model of "reference another repository that you may or may not have", along with the `.gitmodules` file as anything other than optional metadata, and instead have them be a fully integrated thing that necessarily comes along with any pull/push/clone.

[+] howinteresting|3 years ago|reply
The problem with submodules is that they're read-write. Read-only submodules would be completely fine.
[+] juped|3 years ago|reply
Yeah this is basically it in the course of actually using them; people should probably only submodule in tags as well. Some of people's problems also come from using them where they're not necessary.
[+] anonymoushn|3 years ago|reply
If you work on blorp which contains openresty which contains luajit, and you have a patch to luajit that you are making because it enables some work in the end product, you need to make a commit to luajit, make a commit to openresty with your changes, make another commit openresty to change the luajit version, make a commit to blorp with your changes, and make another commit to blorp to change the openresty version. You will create 3 code reviews none of which actually contain all of your changes together. Your coworkers have decent odds of not being able to build their software because they don't have a keybind for `git submodule update --init --recursive` yet.
[+] TrianguloY|3 years ago|reply
I don't use submodules, but I do use git repositories inside other git repositories, and let IntelliJ manage them both simultaneously as if they were two different projects. Works fine.

I once enabled it as submodule to test. It adds nothing and from that moment any change in the child creates a change in the parent, which for my use case is totally unnecessary (I want both of them to be independent, even if hierarchically one is inside the other).

Submodules are probably a good option to have libraries that you rarely touch, so you can update/modify them as with a maven/gradle project. For most other user cases submodules make more problems that advantages.

[+] djmips|3 years ago|reply
On a project where you have a lot of people working on the same code, an advantage (of submodules) is that you can check out someone's branch and it contains the URLs to the specific commits in the submodules that work with their branch.
[+] WanderPanda|3 years ago|reply
Whats up with git submodules refusing to clone/update/checkout submodules while having all their files showing up as deleted? I encounter this quite a lot and the solution seems to be a git submodule sync --recursive (or something like that) but I don't get why I run into this in the first place? Probably related to forgetting --recurse-submodules when cloning but what do I know?
[+] lobocinza|3 years ago|reply
Git submodules are cool but can be confusing because people are already used to their language package manager. They also add overhead as changes frequently have to be pulled/pushed downstream/upstream. But in cases where it makes sense to use it, it's a great tool. Eg: theme that is reused in 3 sites is in it's own repo and is a submodule in each site.
[+] Too|3 years ago|reply
A lot of people here complain about the complex UX. This a big problem but is something you get used to and can live with.

An even bigger problem is when you start substituting a dependency manager with submodules. It has no way to deal with transitive dependencies or diamond-dependencies. What are you going to do when lib A->B->D and A->C->D? Your workspace will now have two duplicate checkouts of D and any update to D requires commiting 3 repos in sequence to update the hashes. If you are really unlucky there can only be one instance of D running on the system but the checkouts differ.

The correct way to deal with this is to only have one top level superproject where all repos, even transitive ones are vendored and pinned. The difficulty is knowing if your repo really is the top level, or if someone else will include it in an even bigger context. Rule of thumb would be that your superproject shouldn’t itself contain any code, only the submodules.

[+] cppforlife|3 years ago|reply
i never found myself struggling with submodules, but at times i found myself just slightly annoyed (especially when having to remove/replace submodules), especially when they are used for simpler use cases.

i actually ended up creating https://carvel.dev/vendir/ for some of the overlapping use cases. aside from not being git specific (for source content or destination), its entirely transparent to consumers of the repo as they do not need to know how some subset of content is being managed. (i am of course a fan of committing vendored content into repos and ignore small price of increasing repo size).

[+] Izkata|3 years ago|reply
They act almost the same as pinned-revision svn externals, which people don't really seem to have a problem with. The biggest difference I can think of is needing a special command to pull in the submodules, where svn pulls its externals automatically.