Ask HN: Why are Git submodules so bad?
213 points| gavinhoward | 3 years ago | reply
I also hardly use Git submodules, but when I do, I don't struggle.
Yet people talk about Git submodules as though they are really hard. I presume I'm just not using them as much as other people, or that my use case for them happens to be on their happy path.
So why are Git submodules so bad?
[+] [-] armchairhacker|3 years ago|reply
1. Git clone not cloning submodules. You need `git submodule update` or `git clone --recursive`, I think
2. Git submodules being out-of-sync because I forgot to pull them specifically. I'm pretty sure `git submodule update` doesn't always work with this but maybe only when 3)
3. Git diff returns something even after I commit, because the submodule has a change. I have to go into the submodule, and either commit / push that as well or revert it. Basically every operation I do on the git main I need to also do on the submodule if I modified files in both
4. Fixing merge conflicts and using git in one repo is already hard enough. The team I was working on kept having issues with using the wrong submodule commit, not having the same commit / push requirements on submodules, etc.
All of these can be fixed by tools and smart techniques like putting `git submodule update` in the makefile. Git submodules aren't "bad" and honestly they're an essential feature of git. But they are a struggle, and lots of people use monorepos instead (which have their own problems...).
[+] [-] fanf2|3 years ago|reply
[+] [-] agilob|3 years ago|reply
git config --global submodule.recurse true
https://git-scm.com/book/en/v2/Git-Tools-Submodules search for "git config"
[+] [-] lelandfe|3 years ago|reply
In git parlance, the submodule porcelain is hard to use (but the plumbing is good)
[+] [-] dtgriscom|3 years ago|reply
And, yes: submodules are really useful, as well as a PIA.
[+] [-] esperent|3 years ago|reply
If an important software tool is hard to use to the point that most people avoid it, then it's not fine. It's broken.
[+] [-] fburnaby|3 years ago|reply
[+] [-] andi999|3 years ago|reply
[+] [-] Phurist|3 years ago|reply
[+] [-] d_watt|3 years ago|reply
I think they're difficult to use, because it breaks my mental model of how I expect a repository to work. It creates a new level of abstraction for figuring out how everything is related, and what commands you need to be able to keep things in sync (as opposed to just a normal pull/branch/push flow). It creates a whole new layer to the way your VCS works the consumer needs to understand.
The two alternatives are
1. Have a bunch of repositories with an understanding of what an expected file structure is, ala ./projects/repo_1, ./projects/repo_2. You have a master repo with a readme instructing people on how to set it up. In theory, there's a disadvantage here in that it puts more work on the end user to manually set that up, but the advantage is there's a simpler understanding of how everything works together.
2. A mono repo. If you absolutely want all of the files to be linked together in a large repo, why not just put them in the same repo, rather than forking everything out across many repos. You lose a little flexibility in being able to mix and match branches, but nothing a good cherry-pick when needed can't fix.
Either of these strategies solve the same problem sub-modules are usually used to solve, without creating a more burdensome mental model, in my opinion. So the question becomes why use them and add more to understand, if there are simpler patterns to use instead.
[+] [-] aslilac|3 years ago|reply
What they're really for, is vendoring someone else's code into yours. They're still not great even at that, but sometimes they're the best option.
[+] [-] dswilkerson|3 years ago|reply
This is what I do. I have something like 17 code repos organized this way, plus lots of testing repos, plus an extra "hub" repo. (Credit to a friend for calling this repo "hub": short, to the point, requires no explanation.) The hub repo is a bunch of scripts and makefiles that configure everything and even clone the rest of the repos for me. It also has special grep and find scrips that will run on all of the repos as their target. The hub repo just needs one env var to tell it where the root of all the repos is. Note that in the file system the hub repo is under the root and a sibling of the code repos, not their parent in the file system.
Each code or test repo has an "externs" subdir populated only with softlinks to the other repos on which it depends. The scripts configure this by default, but it is also straightforward to configure by hand if you want to do something non-typical. For example, if you want to have multiple versions of a repo checked out on, say different branches/commits, you can do that and name each directory with a suffix of the branch/commit. Then the client repos can just point at the one they want. You can have all kinds of different configurations set up at any time. Doing this makes it straightforward to know what you depend on just by looking at the softlinks. There is no confusion at any time.
There are ways of configuring the system that do not even need all of the repos, so this is ideal. Using the hub repo makefile I can clone the whole system with one make target (after cloning hub), I can build the whole system with one target, I can test the whole system with one target. It is a testament to how well it works that I don't even know exactly how many repos I have. In short, it works great.
[+] [-] Dm_Linov|3 years ago|reply
[+] [-] tannhaeuser|3 years ago|reply
[+] [-] pornel|3 years ago|reply
There was no gotcha of a non-recursive clone/checkout. If you've used this feature, your users wouldn't keep getting "broken" checkouts.
There was no gotcha of state split between .gitmodules, top-level .git state, and submodule's .git, and the issues caused by them being out of sync.
There was no gotcha of referencing an unpushed commit.
Submodules are weirdly special and let all of their primitive implementation details leak out to the interface. You can't just clone the repo any more, you have to clone recursively. You can't just checkout branches any more, you have to init and update the submodules too, with extra complications if the same submodules don't exist in all branches. You can't just commit/push, you have to commit/push the submodules first. With submodules every basic git operation gets extra steps and novel failure modes. Some operations feel outright buggy, e.g. rebase gets confused and fails when an in-tree directory has been changed to a submodule.
Functionality provided by submodules is great, but the implementation feels like it's intentionally trying to make less-than-expert users feel stupid.
[+] [-] Groxx|3 years ago|reply
So you have to understand the tradeoffs and make every decision at every step. It's the safe option.
Like, what happens if you remove a submodule between revisions? Git won't remove the files, you could have stuff in there. So it just dangles there, as a pile of changed files that you now have to separately, manually remove or commit, because it's no longer tracked as a submodule. And then repeat this same kind of "X could sometimes be risky, so don't do it" behavior for dozens of scenarios.
All of which is in some ways reasonable, and is very much "Git-like behavior". But it's annoying, and if you don't really understand it all it seems like it's just getting in your way all the time for no good reason. Git has been very very slowly improving this behavior in general, but it's still taking an extremely conservative stance on everything, so it'll probably never be streamlined or automagic - making one set of decisions implicitly would get in the way of someone who wants different behavior.
[+] [-] arjvik|3 years ago|reply
I've always thought of them as a way to "vendor" a git repository, i.e. declare a dependency on a specific version of another project. I thought they made sense to use only when you're not actively developing the other project (at least within the same mental context). If you did want to develop the other project independently, I thought it best to clone it as a non-submodule somewhere else, push any commits, then pull them down into the submodule.
[+] [-] oivey|3 years ago|reply
[+] [-] mewse|3 years ago|reply
I have about ten different games, all using the same engine, but they were written over the course of many years and so are written against different commits within that engine repo, and git submodules capture that idea perfectly.
If I didn’t have something like that where I effectively had a long-lived library which I wanted to pull into multiple projects, I probably wouldn’t bother with the submodules. But it’s *so* much more convenient to have them in submodules than to copy the engine code separately into each project’s separate repos and then manage applying every change to the engine in all the different game repos individually.
Really, git submodules are exactly the same thing as subversion ‘externals’, but always pinned to a specific commit (which is an available-but-not-enabled-by-default option under subversion), and with a substantially easier interface so maybe folks who don’t need them are more likely to notice them and grumble about how they don’t solve an issue they have?
IMHO git submodules are a huge quality of life improvement over the same system from subversion (as was available in subversion back in the 1.8.x era; I haven’t really used svn in anger in a long time). I definitely wouuldn’t want to go back, or to not have them available.
[+] [-] ghoward|3 years ago|reply
I'm glad to know that I'm not the only one to use them as such.
[+] [-] preseinger|3 years ago|reply
[+] [-] Pathogen-David|3 years ago|reply
One thing I haven't seen mentioned in this thread though is that they force an opinion of HTTPS vs SSH for the remote repository.
If a developer usually uses SSH, their credential manager might not be authenticated over HTTPS (if they even have one configured at all!) If they usually use HTTPS, they might not even have an SSH keypair associated with their identity. If they're on Windows setting up SSH is generally even higher friction than it is on Linux.
For someone just casually cloning a repository this is a lot of friction right out of the gate, and they haven't even had to start dealing with deciphering your build instructions yet!
-------
Personally I still use Git submodules despite their flaws because they mesh well with my workflow. (This is partially due to the fact that when I transitioned from Hg to Git it was at a job that used them heavily.)
The reality is every solution to bringing external code into your project (whether it's using submodules, subtrees, tools like Josh, scripts to clone separately, IDE features for multi-repo work, ensuring dependencies are always packages, or just plain ol' copy+pasting) all have different pros and cons. None of them are objectively bad, none are objectively best for all situations. You need to determine which workflow makes the most sense for you, your project, and your collaborators.
[+] [-] tommyjl|3 years ago|reply
[0]: https://github.com/git/git/blob/master/contrib/subtree/git-s...
[+] [-] matthewmacleod|3 years ago|reply
If I ever have to untangle a messed-up repository containing a subtree again in my life I’m quitting development and moving to a cabin in the woods.
[+] [-] FrenchyJiby|3 years ago|reply
I've not seen many uses of submodules that weren't better served by adding the package from pypi/npm/crates/...
[1]: https://12factor.net/dependencies
[+] [-] withinboredom|3 years ago|reply
If everyone used git submodules for deps, you'd end up with a (block) chain (if everyone's commits were signed) for deps.
[+] [-] gitgud|3 years ago|reply
You can use release tags when using Git submodules, which makes it closer to a "normal dependency management system".
It's better than using the commit hash or branch, but still not that great without flexible semver dependencies...
[+] [-] SahAssar|3 years ago|reply
[+] [-] quickthrower2|3 years ago|reply
Using git submodules is no worse or better than say npm in that regard.
[+] [-] jayd16|3 years ago|reply
You end up with a lot of gotchas instead of them just working.
The mental model of juggling multiple repos in a non-atomic way also violates the rule of least astonishment. Working with read only submodules smooths this part out at least.
GUI support is slowly getting better at least.
[+] [-] pengaru|3 years ago|reply
Except practically nothing in git is unsafe, it's like Plan9's Venti in the sense that you can only add content to it. You never lose data, but you can easily lose your bearings.
I presume a major reason why there's not much effort put into UX guard rails preventing losing one's bearings in the day-to-day work is a product of the fundamental fact that the underlying committed data is always there. People just need to at least familiarize themselves with `git reflog` to help lose the anxiety.
[+] [-] TillE|3 years ago|reply
[+] [-] JoshTriplett|3 years ago|reply
I'd like to jettison the entire model of "reference another repository that you may or may not have", along with the `.gitmodules` file as anything other than optional metadata, and instead have them be a fully integrated thing that necessarily comes along with any pull/push/clone.
[+] [-] howinteresting|3 years ago|reply
[+] [-] juped|3 years ago|reply
[+] [-] anonymoushn|3 years ago|reply
[+] [-] TrianguloY|3 years ago|reply
I once enabled it as submodule to test. It adds nothing and from that moment any change in the child creates a change in the parent, which for my use case is totally unnecessary (I want both of them to be independent, even if hierarchically one is inside the other).
Submodules are probably a good option to have libraries that you rarely touch, so you can update/modify them as with a maven/gradle project. For most other user cases submodules make more problems that advantages.
[+] [-] djmips|3 years ago|reply
[+] [-] WanderPanda|3 years ago|reply
[+] [-] lobocinza|3 years ago|reply
[+] [-] Too|3 years ago|reply
An even bigger problem is when you start substituting a dependency manager with submodules. It has no way to deal with transitive dependencies or diamond-dependencies. What are you going to do when lib A->B->D and A->C->D? Your workspace will now have two duplicate checkouts of D and any update to D requires commiting 3 repos in sequence to update the hashes. If you are really unlucky there can only be one instance of D running on the system but the checkouts differ.
The correct way to deal with this is to only have one top level superproject where all repos, even transitive ones are vendored and pinned. The difficulty is knowing if your repo really is the top level, or if someone else will include it in an even bigger context. Rule of thumb would be that your superproject shouldn’t itself contain any code, only the submodules.
[+] [-] AceJohnny2|3 years ago|reply
[1] https://git-scm.com/docs/git-worktree
[+] [-] cppforlife|3 years ago|reply
i actually ended up creating https://carvel.dev/vendir/ for some of the overlapping use cases. aside from not being git specific (for source content or destination), its entirely transparent to consumers of the repo as they do not need to know how some subset of content is being managed. (i am of course a fan of committing vendored content into repos and ignore small price of increasing repo size).
[+] [-] Izkata|3 years ago|reply