(no title)
Alacart | 2 years ago
Jokes aside, and coming from a place of ignorance, it's interesting to me that a file count that size is still a real performance issue for git. I'd have expected something that's so ubiquitous and core to most of the software world hasn't seen improvements there.
Genuine, non snarky question: Are there some fundamental aspects of git that would make it either very difficult to improve that, or that would sacrifice some important benefits if they were made? Or is this a case of it being a large effort and no one has particularly cared enough yet to take it on?
klodolph|2 years ago
It’s hard to look at a million files on disk and figure out which ones have changed. Git, by default, examines the filesystem metadata. It takes a long time to examine the metadata for a million files.
The main alternative approaches are:
- Locking: Git makes all the files read-only, so you have to unlock them first before editing. This way, you only have to look at the unlocked files.
- Watching: Keep a process running in the background and listen to notifications that the files have changed.
- Virtual filesystem: Present a virtual filesystem to the user, so all file modifications go through some kind of Git daemon running in the background.
All three approaches have been used by various version control systems. They’re not easy approaches by any means, and they all have major impacts on the way you have to set up your Git repository.
People also want e.g. sparse checkouts, when you’re working with such large repos.
HALtheWise|2 years ago
https://www.infoq.com/news/2022/06/git-2-37-released/
10000truths|2 years ago
[1] https://github.com/libgit2/libgit2-backends
robotresearcher|2 years ago
eviks|2 years ago
1MachineElf|2 years ago
>Findlargedir is a tool specifically written to help quickly identify "black hole" directories on an any filesystem having more than 100k entries in a single flat structure. When a directory has many entries (directories or files), getting directory listing gets slower and slower, impacting performance of all processes attempting to get a directory listing (for instance to delete some files and/or to find some specific files). Processes reading large directory inodes get frozen while doing so and end up in the uninterruptible sleep ("D" state) for longer and longer periods of time. Depending on the filesystem, this might start to become visible with 100k entries and starts being a very noticeable performance impact with 1M+ entries.
>Such directories mostly cannot shrink back even if content gets cleaned up due to the fact that most Linux and Un*x filesystems do not support directory inode shrinking (for instance very common ext3/ext4). This often happens with forgotten Web sessions directory (PHP sessions folder where GC interval was configured to several days), various cache folders (CMS compiled templates and caches), POSIX filesystem emulating object storage, etc.
bityard|2 years ago
Since git is essentially a filesystem with extensive version control features, it doesn't surprise me that it would have problems handing large amounts of files.
thrashh|2 years ago
But there will be some trade-off.
And I don't think people generally put "a million files" in the requirements because it's fairly rare.
eigenvalue|2 years ago
zamadatix|2 years ago
tp34|2 years ago
ext4 on an old system, feeble in comparison to yours, performs much better.
ext4, 8GB memory, 2 core Intel i7-4600U 2.1GHz, Toshiba THNSNJ25 SSD:
$ time ls -U | wc -l 555557
real 0m0.275s user 0m0.022s sys 0m0.258s
stat(2) slows it down, but sill this is not as poor as your results:
$ time ls -lU | wc -l 555557
real 0m2.514s user 0m1.126s sys 0m1.407s
Sorting is not prohibitively expensive:
$ time ls | wc -l 555556
real 0m1.438s user 0m1.249s sys 0m0.193s
Drop caches, sort, and stat:
# echo 3 > /proc/sys/vm/drop_caches
$ time ls -lU | wc -l 555557
real 0m6.431s user 0m1.249s sys 0m4.324s
Frannyies|2 years ago
I always marvel at it and think: "wow so git goes through its history, pulls out many small files and chunks and patches, updates the whole file tree and all of this after hitting enter and being done like immediately."
kudokatz|2 years ago
I can't speak to improving git, but I think some light on this area can be shed by Linus' tech talk at Google in 2007.
1. Linus says there's a specific focus on full history and content, not files ... so it's a deliberate, different axis of focus than file count:
https://youtu.be/4XpnKHJAok8?t=2586
... AND it's a specific pitfall to avoid when using Git:
https://youtu.be/4XpnKHJAok8?t=4047
2. As Linus tells it, Git appears to be designed specifically for project maintenance while not getting in the way of individual commits and collaboration. But the global history and more expensive operations on things like "who touched this line" are deliberate so lines of a function are tracked across all moves of the content itself.
Maintainer tool enablement: https://youtu.be/4XpnKHJAok8?t=3815
Content tracking slower than file-based "who touched this": https://youtu.be/4XpnKHJAok8?t=4071
===
I have no answer, but ...
Practically, I've used lazy filesystems both for Windows-on-Git via GVFS [1][2] and Google's monorepo jacked into a mercurial client (I think that's what it is?). Both companies have made this work, but as Linus says, a lot of the stuff just doesn't work well with either system.
Windows-on-Git still takes a lot of time overall, and stacking > 10 patches of an exploratory refactor with the monorepo on hg starts slowing WAY WAY down to the point where any source control operations just get in the way.
[1] https://devblogs.microsoft.com/devops/announcing-gvfs-git-vi...
[2] https://github.com/microsoft/VFSForGit