top | item 37295378

(no title)

Alacart | 2 years ago

Ah yes, I too have accidentally committed node_modules.

Jokes aside, and coming from a place of ignorance, it's interesting to me that a file count that size is still a real performance issue for git. I'd have expected something that's so ubiquitous and core to most of the software world hasn't seen improvements there.

Genuine, non snarky question: Are there some fundamental aspects of git that would make it either very difficult to improve that, or that would sacrifice some important benefits if they were made? Or is this a case of it being a large effort and no one has particularly cared enough yet to take it on?

discuss

order

klodolph|2 years ago

> Are there some fundamental aspects of git that would make it either very difficult to improve that, or that would sacrifice some important benefits if they were made?

It’s hard to look at a million files on disk and figure out which ones have changed. Git, by default, examines the filesystem metadata. It takes a long time to examine the metadata for a million files.

The main alternative approaches are:

- Locking: Git makes all the files read-only, so you have to unlock them first before editing. This way, you only have to look at the unlocked files.

- Watching: Keep a process running in the background and listen to notifications that the files have changed.

- Virtual filesystem: Present a virtual filesystem to the user, so all file modifications go through some kind of Git daemon running in the background.

All three approaches have been used by various version control systems. They’re not easy approaches by any means, and they all have major impacts on the way you have to set up your Git repository.

People also want e.g. sparse checkouts, when you’re working with such large repos.

10000truths|2 years ago

Are there any solutions that use libgit2's ability to define a custom ODB backend? There are even example backends already written [1] that use RDBMSs as the underlying data store.

[1] https://github.com/libgit2/libgit2-backends

robotresearcher|2 years ago

Has anyone made a system like option 3 that successfully merges git with a filesystem? It could present both git and fs interfaces, but share events internally. I'd be interested to see how that would work.

eviks|2 years ago

What about asking the OS for the list of changes like Everything on Windows does, instantly, for millions, at a RAM cost of a ~1-2 browser tabs (though that might be limited to NTFS, but still)?

1MachineElf|2 years ago

Other users have made good comments about performance limitations on the underlying filesystems themselves. Adding to this, I recently encountered the findlargedir tool, which aims to detect potentially problematic directories such as this: https://github.com/dkorunic/findlargedir/

>Findlargedir is a tool specifically written to help quickly identify "black hole" directories on an any filesystem having more than 100k entries in a single flat structure. When a directory has many entries (directories or files), getting directory listing gets slower and slower, impacting performance of all processes attempting to get a directory listing (for instance to delete some files and/or to find some specific files). Processes reading large directory inodes get frozen while doing so and end up in the uninterruptible sleep ("D" state) for longer and longer periods of time. Depending on the filesystem, this might start to become visible with 100k entries and starts being a very noticeable performance impact with 1M+ entries.

>Such directories mostly cannot shrink back even if content gets cleaned up due to the fact that most Linux and Un*x filesystems do not support directory inode shrinking (for instance very common ext3/ext4). This often happens with forgotten Web sessions directory (PHP sessions folder where GC interval was configured to several days), various cache folders (CMS compiled templates and caches), POSIX filesystem emulating object storage, etc.

bityard|2 years ago

IME, on basically all filesystems, just walking a directory tree of lots of files is expensive. Half a million files on modern systems should not be a terribly huge issue but once you get into the millions, just figuring out how to back them all up correctly and in a reasonable time frame starts to become a major admin headache.

Since git is essentially a filesystem with extensive version control features, it doesn't surprise me that it would have problems handing large amounts of files.

thrashh|2 years ago

I mean you can design a filesystem to handle a million files extremely quickly... it just has to be in the requirements up front.

But there will be some trade-off.

And I don't think people generally put "a million files" in the requirements because it's fairly rare.

eigenvalue|2 years ago

In my experience, the standard linux file system can get very slow even on super powerful machines when you have too many files in a directory. I recently generated ~550,000 files in a directory on a 64-core machine with 256gb of RAM and an SSD, and it took around 10 seconds to do `ls` on it. So that could be a part of it too.

zamadatix|2 years ago

It sounds suspiciously like you measured the time to display 500k lines in the terminal instead of the time to ls.

tp34|2 years ago

What is the "standard linux file system"?

ext4 on an old system, feeble in comparison to yours, performs much better.

ext4, 8GB memory, 2 core Intel i7-4600U 2.1GHz, Toshiba THNSNJ25 SSD:

$ time ls -U | wc -l 555557

real 0m0.275s user 0m0.022s sys 0m0.258s

stat(2) slows it down, but sill this is not as poor as your results:

$ time ls -lU | wc -l 555557

real 0m2.514s user 0m1.126s sys 0m1.407s

Sorting is not prohibitively expensive:

$ time ls | wc -l 555556

real 0m1.438s user 0m1.249s sys 0m0.193s

Drop caches, sort, and stat:

# echo 3 > /proc/sys/vm/drop_caches

$ time ls -lU | wc -l 555557

real 0m6.431s user 0m1.249s sys 0m4.324s

Frannyies|2 years ago

Funny how the view is so different

I always marvel at it and think: "wow so git goes through its history, pulls out many small files and chunks and patches, updates the whole file tree and all of this after hitting enter and being done like immediately."

kudokatz|2 years ago

> Are there some fundamental aspects of git that would make it either very difficult to improve that, or that would sacrifice some important benefits if they were made?

I can't speak to improving git, but I think some light on this area can be shed by Linus' tech talk at Google in 2007.

1. Linus says there's a specific focus on full history and content, not files ... so it's a deliberate, different axis of focus than file count:

https://youtu.be/4XpnKHJAok8?t=2586

... AND it's a specific pitfall to avoid when using Git:

https://youtu.be/4XpnKHJAok8?t=4047

2. As Linus tells it, Git appears to be designed specifically for project maintenance while not getting in the way of individual commits and collaboration. But the global history and more expensive operations on things like "who touched this line" are deliberate so lines of a function are tracked across all moves of the content itself.

Maintainer tool enablement: https://youtu.be/4XpnKHJAok8?t=3815

Content tracking slower than file-based "who touched this": https://youtu.be/4XpnKHJAok8?t=4071

===

I have no answer, but ...

Practically, I've used lazy filesystems both for Windows-on-Git via GVFS [1][2] and Google's monorepo jacked into a mercurial client (I think that's what it is?). Both companies have made this work, but as Linus says, a lot of the stuff just doesn't work well with either system.

Windows-on-Git still takes a lot of time overall, and stacking > 10 patches of an exploratory refactor with the monorepo on hg starts slowing WAY WAY down to the point where any source control operations just get in the way.

[1] https://devblogs.microsoft.com/devops/announcing-gvfs-git-vi...

[2] https://github.com/microsoft/VFSForGit