top | item 29298017

Git ls-files is Faster Than Fd and Find

156 points| todsacerdoti | 4 years ago |cj.rs | reply

75 comments

order
[+] cb321|4 years ago|reply
I got run times from the simplest single-threaded directory walk that are only 1.8x slower than git ls-files. (Min time of 10 runs with the git repo housed by /dev/shm on Linux 5.15.)

The "simple" code is in https://github.com/c-blake/cligen/blob/master/cligen/dents.n... (just `dents find` does not require the special kernel batch system call module to be fast. That kernel module is more about statx batching but IO uring can also do that these days. For those unfamiliar with Nim, it's a high productivity, high performance systems language.)

I believe that GNU find is slow because it is specifically written to allow arbitrary filesystem depth as opposed to "open file descriptor limit-limited depth". (I personally consider this over-/mis-engineered from days of bygone systems with default ulimits of, say, only 64 open fd's. There should at least be a "fast mode" since let's be real - file hierarchies deeper than 256..1024 levels, while possible, are rare and one should optimize for the common case or at least allow manual instruction that this case prevails. AFAIK there is no such "fast mode". Maybe on some BSD?)

Meanwhile, I think the Rust fd is slow because of (probably counterproductive) multi-threading (at least it does 11,000 calls to futex).

[+] bscphil|4 years ago|reply
> I believe that GNU find is slow because it is specifically written to allow arbitrary filesystem depth as opposed to "open file descriptor limit-limited depth".

I haven't benchmarked find specifically, but I believe the most common Rust library for the purpose, walkdir[1], also allows arbitrary file system recursion depth, and is extremely fast. It was fairly close to some "naive" limited depth code I wrote in C for the same purpose. A lot of go-to C approaches seem to needlessly call stat on every file, so they're even slower.

I'd be curious to see benchmarks of whether this actually makes a difference.

[1] https://github.com/BurntSushi/walkdir

[+] masklinn|4 years ago|reply
> Meanwhile, I think the Rust fd is slow because of (probably counterproductive) multi-threading (at least it does 11,000 calls to futex).

There’s probably a switch to run single-threaded so that should be testable.

[+] R0b0t1|4 years ago|reply
From experience benchmarking this in the past, Window's poor NTFS implementation sees speedup from multithreading (which I can't make sense of) whereas Linux usually had a penalty for multithreading directory traversals.

But, as mentioned below, this is faster because git-ls is using a pregenerated index.

[+] Dylan16807|4 years ago|reply
You could design the code to keep 25 directories open at a time or something, right? Without needing a 'fast mode' that could break in niche circumstances.
[+] aaaaaaaaaaab|4 years ago|reply
>Meanwhile, I think the Rust fd is slow because of (probably counterproductive) multi-threading (at least it does 11,000 calls to futex).

Afaik rustaceans call this “fearless concurrency”.

[+] filmor|4 years ago|reply
Well, first doing `find > .my-index` and then measuring `cat .my-index` would give you even better results... I don't find it noteworthy that reading from an index is faster than actually recursively walking the filesystem.
[+] rakoo|4 years ago|reply
No, it's not surprising, so why do we still not use indexes for this ?

NTFS maintains a journal of all files modification (https://en.wikipedia.org/wiki/USN_Journal). This is used by Everything (https://www.voidtools.com/support/everything/) to quickly and efficiently index _all_ files and folders. Thanks to that, searching for a file is instantaneous because it's "just an index read".

The feature is common: listing files. We know that indexes help solve the issue. But we still use `find` because there's no efficient files indexing strategy. Yes, there's updatedb/mlocate, but updating the db requires a full traversal of the filesytem because there's no efficient listing of changes in linux, only filesystems-specific stuff.

So we will still have articles like this until the situation changes, and it will still be relevant to the discussion because it's not "solved"

[+] yakubin|4 years ago|reply
Yes. I once wrote a tool which spends a lot of time traversing the filesystem and to my surprise in one scenario most of its time is spent on-CPU (!) in kernel-mode in the readdir_r(2) syscall implementation. I still haven't dived into what it's doing on the CPU, but it sure is interesting.
[+] cseleborg|4 years ago|reply
Yeah, that's not really the interesting part. It's clever because if you're a developer editing source files, the Git index is relatively up-to-date. So, as the author says, why not take advantage of it?
[+] c_joly|4 years ago|reply
> I don't find it noteworthy that reading from an index is faster than actually recursively walking the filesystem

I agree, what I found noteworthy though is that git ls-files uses this index you already have for.

(author here, “proof”: https://cj.rs/contact/ & https://keybase.io/leowzukw)

[+] bee_rider|4 years ago|reply
Of course scanning an index is faster than traversing the filesystem.

Is locate/mlocate some obscure command? It works pretty well for this sort of thing (and has the advantage that you wouldn't need to put git repos everywhere, or something silly like that).

I often forget what I've named a pdf that I've downloaded, but usually I'll put something related to the topic of a paper in the file name, so a command like:

    locate -i "*svd*.pdf" 
will have a decent chance of finding any papers about svd's that I have, and is more or less instantaneous.

Although -- I think I did come across the locate command because I was searching for alternatives to 'find.' I can not for the life of me remember the correct order for the path to start searching, and the filename.

[+] quicklime|4 years ago|reply
The problem with locate is that it requires the index db to be updated periodically (I think this happens daily by default?). For some use cases, especially those where I'm searching for files in a tree that I'm actively working in, this forces me to fall back to find (or maybe git ls-files now).

I feel like it should be the job of the filesystem to maintain an index and incrementally update it whenever files are created or removed, but afaict no modern filesystem offers it.

[+] njharman|4 years ago|reply
> source code management system (SCM), it’s main business3 is not to help you list your files!

This seems utterly bizarre statement. What does author imagine SCM dies? Stuff, but listing the files that are the source it is managing is up there.

Like saying FPS is a game its not pushing triangles through GPU.

The larger purpose is facilitated by and requires the lessor purpose.

[+] tyingq|4 years ago|reply
A warm buffer cache makes a big difference too, so if you're benchmarking things like find vs other tools, be sure to empty the cache between runs.

For Linux:

  echo 3 > /proc/sys/vm/drop_caches
In this case, the author is using hyperfine with --warmup 10, so the numbers are all using a warm buffer cache. A cold cache probably would have been more realistic for comparison, since the benchmark is traversing lots of directories.
[+] kelnos|4 years ago|reply
Perhaps, but it depends on how it's used in the real world. The author was benchmarking tools for the purpose of finding files in via a text editor. If that's something that's done once, then sure, cold caches make sense. But if it's done frequently, presumably those caches will be warm for all but the first run, so the expected, common performance encountered would be with the warm cache.
[+] gorgoiler|4 years ago|reply
One thing I’ve never understood about Linux filesystems: given how small and bounded the sets of directories and directory entries are, why is filesystem traversal not instantaneous?
[+] topspin|4 years ago|reply
> why is filesystem traversal not instantaneous

Because a mounted file system isn't a simple indexed list of paths. File systems are shared, mutable state.

The mechanism you're asking about is called the dentry cache[1] and a decent narrative of its operation is found here[2]. It has the fabulously complex job of atomically brokering the state of an arbitrary number of local and remote file systems between an arbitrary number of processes using arbitrary access patterns without consuming all available RAM. Expecting the dentry cache to yield 'instantaneous' results is unreasonable. Comparing its performance to that of an VCS index is naïve.

[1] https://www.kernel.org/doc/Documentation/filesystems/vfs.txt [2] https://www.halolinux.us/kernel-reference/the-dentry-cache.h... (no endorsement of these people, but the dentry cache description is both concise and sufficient.)

[+] m000|4 years ago|reply
What you describe is an over-specialized optimization that very few users would benefit from, but would still introduce significant complexity.

Linux already transparently caches filesystem metadata. You already get a good speedup if you attempt the same directory walk twice, and not much have changed in the filesystem.

[+] wvh|4 years ago|reply
Probably having to check if the actual disk entries changed is what slows it down. I wonder if it would be possible with nowadays' memory sizes to keep all metadata in memory as a write-through cache. Not sure if it'd be worth it though, my system has close to half a million files, but I'm only interested in about a hundred or so. I don't think file systems are slow in practice for typical human-scale operations though, with the exception of non-indexed "search all my files" type of operations.
[+] amelius|4 years ago|reply
Also, why can tab-completion in Bash hang my shell?
[+] spicybright|4 years ago|reply
Lack of caches I'd assume. Not because it's not easy, but because no one has taken the time to implement it.
[+] mikewhy|4 years ago|reply
FYI to anyone looking to try this out: the `--others` flag, used to include untracked files, will also print out files included in your .gitignore. So if you have eg. a node_modules/ or venv/ folder, all its contents will be listed.

This is often unwanted noise. I haven't been able to find if there's some combination of flags that would get the desired behaviour, but it's been a while since I've messed around with this.

[+] xyzzy_plugh|4 years ago|reply
What's the desired behaviour you want? --cached --others --exclude-standard will show tracked and untracked but not ignored.
[+] rmetzler|4 years ago|reply
Small typo occurring two times: git ls-file instead of git ls-files.

Thanks for bringing it up, I didn't know about the command. I also might have a use case for it.

[+] c_joly|4 years ago|reply
Thanks for pointing this out and for your kind words. A fix for the typos should be live soon.
[+] cryptonector|4 years ago|reply
This is why I have a git alias `ls` for `ls-files`.