top | item 36783449

Bfs 3.0: The Fastest Find Yet

119 points| rurban | 2 years ago |tavianator.com | reply

53 comments

order
[+] eviks|2 years ago|reply
Is there a chance for an Everything alternative? That's the awesome tool that would allow you to ignore the depth/breadth completely find any file literally instantly (the power of instant feedback is unapralleled) Do non-NTFS systems (even the modern ones) not offer this?
[+] ksherlock|2 years ago|reply
Mac OS spotlight has more-or-less instant file search. Command-F in finder or mdfind in the terminal.
[+] 112233|2 years ago|reply
I have not used Everything, is it aomething similar to SwiftSearch (bypasses fs and parses ntfs disk directly, so no indexing and instant results) ?
[+] mozball|2 years ago|reply
the 'locate' command is faster than everything but arguably not as convenient (being a commandline tool). It seems to support non-ntfs (meaning it doesn't support ntfs) linux filesystems.

edit: catfish is a gui front-end available for locate.

[+] rurban|2 years ago|reply
Related: https://mastodon.social/@pervognsen/110739397974530013

But this tests perf for only finding one file. bfs is optimized for finding many files at once, as it defers closing the dirhandle, and uses io_uring for parallel IO callbacks.

[+] tavianator|2 years ago|reply
`bfs` doesn't actually use io_uring yet, but it is planned. I'm not sure I'd say it's specifically optimized for finding multiple files at once either, I try to make it fast for many different use cases. There's two benchmarks in the blog post and a few more that I run regularly, e.g. https://github.com/tavianator/bfs/pull/107
[+] austin-cheney|2 years ago|reply
I am performing a similar file system tree navigation asynchronously in Node.js which is just a shallow API over the C Linux FS APIs.

I can see you are using opendir and closedir functions? What is the benefit from using the opendir function[1] when readdir[2] can be called on a location directly? Is the benefit that opendir returns a file descriptor for use in opening a stream to gather directory object descriptors?

[1] https://man7.org/linux/man-pages/man3/opendir.3.html

[2] https://man7.org/linux/man-pages/man3/readdir.3.html

Your project is probably more mature but if you want an alternate approach to examine here is I have been doing it: https://github.com/prettydiff/share-file-systems/blob/master...

I considering changing my use of readdir to use the withFileTypes option so that it returns a list of directory entries (objects of artifact name and type) instead of a list of conditions to discern types like I am doing on lines 382-432.

[+] Zababa|2 years ago|reply
Nice to see other alternatives to find. I personally use fd (https://github.com/sharkdp/fd) a lot, as I find the UX much better. There is one thing that I think could be better, around the difference between "wanting to list all files that follow a certain pattern" and "wanting to find one or a few specific files". Technically, those are the same, but an issue I'll often run into is wanting to search something in dotfiles (for example the Go tools), use the unrestricted mode, and it'll find the few files I'm looking for, alongside hundreds of files coming from some cache/backup directory somewhere. This happens even more with rg, as it'll look through the files contents.

I'm not sure if this is me not using the tool how I should, me not using Linux how I should, me using the wrong tool for this job, something missing from the tool or something else entirely. I wonder if other people have this similar "double usage issue", and I'm interested in ways to avoid it.

[+] thecodrr|2 years ago|reply
Would love to see how bfs compares to fdir[0] for directory traversal. Even though fdir is using Node.js underneath, the comparisons I have done with fd & find are pretty close. Of course, bfs would probably be quite a bit faster...but how much faster exactly?

Disclaimer: I am the developer of fdir.

[0] https://github.com/thecodrr/fdir

[+] tavianator|2 years ago|reply
From a quick test, about 5x faster:

    tavianator@tachyon $ cat bench.mjs
    #!/usr/bin/env node

    import { fdir } from "fdir";

    console.log(new fdir().withFullPaths().crawl("/home/tavianator/code/android").sync().length);
    tavianator@tachyon $ hyperfine -w1 "node ./bench.mjs" "bfs ~/code/android -false"
    Benchmark 1: node ./bench.mjs
      Time (mean ± σ):      2.073 s ±  0.031 s    [User: 1.372 s, System: 1.260 s]
      Range (min … max):    2.022 s …  2.128 s    10 runs

    Benchmark 2: bfs ~/code/android -false
      Time (mean ± σ):     417.5 ms ±   6.8 ms    [User: 592.3 ms, System: 2487.7 ms]
      Range (min … max):   405.2 ms … 429.7 ms    10 runs

    Summary
      bfs ~/code/android -false ran
        4.97 ± 0.11 times faster than node ./bench.mjs
Is there a way to call it that doesn't require holding all the paths in memory simultaneously?
[+] rizky05|2 years ago|reply
At a glance, 7.6 million files in under 2.xx seconds using bfs. And fdir 1 million in 1 second. So bfs is 2-3 times faster than fdir.
[+] konart|2 years ago|reply
>bfs operates breadth-first, which typically finds the file(s) you're looking for faster.

I find this statement very confusing.

Typically I'm looking for files in subdirs because I'm not aware of target's path. That's depth first (though of couse there is a change I'm going to find my target quite close to the starting point).

And if I'm looking in the same dir I'm currently in (or I know what dir to look at) - then there is no need for a sophisticated search anyway, as typicall you don't have too many files in a directory unless this is some kind of tmp or cache directory.

[+] rowbin|2 years ago|reply
breath-first doesn't mean it's not looking into subdirectories. It is, but it looks into each subdirectory before decending into the different subdirectories subdirectories.

So it enger first subdir, looks at files there, then goes back up, enters second subdirectory, looks at files there, ... After checking each subdir it goes into the first subdir oft the first subdir, looks at files there, then goes to second subdir of first subdir

[+] konart|2 years ago|reply
I need some test vs fd.
[+] seeknotfind|2 years ago|reply
This is so much nicer than using find while iterating max depth.
[+] leeman2016|2 years ago|reply
There is a mistake in the README of the github repo:

In the 1st part of the "Features" section, the "bfs" and "find" columns' contents are swapped. I was a bit confused while reading it and it took me a while to notice that it was a mistake.

[+] tavianator|2 years ago|reply
Thanks for the heads up! It's fixed now.
[+] Dowwie|2 years ago|reply
I keep coming across examples where io_uring cannot be used. It seems like functionality that needs to be scrapped and replaced by something more robust. It's the fastest thing that no one seems to be able to use.
[+] tavianator|2 years ago|reply
Or maybe just iterated on until it gets more functionality? Linux has quite a large API surface, I'm not surprised it takes a while to replicate all of it.
[+] kristopolous|2 years ago|reply
I tried it out. If there's a lot of results, it seems to fare poorer.

However as a disclosure, I was running this on my "bedside laptop" which is a Core2 from like 15 years ago.