top | item 34439830

Folders with high file counts

89 points| mrzool | 3 years ago |bombich.com

117 comments

order
[+] c0l0|3 years ago|reply
At $oldjob, taking care of a busy and successful web estate that is now close to 25 years old, one of the ugliest and longest-standing warts was the "image store". That was a simple, flat directory on a single node, shared over NFS, which had accumulated more than 1.2 million (yes, 1_200_000) inodes/files in a single directory. No-one wanted (read: dared) to properly fix the code to rid it of this once-convenient assumption and (lack of) hierarchy, so I tried to work around the ever-growing pain by tuning the filesystem, the host's kernel (a buffed dentry cache size goes a long way at making this kind of abuse more tolerable, for instance), and the NFS server and clients involved to mitigate emerging problems as much as possible. In the end, a readdirplus()/getdents()-chain invoked over NFS only took a few seconds to finish. It's pretty amazing what Linux can cope with, if you make it have to :)
[+] gregmac|3 years ago|reply
I've dealt with a different situation before and handled it with sharding. It's pretty easy to retrofit in code, and it's not hard to do a one-time move of existing files to the new layout.

Simply put, you make an algorithm like:

    let hash = md5(filename)
    let folderPath = path.join(root, hash.substr(0,1), hash.substr(0,2), hash.substr(0,3), filename)
The original name can be used to derive the sharded path, so you don't need a lookup database or anything crazy like that.

You end up storing files like:

    .../7/7f/7fa/somefile.png
This scales to whatever depth you want: with 3 levels and 1.2 million files, you can expect about 1200000/16^3 = 293 files per directory, which is trivial.

There's lots of other strategies, too, depending on your needs:

    # filename are already GUIDs or hashes:
    .../1/1f/1fd/1fd2dd27-d307-47b2-b37d-11903fd0f03d.png

    # date-based to make cleaning up old files easy:
    .../2023/01/19/app.log

    # hash by existing numeric identifier (like customerid) using modulus:
    .../customers/4/94/52194/customerfile.dat
[+] sixothree|3 years ago|reply
At the same time, nobody would bat an eye at 1 million rows in a database table. People don't really start to notice until you hit 100 million.
[+] desro|3 years ago|reply
I'd be very interested in a more detailed write-up of what you did here -- sounds pretty impressive.
[+] nine_k|3 years ago|reply
It's amazing how 1.2M entries in a directory is a pathologically slow case for a file system, but a trivial number of table rows for a database.

Of course, a B-tree is somehow larger than whatever was used to store inode data in the (ext2?) filesystem 25 years ago.

[+] jakub_g|3 years ago|reply
As a consumer of Android devices, one thing that's super annoying is that all pictures you take with a camera are stored in a single massive folder.

When you connect to the device from your computer over USB (which is usually USB 2.0 even if USB-C, except in major 2021+ models), it will take forever to enumerate the files in that folder. Once you start copying, there's big chance it hangs, so you need to disconnect the device, and go through that painful process again.

(I know, I'm oldschool, I don't have automatic cloud backup enabled).

[+] MawKKe|3 years ago|reply
..and always with super helpful naming scheme like IMG_0001.JPG, IMG_0002.JPG, ... that restarts from the lowest unused number after cleaning up phone storage, ensuring you'll have naming collisions in your backup folder
[+] londons_explore|3 years ago|reply
Isn't that mostly because android devices use the old school "picture transfer protocol" mode, which involves a round trip to the device for every single object in a directory, rather than being able to just retrieve a directory listing in all one request like typical filesystems.
[+] totetsu|3 years ago|reply
try

adb-sync --reverse /sdcard/DCIM(?) ~/photos.go.here

[+] greatgib|3 years ago|reply
Yes, this is might constant nightmare too! Around 10k photo files Android starts to be painful to use. like app file browser taking huge to show the file list or crashing, and as you said, terrible experience when trying to get files on a computer.

This is mostly because of the mtp protocol that is really shitty, supporting only a single channel at a time and having to send the whole list of all files in one time. But also because Android implementation of it is very buggy.

I was always stunned that for a dozen years, Google stupidly spend millions in reworking uselessly the UX twenty times but never solved real pmserious usability problems like this one!

[+] sidpatil|3 years ago|reply
I gave up on USB file transfers off of Android phones for that reason.

A faster option is to install an HTTP or FTP server app on the phone, then download the photos through the network.

[+] anshorei|3 years ago|reply
I've always loved how shotwell does it: "YYYY/MM/DD/filename.ext"

The only downside has always been that merging your photo libraries when you did two different things can give some funny results. Shotwell will split the events based on the import, but if you browse the files without shotwell or reimport them later you lose that.

[+] nine_k|3 years ago|reply
Getting the images via e.g. Syncthing running on the phone does not seem to encounter this problem. Likely not over USB though, but Wireguard or better Nebula are easy to set up on both your phone and your router / NAS / desktop.
[+] rr888|3 years ago|reply
iOS is terrible from windows too. I figured they really dont care as you aren't a good customer.
[+] etra0|3 years ago|reply
A bit tangential, but, recently my partner had her Google Drive with around ~3.000 files in the root folder (created mostly by Google Classroom), which means the Files app of the iPad couldn't show them all, because for some reason it limits to 500 files only.

So naturally the next step was to try to clean the directory. We tried through the webpage, deleting by chunks of 300 files consumed around ~8GB of RAM... and it was slow as hell, and her laptop is a bit old. I moved onto my desktop and selecting 500 files consumed ~10GB of RAM, it was still slow.

I thought of using Google Colab to access to the Drive as filesystem but no dice there either because the google account wasn't managed by her.

At the end, we tried the iPad app, it took like 8 minutes to be able to select all files, and when deleting them, it took about an hour to actually do it, I imagine it was submitted by batches.

It was stupidly painful.

[+] Nextgrid|3 years ago|reply
FYI, "rclone" is a good tool to access proprietary cloud storage platforms from a consistent CLI.
[+] crazygringo|3 years ago|reply
In case anybody else ever needs to do this, install the official Google Drive for Desktop client on a Mac or PC, and set it to stream (not mirror). And then just use the command-line to delete everything in one fell swoop.

GUI file managers are generally not designed for many thousands of files, especially web/mobile ones.

That's very strange that Classroom was creating files in root, however. It's supposed to create them in a single "Classroom" folder in root. You can move the folder if you like (e.g. as a subfolder) but there shouldn't even be any way to point Classroom to use the root-level folder for classes/assignments. (Plus the internal folder structure is hierarchical by class/assignment, so there also shouldn't be thousands of files/folders in a folder in any part of it.)

[+] jaxrtech|3 years ago|reply
Perhaps this is one of the reasons that I've seen Linux deployments use XFS (e.g AWS). If you page through the filesystem documentation, once you hit a certain directory size, it actually switches over to using b+ trees like a RDMS would.

https://www.kernel.org/pub/linux/utils/fs/xfs/docs/xfs_files... (section 16.2 on PDF pg 127)

[+] pjdesno|3 years ago|reply
Nowadays ext4 has dir_index enabled by default, so it uses hashed B-trees for its directories.

Ric Wheeler posted this nearly a decade and a half ago: "Strangely enough, I have been testing ext4 and stopped filling it at a bit over 1 billion 20KB files on Monday (with 60TB of storage)." and goes on to describe some performance numbers - which would be a lot better on modern hardware. https://listman.redhat.com/archives/ext3-users/2009-Septembe... There's a talk about it, as well: https://lwn.net/Articles/400629/

Unfortunately it seems like a lot of applications (unfortunately including ls) default to rather inefficient ways of enumerating files in a directory.

[+] klooney|3 years ago|reply
AWS probably uses XFS because RedHat defaults (defaulted?) to XFS.
[+] jonwinstanley|3 years ago|reply
I worked at a place where the directory was split based on a character or two from the start of a hash.

They had millions of profile images but didn't want them all in one directory, so they hashed profile ID and used the first 2 letters as the name of a sub-directory. So you end up with sub-directories called aa, ab, ac, ad etc.

It's not perfect but I suppose the original creator had seen issues in the past when directories have too many files in them.

[+] layer8|3 years ago|reply
Ideally, you want the number of letter combinations to roughly correspond to the square root of number of files you expect to store, to balance the number of directories with the number of files stored per directory. Assuming 26 letters, two letters allow 676 combinations, so the total expected number of files shouldn’t exceed ~457.000 (676 squared), or else each subdirectory runs the risk of containing more than 676 files on average (making it unbalanced with the number of directories).

More realistically, you would use base32 or hex encoding. For example, if you want to store up to around 16 million files (2^24) and use a 64-bit hash with hex encoding (16 hex characters), you would use the first three hex characters (12 bits) for the directory name (since 2^12 is the square root of 2^24), and the remaining 13 hex characters for the file name. As a result, the 16 million files would be stored in 4096 directories with roughly 4000 files each.

[+] pixl97|3 years ago|reply
Things like Squid Cache/Proxy did (does?) this so you don't stick millions of files on a buys cache in the same directory and kill performance. Especially back in the old days when things were on spinning disks.
[+] recuter|3 years ago|reply
Seems perfectly fine. I remember having to do this trick on s3 for similar reasons.
[+] bob1029|3 years ago|reply
Developers can do a lot to fix this by simply choosing SQLite to store all the local things.

Performing backups of our production apps used to take hours (especially in cheap clouds) because of all the loose files. Today, it takes about 3-5 minutes since there are just a handful of consolidated files to worry about.

[+] NovemberWhiskey|3 years ago|reply
Historically, one of the reasons not to create one massive blob with all your stuff in it is containment of the blast radius of any kind of failure. The filesystem at least has (probably) reasonable consistency checking and repair tools, and you're likely to see integrity problems discovered at a file level, rather than experience catastrophic losses.
[+] gmuslera|3 years ago|reply
Or at least using a multilayered approach i.e. all the files starting with A in the A subfolder. You still have to deal with that amount of files, but you don't lose so much time dealing with filesystem/directory operations.

And that might be better for accessing individual files outside the app instead of a big binary blob that you may have no clue on its format (or corruption/deletion of that single file). Thats why there are no universal solutions, there are many different use cases.

[+] bfgoodrich|3 years ago|reply
One of the trends on Hacker News right now is the assumption that SQLite fixes everything. It's the Flex Tape of the software world.

There is zero reason a process would have more than tiny slowdowns with even millions of files in a folder. Finder has problems if you're trying to look at that folder, for obvious reasons, but it's a bit of a self-own for a backup co to claim that 200,000 files causes their solution to break. That speaks to serious algorithmic issues.

DISCLAIMER: This comment will be auto-dead because of moderation choices by dang (e.g. his pernicious need to pander to the anti-science, far-right crowd). This is a badge of honor. Never vouch for my comments.

[+] makeitdouble|3 years ago|reply
This is an issue that is pretty common when auto generating files.

For instance when generating receipt PDF it could feel natural to store them in folders by account ID. Except there will be a bunch of accounts generating 20 or 30 receipts a day, which isn't much on the face of it. But within months it becomes a pain to list receipts across accounts, within a year or two even individual account receipts become a nightmare to list, and fixing the situation requires a few tricks to avoid all the tools that assume directory listing cost nothing.

[+] didgetmaster|3 years ago|reply
This is just another in a long list of problems that existing file system have when built on an architecture that was created before there was sufficient storage to hold more than a few hundred total files (decades ago).

I have been working on a new data manager that could replace existing file systems with something much better. You can store hundreds of millions of files in a single container (I call them pods instead of volumes) and put tags on everything. Folders can hold millions of files with virtually no degradation in performance. Searches to find subsets of files based on tags or other criteria is lightning fast.

The software has been in beta for about a year and is available for free download at www.Didgets.com yet interest has been very moderate in spite of problems like the one discussed in this thread.

demo video: https://www.youtube.com/watch?v=dWIo6sia_hw

[+] zeta0134|3 years ago|reply
So we found out the hard way that having MultiViews enabled in Apache, in your otherwise static folder full of image files, is a reeeeeally bad idea if that folder is filled by automation and contains millions of files. That was a fun support call. :) "Why is our site giving 500s after less than 10 minutes? What are all of those workers doing?"
[+] Joker_vD|3 years ago|reply
Why is it actually so difficult for filesystems to deal with such folders? I mean, a million is not such a large number, not for a computer anyhow. A table with million rows doesn't generally causes a RDBMS to choke, why should filesystems?
[+] crazygringo|3 years ago|reply
It's a general issue that occurs whenever you scale from small to large things.

Generally speaking, things that are expected to be "small" are processed as whole units. Memory is allocated to store the whole thing, a function won't return until it's processed the whole thing, the whole thing gets copied multiple times, and any kind of sorting/lookups often just scan the whole thing because with small things that's fastest/simplest.

When you expect something to be large, you architect things totally differently -- you buffer/stream rather than allocate, you pass pointers rather than copying data, you use indexes/hashes for lookups rather than full scans, and so forth.

And generally, filesystem paradigms are designed around small-number assumptions for number of files in directories, and hierarchy depths, and filename lengths, etc. Because these are all things you interact with using "human-scale" tools like GUI file managers and 'ls' commands. Whereas file sizes and disk sizes use large-number assumptions, because a video file is easily 10 GB or much more.

[+] pjc50|3 years ago|reply
Querying is done "client side", rather than in the filesytem API. If you request all million rows from the database and convert them in your ORM, and then query the ORM, you'll see the same kind of problem.

Quite a lot of systems use linked lists which make everything O(n).

And the worst part is this is unfixable without migrating away from APIs which have been frozen for about 40 years at every level in every single piece of software you're using.

[+] gwbas1c|3 years ago|reply
In general, filesystem code isn't optimized to handle millions of files in a single folder. It's generally optimized for thousands, and generally under 100,000 files per folder.

(I used to work on an industry-leading sync product, Syncplicity)

In general, the UI in Explorer and Finder isn't designed to handle more then a couple-hundred, or low thousand, files in a folder. The whole UI metaphors just fall apart when you have millions of files in a folder. There's plenty of other applications that can handle such situations better.

Thus why optimize the filesystem? It's not a database, and if you need a database, you should use a database. SQLite is a wonderful database, and there are plenty more.

But, getting back to the APIs: Even though the Windows kernel and lower-level APIs will let you enumerate files in a folder via paging and filters, there's nothing that forces an application to use paging or OS filters. It might not even be possible: The Windows kernel sends wildcard strings to the filesystem driver; but it doesn't send regexes. So, if you need to filter files in a directory by names that match a regex, you have to load all filenames into your process.

A database like SQLite will have indexes and better filtering capabilities.

Furthermore, filesystems generally are block storage. Each file takes up, on disk, the number of blocks that it needs, rounded up. For example, your 10k file will take up a whole 100k block, if 100k is your block size. Again, a database like SQLite, (or a zip file,) will be much more efficient.

[+] ilaksh|3 years ago|reply
I think the article is talking about NFS (Network File System). My small experience with it suggests I should never use it due to extreme sluggishness.

Anyone know of an alternative.

[+] ajuc|3 years ago|reply
Back in 2010s in my first job I've seen a bug caused by too many files in one directory. I don't remember the exact details, but it was making me crazy.

Basically we wrote temporary files into one big directory and later printed them. And sometimes our code returned "File not found" errors despite the fact that ls filename had showed the file is there and has correct permissions.

And when I tried to cat filename from shell - it also caused the same error :) But if you created another file with a different name in the same directory - it worked correctly :) There was also space on the disk, and the number of files was high, but not crazy high (a few hundred thousand I think)?

It turns out this particular filesystem (it was ext2 or ext3 with specific parameters IIRC) can behave like that when there's too many similarly-named files in one directory, because there's some metadata with hash of filenames in the filesystem and there can be collisions and it only can handle so many of them before failing.

The solution was to remove the files after printing them of course, so that they don't accumulate forever.

[+] ape4|3 years ago|reply
Don't modern filesystems handle this nicely - eg btrfs
[+] londons_explore|3 years ago|reply
Even if the filesystem handles it nicely, there are lots of applications which don't. So many applications think it's fine to iterate through every file in a directory for various reasons. If there are 10 million files in a directory, that is effectively a linear scan over a huge database table - which is generally advised against...
[+] mprovost|3 years ago|reply
ReiserFS handles this really well but I don't know if you can call it modern anymore. I can't think of any other filesystem that followed its design choices though.
[+] pixl97|3 years ago|reply
Much better especially on SSD, but you still run into issues.

At the end of the day the filesystem is only part of the issue, quite often it's stupid application interaction where the designer of the application thinks they'll only ever see a few hundred or maybe a thousand files and you present the app with a million files.

[+] red_admiral|3 years ago|reply
Meanwhile on the Onedrive desktop client, the cost of some operations is proportional not to the number of files in a folder you're trying to open, but in the whole filesystem. Your "root" folder can take a lot longer to load if /some/sub/folder has a ton of files in it.
[+] watersb|3 years ago|reply
Bombich.com is a great source of filesystem feature deep dives. He started with running a backup lab, helped people with corner cases and common failure modes. He created a GUI Mac app to guide people through the gnarly bits, and later contributed metadata patches to rsync, for which I am truly grateful.

Back in 2004, I played with file systems, metadata archives, directories with 20,000+ files in them.

Learned that at that time, GNU ls had a polynomial-time sort algorithm. I didn't dig into it as much as I should have, but there are sort algorithms that have already-sorted input as worst-case for runtime.

[+] cratermoon|3 years ago|reply
Way back in the 90s I read "New Need Not Be Slow"[1], about usenet, and one of the issues that came up consistently was performance limits because of the number of inodes and the filesystem. When I was tasked with setting up INN for my organization, I was able to get a DEC running OSF/1 with advfs, which at the time was a highly optimized filesystem, to more or less bypass the performance problems of UFS.

1 http://www.collyer.net/who/geoff/newspaper.pdf

[+] angst_ridden|3 years ago|reply
At previous job v.2, changed an IoT system that received daily text files from remote devices from dumping millions of them in the top level of the /data partition to using a /data/YYYY/MM/unit_id/ structure.

The claim was that the original files needed to be kept for audits, even after database ingest.

Management didn't care, but I made the change because I wanted to have my terminal not die if I accidentally typed "ls" in the wrong directory.

[+] mort96|3 years ago|reply
> Adding a new file, for example, requires that the filesystem compare the new item name to the name of every other file in the folder to check for conflicts

Is this true? I would've assumed that filesystems have smarter ways to find a file in a folder than to do a linear search through every entry.

That doesn't take away from this post, those smarter datastructures and algorithms will still grow slower with more entries, just not linearly so.

[+] nayuki|3 years ago|reply
> Adding a new file, for example, requires that the filesystem compare the new item name to the name of every other file in the folder to check for conflicts, so trivial tasks like that will take progressively longer as the file count increases.

Not necessarily. Any file system worth its salt is using B-trees or hash tables, where file name existence can be checked in respectively O(log n) or O(1) time.

[+] s1mon|3 years ago|reply
"Some of these can safely be deleted if you find crazy-high file counts."

It would be nice to know which of these library folders can be cleared out.