Excellent writeup. Computer systems are discoverable. That attitude, along with some of the basic tools (such as strace, ltrace, man pages, debuggers and a compiler) and a willingness to dive into system source code go a long way. If your tracing leads you to a library call and you don't know what's going on inside, find the source code. If it's the kernel, load up lxr (http://lxr.linux.no/).
Very interesting. I spotted two minor problems with the posted code.
Doing this:
#define BUF_SIZE 1024 * 1024 * 5
to define a numerical constant is a bit scary, since depending on how the symbol is used it can break due to dependencies. It's better to enclose the value in parenthesis. Personally I would probably write the right hand side as (5 << 20).
The first time the modification to skip entries with inode 0 is mentioned, the code is wrong:
I did this by adding
if (dp->d_ino == 0) printf(...);
This should use !=, not == (it's correct the second time, but this adds confusion).
I suspect the author is incorrect in his claim that reading in 32k chunks is responsible for the slowness. Due to read ahead and buffering, Unix-like systems tend to do reasonably well on small reads. Yes, big reads are better, but small reads are not unreasonably slow.
To test this, he should try "ls | cat". On large directories that often runs many orders of magnitude faster than "ls". This is because, I believe, ls by default on most people's systems, want to display information such as file modes or type via coloring or otherwise decorating the file names, and getting that information requires looking at the inode for each file.
It's all those inode lookups that slow things down, as the inodes are likely to be scattered all over the place. When you do "ls | cat", ls noticed output is not a terminal, and so turns off the fancy stuff, and just lists the file names. The names can be determined entirely from the directory, and so performance is much better.
The original onus for the post was python's os.listdir() which as far as I know doesn't stat(). ls, just made the blog post more interesting :-).
I was surprised that the 32K reads were taking so long. It's possible since it was on a virtualized disk ("in the cloud") that something else was slowing down disk IO (like Xen).
But I can assure you that a larger read buffer performed much better in this given scenario.
I just tried this on non-virtualized hardware with 10 million files in a single directory using the zfsonlinux native zfs implementation for linux. It took a little over 4 minutes to do a "\ls -f | wc -l" so this might very well be something to do with virtualization.
I'll try an ext3 file system just for giggles and post the results.
Edit:
Didn't have an ext3 file system with enough free inodes handy so I used an ext4 file system. It takes too long to create 10 million files so I cut the test off early. It took about 7 seconds to complete a "\ls -f | wc -l" with 6.2 million files in a single directory.
You are right that 32k buffers should be more than enough to read 5MB in a reasonable time, regardless of disk/VM architecture, but I don't think stat was the problem either. My guess would be readdir or possibly getdents is probably O(n^2) somewhere.
[just noticed it was 500M (oh wow), but same difference]
Really surprised by all the high-fiving and positive excitement going on about this article.
Putting eight million files in one directory level aside, the whole basis for this event - using the filesystem as a storage layer for a k/v 'database' - is just twisted.
Systems people tend to do the simplest, reasonable thing that works, then forget about it until it doesn't work anymore. This means that sometimes you'll make design choices that look ugly, but really, they don't matter.
Absolutely. The sad reality is that, to a lot of linux users right now (and I'm talking about people that know they're running linux, not android users), it's a black box.
Software is "packages" that you install with "synaptic".
Throw a novice Ubuntu user into a FreeBSD system, and tell him to install a port, and 9 times out of 10 they'll freak out once they see GCC output on the screen.
Nothing against Ubuntu (RedHat, SuSE, Debian, Arch, et. al.), but source compilation is something they all have been letting their users avoid for a long time. The target audience is different.
Are there folks out there that are afraid to compile code and modify it?
Some developers (myself included) may have a general preference for sticking with the "official" packages to avoid extra work when bringing up a new machine or migrating to a new distro version.
Those arguments will remove the need for find to stat each directory entry. Regardless, this is a nice walk through of low level details often overlooked.
Actually 'find' will also stat each entry no matter what.
Many of the standard-tools that most people would intuitively expect to be rather optimized (find, rsync, gzip) are embarrassingly inefficient under the hood and turn belly up when confronted with data of any significant size.
That probably stems from the fact that most of the development on these tools took place during a time when 1GB harddrives were "huge" and SMP was "high end".
H tried using find, but, in this case, the libc readdir function was the bottleneck, so find doesn't help.
Often, when ls is being slow, you can speed things up drastically by disabling the things that require a stat (colors, -F, etc.) that are often added by distro's shell files (invoking ls with a full path is an easy way to disable shell aliases.) Also, sorting is on by default, which slows things for obvious reasons.
When all you need to do is kill the stat overhead for small dirs on slow file systems, "echo *" is beautifully simple.
The man pages really do try to put you off using the syscalls. I only used getdents for the first time when I was writing a syscall interface for Lua https://github.com/justincormack/ljsyscall - not tried it on 8m files but you can set the buffer size so it should work.
The native aio interface has advantages over the posix wrapper too, and some of the other interfaces. On the other hand the futex interface is not nice to use, as it requires architecture specific assembly which seems undocumented...
In a similar fashion, something that is reading directories as files on Windows: https://gist.github.com/1148070 - it allows me to get faster some things (it's very crude, rough code, written for exploration, rather than real usage).
They were never meant to have anywhere near that number of files in the first place. The article mentions that the cleaning up of old files failed - presumably they didn't bother with subdirectories because under normal circumstances the single directory approach was working fine.
My impulse would be to modify libc's readdir() to use a larger buffer size instead of using my own C program. Would that much stupider for some reason (besides packaging/dependencies)? Do libc clients expect 32k buffers and die if they receive something else?
I ran into this problem before. One of our engineers had a script that kept some temporary data in his home directory, which went crazy at one point and generated millions of files (no clue the reasoning for this).
Anyways, this was HELL on our backup solution, which ended up failing several nights in a row on that area. Luckily, since the files were largely useless, this left our options open. I think the kicker was the 'rm' command not working (same issue as listing? This was a Solaris 8 system I think). I believe we ended up moving all of the other data off of that filesystem and newfs'ing it.
Having a fan-out directory instead of putting all files at the root would have helped to avoid this problem altogether.
Here's an example. The .git/objects/ directory can grow to have up to 256 sub-directories. If an object hashes to 0a1b2c3d then it gets written to objects/0a/b2c3d. Lookups are still fast and navigating can be done without resorting to writing an 'ls' replacement.
ext3 and ext4 both support b[h]tree directories, which I guess is what you're thinking of, but that would only matter for creating/deleting/looking up a particular file, not listing them. The fact the system didn't slow to a crawl creating the directory suggests that's not the problem.
[+] [-] scott_s|14 years ago|reply
[+] [-] unwind|14 years ago|reply
Doing this: #define BUF_SIZE 1024 * 1024 * 5
to define a numerical constant is a bit scary, since depending on how the symbol is used it can break due to dependencies. It's better to enclose the value in parenthesis. Personally I would probably write the right hand side as (5 << 20).
The first time the modification to skip entries with inode 0 is mentioned, the code is wrong:
I did this by adding if (dp->d_ino == 0) printf(...);
This should use !=, not == (it's correct the second time, but this adds confusion).
[+] [-] tzs|14 years ago|reply
To test this, he should try "ls | cat". On large directories that often runs many orders of magnitude faster than "ls". This is because, I believe, ls by default on most people's systems, want to display information such as file modes or type via coloring or otherwise decorating the file names, and getting that information requires looking at the inode for each file.
It's all those inode lookups that slow things down, as the inodes are likely to be scattered all over the place. When you do "ls | cat", ls noticed output is not a terminal, and so turns off the fancy stuff, and just lists the file names. The names can be determined entirely from the directory, and so performance is much better.
[+] [-] bcx|14 years ago|reply
I was surprised that the 32K reads were taking so long. It's possible since it was on a virtualized disk ("in the cloud") that something else was slowing down disk IO (like Xen).
But I can assure you that a larger read buffer performed much better in this given scenario.
I'd welcome more tests though.
[+] [-] alyandon|14 years ago|reply
I'll try an ext3 file system just for giggles and post the results.
Edit:
Didn't have an ext3 file system with enough free inodes handy so I used an ext4 file system. It takes too long to create 10 million files so I cut the test off early. It took about 7 seconds to complete a "\ls -f | wc -l" with 6.2 million files in a single directory.
[+] [-] tedunangst|14 years ago|reply
[just noticed it was 500M (oh wow), but same difference]
[+] [-] nodata|14 years ago|reply
Running /bin/ls will bypass the alias.
[+] [-] Ralith|14 years ago|reply
[+] [-] brndnhy|14 years ago|reply
Putting eight million files in one directory level aside, the whole basis for this event - using the filesystem as a storage layer for a k/v 'database' - is just twisted.
Happy not to be working with devs like this.
[+] [-] scott_s|14 years ago|reply
See this talk by Jonathan Blow, the designer and programmer behind Braid: http://news.ycombinator.com/item?id=2689999
[+] [-] sausagefeet|14 years ago|reply
[+] [-] joe_the_user|14 years ago|reply
Sure, you can find out a lot by doing things you really shouldn't do, like using the directory system as k/v store.
But in the end, you should still learn the lesson that it is really a bad idea.
[+] [-] drhodes|14 years ago|reply
[+] [-] shabble|14 years ago|reply
And remember to put all your variable declarations at the top of each block, so the compiler can handle it all in a single pass :)
[+] [-] Luyt|14 years ago|reply
"Perhaps the buffer should be dynamically set based on the size of the directory entry file"
This would eliminate the readdir() bottleneck.
[+] [-] jharsman|14 years ago|reply
[+] [-] carbonica|14 years ago|reply
I was a bit thrown by this advice. Are there folks out there that are afraid to compile code and modify it?
[+] [-] blhack|14 years ago|reply
Software is "packages" that you install with "synaptic".
Source code? Compiler? What's that?
[+] [-] uxp|14 years ago|reply
Nothing against Ubuntu (RedHat, SuSE, Debian, Arch, et. al.), but source compilation is something they all have been letting their users avoid for a long time. The target audience is different.
[+] [-] nitrogen|14 years ago|reply
Some developers (myself included) may have a general preference for sticking with the "official" packages to avoid extra work when bringing up a new machine or migrating to a new distro version.
[+] [-] unknown|14 years ago|reply
[deleted]
[+] [-] burgerbrain|14 years ago|reply
[+] [-] rarrrrrr|14 years ago|reply
[+] [-] haberman|14 years ago|reply
[+] [-] moe|14 years ago|reply
Many of the standard-tools that most people would intuitively expect to be rather optimized (find, rsync, gzip) are embarrassingly inefficient under the hood and turn belly up when confronted with data of any significant size.
That probably stems from the fact that most of the development on these tools took place during a time when 1GB harddrives were "huge" and SMP was "high end".
[+] [-] bcx|14 years ago|reply
Certainly find . would have been faster without calling stat().
Does os.listdir() stat?
[+] [-] jesboat|14 years ago|reply
Often, when ls is being slow, you can speed things up drastically by disabling the things that require a stat (colors, -F, etc.) that are often added by distro's shell files (invoking ls with a full path is an easy way to disable shell aliases.) Also, sorting is on by default, which slows things for obvious reasons.
When all you need to do is kill the stat overhead for small dirs on slow file systems, "echo *" is beautifully simple.
[+] [-] justincormack|14 years ago|reply
The native aio interface has advantages over the posix wrapper too, and some of the other interfaces. On the other hand the futex interface is not nice to use, as it requires architecture specific assembly which seems undocumented...
[+] [-] malkia|14 years ago|reply
Also to get the NTFS streams (metadata).
http://msdn.microsoft.com/en-us/library/aa364226(v=vs.85).as...
[+] [-] ms4720|14 years ago|reply
[+] [-] bcx|14 years ago|reply
[+] [-] ck2|14 years ago|reply
[+] [-] archangel_one|14 years ago|reply
[+] [-] unknown|14 years ago|reply
[deleted]
[+] [-] cookiecaper|14 years ago|reply
[+] [-] kylek|14 years ago|reply
[+] [-] pornel|14 years ago|reply
Perhaps
would be fast enough?[+] [-] davvid|14 years ago|reply
Here's an example. The .git/objects/ directory can grow to have up to 256 sub-directories. If an object hashes to 0a1b2c3d then it gets written to objects/0a/b2c3d. Lookups are still fast and navigating can be done without resorting to writing an 'ls' replacement.
[+] [-] jasonmoo|14 years ago|reply
http://blogs.perl.org/users/randal_l_schwartz/2011/03/perl-t...
[+] [-] bcx|14 years ago|reply
[+] [-] mml|14 years ago|reply
[+] [-] joblessjunkie|14 years ago|reply
The shell cannot handle long argument lists, so this will fail rather quickly.
[+] [-] nbashaw|14 years ago|reply
[+] [-] mjpizz|14 years ago|reply
http://webcache.googleusercontent.com/search?q=cache:KBsyzf3...
[+] [-] JimmyMCN|14 years ago|reply
[+] [-] tedunangst|14 years ago|reply
[+] [-] mmccaskill|14 years ago|reply