Linus Torvalds on HFS+

[+] kennywinker|11 years ago|reply

While I really like thristian's explanation of why case insensitivity adds massive complexity to a system: (https://news.ycombinator.com/item?id=8876873)

I have to point out that case sensitivity offloads a bunch of that complexity to the user. This is almost definitely why OS X uses case-insensitive HFS+ by default.

As a ultra-simple example: with a case-sensitive file system I can have two directories, "documents" and "Documents". Which one are my documents in? Half in one, half in the other, probably.

I'm not saying Linus at al. are wrong that case-sensitive is the way to go, but there are some reasonable arguments for trying to take that complexity back from the user and deal with it in the file system.

Source: I'm a long time mac user, and I just switched to a case-sensitive file system last year. The win of having my dev machine match my deployment environment (iOS) is bigger than any downsides I've seen yet.

[+] SwellJoe|11 years ago|reply

I find case-insensitivity (as both Mac OS X and Windows do it) user-hostile. So, there's a counter-anecdote.

But, I believe all that proves is that in many cases, "what I'm used to" is the same as "easy and intuitive". There is often no one true way to make something complex easy and intuitive, and sometimes the "easy way" is the way you're used to (but for someone used to another way, it won't be easy).

In this case, in the absence of compelling evidence that case-insensitivity is a net win for the user, the simpler implementation is probably the right one. The number of bugs over the years caused by insensitivity and Unicode mapping and such, is probably further evidence favoring simplicity.

Also, the average Mac user doesn't even know they have a command line (seriously, I can't remember ever saying to a casual Mac user, "Open terminal" and have them say anything other than, "I don't know what that is" or "I don't have that"). There's no reason a higher layer can't make case-insensitive decisions for the user (say, searching for "the clash" will find music by "The Clash"), which doesn't require the freaking filesystem to make those kinds of guesses. Having this happen at the filesystem layer has always seemed utterly mad, to me. And, having it happen at the higher layers is how Linux and UNIX software has always handled it. Somehow we muddle through with the filesystem being entirely naive about the letter "a" being sort of the same as "A" (for humans).

[+] forgottenpass|11 years ago|reply

As a ultra-simple example: with a case-sensitive file system I can have two directories, "documents" and "Documents". Which one are my documents in? Half in one, half in the other, probably.

This is a source of confusion, but filesystem convention is not the place to solve it because it can't solve it. I can still have three directories for bug #12 "ticket_12", "ticket12" and "ticket 12" Which one is my investigation in? Is documentation in "doc" or "docs"? etc...

[+] tikhonj|11 years ago|reply

Funnily enough, I've had the same thoughts about variable names. We expect case-sensitive variable names because that's what we're used to, but they're not obviously the right choice. I've worked on reasonably sized lisp projects with case-insensitive names with no detriment.

Most of the time, having both a `foo` and a `Foo` identifier at the same time is a bad idea: it's hard to remember which is which. The only time it makes sense is when it's backed up by strong conventions: say, `foo` is a method and `Foo` is a class. With these conventions, you're not really remembering `foo` and `Foo`, you're remembering "`foo` method" and "`foo` class". In a sense, it's just a hack to have sigils use an otherwise unused part of the identifier, just like tagged pointers use the unused bits of a pointer for metadata.

Using capitalization for sigils like this is fine, and works well in practice, but you could as easily use some other symbols in the same role. At the very least, it helps to have the conventions built right into your language (like Haskell) than just followed manually (like Java). Moreover, having significant capitalization outside the conventions (ie in the rest of the word) will still just be confusing: was is `subString` or `substring`?

I don't know what the best solution is, but case sensitivity everywhere isn't it.

[+] beedogs|11 years ago|reply

> As a ultra-simple example: with a case-sensitive file system I can have two directories, "documents" and "Documents". Which one are my documents in? Half in one, half in the other, probably.

That only really happens if you're an idiot who just haphazardly throws documents into any directory that looks like it can still hold files. And it's not an indictment of case sensitivity; you can achieve the same sort of stupidity in plenty of other ways if you're determined to do so.

[+] justizin|11 years ago|reply

"This is almost definitely why OS X uses case-insensitive HFS+ by default."

See my other comment, whereby the Darwin team told me they use case-insensitive FS because Microsoft Office internally converts filenames to uppercase or lowercase at its' leisure, often several, if not tens of times during an operation.

I could be wrong, the person / people I talked to could be incorrect, but I've never heard another explanation that was not speculation.

Your reasoning all applies reasonably well to why Microsoft decided to buck the trend and go case insensitive, since case sensitive was the norm up until they came along afaik.

[+] mikeash|11 years ago|reply

Case insensitivity is a tiny part of the problem in my view. On a case insensitive filesystem you can still have distinct directories called Document, Documents, My Documents, Documnts, etc. and really, from the user's point of view, why can't you have two directories called Documents? The entire concept of addressing things by name is counterintuitive. Mapping strings that differ only by case into the same name is a teeny band-aid on a gaping wound.

[+] justizin|11 years ago|reply

Several years ago, when I'd first moved to SF, I got a scholarship to WWDC for working for ACM, and I took advantage of the opportunity to go to a Birds-of-Feather for people interested in "Darwin filesystems" or something like that. It was basically a little conference room with the FS team, and I asked, flat-out, "Why case insensitive?"

And they answered, flat-out: "Microsoft Office"

Even this past week, I was talking with coworkers about having trouble with nothing but Valve Steam on my Macs that have a case-sensitive filesystem[0]. That's particularly odd, since it works on Linux now, but that's another matter.

What I found most notable about this thread, is this quote from Linus:

"And Apple let these monkeys work on their filesystem? Seriously?"

I'm pretty sure Apple actually _fired_ anyone who wanted any of the things done anything close to any way that Linus Torvalds would agree with.

ext* not being particularly perfect, I'm happy to have both. I mean, ext2 is hard to complain about, but it comes from an era where basically all filesystems were terrible, literally the era when SGI started installing backup batteries to race with fsync().

ext4 has an alarming number of corruption bugs, but I'm sure it's not because of insane unicode handling, though I take Linus' description of how the OSX filesystem works with a grain of salt. He can't possibly _care_ to know as much about it as he knows about Linux's.

[0] achievable by formatting HFS+X in Disk Utility in Recovery Mode, then installing onto that drive

[+] phaemon|11 years ago|reply

> ext4 has an alarming number of corruption bugs

Which bugs are these?

> though I take Linus' description of how the OSX filesystem works with a grain of salt. He can't possibly _care_ to know as much about it as he knows about Linux's.

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.g...

[+] thehacker005263|11 years ago|reply

Why wouldn't Microsoft office not work properly if the file system is case-sensitive ?

[+] Someone1234|11 years ago|reply

Can someone explain to me why case insensitivity is a bad thing? Clearly Linus believes so but didn't explain why he believes so.

Most UNIX and Linux systems seem to have an "all lowercase" or "all uppercase" convention, so the fact that they have case sensitivity is often not utilised.

In fact the biggest reason you'd want case sensitivity off the top of my head is legacy support but that's just a circular argument (since you never really reach WHY it was that way originally, just that it was).

I guess based on what he talks about next he is worried about how case insensitivity interacts with other character sets (i.e. does it correctly change their case), but for most sets isn't the lower and upper case defined in the UNICODE language spec itself?

[+] thristian|11 years ago|reply

The prime number-one concern in kernel programming is managing complexity. Well, in most programming really, but in kernel programming unmanaged complexity leads to lost data and sometimes broken hardware instead of "just" crashes.

Case-sensitivity is the easiest thing - you take a bytestring from userspace, you search for it exactly in the filesystem. Difficult to get wrong.

Case-insensitivity for ASCII is slightly more complex - thanks to the clever people who designed ASCII, you can convert lower-case to upper-case by clearing a single bit. You don't want to always clear that bit, or else you'd get weirdness like "`" being the lowercase form of "@", so there's a couple of corner-cases to check.

Case-sensitivity for Unicode is a giant mud-ball by comparison. There's no simple bit flip to apply, just a 66KB table of mappings[1] you have to hard-code. And that's not all! Changing the case of a Unicode string can change its length (ß -> SS), sometimes lower -> upper -> lower is not a round-trip conversion (ß -> SS -> ss), and some case-folding rules depend on locale (In Turkish, uppercase LATIN SMALL LETTER I is LATIN CAPITAL LETTER I WITH DOT ABOVE, not LATIN CAPITAL LETTER I like it is in ASCII). Oh, and since Unicode requires that LATIN SMALL LETTER E + COMBINING ACUTE ACCENT should be treated the same way as LATIN SMALL LETTER E WITH ACUTE, you also need to bring in the Unicode normalisation tables too. And keep them up-to-date with each new release of Unicode.

So the last thing a kernel developer wants is Unicode support in a filesystem.

[1]: http://www.unicode.org/Public/UNIDATA/CaseFolding.txt

[+] terminus|11 years ago|reply

Speaking from imperfect knowledge, I'd guess: case insensitivity implies allowing for aliasing (case aliasing for case insensitive and god-knows-what for Unicode insensitivity.)

Which means that anywhere you handles names, you explicitly handle these aliases. Miss a spot (or an alias), and you have a security bug.

[+] kstenerud|11 years ago|reply

Yes, it would be nice to have universal "uppercase" and "lowercase" rules, but in the real world, collation rules are crazy, arbitrary, and damn near impossible to get right. Now try to get it right in EVERY locale around the world, because if you don't, you open a big gaping security hole or data corruption bug.

[+] pmjordan|11 years ago|reply

I don't know to what extent this affects HFS+, but case insensitivity in general is locale dependent. The classic example is the Turkish distinction between dotted and dotless I - http://en.m.wikipedia.org/wiki/Dotted_and_dotless_I.

Quote:

The dotless I, I ı, denotes the close back unrounded vowel sound (/ɯ/). Neither the upper nor the lower case version has a dot.

The dotted I, İ i, denotes the close front unrounded vowel sound (/i/). Both the upper and lower case versions have a dot.

End quote.

I can imagine that being a nightmare, particularly if one user uses a Turkish locale and another on the same machine uses English. And all of that complexity in the kernel? Ouch. (Or HFS+ just behaves incorrectly for Turkish file names, not sure that's really better?)

[+] dunham|11 years ago|reply

Upper/lower case is locale language sensitive. One specific example, the uppercase for "i" varies depending on whether you have a Turkish locale or not. String.toLowerCase() in Java has exceptions coded in for the "tr", "az", and "lt" locale languages. Some of these languages may cause the length of the string to change when upcased or downcased, too.

Additionally, given a utf8 string, OSX will translate it into another, possibly different utf8 string before using it as a filename (the NFD normalization that Linus mentioned).

[+] dennisnedry|11 years ago|reply

The exploit being referenced allows malicious code to be injected into a file that handles git, which is (.git). Now, normally Git does not allow you to overwrite (.git). Since HFS+ is case-insensitive, writing the file as (.Git) will overwrite the (.git) file and break security. Linus is saying that an uppercase and lowercase character are two different things, and not recognizing that can (and did) cause problems with security.

[+] bigdubs|11 years ago|reply

I've run into issues with it from a programming point of view, you have to make sure to do a (slightly) more expensive case insensitive comparison when dealing with filenames.

[+] terminus|11 years ago|reply

Also a longer critique by John Siracusa here: http://arstechnica.com/apple/2011/07/mac-os-x-10-7/12/#hfs-p...

I wonder how this ties in with the whole Apple philosophy of "Design is how it works."

Clearly the innards look nothing like the facade.

[+] Someone1234|11 years ago|reply

> File system metadata structures in HFS+ have global locks. Only one process can update the file system at a time.

Holy heck! How does that work in practice? Do operations get queued and then a single kernel process who can take the lock do the atomic updates?

I have to imagine that is going to cause a bottleneck however, as all non-read operations need to update the metadata (e.g. timestamp, maybe size if it is stored).

That all being said I haven't noticed OS X being particularly slower to do things than e.g. Windows. So if that is the case they're hiding it well.

[+] yuhong|11 years ago|reply

Wonder why the "HFS+ Private Data" hack was chosen when Rhapsody existed even back in 1997.

[+] kannonboy|11 years ago|reply

Scroll down for Linus' commentary.

> "The true horrors of HFS+ are not in how it's not a great

> filesystem, but in how it's actively designed to be a bad

> filesystem by people who thought they had good ideas."

There doesn't seem to be a way to deep link to comments in G+?

[+] astrodust|11 years ago|reply

G+ is a train wreck for a number of reasons, this included.

Yeah, he's not a fan of HFS+ at all. Wasn't the plan to move to ZFS prior to the Oracle acquisition of Sun? Hopefully that ends up back on track somehow.

[+] AceJohnny2|11 years ago|reply

I'm rather amused (and, as always, a bit saddened) by the comment of Terry A. Davis, of TempleOS fame/notoriety.

"linux doesn't have to search parent directories for file-not-found, but I do."

wat

Edit: further parsing reveals he implemented a (read-only) overlay system in his FS. Interesting, I wonder what the side-effects (vulns) could be?

[+] iso8859-1|11 years ago|reply

The word "vulnerability" doesn't make sense when talking about TempleOS, since it does not even attempt to offer any kind of security.

[+] wiremine|11 years ago|reply

"Quite frankly, HFS+ is probably the worst filesystem ever."

Can anyone summarize why he thinks this?

[+] wazoox|11 years ago|reply

HFS+ has been patched with duct tape and pieces of cardboard for 20 years; receiving journaling, support for Unix attributes, extended attributes, 64 bits sizing, multi-processing, multi-users, hard links, etc. over the years. Really it's a sort of monument to kludge. It should have been ditched like 10 years ago.

[+] kannonboy|11 years ago|reply

You might've missed Linus' second comment on the page, wherein he talks about case-insensitivity and NFD normalization.

[+] yarrel|11 years ago|reply

Because he hasn't used MFS. ;-)

[+] gchpaco|11 years ago|reply

HFS+ is probably the worst filesystem in common use right now; even FAT has the benefit of simplicity. Most of its issues, however, are with its horrific implementation; the Unicode naming is kind of bad but Linus manages to be wrong about several things.

Regarding case sensitivity: it is generally accepted among the user interface crowd that (Western) users don't really understand that 'C' and 'c' are different things; they're "both" 'c'. Case-preserving is thus the accepted practice. However case manipulation is not an operation that can be done absent a locale; my go to example here is that 'i' upcases to 'I' unless you're a Turk in which case it upcases to 'İ'. Similar although not quite as bad is the fact that 'ß' upcases to 'ẞ' U+1E9E in some exotic circumstances; see http://en.wikipedia.org/wiki/Capital_ẞ for details. Similar limitations apply to sorting, which users also expect.

Regarding Unicode: NFD is a normalization format; it converts 'é' U+00E9 and 'é' U+0065 U+0301--which are semantically identical--into the same coding. As it happens NFD picks U+0065 U+0301 for that string; NFC picks U+00E9. Any time there is ambiguity, NF[CD] will retain the original ordering. Calling it "destroying user data" is meaningless histrionics. Most of the time we tend to use NFC. I am told that NFD has certain advantages for sorting, where one might want to match the French word 'être' with the search string 'etre'; in NFC this requires large equivalence tables but in NFD the root character is the same in both cases. Linus's claim 'Even the people who think normalization is a good thing admit that NFD is a bad format, and certainly not for data exchange.' has a big [citation needed] tag attached.

As it happens, my personal belief is the following: Given that users expect case sensitivity and locale specific ordering, which complicate filesystem design tremendously. Given that users mostly interact with the system through GUI dialogs, which already hide system files (files with the hidden bit in HFS+, or files starting with '.'). Therefore, extract the case sensitivity to a layer, used by the GUI, which can understand the user's locale and so fold case properly. This layer should be available to command line applications so that they can use the same rules if they so choose. The underlying filesystem will then be case insensitive, but is still used to encode Unicode data; the right thing to do here is to normalize. Either NFC or NFD is fine, really.

For pedants: the related NFKC/NFKD forms add a canonicalization step and are absolutely not semantically safe in any way, for all that they're useful for sorting.

[+] Someone|11 years ago|reply

And then, you receive a zip file from Linus that has file.c, File.c and FILE.c files on it. You extract it, and then? They either end up on disk, breaking your case-insensitive UI layer (yes, you can see those files, but can you copy them elsewhere?), or they don't, breaking the makefile that's also in the archive. Here be dragons.

Locale-specific ordering of course _must_ be done outside the disk because disks may move between systems with different locales, locales can be changed at will, and multiple users could read the same directory with different active locales (well, must: one could store a locale for sorting per directory and force that on he user, but that is madness)

Also, reading http://dubeiko.com/development/FileSystems/HFSPLUS/tn1150.ht... (which I can't find anymore on Apple.com), HFS+ doesn't quite use full NFD because it sometimes destroys information that Apple deemed worth keeping.

[+] yuhong|11 years ago|reply

Yea, but it would be better to convert filenames from disk at comparison time.

[+] niels_olson|11 years ago|reply

1) What are the odds Tim Cook or Craig Federighi will here about this?

2) What are good filesystems for OS X to adopt? OS X supports other filesystems. Is there a way to force it to install the OS onto a different filesystem, like ext4?

[+] mindajar|11 years ago|reply

Re: 2), there are significant parts of OS X that seem to either check for HFS+ or rely on its implementation and bugs, and those things don't work properly on other filesystems. OpenZFS, for instance, still doesn't work with Spotlight, on which a surprising number of things depend these days.

https://openzfsonosx.org/wiki/FAQ#Limitations

If you want to use all the software features of your Mac, your only option is HFS+.

[+] duskwuff|11 years ago|reply

OS X doesn't have full native support for any filesystem other than HFS+.

It supported UFS up until 10.9, but that was ancient and awful.

It can read and write FAT, but that's even worse. (It doesn't even support permissions, so it couldn't possibly boot from a FAT volume.)

It can read and write FATX. I don't think that supports permissions either, though, and it's also annoyingly patent-encumbered.

It can read NTFS, but not write to it.

[+] unknown|11 years ago|reply

[deleted]

[+] makecheck|11 years ago|reply

I agree with Linus from a technical point of view but I think Apple had many considerations here.

A number of games (and possibly other programs) on the App Store alone specifically mention that they will not work on Macs configured with case-sensitive file systems. My guess is that this aids programmers who may have ported something from Windows and not tested all possible file/path dependencies.

This may also help users when copying files from Windows network disks or Mac legacy systems where (from their point of view) they expect things to work.

[+] unknown|11 years ago|reply

[deleted]

[+] lispm|11 years ago|reply

I agree that HFS+ and its API is outdated.

Case insensitivity I find useful, OTOH.

[+] jen729w|11 years ago|reply

Ding!

[+] yuhong|11 years ago|reply

BTW, one other stupidity is that Linux's HFS+ implementation refuses to mount journaled volumes when Apple designed it to be backward compatible.

[+] cmurf|11 years ago|reply

That's not correct. It won't rw mount journaled HFS+ volumes, they are mounted ro by default. You can force them to mount rw and it'll ignore the journal possibly at the immediate or near future peril of the filesystem, depending on what state its in. I don't know why linux is so far behind on supporting it and doesn't bode well for supporting Core Storage volumes. Near as I can tell while HFS+/X are part of Darwin and thus at least sorta open sourced, I'm not finding any of the Core Storage stuff open sourced meaning it'd have to be completely reverse engineered.

[+] ajross|11 years ago|reply

How could that possibly work? I presume by "backward compatible" you mean that the data outside the journal remains consistent, and the journal layer is capable of detecting modifications made by non-journaled mounts.

That's... fine, I guess. It prevents the obvious corruption cases. But the only plausible recovery mechanism after such a mount is to throw out the journal! That's not likely to be acceptable to most users ("I booted to linux and back, and now a bunch of new files disappeared!").

That's stretching the meaning of "compatible" too far.

[+] jeffehobbs|11 years ago|reply

DING

[+] datashovel|11 years ago|reply

Thank God we have someone like Linus to keep BIG TECH in line. It's not all about marketing Apple, Microsoft. It's about good design, and openness.

[+] datashovel|11 years ago|reply

I would expect far more downvotes than that! Come on what's wrong with you "Hacker News". Only 1 downvote when I'm slammin' Apple for their shitty strategy? Surely some of you have at least some downvotes available to use against that!!!

119 comments