top | item 13953800

APFS does not normalize Unicode filenames

313 points| okket | 9 years ago |mjtsai.com | reply

137 comments

order
[+] userbinator|9 years ago|reply
I agree that this is a good change. Unicode, normalisation, character encodings, etc. should really be handled at the presentation layer, and everything below that just treats filenames as sequences of bytes, perhaps with one or two exceptions like '/' and \0.

It is interesting to consider a theoretical system in which paths are represented in 0-terminated count-length format (e.g. "foo/bar/baz/myfile.txt" would be "\003foo\003bar\003baz\012myfile.txt\000"), truly allowing any byte in a filesystem node's name, although that might be going a little bit too far.

Things are much easier for the file system if it can just treat names as bags of bytes.

If you're really talking about bags (unordered sets), that would certainly make for an interesting filesystem since filename.txt, filemane.txt, and maletent.fix would all be the same...

[+] ridiculous_fish|9 years ago|reply
This feels in some sense like punting the problem. What exactly should the presentation layer do when presenting two files, where the first is named with a precomposed character sequence, and the other has the same name but decomposed? Surface the normalization form the user? Uh, no...

The more fundamental question is whether filenames are under control of the user or the system. The answer today is "both": there's blessed paths /System/Library... and non-blessed paths like ~/Documents/Pokemon.txt. Addressing this properly means reifying that distinction: making apps always be explicit about whether they're working with the user's or the filesystem's view of a file.

While we're here, why the should the user even have to name every file? Do you name every piece of junk mail on your desk/kitchen island?

The ideal for the user is something like tagging, where naming stuff is optional, names are forgiving and perhaps not unique, and files are not restricted to being in one directory. Meanwhile, filesystem names are an implementation detail and the filesystem enjoys Unicode ignorance. Spotlight moved a bit in this direction, but the end goal is still awfully far away.

[+] jfim|9 years ago|reply
It's a good idea until you end up with two files that have the "same" name (eg. Amélie.jpg and Amélie.jpg) because one uses decomposed characters (U+0065 and U+0301) and the other one uses a single character (U+00E9).

If the difference is not visible in your browser (it shouldn't), try copy-pasting those two filenames in a text editor, one of them is 10 characters long and one of them is 11 characters long.

[+] quotemstr|9 years ago|reply
No good has ever come from allowing BEL and DEL as part of filenames.

I'm with David Wheeler: https://www.dwheeler.com/essays/fixing-unix-linux-filenames....

We need to limit filenames for the good of the entire system and the whole community. Filenames as byte strings may sound good, but nobody ever thinks of the costs and the scant benefits.

[+] silvestrov|9 years ago|reply
> Things are much easier for the file system if it can just treat names as bags of bytes.

And much, much harder for applications if that "bag of bytes" is an invalid UTF-8 sequence. You will end up with an invalid string (or an exception), and trying to open that file will then fail.

I'd really hope that Apple checks that the filenames are valid UTF-8 as they otherwise can end up with very interesting security bugs.

[+] emn13|9 years ago|reply
I'm not so sure: the thing is, those strings are not just a bag of bytes. They have semantics. Is it legal to include bytes that are 0? How about slashes in file names? How about byte sequences that aren't utf8 at all?

These things aren't just byte streams, they have semantics.

So one risk of treating this as "just bytes" is that bugs will introduce byte sequences that aren't utf8 at all, which will cause other programs to fail, or worse, to "try" and thereby corrupt the data further.

Another risk is that since it's supposed to be utf-8, some programs may do canonicalization internally to avoid confusing situations. This may even happen accidentally (though it's not likely): after all, a unicode-processing system could be forgiven for transparently changing canonicalization. But if a program canonicalizes you can now get really weird behavior such as opening a file, then saving it, and ending up with two files that look identical - because the write wrote to a path that was canonicalized.

Additionally, even though you can never avoid confusing paths entirely without considering the glyphs rendered, you are losing a very simple check against a fairly large class of errors.

[+] rurban|9 years ago|reply
No. Ever heard about http://websec.github.io/unicode-security-guide/

Identifiers should be identifiable. If a filename is encoded in utf8, it needs to be normalized. To the canonical form of course not the crazy Python 3 or Apple idea of NFD. Which is also slower. If it's encoded as bytes you get garbage in - garbage out.

There's much more to consider, but unfortunately you cannot restrict a directory to forbid mixed scripts or confusables. I summarized a few problems at http://perl11.org/blog/unicode-identifiers.html

[+] saghm|9 years ago|reply
> If you're really talking about bags (unordered sets), that would certainly make for an interesting filesystem since filename.txt, filemane.txt, and maletent.fix would all be the same...

I was thinking the same thing! Although if we're being pedantic, I think "bag" is an "unordered multiset". If it were just a plain unordered set, then "filename.txt" would also be the same as "filenam.txt".

[+] kijeda|9 years ago|reply
Incidentally, your theoretical system is exactly how the DNS stores domain names.
[+] tyingq|9 years ago|reply
>perhaps with one or two exceptions like '/' and \0

Aren't colons (:) an issue with MacOS as well? I think Finder and other userspace apps convert them to slashes. I suppose though, there's a case for the filesystem not caring about that.

[+] sametmax|9 years ago|reply
How do you normalize arab or chinese in a meaningful way for people speaking these languages ?
[+] infogulch|9 years ago|reply
If I were allowed one more distinction I would definitely add utf-8 only.
[+] silverwind|9 years ago|reply
The same goes for case-sensitivity which, unfortunately, they see as a defect. It would be great for portability if APFS would stay case-insensitive like the Unix file systems, e.g. just a dumb layer that reads/writes bytes.
[+] QuercusMax|9 years ago|reply
This seems especially bad because US-based developers who don't test with unicode filenames might not come across this issue, leaving all their non-English-speaking customers broken. (Not that this excuses such developers in any way.)

It also means that, yet again, every app will need to be updated for a new version of iOS. Makes me wonder how many apps will be left behind if not updated? Thousands? Hundreds of thousands? Millions?

[+] hamstergene|9 years ago|reply
I'm going through apps on my phone and can hardly think what any of them would use Unicode filenames for. Say, a messenger might use user's nickname to name a history file — that would cause one-time loss of history, but not break the app. What else?

Something tells me practically no apps will be seriously affected.

[+] xenadu02|9 years ago|reply
You should not be using anything other than UUIDs or integers for file names. Maintain your own mapping in a database or file.

Using a network value or a value returned by an API is just asking for trouble.

If a user names a file that will be hidden behind a URL the same advice applies. If not then the user can use any sequence of bytes they want and you shouldn't care.

[+] spullara|9 years ago|reply
Except filesystem experts like Dropbox, developers probably shouldn't be letting users name their files.
[+] djrogers|9 years ago|reply
iOS 10.3 with APFS has been in public and developer beta for several months - it's up to beta 7 right now in fact. If this were as vast a problem as Micheal Tsai presents in this post, wouldn't we (the devs and beta testers) be running in to this a lot?

Given how loudly the tech press proclaims any perceived mis-step by Apple, I'd have to believe we'd have been reading tons of 'Apple is Doooooooomed' articles about this by now. Given that this hasn't happened, and I haven't seen similar problem reports on dev forums and other hangouts, I'd lean towards there being some miscommunication or misunderstanding here.

[+] eridius|9 years ago|reply
This is not likely to be a particularly big issue on iOS, because the file system isn't directly exposed to the user (and therefore the user can't go making changes behind the app's back). There are of course still edge cases that could cause a problem, but they're going to be relatively rare.

But this may become a much bigger issue when we start using APFS on macOS.

[+] williamscales|9 years ago|reply
Is it possible that the beta testers are largely in the US and so wouldn't have seen this issue much?
[+] bartvk|9 years ago|reply
As I understand it, it may only be a problem when using the straight libc calls. But most devs just use the file system stuff that's in Apple's Foundation framework.
[+] cesarb|9 years ago|reply
This used to be a pain point with git, when some developers were using MacOS and the repository had file names with accents; to git, it looked like the files had been renamed. Some time later, git added the "core.precomposeunicode" option to work around this problem.
[+] jacobolus|9 years ago|reply
This is a big change. I guess they now decided that compatibility with external systems is a more important goal than end-user-friendliness.

It’s a reasonable decision to come to (especially for iOS where end-users don’t ever really interact with the filesystem directly), but it will cause quite a bit of churn in the short term.

[+] vbezhenar|9 years ago|reply
I'm not sure if normalization is good idea (generally because Unicode is complex beast and moving that complexity inside a kernel should be carefully weighted), but I'm sure that it doesn't solve any real problem. Characters "A" and "А" looks identical, unless you're missing Cyrillic font, but they won't be normalized, because they are completely different characters. There are many more other visually identical strings. So while normalization might solve some simple problems, it's not a complete solution, so filesystem might just treat names as byte arrays and let user solve his problems.
[+] lathiat|9 years ago|reply
Seems to me that Apple would be smart to hook all of the file functions and survey and/or alert on this situation somehow.

I only just learnt about this unicode normalisation recently looking at ZFS which has options for it I had never seen until reading the Ubuntu Root FS on ZFS guides which talk about setting it.

[+] kalleboo|9 years ago|reply
Linus Thorvalds will be happy to hear that http://www.cio.com/article/2868393/linus-torvalds-apples-hfs...
[+] ben_bai|9 years ago|reply
HFS+ can be configured at creation time to be case sensitive. I did so a while back. Worked perfectly except for one application which could not find it's files. So i had to create a container and Format it case in sensitive and intall the APP there...
[+] cjensen|9 years ago|reply
This is for iOS, where the app developer fully controls file naming within their sandbox. It is very unlikely that MacOS will fail to normalize because filenames there are presented directly to the user.
[+] alphabettsy|9 years ago|reply
Wouldn't this be seen as an issue in betas? I haven't seen anything indicating this is widespread so far? Why would that be, just not wide enough deployment yet?
[+] Moru|9 years ago|reply
Mabe most beta testers are based in english-speaking countries and countries where most people are used to stay away from non-english characters and never noticed the problem? I live in Sweden and still avoid using åäö in filenames because of old habits from DOS/Atari era.
[+] makecheck|9 years ago|reply
Technically the presentation-layer problem existed already with things like legacy path separators, making the Finder tell lies in the presence of colons or slashes. I suspect that normalization differences will be a little like telling two files apart when one has a trailing space, or hidden file extensions; there will have to be some distinction but maybe no easy answer.
[+] al2o3cr|9 years ago|reply

    More generally, once APFS is deployed users can legitimately end up with
    multiple files in the same folder whose names only differ in normalization.
The initial message that starts this off seems to imply the opposite - instead, application developers should be normalizing the name before handing it to the filesystem. In that case, an application which allowed non-normalized naming would arguably have a bug.
[+] kalleboo|9 years ago|reply
Don't Mac apps already have to deal with network and FAT32 drives? Or does macOS already normalize those?
[+] djrogers|9 years ago|reply
Network drives are handled by the file sharing protocol, not the local file system. Fat32 is handled by a fat32 driver that does the correct thing according to fat32 rules.
[+] killercup|9 years ago|reply
Not sure if APFS has such a thing, but I think I heard about it a while back:

Could they introduce a directory-level option to automatically normalize all files below that node? (Same with case-sensitivity, which I think Adobe software still has problems with.)

[+] therealmarv|9 years ago|reply
Is APFS still using Apple's style UTF-8 for e.g. Umlauts? I had a lot of trouble with rsync and also Samba later (filenames and folders hidden) when I discovered that Umlauts on HFS are different than Umlauts on e.g. Ext4.
[+] eridius|9 years ago|reply
What do you mean by "different"?

Umlauts on HFS+ are still Umlauts anywhere else. The only real oddity of HFS+ (beyond the fact that it does normalization at all) is that it's not using NFD, it's using a variant of NFD based on an old version of Unicode (it has to be this way because the normalization tables must be immutable or there's compatibility issues when reading drives written to from different versions of the OS). So if you take a filename and convert it to NFD, it may not be the exact same byte sequence that you get if you plug that filename into HFS+ (but in most cases it will be). But whatever byte sequence you get from HFS+ is still going to be a valid Unicode sequence.

[+] 0x0|9 years ago|reply
If you are rsync'ing from HFS on a Mac to a Linux server, you can use "--iconv=UTF-8-MAC,UTF-8" to fix this problem.
[+] dom0|9 years ago|reply
It doesn't really matter what they do, since filesystem naming is FUBAR and has been FUBAR pretty much since UNIXv1, and possibly even earlier than that outside the UNIX family.
[+] m-j-fox|9 years ago|reply
I want to make the gzipped contents of my file it's name and leave the actual contents blank or whatever metadata. Thanks apfs!
[+] rick_cheese|9 years ago|reply
The way iOS abstracts the filesystem away from user-view makes this less of an issue than it otherwise would be but still a good find by the author, as an aside surely I'm not the only one who thought of [1] when I read "APFS now treats all files as a bag of bytes on iOS" ;)

[1] https://www.youtube.com/watch?v=OT7xc_XqYO8