Unicode in Python 3

[+] wbond|11 years ago|reply

Having written a bunch of Python 2 and porting it to 3 where I deal with unknown encodings (FTP servers), I can't help but disagree with Armin on most of his Python 3 posts.

The crux of his argument with this article is "unix is bytes, you are making me deal with pain to treat it like Unicode." Python 2 just allowed to take crap in and spit crap out. Python 3 requires you to do something more complicated when crap comes in. In my situation, I am regularly putting data into a database (PostgreSQL with UTF-8 encoding) or working with Sublime Text (on all three platforms). You try to pass crap along to those and they explode. You HAVE to deal with crappy input.

In my experience, Python 2 explodes at run time when you get weird crappily-encoded data. And only your end users see it, and it is a huge pain to reproduce and handle. Python 3 forces you to write code that can handle the decoding at the get go. By porting my Python 2 to 3, I uncovered a bunch of places where I was just passing the buck on encoding issues. Python 3 forced me to address the issues.

I'm sure there are bugs and annoyances along the way with Python 3. Oh well. Dealing with text input in any language is a pain. Having worked with Python, C, Ruby and PHP and dealing with properly handling "input" for things like FTP, IMAP, SMTP, HTTP, etc, yeah, it sucks. Transliterating, converting between encodings, wide chars, Windows APIs. Fun stuff. It isn't really Python 3 that is the problem, it is undefined input.

Unfortunately, it seems Armin happens to play in areas where people play fast and loose (or are completely oblivious to encodings). There is probably more pain generally there than dealing with transporting data from native UI widgets to databases. Sorry dude.

Anyway, I never write Python 2 anymore because I hate having this randomly explode for end-users and having to try and trace down the path of text through thousands of lines of code. Python 3 makes it easy for me because I can't just pass bytes along as if they were Unicode, I have to deal with crappy input and ask the user what to do.

Python 2 is a dead end with all sorts of issues. The SSL support in Python 2 is a joke compared to 3. You can't re-use SSL contexts without installing the cryptography package, which requires, cffi, pycparsers and bunch of other crap. Python 2 SSL verification didn't exist unless you roll your own, or use Requests. Except Requests didn't even support HTTPS proxies until less than a year ago.

Good riddance Python 2.

[+] the_mitsuhiko|11 years ago|reply

> Python 3 requires you to do something more complicated when crap comes in.

Or in most cases: Python 3 falls flat on the floor with all kinds of errors because you did not handle unicode with one of the many ways you need to handle it.

On Python 2 you decoded and encoded. On Python 3 you have so many different mental models you constantly need to juggle with. (Is it unicode, is it latin1 transfer encoded unicode, does it contain surrogates) and then for each of them you need to start thinking where you are writing it to. Is it a bytes based stream? then surrogate errors can be escaped and might result in encoding garbage same as in python 2. If it a text stream? Then that no longer works then you can either crash or write different garbage. If it's latin1 transfer encoded then most people don't even know that they have garbage. I filed lots of bugs against that in WSGI libs.

If you write error free Python 3 unicode code, then teach me. (Or show me your repo and I show you all the bugs you now have)

[+] twic|11 years ago|reply

There was a related discussion on the Mercurial mailing list a while back. Not about Python 2 vs 3, but about filename encoding.

Mercurial follows a policy of treating filenames as byte strings. Matt Mackall is very clear about this. Because unix treats filenames as byte strings, this makes Mercurial interoperate with other programs on a unix machine pretty well: you can manage files of any encoding, you can embed filenames in file contents (eg in build scripts) and be confident they will always be byte-for-byte identical with the names managed by Mercurial, etc.

However, it also means Mercurial falls flat on its face when it's asked to share files between machines using different encodings. Names which work fine on one machine will, to human eyes, be garbled nonsense on the other.

This is a problem which does actually happen; there is a slow trickle of bug reports about it. And because of the commitment to unix-style filenames, it will probably never be fixed. List members did try and come up with some ideas to fix it which preserved the unix semantics normal cases, but they weren't popular.

And before anyone gets lippy, i assume Git has the same problem.

Ultimately, i would say this comes down to a conflict between two fundamentally different kinds of users of strings: machines and people. Machines are best served by strings of bytes. People are best served by strings of characters. Usually. And sadly, unix's lack of a known filesystem encoding is too well-established for there to be much chance of building a bridge.

[+] andreasvc|11 years ago|reply

What do you mean by "share files between machines"? Do you mean over a protocol? In that case the protocol over the wire should be well-defined and would avoid problems. If you mean by sharing files over an USB-stick then it's not so much an application problem as an OS issue.

I don't think the argument about machines wanting bytes is true. Machines will accept anything as long as it is well-defined. I'm really curious why there isn't yet some Linux or Posix standard that mandates utf-8. What's the problem with just decreeing that version +1 of the standard now expects utf-8?

[+] marcosdumay|11 years ago|reply

Well, honestly, the other way (interpreting filenames) would create worse interoperability problems.

[+] PeterisP|11 years ago|reply

It seems that git doesn't have this problem - just tried adding an encoding-sensitive nonASCII filename, and it worked correctly when pulled between different operating systems (macos, win7, ubuntu).

[+] overgard|11 years ago|reply

I had to deal with this a lot at a job I used to have (not python specifically, but just with unicode issues), and there's really just not a right answer to how to do any of this. Any solution you pick is going to suck for someone.

One thing he's leaving out of the Python 2 being better aspect: Ok, for cat you can treat everything as one long byte array. But what if, say, I need to count how many characters are in that string? Or what if I need to write a "reverse cat", which reverses the string? Python 2's model is entirely broken there.

Armin suggests that printing broken characters is better than the application exploding and I agree.. sometimes. On the other hand, try explaining to a customer why the junk text they copy pasted from microsoft word into an html form has question marks in it when it shows on your site.

The problem with the whole "treat everything as bytes" thing is that you'll never have a system that quite works. You'll just have a system that mostly works, and mostly for languages closer to english. Going the rigorous route is the hard way, but it will end up with systems that actually work right.

[+] rdtsc|11 years ago|reply

> There is a perfectly other language available called Python 2, it has the larger user base and that user base is barely at all migrating over. At the moment it's just very frustrating.

I come from a different perspective, I looked at the benefits of Python 3 and looked at my existing code base and how it would be better if was written in Python 3 and apart from bragging rights, and having a few built-in modules (that now I get externally) it wouldn't actually be better.

To put it plainly, Python 3, for me, doesn't offer anything at the moment. There is no carrot at the end. I have not seen any problems with Unicode yet. Not saying they might not be lurking there, I just haven't seen them. And, most important, Python 2 doesn't have any stick beating me on the head with, to justify migrating away from it. It is just a really nice language, fast, easy to work with, plenty of libraries.

From from _my_ perspective Python 3 came at the wrong time and offered the wrong thing. I think it should have happened a lot earlier, I think to justify incompatibilities it should have offered a lot more, for example:

* Increased speed (a JIT of some sort)

* Some new built-in concurrency primitives or technologies (something greenlet or message passing based).

* Maybe a built-in web framework (flask) or something like requests or ipython.

It is even hard to come with a list, just because Python 2 with its library ecosystem is already pretty good.

[+] ak217|11 years ago|reply

Is sys.getfilesystemencoding() not a good way to get at filename encoding?

I think on the face of it I do like the Go approach of "everything is a byte string in utf-8" a lot, but I haven't really worked with it so there's probably some horrible pain there somewhere, too. In the meantime Python 3 is a hell of a lot better than Python 2 to me because it doesn't force unicode coercion with the insane ascii default down my throat (by the time most new Python 2 coders realize what's going on, their app already requires serious i18n rework). Also, I don't really know why making sure stuff works when locale is set to C is important - I would simply treat such a situation as broken.

In writing python 2/3 cross-compatible code, I've done the following things when on Python 2 to stay sane:

- Decode sys.argv asap, using sys.stdin.encoding

- Wrap sys.stdin/out/err in text codecs from the io module (https://github.com/kislyuk/eight/blob/master/eight/__init__....). This approximates Python 3 stdio streams, but has slightly different buffering semantics compared to Python 2 and messes around with raw_input, but it works well. Also, my wrappers allow passing bytes on Python 2, since a lot of things will try to do so.

[+] inklesspen|11 years ago|reply

If you want to work with bytes on stdin and stdout, Python 3 documents how to do that, at the same place it documents the stdin and stdout streams.

https://docs.python.org/3/library/sys.html#sys.stdin

All you have to do is use sys.stdin.buffer and sys.stdout.buffer; the caveat is that if sys.stdin has been replaced with a StringIO instance, this won't work. But in Armin's simple cat example, we can trivially make sure that won't happen.

I'd be a lot more willing to listen to this argument if it didn't overlook basic stuff like this.

[+] DasIch|11 years ago|reply

Yes, the documentation mentions that you can use buffer and follows that by a sentence explaining that you can do this unless you can't:

> Note that the streams may be replaced with objects (like io.StringIO) that do not support the buffer attribute or the detach() method and can raise AttributeError or io.UnsupportedOperation.

So no this is neither basic nor easy to do correctly in general. That's only the case, if you are writing an application and use well-behaved libraries that handle the edge cases you introduce.

[+] CatMtKing|11 years ago|reply

I guess it's a little odd that Python 3 treats stdin and stdout by default as unicode text streams. And sys.argv is a list of unicode strings, too, instead of bytes.

[+] andreasvc|11 years ago|reply

> I'd be a lot more willing to listen to this argument if it didn't overlook basic stuff like this.

Just because there's a way around this particular issue doesn't mean that the attitude of Unicode by default of Python 3 isn't problematic. There's also sys.argv, os.listdir, and other filename stuff which Python 3 attempts to decode.

[+] mangecoeur|11 years ago|reply

I get that Armin runs into pain points with Py3, but on the other hand I get annoyed with the heavily English centric criticims - its easy to think py2 was better when you're only ever dealing with ASCII text anyway.

Fact is, most of the world doesn't speak english and needs accents, symbols, or completely different alphabets or characters to represent their language. If POSIX has a problem with that then yes, it is wrong.

Even simple things like french or german accents can make the Py2 csv module explode, while Py3 works like a dream. And anyone who thinks they can just replace accented characters with ASCII equivalents needs to take some language lessons - the result is as borked and nonsensical as if, in some parallel univese, I had to replace every "e" with an "a" in order to load simple english text.

[+] Thrymr|11 years ago|reply

Armin is Austrian. Whatever else you think of his critique, it's probably not English-centric and he's had to deal with accents in his native language.

[+] the_mitsuhiko|11 years ago|reply

My libraries are all supporting unicode on Python 2. And in fact, they do it better than on Python 3. File any unicode bugs you might encounter on Python 2 against me please.

[+] andreasvc|11 years ago|reply

This is not what the blog post is saying. It is saying that Python 3's attitude of forcing Unicode is making life difficult, whereas in Python 2 it is easier to decode to Unicode where needed, and be able to accept non-Unicode data in other cases. That Unicode needs to be supported was never under discussion.

[+] abus|11 years ago|reply

I've used the Py2 csv module with csv files containing accented characters and had no issues. Could you post an example?

This works fine, as does the corresponding writer:

    reader = csv.reader(open(FILENAME, 'rb'))
    for row in reader:
        print row

If you want a unicode string it's as simple as:

    value = row[0].decode('utf8')

Then before writing:

    row[0] = value.encode('utf8')

[+] lmm|11 years ago|reply

If you're happy with Go's "everything is a unicode string" approach then you should be happy to just treat everything as unicode. Don't handle the decode errors - if someone sends some data to your stdin that's not in the correct encoding, too bad.

Yes, python3 makes it hard to write programs that operate on strings as bytes. This is a good thing, because the second you start to do anything more complicated than read in a stream of bytes and dump it straight back out to the shell (the trivial example used here), your code will break. Unix really is wrong here, and the example requirement would seem absurd to anyone not already indoctrinated into the unix approach: you want a program that will join binary files onto each other, but also join strings onto each other, and if one of those strings is in one encoding and one is in another then you want to print a corrupt string, and if one of them is in an encoding that's different from your terminal's then you want to display garbage? Huh? Is that really the program you want to write?

[+] the_mitsuhiko|11 years ago|reply

> If you're happy with Go's "everything is a unicode string" approach then you should be happy to just treat everything as unicode. Don't handle the decode errors - if someone sends some data to your stdin that's not in the correct encoding, too bad.

Go's approach is transparently passing data through. In Python 3 your process crashes if you do that.

[+] ninkendo|11 years ago|reply

Could we all agree to not use the word "unicode" when talking about encoding (ie. how code points are serialized to bytes)? Unicode (ie. the standard set forth by the unicode consortium) has nothing to say about encoding.

I think what you mean is "Go's 'everything is a UTF-8 string' approach", but I'm not familiar enough with Go's internal encoding to know.

For instance, you mention "Don't handle the decode errors", but I can only assume by that you mean UTF-8's decode errors, since UTF-8 has the possibility of having encoding errors, where things like UTF-32 do not. They're both Unicode encodings, so it makes no sense to say it's "Unicode"'s decode errors.

I think the author of this article falls into the same trap all over the place. He uses the word "Unicode" to refer to an encoding all over the place. Until he is able to distinguish the difference between the Unicode standard and all its various encoding methods, there's not much point in reading his article.

(And no, Microsoft's braindead decision to use the word "Unicode" to mean "UCS-2", and later ret-conning it to mean "UTF-16", doesn't count. Don't perpetuate the stupid.)

[+] chimeracoder|11 years ago|reply

> If you're happy with Go's "everything is a unicode string" approach then you should be happy to just treat everything as unicode.

That's actually not really Go's approach. In Go, strings do not have encodings attached to them.

Source files are defined to be UTF-8 (by the compiler), so string literals are always unicode. That's not quite the same thing as saying that the "string" type in Go is always Unicode (it's not). And when you're dealing with a byte slice ([]byte), you cannot make any assumptions about the encoding.

It took a bit to wrap my head around this when I first read about it[0], but now that I think about it, I think it's the right way to go[1].

[0] http://blog.golang.org/strings

[1] And for what it's worth, Go and UTF-8 were designed by (some of) the same people, so one would hope they'd get it right!

[+] cool-RR|11 years ago|reply

Worth it if only for `copyfileobj`. As a seasoned Python expert, I was not familiar with that function. From the docs:

shutil.copyfileobj(fsrc, fdst[, length]) Copy the contents of the file-like object fsrc to the file-like object fdst. The integer length, if given, is the buffer size. In particular, a negative length value means to copy the data without looping over the source data in chunks; by default the data is read in chunks to avoid uncontrolled memory consumption. Note that if the current file position of the fsrc object is not 0, only the contents from the current file position to the end of the file will be copied.

[+] andreasvc|11 years ago|reply

I think the main problem here is an impedance mismatch caused by forcing things to be Unicode. While the Python developers are technically correct (the best kind they say..) in claiming that LANG=C means ASCII, that's not how everything else in UNIX works until now, most applications don't crash because of encoding errors. And filenames are byte strings, so forcing Unicode on them is a bad idea.

It would be great if everyone fixed their locale settings and all their filename encodings but in the meantime this will cause even more friction for Python 3 adoption.

[+] andrewstuart|11 years ago|reply

It's a great concern that some of Python's most respected developers such as mitsuhiko and Zed Shaw are not on board with the current future direction of Python. It would be a better world for all if somehow Python 4 could be something that everyone is happy with - I want the mitsuhikos and Zed Shaws of the world to be writing code that I can run as a Python 3 user, written in a language that these top level developers feel enthused about.

Is there no way forward that everyone agrees on? Has anyone ever proposed a solution?

[+] shadowmint|11 years ago|reply

> That I work with "boundary code" so obviously that's harder on Python 3 now (duh)

mhm. I tell people now and then that python 3 (and the python 3 developers) are hostile to people embedding it and using it for low level tasks specifically because of this unicode stuff, and they tend to tell me I should just suck it up.

I suppose I'm morbidly glad not the only one feeling the pain, but really, it honestly feels like python 3 line is just not making any effort towards making this stuff easier and simpler. :/

[+] ygra|11 years ago|reply

Unicode, dealing with text, i18n are never easy and simple. That being said, there are lots of things that work on both Windows and Unix and use Unicode internally, even for file names and paths (e.g. Qt and the already-mentioned Java). Qt is even used by a popular-ish desktop environment. If that approach were that unsuitable and utterly incompatible with the Unix approach on encodings I wonder why it apparently does work.

[+] sp332|11 years ago|reply

Could you use the "bytes" type instead of the "string" type for that low-level stuff?

[+] andrewstuart|11 years ago|reply

I hear and understand and agree with the issues raised, the question is what is the right way to fix this stuff? How can we get there?

How can we get the Python 2 stalwarts and the Python 3 folks to all sit in the same figurative room and create a future that everyone is happy with?

It would be nice to see the ongoing grumbling about Python 3 replaced with a tangible peace process.

Are the warring parties talking about solutions?

[+] rectangletangle|11 years ago|reply

It'd be nice, I don't want Python to go the way of PERL.

[+] e12e|11 years ago|reply

I don't know... I get an error from the first script with python3:

    $ ls
    test  test3.py  test.py  tøst  日本語
    $ python2.7 test.py *
    hello hellø こにちは tøst 日本語
    import sys
      # (…)
    hello hellø こにちは tøst 日本語
    hello hellø こにちは tøst 日本語
   
    $ python3 test.py *
    Traceback (most recent call last):
      File "test.py", line 13, in <module>
        shutil.copyfileobj(f, sys.stdout)
      File "/usr/lib/python3.2/shutil.py", line 68, in copyfileobj
        fdst.write(buf)
    TypeError: must be str, not bytes
    
    #But I can make it work with:
    $ diff test.py test3.py 
    8c8
    <             f = open(filename, 'rb')
    ---
    >             f = open(filename, 'r')
    
    $ python3 test3.py *
    # same as above

Now, these two scripts are no longer the same, the python3 script outputs text, the python2 script outputs bytes:

    $ python3 test3.py /bin/ls
    Traceback (most recent call last):
      File "test3.py", line 13, in <module>
        shutil.copyfileobj(f, sys.stdout)
      File "/usr/lib/python3.2/shutil.py", line 65, in copyfileobj
        buf = fsrc.read(length)
      File "/usr/lib/python3.2/codecs.py", line 300, in decode
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 24: invalid start byte

The other script works like cat -- and dumps all that binary crap to the terminal.

So, yeah, I guess things are different -- not entirely sure that the python3 way is broken, though? It's probably correct to say that it doesn't work well with the "old" unix way in which text was ascii and binary was just bytes -- but consider:

    $ cat /bin/ls |wc
    403    2565  114032
    e12e@stripe:~/tmp/python/unicodetest $ du -b /bin/ls
    114032  /bin/ls

Does that "wordcount" and "linecount" from wc make any sense? For that matter, consider:

    $ cat test
    hello hellø こにちは tøst 日本語
    e12e@stripe:~/tmp/python/unicodetest $ wc test
     1  5 42 test

(Here the word count does make sense, but just because it's an artificial example, it wouldn't make sense for actual Japanese).

The character count is pretty certainly wrong unless you cared about what "du -b" thinks of the number of bytes...

[+] lifeisstillgood|11 years ago|reply

The japanese example is interesting - because wc really rather depends on the language. So does regex. And quite a lot of other things that are useful in a Latin-derived world kind of get harder in a right to left inflected written language (if there is one, some Arabic comes to mind).

I think if anything will force us to rethink the underlying assumptions of Unix, its unicode.

[+] the_mitsuhiko|11 years ago|reply

The first script does not work on Python 3. That's the whole point of the post. You need to use the second one.

[+] lazzlazzlazz|11 years ago|reply

Isn't the entire point of the post that the first Python3 example will not work?

[+] unknown|11 years ago|reply

[deleted]

[+] keyme|11 years ago|reply

Strings should be byte strings. Not ASCII, not Unicode. Bytes.

Strings don't represent Text lest I decide they do. For this a UnicodeString object should exist, and it should _not_ be the default.

In my latest project I've made myself use Python 3.4 over 2.7, for its new great features. So many steps forward, except this one thing.

What a stupid decision are these default Unicode strings...

[+] andrewstuart|11 years ago|reply

Wouldn't people be complaining if "the unicode problem" hadn't been solved in Python rather than leaving it an undefined mess? Now it is a solved problem even if the solution is seen as a problem by some.

[+] pekk|11 years ago|reply

From the one person who has complained most about this topic, making him an expert on complaining about Python 3 but not necessarily as much of an expert on how to cope.

[+] skizm|11 years ago|reply

Bit off topic, but can anyone recommend a good tutorial/book/whatever for python 2 programmers looking to move to (or at least become familiar with) python 3?

[+] maxerickson|11 years ago|reply

What's New in 3.0 has lots of information:

https://docs.python.org/3/whatsnew/3.0.html

and maybe take a look at what standard modules have moved to different name or namespace:

https://docs.python.org/3/py-modindex.html

[+] daftshady|11 years ago|reply

Here's good porting guide http://python3porting.com/

[+] bdevine|11 years ago|reply

I liked the latest edition of "Python Cookbook" by Beazley and Jones for precisely this.

[+] im3w1l|11 years ago|reply

>For instance it will attempt decoding from utf-8 with replacing decoding errors with question marks.

Please don't do this. Replacing with question mark is a lossy transformation. If you use a lossless transformation, a knowledgeable user of your program will be able to reverse the garbling, in their head, or using a tool. Consider Ã¥Ã¤Ã¶, the result of interpreting utf8 åäö as latin1. You could find both the reason and solution by googling on it.

[+] unknown|11 years ago|reply

[deleted]

96 comments