top | item 30117556

(no title)

supergarfield | 4 years ago

What do you think the current text encoding footguns are?

In a different direction, I don't know what your problem domain is/was, but in general when I'm dealing with UTF8, I don't need to convert back to bytes very often. Was the need for conversion mostly due to the libraries that still expected strings instead of bytes?

discuss

order

StewardMcOy|4 years ago

It's been quite a few years since we went through the conversion, and at this point, working in Python 3 is natural to me, so I may not be able to recall all the footguns. I can say a lot of the difficulty was due to libraries, both third-party and standard, and that hasn't improved very much. I don't want to single anyone out here, because it's pervasive. In Pytuon 2, str was the bag of bytes type. I think a lot of libraries didn't want to change to accepting bytes types instead, because it broke API compatibility, but it caused a lot of issues.

I should also say that we were working with files in tons of encodings, not just UTF-8. We had UTF-16 and UTF-32, both little and big endian, with and without BOMs, but we also had S-JIS and a bunch of legacy 8-bit encodings. Often we wouldn't know what encoding a file was in, so we'd have to use the chardet library, along with some home-grown heuristics to guess.

Off the top of my head, the two biggest footguns are:

- There should be no way to read or write the contents of a file into a str without specifying an encoding. locale.getpreferredencoding() is a mistake. File operations should be on bytes only, or require an explicit encoding.

- .encode() and .decode() are very poorly named for what they do, and it wasn't that uncommon that someone would get them backwards. Sometimes, exceptions aren't even thrown for getting them wrong, you just get incorrect data.

Both of which were still issues with Python 2. There's a valid architectural argument to be had between the Python 2 way, where str was a bag of bytes, and the unicode type was for decoded bytes, and the Python 3 way, where the bytes type is your bag of bytes and str holds your decoded string. I favor Python 3's way of doing it, but it's almost six of one, half a dozen of the other. The advantages of one over the other are slight, and given how many library functions relied on the old behavior, it was probably a mistake to change it like that, rather than continuing the Python 2 way, and fixing issues like those above that caused problems.

aeturnum|4 years ago

I haven't done a ton of work with Python recently, but the problems I remember encountering came from the fact that python doesn't try to have encodings in any other part of the basic type system. So like, if you have an int or a float, you can pass those to any interface that takes a 'number-y' value and it will mostly work like you expect. That's also how strings worked in P2 - you could pass them around and things would accept the values (though you might get gibberish out the other side). Now, in P3, things will blow up (which is helpful for finding where you went wrong ofc - I understand the utility), but it means that your code handling things-that-might-be-strings-or-bytes often needs to have a different structure than the rest of your code.

I think the P3 string/byte ecosystem was made substantially weaker by P3 deciding not to lean more into types (something I have complained about on here before!). Like...they are the only values where the stdlib is extremely specific about you passing a value that has the exact right type, but the standard tools for tracking that are pretty poor.

kortex|4 years ago

> but it means that your code handling things-that-might-be-strings-or-bytes often needs to have a different structure than the rest of your code.

Isn't that the point? String and bytes are different beasties. You can often encode strings to bytes and just about anything accepting bytes will accept it, but the converse is not true. Bytes are more permissive in that any sequence of any 0x00-0xff is acceptable, but str implies utf8 (not always guaranteed, I've seen some shit), meaning e.g. you can dump it to json without any escaping/base encoding.