(no title)
supergarfield | 4 years ago
In a different direction, I don't know what your problem domain is/was, but in general when I'm dealing with UTF8, I don't need to convert back to bytes very often. Was the need for conversion mostly due to the libraries that still expected strings instead of bytes?
StewardMcOy|4 years ago
I should also say that we were working with files in tons of encodings, not just UTF-8. We had UTF-16 and UTF-32, both little and big endian, with and without BOMs, but we also had S-JIS and a bunch of legacy 8-bit encodings. Often we wouldn't know what encoding a file was in, so we'd have to use the chardet library, along with some home-grown heuristics to guess.
Off the top of my head, the two biggest footguns are:
- There should be no way to read or write the contents of a file into a str without specifying an encoding. locale.getpreferredencoding() is a mistake. File operations should be on bytes only, or require an explicit encoding.
- .encode() and .decode() are very poorly named for what they do, and it wasn't that uncommon that someone would get them backwards. Sometimes, exceptions aren't even thrown for getting them wrong, you just get incorrect data.
Both of which were still issues with Python 2. There's a valid architectural argument to be had between the Python 2 way, where str was a bag of bytes, and the unicode type was for decoded bytes, and the Python 3 way, where the bytes type is your bag of bytes and str holds your decoded string. I favor Python 3's way of doing it, but it's almost six of one, half a dozen of the other. The advantages of one over the other are slight, and given how many library functions relied on the old behavior, it was probably a mistake to change it like that, rather than continuing the Python 2 way, and fixing issues like those above that caused problems.
aeturnum|4 years ago
I think the P3 string/byte ecosystem was made substantially weaker by P3 deciding not to lean more into types (something I have complained about on here before!). Like...they are the only values where the stdlib is extremely specific about you passing a value that has the exact right type, but the standard tools for tracking that are pretty poor.
kortex|4 years ago
Isn't that the point? String and bytes are different beasties. You can often encode strings to bytes and just about anything accepting bytes will accept it, but the converse is not true. Bytes are more permissive in that any sequence of any 0x00-0xff is acceptable, but str implies utf8 (not always guaranteed, I've seen some shit), meaning e.g. you can dump it to json without any escaping/base encoding.