top | item 40171035

(no title)

aktiur | 1 year ago

> strings having an encoding and byte strings being for byte sequences without encodings

You got it kind of backwards. `str` are sequence of unicode codepoints (not UTF-8, which is a specific encoding for unicode codepoints), without reference to any encoding. `bytes` are arbitrary sequence of octets. If you have some `bytes` object that somehow stands for text, you need to know that it is text and what its encoding is to be able to interpret it correctly (by decoding it to `str`).

And, if you got a `str` and want to serialize it (for writing or transmitting), you need to choose an encoding, because different encodings will generate different `bytes`.

As an example :

>>> "évènement".encode("utf-8") b'\xc3\xa9v\xc3\xa8nement'

>>> "évènement".encode("latin-1") b'\xe9v\xe8nement'

discuss

order

chrismorgan|1 year ago

> `str` are sequence of unicode codepoints (not UTF-8, which is a specific encoding for unicode codepoints)

It’s worse than that, actually: UTF-8 is a specific encoding for sequences of Unicode scalar values (which means: code points minus the surrogate range U+D800–U+DFFF). Since str is a sequence of Unicode code points, this means you can make strings that cannot be encoded in any standard encoding:

  >>> '\udead'.encode('utf-16')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-16' codec can't encode character '\udead' in position 0: surrogates not allowed
  >>> '\ud83d\ude41'.encode('utf-8')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
Python 3’s strings are a tragedy. They seized defeat from the jaws of victory.

account42|1 year ago

Maybe we need another PEP that switches the default to WTF-8 [0] aka UTF-8 but let's ignore that a chunk of code points was reserved as surrogates and just encode them like any other code point.

[0] https://simonsapin.github.io/wtf-8/

lucb1e|1 year ago

> `str` are sequence of unicode codepoints [...] without reference to any encoding

I guess I see it from the programmer's perspective: to handle bytes coming from the disk/network as a string, I need to specify an encoding, so they are (to me) byte sequences with an encoding assigned. Didn't realize strings don't have an encoding in Python's internal string handling but are, instead, something like an array of integers pointing to unicode code points. Not sure if this viewpoint means I am getting it backwards but I can see how that was phrased poorly on my part!

tialaramex|1 year ago

There are two distinct questions here, to which implementations can provide different answers

1. Interface: How can I interact with "string" values, what kind of operations can I perform versus what can't be done ? Methods and Operators provided go here.

2. Representation: What is actually stored (in memory) ? Layout goes here.

So you may have understood (1) for Python, but you were badly off on (2). Now, at some level this doesn't matter, but, for performance obviously the choice of what you should do will depend on (2). Most obviously, if the language represents strings as UTF-8 bytes, then "encoding" a string as UTF-8 will be extremely cheap. Whereas, if the language represents them as UTF-16 code units, the UTF-8 encoding operation will be a little slower.