top | item 44973787

(no title)

psionides | 6 months ago

Hmm… Yeah, I guess each language does it kinda differently. At least Ruby also does it similarly like Python.

> I’m confused here. You established indexing is by UTF-8 code unit, then said it’s because of JavaScript which… doesn’t do UTF-8 so well?

It's not that UTF-8 is because of JavaScript, it's that indexing by bytes instead of UTF-8 code units is because of JavaScript. To use UTF-8 in JavaScript, you can use TextEncoder/TextDecoder, which return the string as a Uint8Array, which is indexed by bytes.

So if you have a string "Cześć, #Bluesky!" and you want to mark the "#Bluesky" part with a hashtag link facet, the index range is 9...17 (bytes), and not 7...15 (scalars).

discuss

chrismorgan|6 months ago

> indexing by bytes instead of UTF-8 code units

When the encoding is UTF-8 (which it is here), the code unit is the byte.

They called the fields byteStart and byteEnd, but a more technically precise (no more or less accurate, but more precise) labels would be utf8CodeUnitStart and utf8CodeUnitEnd.

psionides|6 months ago

Sorry, I keep mixing these - bytes instead of scalars, which I think would be more natural to iterate over in most languages (at least the ones I use).