top | item 45670197

(no title)

mwsherman | 4 months ago

There is mention of how len() is bytes, not “characters”. A further subtlety: a rune (codepoint) is still not necessarily a “character” in terms of what is displayed for users — that would be a “grapheme”.

A grapheme can be multiple codepoints, with modifiers, joiners, etc.

This is true in all languages, it’s a Unicode thing, not a Go thing. Shameless plug, here is a grapheme tokenizer for Go: https://github.com/clipperhouse/uax29/tree/master/graphemes

discuss

order

HeyImAlex|4 months ago

Here’s my favorite post on the subject https://adam-p.ca/blog/2025/04/string-length/

debugnik|4 months ago

Finally an article that doesn't pretend grapheme clusters are the be-all end-all of Unicode handling.

I'm saving this one. Not exactly how I'd explain it, but it's simplified enough to share with my current co-workers without being misleading.

virtualritz|4 months ago

len() is also returning int instead of uint/uint64 in Go.

I do not use Go but ran into this when I had to write a Go wrapper for some Rust stuff the other day. I was baffled.