top | item 37853259

(no title)

zyedidia | 2 years ago

Does anyone know why LSP uses UTF-16 for encoding columns? It seems like everyone agrees it is a bad choice, so I'm curious about the original reasoning. Are there any benefits at all to using UTF-16, or was it something to do with Microsoft legacy code?

discuss

jcranmer|2 years ago

The JavaScript VM, Java VM, .NET VM, and several other runtimes (including effectively the entire Windows API) have their fundamental definition of strings be based on UTF-16 baked in.

I believe the original producers and consumers of LSP were written in languages that had string lengths based on UTF-16, so it was the literal easiest way to do it, even though UTF-16 is probably objectively the most painful thing to compute if your string system isn't UTF-16.

LSP eventually got a solution where you can request something other than UTF-16 offset calculations, but I don't remember the details of what that solution is.

hgs3|2 years ago

There was a lengthy discussion on this [1]. UTF-16 was used because it was convenient: it's what Microsoft API's and JavaScript already use (the latter being the language VS Code is written in).

[1] https://github.com/microsoft/language-server-protocol/issues...

mardifoufs|2 years ago

That thread was infuriating. Since when does an encoding format have an evangelical task force? I'm all for UTF8 everywhere but wow some of the replies were super cringe.

Even when the proposal of "UTF-16 default, UTF-8 optional" was made to keep backwards compatibility, it was not enough. It has to be UTF8 because it's superior technically, as if that's the only consideration! I agree they should've just picked one, but I still don't think the maintainers needed a refresher on what is UTF-8 every 3 comments.

raphlinus|2 years ago

For earlier archaeology see [19]. It seems to me people had started coding extensions in VS Code without giving any real thought to the question, so the default choice inherited from the language was UTF-16.

[19]: https://github.com/microsoft/language-server-protocol/issues...

slimsag|2 years ago

JavaScript uses UTF-16 for everything is why, and LSP is a TypeScript-first protocol.

the_mitsuhiko|2 years ago

Sadly there is some standard to that. JavaScript source maps also use the same definition for columns.