top | item 40383960

(no title)

kdheepak | 1 year ago

In my opinion, one argument for internally representing `String`s as UTF8 is it prevents accidentally saving a file as Latin1 or other encodings. I would like to read a file my coworker sent me in my favorite language without having to figure out what the encoding of the file is.

For example, my most recent Julia project has the following line:

    windows1252_to_utf8(s) = decode(Vector{UInt8}(String(coalesce(s, ""))), "Windows-1252")

Figuring out that I had to use Windows-1252 (and not Latin1) took a lot more time than I would have liked it to.

I get that there's some ergonomic challenges around this in languages like Julia that are optimized for data analysis workflows, but imho all data analysis languages/scripts should be forced to explicitly list encodings/decodings whenever reading/writing a file or default to UTF-8.

discuss

samatman|1 year ago

I don't understand how a language runtime is supposed to prevent your colleague from using an unexpected encoding.

Next time you try to load whoops-weird-encoding.txt as utf-8, and get garbage, may I suggest `file whoops-weird-encoding.txt`? It's pretty good at guessing.

There might be a Julia package which can do that as well. I haven't run into the problem so I have no need to check.