top | item 26911414

(no title)

> It is often best to avoid non-ASCII characters in source code. ... >> Depends of the language. In Python 3, files are expected to be utf8 by default, and you can change that by adding a "# coding: <charset>" header.

It's interesting that many languages avoid unicode and non-ASCII text, yet they make assumptions about file and directory structures about the underlying system. It's as if interpreting directory and file system structures is "okay", but interpreting file formats is not.

> In fact, it's one of the reasons it was a breaking release in the first place, and being able to put non-ASCII characters in strings and comments in my source code are a huge plus.

Sorry, but as a Python dev that went from 2 to 3, yes native unicode features are nice, but no, it was not worth breaking two decades of existing code.

discuss

BiteCode_dev|4 years ago

As somebody living in Europe, I think it's a perspective you can have only if you live mostly in an English speaking world.

Up to 2015, my laptop was plagued with Python software (and others) crashing because they didn't handle unicode properly. Reading from my "Vidéos" directory, entering my first name, dealing with the "février" month in date parsing...

For the USA, things just work most of the time. For us, it's a nightmare. It's even worse because most software is produced by English speaking people that have your attitude, and are completely oblivious about the crashing bugs they introduce on a daily basis.

In Asia it's even more terrible.

And I've heard the people saying you can perfectly code unicode aware software in Python 2. Yes, you can, just like you can code memory safe code in C.

In fact, just teaching Python 2 is painful in Europe. The student write their name in a comment ? Crashes if you forget the encoding header. Using raw_input() with unicode to ask a question? Crashes. Reading bytes from an image file ? Get back a string object. Got garbage bytes in string object ? Silently concatenate with a valid string and produce garbage.

unknown|4 years ago

[deleted]

qsort|4 years ago

> As somebody living in Europe, I think it's a perspective you can have only if you live mostly in an English speaking world.

I live in Europe and I (mostly) agree that (most) code shouldn't (usually) contain any codepoint greater than 127. It's a simple matter of reducing the surface area of possible problems. Code needs to work across machines and platforms, and it's basically guaranteed that someone, somewhere is going to screw up the encoding. I know it shouldn't happen, but it will happen, and ASCII mostly works around that problem. Another issue is readability. I know ASCII by heart, but I can't memorize Unicode. If identifiers were allowed to contain random characters, it would make my job harder for basically no reason. Furthermore, the entire ASCII charset (or at least the subset of ASCII that's relevant for source code) can easily be typed from a keyboard. Almost everyone I know uses the US layout for programming because European ones frankly just suck, and that means typing accents, diacritics, emoji or what have you is harder, for no real discernible reason.

String handling is a different issue, and I agree that native functions for a modern language should be able to handle utf8 without crashing. The above only applies to the literal content of source file.