top | item 28903383

I couldn't debug the code because of my name

258 points| mikasjp | 4 years ago |mikolaj-kaminski.com

288 comments

order

xlii|4 years ago

Very similar problem to one described started my exodus from Google services.

I also have non-latin characters in my name however I knew it was always an issue so I never used it in paths etc.

At some point, long time ago, I was tasked to do some maintance with Google Cloud service (can't remember the name of the service now) which was doable only through Python CLI utility and it failed with very similar Python error.

What I found out rather quickly is that utility took my name from Google+ profile, which did include those non-latin characters. No biggie - I thought and fired e-mail to support (yeah it was those times it was still that easy). Few hours passed and I received information that this won't be fixed anytime soon and the best course of action would be to change my name.

Of course, support person probably meant to remove the diacriticals from my Google+ profiles, but still it left unplesant aftertaste for years to come.

perl4ever|4 years ago

A Polish relative of mine used to just gave an arbitrary substitute name (e.g. "Dave Smith") for restaurant reservations, because even if they could write his last name, they wouldn't be able to pronounce it.

My sibling has a name that has an accent, and just enters it with the plain letter most of the time. The name was once rare and "ethnic", but became popular a generation later so people know how to pronounce it regardless.

Our parents gave us two middle names, wanting to preserve our grandmothers' surnames, but also in the spirit of "Bobby Tables", having ambivalent feelings about the computerization of society tending towards inflexibility.

According to: https://en.wikipedia.org/wiki/Naming_customs_of_Hispanic_Ame...

...misunderstanding of naming customs in the US has actually led to significant consequences due to last names not matching on legal documents.

I remember reading a story about how there are people in China whose name incorporates a character that is obscure enough, the authorities are trying to eliminate it and get them to change their name. If I recall correctly, Chinese has a particular problem with characters that are part of names that have been around forever, but are no longer used for ordinary writing.

nullspace|4 years ago

> the best course of action would be to change my name

As someone who has been told this, for other reasons, I empathize. My reaction has always been - "Your system can't even handle names, you need to fix it".

Edit: I wish there was a library / service that helped you handle all sorts of edge cases in names, so that you don' t have to worry about it. Just use a user-id, and set / get a name from a lib / service that can actually handle it.

mjevans|4 years ago

This is exactly why I hate the way Python3 handles Unicode.

EVERY language should _try_ to handle Unicode such that if a data sequence were valid before it remains valid after. NONE should ever FORCE validation, since sometimes, like in the article's case, the correct answer is GIGO. Just pass it through and hope it continues to work. Sometimes the error is trying to enforce that validation.

musicale|4 years ago

> the best course of action would be to change my name

That's usually easier than getting a company to fix their software.

tyteen4a03|4 years ago

Fun fact: If you have the exclamation mark (!) in your Windows username, Java will think it's the jar separator and `getResourceAsStream` will refuse to work. This broke many people's Minecraft installation over the years.

The bug in question [0] was reported in 2001 and remains unsolved 20 years later.

[0] https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4523159

kspacewalk2|4 years ago

I find the very idea of putting an exclamation mark on one's username and not expecting eventual problems to be quite curious.

GoblinSlayer|4 years ago

Simply install the game into %ProgramData%

amarshall|4 years ago

For a list of strings that often cause problems to, e.g., add to a test suite, see https://github.com/minimaxir/big-list-of-naughty-strings

ryanianian|4 years ago

Very handy. My previous simple test-case was simply a selection from this well-known text-file which is simply a collection of somewhat uncommon unicode characters, usually used for rendering tests.

https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt

But this set of strings is specifically designed to cause edge-case errors.

Also don't forget Spolsky's seminal "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

munk-a|4 years ago

It's also important to width-test fields. Never forget to make sure that WWWWWWWWWWWW doesn't cause weird application wrapping.

tomaslaureano|4 years ago

Great resource! I usually use pangrams (holoalphabetic sentences like "The quick brown fox jumps over the lazy dog") to ensure that my code can handle all the alphabet characters for the languages that should be supported at the very minimum.

david422|4 years ago

There's also this article: falsehoods-programmers-believe-about-names:

https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...

Certainly informative if you haven't seen it before.

My takeaway from it was that design your system to try to accommodate as much as possible, but it would basically be impossible to accommodate them all, so aim for your target audience.

umvi|4 years ago

Using non-ascii characters in file paths, toolchain config files, and other non-display contexts is just asking for trouble, even if it is your name...

BiteCode_dev|4 years ago

Unfortunately, it's true, most toolchains are stuck in the past, and don't deal with non-ascii characters or even spaces very well. In fact, I just learned that spaces in .deskop files values could cause trouble after a long debugging.

But it's a shame.

In Europe, we do have a lot of non-ascii characters everywhere. Ubuntu puts a "Vidéo" and a "Téléchargements" directory in my $HOME because I'm french. If I were to use my name as my username I would have even more troubles.

I'm careful with not using special chars in names for work, but it feels like I'm a girl trying to not dress sexy in the wrong part of town: necessary, but I shouldn't have to do this, and it's definitely the others to blame.

All in all, I thank the Gods of encoding for Python 3 unicode handling. Having a scripting language that does the right thing out of the box is wonderful on this side of the pond.

jasonpeacock|4 years ago

This is the modern, post-ASCII computing world, we should no longer be willing to settle for the lowest-common-denominator of ASCII-only strings.

There's no excuse for actively supported, paid products to have these problems today.

fluxem|4 years ago

Also spaces. I spent half an hour debugging why cmake cuda build was failing.

PeterisP|4 years ago

Using non-ascii characters in file paths, toolchain config files and other non-display contexts is something every development team should explicitly, intentionally do in order to catch such bugs.

"Asking for trouble" is a key part of testing. My suggestion would be for a QA person to have their username (and root folder of the testable project) to start with a space, and be followed by an accented letter, tab-symbol, apostrophe, an emoji, followed by an unicode RTL control character and some Arabic text.

gumby|4 years ago

This is blaming the victim

b112|4 years ago

This wouldn't have happened if using rust!

sschueller|4 years ago

Many years ago I could not access the apple developer panel because of the umlaut in my last name. It was eventually fixed but I was quite surprised that such a large company would run into such a basic issue.

devrand|4 years ago

My last name has an apostrophe in it which Apple apparently loves to embed directly into their JavaScript unescaped. For a long time neither I nor Apple could look up AppleCare status on my stuff as they were all linked to my Apple ID. The portal would thus require me to login, but then would just show a partially rendered page as my last name was causing an JS syntax error.

irrational|4 years ago

>such a large company would run into such a basic issue

Every large company is just a conglomeration of smaller departments. Each department had individual contributors. Some individual contributor in that department wrote the code and if nobody else is their department caught it, nobody else at the large company would have caught it since they have their own work to consider and don't have time to look at other people's stuff.

rodgerd|4 years ago

If you look at many of the responses here it's sadly unsurprising: small-minded provincialism or outright xenophobia are no less common amongst programmers than the general population.

SergeAx|4 years ago

When I first installed Windows 7 like ten years ago, I entered my Russian name in Cyrillic. When I saw that the system created a directory with exactly that name under `C:\Users\` I immediately scanned the internet for a way to rename it and done just that. I don't want to know how much mess like that in a story I thus had successfully escaped.

NB: the method is still the same, it's a second (not accepted) answer here: https://superuser.com/questions/890812/how-to-rename-the-use... (about ProfileImagePath registry value).

vertis|4 years ago

This is sad though. You shouldn't have to change who you are for a computer program.

simonblack|4 years ago

Isn't this one of those "100 things Programmers don't know about People's Names" things?

Like the poor, it will be with us always.

xdfgh1112|4 years ago

I don't know, it's just a Unicode character? Not even a newer one, it's just 2 utf8 bytes. Pretty much everything should support that in 2021.

When I think of 100 things I think of stuff like "some people spell their name in all lowercase and get really funny if you change it"

supernes|4 years ago

It's somewhat common to see videogames issue a patch shortly after release where they fix crashes due to non-ASCII Windows usernames or non-English locales. I'm not sure what the root cause of the confusion is, other than text strings being hard in general.

jerf|4 years ago

It's easy to think the answer is "just UTF-8 everything" but unfortunately the long and twisty history of filesystems means that's not the correct answer, and the "correct answer" is really hard to write down quickly.

If you never display the filename, the answer is to treat existing filenames as bags of bytes, but that breaks down as soon as you need to display them, or if you need to manipulate them by appending unicode to them, in which case you have to decide on an encoding.

Unicode encodings tend to mangle non-Unicode values because they're specified to replace whatever they can't understand with a particular Unicode character, usually represented as a diamond with an inverted ? inside of it.

There's some obscure solutions to this problem, like https://simonsapin.github.io/wtf-8/ (which includes discussion of the 16 bit encodings you need for Windows), but I haven't seen broad support for them. You need a deliberately "noncompliant" encoding/decoding system that doesn't replace unknown characters with replacement characters. Fortunately, compliant systems are becoming more and more popular and available. Unfortunately, that can make file name handling harder than when you had a non-Unicode-compliant handling system for your strings.

garaetjjte|4 years ago

Part of the problem is legacy Windows cruft. For long time to properly handle Unicode characers you needed to explictly use widechar UTF-16 functions. Legacy narrow encoding is systemwide setting, couldn't be set to UTF8, thus only subset of characters would be represented correctly. Only recently they introduced ability to set narrow encoding for application to UTF-8 with setlocale, which is a lot saner.

jan_Inkepa|4 years ago

I've been bitten on a few small releases by forgetting that C# localises number->string conversion by default (which makes sense. But if you forget, and you're writing floats to csv files and the decimal points become decimal commas....).

mkotowski|4 years ago

In case of a home-grown code, it could be simply the question of a programmer awareness. There are still many outdated and/or unfinished tutorials that use WinAPI without any concern about enabling Unicode and wide chars support.

If we are talking about ready game engines like Unity and Unreal... it is probably a naive assumption about input being 1 byte wide and things getting lost because of that in some gamedev-made script.

tazjin|4 years ago

The amount of random encoding problems that still exist are so bizarre. I recently left a UK job after already leaving the country more than a year ago, and in their attempt to mail P45 form to my new address (in Moscow) the only bits that survived are the string "c/o" and the postal code.

mkotowski|4 years ago

I, too, have the Ł letter in my name, and yes, it is a sick joke that so many things even in a supposedly modern systems make an assumption that the world runs on ASCII.

In the case of the Windows operating system, the worst fact is that every single part of it behaves differently. Some parts display the path with a wrong encoding, but handle it correctly. A third-party app can display it correctly, but fails while trying to access any file. From what I remember, even the built-in PATH variable editor/manager goes through some arcane steps to display the letters in a wrong way, but getting them to work sometimes.

I can only imagine how much more pain it is for someone using any of the less widely-used writing systems or those with more advanced features compared to ASCII (Hebrew’s RTL, Arabic scripts mid- and final forms, etcetera).

gerdesj|4 years ago

Can Ł have an alternative representation? For example the German ß => ss. Also I think ö can be written as oe.

In English we simply shake the big bag of letters, pick a few at random and then throw them at the page until a few stick.

Dannymetconan|4 years ago

I can very much relate to this but also have very little sympathy here.

I have a special character in my name, an apostrophe, and it causes trouble regularly online and with tooling. A number of years ago I decided just to never use it when it came to anything to do with technical work be it email, logins or usernames.

Unicode characters are a pain to deal with and I have suffered from it first hand trying to handle it. At the end of the day it is much easier just to not use the special characters and move on with your life rather then be battling the constant frustration.

I'm sure these tools have lots of issues opening and you would be surprised at the amount of time, effort and testing it would be required to provide fully Unicode support. Most people would see it as a very small positive and not worth the effort. I find it hard to disagree.

jltsiren|4 years ago

My legal last name is "Sirén". When I was younger, I almost always used "Siren", because it was easier to type. Then, ~15 years ago, I started noticing that American websites sometimes rejected it, because they considered it inappropriate. Sometimes "Sirén" would work, sometimes it worked but caused minor annoyances, and sometimes it would not work for technical reasons.

Both versions work most of the time these days, but I still run into trouble once in a while no matter which name I use.

johnorourke|4 years ago

I can relate, mine is O'Rourke and even in 2021 I get:

- websites telling me I have an invalid name

- post addressed to O'Rourke, O\\\Rourke, O&Rourke

- "my account" pages say "Welcome, Mr O\Rourke"

vultour|4 years ago

I'm really surprised someone technically minded thought it's a good idea to put a non ASCII character in their username. I'd never do that.

jasonpeacock|4 years ago

And yet it's one of the simplest things to add non-ASCII chars to your tests to validate their handling.

It's like not testing if your calculate application can handle negative numbers or decimals.

nradov|4 years ago

In fact it's trivial to generate a text file of all valid Unicode code points and use that as input to unit tests.

mrweasel|4 years ago

It’s a pretty good test case. Similarly we found a number of bugs in a Django application and path handling, because I happend to be using Windows for six months, while the rest of the team was on Linux and Mac.

mikasjp|4 years ago

I think the whole problem is keeping the character encoding consistent in the applications and their dependencies. Programmers often forget this because they avoid non-ASCII characters in their code.

Ansil849|4 years ago

Sometimes even "regular" ASCII surnames cause problems.

When written in the Latin alphabet, my surname is one letter.

I've had an amazing amount of problems with this not just due to technical limitations (like various forms marking the entry as invalid), but--much more aggravatingly--human limitations.

One particularly infuriating anecdote: at a past job many years ago, the email structure was lastname@company.com. I dutifully sent the IT person in charge of creating emails my desired email. The IT person wrote back an amazingly condescending email that as per the policy, emails had to be last names, not individual letters. I then had to go find a bunch of random websites which explained single-letter names and forwarded them to the IT person. They then obliged, but did not apologize for insulting me. That is not right that I had to put up with that.

account42|4 years ago

> One particularly infuriating anecdote: at a past job many years ago, the email structure was lastname@company.com. I dutifully sent the IT person in charge of creating emails my desired email. The IT person wrote back an amazingly condescending email that as per the policy, emails had to be last names, not individual letters. I then had to go find a bunch of random websites which explained single-letter names and forwarded them to the IT person. They then obliged, but did not apologize for insulting me. That is not right that I had to put up with that.

Except single letter last names are less common than people not following policy and/or abbriviating the name. It could simply be an honest mistake and the email is just their standard response since they have other things to get to. Did you try simply pointing out that that the letter was in fact your last name instead of getting passive-agressive?

pledess|4 years ago

The article offers a solution of idea.system.path=${root.dir}/JetBrains/Rider/system but doesn't mention the C:\JetBrains directory permissions. Directory permissions under %LOCALAPPDATA% (the location that works for people without a Polish character) should restrict write access to one user. With the Windows default behavior, creating C:\JetBrains would inherit permissions from C:\ - and wouldn't restrict write access to one user. Maybe 99% of the time this is irrelevant (i.e., there's no realistic threat from malicious actors who control unprivileged user accounts on your own development machine). Still, it's a potential downside of the solution, and more motivation for the vendor to fix their code so that Polish characters can be used under %LOCALAPPDATA%.

Kwpolska|4 years ago

If you are on a multi-user system, the path "C:\JetBrains" isn’t really ideal (what if other users also need Rider and have non-ASCII usernames?). That said, you can easily change file permissions on Windows if the default ones don’t work for you.

dmingod666|4 years ago

The domain name to the website is all ascii..

zamalek|4 years ago

If you use a Microsoft account to set up windows then you have no control over the local username.

askvictor|4 years ago

Somewhat surprising that this is an issue with JetBrains, given that they are based in Eastern Europe, and would probably have more direct experience of these sorts of problems than US or UK based companies. OTOH maybe it's just a scale thing - bigger companies have more resources to handle these sort of cases, regardless where they're based (not that they always do...)

m_kos|4 years ago

Isn't it bizarre that we have self-driving cars, the ISS, and phones with 50 megapixel cameras but still struggle with character encoding?

tetha|4 years ago

Character encoding is in a special class of problems. Like time handling.

If you pick up a halfway non-ancient framework in a somewhat common language with a somewhat non-terrible persistence like postgres, you just don't have problems. Just don't care, and it just works.

But it's super easy to derail that fragile correctness with something like MySQLs utf8-ish handling, or some OS's path handling, or 'efficiency', or a user or frontend dev submitting data in a wrong encoding. And then it gets mangled. And then the user is unhappy.

At that point, it becomes very hard to argue why one of the two things is wrong, and the other is not. While the user argues the other way around. Because both look correct, if you look from the right angle. And the only reason why I am right is because of some standard, while the customer is right because of money.

And yes, it is very 'surprising' why our software now functions correctly for russian or greek customers.

quadrifoliate|4 years ago

It's not bizarre at all. Character encodings are a sort of language in themselves, and end up with all the problems that regular old languages have – there's a lot of variety, people can't agree on one particular solution, and there's not a lot of money in taking care of the edge cases.

It would be bizarre if we were at the point where we had perfect translations for everything, but still struggled with character encodings specifically.

zakius|4 years ago

for self driving cars, ISS and digital cameras everything you do is blurry in a sense, "good enough" approximation is actually good enough while character encoding and transformations have to be done perfectly and precisely and have surprisingly big number of edge cases

rcxdude|4 years ago

Sadly there is even still software which fails to build or even fails to run when there is a space in a filename (as is super common on windows file paths, as well as autogenerated CI build folders). It's ridiculous to no end that software cannot handle paths correctly.

f311a|4 years ago

That's a pretty common problem, especially for cyrillic names. People just use ASCII names.

numpad0|4 years ago

Oh, it’s not a common knowledge that you should not UTF-8 in Windows username? That had been the case since 95 days. Only recently it had supposedly improved after Microsoft Account login become semi mandatory.

Fordec|4 years ago

A lot of adults today weren't even alive in 95. Also, the assumption that people are familiar with windows vs other operating systems is becoming less and less valid. And as the world gets more globalised and remote, it's no longer to be assumed that all technical people are of a Anglo American culture.

chris_overseas|4 years ago

I don't think this bug is anything to do with Windows, rather it is due to the way the paths are handled in the IDE's codebase. Presumably the same problem exists when using these IDEs in conjunction with a path containing non-ascii characters in the Linux or macOS world.

progval|4 years ago

On the contrary, the first bug happens because docker-compose tries to decode the path as UTF-8, but it is not UTF-8-encoded. ("'utf-8' codec can't decode byte")

tediousdemise|4 years ago

The solution to this is extremely simple: don't validate usernames, period.

The rationale is from an article someone linked here ("Falsehoods Programmer's Believe About Names"):

> Anything someone tells you is their name is—by definition—an appropriate identifier for them.

If you try to validate by checking for profanity, knowing full well that people can have names that contain profane substrings, I have a tongue-in-check message for you—you are a fucking asshole.

ddeyar|4 years ago

Some years ago I used the + feature in my gmail address. e.g. myname+ycombinator@gmail.com to track down which service is giving away my email address. It happened more than once that I could not log in anymore at some point because they started to disallow the + character in email addresses. I also got phone calls from some companies complaining that i misspelled my email address because there was their company name in it.

cgufus|4 years ago

hehe, did the same, although not with +, but using a catch-all feature of the provider. I still get a lot of spam and phishing attempts on my „dropbox@<mydomain>“ address. I faintly remember they (dropbox) had a breach some time in the past.

deepsun|4 years ago

> My username contains a "ł" character and because of it, this file cannot be processed properly.

What is so curious there? Some names contain all non-latin characters, and some softwares don't work with non-ASCII symbols. I just cannot understand why is it interesting.

godmode2019|4 years ago

I have a set of names I give to different providers. Advertisers always assume a name is constant and email addresses can change.

I got a name saying 'Hi John I just want to xyz'

I can skip this email as they used a fake name. Works better than other methods I have found.

jonnycomputer|4 years ago

>When I found out that the bug was in the Rider itself, I reported it to technical support. I also found a similar report for PyCharm. Unfortunately, things haven’t moved forward since then.

Unfortunately, typical with Jetbrains.

PrivateButts|4 years ago

Similar to this, Node and NPM get very temperamental when you have a User folder with a space in it. I gave up on the community workarounds and just created a new account and copied my files over to fix it.

trinovantes|4 years ago

In CS, most algorithms assume an ASCII character set. I wonder if there's any string-related algorithms that completely break (functionally or complexity wise) when given UTF-16 or UTF-8 character sets

a1369209993|4 years ago

Asymtotic complexity can't change based on the character set, since you can just reuse the same algorithm with larger opaque datums. (Exception being algorithms with O(n^8) or O(n^256) complexity, but noone uses those anyway.)

A variable width encoding can cause issues in principle, but useful algorithms already have to deal with strings that have variable-length physical represention anyway (eg "yes" vs "no"), so it tends not to be a problem in practice.

deathanatos|4 years ago

> In CS, most algorithms assume an ASCII character set.

They most certainly do not. E.g., a Turing machine assumes an alphabet Γ which is a set of some characters and is defined no further, as any exact definition is meaningless to the theory. (I.e., the algorithm is generic over any alphabet.) The alphabet need not even be text; e.g., for a Turing machine, the set of all octets suffices.

Even for something like Levenshtein distance, the only real requirement of the algorithm is that the abstract "characters" implement equality testing. For Unicode text, I'd start with graphemes, and then look for counter examples.

spicybright|4 years ago

So frustrating how this still happens. It's too latin centric.

lukaszkups|4 years ago

Ah it's so common for me that I've totally abandoned using Latin letters in my first/last name long time ago (and I recommend the same for you ;))

ygra|4 years ago

One way of working arrive such issues is to use subst. That way the application thinks your project directory is actually located on P:\ or something like that.

darkhorn|4 years ago

I think it is a Java related issue. Relevant issue occurs in Jaspersoft Report. You cannot install Jaspersoft Report on Turkish Windows no matter what.

Svoka|4 years ago

Did you know that Android still won't build on Windows if you have Cyrillic letter in user name?

asimjalis|4 years ago

This is like Kafka’s story in which the protagonist wakes up to find out he’s a (software) bug.

xwdv|4 years ago

What’s wrong with just writing it as Mikolaj? It’s not like it’s a kanji or something.

no_time|4 years ago

Because it's not his name. Imagine you are John but you had to make do with Yohn because the people designing you software didn't need the letter J...

miloignis|4 years ago

From the article:

The first idea was to change the username to one that does not contain Polish characters. It turned out that Windows does not rename the user’s folder when changing the username. Manually renaming the folder was not an option. This way I could corrupt my profile in the system.

The end of the article is about how to change the directory where the temporary files go to one not under the user folder.

ludamad|4 years ago

For the record, it's a stark pronunciation difference as ł has drifted to a very different "w" sound

ssivark|4 years ago

That’s about as aggravating as asking Ryan to change name to Pyan — because the encoding doesn’t support “R” and “P” looks very similar.

wbsss4412|4 years ago

So the solution is for the user to change their entire windows account name, rather than handling common characters in your code?

sophacles|4 years ago

Because that's not their name?

dahfizz|4 years ago

Their URL is even mikolaj-kaminski.com . I get its annoying, but I would never use non-ascii chars in a username / file path.

needle0|4 years ago

Then there are the people whose names ARE in Kanji, thankyouverymuch. Ah, no big deal, there's only around 1.6 billion of us.

anotheraccount9|4 years ago

When ł visiłed his page, my browser crashed.

shantnutiwari|4 years ago

the bug was fixed one hour ago-- looks like HN customer service worked again