top | item 1940623

Remind HN: Unicode hacks

37 points| olalonde | 15 years ago

Just a friendly reminder that some Unicode characters[1] look like spaces and should be taken into account when writing filtering/trimming functions. Of course it's not a big deal but something to keep in mind to prevent stuff like usernames who are basically a bunch of spaces.

[1] http://www.cs.tut.fi/~jkorpela/chars/spaces.html

19 comments

order
[+] tptacek|15 years ago|reply
This is a classic web security problem; most famously, WinAPI systems have a "flattening" function that would convert things like PRIME U+2032 into ASCII 0x27 (the tick that terminates SQL statements). Database engines can also interpret character sets differently than the rest of the app stack, leading to similar problems. UTF-7 cursed Wordpress for something like a year in which multiple preauth SQL injection flaws were discovered.

The answer to these problems is whitelist filtering and neutralization; if a character isn't known-safe, substitute its HTML entity alternative. If you're writing blacklist filters that need to know what spaces are, you're already playing to lose.

[+] RyanMcGreal|15 years ago|reply
I just want to drop a thank-you for your dedication to good security practice and steady generosity with advice - in what is likely an intimidating topic for many developers (at least it is for me).
[+] perlgeek|15 years ago|reply
Sorry, whitelisting isn't the answer to SQL injection - bind parameters are.

With bind parameters you can pass data out of band, and the DB engine never tries to parse it as SQL.

[+] Tichy|15 years ago|reply
Hm, how does this work? Wouldn't the WinAPI convert the characters before I do the security parsing (and I agree with the bind parameters comment anyway)? Or is the problem that you run the app on a Linux server and the DB on a Windows server?

In any case: don't use Windows on a server :-)

[+] olalonde|15 years ago|reply
Seems like Twitter is "vulnerable" to U+00A0 tweets: http://twitter.com/#!/olivierll/status/7852651047817216

For those who are wondering, you can type Unicode codes directly from your keyboard (Ubuntu: Ctrl-Shift-u, other OS: http://en.wikipedia.org/wiki/Unicode_input)

[+] alanh|15 years ago|reply
On OS X this particular character is alt+space.
[+] VMG|15 years ago|reply
Interesting - just tested it in python and everything is removed with str.strip(), except "\ufeff", which also has zero width.

    >>> print("\ufeff#")
    #
    >>> print(len("\ufeff#".strip()))
    2
[+] stwe|15 years ago|reply
‏There are also other unicode hacks like changing ‏ text direction (U+200F)‏.
[+] stwe|15 years ago|reply
It used to have funny effects on websites (browser name in title bar spelled backwards), but it doesn't seem to work now. The above comment contains the unicode character three times.
[+] citricsquid|15 years ago|reply
­
[+] citricsquid|15 years ago|reply
Seems ALT+0173 works here as a "blank" character. I'm not sure of its exact purpose, but I've never seen it dealt with and often use it as "nothing". The only solution I've seen to properly sanitising Unicode characters is just to disable them entirely and print their name.