top | item 1940623

Remind HN: Unicode hacks

37 points| olalonde | 15 years ago

Just a friendly reminder that some Unicode characters[1] look like spaces and should be taken into account when writing filtering/trimming functions. Of course it's not a big deal but something to keep in mind to prevent stuff like usernames who are basically a bunch of spaces.

[1] http://www.cs.tut.fi/~jkorpela/chars/spaces.html

19 comments

[+] tptacek|15 years ago|reply

This is a classic web security problem; most famously, WinAPI systems have a "flattening" function that would convert things like PRIME U+2032 into ASCII 0x27 (the tick that terminates SQL statements). Database engines can also interpret character sets differently than the rest of the app stack, leading to similar problems. UTF-7 cursed Wordpress for something like a year in which multiple preauth SQL injection flaws were discovered.

The answer to these problems is whitelist filtering and neutralization; if a character isn't known-safe, substitute its HTML entity alternative. If you're writing blacklist filters that need to know what spaces are, you're already playing to lose.

[+] RyanMcGreal|15 years ago|reply

I just want to drop a thank-you for your dedication to good security practice and steady generosity with advice - in what is likely an intimidating topic for many developers (at least it is for me).

[+] perlgeek|15 years ago|reply

Sorry, whitelisting isn't the answer to SQL injection - bind parameters are.

With bind parameters you can pass data out of band, and the DB engine never tries to parse it as SQL.

[+] Tichy|15 years ago|reply

Hm, how does this work? Wouldn't the WinAPI convert the characters before I do the security parsing (and I agree with the bind parameters comment anyway)? Or is the problem that you run the app on a Linux server and the DB on a Windows server?

In any case: don't use Windows on a server :-)

[+] olalonde|15 years ago|reply

Seems like Twitter is "vulnerable" to U+00A0 tweets: http://twitter.com/#!/olivierll/status/7852651047817216

For those who are wondering, you can type Unicode codes directly from your keyboard (Ubuntu: Ctrl-Shift-u, other OS: http://en.wikipedia.org/wiki/Unicode_input)

[+] Bootvis|15 years ago|reply

A more innocent trick with unicode and twitter is squeezing extra characters in a tweet by using unicode ligatures:

http://en.wikipedia.org/wiki/List_of_precomposed_Latin_chara...

Unfortunately the amount of ligatures is small but it might come in handy.

[+] citricsquid|15 years ago|reply

http://twitter.com/#!/citricsquid/status/7864128295145472

Same with 0173, although mine seems to produce nothing, whereas yours is a line break (I think?)

[+] alanh|15 years ago|reply

On OS X this particular character is alt+space.

[+] VMG|15 years ago|reply

Interesting - just tested it in python and everything is removed with str.strip(), except "\ufeff", which also has zero width.

    >>> print("\ufeff#")
    #
    >>> print(len("\ufeff#".strip()))
    2

[+] olalonde|15 years ago|reply

For more details on the potential visual spoofs: http://unicode.org/reports/tr36/#visual_spoofing

[+] stwe|15 years ago|reply

‏There are also other unicode hacks like changing ‏ text direction (U+200F)‏.

[+] stwe|15 years ago|reply

It used to have funny effects on websites (browser name in title bar spelled backwards), but it doesn't seem to work now. The above comment contains the unicode character three times.

[+] citricsquid|15 years ago|reply

Seems ALT+0173 works here as a "blank" character. I'm not sure of its exact purpose, but I've never seen it dealt with and often use it as "nothing". The only solution I've seen to properly sanitising Unicode characters is just to disable them entirely and print their name.

[+] unknown|15 years ago|reply

[deleted]