top | item 37747957

(no title)

ssokolow | 2 years ago

> Before comparing strings or searching for a substring, normalize!

...and learn about the TR39 Skeleton Algorithm for Unicode Confusables. Far too few people writing spam-handling code know about that thing.

(Basically, it generates matching keys from arbitrary strings so that visually similar characters compare identical, so those Disqus/Facebook/etc. spam messages promoting things like BITCO1N pump-and-dumps or using esoteric Unicode characters to advertise work-from-home scams will be wasting their time trying to disguise their words.)

...and since it's based on a tabular plaintext definition file, you can write a simple parser and algorithm to work it in reverse and generate sample spam exploiting that approach if you want.

https://www.unicode.org/Public/security/latest/confusables.t...

> and CD-ROM!

I think you mean Microsoft Windows's Joliet extensions to ISO9660 which, by the way, use UCS-2, not UTF-16. (Try generating an ISO on Linux (eg. using K3b) with the Joliet option enabled and watch as filenames with emoji outside the Basic Multilingual Plane cause the process to fail.)

The base ISO9660 filesystem uses bytewise-encoded filenames.

discuss

dystroy|2 years ago

But not all normalizations are done to fight spam, not all of them should be interested in visual similarity.

I normalize strings in searches not because of bad intents but because for all user related purposes "Comunicações" and "Comunicações" are the same, their different encodings being more of an accident.

ssokolow|2 years ago

*nod* ...and stemming is that taken to a greater extreme.

I was just pointing out that Unicode itself has various forms of normalization and normalization-adjacent functionality that people are far too unaware of.