Zero-Width Characters: Invisibly fingerprinting text

[+] bsamuels|8 years ago|reply

This is a pretty common method of watermarking sensitive content in EVE Online alliances, however we've found it suffers from some serious drawbacks.

Zero-Width characters tend to cause lots of issues when they are copied and pasted, which may alert a poorly equipped adversary that they're handling watermarked content. In addition, entities that are aware that you're watermarking text content in this way can just take screenshots of the text, transcribe it, or strip it of all non-ASCII characters.

The best solution I've seen is something I like to call "content transposition". The idea is that you take a paragraph of content and run it through a program that will reorder parts of content and inject/manipulate indefinite articles in order to create your watermark, while keeping the content grammatically correct. That way even if an adversary is fully aware that you're watermarking text content, they need two copies of the watermarked text in order to identify and strip your watermark.

[+] rl3|8 years ago|reply

Almost a decade ago I wrote a tool to search forum dumps from various EVE Online alliances. The content was acquired by spies and often watermarked.

The first barrier was that homoglyphs would inhibit text search, so I had to build an automated homogylph detection and substitution layer.

Once homoglyphs were stripped, the challenge was then to fit the entire search corpus into memory, so I compressed each page with LZMA, loaded it into memory, and decompressed on the fly when searching—probably not optimal, but still way faster than loading from disk.

I always wanted to try reverse engineering some of the watermarking systems so we could modify the watermarks on certain material, subtly leak it, and effectively frame adversaries while protecting our own spies in the process. Fortunately or unfortunately I never got around to that.

[+] reichardt|8 years ago|reply

What kind of content would be distribution that way? Something like battle plans send to individual members? Would they not be able to just snitch the important information without copying word for word the sensitive message?

[+] CapacitorSet|8 years ago|reply

Do you happen to have a link to these "content transposition" tools?

[+] joshstrange|8 years ago|reply

I know there is no way I have the time or energy to play EVE but it is the most fascinating game (MMO) that I have ever seen. The writeups on it are so fascinating.

[+] lost_my_pwd|8 years ago|reply

That same type of "content transposition" tool would also be a way to ensure obfuscation of content source if applied to copied text.

[+] tacon|8 years ago|reply

Back in the Usenet days, around the late '80s, I had read somewhere/somehow a few years before about how classified information would be printed with tiny differences in spacing to track leaks to foreign adversaries. In some newsgroup, I happened to mention I had read about this technique, which while obvious once you think of it was apparently not well known. Certainly by that time, it was very well known to the KGB.

Months after my public comment, I got a phone call from an AT&T inventor who was prosecuting a patent on the same technology. They were very interested where I had read about the technique in the public literature. Alas, I could not remember where I had read that little factoid so I wasn't much use to them. It was disturbing to them that their patent claim was out in the open literature somewhere, but they could not find it.

[+] QAPereo|8 years ago|reply

The general idea is called a Canary Trap, and IIRC it was popularized by Tom Clancy. The most common form is to have some subtle variations in the content or wording, which would give away the origin.

Edit here we go https://en.m.wikipedia.org/wiki/Canary_trap

The Canary Trap, aka The Barium Meal.

[+] dsr_|8 years ago|reply

Whitespace chars at the end of lines was used to encode formatting information; I think clari.net news services were the primary users.

[+] kpU8efre7r|8 years ago|reply

Isn't that mentioned in The Cardinal of the Kremlin?

[+] TrainedMonkey|8 years ago|reply

Some printers do that, it is very expensive and mostly moot because of electronic documents.

See https://qz.com/1002927/computer-printers-have-been-quietly-e...

[+] jchw|8 years ago|reply

Notably, some zero width characters tend to get removed, especially in systems that try to remove excess whitespace. I made a very rudimentary PoC of encoding data in zero width characters, but was hit by a few things:

- Some characters affect word wrap in unexpected ways, depending on the script of the text.

- Some characters impact glyph rendering in minor ways. For example, the ligature between f and l may be interrupted.

- Some characters are outright stripped. For example, Twitter strips U+FEFF.

- Zero width characters often trip up language detection systems. I noticed that Twitter detected my English message as Italian with the presence of some zero width characters.

So it's not necessarily as useful as it seems. If you pick specific characters to strategically avoid these issues, it's hard to make the encoding very efficient.

Still, it probably has it's uses.

[+] nemo1618|8 years ago|reply

Related: we built a homoglyph linter for Go source code, to help detect potentially malicious homoglyph substitution: https://github.com/NebulousLabs/glyphcheck

UTF-8 source code is nice for i18n, but it also opens the door to these kinds of attacks.

[+] robin_reala|8 years ago|reply

That’s a good start, but unless I’m misreading it[1] the range of homoglyphs it checks for is rather small. You might be better off importing the Unicode Consortium’s list of ‘confusables’[2] if you’re planned automated linting.

[1] https://github.com/NebulousLabs/glyphcheck/blob/f6483dd9e97a...

[2] http://www.unicode.org/Public/security/latest/confusables.tx...

[+] gnat|8 years ago|reply

Whereas, in Perl, we have a module for executable whitespace: http://www.perlmonks.org/?node_id=270023

Damian Conway is a wonderful mad genius.

[+] pdkl95|8 years ago|reply

I've thought about using whitespace and/or zero-width characters to embed a cryptographic signature. The goal was a was a browser extension that could sign the contents of an arbitrary <textarea> like an invisible "gpg --clearsign". The signature rules would have to be relaxed to accommodate common transformations servers do to user comments.

Ideally, this would allow people to cryptographically sign comments and automatically verify comments that were signed by the same author, all without changing existing server software or adding ugly "-----BEGIN PGP SIGNED MESSAGE-----" banners.

[+] infogulch|8 years ago|reply

Ok so the high level goal is to be able to inject extra data into a message (which could include a signature). Add in a magic string and then a browser extension could detect it and decode it (maybe add a little badge inline). From the API perspective you'd want a standard way to transform an arbitrary block of text and data you want to attach to it into a text+data blob that still looks and feels like text to humans, and then for the reverse, regex for the magic string and decode from there to the end of the message to get the original text and data out separately.

One way would be to place N zws (zero-width spaces) between each original character, and treat each block as a digit which encodes a number. This could work for large original texts, but it would be clunky and very low bandwidth I think. E.g. if "." is zws, you could encode "fox" and the number 123 as "f.o..x...".

Better I think would be to create an alphabet with several of the zero-with characters and put the whole encoded number somewhere where it's unlikely to get trimmed or mess up line breaks (probably near the end in the interior but on a word boundary next to a space).

The hardest part would be making a transformation that wouldn't simplify it too much but would still be resilient enough to the transformations done by many forums like markdown/bbcode/trimming that the result could be perfectly converted into a PGP message. Maybe include some error correction?

[+] suixo|8 years ago|reply

This looks like an amazing idea, why didn't you proceed further? I would love to see something like it, that could help certify messages :)

[+] jsnell|8 years ago|reply

This isn't as new as the author thinks. Doesn't have to be new to be interesting, of course :)

I know this was done at one large tech company around 2010 for an internal announcement email. Different people got copies with slightly different Unicode whitespace, despite the email having ostensibly gone directly to an "all employees" alias.

The fingerprinting was noticed within half an hour or so. (Somebody pasted a sentence to an internal IRC for discussion, and as is the case with Unicode and IRC, it inevitably showed up as being garbled in exciting ways for some people).

[+] vog|8 years ago|reply

Very good problem description and nice list of countermeasures!

However, the following countermeasure made be wonder:

> Manually retype excerpts to avoid invisible characters and homoglyphs.

Isn't this something you can automate? We should create linters for plain text (rather than code)? For example, depending on the language, reduce the text to a certain set of characters. Every character not in this whitelist is either replaced, or causes an error message the user (journalist) needs to deal with (i.e. remove it, or replace it with an innocent alternative, perhaps even proposing this replacement back to the linter project).

Of course, there are multiple ways for linting, which might become a fingerprint on it own. But then, if there are only 3 or 4 of such linter styles actually used (ideally, standardize on exactly one linting style), you can only tell which linter was used by the journalist, without any information about their source.

[+] 3pt14159|8 years ago|reply

Hi, I'm the author.

I was kinda on the fence, but I was considering my target: Journalists. A journalist may not notice simple differences like an extra space here or there when reading, but probably wouldn't retype a double space. I agree that this should be automated in some way, but it's a bit of an arms race.

[+] adtac|8 years ago|reply

Convert to image and run it through an OCR? Needlessly complicated, but it could possibly mimic the retyping part.

[+] hyperpape|8 years ago|reply

How is that different from his option 5?

I suspect he put it low on the list because it could be a cat and mouse game of trying to anticipate all the potential information leaks.

[+] Nomentatus|8 years ago|reply

Cutting and pasting using Ctrl-Shift-V in Libre Office does the trick. (Then select unformatted.) You still have to manually eliminate the spaces.

[+] ajarmst|8 years ago|reply

This has been a generally known technique for identifying leaks for at least three decades (and certainly much longer---Tom Clancy described it in a novel in 1987). The use of non-printing characters is an obvious extension of the idea. Any journalist who has published copy/pasted material from a confidential source since then is provably incompetent. https://en.wikipedia.org/wiki/Canary_trap

[+] userbinator|8 years ago|reply

This reminds me of a time not too long ago when I was teaching programming courses; at least one of the students would somehow manage to get one of these or other weird characters into source code, resulting in much confusion. On the bright side, I take advantage of the opportunity and an impromptu lesson in data representation and character encoding soon follows.

It's also one of the things where a hex editor is extremely useful --- even if you're not working on low-level, seeing the bits directly can be a great confirmation of correctness.

[+] niij|8 years ago|reply

I had this happen to me once (and only once, I learned my lesson). Our professor gave us a PDF with the problem description and it had a bit of code in it we were to put into our final program. Well, when I copy and pasted it some of the spaces copied as non-ASCII space.

[+] sbrm1|8 years ago|reply

Also keep an eye out for the Greek question mark, looks identical to a semicolon.

[+] peterburkimsher|8 years ago|reply

Fingerprinting data to find when it's been copy-pasted is a neat application for invisible characters!

I've also found a lot of identical characters when handling Chinese text. Note that Google Translate does not handle these correctly.

https://github.com/pingtype/pingtype.github.io/blob/master/r...

It's about the Kangxi Radicals Unicode block, compared the CJK characters block. If you want me to write a blog post about it, please comment and I'll get around to it.

[+] kurthr|8 years ago|reply

It's interesting to me, because I've seen this effect (while copy pasting) and wondered why... if I hadn't been translating I would never have noticed.

Even though some are noticeably different:

⿌黾

[+] lifthrasiir|8 years ago|reply

I really like the fact that ⼚/⺁ and ⽰/⺬ and so on are separately encoded in a single block (technically, though, they aren't).

[+] evfanknitram|8 years ago|reply

A fun fact is that the thumbprint text box in the certificate viewer in Windows starts with an invisible character.

So if you have cert and just want to copy paste the thunbprint to some file or application which needs to load it, then copying the full thumbprint probably won't work.

When I said fun I meant frustrating.

[+] hinkley|8 years ago|reply

At one job I had to declare a moratorium on sending certain bits of information through Outlook because the number of people who needed help cutting and pasting it correctly was becoming a problem. Everything went into the wiki or config files that were checked in or used as attachments.

And don’t get me started on Microsoft and their fucking smart quotes...

[+] twodave|8 years ago|reply

This. I used to maintain a software project that consisted of a few inter-communicating services on clients' windows machines (not just servers, but that would have been optimal). The most difficult part of making a sale was getting the implementation guys to correctly install these components, issue a self-signed key from the machine's local CA, and bind it to the local dns/ssl port. Not to mention most people don't even really understand how/why certificates work, so if they ran into the tiniest snag it was going to completely block progress until a developer could take a look. Barf.

Working with certificates on Windows in general is error-prone and difficult to automate (this coming from someone who spent more than a decade developing in .NET).

[+] adzm|8 years ago|reply

I've dealt with zero width characters on Windows causing problems in a variety of situations; all appear to have originated by copying something from SharePoint or Lync / Skype4business. It was included in a SQL query, causing it to fail to parse. Another time, one ended up somehow getting into a database field which then created filesystem paths containing the character, which was a much trickier one to figure out.

[+] shitloadofbooks|8 years ago|reply

I hate this so fucking much.

[+] 3pt14159|8 years ago|reply

Hi everyone, thanks for the feedback! I'm not sure if this deserves its own submission or not, but I have a short update to this post available here:

https://www.zachaysan.com/writing/2018-01-01-fingerprinting-...

One very interesting comment from an editor of The Weekly Standard.

[+] jeffwass|8 years ago|reply

Elon Musk did this nearly a decade ago at Tesla to try to identify a source of leaks. He sent many copies of a memo with slightly different versions to key team members.

It backfired when one of the top executives forwarded his own copy to the rest of the team.

https://www.cbsnews.com/news/should-management-spy-on-employ...

[+] jwilk|8 years ago|reply

How did it backfire? He forwarded his copy... and what?

[+] coffeedrinker|8 years ago|reply

I use LibreOffice and these appear as grey characters (at least some of the characters do). I've wondered what the grey characters were in the past and now I know.

Copy and paste the examples into LibreOffice and you will immediately see them.

[+] Sir_Cmpwn|8 years ago|reply

For what it's worth, there are genuine use-cases for these. My friends and I use them for our IRC bots so they can mention people's names without notifying them.

[+] jesperlang|8 years ago|reply

This is so sneaky! I am trying to find a good black list and found this:

http://kb.mozillazine.org/Network.IDN.blacklist_chars

Or maybe black listing is not the best approach, maybe a mix of multiple approaches. First strip out stuff and then view the text in a program that displays "unconventional" characters? As a test I pasted the post's test sentences in Vim and the invisible characters are replaced by blocks of <XXXX> that are very hard not to notice. The more you think about it the more tricky corner cases you find :O

[+] WalterBright|8 years ago|reply

The existence of homoglyphs in Unicode is a failure of Unicode's mission. Two sequences of characters that render the same should be the same. Encoding invisible semantic information into Unicode is a huge mistake.

[+] estebank|8 years ago|reply

I disagree, if anything they didn't go far enough. Unifying Chinese and Japanese characters making them locale dependent was, IMO, a mistake. The kind of problem I have with this can be seen with the so called "Turkish I" problem. The Turkish language has 4 'i's: i, I, ı and İ. Unicode decided to encode the first two using the points for latin lower case i and upper case I. In Turkish, the capitalization rules say that i and ı are lower case, and their upper case counterparts are İ and I. You can see that if you have a byte stream for which you don't know the locale, you cannot correctly apply capitalization to it. That way all 'I's would be unambiguous in all contexts. This is not as trivial a problem as it sounds[1]. This could have been avoided if Unicode had a Turkish i and I. You can extrapolate this issue to entire languages.

[1] https://gizmodo.com/382026/a-cellphones-missing-dot-kills-tw...

[+] lifthrasiir|8 years ago|reply

How can it be "a failure of Unicode's mission" when the mission does not really include that? The foremost reason that Unicode exists is the unification of tons of existing legacy character sets and encodings and not the technologically perfect system. There are enough duplicates and invisible characters in a single character set (yes some have duplicates by accident or design), across multiple character sets (so you like to map Latin A and Cyrillic А to the same character while retaining the collation order for each case...), or simply necessary to multilingual systems that jeopardize any attempt to "perfect" it.

Granted, there are several failed experiments (e.g. interlinear annotations which are completely obsoleted by markup languages) and several pain points (e.g. shrug Emoji) inside Unicode. I don't really like them, you may not like them, but how would you make such a system without all these interim works?

[+] tedunangst|8 years ago|reply

So what letters should be used to name the USSR? SSSR or (Cyrillic) CCCP?

Should the Greek alphabet be removed? Why have delta when d is just as good?

[+] guidovranken|8 years ago|reply

I've been wondering whether it could be possible to create a formal language through which one is capable of expressing ideas and facts about the world around us rather than values and variables as is the case for programming language. For humorous effect people sometimes write things like if ( !food ) goToStore(); -- could it be possible to formalize such constructs? If the language is formal, then the author's output can be run through a post-processor that re-formulates the expression such that superfluous data, if any, is removed to make stylographic characteristics disappear, akin to, and maybe using the same foundations as the reduction of mathematical equations into a bare minimal set of symbols. Furthermore, a mathematical approach to reasoning about real-world concepts is interesting. Computational philosophy?

Tangentially related to this topic (ulterior fingerprinting), I wonder whether websites like Twitter might be encoding your IP address or account ID in pixels on your screen (so subtle that it's impossible to discern with the naked eye) to make it easy to track screenshots back to you.

[+] khendron|8 years ago|reply

I just copied and pasted the example text in the OSX Text Editor app, and all the zero-width characters got removed.

[+] JepZ|8 years ago|reply

Short anecdote:

A few years ago I had to fill in some translation keys within a ecommerce shop gui. Eventually, I had to revert some key to it's default value and therefore tried to copy and paste the displayed default value, but the system always refused to take that value. It complained that the string contained invalid characters, but I wondered because I just inserted the previous valid default value and to my eye there where not special characters anyway?!?

So I called one of the devs and after a few minutes he told me, that the output I copied contained a zero-width space and that character was not allowed by the validation engine. So when I typed the string myself everything went fine ;-)

Nowadays, I like to consult `hexdump -C` in such cases.

146 comments