top | item 44243712

(no title)

dmsnell | 8 months ago

Unicode has a range of Tag Characters, created for marking regions of text as coming from another language. These were deprecated for this purpose in favor of higher level marking (such as HTML tags), but the characters still exist.

They are special because they are invisible and sequences of them behave as a single character for cursor movement.

They mirror ASCII so you can encode arbitrary JSON or other data inside them. Quite suitable for marking LLM-generated spans, as long as you don’t mind annoying people with hidden data or deprecated usage.

https://en.m.wikipedia.org/wiki/Tags_(Unicode_block)

discuss

akoboldfrying|8 months ago

Can't I get around this by starting my text selection one character after the start of some AI-generated text and ending it one character before the end, Ctrl-C, Ctrl-V?

dmsnell|8 months ago

Yes, that’s correct. All of these measures, of course, stand as a courtesy and are trivial to bypass, as ema notes.

Finding cryptographic-strength measures to identify LLM-generated content is a few orders of magnitude harder than optimistically marking them. Besides, it also relies on the content producer adding those indicators so that can’t be ignored as a major source of missing metadata.

But sometimes lossy mechanisms are still helpful because people who aren’t out with malicious purposes might copy and paste without being aware that the content is generated, while an auditor (be it anyone who inspects one level deeper) can discover in some (most?) cases the source of the content.

ema|8 months ago

There are many ways to get around this since it is trivial to write code that strips those tags.