These watermarks are not robust to paraphrasing attacks: AUC ROC falls from 0.95 to 0.55 (barely better than guessing) for a 100 token passage.
The existing impossibility results imply that these attacks are essentially unavoidable (https://arxiv.org/abs/2311.04378) and not very costly, so this line of inquiry into LLM watermarking seems like a dead end.
I spent the last five years doing PhD research into steganography, with a particular focus on how to embed messages into LLM outputs. Watermarking is basically one-bit steganography.
The first serious investigations into "secure" steganography were about 30 years ago and it was clearly a dead end even back then. Sure, watermarking might be effective against lazy adversaries--college students, job applicants, etc.--but can be trivially defeated otherwise.
All this time I'd been lamenting my research area as unpopular and boring when I should've been submitting to Nature!
I’ve been working in the space since 2018. Watermarking and fingerprinting (of models themselves and outputs) are useful tools but they have a weak adversary model.
Yet, it doesn’t stop companies from making claims like these, and what’s worse, people buying into them.
Watermarking is not the way to go. It relies on the honesty of the producers, and watermarks can be easily stripped. With images, the way to go is detect authentic images, not fake ones. I've written about this extensively: https://dev.to/kylepena/addressing-the-threat-of-deep-fakes-...
If there were a law that AI generated text should be watermarked then major corporations would take pains to apply the watermark, because if they didn't then they would be exposed to regulatory and reputational problems.
Watermarking the text would enable people training models to avoid it, and it would allow search engines to determine not to rely on it (if that was the search engine preference).
It would not mean that all text not watermarked was human generated, but it would mean that all text not watermarked and provided by institutional actors could be trusted.
This article goes into it a little bit, but an interview with Scott Aaronson goes into some detail about how watermarking works[0].
He's a theoretical computer scientist but he was recruited by OpenAI to work on AI safety. He has a very practical view on the matter and is focusing his efforts on leveraging the probabilistic nature of LLMs to provide a digital undetectable watermark. So it nudges certain words to be paired together slightly more than random and you can mathematically derive with some level of certainty whether an output or even a section of an output was generated by the LLM. It's really clever and apparently he has a working prototype in development.
Some work arounds he hasn't figured out yet is asking for an output in language X and then translating it into language Y. But those may still be eventually figured out.
I think watermarking would be a big step forward to practical AI safety and ideally this method would be adopted by all major LLMs.
That part starts around 1 hour 25 min in.
> Scott Aaronson: Exactly. In fact, we have a pseudorandom function that maps the N-gram to, let’s say, a real number from zero to one. Let’s say we call that real number ri for each possible choice i of the next token. And then let’s say that GPT has told us that the ith token should be chosen with probability pi.
I don't think that provable watermarking is possible in practice. The method you mention is clever, but before it can work, you would need to know the probability of the every other source which could also be used to generate the output for the same purpose. If you can claim that the probability of that model is much higher on that model than in any other place, including humans, then watermark might give some stronger indications.
You would also need to define probability graph based on the output length. The longer the output, more certain you can be. What is the smallest amount of tokens that cannot be proved at all?
You would also need include humans. Can you define that for human? All LLMs should use the same system uniformally.
Otherwise, "watermaking" is doomed to be misused and not being reliable enough. False accusations will be take a place.
>So it nudges certain words to be paired together slightly more than random and you can mathematically derive with some level of certainty whether an output or even a section of an output was generated by the LLM.
hah, every single LLM already watermarks its output by starting the second paragraph with "It is important/essential to remember that..." followed by inane gibberish, no matter what question you ask.
Sounds interesting, but it also sounds like something that could very well be circumvented by using a technique similar to speculative decoding: you use the censored model like you'd use the fast llm in speculative decoding, and you check whether the other model agrees with it or not. But instead of correcting the token every time both models disagree like you'd do with speculative decoding, you just need to change it often enough to mess with the watermark detection function (maybe you'd change every other mismatched token, or maybe one every 5 tokens would be enough to reduce the signal-to-noise ratio below the detection threshold).
You wouldn't even need to have access to an unwatermarked model, the “correcting model” could even be watermaked itself as long as it's not the same watermarking function applied to both.
"An LLM generates text one token at a time. These tokens can represent a single character, word or part of a phrase. To create a sequence of coherent text, the model predicts the next most likely token to generate. These predictions are based on the preceding words and the probability scores assigned to each potential token.
For example, with the phrase “My favorite tropical fruits are __.” The LLM might start completing the sentence with the tokens “mango,” “lychee,” “papaya,” or “durian,” and each token is given a probability score. When there’s a range of different tokens to choose from, SynthID can adjust the probability score of each predicted token, in cases where it won’t compromise the quality, accuracy and creativity of the output.
This process is repeated throughout the generated text, so a single sentence might contain ten or more adjusted probability scores, and a page could contain hundreds. The final pattern of scores for both the model’s word choices combined with the adjusted probability scores are considered the watermark. This technique can be used for as few as three sentences. And as the text increases in length, SynthID’s robustness and accuracy increases."
I'm fascinated that this approach works at all, but that said, I don't believe watermarking text will ever be practical. Yes, you can do an academic study where you have exactly 1 version of an LLM in exactly 1 parameter configuration, and you can have an algorithm that tweaks the logits of different tokens in a way that produces a recognizable pattern. But you should note that the pattern will be recognizable only when the LLM version is locked and the parameter configuration is locked. Which they won't be in the real world. You will have a bunch of different models, and people will use them with a bunch of different parameter combinations. If your "detector" has to be able to recognize AI generated text from a variety of models and a variety of parameter combinations, it's no longer going to work. Even if you imagine someone bruteforcing all these different combos, trouble is that some of the combos will produce false positives just because you tested so many of them. Want to get rid off those false positives? Go ahead, make the pattern stronger. And now you're visibly altering the generated text to an extent where that is a quality issue.
Couldn’t this be easily disrupted as a watermark system by simply changing the words to interfere with the relative checksum?
I suspect sentence structure is also being used or, more likely, the primary “watermark”. Similar to how you can easily identify if something is at least NOT a Yoda quote based on it having incorrect structure. Combine that with other negative patterns like the quote containing Harry Potter references instead of Star Wars, and you can start to build up a profile of trends like this statement.
By rewriting the sentence structure and altering usual wording instead of directly copying the raw output, it seems like you could defeat any current raw watermarking.
Though this hasn’t stopped Google and others in the past using bad science and stats to make unhinged entitled claims like when they added captcha problems everybody said would be “literally impossible“ for bots to solve.
What a surprise how trivial they were to automate and the data they produce can be sold for profit at the expense of mass consumer time.
Some comments here point at impossibility results, but after screening hundreds of job applications at work, it's not hard to pick out the LLM writing, even without watermark. My internal LLM detector is now so sensitive that I can tell when my confirmed-human colleagues used an LLM to rephrase something when it's longer than just one sentence. The writing style is just so different.
Maybe if you prompt it right, it can do a better job of masking itself, but people don't seem to do that.
They use the last N prefix tokens, hash them (with a keyed hash), and use the random value to sample the next token by doing an 8-wise tournament, by assigning random bits to each of the top 8 preferred tokens, making pairwise comparisons, and keeping the token with a larger bit. (Yes, it seems complicated, but apparently it increases the watermarking accuracy compared to a straightforward nucleus9 sampling.)
The negative of this approach is that you need to rerun the LLM, so you must keep all versions of all LLMs that you trained, forever.
They actually run 2^30-way tournament (they derive an equivalent form that doesn't requires 2B operations). You do not need to run the LLM, it only depends on the tokenizer.
This is information-theoretically guaranteed to make LLM output worse.
My reasoning is simple: the only way to watermark text is to inject some relatively low-entropy signal into it, which can be detected later. This has to a) work for "all" output for some values of all, and b) have a low false positive rate on the detection side. The amount of signal involved cannot be subtle, for this reason.
That signal has a subtractive effect on the predictive-output signal. The entropy of the output is fixed by the entropy of natural language, so this is a zero-sum game: the watermark signal will remove fidelity from the predictive output.
you are correct of we suppose we are at a global optimum. however, consider this example:
i have two hands
i have 2 hands
these sentences communicate the same thing but one could be a watermarked result. we can apply this equivalent meaning word/phrase change many times over and be confident something is watermark while having avoided any semantic shifts.
You're not wrong, but natural language has a lot of stylistic "noise" which can be utilized as a subliminal channel without noticeably degrading the semantic signal.
Why do we need that for photo, video and audio? If it's about the general public believing something false, they're not going to check the watermarks of random internet content or trust anyone who says they checked it. If they really want to know, they can go to the source and if they trust that person or organization, they can also trust the content they published. If it's about use in court, we already have a system for that - the person who recorded it appears in court as a witness and promises that they didn't alter it then if it turns out they did, they can go to prison.
> the team tested it on 20 million prompts given to Gemini. Half of those prompts were routed to the SynthID-Text system and got a watermarked response, while the other half got the standard Gemini response. Judging by the “thumbs up” and “thumbs down” feedback from users, the watermarked responses were just as satisfactory to users as the standard ones.
Three comments here:
1. I wonder how many of the 20M prompts got a thumbs up or down. I don't think people click that a lot. Unless the UI enforces it. I haven't used Gemini, so I might be unaware.
2. Judging a single response might be not enough to tell if watermarking is acceptable or not. For instance, imagine the watermarking is adding "However," to the start of each paragraph. In a single GPT interaction you might not notice it. Once you get 3 or 4 responses it might stand out.
3. Since when Google is happy with measuring by self declared satisfaction? Aren't they the kings of A/B testing and high volume analysis of usage behavior?
I think your idea is basically right, but there are two points to consider:
- Your hypothesis only holds if the alternative LLM is also "sufficiently good". If Gemini does not stay competitive with other LLMs, Google's AI plans have a much more serious problem.
- Your hypothesis assumes that many people will be capable of detecting the watermarks (both of Gemini and other LLMs) so that they can make a conscious choice for another LLM. But the idea behind good watermarking is that it is not that easy to detect.
Why does the user care if its watermarked? Surely there are only some use cases for this stuff where it matters. Most of the time isn't it just people having ephemeral chats where this wouldn't matter?
People use Google Search despite it being littered with adverts and tracking. Maybe Google are counting on either being better than the competition despite watermarking, or simply accepting that people who don't care are enough of a market that it's still worth adding.
If Google locks in enterprise clients using Google Workspace to Gemini then they won't really have a choice. It is selling it as an "add-on" already: https://workspace.google.com/solutions/ai/#plan
Suffice to say it is evident that no other LLM will come close in integration with Google Docs and other Workspace apps as Gemini.
Correct me if I'm wrong, but watermarking is only possible, if the model has a limited set of input you can provide (affects for the output) and a limited set of output it produces, and it should be completely deterministic. And you should pre-calculate all possible combinations.
And this should be also the case for every possible LLMs; then you can compare which LLMs could produce which outputs based on what inputs. Then there is some certainty that this output is produced by this LLM and this another LLM might produce it as well with these inputs.
People made this same argument about DRM escalations, about increasing privacy violations in the browser, and about Google's donations to support climate change misinformation. Even about Facebook interface redesigns. Every variation of "people will be driven to do X" I've ever heard assumes some coherence and unity of collective purpose that rarely matches the reality of how people behave.
There are counter examples, e.g. Unity. But catching that lightning in a bottle is rare and merits special explanation rather than being assumed.
This strikes me as potentially a bad thing for regular people. For example, corporations call still use AI filtering to force job seekers to jump through hoops but job seekers won't be able to use AI to generate the cover letters and resumes that those hoops demand.
To archive the watermark they store every output which they create and let partners check against it. That’s how I understand the article.
Then they also store everything which the partners upload to check if it’s created by them.
If other AI players also would store everything they create and make it available in a similar way there could be indeed some working watermark.
If one would use a private run AI to change the public run AI generated content to alter it there still would be a percentage similarity recognisable to hint that it might come from one of the public AIs.
Timestamps would become quite relevant since much content would start to repeat itself at some point and the answers generated might be similar.
By design, a watermark would make it easy to create a discriminator that distinguishes between LLM content and human content. In that case, just make a discriminator yourself and use regex to find and remove any of the watermarks.
I think people are already doing that. I frequently hear people watermarking their speeches with phrases like "are we aligned on this?", or "let's circle back" and similar.
I really want to be able to try Gemini without the AI watermark. IIRC they've used SynthID from the start and it makes me wonder if it's the source of all of Gemini's issues.
Obviously Google claims that it doesn't cause any issues but I'd think that OpenAI and other competitors would have something similar to SynthID if it didn't impact performance.
I want AI to use just the right word when it’s writing for me. If it’s going to nerf itself to not choose the perfect word so it can be watermarked, then why would I use that product? I’ll go somewhere else. And if it does use just the right word, then how is that different from a great human writer?
Google are obviously pushing this as a way to root out AI blog spam.
If only they can get other providers to use it because of 'safety' or something they won't have to change their indexer much. Otherwise page rank is dead due to the ease of creating content farms.
Do LLM's always pick the most probable next word? I would have thought this would lead to having the same output for every input? How does this deal with the randomness that you get from prompting the same thing over and over?
It depends. If we use beam search we pick the most likely sequence of tokens rather than the most likely token at each point in time. This process is deterministic though.
We can also sample from the distribution, which introduces randomness. Basically, if word1 should be chosen 75% of the time and word2 25% of the time, it will do that.
The randomness you’re seeing can also be due to implementation details.
[+] [-] blintz|1 year ago|reply
The existing impossibility results imply that these attacks are essentially unavoidable (https://arxiv.org/abs/2311.04378) and not very costly, so this line of inquiry into LLM watermarking seems like a dead end.
[+] [-] jkhdigital|1 year ago|reply
The first serious investigations into "secure" steganography were about 30 years ago and it was clearly a dead end even back then. Sure, watermarking might be effective against lazy adversaries--college students, job applicants, etc.--but can be trivially defeated otherwise.
All this time I'd been lamenting my research area as unpopular and boring when I should've been submitting to Nature!
[+] [-] sbszllr|1 year ago|reply
Yet, it doesn’t stop companies from making claims like these, and what’s worse, people buying into them.
[+] [-] kp1197|1 year ago|reply
[+] [-] sgt101|1 year ago|reply
If there were a law that AI generated text should be watermarked then major corporations would take pains to apply the watermark, because if they didn't then they would be exposed to regulatory and reputational problems.
Watermarking the text would enable people training models to avoid it, and it would allow search engines to determine not to rely on it (if that was the search engine preference).
It would not mean that all text not watermarked was human generated, but it would mean that all text not watermarked and provided by institutional actors could be trusted.
[+] [-] TibbityFlanders|1 year ago|reply
[deleted]
[+] [-] bko|1 year ago|reply
He's a theoretical computer scientist but he was recruited by OpenAI to work on AI safety. He has a very practical view on the matter and is focusing his efforts on leveraging the probabilistic nature of LLMs to provide a digital undetectable watermark. So it nudges certain words to be paired together slightly more than random and you can mathematically derive with some level of certainty whether an output or even a section of an output was generated by the LLM. It's really clever and apparently he has a working prototype in development.
Some work arounds he hasn't figured out yet is asking for an output in language X and then translating it into language Y. But those may still be eventually figured out.
I think watermarking would be a big step forward to practical AI safety and ideally this method would be adopted by all major LLMs.
That part starts around 1 hour 25 min in.
> Scott Aaronson: Exactly. In fact, we have a pseudorandom function that maps the N-gram to, let’s say, a real number from zero to one. Let’s say we call that real number ri for each possible choice i of the next token. And then let’s say that GPT has told us that the ith token should be chosen with probability pi.
https://axrp.net/episode/2023/04/11/episode-20-reform-ai-ali...
[+] [-] nicce|1 year ago|reply
You would also need to define probability graph based on the output length. The longer the output, more certain you can be. What is the smallest amount of tokens that cannot be proved at all?
You would also need include humans. Can you define that for human? All LLMs should use the same system uniformally.
Otherwise, "watermaking" is doomed to be misused and not being reliable enough. False accusations will be take a place.
[+] [-] 123yawaworht456|1 year ago|reply
hah, every single LLM already watermarks its output by starting the second paragraph with "It is important/essential to remember that..." followed by inane gibberish, no matter what question you ask.
[+] [-] littlestymaar|1 year ago|reply
You wouldn't even need to have access to an unwatermarked model, the “correcting model” could even be watermaked itself as long as it's not the same watermarking function applied to both.
Or am I misunderstanding something?
[+] [-] nprateem|1 year ago|reply
[+] [-] namanyayg|1 year ago|reply
For example, with the phrase “My favorite tropical fruits are __.” The LLM might start completing the sentence with the tokens “mango,” “lychee,” “papaya,” or “durian,” and each token is given a probability score. When there’s a range of different tokens to choose from, SynthID can adjust the probability score of each predicted token, in cases where it won’t compromise the quality, accuracy and creativity of the output.
This process is repeated throughout the generated text, so a single sentence might contain ten or more adjusted probability scores, and a page could contain hundreds. The final pattern of scores for both the model’s word choices combined with the adjusted probability scores are considered the watermark. This technique can be used for as few as three sentences. And as the text increases in length, SynthID’s robustness and accuracy increases."
Better link: https://deepmind.google/technologies/synthid/
[+] [-] baobabKoodaa|1 year ago|reply
In summary, this will not work in practice. Ever.
[+] [-] bgro|1 year ago|reply
I suspect sentence structure is also being used or, more likely, the primary “watermark”. Similar to how you can easily identify if something is at least NOT a Yoda quote based on it having incorrect structure. Combine that with other negative patterns like the quote containing Harry Potter references instead of Star Wars, and you can start to build up a profile of trends like this statement.
By rewriting the sentence structure and altering usual wording instead of directly copying the raw output, it seems like you could defeat any current raw watermarking.
Though this hasn’t stopped Google and others in the past using bad science and stats to make unhinged entitled claims like when they added captcha problems everybody said would be “literally impossible“ for bots to solve.
What a surprise how trivial they were to automate and the data they produce can be sold for profit at the expense of mass consumer time.
[+] [-] ruuda|1 year ago|reply
Maybe if you prompt it right, it can do a better job of masking itself, but people don't seem to do that.
[+] [-] auggierose|1 year ago|reply
My guess is zero times. So, you are not describing an experiment here, you are just describing how you built up your internal bias.
[+] [-] ksaj|1 year ago|reply
It's not just that they are (somewhat) unusual phrases, it's that ChatGPT comes up with those phrases so very often.
It's quite like how earlier versions always had a "However" in between explanations.
[+] [-] espadrine|1 year ago|reply
They use the last N prefix tokens, hash them (with a keyed hash), and use the random value to sample the next token by doing an 8-wise tournament, by assigning random bits to each of the top 8 preferred tokens, making pairwise comparisons, and keeping the token with a larger bit. (Yes, it seems complicated, but apparently it increases the watermarking accuracy compared to a straightforward nucleus9 sampling.)
The negative of this approach is that you need to rerun the LLM, so you must keep all versions of all LLMs that you trained, forever.
[+] [-] mmoskal|1 year ago|reply
[+] [-] jkhdigital|1 year ago|reply
[+] [-] samatman|1 year ago|reply
My reasoning is simple: the only way to watermark text is to inject some relatively low-entropy signal into it, which can be detected later. This has to a) work for "all" output for some values of all, and b) have a low false positive rate on the detection side. The amount of signal involved cannot be subtle, for this reason.
That signal has a subtractive effect on the predictive-output signal. The entropy of the output is fixed by the entropy of natural language, so this is a zero-sum game: the watermark signal will remove fidelity from the predictive output.
This is impossible to avoid or fix.
[+] [-] thornewolf|1 year ago|reply
i have two hands
i have 2 hands
these sentences communicate the same thing but one could be a watermarked result. we can apply this equivalent meaning word/phrase change many times over and be confident something is watermark while having avoided any semantic shifts.
[+] [-] jkhdigital|1 year ago|reply
[+] [-] mateus1|1 year ago|reply
[+] [-] fny|1 year ago|reply
I’m far, far more concerned about photo, video, and audio verification. We need a camera that can guarantee a recording is real.
[+] [-] foxglacier|1 year ago|reply
[+] [-] playingalong|1 year ago|reply
Three comments here:
1. I wonder how many of the 20M prompts got a thumbs up or down. I don't think people click that a lot. Unless the UI enforces it. I haven't used Gemini, so I might be unaware.
2. Judging a single response might be not enough to tell if watermarking is acceptable or not. For instance, imagine the watermarking is adding "However," to the start of each paragraph. In a single GPT interaction you might not notice it. Once you get 3 or 4 responses it might stand out.
3. Since when Google is happy with measuring by self declared satisfaction? Aren't they the kings of A/B testing and high volume analysis of usage behavior?
[+] [-] tokioyoyo|1 year ago|reply
[+] [-] aleph_minus_one|1 year ago|reply
- Your hypothesis only holds if the alternative LLM is also "sufficiently good". If Gemini does not stay competitive with other LLMs, Google's AI plans have a much more serious problem.
- Your hypothesis assumes that many people will be capable of detecting the watermarks (both of Gemini and other LLMs) so that they can make a conscious choice for another LLM. But the idea behind good watermarking is that it is not that easy to detect.
[+] [-] kranner|1 year ago|reply
[+] [-] beepbooptheory|1 year ago|reply
[+] [-] onion2k|1 year ago|reply
[+] [-] ndr|1 year ago|reply
[+] [-] dartharva|1 year ago|reply
Suffice to say it is evident that no other LLM will come close in integration with Google Docs and other Workspace apps as Gemini.
[+] [-] nicce|1 year ago|reply
And this should be also the case for every possible LLMs; then you can compare which LLMs could produce which outputs based on what inputs. Then there is some certainty that this output is produced by this LLM and this another LLM might produce it as well with these inputs.
So... impossible?
[+] [-] glenstein|1 year ago|reply
There are counter examples, e.g. Unity. But catching that lightning in a bottle is rare and merits special explanation rather than being assumed.
[+] [-] harimau777|1 year ago|reply
[+] [-] sharpshadow|1 year ago|reply
Then they also store everything which the partners upload to check if it’s created by them.
If other AI players also would store everything they create and make it available in a similar way there could be indeed some working watermark.
If one would use a private run AI to change the public run AI generated content to alter it there still would be a percentage similarity recognisable to hint that it might come from one of the public AIs.
Timestamps would become quite relevant since much content would start to repeat itself at some point and the answers generated might be similar.
[+] [-] matteoraso|1 year ago|reply
[+] [-] js8|1 year ago|reply
[+] [-] tomxor|1 year ago|reply
Great so now people have to be worried about being too statistically similar to an arbitrary "watermark".
[+] [-] rany_|1 year ago|reply
Obviously Google claims that it doesn't cause any issues but I'd think that OpenAI and other competitors would have something similar to SynthID if it didn't impact performance.
[+] [-] lowbloodsugar|1 year ago|reply
[+] [-] nprateem|1 year ago|reply
If only they can get other providers to use it because of 'safety' or something they won't have to change their indexer much. Otherwise page rank is dead due to the ease of creating content farms.
[+] [-] ajwin|1 year ago|reply
[+] [-] 8note|1 year ago|reply
It doesn't get you perfectly deterministic output to set it to 0 though, per https://medium.com/google-cloud/is-a-zero-temperature-determ... as you don't have perfect control over what approximations are being made on your floating point operations
[+] [-] janalsncm|1 year ago|reply
We can also sample from the distribution, which introduces randomness. Basically, if word1 should be chosen 75% of the time and word2 25% of the time, it will do that.
The randomness you’re seeing can also be due to implementation details.
https://community.openai.com/t/a-question-on-determinism/818...