Boaz Barak - The impossibility of (strong) watermarking

Transcript

I will try to stick to the 20 minutes that I have.

This is a joint work with Hanlin Zhang,

Ben Edelman, Danilo Francati, Daniel Venturi, and Giuseppe Ateniese.

And if you're interested in the archive paper, you can click this QR code and I'm going to talk about the result

showing the impossibility of strong watermarking and then given time,

I'll see if I can say a little bit about what's strong watermarking.

But I think strong watermarking is what we usually think of as watermarking.

But weak watermarking can still be useful for several applications.

So, what is watermarking? So, what's one of the very common usages of say,

ChatGPT? It's the following: You're a student,

your teacher assigns you to write an essay on something. You have better things to do with your time.

So you ask ChatGPT to write the essay for you, ChatGPT sends it back,

you submit it and you get an A, possibly. So that's great.

But if you are the professor, you might not be so happy with that.

So, the idea of watermarking is to try to prevent this. So, the model,

the generative model will embed some signal in the output using

some secret randomness. Let's call it K. And there are two flavors.

Let's look at the flavor that makes impossibility results stronger or impossibility results easier,

where there is no public verification algorithm, but rather there is a detector that also uses the same shared secret

randomness or secret key K. Then the professor could run the essay to the server.

The server would say no, this was a generated output, and then your grade will be an F.

So, the question is: Can we achieve this?

And basically our result is that we give a generic attack that applies

against all watermarking schemes, no matter if it's video, image, text, et cetera.

But, under certain assumptions - we believe they are natural -

we can debate it and talk about it probably in breaks, etcetera.

And the second we show empirical experiments that demonstrating that our attack - an

instantiation of our attack - actually works against three recent LLM watermarking schemes.

The instantiation is not perfect and working is not perfect.

But I think it's proof of concept that the ideas kind of work.

And these are the three schemes initially, the P value -

the probability that says the confidence that the text was generated - the P value is very,

very low. So, you're definitely, you're certain that the text was generated.

Then I - after the attack, you really don't have any confidence that the text was generated rather than

human generated. And the quality change is not zero, but the quality degradation is relatively low.

And here is kind of a demonstration for one of these schemes for UMD.

And as the attack is iterative as we run the attack, more and more and more,

we continually change the output.

So the "disease code" - that is, by how many standard deviations away this is, say, from, human, non-watermarked content - gradually reduces.

And we measure the quality of the output by asking whether our output versus the

original output is better or worse and asking GPT-4 to judge

as whether something is better or worse, and generally that is fairly stable.

The watermarking signal, steadily decreases while the quality stays

relatively stable. So, what are the assumptions?

Our first assumption is that we have something which we call a quality oracle.

We think of the attacker: the attacker doesn't really care. If I am a student

I don't really care about the particular essay that ChatGPT gave me.

I just want to get an A. So our assumption is that if my model is capable enough to generate

output, then it can also rate it. So basically, the idea is that the model that we have,

the student might not have access to the weights of the model,

but it can make queries to it, black box queries.

And one of the queries that it can make is please grade this essay.

And the idea is that verification is easier than generation.

Then there is the second assumption is that we can perturb it using an open source model.

And since I'm running out of time, let me just kind of leave this out.

And there is a barcode and generally we use that to do

rejection sampling and do a random walk on the space of outputs while keeping to the high quality space. Thank you.

Boaz Barak - The impossibility of (strong) watermarking

Transcript

Alignment Workshop