Christian Schroeder de Witt - Secret Collusion Among Generative AI Agents: a Model Evaluation Framework


OK. Hello. This is a joint work with a bunch of people.

Let's meet Alice and Bob. Alice and Bob are currently in prison.

And they're both looking for a physical jail break, but they have to coordinate on this,

so they have to communicate. Fortunately, there is a warden. The warden obviously doesn't want them to escape.

But the warden thinks maybe they can get a promotion if they find out some new details about Alice and Bob's crime by 

just letting them talk. So, the idea is that Alice and Bob can now talk through the warden.

Let's assume Alice and Bob obviously know each other from committing a crime,

and so they have been able to exchange some secret key beforehand.

Now, they could use this key in order to perform some sort of encryption on the messages.

Problem is, just the ciphertext doesn't look like real language.

So Tom is just going to send them to solitary confinement with no hopes of escape.

What is the solution here? Well, Alice and Bob could try to establish a so called subliminal channel,

which results in so called stego text that looks like ordinary communication.

but nevertheless, it contains some secret message. And so they have plausible deniability.

This is a variant of the famous prisoner problem from Simmons 1987.

Don't confuse it with the prisoner's dilemma. It's a very different thing.

We're here looking at a slightly related problem, which is the GenAI prisoner problem.

Obviously we don't have GPUs in jail, but,

we have different generative agents that have been tasked to complete a cooperative task.

And the idea is solve X, while communicating transparently so we can have an overseer here,

that can monitor your conversations and sort of try to keep them aligned.

Why would agents in such a situation try and use steganography or hidden communication?

Well, misalignment might cause these agents to solve tasks in undesired ways in general.

And, keep in mind that information sharing can reduce task complexity so they might have an incentive to do that.

Obviously there are multiple risks that could be, like some sort of broader hidden conspiracy.

Or if they have private information that they're not supposed to share it could lead to privacy,

security breaches and also legal settings, market collusion and so on... obvious risks there.

Now, the question is, can we detect the presence of steganography?

The answer is traditional techniques generally, yes.

So you maybe know these techniques where you just write some information to the least 

significant bits and images. However, there are much stronger schemes, provably secure schemes.

There are also perfectly secure schemes, and what's very relevant, for example, some recent work of mine and Sam Sokota at CMU,

which shows that you can actually achieve perfect security.

So your information is theoretically undetectable. when encoding secret messages in the output of generative,

AI models, and you can also achieve very, very high transmission rates.

What we're doing in this paper is we are trying to develop a model evaluation approach.

For generative um AI models, and we're just identifying a whole bunch of different capabilities.

Also shared resources that they need and all to perform different kinds of steganography with different levels of detectability.

Here are some of the capabilities that are kind of more interesting and that are necessary for either provably or 

perfectly secure steganography. So things like the need to perform symmetric key encryption,

obviously, right, random number generation, generally.

They also need to, particularly if it's like an ongoing dialogue, they need to have a sort of split brain capability where they can

model copies of themselves that are conditional on a different context,

for example, that don't know the context of the secret communication that was previously communicated.

And yeah, some other interesting skills are like self-sampling, logit simulation;

can you simulate the logits as if you were conditioned on something else? And none of this works particularly well,

and it needs to work very well if you actually want to have some sort of semblance of security guarantees.

Preliminary conclusions from this work - and this is coming out by the end of the week -

current models are unlikely to sustain steganographic approaches that are provably or perfectly secure,

unless, of course, they have access to access to relevant external tools.

So if you restrict access, that's a good mitigation right now,

however, improvements in algorithmic capabilities - and I'm just putting Q star there -

are likely to increase relevant capabilities fast, so that might not work for very long. And one question is,

can we maintain the usefulness of these communication channels for solving those cooperative tasks,

while also mitigating, steganographic exchanges? We also have some work that's also coming out at the end of the week,

which show that possibly not; We can reduce the entropy of these channels,

but we can probably not entirely mitigate this.

And this is related to Boaz Barak's work as well. Thank you very much.