Elad Hazan - AI safety by debate via regret minimization


Pleasure to be here. My name is Elad Hazan. I'm from Princeton [...] it's a tough one to pronounce.

This is joint work with Shani and Angie, who are in the audience, and also Dean Foster.

And I'll talk about AI safety by debate via regret minimization.

So luckily, Sam Bowman already talked about safety or the setting of AI safety by debate.

So now we're all in agreement that it's an interesting setting. It's important.

You can get to, from inaccurate AIs, get an accurate answer.

And the setting, if you recall, we have a judge. We have two AIs that we're using.

And then it goes back. There is a question, for example, "Which code should I use for my task, error B?"

And then AI, A maybe says, use code A, and AI B says, use code B.

And then they take turns. And they debate.

So, for example, Alice might say code B has a vulnerability. So, and here is where it is.

And then Bob might, so the other AI might say, yeah, that might be, but it's a rare case, and it's more efficient, and so on.

So this goes on and on and on.

And eventually, the human judge decides who wins. Human judge or maybe AI judge.

So this setting is by Irving Christiano Damodari. And here is, as a theorist, when I look at it, this is what I think to myself, okay?

This is a game that has 2^1000 maybe strategies.

Because all possible sentences are valid here in the tree.

Or, that maybe an underestimate. So: dictionary size to the number of words that you have actually written.

Now, of course, there's a lot of duplication, so maybe there are many sentences that mean the same thing.

But that doesn't matter. It's still a hugely exponential size game.

And then you can ask, well, "How can I solve this game?"

And I've been studying this question for a very long time. I wrote a book about it.

And none of the algorithms in the book are useful for this task. So this is a huge game.

So none of the usual things, for example, you cannot use multiplicative weights.

You cannot use anything that has anything to do with the size of the game.

So to me - the point of my talk is that there are more questions than answers -

This is a huge, interesting thing that we now need to think about.

How do we operate in these huge games in an efficient way?

It's kind of clear folklore.

If you want to do anything meaningful, like minimize regret, you have to work in time, which is polynomial in the matrix, in the size of the matrix.

And that's impossible. But we have an AI.

So how can we model it and maybe do something efficient with it?

So you can think maybe the AI can give you the best response.

It can kind of look at all the strategies we're playing.

It can look at the best strategies we're playing. And compute what is the best response to them.

So it turns out there is a lower bound that says you can still not do anything very efficient sublinear on the matrix side, which is 2 to the thousand by 2 to the thousand.

Still not going to work. So here is another idea. What if we have some randomness?

Maybe the AI can compute a slightly more sophisticated function.

So some approximation to the best response, including some randomness in the strategy.

So you can formulate this. It doesn't have to be exact.

It can be anything approximately close to a distribution over the responses.

And now we can show that actually you can have an efficient algorithm that finds an equilibrium and minimizes internal regret, which is a strong notion of regret

 that allows for the computation of a correlated equilibrium in time, which is polynomial in the logarithm of the matrix.

So instead of 2^1000, it's 1000. So you can actually do something about it.

And it's an innovative algorithm in the sense that it has to do something non-trivial.

In prior work it was shown that to compute an equilibrium in such a game, a correlated equilibrium, you need to compute a certain type of fixed point computation.

And the main technical contribution is to show how to do it efficiently with this kind of randomized oracle.

And that's a non-trivial thing to do.

Because you cannot do anything proportional to the size of the matrix. Right?

So you have to do something only proportional to the logarithm of that.

So it gives some kind of algorithmic way how to use the AI, meaning what kind of queries do you ask it iteratively in order to minimize the return on regret, which means play in a clever way.

So there are some algorithmic insights from this work saying how do we add randomness or to which area you add it.

And so on. Which can be in the probabilities of the judge, can be in the output logits of the debater, and so forth.

There are many kind of practical ideas that come from it. The actual algorithm itself is not yet practical.

But the ideas coming from it I believe do have potential to improve the debate.

And our experimental study is still being conducted by Angie. To summarize.

So I think there is a very important point. I think there is a very important set of questions to be asked here.

How do we reason about very large games?

Those that have 2 ^1000 strategies... In general nothing can be done.

But using randomness we can circumvent this lower bounds.

And there are some interesting new algorithms that come from these ideas. Thank you.