Yoshua Bengio - Towards Quantitative Safety Guarantees and Alignment

Transcript

I'll have a little bit about governance, and then hopefully two third of my talk is going to be about alignment and a direction to address alignment,

which focuses on how could we possibly get quantitative guarantees of safety based on Bayesian posteriors.

So what do we need to avoid catastrophes?

At least these two things, there is a more technical problem of alignment and control.

I mean, there are some political aspects like we need to have the sufficient investments

in solving those problems in particular, I think from governments and not just the private sector.

And second, we have of course the coordination challenge because even if we solve alignment,

you would still have humans exploiting this for their purposes and destroying at

least democracy, if not the world. And so, yeah,

how do we make sure people will not abuse power and follow whatever protocols

for safety you know, that we found as scientists. Quickly, a A few words about alignment.

I just want to draw your attention to the fact that in addition to the loss

of control issue that people in safety usually think about of course,

if we make progress in alignment,

we can also reduce the issue of misuse when we design an AI system that

is aligned in the sense that it can't be jailbreaked,

it can't be fine-tuned or something to be misused for all sorts of bad things.

But, I think most people here are aware of that.

The, the other thing that maybe people don't realize is that making progress in alignment might also help with the

ethical issues, offensive AIs of course, but AIs that discriminate, that violate human rights and privacy,

and give unfair decisions. So these are all linked with trying to deal

with the alignment problem. On the coordination side

I want to mention a few things. There's a lot of discussion these days which I call

the open-source debate, if not the open-source battle.

I'm a big proponent of open source, but at the same time,

it's not an absolute value. We need to make sure that it you know,

there's net benefit and the problem is once the system is released in an open way,

you can't patch it, you can't get it back.

It's irreversible proliferation. Later after it's been released someone can find a way to exploit it to,

to do something really dangerous. The state, the regulator doesn't have any handle anymore.

So there is the proposal that if everybody has nuclear weapons we'll all be safer.

I don't think so. And the other aspect is, who should decide, right?

Should it be the CEO of a company, or should it be a democratic process?

Something that gets a lot less attention in this community and elsewhere is the scenario where it's

not an alignment problem because we unintentionally created a monster.

It's because somebody wants to create a monster.

That's what I call the Frankenstein scenario.

People want to see an AI as their image or they think that superhuman AI is,

is the future of intelligence and they value intelligence more than humanity.

That's their choice. But I have children and grandchildren and I see it differently.

And then, there are other things like societal collapse that don't get enough attention,

but I won't talk about and are not as directly connected to alignment.

Staying in the realm of governance, as I said, there is a problem of power. Who is going to control these big AIs of

the future. And they can be used in ways that contradict the very notion of democracy.

Remember, democracy is about sharing power. And if there are machines that are much smarter than humans in many ways,

they give the people who control them a lot of power over other humans, whether it's in the political arena,

military, economic and so on. So that's not good. And I think democratic governments

will understand that one day and want to make sure that they control these things,

but we have to make sure it also doesn't become a military black box and we

need to work on the international governance to try to reduce the AI arms race.

So, last summer, I testified to the US Congress. and there's a long document that has three main recommendations.

One regarding regulation, one regarding investments in AI safety, in particular solving the alignment problem,

but also how to evaluate the various risks.

And maybe the thing that gets less discussed is the third point

which is regulation is not going to be perfect.

There's going to be some bad actors that treaties and regulation don't really stop and,

and these people or organizations will create dangerous AIs at some point in the future.

So we need to deal with that. We need to prepare for that.

We need to have kind of good well-governed, defensive AIs that will protect against these,

but we need to do it in a way that we don't create the very problem we are trying to avoid,

which means we need to make sufficient progress on the alignment challenge

before we build AIs to defend us against rogue AIs,

and the AIs we build could turn against us if we don't do the right things.

So why do we need AIs to defend against the AI?

Well, it should be pretty obvious. I don't think I need to convince you guys.

But here's the thing maybe again that I don't think this gets as much attention,

which is what I call the single point of failure problem.

And that, that is both for the issue of humans taking over AI for

power reasons or loss of control. So you know, one extreme:

if we have just one country and one organization building the most powerful AI in the world that is very dangerous.

because you have a single point of failure of either humans taking over the control of that for

malicious purposes. Maybe the operator or the CEO or the president of that country.

and or making a mistake and we end up with a dangerous rogue AI.

So you want to have a decentralization of power, but you don't want the extreme

decentralization I think because then you get into anarchy and the

issue with open source that I was discussing earlier. And one reason for this is yes,

you want to decentralize power, but these different sources of power say different AI systems that are very powerful need

to be governed properly. So you need to set up the democratic governance.

If everybody has it, then it's not clear you can govern these things. so I've written a

paper that is on in the Journal of Democracy that's been out for three months about that discusses

these questions. In particular, the idea that we need a decrentalized... a multilateral

- so across multiple countries - decentralized network of frontier

AI labs that work on these problems: the problem of alignment;

the problem of governance, and are governed properly [themselves], and; the problem of countermeasures or defensive

measures we could take with AI. Now I think that this should be something that democracies do together,

but the oversight of that could be something broader with the UN and things like that in the context

of obtaining a reduction of the arms race pressure and maybe other countries that

don't that are not part of the club would agree to have similar kind of oversight.

So one condition to avoid the single point of failure is that these various labs working on these frontier systems need

to share what they're doing with each other. So they can't share with the rest of the world

- at least not everything - but they should share with each other so that if one of them fails in one of the ways I've

talked about, then the others have at least comparable firing power, and they know what sort of techniques

the other guy is likely to be using. So you don't want to end up in a situation where the

strongest AI again is corrupted in one way or another.

And there is nothing of comparable intelligence that is in the right hands.

And by the way, what is the right hands again? Democracy,

it's not the best system, but it's the best we have.

Yes. So that suggests that the organizations that will do

this in the long run should be heavily governed by a democratic process

- which means governments - and ideally not in conflict of interest with commercial interest,

which doesn't mean that companies can't be part of it.

But the governance needs to be really driven by defending humanity in my opinion.

Also a practical reason why it would be better this way

is companies are competing with each other.

And so they might not want to share everything with each other.

Whereas if we have such a network of labs working together,

but independently under the same democratic process, you know,

it will be just natural to share just like between academics,

we share and we compete and share, right? That's the thing we know and works really well.

Now, if they're not funded by VCs, where is the money coming from?

Well, I think it should come mostly from governments but again,

we have to avoid this single point of failure. We don't want the next government of some very,

very big country to become a dictatorship and have access to that power.

So that's why we need to set up an international structure around these.

That makes it hard for a single government to take over or if they do,

the other guys are around to defend democracy.

Alright. So that covers the governance part of my presentation.

I can make a pause here if people have questions before I go on to technical things.

[audience member] Thank you. OK. So the question was about the proposed kind of approach that the

frontier AI will be only developed by a certain set of labs.

And the question is who going to decide which labs?

Democracy.

[audience member] What do you mean democracy? Like, OK, we have academic community who going to tell which lab that they can or cannot develop frontier AI.

Democracy is a process. So we take collective decisions.

That's the whole point of democracy and we can do it in various ways.

[audience member] I mean, you're just going to tell the lab not to develop AI?

I didn't say that there's going to be governance of the labs that can do that.

And I think the ultimate level of control is to make sure that the commercial interests don't interfere. Yes.

there's a question here and, and another one and then we'll stop, and I'll move on to the technical stuff.

[audience member] So I guess multinationally, how do you convince countries that don't value democracy to participate in the democratic process

Ok. So the way I presented it, you have two levels of cooperation,

you should have cooperation between democracies.

and they can share a lot of information and then you should have, broader cooperation that includes,

say China and Russia and so on where we agree to something similar to what we would

do with nuclear weapons. So we agree with mutual oversight, which doesn't mean we share all our secrets,

but we make sure that there's enough oversight, for example, by independent actors like the UN,

that we trust that the other person is not developing these things as an offensive weapon,

but of course, it's not going to be perfect. So we need to prepare.

I mean, every country that has the means is also going to prepare defensive countermeasures,

which I think we have no choice to do. I think I said there was another person here and she didn't get the chance to speak.

And that was the last after that I need to go to technical things.

[audience member] Yeah, thanks. Follow up on the previous question. Is it possible to start from the position or from a term that is not democracy?

So we go much more inclusive, right from the beginning because as I mentioned,

like maybe a lot of the governments in the world or a lot,

a huge population of the world is not living under the democratic society.

So we go all-inclusive from the beginning by not starting from democracy.

If you have a concrete proposal, email me. Alright. So I'm going to say a few motivating

words about an issue which is at least with reinforcement learning,

there are a number of arguments that have been brought forward suggesting that as we make the AI more capable,

we make it more dangerous and more likely that we lose control. So that's the problem of reward hacking.

But there's another way to think about it which you can think in terms of like adversarial scenarios.

The more the adversary has the compute power to find the attack that's going to work, the needle in the haystack,

the more likely it is to find it, right?

So, if your AI has a lot of power, it might discover the weakness in

the mismatch between our intentions and what we asked the AI to do and, we lose.

So one way to think about this - which, again,

the adversarial settings help to understand - you might have a system like the

say the green lin. This is a reward function. And the red line, which is the estimated reward function

where if you take a statistical sample, they will look like they're close enough.

But if you are allowed to optimize in order to find the places where they differ,

in particular where the red ones is large, you could find something which is essentially too good to be true.

And that is why things like CIRL that are saying we should not try to find a reward function but a distribution over reward functions that are plausible,

given the information that was given, I think are just the right way to think about it.

so that's the direction I'm going to go to. So yeah, and to be more specific about that's the scenario where this can happen.

I think reward hacking is a good case. I encourage you to read this paper by Michael Cohen and coauthors.

If the A I understands the reward as, when humans press plus one on the keyboard,

the AIs might just take control of the keyboard and get lots of plus ones.

And once they get all that positive reward from themselves that they control,

they might not want us to find out about their scheme or to, stop their scheme because then the reward will be smaller.

And so the immediately you get a conflict. Now it's a kind of cartoon scenario, but that gives you

a notion of how things can go wrong. And you can see that the more powerful the AI,

the more dangerous that situation gets, right? So what I'm going to try to get at is a scenario where we train an AI and the

more compute we throw in it, the more capable it is the safer it's going to get. Anca?

[audience member] I'm sad, Michael's not here to clarify this.

But the statement even in the CIRL case doesn't make sense to me because the CIRL

case would look at the person hitting plus one as evidence about what they currently want.

And the notion of taking control of your only source of evidence about this Bayesian thing you're trying to estimate is strictly suboptimal.

So the CIRL would very much not incentivize that, but I might--

Yeah. So, so Michael has an argument that even if you're Bayesian about it,

you could still get to choose the wrong hypothesis.

Just saying that under the surface things could be even more complicated than what the sort of

naive Bayesian solution would suggest. It's worth really like thinking about these things.

OK. But let's explore this, this kind of Bayesian approaches.

I'm going to start with a simple but incomplete solution to the problem which is

"Why don't we just build non-agentic AI systems that are like oracles." But actually

not, no, they're not oracles.

I call them AI pure scientists.

So the difference between an oracle and A I pure scientists is that the oracle answers any question.

Whereas the AI pure scientist, it only answers a different kind of question which is "Come up with good explanations for the data."

Come up with good theories.

So like a pure scientist, like a pure physicist doesn't do experiments, right?

They don't even give you answers to typical questions in day to day.

In fact, they're pretty bad at it. But they might come up with new theories that explain data in the world

previous experimental results better than has been done before. So that sort of AI would be very useful.

At least for our scientists. Maybe they can help our biologist and people who are trying to solve climate and all of these things.

But if we go back to what I said earlier that we may have to defend against bad AIs that is probably going to

require an, an AI that is an agent and can fight the bad guys live.

So it's an interesting thing maybe to do, but it's not going to be sufficient.

Let's go back to why we should be Bayesian or something of that nature.

If you consider maximum likelihood training or standard end to end reinforcement learning,

whether you learn the reward function or not, there's an issue that for any given source of data,

there's going to be multiple solutions, multiple neural nets, multiple theories that explain that data.

In fact, you have theoretical results from causality theory, that even with infinite amount of data,

you would still have that ambiguity and it has to do with the problem of interventions.

We can't intervene on everything, we can't intervene on the sun.

So we have to rely on generalization and we have to rely on data that

is not sufficiently complete to guarantee that there's not another theory

that explains the data as well. OK. And this is important because if you have the wrong theory...

if there are multiple theories that I can explain the data, and your

model has picked up the wrong one... and that wrong theory could be confidently wrong,

which we see in current LLMs. Being confidently wrong could be funny,

but it could be extremely dangerous when the AI is confidently wrong about an action that

could have very bad consequences, even though the AI might think it's actually quite OK.

So if we want to get quantitative safety guarantees, that's not going to cut it.

I have a little toy scenario to explain really what being Bayesian means here.

Imagine you've got an AI robot here. It's in front of two doors. It has choices,

go left, go right. And based on the data,

there are two theories that are compatible with everything it has seen before.

According to the left bubble, the left theory,

if it goes on the left, people die, if it goes on the right, it gets some cake.

If you consider the other theory, on the other hand,

going on the left gives you some cake and going on the right is neutral.

So if we're not lucky with 25% probability (if the two theories are equally likely) we're going to die.

And if we however keep track of all those theories, it should be obvious what to do, right?

Do you need to go left or right? Right. Yes,

it's easy here because there are only two theories. What if you have an exponential number of theories?

So that's the sort of question I want to think about how do we keep

track of an exponentially large number of theories that could explain the data?

OK. If we could, we could start getting some kinds of quantitative guarantees. And there

are many variants. But I'm going to explain just like a really simple scheme which could be improved.

But like, for example, if I had a a neural net that computes a Bayesian posterior,

that is the probability of something bad happening,

given a particular context X and given an action that we propose to do and

given all the data that the AI has seen.

So that's just the Bayesian posterior because we put the D there explicitly.

And yeah, and if that quality is above a threshold, you simply decide to not do the action.

That's a very simple kind of quantitative guarantee. Now, there are a number of issues. How are we going to estimate these

things efficiently? And also because our estimates are going to be imperfect,

how do we deal with that? OK. So let's dig a little bit into this.

Let's see, what are the actors in the theater here? We have a very complicated random variable here,

theta, which is not the usual parameter... theta for theory, it works.

so theory is a theory of how the world works including an explanation for each and every data point that you've seen up to now.

So in other words, it has both general statements about the

world and specific statements about individual examples, which we usually call latent variables.

But in the Bayesian world, everything is a latent variable including the parameters, right?

So theta here just to simplify the notation includes all of these things.

So what are the things we're trying to do in order to get to the my Bayesian posterior we're talking about?

Well, there are really two things we need to do. one is the first equation there,

D is dataset. So we need to be able to sample or represent the distribution over all the possible theories given the data

and we know how to compute that up to a normalization constant.

That's the prior p of theta, which is something that's going to be easy to compute times a likelihood.

So given the theory which by the way contains a full explanation of every data point in

D what's the likelihood of the data,

which is going to be easy to compute but still not cheap because we have to go through the whole dataset.

So it's not exponential but we want to reduce it. Here, the problem isn't so much that we don't know how to compute the normalizing constant

it's that we don't know how to sample from this because in order to answer the second question,

which is, given a context or a question X, what is the answer Y?

We need to sum over all the theories weighted by this posterior or similar posterior.

So for each theory, theta, we would know how to answer the question.

But now we've got this exponentially large number of theories.

This is the problem of marginalization and sampling and marginalization

are the two hard problems of probabilistic inference. They're both intractable.

And we need to typically do both in order to answer questions.

I mean, there are some specific questions where we could get the answer directly from the Bayesian posterior but on data.

But in general, we won't be able to do that. And so, how are we going to do that?

That's what I want to talk about. We could do MCMC, right?

Because if you know anything about MCMC, this is what it's meant to do, right?

We are given an unnormalized probability function and we want to sample proportionally to that.

And we can use those samples and compute some averages which approximate an

expectation over these. That's the second line, right. So if we could do MCMC,

we could approximate these things, but there are some problems.

So the way MCMC works is you make a lot, lot of small random moves that tend to prefer more probable configurations.

So in general, you will kind of hover around a mode and with some probability you might

slowly go to another mode. But the slowly can get really bad. If the modes are far from each other and

they occupy a small volume, then it might take forever for that to happen.

And that's the so-called mixing problem. If you have a machine learning mind and you look at the

picture on the top, right, it should be obvious that there is a solution using machine learning.

Let's say we have visited the three modes corresponding to the three bumps I have drawn.

Would a neural net guess that there might be a fourth one at the

intersection that we haven't visited there with the question mark.

And the answer is yes, actually,

You just need three modes and you get the fourth one here. But in general,

you visited some modes and instead of finding a random walk that discovers other modes,

you want to exploit the structure that hopefully exists in how the modes are placed in order to guess not necessarily perfectly,

but with a high probability that's high enough that, you know,

if you guess a reasonable number of times, you're going to find a lot of these missing modes.

So that's the approach that my group has been exploring for the last 2.5 years.

And I'm going to explain a bit more. If we can do that sort of thing,

we're going to have to train neural nets that can do that kind of generalization

that propose theories that have high posterior probability. The modes,

you should think here, are in the space of theories. And the probability here is the Bayesian posterior probability,

the probability that they are the right explanation for everything you've seen in your life.

OK. if we could do that, we would get the opposite of what I was talking about before,

which was, with RL, if we don't do it right, we may end up in a scenario where the more compute we put in the machine,

the less safety we end up with. Here, it will be the other way around because the more compute we put in,

we train a bigger neural net. If we have theorems that tell us that when the training loss of the neural net goes to zero,

we recover the Bayesian posterior, then we have a sort of guarantee that more compute equals more safety.

So how could we do that? You know that one of the big lessons of machine learning in the last few

years is that we can learn really, really complicated distributions with big neural nets.

The question is can we just change the objective function so that what they learn are these Bayesian posteriors rather than

to predict the next word in, in a kind of supervised way? Let's see.

That's the idea of amortized predictors. If you know about variational autoencoders,

they were probably one of the first, if not the first, machine learning method in that direction.

And the idea is, instead of paying the compute at run time, like MCMC,

if you ask me a question, I'm going to do a lot of sampling to come up with an answer that has the right distribution.

Instead, I'm going to pay a lot upfront, that's the amortization part to train the neural net.

And then I'm going to just sample from the neural net very quickly.

It's just like one pass through the neural net. I'm going to get samples.

So that's amortization and there are subtle advantages to doing this besides the fact that we are paying up front.

First, the worst-case scenario where there is no structure in the modes,

it is going to be as bad as MCMC, right? Because then if there is no structure,

then there is no generalization and

using a neural network is not going to help us compared to MCMC.

But if there is structure, we could gain a lot.

We can gain not just in the sampling but also in the marginalization. Let me go back to my little cartoon here.

Let's say that instead of these four modes,

there was an exponential number of them - just make it a higher dimension. Sampling those in order to compute an average,

if there's an exponential number of modes, is also going to be a problem because

I'm not going to have a chance to average over an exponential number of things.

So what could you do?

Well, you could also use these amortization methods basically like supervised learning in in in neural

nets in order to approximate the marginalization

problem in order to approximate the kind of sum on the second line.

So instead of doing the sum on the second line through a Monte Carlo averaging process,

we can train a neural net so that it directly samples Y given X, right.

So we're going to use neural nets to do the first line in the sense that it's going to give us a sampler for theories.

And we're going to use a different neural net or maybe you can share some things,

but fundamentally a different neural net that produces answers to questions.

And we can train that neural net with single samples of theta at a time.

I'm going to kind of give you clues for,

you know how we can do that.

We don't need to run any big sum here because that's not going to fly.

before I explain how we might do that, let me just go back to the safety question and some of the challenges here.

So I've said that we might be able to train neural nets to sample modes of that posterior

distribution. Good. But there is no guarantee that we're going to cover all the modes.

And remember these modes correspond to theories about how the world works.

So if somehow our neural net doesn't represent all the modes,

we might miss the right one and then we might still have a system that is confidently wrong.

So how could we have a safety guarantee if there is a

possibility that the system says with high probability,

it's all fine, but we still all die, right? OK. So what can we do?

It looks like it's intractable because the number of possible places to look

at in training our neural net is exponentially large. So what I'm proposing is to get a

safety guarantee that's a slightly lower quality. We're going to try to get not worse

than any human. So, if there is a human who has a theory about the world or some partial aspects of

how the world works that is right, we're going to consider that we can consider all of those theories.

OK. If different humans have different theories, now this is like a big AI here, and it can model all the theories that all humans

have come up with to explain all kinds of things in the world.

We just need to make sure that of all the possible modes that we want to consider when we train our neural net,

that we need to cover all of those corresponding to theories that humans have generated,

that are readable in, in papers and stuff. And then what we get is a guarantee that may not be

perfect, but at least it says, "no human would find that particular action outrageously dangerous."

It could still be very dangerous, but humans would not see it and did not see it coming either.

So that's the kind of guarantees I think we can get. Alright.

So let's go back to the question of how we might approximate Bayesian posteriors and

both in the sense of sampling and in the sense of marginalization.

Let me tell you about these generative flow networks which we've been

developing that I mentioned and they are, if you want at the intersection of variational inference and reinforcement learning,

They learn a policy like in RL and they're close to maximum entropy RL if you know about that,

they're also close to high correlational inference. Jargon.

But anyways, what they do is you give them a reward function. The reward function is like a black box

that the system can call during training. For example, the black box could be likelihood times prior and

that's like a reward. We want to sample theories that have that are both reasonable or under the prior.

For example, they're not like humongous theories and explain well a billion data points.

So that's the reward. And the theorems we have with G flow nets tell us that there is

an objective function that has the property that if we minimize it completely to the global minimum,

then the policy that we learn with our little neural net will generate theories with probability proportional to the reward function.

So if we pick the reward function to be the prior times likelihood by Bayes' theorem,

then the neural net generates from the posterior. I'm going to make a stop here because this is an important slide.

Any question up to now on the technical stuff?

Yes, over there.

[audience member] So I want to go back to one of your earlier slides about RL.

I'm curious. So how likely is it in your opinion that when you use RL,

you get something which maximizes reward like an agent that does that?

Right now, I would say not. Right.

But I don't know about Q star or whatever will come next year or three years from now or in five years from now. I don't know.

[audience member] Yeah. I think it was on the previous slide.

So or there's also a citation of Cohen's work and I just wanted to note some

confusion on this as why maximizing reward is seen as so dangerous, especially with Cohen's results.

So, those results assume that you find a policy which is optimizing some discounted sum.

And then it's like, well, it's going to be this dangerous reward maximizer.

But if you don't have that assumption, then these concerns about CIRL --

If you don't assume that the network already cares about a reward function,

then their theorems don't apply. And I'm wondering if that's--

But that's the whole point of maximizing is because you have something to maximize, right?

[audience member] Yes. But I don't think that RL does that.

You don't think that RL maximizes reward?

[audience member] No.

Well, I don't understand. Maybe we can take it offline.

Alright. So here we are trying to figure out how to sample from the Bayesian posterior.

And let me tell you more about these G flow nets.

OK. Let me explain a very simple principle that shows how you can get both marginalization and sampling for the price of one.

Let's consider the reward function R of XY.

X is the question and Y is some quantity we want to marginalize over,

In our case, these would be the theories, for example, so we want to compute S of X.

This is the sum over Y of R of XY and we can't do that sum explicitly because it,

it has an exponential number of terms. So S for sum is a normalizing constant, right?

And if we wanted to think about important sampling, you'd like to sample proportionally to R,

let's say R is positive, OK. Just make things easy.

You would like to sample proportionally to R. So that means you'd like your sampler pi of Y given X to be R of XY divided by S of X.

It's a distribution over Y given X that gives more probability to things that have a larger R.

If you're going to sample things, you want to sample the big things, right? OK.

So we would like to have that sampler. So these are the two things we want:

we want this sampler which is the normalization of R of XY;

and we want the sum the marginalization.

These are the two things I was talking about in a very kind of abstract way. OK?

Take the second line and just move the S to the left hand side.

So now you get a constraint, you'd like that for every XY,

the probability of Y given X according to my sampler times,

the normalizing constant function of X is equal to R of XY.

So it's interesting that it's very easy to prove that if

you make that constraint satisfying because the pi sums to one, you recover both of the first two lines, right?

So there's this one constraint, the sampler times the normalization equals the unnormalized function that if we can satisfy this for every XY,

we're good and we recover what we want. So we're going to have estimators the pi hat and S hat and we take the log of the left hand

side and the log of the right-hand side take the square difference and make that a loss for XY.

And then you can show that if you sample XYs according to any distribution that has full support.

In other words, you basically, with some probability, you can try any XY that defines a loss and its

expectation and the minimizing the expectation of the loss here gives you what you want when it's zero,

the constraints are satisfied and you recover the sampler and the marginalized.

And what's nice is that you have a lot of freedom in how you choose that the XY pairs,

you can choose any distribution here. It doesn't have to be pi of Y given X.

So in RL terms, this is an off-policy scheme. We can use a different policy to choose the configurations on which we're going to

train - a policy that you could call the exploration policy, If you think in RL terms.

... it doesn't have to be the same as my current model of how things are; the

sampler pi of Y given X. And that's very useful because if you are forced, like in other schemes

- there's just standard variational methods with the KL divergence -

if you're forced to use the current sampler and the current sampler has missing modes.

In other words, there are places theories that it's not aware of things, doesn't see them as important.

it never visits them and so it never gets a gradient that tells it, "Oh, we should go there."

But you might have other exploration scheme, for example,

a very simple exploration scheme that we currently use is what we call a tempered scheme.

So we take the current policy and we just increase its temperature.

So it's more exploratory and but you can use pretty much all of the tricks in the book in RL exploration.

OK. so that's if X and Y are simple objects that you can directly sample from but theories are complicated objects.

For example, a causal graph is a kind of theory that is very structured.

It's a data structure and the way to construct complicated data structures is through a sequence of steps, right?

It's hard to generate in one like forward pass, say a whole tree or graph or something or a,

mathematical proof or a body of theorems something. But if you can do it step by step,

just like humans do it with our like conscious cognition, you can adapt the math that I showed

in previous slide to the case where we're going to generate the objects like Y

through a trajectory sequence of steps.

And you may be in a situation where there are many trajectories that could lead to the

same theory I can construct my theory in from various angles.

So long as I get the same theory, that's good.

I'm not going to go through that math but just tell you that we can now do

the same trick and we still get a sampler and we still get a marginalizer.

We have applied that in the context of Bayesian posteriors in the case of causal graphs,

first paper came out last year at UAI and there are a few more that came out.

In the first paper, we just generated the causal graph which is a discrete structure.

And now we are starting to work on also sampling parameters of causal conditionals,

for example, and you can also work on generating latent variables.

So, again, the scheme is... let me explain this figure.

In order to generate a theory here, what we're going to do is let's say a theory is a causal model.

So a causal model would first have to dec decide on what is the causal graph like?

What variables are direct causal parents of which one?

So it's a graph and we can just generate that graph one edge at a time.

so we got a partial solution at each step and then we refine it, we add some pieces at some point,

we decide, "OK, we're done with the graph." There's a special action that says,

"OK, we're done," and then we can generate other things like the parameters of the conditionals or

potentially latent variables as well. And yeah, so that's the sort of thing we're working on for now.

These are fairly small-scale things compared to building a full-scale world model.

But let me continue on that. If you want to learn more about G flow nets,

I've written a tutorial. There's about 20 papers in the list there from my group and other groups are starting to

put out papers on, on this sort of thing as well. OK, quickly,

there are other things that I think need to be done. So for example,

we would like the theories that the AI generates to be in a language

that's close to natural language in other words that they are interpretable.

So they would be like pieces of program or logical

statements on which you can reason with probabilistic inference.

But you also want to make sure that you can go back and forth to something humans can understand for a number of reasons.

One, just to check that what we're doing makes sense.

Also, because those theories can be useful by themselves for human scientists, as I mentioned earlier.

and yeah, that would limit a little bit, the kind of theories that can be easily generated to the kinds of theories that humans can also generate.

But there's more to say here. There's another thing that's complicated. The Bayesian

posterior talked about is sort of an average over all the theories that are compatible with the data.

But what if the correct theory is sort of alone in its gang, to say "This is really dangerous"?

Then you might get a very small probability of harm, especially if we are in a continuous space.

So there there are adversarial ways to try to deal with this to search for a theory among those that have high probability that would say,

You know, "This is a catastrophic, catastrophic move."

The other things... so there's still a lot of math to do around this to demonstrate safety bounds.

In particular, the neural nets I've been talking about are not going to get a perfect Bayesian procedure,

they're going to be approximations and we're going to be able to measure the approximation error.

But we need to turn that into some sort of confidence intervals for the probabilities of harm that they will generate.

... and there are other questions that are interesting that need to be solved. So I'll stop here.

[audience member] Thank you for the really great talk. One question I have is so in the slide that you paused upon so the I think it was the objective

for the posterior approximation... If I understand correctly, you're training this pi hat so that it approximates the posterior,

but then you're comparing it to R and R of X and Y... Which is proportional to the ground truth posterior. And how do you get that?

So, that, you get fairly easily. R of XY is the prior times likelihood.

So that's computable. There's no exponential calculation. It's just big because like it's the size of your data set,

but you can use mini batches which are like stochastic approximations of the whole sum.

So the log of the likelihood is the is the sum of log likelihoods for each example.

And just like in stochastic gradient descent, you can get a stochastic gradient from this and you only need to look at one mini batch at a time.

[audience member] Thanks for your really great talk. Well, first off and I want to say I'm sympathetic to the vision of wanting to scale up Bayesian inference

for the purposes of AI safety. It's kind of what has been my PhD doing and

there's a whole field of working on this called probabilistic programming and I think more people should work on it.

Some folks at MIT like myself are working on that. I guess 2 questions which are related, to the sort of specific picture of how to do that,

which is amortized Bayesian inference. Some of us at MIT don't think that's going to work.

I think one thought, is that I'm curious how you think about the sort of flexibility of amortized inference to different inference problems?

Because typically in amortized Bayesian inference, you train your neural network or whatever else, your regressor, to approximate some posterior pi of theta given X.

But in a real world, you may suddenly encounter a new inference problem pi of theta given some other set of observations

Y that you haven't trained your posterior approximation for--

That's called generalization in machine learning. We need to do that, of course.

[audience member] I'm curious whether you think that if you encountera different set of constraints,

all of a sudden out in the real world --

it's not a different set of constraints, it's just applying the same neural net to new cases.

And that's what we do every day with these models. That's what makes them so powerful that they generalize if they are the best generalization machines that we know right now in machine learning,

I guess I'm thinking not of the case of when the observations of the same data type.

So let's say yes, I'm willing to believe that you train in network to infer say, seeing graphs from images, 2D images, right?

But now you want to constrain your neural network to generate seeing graphs not from 2D images, but from sounds or something like that.

They need to know about sounds in the data.

[audience member] Maybe we can talk more about this later. But that's just--

Look at what current large frontier models do. They can handle multiple modalities and they can generalize.

Now, what I'm proposing might generalize even better because it's what they're trying to capture is the causal structure.

And that's reason number one reason, number two. They would be more robust to out of distribution changes.

That's the whole reason for being causal. And the second thing is where they might be out of their league.

They would say, "I don't know," that's the whole point of being Bayesian.

And by the way, that's, that's the real core of this talk, right?

Which is that we'll not get perfect oracles. And so we need those oracles to be able to say reliably, "I don't know."

[audience member] I'm curious what you think is the main reason or reasons that the

scientist proposal would be safe from reward hacking in a way that CIRL wouldn't.

I think CIRL is right in the right direction. But I am not comfortable with directly modeling a reward function.

I think you want to get good generalization, instead of modeling a reward function,

you want to understand human psychology society norms like the underlying causal factors to why we say those things and including theories of how we feel

which is much more subtle than learning a reward function, like which reward function there are many people, and you know, it's much more complicated than that.

And you want to be able to generalize across people across societies.

So, so that's, that's one aspect. But yeah, I got the inspiration for all this by reading Stuart Russell's book and the CIRL paper.

Sure. And it's mostly about how do we do it efficiently, right? That is the question.

[audience member] Thanks for the great talk Yoshua. I'm all for amortized Bayesian inference.

we've also had great successes. There was prior-fitted networks. I didn't get the one point where you were worried about the exponential number of modes, because couldn't you

just generalize like if you sample millions of points in this exponential space, then wouldn't you just generalize?

That's precisely what I'm proposing, right? So, if you go to these equations, right?

So think about the sum in the second line, right? You could do it Monte Carlo. But then if you miss some modes,

like let's say you only visit 1% of the modes, then you could be completely wrong, 99% wrong.

But if you somehow find a way to implicitly generalize about the mode structure, because you train a neural net that just directly predicts Y given X

while being Bayesian, then you're OK, even though you don't actually visit all of the modes you kind of capture.

Oh, it's a grid and you know that's the dimension of the grid. So here's the result.

[audience member] OK, perfect. Then, then we're on the same page that that's also what prior-fitted networks do. Great. Thanks. Thanks for the presentation.

[audience member] I was wondering if you have an idea in general or maybe this is always context dependent.

So for some specific cases, how do you get a good prior for the whole process? And how do you make sure that your prior doesn't assign zero probability to, let's say, unsafe modes?

And how does that affect your safety guarantees that you could get?

You don't want to have probability on safe modes. But yeah, the correct answers.

So, so you basically have to be non-parametric and maybe as a simple example to understand what non-parametric could mean here,

think about the space of programs. You don't limit the size of the programs and that's it.

Now, you still need to assign a different weight to different programs. when we do normal like training of neural nets,

we have priors because we don't actually just do maximum likelihood.

We have architectures, we have regularizers, we have an optimizer that prefers simpler things in some sense.

So we already have some kinds of priors in our machines.

And you know, this is a choice. We can use the usual things in machine learning like test log likelihood in order to compare different priors. But that's like standard machine learning.

But how can you get guarantees out of that without because our priors are implicit, how can you be sure that they're including --

Well, in that case, I'll need explicit priors because I have to compute them.

Yeah. I'm just saying priors are not so weird that there are things we normally do anyways.

[audience member] Thank you for the great talk. I'm curious how you imagine this being used. Like,

I suppose it's like we train one massive G flow net that has the Solomonoff

prior or whatever and then talks things about the world, right?

But then does it then like generate a bunch of theories that then get used?

But this is kind of like sampling a bunch of things and kind of in contrast with the problem that you mentioned about the exponential number of modes, right?

So like you can't really marginalize theories that you explain over to humans.

So I guess I'm just curious about how should you...

One strategy that I'm working on is we don't want to actually generate a big theory of everything.

It's just too big and it would take too much time to get gradients if we have to wait, that we generated all that.

How do human scientists do it? We write papers. Paper is like a small little piece of theory that only describes some

aspects of the dependencies in the world, some hypotheses and some claims and

these pieces of theory are sufficiently self-contained that they can be evaluated against experiments, right?

So some data, some theory mismatch gradient. So this is how I envisioned this to be feasible. So the G flow net or whatever would,

would only generate like little pieces of theory, just like we do like we see things.

Oh I mean, I walk in the street, and I have all kinds of theories popping into my mind.

[audience member] That's not actually my question. Suppose that you can train it...

Oh well, once you have trained the marginalizer, you can answer questions.

So the question is "How do you use it?" You use the marginalizer.

Or you could also use a sampler to talk to scientists and say, oh, here's a new theory about X.

I think it's zero here. Yes, yes, I suppose we're done. Thank you very much for the amazing talk.

Please. Thank Yoshua.

Yoshua Bengio - Towards Quantitative Safety Guarantees and Alignment

Transcript

Alignment Workshop