Collin Burns - Weak-to-Strong Generalization

Please note: In the talk Collin mentions >$2 million in prizes. This money has been reallocated to the Superalignment Fast Grants program, and the prizes will no longer be offered.


I'm extremely excited to share a quick preview of what will be our first paper from the Super Alignment team at OpenAI.

This will be coming out later this week. There will be many more details in the paper that I can't cover in this talk.

But I'm really excited about this. It's covering a new research direction for solving alignment.

We formed the Super Alignment team in July because we think we could develop AI systems that are smarter than humans 

possibly in the relatively near future. And we currently do not know how to align these models. This is a huge open problem for alignment.

There's plenty of reasonable disagreement about when we will create broadly human level or superhuman level 

For the purposes of this talk, the details here mostly don't matter. I'm not trying to make any super strong, high confidence claims,

but I do want to claim that developing broadly superhuman models is very plausible and 

we should take this possibility seriously today. Just to spell this out a little bit,

we've been making extremely rapid progress in recent years that has continuously 

surprised us as a community. I'm sure many people in this audience have seen this plot before.

It shows performance on popular ML tasks from the time they were introduced to the time it took to be solved,

solved in the sense of reaching human level parity. That's sort of the the dashed line at zero.

And it shows that models are reaching human level performance

in basically every task we can come up with at an increasingly fast rate.

It is genuinely difficult to come up with tasks where models are far from 

reaching human level performance. And I know this from experience as well.

So a few years ago, Dan Hendrycks,

some other collaborators, and I tried to come up with a benchmark that would be really hard,

to actually measure progress towards expert human level performance,

and though it hopefully should not follow the same trend. That benchmark was MMLU and it was solved after three years.

So even when you tried really hard, it was still solved extremely quickly. And to be clear,

dataset creation is fraught with subtleties. You shouldn't trust any of these individual numbers too much and so on.

But in aggregate, I think it is clear that this is a very real trend. OK?

Models are reaching human level performance at an increasing rate. And if you project this trend out just a little bit,

I think we should not be too surprised if models reach human level performance across basically every 

task not too long from now. Again, I'm not saying anything super strong here.

We should just take this possibility very seriously. In particular, if we reach AGI,

we should not expect progress to stop at human level.

I think it may not take too long to go from AGI to broadly superhuman level models.

I think this is important for alignment because aligning superhuman models controlling them to do what we want 

them to do requires solving qualitatively new technical challenges relative to aligning today's models.

That's the goal of super alignment. OK. To illustrate this, consider what we do today.

In most of machine learning, humans label examples, and we train models predict those labels.

This is how supervised learning works. It's also how we align models like GPT 4 with RLHF.

We use human supervision to train a reward model. Then we optimize that reward model with reinforcement learning.

All of this works perfectly fine today,

since human supervision is more or less reliable for the models that we care about.

It's not perfect but it works astonishingly well. However, in the future,

humans will need to surprise models that are smarter than us crucially in this regime,

human supervision will no longer be reliable for key tasks that we care about.

For example, assessing is this model behavior good or bad. From the perspective of superhuman models, humans will be weak supervisors.

As a result, it is unclear if current alignment techniques like RLHF will continue to work for these models.

So just to illustrate this, right now, maybe we'll ask a code assistant model to do some task and we can just look at its outputs.

For example, maybe it'll generate some simple code. And usually we can basically understand what it's doing.

We can understand, you know, in this case, this is code that outputs four because it multiplies something by two and we input two.

OK. We can basically assess what's going on here. It doesn't mean the model behaved in the right way.

But we can tell if it behaved in the right way. And that's the thing that matters for alignment.

In contrast, superhuman models will be capable of extremely complex creative behaviors.

For example, extremely complicated code that we do not understand. As a result, if we ask the model to do something,

we won't know if it's following the instructions that we asked it to do because we don't know what the code does.

If we ask questions about the code and it answers questions in natural language,

we won't be able to tell if the model is telling the truth because we don't know what the code is doing.

We also won't even know if the code is safe to run or if it's sort of dangerous to run because we don't really know what it's doing.

OK. As a result, it's unclear if methods like RLHF will continue to work in this regime.

This is the problem we're trying to solve. This problem is hard to empirically study right now.

It's not obvious how to even approach setting alignment for super human models today.

We don't have these models yet. Even if we did, we would not know how to measure alignment for those models.

So what do we do? Most work so far in alignment has taken one-- [audience: It's showing a blank slide.]

Yes, keeping you in suspense. So, most work so far in alignment has taken one of two strategies.

Some work has been very theoretical.

It has tried to focus on capturing what are the core problems that will arise in the future and analyzing those theoretically.

This has the advantage that allows us to really target the problems we might expect in the future.

But it also doesn't have the empirical feedback loops that have been really essential to machine learning progress over the past decade.

An increasing amount of work studies alignment empirically, but it usually does so with today's models,

usually focusing on alignment problems that can be exhibited with models like GPT 4.

This, of course, has the advantage that we have these nice empirical feedback loops.

But it mostly ignores alignment problems that we will only face in the future.

So we would really love to get the best of both worlds. How do we do this?

Our basic strategy is focus on empirical setups that are as analogous as possible to the future problem that we care about.

So the setup is analogous if results we find on that setup today are qualitatively similar to the results we would find in the future.

So if you can come up with simple general analogous setups for superalignment,

that would be extraordinarily useful for making progress today.

Of course, we don't know what future models will look like exactly. OK.

So we can't do this perfectly, but I think we can still move in the right direction.

In particular, I think we can introduce one of the core challenges that we will predictably encounter in the future,

namely the fact that we will need to align models that are smarter than us.

Humans will be weak supervisors from the perspective of superhuman models.

So we propose studying this particular challenge today by considering a simple analogy.

What happens when we use a small model to surprise a big model?

For example, suppose we take GPT 2. Can we use GPT 2 to surprise GPT 4?

Intuitively, if you found out a way of aligning GPT 4 just as well as it's aligned today,

but only using GPT 2 and no human at all, I think that'd be a huge milestone. OK.

This is not perfect.

This is not, doesn't mean we would have a solution to alignent or anything,

but this is the sort of goal we're aiming for and I think that would indicate significant progress.

Importantly, this is very straightforward to test empirically today.

We can literally just supervise GPT 4 with GPT 2 level supervision and see what happens.

OK. This is the sort of thing that we do in our work.

To begin testing this, let's take a representative NLP classification task. Let's take a small model,

say one roughly at the level of GPT 2, and we'll fine tune it on this task using ground-truth labels.

We'll call this model the weak supervisor. In this case, it attains a little over 60% accuracy.

We can also take a big model, in this case, GPT 4. We can do the same thing,

can fine-tune it on ground-truth labels. This gives us a measure of what is GPT 4 capable of.

This is a measure of GPT 4 trying its best to solve this task.

Question: what happens if we fine tune GPT 4 on the predictions from GPT 2, at GPT 2 level model in this case?

What do you expect? What was that? It gets worse?

So stepping back for a moment, when I think about what to expect here, I have conflicting intuitions actually.

So on one hand, general deep learning has consistently exceeded expectations,

often generalizing in surprisingly benign, useful ways even unexpectedly.

For example, in context learning, I think this is extremely surprising when this came out,

this is just this emergent property showing extremely benign generalization.

On the other hand, when think about alignment generally, like whether it will be easy or hard,

I also think about all the failure modes that deep learning models have, that Zico's talk also totally talks about.

So when I think about this, I sort of have both of these in mind simultaneously. I'm not sure which of these will dominate.

So I also have conflicting intuitions about weak to strong generalization, in particular.

On one hand, we just want to use a weak supervisor to elicit everything the strong model knows.

Intuitively, maybe we shouldn't be too surprised if a strong model does just generalize in the right way naturally.

I mean, this is just outputting what it knows, trying its best. This feels like a very natural thing for it to do.

On the other hand, the strong model is literally just being trained to imitate the predictions of the weak model.

So it's also very natural to think that it will just imitate the weak model and not do anything better than the weak model.

So I think it's actually extremely unclear what we should expect here by default. So what do we expect or what do we actually get?

...I was actually just thirsty.

It turns out when we supervise a strong model with a weak model in the setting, we had something pretty close in between.

I find this very cool. The model doesn't just imitate the weak supervisor, it does significantly better than it. OK.

We refer to this phenomenon as weak to strong generalization, positive, weak to strong generalization.

At the same time, there's still clearly a big gap remaining between the weak to strong fine tuned model 

and the best the strong model is capable of. OK. This is definitely not solved by default.

This seems like it will be a real problem if we naively try to do RLHF, say with human supervision.

Applying it to superhuman models probably will not get the reward models...

really leveraging the full knowledge and capabilities of the strong models.

So it seems like this might break down. OK. But there there are signs of life here.

[audience member] where was the big model before being fine-tuned on the weak model? Was it better or worse than the green bar?)

To clarify, the big model is just a base pre-trained model. OK.

So it's just predicting the next token. So it's not doing the task by default.

And so the bar on the right is when we fine-tune it with ground-truth labels on this task.

So I think that's a reasonable measure of the model, loosely speaking, trying its best to solve this task.

We want to elicit that. [audience member] So basically it would have been lower before if you just if you just evaluated it without fine tuning?

Yeah. So it, this is a problem I can get into more in say office hours. Let's say zero-shot prompting and few-shot prompting and so on.

So with the very biggest models, prompting works very well for some types of tasks. OK.

For many cases, for most models, this sort of generalization does better.

But there are subtleties here. I can talk about it in more detail, but it's a good question.

So far, this is just one representative NLP task. Is this a robust phenomenon?

Does weak to strong generalization occur in general?

So we ran the same basic experiment across many orders of magnitude of effective compute and across a large number of tasks.

Here we look at median accuracy across 22 NLP classification tasks.

The x-axis is effective compute for the strong model, y-axis is accuracy.

So the solid white line shows accuracy on ground-truth supervision.

This is the model trying its best, loosely speaking. We also look at this for different weak models.

So each color corresponds to a different weak model. For example, purple is a much weaker model,

many orders of magnitude worse than the biggest model.

And so we see that if we had perfect weak to strong generalization, the colored curves would be on the white line.

If we have zero generalization, the curves would be flat. In practice, we get something in between.

We got consistent, positive, upward sloping curves indicating, you know, positive weak to strong generalization.

But it's clearly far from the white line, it's definitely not follow by default. This is a robust phenomenon.

So strong models very often generalize beyond their weak supervisors,

but imitating weak supervisors is still much worse than training on ground-truth supervision alone.

The natural question is: can we close some of this gap? In other words, how can we improve weak to strong generalization?


[audience member] What's not so clear is, you know, if you just didn't, if you had a step size of zero, wouldn't you do better?

You mean like--

[audience member] isn't this just a phenomenon of bad optimization?

Like if the, if you just didn't, if you just ignored the weak supervision, wouldn't you have better performance?

So I, I think if I understand correctly, the question is also about, say zero-shot prompting.

Like won't zero shot prompting just do better? Is that right? [audience member] Yeah, maybe if I understand.

Yeah, the answer is so zero-shot prompting is competitive at the very largest models, let's say GPT 4.

For smaller models, it is definitely not competitive and this does significantly better.

But you know, it depends on the details. But it's definitely not a panacea.

Also, I think there are strong reasons to believe that prompting is likely to break down in the future.

I don't have time to get into that right now. I'm happy to discuss it in office hours.

I think, yeah, but I think this is one reason, like not rely on that too much.

But it's sort of a subtle issue. I'm very happy to discuss it more. It's a great question.

Yeah. OK. How can you improve weak to strong generalization?

Recall this earlier plot with a naive generalization.

With the baseline method, we train the strong no to directly imitate weak model supervision.

But this is kind of not what we want to do because the weak supervision can contain errors.

We don't really want it to simulate the errors of the weak model.

One way to hopefully avoid that is to make it easier for the strong model to disagree with the weak supervisor in some cases.

And so to do this, we consider a method very similar to self-training.

We basically make it easier for the model. So

we basically encourage the strong model to make confident predictions, effectively reinforcing some of its own predictions. OK.

Turns out this simple method significantly improves generalization. It's a very simple auxiliary loss term.

But adding it consistently results in large improvements across all these NLP, sorry, let's say three quarters of these NLP tasks.

OK. But still the the gains are really quite significant and we can see this more clearly when we look at all data sets combined.

So here we're looking at baseline performance is now dashed and performance with this auxiliary loss term,

these are the triangles and you see, especially for very small weak models and very big strong models,

you see massive improvements. OK. This is not meant to be sort of the final method.

This is not meant to be sort of the main contribution here.

This is a proof of concept that we can significantly improve generalization in very simple ways.

I think that's the important point here. And in this case, we are really generalizing from say a 60% accuracy

weak classifier to getting over 80% accuracy when we fine-tune the strong model on this. Yes.

[audience member] This is really interesting. In terms of alignment, I guess the thing I'm concerned about is that here, we know what test accuracy is.

But in general, if we have humans that are providing labels, how do we know that the way it's generalizing here is going to be aligned with the intent of the human labels?

Yeah, so this is an excellent question. So I'll touch on this a little bit later.

The short answer is I think we need a bunch of science here as well.

So first of all, this is a setup for for setting these questions with today's models.

In the future problem, we will not be able to test at test time,

Is this working? At least naively without using ground-truth tables.

I think there's also probably a lot we can still do there.

For example, I strongly suspect we can estimate how good is generalization at test time in an unsupervised way.

For example, by measuring how under specified is the generalization.

Like, is there like only one generalization that is consistent with all of the constraints that we give the model

or are there a bunch of different generalizations that have very different predictions.

If it's the latter, then we probably should not trust the generalization.

So there's a whole lot more there. I think that's mostly future work. But the point is,

I think there are all sorts of ways we can try to gain trust in the generalization we get,

but it's not an obvious answer and it's a really important question. Great.

So to be clear, this is still all in a very simple setting, popular NLP classification data sets.

Now this is fairly robust across a lot of these. But still our results are not as good in the word modeling setting.

OK. And this is arguably what we care about even more.

And so we still do not know how to get strong weak to strong generalization for the actual full RLHF stock.

I think this is an important line of future work.

But the point remains, we can drastically improve generalization across a large number of tasks using a very simple method.

So we see signs of traction. There's a lot of work to be done.

I'll now sketch out just a few broad areas of research that I think will be especially important.

First, we need increasingly analogous setups for studying superalignment.

Analogous setups are essential for ensuring that progress today translates to real progress in the future.

Our setup is analogous on at least one important axis,

namely the fact that we are supervising models stronger or we have weak supervisors 

supervising stronger models just like in the future problem we care about.

But our setup is definitely definitely not perfect, still has important disanalogies that should be improved in future work.

We discussed this at length in the paper.

But just to give one illustration of this: one disanalogy is that future superhuman models will probably be very good at simulating humans.

OK. It will probably be very good at imitating what we would say, including our errors.

In contrast, current models are not really trained to predict what will weak models say or how will they behave. 

As a result, this issue of imitating the weak supervisor might be more of a problem in the future than it is today.

This is an example of a disanalogy that future work should address. And again, this is just one example.

But the general point is that I think in addition to making iterative empirical progress on our current setup,

we should also iteratively improve our setup itself and update it over time as we learn more about what future models will look like.

Second, we need better methods. I think our method gets really exciting results across the vast majority of our NLP tasks, but it is far from perfect.

Like I said, perhaps the most importantly, still doesn't work in all settings such as the word modeling tasks, we maybe care even more about.

So getting good weak to strong generalization for really key alignment relevant tasks is still a totally open problem.

But importantly, I think there's a ton of low hanging fruit here. There are tons of ideas from related research areas like robustness.

And I think many of these could be ported over to the setting as well.

And there's also a lot of additional structure specific to our problem that I think we can leverage to improve methods.

Finally, we also need science and this is going back to your question from before.

To really align superhuman models, we need a very high degree of trust in our alignment solutions.

OK. I think strong evaluation setups and strong methods on those setups is I think both of these components are necessary.

But to really trust the results, I think we need to understand when and why our methods work as well.

OK. We ran a bunch of preliminary experiments starting to study these sorts of questions.

I didn't have time to go into these now. Look up the paper for many more details there.

I think we can already say some interesting things. But for the most part, we've just begun to scratch the surface here.

I think there's a lot more work to be done in the future.

So all three of these areas of future work are essential for making progress on superalignment.

In our paper, we introduced what I think of as a good starting point for analogous set up, but it should definitely be improved.

We show some methods that seem to be extremely exciting in some settings, but not all.

They're definitely not good enough, but we found many signs of life and that makes us excited.

Finally, science. In our paper, we study a number of interesting initial results.

But for the most part, we still remain extremely ignorant. There's a lot more work to be done here.

So for all three of these areas, I feel like there's just a lot,

basically both a lot of work to be done, but also feels extremely tractable to do.

This feels like this wide open area; lots of low hanging fruit.

So I don't think it's like we have no idea what to do.

I think it's more like we need more people working on this and just doing what they do best.

So in this sense, alignment feels to me more like a normal ML problem than probably ever before.

And I think that's exciting. So in particular, I think you can do a huge amount to contribute and we want to make this as easy as possible for you.

So as I mentioned before, we will be releasing the full paper later this week,

We will have many more details. Please check it out.

I only got to cover a very small fraction of what we show in the paper. We'll also be releasing open source code.

So we're trying to make it as easy as possible for you to get started. Second, we're excited to support work in this direction.

So we're going to be releasing over $2 million in prizes for the best work on this paper over the next year.

We'll be releasing many more details about this later in the week as well. But I'm extremely excited for this.

I'm really excited to see what everyone comes up with.

Third, we're also releasing $7 million in grants.

This is across both weak to strong generalization and other areas of superalignment.

OK. So this can help help you get started.

And I mean, I think really this community of people is just extremely well suited to solve these problems.

So I think we should support you as much as possible here.

Finally, you should talk to us. I personally might be extremely busy over the next few days.

Don't be offended if I say I can't talk right now. After that, I would absolutely love to talk.

Also a few of my collaborators are also here today. So Pavel, Leopold, and Yining.

Look out for them and talk to them, and say hi to me as well.

I'll especially be happy to talk after things sort of quiet down in a few days. I'll try to also be around today.

So to conclude my shameless pitch on why you should work on this problem here are five reasons: 

Like, one doesn't require a huge amount of compute, doesn't require much set up.

It's very easy to get sorted and very easy to iterate.

This is something I've been really excited to see with this setting.

I think it's very accessible to academics outside of industry.

Two, it feels more similar to normal ML research than any other direction in alignment that I know of.

You don't need to be steeped in alignment already to make substantial contributions immediately.

And I think having expertise in other areas can be extremely useful.

Three, I think it's extremely conceptually interesting and natural as a problem.

So here we're basically asking the question of, can a student out perform its teacher?

When does this happen? Can we make this happen as well as possible?

I think these are very natural ML questions even ignoring the alignment motivations.

Obviously, I care about the alignment motivations, but you don't need to, to care about this problem.

Four, I think this is extremely tractable. Like I think you can do much better than us.

We've only just scratched the surface here. There's a lot of low hanging fruit here.

Finally, the reason I'm personally working on this because I think it's extremely important.

I think we'll need to safely align superhuman models. I think this could easily happen sooner than we expect.

And currently, I think this is one of the most promising directions for actually doing so. Thank you.